Contributor: skrish13

Multimodal Emotion Detection on Videos using CNN-RNN, 3D Convolutions and Audio features

Mentors: Mehul Bhatt, Francis Steen, Cristóbal Pagán Cánovas, Jakob Suchan
Organization: Red Hen Lab

This is a deep learning approach which uses both image and audio modality from the videos to detect emotion and characterize it. It uses a combination of CNN-RNN (Convolutional-Recurrent Neural Network), 3D convolutions (C3D) and audio features, as shown in the winning solution of EmotiW 2016 competition. A special RNN called LSTM (Long Short-Term Memory) is used to learn the motion by taking the individual features of each image frame, extracted by the CNN. The 3D convolutional network, works in a way such that it takes care of the image as well as flow of the video (appearance and motion). For extracting the audio features, it uses the OpenSmile toolkit. The EmotiW dataset is used for training this network. The goal of this project is to give an API which will take a video from the user and returns the output regarding the characteristics of emotion. This API will be fully tested, documented with examples and will be deployable in the HPC.