You are on page 1of 8

VIDEO BASED EMOTION RECOGNITION

Submitted By:-

17103127 – B2 - ATHARVA TRIPATHI

17103292 - B7 - SUSHEN SHROTRIYA

17103021 - B1 - TRINENDRA MEHRIA


INTRODUCTION
Emotion recognition is the process of identifying human emotion. People vary
widely in their accuracy at recognizing the emotions of others. Use of technology to
help people with emotion recognition is a relatively nascent research area. Generally,
the technology works best if it uses multiple modalities in context. To date, the most
work has been conducted on automating the recognition of facial expressions from
video, spoken expressions from audio, written expressions from text, and physiology
as measured by wearables.

Decades of scientific research have been conducted developing and evaluating


methods for automated emotion recognition. There is now an extensive literature
proposing and evaluating hundreds of different kinds of methods, leveraging
techniques from multiple areas, such as signal processing,machine learning,computer
vision, and speech processing.

Emotion recognition from video is also more difficult than general video
recognition.Once emotions can be recognized reliably and well understood, it can
provide the same or even more benefits than face recognition. Due to the presence of
concealed emotions it can lead to even more benefits because humans need expert
and rare knowledge to recognize concealed emotions, while machines could
potentially perform this task easily opening up new research areas.

In this work, we attempt to further contribute to the field of emotion recognition by


presenting our solution. Our work is foucsses on improving the existing models
accuracy and then use its power to build useful application. Our model will categorize
the human emotions into classes as shown in image presented.
PROBLEM STATEMENT

Emotion recognition potentially has many applications in academia and industry, and
emotional intelligence is an important part of artificial intelligence.However, in contrast to
such tasks as face recognition (FR), emotion recognition has not yet become so widespread. We
believe that the reason for this is the fact that emotion recognition is much harder and requires more
research and efforts to gain success.

Face recognition is also hard, but training data with clean ground truth labels can be collected easier
and benchmarks are usually objective (i.e. we know the identity). In emotion recognition, there is a
lack of understanding and the agreement of what the labels should be. This can be proved by recent
appearance of datasets with compound emotions [4] or with dominant and complementary emotions
[11]. There is also a lack of training data due to difficulty of collecting rare emotions (how often do
you clearly show fear?).

Thus referring as the above paragraph states our problem statement confines to analyzing the
proposed research papers and implementation proposed in this field and bringing out the model that
could attain the best possible accuracy. After successful building of such model which comproises
of advantages of earlier researches, also using it to build a useful application that could utilize the
model efficiently.

Futuristic approaches will be aimed towards the use of combination of voice and video both to
predict the mood and emotion of the person which will be more user friend
PAPER DETAILS

Paper Description Method Used Method Description Result Future Works


Accuracy
Video-Based 1. CNN-RNN Classifier Cnn-Rnn classifier has been used Classified Enhance our
Emotion 2. C3D – A Direct for preprocessing of video sample into 7 dataset to
Recognition using Spatio-Temporal Model frames and pretraining model for different class increase
CNN-RNN and 3. Audio facial recognition. Angry, efficiency and
C3D Hybrid C3D network model both Disgust, Fear, accurate
Networks appearance and motion Happy, Sad, prediction.
published by Yin simultaneously. Surprise,Neut
Fan, Xiangju Lu, ral with an
Dian Li, Yuanliu overall result
Liu at ICMI’17, accuracy of
November 2016 59.02%.
Emotion 1.CNN-pre processing Video sequences are encoded Classified Includes
Recognition in the 2. Encoding of Deep into a feature vector which are sample into 7 leveraging the
Wild from Videos Features:We use the formed by computing and different class temporal
using Images fully connected layer 5 concatenating the mean, Angry, relationship
published by Sarah (fc5 ) from the variance, minimum and Disgust, Fear, between video
Adel Bargal,Emad VGG13 network, the maximum feature dimension. Happy, Sad, frames, and
Barsoum, Cristian fully connected layer 7 And Cnn is used to predict the Surprise,Neut combining
Canton Ferrer,Cha (fc7 ) from the VGG16 label which is done by ral with an video emotion
Zhang at ICMI’17, network, and probabilistic label drawing overall result recognition with
November 2016 the global pooling layer training process. accuracy of audio emotion
(pool ) from the 59.42%. recognition.
RESNET network.
CNN pretrained on 1. 4 Convolutional We extract the frame faces using Classified Audio features
large face Neural Network the dlib.We fine-tuned four sample into 7 will be used to
recognition datasets networks to predict Images different class complement our
for emotion 2. VGG-Face,FR-Net- obtaibed from frames of video. Angry, models with an
classification from A,FR-Net-B,FR-Net-C First ,features for all frames are Disgust, Fear, additional
video by - computed using all four Happy, Sad, modality. We
BorisKnyazev,Rom 3. Dlib networks. .Face tracking and Surprise,Neut will make frame
an Shvetsov,Natalia alignment Here we implement ral with an level features
Efremova,Artem 4. Frame Shuffling dlib for face ex-traction and face overall result computed with
Kuharenko, 2017 alignment. Linear SVM was accuracy of our networks
trained on training data(one 60.03%. publicly
SVM per network) in case of available to help
reporting validation accuracy the research
and,as in training plus validation community by
data in case of test accuracy. reducing the
Theregularization constant of task of emotion
SVMs is found by 5-fold cross- recognition
validation. We compute 18 from video to
transformations perframe and learning from
average features of these high level
transformations. features.
Video-based 1) Deeply Supervised Effectively classify 7 classes of Classified Combine this
Emotion CNN. emotions.Effectively filtered out sample into 7 method with
Recognition using 2) FR-Net a facial frames that does not contains different class other ER
deeply supervised recognition network faces.Extracted frames from Angry, methods based
network published which is trained on videos.Face Deformations are Disgust, Fear, on other
by Yingruo Fan, around six million solved using dlib. Happy, Sad, modalities, e.g.
Jacqueline C.K. images of Surprise,Neut elec-
Lam, Victor O.K. human faces. ral with an troencephalogra
Li At - ICMI’18, 3) FF-mpeg, a overall result m (EEG) based
October 16-20, complete, cross- accuracy of on heart rates,
2018, Boulder,USA. platform solution to 61.1%. to develop a
record, convert and more
stream audio and complementary
video. and high level
4) Dlib, an open source framework for
image processing E-R
library.

Multi-Feature 1. CNN Feature We extract the frame faces using Classified Enhance our
Based Emotion Based:Pre- the MTCNN.We fine-tuned four sample into 7 dataset to
Recognition for Processing,Feature networks to predict different class increase
Video Clips Extracting. single static images.Four Angry, efficiency and
published by 2. Landmark Euclidean network-Inception-V3, Disgust, Fear, accurate
Chuanhe Liu, Distance (LMED) DenseNet-121 ,DenseNet-161, Happy, Sad, prediction.
Tianhao Tang, Kui 3. Temporal DenseNet-201 Surprise,Neut
Lv At - ICMI’18, Features:Face tracking Face landmarks are critical high- ral with an
October 16-20, and alignment,VGG level features on faces,For each overall result
2018, Boulder, CO, Facial Features. frame in the video, we accuracy of
USA 4. Audio:SoundNet got 34 features This method 61.8%.
model archived 39.95% accuracy on
AFEW.
Face tracking and alignment
Here we implement MTCNN for
face ex-traction and SDM
for face alignment
Proposed Solution
Dataset:

AFEW (Acted Facial Expressions In The Wild), RAF-DB(Real World Affective


faces)
Dataset link - ‘https://sites.google.com/site/emotiwchallenge/’
Dataset link - ‘http://www.whdeng.cn/RAF/model1.html’

Acted Facial Expressions in the Wild (AFEW) is dynamic temporal facial expressions data
corpus consisting of close to real world environment extracted from movies.It consists of
training(773), validation (383) and test(653)
video clips.

Real-world Affective Faces Database (RAF-DB) is a large-scale facial expression database with
around 30K great-diverse facial images downloaded from the Internet. Based on the
crowdsourcing annotation, each image has been independently labeled by about 40 annotators.
Images in this database are of great variability in subjects' age, gender and ethnicity, head poses,
lighting conditions, occlusions, (e.g. glasses, facial hair or self-occlusion), post-processing
operations (e.g. various filters and special effects), etc. RAF-DB has large diversities, large
quantities, and rich annotations, including:

• 29672 number of real-world images,


• a 7-dimensional expression distribution vector for each image,
• two different subsets: single-label subset, including 7 classes of basic emotions; two-tab subset,
including 12 classes of compound emotions,
• 5 accurate landmark locations, 37 automatic landmark locations, bounding box, race, age range and
gender attributes annotations per image,
• baseline classifier outputs for basic emotions and compound emotions.

Library Used :

• Ffmpeg : FFmpeg is a free and open-source library consisting of a vast software suite of
libraries and programs for handling video, audio, and other multimedia files and streams.
It is being used in Fragmentation of video clips in our dataset

• MTCNN: Multi-task Cascaded Convolutional Neural Networks for Face Detection,


based on TensorFlow.

• Dlib: Dlib is a modern C++ toolkit containing machine learning algorithms and tools for
creating complex software in C++ to solve real world problems.
Model Used :

We will be using multimodel system comprising of 3 main parts :

1. SVM (Support Vector Machines) : SVM predictor will be used on the input vector consisting of
features to predict the outcome.

2. Deeply supervised CNN: Applying Deeply Supervised CNN (ResNet-50, VggFaces) by taking
multi-level and multi-scale features extracted from convolution layer.

3. VGA-16 and LSTM: Efficient for working in domain related to videos.

Expected Result:

After completion of the above described multimodel system we will be able to categorize the
emotions into 7 distinct classes namely: Angry, Disguise, Happy, Sad, Surprise, Fear and
Neutral.

Future Work:

One of the major thing is to embed voice along with video in the prediction of emotions. This will
not only make the user experiences better but will also be much more accurate in the prediction of
human emotions and analysis.The second thing would be to make the uses of more recent
advances to somehow increase the accuracy.
REFERENCES
1. Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks published by Yin
Fan, Xiangju Lu, Dian Li, Yuanliu Liu at ICMI’17, November 2016.

2. Emotion Recognition in the Wild from Videos using Images published by Sarah Adel
Bargal,Emad Barsoum, Cristian Canton Ferrer,Cha Zhang

3. Convolutional neural networks pretrained on large face recognition datasets for emotion classification
from video -Boris Knyazev,Roman Shvetsov,Natalia Efremova,Artem Kuharenko
At - ICMI’17, November 13 2017

4. Video-based Emotion Recognition using deeply supervised network published by Yingruo Fan, Jacqueline
C.K. Lam, Victor O.K. Li At - ICMI’18, October 16-20, 2018, Boulder, CO, USA.

5. Multi-Feature Based Emotion Recognition for Video Clips published by Chuanhe Liu, Tianhao
Tang, Kui Lv At - ICMI’18, October 16-20, 2018, Boulder, CO, USA

You might also like