Professional Documents
Culture Documents
Submitted By:-
Emotion recognition from video is also more difficult than general video
recognition.Once emotions can be recognized reliably and well understood, it can
provide the same or even more benefits than face recognition. Due to the presence of
concealed emotions it can lead to even more benefits because humans need expert
and rare knowledge to recognize concealed emotions, while machines could
potentially perform this task easily opening up new research areas.
Emotion recognition potentially has many applications in academia and industry, and
emotional intelligence is an important part of artificial intelligence.However, in contrast to
such tasks as face recognition (FR), emotion recognition has not yet become so widespread. We
believe that the reason for this is the fact that emotion recognition is much harder and requires more
research and efforts to gain success.
Face recognition is also hard, but training data with clean ground truth labels can be collected easier
and benchmarks are usually objective (i.e. we know the identity). In emotion recognition, there is a
lack of understanding and the agreement of what the labels should be. This can be proved by recent
appearance of datasets with compound emotions [4] or with dominant and complementary emotions
[11]. There is also a lack of training data due to difficulty of collecting rare emotions (how often do
you clearly show fear?).
Thus referring as the above paragraph states our problem statement confines to analyzing the
proposed research papers and implementation proposed in this field and bringing out the model that
could attain the best possible accuracy. After successful building of such model which comproises
of advantages of earlier researches, also using it to build a useful application that could utilize the
model efficiently.
Futuristic approaches will be aimed towards the use of combination of voice and video both to
predict the mood and emotion of the person which will be more user friend
PAPER DETAILS
Multi-Feature 1. CNN Feature We extract the frame faces using Classified Enhance our
Based Emotion Based:Pre- the MTCNN.We fine-tuned four sample into 7 dataset to
Recognition for Processing,Feature networks to predict different class increase
Video Clips Extracting. single static images.Four Angry, efficiency and
published by 2. Landmark Euclidean network-Inception-V3, Disgust, Fear, accurate
Chuanhe Liu, Distance (LMED) DenseNet-121 ,DenseNet-161, Happy, Sad, prediction.
Tianhao Tang, Kui 3. Temporal DenseNet-201 Surprise,Neut
Lv At - ICMI’18, Features:Face tracking Face landmarks are critical high- ral with an
October 16-20, and alignment,VGG level features on faces,For each overall result
2018, Boulder, CO, Facial Features. frame in the video, we accuracy of
USA 4. Audio:SoundNet got 34 features This method 61.8%.
model archived 39.95% accuracy on
AFEW.
Face tracking and alignment
Here we implement MTCNN for
face ex-traction and SDM
for face alignment
Proposed Solution
Dataset:
Acted Facial Expressions in the Wild (AFEW) is dynamic temporal facial expressions data
corpus consisting of close to real world environment extracted from movies.It consists of
training(773), validation (383) and test(653)
video clips.
Real-world Affective Faces Database (RAF-DB) is a large-scale facial expression database with
around 30K great-diverse facial images downloaded from the Internet. Based on the
crowdsourcing annotation, each image has been independently labeled by about 40 annotators.
Images in this database are of great variability in subjects' age, gender and ethnicity, head poses,
lighting conditions, occlusions, (e.g. glasses, facial hair or self-occlusion), post-processing
operations (e.g. various filters and special effects), etc. RAF-DB has large diversities, large
quantities, and rich annotations, including:
Library Used :
• Ffmpeg : FFmpeg is a free and open-source library consisting of a vast software suite of
libraries and programs for handling video, audio, and other multimedia files and streams.
It is being used in Fragmentation of video clips in our dataset
• Dlib: Dlib is a modern C++ toolkit containing machine learning algorithms and tools for
creating complex software in C++ to solve real world problems.
Model Used :
1. SVM (Support Vector Machines) : SVM predictor will be used on the input vector consisting of
features to predict the outcome.
2. Deeply supervised CNN: Applying Deeply Supervised CNN (ResNet-50, VggFaces) by taking
multi-level and multi-scale features extracted from convolution layer.
Expected Result:
After completion of the above described multimodel system we will be able to categorize the
emotions into 7 distinct classes namely: Angry, Disguise, Happy, Sad, Surprise, Fear and
Neutral.
Future Work:
One of the major thing is to embed voice along with video in the prediction of emotions. This will
not only make the user experiences better but will also be much more accurate in the prediction of
human emotions and analysis.The second thing would be to make the uses of more recent
advances to somehow increase the accuracy.
REFERENCES
1. Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks published by Yin
Fan, Xiangju Lu, Dian Li, Yuanliu Liu at ICMI’17, November 2016.
2. Emotion Recognition in the Wild from Videos using Images published by Sarah Adel
Bargal,Emad Barsoum, Cristian Canton Ferrer,Cha Zhang
3. Convolutional neural networks pretrained on large face recognition datasets for emotion classification
from video -Boris Knyazev,Roman Shvetsov,Natalia Efremova,Artem Kuharenko
At - ICMI’17, November 13 2017
4. Video-based Emotion Recognition using deeply supervised network published by Yingruo Fan, Jacqueline
C.K. Lam, Victor O.K. Li At - ICMI’18, October 16-20, 2018, Boulder, CO, USA.
5. Multi-Feature Based Emotion Recognition for Video Clips published by Chuanhe Liu, Tianhao
Tang, Kui Lv At - ICMI’18, October 16-20, 2018, Boulder, CO, USA