You are on page 1of 42

Deep Learning of

Human Emotion Recognition


in Videos

Yuqing Li
Uppsala University
Abstract
Machine learning in computer vision has made great progress in recent years. Tasks like ob-
ject detection, object classification and image segmentation reached near or even above human
performance. Meanwhile, there are still tasks like human emotion recognition remains chal-
lenging. In this paper, machine learning techniques are used to recognize human emotions
in movie images and videos. First of all, the theoretical background of these techniques is
introduced. Secondly, informative content including audio, single video frame and multiple
video frames are extracted from videos to represent emotions. In this step, OpenSMILE and
Inception-ResNet-v2 model are used to extract feature vectors from audios and frames sep-
arately. Thirdly, various models are trained to classify the emotions. SVM is used to classify
audio feature vectors. Inception-ResNet-v2 is used to recognize emotions in static images. C3D
model is used to classify a sequence of frames(video). After that, the accuracy of these models
are shown. Finally, the advantages and disadvantages of these models are discussed as well as
possible improvements of future studies on human emotion recognition.
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previous research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Emotion categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Hand-crafted features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 Deep features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Convolutional layer and feature map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 CNN Deep Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 SVM for Audios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.3 C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 On Audios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Inception ResNet V2 On Static Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Testing on SFEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Failed Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 On Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1 Audios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2 Image model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.3 Video models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References ........................................................................................................ 37
1. Introduction

1.1 Background
In recent years, thanks to the rapid development of computer vision and ma-
chine learning, tasks like object classification, action recognition, and face
recognition have resulted in fruitful achievements. However, human emotion
recognition remains one of the most challenging tasks. A lot of effort has
been made to solve this problem. Since 2013, the first Emotion Recogni-
tion in the Wild (EmotiW) challenge has been held, the classification accu-
racy of video emotion classification has increased from 38% as the baseline
to 59%[2]. Great progress was made but still unsatisfying. On the one hand,
this is probably due to lack of labeled video data and the ambiguity nature of
human facial expressions. On the other hand, lack of effective ways to extract
facial emotion features also effects model performance. In recent years, pre-
trained deep convolutional neural networks have been proven to perform well
in extracting image features in challenging databases such as ImageNet[1];
Long Short-Term Memory (LSTM) network shows exciting prediction accu-
racy by analyzing sequential data[6]; three dimension convolution neural net-
work (C3D) achieves high performance in video action detection[2]. Thus,
by applying all these new techniques and combining them together may boom
accuracy of human emotion recognition in videos.

1.2 Previous research


The study of automatic human facial emotion recognition started from defin-
ing and categorizing human facial expressions. After that, researchers built
databases that contained labelled facial expression examples. Finally, various
approaches have been used to recognize human emotions.

1.2.1 Emotion categories


The study of facial emotion recognition can be traced back to 1970s. Paul Ek-
man and his colleagues[7] found that there are six facial expressions (happy,
sadness, anger, fear, surprise, disgust) can be understood by people from dif-
ferent cultures. The difference in backgrounds influences facial expressions
mainly in intensity[10]. For example, watching the same comedy film, Amer-
icans tend to laugh with their mouth widely open while Japanese are more

1
likely to smile without showing their teeth. The observation that infants are
able to show a wide range of facial expressions and respond to facial expres-
sions from others without being taught suggesting that the ability to deliver
emotions and understand emotions via facial expressions is inherent in hu-
mans.

1.2.2 Data set


Several data sets have been established to build emotion recognition models
and evaluate performances. Same methods can receive dramatically different
results on different data sets. This is due to the variance in data sets. Facial
emotion databases can be divided into two categories: lab data sets like Cohn-
Kanade CK+ database[16] and wild data sets like Acted Facial Expressions in
the Wild (AFEW) [5]. In latter, facial expressions images or videos are chosen
from movies and online videos which diverse significantly in resolution, illu-
mination, head pose and etc while most lab data sets control all these factors
carefully. Figure 1.1 shows the difference of two kind of data set.

Figure 1.1. Cohn-Kanade CK+ database (above) have frontal facial images with stable
illuminance. The Facial Expression Recognition 2013 (FER-2013) Dataset (below)
has images cropped from movies varies in head posture and illuminance.

1.2.3 Hand-crafted features


There are two approaches to craft facial features by hand from original im-
ages/videos: geometric features approach and appearance approach.

2
Geometric-feature-based methods are methods that extract information about
facial components and their movements which imitate how humans understand
facial emotions. One example of geometric-feature-based methods is the Fa-
cial Action Coding System (FACS). In order to describe facial expressions pre-
cisely, FACS was developed in which each facial expression broke down into
several Action Units (AU) each representing a facial muscular movement[7].
Based on FACS, categorization of facial expressions was conducted by rec-
ognizing certain facial movements[19]. Before the 1990s, encoding facial ex-
pressions using FACS was done manually and thus it was very inefficient.
Meanwhile, geometric-feature-based methods are highly dependent on the ac-
curacy of facial component recognition and tracking which makes it less reli-
able than appearance-based methods.
Computers started to be part of the game since the 1990s. Since then,
appearance-based methods became quite popular. Optical flow (OF)[26], 2D
Fourier transform coefficients[15], Local Binary Patterns(LBP)[21] and facial
motions[8] were popular new features.
Among these features, optical flow captures the movement of surfaces and
objects in video; 2D Fourier transform converts spatial domain information
into frequency domain which allows researches to decrease the dimension of
image/video significantly; LBP, on the other hand, is mainly focus on compar-
ing a pixel with its nearby pixels and encode the unique spatial pattern.
New classification models were also contributed in the task. Hidden Markov
model, a simplified version of Bayesian network aims to discover the hidden
pattern of features, was able to classify facial expressions near real-time[17].

1.2.4 Deep features


There are plenty of challenges in Computer Vision(CV) area besides human
facial emotion recognition. Accordingly, there are data sets and competitions
built for these tasks. The most famous one is ImageNet database and ImageNet
Large Scale Visual Recognition Competition (ILSVRC) focusing on object
recognition in images. ImageNet contains over 10 million images in around
1000 classes.
Since AlexNet, a 5-layer convolutional neural network, was proven to be
successful in ILSVRC in 2012, deep neural networks which can extract more
complex features came into popularity[13] in all CV research areas. For a con-
volution neural network, the convolutional layers are considered to be the fea-
ture extractors while the fully connected networks are considered classifiers.
If the network consists several convolutional layers, the output of last convolu-
tional layer is called deep feature. For a deep network, multiple convolutional
layers means a large solution domain thus deep features(the output tensor of
last non-classification layer) have higher dimensions and contain more infor-

3
mation from input image. Consequently, deep features are used in emotion
recognition tasks and significantly improved classification results.
Similar to ImageNet and ILSVRC, in facial emotion recognition area, there
are Emotion Recognition in the Wild (EmotiW) Challenge and EmotiW databases
designed for the challenge. The challenge was first held in 2013 with two
databases Acted Facial Expressions in the Wild(AFEW) and Static Facial Ex-
pressions in the Wild(SFEW). The baseline accuracy was 38% with Local Bi-
nary Patterns on Three Orthogonal Planes (LBP-TOP) as features and Support
Vector Machine (SVM) as classifier[4]. In the following years of competition,
solutions utilizing deep pretrained neural networks to extract image features
and Long Short-Term Memory (LSTM) to take temporal influence into ac-
count has been proven efficient with limited labeled data[6]. The winning
team of EmotiW 2016 successfully implemented three-dimensional convolu-
tion neural network (C3D) and achieved the best performance with an accu-
racy of 59%[2].
In this research, the databases from EmotiW will be used to train the mod-
els. Meanwhile, the baseline and competition results will be used to evaluate
the performance of trained model in this work.

1.3 Problem Formulation


This project consists several minor problems need to be solved:
1. Compare various emotion features extracted from videos.
2. Evaluate Inception-ResNet-v2 model’s performance on human facial emo-
tion recognition.
3. Evaluate the performance of C3D model.

4
2. Theory

This chapter includes the theoretic background of the models and methods im-
plemented in this project to extract deep features and classify video emotions.

2.1 Artificial Neural Network


Artificial Neural Network (ANN) is a kind of computational model consisting
of a collection of artificial neurons as their basic computation units. There
are a set of variants of ANN, such as Convolutional Neural Network (CNN)
and Recurrent Neural Network (RNN). Based on different architecture and
neurons, ANN can be used to solve different problems.

2.1.1 Architecture
The structure of ANN can be determined by two factors. The first is how
many layers and how many neurons each layer has in ANN. The second is
how information/inputs are transferred in ANN.
For the former factor, the more layers an ANN has, the deeper it is while the
more neurons in each layer, the fatter the ANN is. More neurons means larger
solution domain at the cost of longer training time. With limited computational
power and time (the number of weights can be trained), thinner and deeper
ANNs are proven to be of better performance[20].
For the latter factor, the number of possible ways to connect neurons is
enormous. Most of them are not possible for training at the moment. Of
all the feasible networks, the most commonly used and typical ones are fully
connect feed-forward network and recurrent neural network (RNN) as shown
in Figure 2.1.

2.1.2 Neurons
Neurons in ANN works in a similar way to neuron cells in animal brains.
Neuron cells receive stimuli, process it and produce feedback based on it.
Artificial neurons does exactly the same thing by summarizing inputs, adding
bias and using an activation function to decide responses, as shown in Figure
2.2.
Mathematically speaking, the operations in neurons can be summarized as
below:

5
Figure 2.1. Feed-forward network is the one in the left. RNN is the one in the right.
As the name indicate, neurons in fully connected feed-forward network only take
the output of all neurons its previous layer as its input. The flow of information is
unidirectional. Meanwhile, neurons’ input in RNN may come from other neurons in
the same layer.

Figure 2.2. Artificial neurons

6
y = ϕ(∑ wi xi + a) (2.1)
i

where xi is input, wi is the weight of xi , a is the bias added in this neuron


and ϕ is the activation function.

Activation function
Activation function determines the output of a neuron. There are quite a lot
of commonly used activation functions including Logistic function (sigmoid
function), hyperbolic tangent function (tanh function), ramp function (ReLU)
and normalized exponential function (softmax function). These activation
functions works as filters deciding whether the information will be passed on
and how strong the signal will be.
For instance, a node with ReLU as activation function can be written as
below: " #
y = max (∑ wi xi + a), 0 (2.2)
i

If the sum of weighted input is larger than zero, the signal will be passed on
without changing its intensity. Otherwise, the signal will vanish.

2.1.3 Training process


The training process of a ANN is a process to find a value for each parameters
in the ANN so that the output of the ANN is optimal. This process involves
three problems: what initial value should be given to the parameters, how
to update the parameters and how to define what is the optimal result of the
output.

Initial parameters
For smaller ANN, the initial value of its parameters is normally set to a num-
ber between 0 to 1 or random numbers in certain range generated by com-
puter. However, this approach is reported performing poorly in deep neural
networks[9] and it takes more time for networks to converge even in "shal-
low" networks. In some extreme cases, ANN with poor initial parameters is
not able to converge at all. Except this approach, parameters can also be ini-
tialized with values from pretrained models as explained in Chapter 2.5.

Loss function
Loss functions are functions used to calculate the distance of actual value
(target) and the output value. Take mean square error(MSE) as an example:
Loss = [∑ni=1 (yi − pi )] /n. yi is the actual label/value while pi is the prediction
of the model. The aim of training is to decrease the loss as much as possible.

7
In order to achieve this goal, the choice of loss function plays an important
role. Cross entropy Loss = − [∑ni=1 pi lnyi ] /n is capable of representing loss
properly when the output layer is softmax layer as shown in Figure 2.3.

Figure 2.3. Cross entropy(black) and square error(red) of a two layer network. W1 and
W2 are weights of first and second layer.

Gradient descent and learning


With parameters and loss function, the mechanism to link them together is
gradient descent, an method to guide how to approximate the optimal value
of parameters. There are plenty of methods but they are all based on gradient
descent. With L as loss, w as weight, µ as learning rate (the speed set man-
ually, normally less than 0.1), gradient descent in one layer network includes
following steps:
1. Compute the derivative of L with respect to w, ∂ L/∂ w.
2. Update w by adding −µ∂ L/∂ w.
3. repeat step 1-2 until ∂ L/∂ w is approximately 0.
In order to simplify the calculation, back-propagation (BP) is introduced.
With hi as the output of layer i, wi as the weights of layer i, bi as the bias of
layer i and L as loss, we have hi = wi hi−1 + bi . Thus:

∂L ∂ L ∂ hi ∂L
= = hi−1 (2.3)
∂ wi ∂ hi ∂ wi ∂ hi
Where
l
∂L ∂ L ∂ hi+1 ∂L ∂L
= = × wi+1 = × ∏ wt (2.4)
∂ hi ∂ hi+1 ∂ hi ∂ hi+1 ∂ hl t=i+1

8
2.2 Convolutional Neural Network
CNN is a type of feed-forward ANN inspired by animal visual cortex and is
known for outstanding performance in image classification. Compared to reg-
ular fully-connected feed-forward ANNs, CNNs is much easier to train due
to sparse connectivity and shared weights. Sparse connectivity means that
each neuron on convolution layers only take certain amount of output val-
ues of previous layers instead of all the output of previous layers like other
fully-connected ANNs. Meanwhile, CNNs also share the weights among hid-
den layers which means that inputs of different locations are filtered by same
learned kernels. These two features of CNN decrease the parameters needed
to be trained dramatically. In Figure 2.4, there is LeNet-5, a simple convolu-
tion neural network designed for handwritten and machine-printed character
recognition.

Figure 2.4. Structure of LeNet-5. Each plane is a feature map.

2.2.1 Convolutional layer and feature map


Feature maps as shown in Figure 2.4 are the results of applying functions
across sub-regions of entire image. The operations in a convolutional layer
are listed below:
1. Convolution of the input image fm∗n with a linear filter g p∗q . Mathe-
matically, 2 dimension convolution ost for image fm∗n at location s,t is
presented as:
p q
ost = f [p, q]st ∗ g[p, q] = ∑ ∑ f [u, v]g[p − u, q − v] (2.5)
u=0 v=0
2. Add a bias b to ost .
3. Apply a none-linear function ϕ (activation function) ost + b.
4. Adding stride to change location (value of s,t) and repeat step 1 - 3 until
exhaust all the required locations.
5. Change a filter and repeat step 1 - 4 until exhaust all filters.
The number of output feature maps is equal to the number of the convolu-
tional layer’s filters.

9
2.2.2 Pooling layer
Pooling layers are used for down-sampling in CNNs. Down-sampling or sub-
sampling is to decrease the size of feature maps as shown below. Of differ-
ent kinds of pooling methods, average pooling and max pooling are the most
commonly used ones. The benefit of pooling layers not only lays in much less
dimensions lessening the computation cost but also controls the overfitting.
The process of pooling is shown in Figure 2.5.

Figure 2.5. Pooling

2.2.3 C3D
Three-dimension convolutional neural networks are a special kind of convolu-
tional neural networks which perform convolution on three dimensions. These
networks extract features not only from spatial dimension/images but also in-
tegrate information from temporal dimension/videos as shown in Figure 2.6.
In the case of 2D CNN, all the filters are of two dimensions while in C3D,
all the filters are 3D filters. C3D has shown good performance(82.3% top-1
accuracy) on UCF101 (a data set of 101 human actions classes from videos in
the wild)[24].

2.3 Deep Neural Network


Deep neural networks (DNN) have shown state-of-art accuracy on a lot of
challenging database such as ImageNet due to its larger feature space and
solution space[9]. However, the training process of DNN is much trickier
compared to "shallow" networks since deeper networks are more likely to have
vanishing gradient problem and exploding gradient problem[9]. As shown in
Equation 2.3 and Equation 2.4, if the network goes deeper, the production
of w might become so influential that the value of ∂ L/∂ wi will be extreme

10
Figure 2.6. 3D Convolutional Neural Network

large when absolute value of w is larger than 1(exploding gradient problem)


and extreme small when it is smaller than 1(vanishing gradient problem). In
this case, the loss of the model will be shaking instead of decreasing. Batch
normalization and residual learning are two methods to solve the problems.

2.3.1 Batch Normalization


Batch normalization was first introduced in [12]. Its successful application is
Inception Network by Google. As indicated by its name, batch normalization
is to normalize data, specifically, to normalize layer inputs. Batch normaliza-
tion make higher learning rate possible which accelerates the learning process
of DNN which prevent vanishing gradient problem and exploding gradient
problem[12].
For each mini batch with x1 , x2 , ..., xm , first calculate the mean value µ and
deviation σ :

1 m
µ= ∑ xi (2.6)
m i=1

1 m
σ2 = ∑ (xi − µ)2 (2.7)
m i=1
Then normalize x1 , x2 , ..., xm by using a small number ε in case σ = 0:
xi − µ
x̂i = √ (2.8)
σ2 +ε

2.3.2 Residual Learning


Residual learning is a kind of framework of deep neural network firstly intro-
duced in [11]. It enables DNN to be trained with high accuracy and costs less
time to converge[11]. The idea of residual learning is quite simple. Instead

11
of mapping inputs to outputs with stacked layers, residual network use layers
to map fluctuations so gradually map the output. The comparison is shown in
Figure 2.7.

Figure 2.7. H(x) is any desired mapping, plain net hopes the 2 weight layers fit H(x)
while residual net hopes the 2 weight layers fit F(x) and let H(x) = F(x) + x.

2.4 RNN
As shown in Figure 2.1, recurrent neural network are networks that have con-
nections between neurons in hidden layers. By doing so, it becomes possible
for RNN to handle sequential information where the sequence of input matters
and the meaning of data depend on the "context". While CNN share weights
by using "filter" in spatial dimension, RNN share weights by using the same
function to handle information at different time in time domain.

Figure 2.8. A standard RNN neuron. RNN share wights in sequential data.

In Figure 2.8, there is a simple example of a neuron in standard RNN. With


sequential data x1 , x2 , ..., xm , output of the node with input xt is ht :

ht = Fθ (ht−1 , xt ) (2.9)
For all xt and ht , the parameters of Fθ is the same.

12
2.4.1 LSTM
Long Short-Term Memory networks are a special kind of RNN which has a
different and more complex structure for neural cells. A LSTM neuron has
three gates: an input gate, a forget gate and an output gate.

Figure 2.9. A LSTM neuron.

Inside a LSTM neuron, three things need to be decided. First, how "clear"
the new information should be remembered. Secondly, how much of previous
memory should be forgotten. Thirdly, what signal should be passed on to
influence other neurons.
With sequential data x1 , x2 , ..., xm , for xt , the influence of xt on xt+1 is ht . ct
is the cell state after processing xt , the process of first step is:

it = σ (W i [xt , ht−1 ] + bi ) (2.10)


where σ is a sigmoid function and produce a number between [0 − 1]. 0
means the information conveyed by xt and ht−1 will not be used at all and 1
means it will be kept entirely. And then preprocess new information:

c̃t = tanh(W c [xt , ht−1 ] + bc ) (2.11)


it ◦ c̃t is the new part need to be add to existing memory.
While adding new part is necessary, the old information in the memory will
also be affected by new signals. The forget gate works as:

ft = σ (W f [xt , ht−1 ] + b f ) (2.12)

13
The range of ft is [0 − 1]. 0 means all the old memory will be removed
and 1 means it will be kept entirely. Thus, the memory at sequence t can be
inferred:

ct = ft ◦ ct−1 + it ◦ c̃t (2.13)


Finally, the output ot for next layer of xt and the message ht from xt to xt+1
is calculated:

ot = φ (W o [xt , ht−1 ] + bo ) (2.14)

ht = ot ∗ tanh(ct ) (2.15)

2.5 Transfer Learning


Transfer learning is a term used to describe the fact that humans can apply
the knowledge they learn in one field to another field to generate better re-
sults. In machine learning, predictions on new data are based on statistical
model trained with previous collected data. Once problem domains, tasks or
data distributions are changed, models need to be retrained from scratch with
relative data. In real world application, data collection is extremely resource-
consuming and so is the training process. In order to solve this problem, data
scientists started to apply transfer learning observed in humans to machine
learning. Compared to traditional learning approach, transfer learning allows
knowledge learning from previous tasks being used in target new task[18]. The
comparison between learning from scratch and transfer learning are shown in
Figure 2.10.

Figure 2.10. Different learning approach of traditional learning (left) and transfer
learning (learning).

14
3. Methodology

In this section, the specific approach are illustrated. It can be divided into three
parts: data collection and pre-processing, feature extraction and classification
and evaluation method.
This research involves 3 models. The first one is audio-SVM model. Au-
dios will be used to extract audio features with OpenSMILE as feature extrac-
tor and classified with SVM. The second one is CNN-LSTM model. This in-
volves three steps: train the feature extractor, CNN model(Inception-ResNet-
v2 model); use CNN model to extract deep features from face images cropped
from video frames; and use LSTM as classifier to integrate the deep features
and classify the emotion. The third one is video-C3D model. This model use
face frames from videos as input and C3D model as both feature-extractor and
classifier.

3.1 Data
Three data sets are involved in the training and evaluation process of the
model. The first one is AFEW 6.0. The second one is Static Facial Ex-
pression Recognition in the Wild (SFEW). The third one is Facial Expres-
sion Recognition 2013 data set (FER2013). FER2013 is used to train CNN
model(Inception-ResNet-v2 model) and SFEW is used to evaluate this Inception-
ResNet-v2 model. AFEW 6.0 will be used for both training(60%) and evalu-
ating(40%) SVM, LSTM and C3D model.

AFEW 6.0
AFEW 6.0 is a data set consisting of video clips collected from movies and
reality TV shows. It is the newest version of AFEW data set. Compared to
AFEW 1.0-5.0, reality TV show clips are newly added. There are 1750 video
clips in the data set and they are originally divided into 774 training videos,
383 validation videos and 593 test videos. Each of them are labelled with only
one emotion of the universally recognized seven emotions, as shown in Table
3.1. Due to the EmotiW contest, the labels of test videos are not available
when this project is conducted.
All the videos are of 25 fps (25 frames per second) and are of 720*576
resolution.

15
Emotion Angry Disgusted Fear Happy Sad Surprise Neutral
Number of videos 197 114 127 213 178 120 207
Table 3.1. Emotion distribution of AFEW 6.0 data set for training and validation

SFEW
SFEW is a data set consisting of images collected from movie frames with a
label from seven emotions. There are 861 labelled images in total in the train
set and validate set. The distribution of SFEW data set is shown in Table 3.2.
All the images are frames of movies of 720*567 resolution.

Emotion Angry Disgusted Fear Happy Sad Surprise Neutral


Number of videos 153 53 91 188 143 86 147
Table 3.2. Emotion distribution of SFEW data set for training and validation

FER2013
The FER2013 database is an image data set containing 35889 48*48 pixel
gray-scale facial expression images labelled with the seven universal emo-
tions above. The facial expression images in the FER2013 data set are also
gathered from wild environment (movies) and thus the features learned from
it can be applied to AFEW 6.0 data set. The distribution of images in different
categories is shown in Table 3.3.

Emotion Angry Disgusted Fear Happy Sad Surprise Neutral


Number of images 4952 546 5120 8988 4829 6076 6197
Table 3.3. Emotion distribution of FER2013 data set

3.2 Pre-processing
Videos contain rich information. However, in order to train models more effi-
ciently, only audios and facial crops from video frames are used in the training.

AFEW 6.0
For AFEW 6.0 data set, ffmpeg, a cross-platform open source audio-video
processing framework is used to extract audio files and video frames. All
the video clips are around 2 seconds and have a frame rate of 25 frames per
second (25fps). Since not every frame in a video contains at least one human
face, in order to get enough cropped faces to analyze temporal information,
all frames were extracted from videos and Dlib frontal face detector was used
to crop the largest face in a frame. In the end, all the faces are resized to 2

16
standard sizes: 48*48 and 299*299 and converted to gray-scale image. Gray-
scale facial images and audios from videos will be used as input for further
feature extraction as shown in Figure 3.1.

Figure 3.1. Pre-process of AFEW 6.0 data

SFEW
For SFEW data set, Dlib frontal face detector is also used to corp the largest
face in the frame. And Cropped faces are then resized to 299*299 and con-
verted to gray-scale images for future usage.

FER2013
For FER2013 data set, all the faces are resized to 299*299.

Normalization
Besides, all the faces are linear normalized to decrease the influence of illu-
mination. For a pixel at location (x, y) with intensity value I(x,y) , while Imax
and Imin are the largest and smallest intensity value of the original image, the
0
normalized intensity value of the pixel I(x,y) is calculated:

0
I(x,y) − Imin
I(x,y) = ∗ 255 (3.1)
Imax − Imin

3.3 Feature Extraction


Before training classifiers, features must be extracted accordingly. In this re-
search, audio features are extracted using OpenSMILE while facial image fea-
tures are extracted from a pre-trained and fine-tuned deep CNN model.

17
3.3.1 Audio Features
OpenSMILE (Speech and Music Interpretation by Large-space Extraction)
feature extraction toolkit is used to extract audio features. OpenSMILE can ex-
tract audio low-level descriptors such as Mel-frequency cepstral coefficients,
loudness, perceptual linear predictive cepstral coefficients, line spectral fre-
quencies and format frequencies. The extracted feature of a 2 second audio is
a feature of 1582 dimensions.

3.3.2 CNN Deep Features


Inception ResNet V2 network
Inception ResNet V2 network pretrained with ImageNet database is used in
deep feature extraction. Inception-ResNet-v2 network is a deep convolutional
neural network utilizing residual connection and inception deep convolutional
architecture[22], as shown in Figure 3.2 and Figure 3.4.

Figure 3.2. The structure of inception ResNet v2 network. Detailed structures of


repeating blocks are shown in Figure 3.4 and Figure 3.3.

Inception-ResNet-v2 has the highest classification accuracy (top-1 accu-


racy: 80.4%, top-5 accuracy: 95.3%)[22] on ILSVRC image classification
benchmark so far. Consequently, it can be considered of sufficient ability to
extract features from images of different content. Thus, the parameter value
of pretrained Inception-ResNet-v2 with ImageNet will be used as the initial
value of Inception-ResNet-v2 model in this project.
However, since ImageNet has 1000 class while the 7 universal emotions
will be used as classes in this research, the parameters in fully connected layers

18
Figure 3.3. On the left is Stem block in Figure 3.2. On the right there is block A’
(below) and block B’ (above) in Figure 3.2.

19
Figure 3.4. The structure of Inception-ResNet-v2 network blocks. The block in the
left is block A shown in Figure 3.2. The block in the middle is block B shown in
Figure 3.2. The block in the right is block C shown in Figure 3.2.

in ImageNet pretrained model is not suitable in this model. All the parameters
in ’AuxLogits’ block and ’Logits’ block will be generated randomly instead
of restored from pretrained model.

Fine-tune
In order to enhance the ability of Inception-ResNet-v2 model to extract facial
expression features, the Facial Expression Recognition 2013 data set (FER2013)
is used to fine-tune the Inception-ResNet-v2 model pretrained with ImageNet.
All the layers of pretrained Inception-ResNet-v2 model were tuned with
the FER2013 database. FER2013 data set is divided into train set of 28709
images and validate set of 7188 images. The learning rate is set to 0.01 for
step 1-30000. Then with learning rate as 0.0001, fine-tune the model for 2000
steps.
After fine-tuning, this model is able to classify facial emotions from static
images. The output layer will produce the classification result while the output
of convolution layers will be deep emotional features of facial images.

3.4 Model Training


3.4.1 SVM for Audios
Input
The audio features explained in Section 3.3.1 will be used as model input.

Model details
After extracting 1582 dimensional audio features, a SVM is trained as clas-
sifier. Scikit-learn, the open source machine learning library is used to train

20
this SVM model. To be specific, Classification SVM type 1 (C-SVM) is used.
Training of C-SVM is a process to minimize the error function:
N
1 T
w w +C ∑ εi (3.2)
2 i=1

with constrains:
yi (wT Φ(xi ) + b) ≥ 1 − εi (3.3)

εi ≥ 0, i = 1, ..., N (3.4)
where yi is class label, xi is input data, w is coefficient vector, b is a bias,
εi is single input parameter, and C is capacity constant. Kernel Φ is used to
transform input data into feature space.

Training Method
To find the optimal parameter set, 10-fold cross validation is used. For linear
kernel, the possible value of C is 1, 10 and 100. For radial basis function (rbf)
kernel, the possible value of C is among 1, 10, 100 and 1000. Besides, for rbf
kernel, the possible value of gamma is 0.01, 0.001 and 0.0001.
The training set of AFEW will be used for training. And the validation set
of AFEW will be used to evaluate the performance of all the SVM models.The
highest-performed combination of parameters will be chosen as the model to
test.

Output
The output of SVM model will be a one-digit number indicating the classifi-
cation result.

3.4.2 LSTM Model


Deep features of facial images (single frame) is extracted from the output value
of last flatten layer, which is the input of dropout layer (the purple layer in
Figure 3.2). The deep feature of singe facial image is a 1536 dimension vector.
For each video, 16 facial images are used to get a deep feature.
For videos in AFEW database, each video contains around 50 frames. How-
ever, not every frames contains a face and in some cases, the Dlib frontal face
detector is not able to locate faces due to head posture, illuminance and etc.
Thus, it is common that some videos can have more than 16 faces while some
have less than that. For those videos have x faces and x ≥ 16, a random num-
ber s between 0 and x − 16 will be generated and the deep features from face
s to face s + 15 will be used as input of LSTM model. For those videos have
x faces and x < 16, padding is needed. A random face selected from existing

21
faces will be duplicated and add to face collection until the number of faces
reaches 16.
Thus, the input of LSTM will be a 16*1536 dimension feature.

Model Details
One layer LSTM model is used.

Training Method
The learning rate of LSTM is set to 0.001 with 40000 training iterations. Batch
size is 32.

Output
The output of LSTM model is seven logits indicating the likelihood of each
emotion.

3.4.3 C3D
The input of C3D are cropped faces of size 48*48 described in Section 3.2.
As explained in Section 3.4.2, the number of faces found in videos varies.
Thus, a similar method is used to get the same size input for C3D. For those
videos have more than 16 faces, 16 sequential faces will be chosen randomly
as input. For those videos have more than 1 face but less than 16 faces, the
padding technique in 3.4.2 is used. For those videos with no face founded
inside, they will be removed from training set.

Model Details
There are 7 hidden layers in LSTM model as shown in Figure 3.5. The first
five layers are convolutional layers to extract video features. Kernels for these
layers are all 3*3*3, for every dimension the stride is 1.
The last 2 layers are fully connected layers in order to classify.
This model structure has been proven efficient in classifying videos in UCF101
database, a database of 101 different kind of human actions (101 classes) and
shown an accuracy of 72.6%[25].

Output
The output is seven logits indicating the likelihood of each emotion.

22
Figure 3.5. The structure of C3D model

23
4. Results

In this chapter, the performance of each model is evaluated independently. The


accuracy of all models are shown while the confusion matrix for Inception-
ResNet-v2 model, LSTM model and C3D model are displayed as well. Also,
the learning process of Inception-ResNet-v2 model, LSTM model and C3D
model is also shown to provide more details.

4.1 On Audios
Of 15 SVM models trained, the linear ones have better accuracy as shown in
Table 4.1. Besides, the accuracy does not vary much depending on parameter
C and gamma. After grid search, linear SVM model with C = 1 is chosen a
the classifier due to its accuracy and efficiency.

kernel\C 1 10 100 1000


linear 0.229 0.229 0.229 -
gamma=0.01 0.194 0.194 0.194 0.194
rbf gamma=0.001 0.194 0.194 0.194 0.194
gamma=0.0001 0.193 0.193 0.193 0.194
Table 4.1. The accuracy of SVM models with different parameters

With the model trained by AFEW 6.0 training set, AFEW 6.0 validation set
is used to evaluate the model as shown in Table 4.2. The accuracy of SVM
model on validation set is 25%. This classifier is better at angry videos with
a f1-score 0.37, while the performance on disgust and surprise videos is much
worse than average.

precision recall f1-score support


angry 0.31 0.44 0.37 64
disgust 0.07 0.10 0.08 40
fear 0.24 0.22 0.23 46
happy 0.24 0.30 0.27 63
sad 0.31 0.20 0.24 61
surprise 0.15 0.09 0.11 46
neutral 0.31 0.24 0.27 63
total 0.25 0.24 0.24 383
Table 4.2. Accuracy of audio SVM model test on AFEW validation set.

24
4.2 Inception ResNet V2 On Static Images
The Inception-ResNet-v2 model is first fine-tuned with FER2013. The train-
ing process of Inception-ResNet-v2 model is shown in Figure 4.1. Training
step is 32000. After 30000 training steps, the learning rate is adjusted to
0.0001 to fine-tune the model. After 20000 steps of training, the loss is tend
to be stable and the accuracy remains the same.

(a) Training loss

(b) Learning rate


Figure 4.1. Details of fine-tuning.

After fine-tuning, the accuracy of the model on the validation set of FER2013
is shown in Table 4.3. The model has better performance on disgusted, happy
and surprise emotions. Among all, the high accuracy of disgusted. Overall,
the accuracy is 60% on 5740 images in FER2013.

25
Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Overall
Accuracy 54% 80% 48% 78% 49% 67% 48% 60%

Table 4.3. Accuracy on FER2013 test set.

4.2.1 Testing on SFEW


The accuracy of Inception-ResNet-v2 model shown in Table 4.3 is based on
training data and testing data are both from FER2013 database and thus cropped
using same method. However, this model is trained to recognize all the facial
images in movies cropped by Dlib frontal face detector. Different cropping
methods may result in various facial images and may have an impact on the
prediction result. Thus, testing the model using SFEW is necessary. All the
pre-processed labelled images from SFEW described in Section 3.2 are used
for testing on the fine-tuned model.
The confusion matrix of fine-tuned Inception-ResNet-v2 model on SFEW
is shown in Table 4.4.
Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral
Angry 42.4837 0 15.6863 16.3399 11.7647 5.88235 7.84314
Disgust 24.5283 0 13.2075 24.5283 15.0943 0 22.6415
Fear 6.59341 0 21.978 9.89011 21.978 24.1758 15.3846
Happy 1.59574 0 0.531915 86.7021 4.78723 0.531915 5.85106
Sad 0.699301 0 22.3776 11.8881 43.3566 4.8951 16.7832
Surprise 5.81395 0 16.2791 6.97674 13.9535 41.8605 15.1163
Neutral 4.7619 0 12.9252 4.08163 25.8503 6.80272 45.5782
Table 4.4. Results on static facial expressions

There are 861 images being tested. 413 Images are correctly predicted.
The overall accuracy on all emotions is 47.97%. As shown in Table 4.4, the
Inception-ResNet-v2 model remains satisfying performance on happy images.
However, it fails to recognize disgusted images completely with an accuracy
of 0.

4.2.2 Failed Images


In Figure 4.2, some of the failed images from SFEW database are listed. They
are chosen from the movie ’The Hangover’. The prediction of model and the
ground truth label of each image is listed below.

4.3 On Videos
4.3.1 LSTM
The validation set of AFEW 6.0 is used to test the accuracy of LSTM. The loss
decreases dramatically during the first 5000 training steps as shown in Figure

26
(a) (b) (c)

(d) (e) (f)


Figure 4.2. Failed images from SFEW data set.(a)Prediction: Sad. Ground truth:
Neutral. (b)Prediction: Sad. Ground truth: Angry. (c)Prediction: Neutral. Ground
truth: Disgust. (d)Prediction: Surprise. Ground truth: Fear. (e)Prediction: Angry.
Ground truth: Surprise. (f)Prediction: Happy. Ground truth: Fear.

4.3 and it declines slowly during the training process. In the end, it fluctuates
around 1.3.
While the loss decreases, the accuracy increases as a result during training
as shown in Figure 4.4. The training accuracy and testing accuracy both in-
crease during the first 5000 step of training. However, further training failed
to increase the accuracy of testing set.
The confusion matrix of LSTM model is shown in Table 4.5. LSTM model
is more capable of classify angry and happy emotions while it fails completely
at rest of other emotions. 41.67% videos are classified as neutral emotion
indicates that overfitting might exists.

Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral


Angry 24 1 4 1 0 0 8
Disgust 8 1 1 3 1 3 14
Fear 9 1 3 1 1 0 12
Happy 6 0 2 28 0 2 10
Sad 9 1 5 0 1 0 14
Surprise 4 1 4 4 5 18 25
Neutral 4 1 2 3 1 1 32
Table 4.5. The Confusion Matrix of LSTM model

27
Figure 4.3. The training accuracy and testing accuracy of LSTM model.

Figure 4.4. The training accuracy and testing accuracy of LSTM model.

28
4.3.2 C3D
Two C3D models are trained at different learning rates.

C3D-1
C3D-1 is the first C3D model trained. It is trained with learning rate 0.01
and it decays every 2700 steps with a decay rate 0.1 as shown in Figure 4.5.
The loss decreases dramatically during the first 1000 training steps.And it de-
crease slowly during the following 9000 steps. Finally, the training loss fluctu-
ates slightly around 1.9. The accuracy of the model increases fast during first
3000 training steps. However, the training accuracy is not stable at all. After
smoothing the curve, we can see the accuracy stays around 22%.
After training, the validation set of AFEW 6.0 is used to evaluate the accu-
racy of C3D-1. The confusion matrix is shown in Table 4.6. All the videos are
labelled as happy shows the model overfits the training data completely.

Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral


Angry 0 0 0 48 0 0 0
Disgust 0 0 0 34 0 0 0
Fear 0 0 0 32 0 0 0
Happy 0 0 0 53 0 0 0
Sad 0 0 0 30 0 0 0
Surprise 0 0 0 45 0 0 0
Neutral 0 0 0 39 0 0 0
Table 4.6. C3D-1 results on AFEW 6.0 validation set

The overall accuracy of C3D-1 on AFEW 6.0 validation set is 18.86%. 53


videos are correctly labelled out of 281 videos in total.

C3D-2
In order to further decrease the possibility of overfitting, another C3D model
is trained. The training process is as shown in Figure 4.6. C3D-2 is training
with smaller learning rate and more training steps. Ideally it should be able to
avoid overfitting and converges slower.
As shown in Figure 4.6, the learning rate of C3D-2 is set to 0.00001 and it
decays every 1600 steps with a decay rate of 0.1. The average loss drops fast
to around 3.3 during the beginning phrase of training but remains the same
in the following training steps. After learning rate drops below 0.000001, the
learning of the model can be considered stopped. Both loss and training accu-
racy remains very unstable. By the end of training process, training accuracy
is around 19%.
After training, the validation set of AFEW 6.0 is used to evaluate the ac-
curacy of C3D-2. The confusion matrix is shown in Table 4.7. There is no
videos being labelled as disgust and fear. Some of videos are labelled as sur-

29
(a) Learning rate

(b) Training loss

(c) Training Accuracy


Figure 4.5. Details of fine-tuning.

30
(a) Learning rate

(b) Training loss

(c) Training Accuracy


Figure 4.6. Details of C3D training process.

31
prise correctly. Most of videos are classified as happy videos which clearly
still indicates overfitting.

Label \Prediction Angry Disgust Fear Happy Sad Surprise Neutral


Angry 4 0 0 33 0 7 4
Disgust 1 0 0 26 0 6 1
Fear 3 0 0 17 0 10 2
Happy 4 0 0 35 0 8 6
Sad 0 0 0 35 0 0 0
Surprise 0 0 0 30 0 8 7
Neutral 2 0 0 24 0 6 7
Table 4.7. C3D-2 result final

The overall accuracy of C3D-2 on AFEW 6.0 validation set is 21.7%. 61


videos are correctly labelled out of 281 videos in total.

32
5. Discussion

In this chapter, the conclusions are draw based on the comparison of EmotiW
competitors’ work and future work are purposed based on conclusions.

5.1 Conclusion
5.1.1 Audios
The SVM model shows a weak ability to distinguish certain emotions. Com-
pare to an accuracy of 14.3% when randomly label a given video an emotion,
the model has a better accuracy of 25%.
As shown in 4.2, the SVM model is more capable of distinguish angry
emotion than any other emotion with a f1 score of 0.37 much higher than
average f1 score 0.24. It might due to the fact that most angry videos involves
someone shouts loudly which makes the classifier easy to find key feature.
On the other hand, SVM model is not capable to classify disgust and sur-
prise. Because there is hardly any sounds are particularly surprise even to
human ears while there aren’t enough audios for the model to find the key
feature of disgust.
Besides, the audios in a certain category can vary even more than audios
across categories. For instance, some surprise audios are quite quiet that they
can be considered neutral while some surprise audios are quite noisy that they
can be classified as angry or fear.
It is difficult to compare this audio-SVM model with sate-of-art work since
EmotiW competitors did not provide the precise accuracy of their audio model.
Overall speaking, the audio model is of limited ability to classify emotions.
But with a 1582 dimension feature for a 2-second audio, it is neither efficient
nor precise.

5.1.2 Image model


The Inception-ResNet-v2 model shows good ability to classify static facial
emotions under challenging situation. On FER2013 database, the accuracy
is 60%, 6% lower than state-of-art performance[14]. On SFEW database, the
accuracy is 47.97%, increase 9% over baseline on images while the state-of-art
result on SFEW from the winner team of EmotiW 2015 is 61.6%[6].
As shown in Table 4.3 and Table 4.4, deeper model didn’t improve perfor-
mance significantly but it has good ability across databases. The reason why

33
Inception-ResNet-v2 model could not outperform previous models might be
different cropping method, various input size and different training method.
The state-of-art result on SFEW uses faces cropped with Viola-Jones face
detector while in this research Dlib frontal face detector is used while means
faces in different angle might not be detected.
The size of FER2013 images are 48*48 gray-scale images. However, the
input size of pretrained Inception-ResNet-v2 model is 299*299*3. After re-
sizing all images into 299*299, every pixel of FER2013 image is extended
to an area contains at least 6*6 pixels. This makes the first several layers of
Inception-ResNet-v2 model not able to extract too much useful information
since the stride of those layers are 1 or 2 and the size of filters are 3*3 pixels
or 1*1 pixels. Unless the filters move on the broader of two 6*6 pixel areas,
the filters can hardly extract any useful information. On the other hand, the
cropped faces from SFEW database are of different sizes. A lot of them are
large enough and full of details. Not to mention that faces from SFEW are
RGB-color images. The Inception-ResNet-v2 model trained with FER2013 is
not able to extract those image details.
Besides, ImageNet is a database of both colored images and gray-scale
images. Without color information in FER2013 database, the full power of
Inception-ResNet-v2 model is not utilized and so is the pretrained parameters
of the model.

5.1.3 Video models


LSTM
The LSTM model shows good ability to integrate the temporal information.
As shown in Section 4.3.1, the accuracy of LSTM model on AFEW 6.0 is
41.67%, increases 4% over baseline[5], but still 3.67% lower than state of
art[2]. Compared to state-of-art research, this LSTM model uses less training
data, different cropping method and different input feature.
The state-of-art result using same method has more training data[2]. They
used both training set and validation set of AFEW 6.0 database for training
and test set of AFEW 6.0 for testing. In this research, only training set are
used for training since the label of test set is not available at the moment.
Also, the state-of-art research uses the faces provided in AFEW 6.0 which
are cropped with Viola-Jones face detector and then they developed a face
classifier to remove those non-faces from it. In this research, faces are cropped
with Dlib frontal face detector and no face classifier is developed due to limit
of time.
Besides, the input feature of models are different. The state-of-art model
uses deep features extracted by VGG16-FACE, a CNN model pretrained with
face database and FER2013. Even though VGG16-FACE has a much higher
accuracy(70.74%) on FER2013 database, the outcome of LSTM model is not

34
very different from this research. It indicates that Inception-ResNet-v2 model
has a similar ability to extract emotion features compared to VGG16-FACE.
In all, LSTM works well with deep CNN features. The accuracy of LSTM
model on AFEW 6.0 is not far below Inception-ResNet-v2 on SFEW even
though there are way more information in videos.

C3D
Both C3D models failed terribly. The performance is poor and C3D models
are difficult to train. Both C3D models overfit the training set even though
C3D-2 has an extremely small learning rate(0.00001) and it decays pretty fast.
The accuracy of C3D-2 is almost the same as C3D-1 which labels every video
as happy. The accuracy of C3D is 20% lower than state-of-art. This might
result from camera movement.
C3D has show promising results on UCF101 database which contains 101
categories of human activities including putting on make up and playing bas-
ketball. Seemingly, since human activities involves more complex situation
and movements of multiple parts of human body and C3D model is able to
classify these activities, it should be capable of dealing with 7 human facial
movement. This ignores the fact that UCF101 is a database that only consist
videos shooting without camera movement while AFEW 6.0 videos are cut-
ting from movies with a lot of camera movement in 2 seconds. This means
C3D models is having a hard time capture the movement of an object/pixel.
Not to mention that a wrongly cropped non-face image will ruin the prediction
result of C3D model completely as a wrong input.
On the other hand, the state-of-art research achieves an accuracy of 39.69%
on AFEW 6.0 with C3D model because they not only apply a face classifier
to remove all the non-face images from input but also transform all the faces
they cropped with different head postures into frontal faces and relocate the
faces into a standardize size with their facial key points at similar spots on
each image. In other words, they stabilize the moving camera.

Overall
Among all video information, cropped faces are most informative when it
comes to classify human emotions. Inception-ResNet-v2 model proves its
outstanding ability to extract image features including facial emotion features.
When taking temporal information into account, CNN-LSTM approach per-
forms better and is easier to train compares to C3D model.

5.2 Future Work


Emotion recognition in videos is a challenging task involving so many sub-
tasks including face detection, face tracking, face recognition, 3D transforma-
tion and others. Any improvement in those sub-tasks will benefit the study of

35
the main task. Due to the limited time and resources of this project, none of
the above sub-tasks achieves state-of-art. Of all the sub-tasks mentioned be-
low, a better face detection technique combined with face recognition, might
improve the results significantly by only distinguish different individuals in
one video and label their emotions accordingly.
Also, in this research, audios are only used to extract features from a signal
approach. However, if used to generate subtitles or recognize the environment,
audios can be a feature as effective as facial emotions. The loudness or fre-
quency of the sound make less sense than the content of sound. For instance,
gun shooting in the background is quite common in movies and it apparently
provides a decisive information.
Finally, more labelled data is required for a more reliable model. Emotion
is complex in its very nature and the solution domain of it is very large and
requires enough data to learn. In the previous stage of this research, a lot of
dynamic images are downloaded from giphy.com with emotions as keywords.
But since there are so many images to be checked and labelled manually, none
of them are used in the training process. If possible, having more correctly
labelled data from wild will boom the accuracy.
Overall, there are still plenty of improvement can be done regarding emo-
tion recognition. And hopefully one day, it can be implement in real life sce-
nario.

36
References

[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database. In Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255.
IEEE, 2009.
[2] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Jesse Hoey, and Tom Gedeon.
Emotiw 2016: Video and group-level emotion recognition challenges. In
Proceedings of the 18th ACM International Conference on Multimodal
Interaction, pages 427–432. ACM, 2016.
[3] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Karan Sikka, and Tom Gedeon.
Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In
Proceedings of the 16th International Conference on Multimodal Interaction,
pages 461–466. ACM, 2014.
[4] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Michael Wagner, and Tom Gedeon.
Emotion recognition in the wild challenge (emotiw) challenge and workshop
summary. In Proceedings of the 15th ACM on International conference on
multimodal interaction, pages 371–372. ACM, 2013.
[5] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Static facial
expression analysis in tough conditions: Data, evaluation protocol and
benchmark. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE
International Conference on, pages 2106–2112. IEEE, 2011.
[6] Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom
Gedeon. Video and image based emotion recognition challenges in the wild:
Emotiw 2015. In Proceedings of the 2015 ACM on International Conference on
Multimodal Interaction, pages 423–426. ACM, 2015.
[7] Paul Ekman and Wallace V Friesen. Facial action coding system. 1977.
[8] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: a survey.
Pattern recognition, 36(1):259–275, 2003.
[9] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
[10] Hatice Gunes and Massimo Piccardi. Bi-modal emotion recognition from
expressive face and body gestures. Journal of Network and Computer
Applications, 30(4):1334–1345, 2007.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In International
Conference on Machine Learning, pages 448–456, 2015.

37
[13] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional
architecture for fast feature embedding. In Proceedings of the 22nd ACM
international conference on Multimedia, pages 675–678. ACM, 2014.
[14] Bo-Kyeong Kim, Hwaran Lee, Jihyeon Roh, and Soo-Young Lee. Hierarchical
committee of deep cnns with exponentially-weighted decision fusion for static
facial expression recognition. In Proceedings of the 2015 ACM on International
Conference on Multimodal Interaction, pages 427–434. ACM, 2015.
[15] Jian Huang Lai, Pong C Yuen, and Guo Can Feng. Face recognition using
holistic fourier invariant features. Pattern Recognition, 34(1):95–109, 2001.
[16] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar,
and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete
dataset for action unit and emotion-specified expression. In Computer Vision
and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society
Conference on, pages 94–101. IEEE, 2010.
[17] Ara V Nefian and Monson H Hayes. Hidden markov models for face
recognition. In Acoustics, Speech and Signal Processing, 1998. Proceedings of
the 1998 IEEE International Conference on, volume 5, pages 2721–2724. IEEE,
1998.
[18] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
[19] Stephen M Platt and Norman I Badler. Animating facial expressions. In ACM
SIGGRAPH computer graphics, volume 15, pages 245–252. ACM, 1981.
[20] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using
context-dependent deep neural networks. In Twelfth Annual Conference of the
International Speech Communication Association, 2011.
[21] Caifeng Shan, Shaogang Gong, and Peter W McOwan. Facial expression
recognition based on local binary patterns: A comprehensive study. Image and
Vision Computing, 27(6):803–816, 2009.
[22] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on
learning. In AAAI, pages 4278–4284, 2017.
[23] Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. Recognizing action units for facial
expression analysis. IEEE Transactions on pattern analysis and machine
intelligence, 23(2):97–115, 2001.
[24] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar
Paluri. Learning spatiotemporal features with 3d convolutional networks. In
Proceedings of the IEEE international conference on computer vision, pages
4489–4497, 2015.
[25] Du Tran, Lubomir D Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar
Paluri. C3d: generic features for video analysis. CoRR, abs/1412.0767, 2(7):8,
2014.
[26] Yaser Yacoob and Larry S Davis. Recognizing human facial expressions from
long image sequences using optical flow. IEEE Transactions on pattern
analysis and machine intelligence, 18(6):636–642, 1996.

38

You might also like