Professional Documents
Culture Documents
Yuqing Li
Uppsala University
Abstract
Machine learning in computer vision has made great progress in recent years. Tasks like ob-
ject detection, object classification and image segmentation reached near or even above human
performance. Meanwhile, there are still tasks like human emotion recognition remains chal-
lenging. In this paper, machine learning techniques are used to recognize human emotions
in movie images and videos. First of all, the theoretical background of these techniques is
introduced. Secondly, informative content including audio, single video frame and multiple
video frames are extracted from videos to represent emotions. In this step, OpenSMILE and
Inception-ResNet-v2 model are used to extract feature vectors from audios and frames sep-
arately. Thirdly, various models are trained to classify the emotions. SVM is used to classify
audio feature vectors. Inception-ResNet-v2 is used to recognize emotions in static images. C3D
model is used to classify a sequence of frames(video). After that, the accuracy of these models
are shown. Finally, the advantages and disadvantages of these models are discussed as well as
possible improvements of future studies on human emotion recognition.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previous research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Emotion categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Hand-crafted features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 Deep features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Convolutional layer and feature map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 CNN Deep Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 SVM for Audios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.3 C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 On Audios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Inception ResNet V2 On Static Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Testing on SFEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Failed Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 On Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 C3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1 Audios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2 Image model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.3 Video models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References ........................................................................................................ 37
1. Introduction
1.1 Background
In recent years, thanks to the rapid development of computer vision and ma-
chine learning, tasks like object classification, action recognition, and face
recognition have resulted in fruitful achievements. However, human emotion
recognition remains one of the most challenging tasks. A lot of effort has
been made to solve this problem. Since 2013, the first Emotion Recogni-
tion in the Wild (EmotiW) challenge has been held, the classification accu-
racy of video emotion classification has increased from 38% as the baseline
to 59%[2]. Great progress was made but still unsatisfying. On the one hand,
this is probably due to lack of labeled video data and the ambiguity nature of
human facial expressions. On the other hand, lack of effective ways to extract
facial emotion features also effects model performance. In recent years, pre-
trained deep convolutional neural networks have been proven to perform well
in extracting image features in challenging databases such as ImageNet[1];
Long Short-Term Memory (LSTM) network shows exciting prediction accu-
racy by analyzing sequential data[6]; three dimension convolution neural net-
work (C3D) achieves high performance in video action detection[2]. Thus,
by applying all these new techniques and combining them together may boom
accuracy of human emotion recognition in videos.
1
likely to smile without showing their teeth. The observation that infants are
able to show a wide range of facial expressions and respond to facial expres-
sions from others without being taught suggesting that the ability to deliver
emotions and understand emotions via facial expressions is inherent in hu-
mans.
Figure 1.1. Cohn-Kanade CK+ database (above) have frontal facial images with stable
illuminance. The Facial Expression Recognition 2013 (FER-2013) Dataset (below)
has images cropped from movies varies in head posture and illuminance.
2
Geometric-feature-based methods are methods that extract information about
facial components and their movements which imitate how humans understand
facial emotions. One example of geometric-feature-based methods is the Fa-
cial Action Coding System (FACS). In order to describe facial expressions pre-
cisely, FACS was developed in which each facial expression broke down into
several Action Units (AU) each representing a facial muscular movement[7].
Based on FACS, categorization of facial expressions was conducted by rec-
ognizing certain facial movements[19]. Before the 1990s, encoding facial ex-
pressions using FACS was done manually and thus it was very inefficient.
Meanwhile, geometric-feature-based methods are highly dependent on the ac-
curacy of facial component recognition and tracking which makes it less reli-
able than appearance-based methods.
Computers started to be part of the game since the 1990s. Since then,
appearance-based methods became quite popular. Optical flow (OF)[26], 2D
Fourier transform coefficients[15], Local Binary Patterns(LBP)[21] and facial
motions[8] were popular new features.
Among these features, optical flow captures the movement of surfaces and
objects in video; 2D Fourier transform converts spatial domain information
into frequency domain which allows researches to decrease the dimension of
image/video significantly; LBP, on the other hand, is mainly focus on compar-
ing a pixel with its nearby pixels and encode the unique spatial pattern.
New classification models were also contributed in the task. Hidden Markov
model, a simplified version of Bayesian network aims to discover the hidden
pattern of features, was able to classify facial expressions near real-time[17].
3
mation from input image. Consequently, deep features are used in emotion
recognition tasks and significantly improved classification results.
Similar to ImageNet and ILSVRC, in facial emotion recognition area, there
are Emotion Recognition in the Wild (EmotiW) Challenge and EmotiW databases
designed for the challenge. The challenge was first held in 2013 with two
databases Acted Facial Expressions in the Wild(AFEW) and Static Facial Ex-
pressions in the Wild(SFEW). The baseline accuracy was 38% with Local Bi-
nary Patterns on Three Orthogonal Planes (LBP-TOP) as features and Support
Vector Machine (SVM) as classifier[4]. In the following years of competition,
solutions utilizing deep pretrained neural networks to extract image features
and Long Short-Term Memory (LSTM) to take temporal influence into ac-
count has been proven efficient with limited labeled data[6]. The winning
team of EmotiW 2016 successfully implemented three-dimensional convolu-
tion neural network (C3D) and achieved the best performance with an accu-
racy of 59%[2].
In this research, the databases from EmotiW will be used to train the mod-
els. Meanwhile, the baseline and competition results will be used to evaluate
the performance of trained model in this work.
4
2. Theory
This chapter includes the theoretic background of the models and methods im-
plemented in this project to extract deep features and classify video emotions.
2.1.1 Architecture
The structure of ANN can be determined by two factors. The first is how
many layers and how many neurons each layer has in ANN. The second is
how information/inputs are transferred in ANN.
For the former factor, the more layers an ANN has, the deeper it is while the
more neurons in each layer, the fatter the ANN is. More neurons means larger
solution domain at the cost of longer training time. With limited computational
power and time (the number of weights can be trained), thinner and deeper
ANNs are proven to be of better performance[20].
For the latter factor, the number of possible ways to connect neurons is
enormous. Most of them are not possible for training at the moment. Of
all the feasible networks, the most commonly used and typical ones are fully
connect feed-forward network and recurrent neural network (RNN) as shown
in Figure 2.1.
2.1.2 Neurons
Neurons in ANN works in a similar way to neuron cells in animal brains.
Neuron cells receive stimuli, process it and produce feedback based on it.
Artificial neurons does exactly the same thing by summarizing inputs, adding
bias and using an activation function to decide responses, as shown in Figure
2.2.
Mathematically speaking, the operations in neurons can be summarized as
below:
5
Figure 2.1. Feed-forward network is the one in the left. RNN is the one in the right.
As the name indicate, neurons in fully connected feed-forward network only take
the output of all neurons its previous layer as its input. The flow of information is
unidirectional. Meanwhile, neurons’ input in RNN may come from other neurons in
the same layer.
6
y = ϕ(∑ wi xi + a) (2.1)
i
Activation function
Activation function determines the output of a neuron. There are quite a lot
of commonly used activation functions including Logistic function (sigmoid
function), hyperbolic tangent function (tanh function), ramp function (ReLU)
and normalized exponential function (softmax function). These activation
functions works as filters deciding whether the information will be passed on
and how strong the signal will be.
For instance, a node with ReLU as activation function can be written as
below: " #
y = max (∑ wi xi + a), 0 (2.2)
i
If the sum of weighted input is larger than zero, the signal will be passed on
without changing its intensity. Otherwise, the signal will vanish.
Initial parameters
For smaller ANN, the initial value of its parameters is normally set to a num-
ber between 0 to 1 or random numbers in certain range generated by com-
puter. However, this approach is reported performing poorly in deep neural
networks[9] and it takes more time for networks to converge even in "shal-
low" networks. In some extreme cases, ANN with poor initial parameters is
not able to converge at all. Except this approach, parameters can also be ini-
tialized with values from pretrained models as explained in Chapter 2.5.
Loss function
Loss functions are functions used to calculate the distance of actual value
(target) and the output value. Take mean square error(MSE) as an example:
Loss = [∑ni=1 (yi − pi )] /n. yi is the actual label/value while pi is the prediction
of the model. The aim of training is to decrease the loss as much as possible.
7
In order to achieve this goal, the choice of loss function plays an important
role. Cross entropy Loss = − [∑ni=1 pi lnyi ] /n is capable of representing loss
properly when the output layer is softmax layer as shown in Figure 2.3.
Figure 2.3. Cross entropy(black) and square error(red) of a two layer network. W1 and
W2 are weights of first and second layer.
∂L ∂ L ∂ hi ∂L
= = hi−1 (2.3)
∂ wi ∂ hi ∂ wi ∂ hi
Where
l
∂L ∂ L ∂ hi+1 ∂L ∂L
= = × wi+1 = × ∏ wt (2.4)
∂ hi ∂ hi+1 ∂ hi ∂ hi+1 ∂ hl t=i+1
8
2.2 Convolutional Neural Network
CNN is a type of feed-forward ANN inspired by animal visual cortex and is
known for outstanding performance in image classification. Compared to reg-
ular fully-connected feed-forward ANNs, CNNs is much easier to train due
to sparse connectivity and shared weights. Sparse connectivity means that
each neuron on convolution layers only take certain amount of output val-
ues of previous layers instead of all the output of previous layers like other
fully-connected ANNs. Meanwhile, CNNs also share the weights among hid-
den layers which means that inputs of different locations are filtered by same
learned kernels. These two features of CNN decrease the parameters needed
to be trained dramatically. In Figure 2.4, there is LeNet-5, a simple convolu-
tion neural network designed for handwritten and machine-printed character
recognition.
9
2.2.2 Pooling layer
Pooling layers are used for down-sampling in CNNs. Down-sampling or sub-
sampling is to decrease the size of feature maps as shown below. Of differ-
ent kinds of pooling methods, average pooling and max pooling are the most
commonly used ones. The benefit of pooling layers not only lays in much less
dimensions lessening the computation cost but also controls the overfitting.
The process of pooling is shown in Figure 2.5.
2.2.3 C3D
Three-dimension convolutional neural networks are a special kind of convolu-
tional neural networks which perform convolution on three dimensions. These
networks extract features not only from spatial dimension/images but also in-
tegrate information from temporal dimension/videos as shown in Figure 2.6.
In the case of 2D CNN, all the filters are of two dimensions while in C3D,
all the filters are 3D filters. C3D has shown good performance(82.3% top-1
accuracy) on UCF101 (a data set of 101 human actions classes from videos in
the wild)[24].
10
Figure 2.6. 3D Convolutional Neural Network
1 m
µ= ∑ xi (2.6)
m i=1
1 m
σ2 = ∑ (xi − µ)2 (2.7)
m i=1
Then normalize x1 , x2 , ..., xm by using a small number ε in case σ = 0:
xi − µ
x̂i = √ (2.8)
σ2 +ε
11
of mapping inputs to outputs with stacked layers, residual network use layers
to map fluctuations so gradually map the output. The comparison is shown in
Figure 2.7.
Figure 2.7. H(x) is any desired mapping, plain net hopes the 2 weight layers fit H(x)
while residual net hopes the 2 weight layers fit F(x) and let H(x) = F(x) + x.
2.4 RNN
As shown in Figure 2.1, recurrent neural network are networks that have con-
nections between neurons in hidden layers. By doing so, it becomes possible
for RNN to handle sequential information where the sequence of input matters
and the meaning of data depend on the "context". While CNN share weights
by using "filter" in spatial dimension, RNN share weights by using the same
function to handle information at different time in time domain.
Figure 2.8. A standard RNN neuron. RNN share wights in sequential data.
ht = Fθ (ht−1 , xt ) (2.9)
For all xt and ht , the parameters of Fθ is the same.
12
2.4.1 LSTM
Long Short-Term Memory networks are a special kind of RNN which has a
different and more complex structure for neural cells. A LSTM neuron has
three gates: an input gate, a forget gate and an output gate.
Inside a LSTM neuron, three things need to be decided. First, how "clear"
the new information should be remembered. Secondly, how much of previous
memory should be forgotten. Thirdly, what signal should be passed on to
influence other neurons.
With sequential data x1 , x2 , ..., xm , for xt , the influence of xt on xt+1 is ht . ct
is the cell state after processing xt , the process of first step is:
13
The range of ft is [0 − 1]. 0 means all the old memory will be removed
and 1 means it will be kept entirely. Thus, the memory at sequence t can be
inferred:
ht = ot ∗ tanh(ct ) (2.15)
Figure 2.10. Different learning approach of traditional learning (left) and transfer
learning (learning).
14
3. Methodology
In this section, the specific approach are illustrated. It can be divided into three
parts: data collection and pre-processing, feature extraction and classification
and evaluation method.
This research involves 3 models. The first one is audio-SVM model. Au-
dios will be used to extract audio features with OpenSMILE as feature extrac-
tor and classified with SVM. The second one is CNN-LSTM model. This in-
volves three steps: train the feature extractor, CNN model(Inception-ResNet-
v2 model); use CNN model to extract deep features from face images cropped
from video frames; and use LSTM as classifier to integrate the deep features
and classify the emotion. The third one is video-C3D model. This model use
face frames from videos as input and C3D model as both feature-extractor and
classifier.
3.1 Data
Three data sets are involved in the training and evaluation process of the
model. The first one is AFEW 6.0. The second one is Static Facial Ex-
pression Recognition in the Wild (SFEW). The third one is Facial Expres-
sion Recognition 2013 data set (FER2013). FER2013 is used to train CNN
model(Inception-ResNet-v2 model) and SFEW is used to evaluate this Inception-
ResNet-v2 model. AFEW 6.0 will be used for both training(60%) and evalu-
ating(40%) SVM, LSTM and C3D model.
AFEW 6.0
AFEW 6.0 is a data set consisting of video clips collected from movies and
reality TV shows. It is the newest version of AFEW data set. Compared to
AFEW 1.0-5.0, reality TV show clips are newly added. There are 1750 video
clips in the data set and they are originally divided into 774 training videos,
383 validation videos and 593 test videos. Each of them are labelled with only
one emotion of the universally recognized seven emotions, as shown in Table
3.1. Due to the EmotiW contest, the labels of test videos are not available
when this project is conducted.
All the videos are of 25 fps (25 frames per second) and are of 720*576
resolution.
15
Emotion Angry Disgusted Fear Happy Sad Surprise Neutral
Number of videos 197 114 127 213 178 120 207
Table 3.1. Emotion distribution of AFEW 6.0 data set for training and validation
SFEW
SFEW is a data set consisting of images collected from movie frames with a
label from seven emotions. There are 861 labelled images in total in the train
set and validate set. The distribution of SFEW data set is shown in Table 3.2.
All the images are frames of movies of 720*567 resolution.
FER2013
The FER2013 database is an image data set containing 35889 48*48 pixel
gray-scale facial expression images labelled with the seven universal emo-
tions above. The facial expression images in the FER2013 data set are also
gathered from wild environment (movies) and thus the features learned from
it can be applied to AFEW 6.0 data set. The distribution of images in different
categories is shown in Table 3.3.
3.2 Pre-processing
Videos contain rich information. However, in order to train models more effi-
ciently, only audios and facial crops from video frames are used in the training.
AFEW 6.0
For AFEW 6.0 data set, ffmpeg, a cross-platform open source audio-video
processing framework is used to extract audio files and video frames. All
the video clips are around 2 seconds and have a frame rate of 25 frames per
second (25fps). Since not every frame in a video contains at least one human
face, in order to get enough cropped faces to analyze temporal information,
all frames were extracted from videos and Dlib frontal face detector was used
to crop the largest face in a frame. In the end, all the faces are resized to 2
16
standard sizes: 48*48 and 299*299 and converted to gray-scale image. Gray-
scale facial images and audios from videos will be used as input for further
feature extraction as shown in Figure 3.1.
SFEW
For SFEW data set, Dlib frontal face detector is also used to corp the largest
face in the frame. And Cropped faces are then resized to 299*299 and con-
verted to gray-scale images for future usage.
FER2013
For FER2013 data set, all the faces are resized to 299*299.
Normalization
Besides, all the faces are linear normalized to decrease the influence of illu-
mination. For a pixel at location (x, y) with intensity value I(x,y) , while Imax
and Imin are the largest and smallest intensity value of the original image, the
0
normalized intensity value of the pixel I(x,y) is calculated:
0
I(x,y) − Imin
I(x,y) = ∗ 255 (3.1)
Imax − Imin
17
3.3.1 Audio Features
OpenSMILE (Speech and Music Interpretation by Large-space Extraction)
feature extraction toolkit is used to extract audio features. OpenSMILE can ex-
tract audio low-level descriptors such as Mel-frequency cepstral coefficients,
loudness, perceptual linear predictive cepstral coefficients, line spectral fre-
quencies and format frequencies. The extracted feature of a 2 second audio is
a feature of 1582 dimensions.
18
Figure 3.3. On the left is Stem block in Figure 3.2. On the right there is block A’
(below) and block B’ (above) in Figure 3.2.
19
Figure 3.4. The structure of Inception-ResNet-v2 network blocks. The block in the
left is block A shown in Figure 3.2. The block in the middle is block B shown in
Figure 3.2. The block in the right is block C shown in Figure 3.2.
in ImageNet pretrained model is not suitable in this model. All the parameters
in ’AuxLogits’ block and ’Logits’ block will be generated randomly instead
of restored from pretrained model.
Fine-tune
In order to enhance the ability of Inception-ResNet-v2 model to extract facial
expression features, the Facial Expression Recognition 2013 data set (FER2013)
is used to fine-tune the Inception-ResNet-v2 model pretrained with ImageNet.
All the layers of pretrained Inception-ResNet-v2 model were tuned with
the FER2013 database. FER2013 data set is divided into train set of 28709
images and validate set of 7188 images. The learning rate is set to 0.01 for
step 1-30000. Then with learning rate as 0.0001, fine-tune the model for 2000
steps.
After fine-tuning, this model is able to classify facial emotions from static
images. The output layer will produce the classification result while the output
of convolution layers will be deep emotional features of facial images.
Model details
After extracting 1582 dimensional audio features, a SVM is trained as clas-
sifier. Scikit-learn, the open source machine learning library is used to train
20
this SVM model. To be specific, Classification SVM type 1 (C-SVM) is used.
Training of C-SVM is a process to minimize the error function:
N
1 T
w w +C ∑ εi (3.2)
2 i=1
with constrains:
yi (wT Φ(xi ) + b) ≥ 1 − εi (3.3)
εi ≥ 0, i = 1, ..., N (3.4)
where yi is class label, xi is input data, w is coefficient vector, b is a bias,
εi is single input parameter, and C is capacity constant. Kernel Φ is used to
transform input data into feature space.
Training Method
To find the optimal parameter set, 10-fold cross validation is used. For linear
kernel, the possible value of C is 1, 10 and 100. For radial basis function (rbf)
kernel, the possible value of C is among 1, 10, 100 and 1000. Besides, for rbf
kernel, the possible value of gamma is 0.01, 0.001 and 0.0001.
The training set of AFEW will be used for training. And the validation set
of AFEW will be used to evaluate the performance of all the SVM models.The
highest-performed combination of parameters will be chosen as the model to
test.
Output
The output of SVM model will be a one-digit number indicating the classifi-
cation result.
21
faces will be duplicated and add to face collection until the number of faces
reaches 16.
Thus, the input of LSTM will be a 16*1536 dimension feature.
Model Details
One layer LSTM model is used.
Training Method
The learning rate of LSTM is set to 0.001 with 40000 training iterations. Batch
size is 32.
Output
The output of LSTM model is seven logits indicating the likelihood of each
emotion.
3.4.3 C3D
The input of C3D are cropped faces of size 48*48 described in Section 3.2.
As explained in Section 3.4.2, the number of faces found in videos varies.
Thus, a similar method is used to get the same size input for C3D. For those
videos have more than 16 faces, 16 sequential faces will be chosen randomly
as input. For those videos have more than 1 face but less than 16 faces, the
padding technique in 3.4.2 is used. For those videos with no face founded
inside, they will be removed from training set.
Model Details
There are 7 hidden layers in LSTM model as shown in Figure 3.5. The first
five layers are convolutional layers to extract video features. Kernels for these
layers are all 3*3*3, for every dimension the stride is 1.
The last 2 layers are fully connected layers in order to classify.
This model structure has been proven efficient in classifying videos in UCF101
database, a database of 101 different kind of human actions (101 classes) and
shown an accuracy of 72.6%[25].
Output
The output is seven logits indicating the likelihood of each emotion.
22
Figure 3.5. The structure of C3D model
23
4. Results
4.1 On Audios
Of 15 SVM models trained, the linear ones have better accuracy as shown in
Table 4.1. Besides, the accuracy does not vary much depending on parameter
C and gamma. After grid search, linear SVM model with C = 1 is chosen a
the classifier due to its accuracy and efficiency.
With the model trained by AFEW 6.0 training set, AFEW 6.0 validation set
is used to evaluate the model as shown in Table 4.2. The accuracy of SVM
model on validation set is 25%. This classifier is better at angry videos with
a f1-score 0.37, while the performance on disgust and surprise videos is much
worse than average.
24
4.2 Inception ResNet V2 On Static Images
The Inception-ResNet-v2 model is first fine-tuned with FER2013. The train-
ing process of Inception-ResNet-v2 model is shown in Figure 4.1. Training
step is 32000. After 30000 training steps, the learning rate is adjusted to
0.0001 to fine-tune the model. After 20000 steps of training, the loss is tend
to be stable and the accuracy remains the same.
After fine-tuning, the accuracy of the model on the validation set of FER2013
is shown in Table 4.3. The model has better performance on disgusted, happy
and surprise emotions. Among all, the high accuracy of disgusted. Overall,
the accuracy is 60% on 5740 images in FER2013.
25
Emotion Angry Disgusted Fear Happy Sad Surprise Neutral Overall
Accuracy 54% 80% 48% 78% 49% 67% 48% 60%
There are 861 images being tested. 413 Images are correctly predicted.
The overall accuracy on all emotions is 47.97%. As shown in Table 4.4, the
Inception-ResNet-v2 model remains satisfying performance on happy images.
However, it fails to recognize disgusted images completely with an accuracy
of 0.
4.3 On Videos
4.3.1 LSTM
The validation set of AFEW 6.0 is used to test the accuracy of LSTM. The loss
decreases dramatically during the first 5000 training steps as shown in Figure
26
(a) (b) (c)
4.3 and it declines slowly during the training process. In the end, it fluctuates
around 1.3.
While the loss decreases, the accuracy increases as a result during training
as shown in Figure 4.4. The training accuracy and testing accuracy both in-
crease during the first 5000 step of training. However, further training failed
to increase the accuracy of testing set.
The confusion matrix of LSTM model is shown in Table 4.5. LSTM model
is more capable of classify angry and happy emotions while it fails completely
at rest of other emotions. 41.67% videos are classified as neutral emotion
indicates that overfitting might exists.
27
Figure 4.3. The training accuracy and testing accuracy of LSTM model.
Figure 4.4. The training accuracy and testing accuracy of LSTM model.
28
4.3.2 C3D
Two C3D models are trained at different learning rates.
C3D-1
C3D-1 is the first C3D model trained. It is trained with learning rate 0.01
and it decays every 2700 steps with a decay rate 0.1 as shown in Figure 4.5.
The loss decreases dramatically during the first 1000 training steps.And it de-
crease slowly during the following 9000 steps. Finally, the training loss fluctu-
ates slightly around 1.9. The accuracy of the model increases fast during first
3000 training steps. However, the training accuracy is not stable at all. After
smoothing the curve, we can see the accuracy stays around 22%.
After training, the validation set of AFEW 6.0 is used to evaluate the accu-
racy of C3D-1. The confusion matrix is shown in Table 4.6. All the videos are
labelled as happy shows the model overfits the training data completely.
C3D-2
In order to further decrease the possibility of overfitting, another C3D model
is trained. The training process is as shown in Figure 4.6. C3D-2 is training
with smaller learning rate and more training steps. Ideally it should be able to
avoid overfitting and converges slower.
As shown in Figure 4.6, the learning rate of C3D-2 is set to 0.00001 and it
decays every 1600 steps with a decay rate of 0.1. The average loss drops fast
to around 3.3 during the beginning phrase of training but remains the same
in the following training steps. After learning rate drops below 0.000001, the
learning of the model can be considered stopped. Both loss and training accu-
racy remains very unstable. By the end of training process, training accuracy
is around 19%.
After training, the validation set of AFEW 6.0 is used to evaluate the ac-
curacy of C3D-2. The confusion matrix is shown in Table 4.7. There is no
videos being labelled as disgust and fear. Some of videos are labelled as sur-
29
(a) Learning rate
30
(a) Learning rate
31
prise correctly. Most of videos are classified as happy videos which clearly
still indicates overfitting.
32
5. Discussion
In this chapter, the conclusions are draw based on the comparison of EmotiW
competitors’ work and future work are purposed based on conclusions.
5.1 Conclusion
5.1.1 Audios
The SVM model shows a weak ability to distinguish certain emotions. Com-
pare to an accuracy of 14.3% when randomly label a given video an emotion,
the model has a better accuracy of 25%.
As shown in 4.2, the SVM model is more capable of distinguish angry
emotion than any other emotion with a f1 score of 0.37 much higher than
average f1 score 0.24. It might due to the fact that most angry videos involves
someone shouts loudly which makes the classifier easy to find key feature.
On the other hand, SVM model is not capable to classify disgust and sur-
prise. Because there is hardly any sounds are particularly surprise even to
human ears while there aren’t enough audios for the model to find the key
feature of disgust.
Besides, the audios in a certain category can vary even more than audios
across categories. For instance, some surprise audios are quite quiet that they
can be considered neutral while some surprise audios are quite noisy that they
can be classified as angry or fear.
It is difficult to compare this audio-SVM model with sate-of-art work since
EmotiW competitors did not provide the precise accuracy of their audio model.
Overall speaking, the audio model is of limited ability to classify emotions.
But with a 1582 dimension feature for a 2-second audio, it is neither efficient
nor precise.
33
Inception-ResNet-v2 model could not outperform previous models might be
different cropping method, various input size and different training method.
The state-of-art result on SFEW uses faces cropped with Viola-Jones face
detector while in this research Dlib frontal face detector is used while means
faces in different angle might not be detected.
The size of FER2013 images are 48*48 gray-scale images. However, the
input size of pretrained Inception-ResNet-v2 model is 299*299*3. After re-
sizing all images into 299*299, every pixel of FER2013 image is extended
to an area contains at least 6*6 pixels. This makes the first several layers of
Inception-ResNet-v2 model not able to extract too much useful information
since the stride of those layers are 1 or 2 and the size of filters are 3*3 pixels
or 1*1 pixels. Unless the filters move on the broader of two 6*6 pixel areas,
the filters can hardly extract any useful information. On the other hand, the
cropped faces from SFEW database are of different sizes. A lot of them are
large enough and full of details. Not to mention that faces from SFEW are
RGB-color images. The Inception-ResNet-v2 model trained with FER2013 is
not able to extract those image details.
Besides, ImageNet is a database of both colored images and gray-scale
images. Without color information in FER2013 database, the full power of
Inception-ResNet-v2 model is not utilized and so is the pretrained parameters
of the model.
34
very different from this research. It indicates that Inception-ResNet-v2 model
has a similar ability to extract emotion features compared to VGG16-FACE.
In all, LSTM works well with deep CNN features. The accuracy of LSTM
model on AFEW 6.0 is not far below Inception-ResNet-v2 on SFEW even
though there are way more information in videos.
C3D
Both C3D models failed terribly. The performance is poor and C3D models
are difficult to train. Both C3D models overfit the training set even though
C3D-2 has an extremely small learning rate(0.00001) and it decays pretty fast.
The accuracy of C3D-2 is almost the same as C3D-1 which labels every video
as happy. The accuracy of C3D is 20% lower than state-of-art. This might
result from camera movement.
C3D has show promising results on UCF101 database which contains 101
categories of human activities including putting on make up and playing bas-
ketball. Seemingly, since human activities involves more complex situation
and movements of multiple parts of human body and C3D model is able to
classify these activities, it should be capable of dealing with 7 human facial
movement. This ignores the fact that UCF101 is a database that only consist
videos shooting without camera movement while AFEW 6.0 videos are cut-
ting from movies with a lot of camera movement in 2 seconds. This means
C3D models is having a hard time capture the movement of an object/pixel.
Not to mention that a wrongly cropped non-face image will ruin the prediction
result of C3D model completely as a wrong input.
On the other hand, the state-of-art research achieves an accuracy of 39.69%
on AFEW 6.0 with C3D model because they not only apply a face classifier
to remove all the non-face images from input but also transform all the faces
they cropped with different head postures into frontal faces and relocate the
faces into a standardize size with their facial key points at similar spots on
each image. In other words, they stabilize the moving camera.
Overall
Among all video information, cropped faces are most informative when it
comes to classify human emotions. Inception-ResNet-v2 model proves its
outstanding ability to extract image features including facial emotion features.
When taking temporal information into account, CNN-LSTM approach per-
forms better and is easier to train compares to C3D model.
35
the main task. Due to the limited time and resources of this project, none of
the above sub-tasks achieves state-of-art. Of all the sub-tasks mentioned be-
low, a better face detection technique combined with face recognition, might
improve the results significantly by only distinguish different individuals in
one video and label their emotions accordingly.
Also, in this research, audios are only used to extract features from a signal
approach. However, if used to generate subtitles or recognize the environment,
audios can be a feature as effective as facial emotions. The loudness or fre-
quency of the sound make less sense than the content of sound. For instance,
gun shooting in the background is quite common in movies and it apparently
provides a decisive information.
Finally, more labelled data is required for a more reliable model. Emotion
is complex in its very nature and the solution domain of it is very large and
requires enough data to learn. In the previous stage of this research, a lot of
dynamic images are downloaded from giphy.com with emotions as keywords.
But since there are so many images to be checked and labelled manually, none
of them are used in the training process. If possible, having more correctly
labelled data from wild will boom the accuracy.
Overall, there are still plenty of improvement can be done regarding emo-
tion recognition. And hopefully one day, it can be implement in real life sce-
nario.
36
References
[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database. In Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255.
IEEE, 2009.
[2] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Jesse Hoey, and Tom Gedeon.
Emotiw 2016: Video and group-level emotion recognition challenges. In
Proceedings of the 18th ACM International Conference on Multimodal
Interaction, pages 427–432. ACM, 2016.
[3] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Karan Sikka, and Tom Gedeon.
Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In
Proceedings of the 16th International Conference on Multimodal Interaction,
pages 461–466. ACM, 2014.
[4] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Michael Wagner, and Tom Gedeon.
Emotion recognition in the wild challenge (emotiw) challenge and workshop
summary. In Proceedings of the 15th ACM on International conference on
multimodal interaction, pages 371–372. ACM, 2013.
[5] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Static facial
expression analysis in tough conditions: Data, evaluation protocol and
benchmark. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE
International Conference on, pages 2106–2112. IEEE, 2011.
[6] Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom
Gedeon. Video and image based emotion recognition challenges in the wild:
Emotiw 2015. In Proceedings of the 2015 ACM on International Conference on
Multimodal Interaction, pages 423–426. ACM, 2015.
[7] Paul Ekman and Wallace V Friesen. Facial action coding system. 1977.
[8] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: a survey.
Pattern recognition, 36(1):259–275, 2003.
[9] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
[10] Hatice Gunes and Massimo Piccardi. Bi-modal emotion recognition from
expressive face and body gestures. Journal of Network and Computer
Applications, 30(4):1334–1345, 2007.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In International
Conference on Machine Learning, pages 448–456, 2015.
37
[13] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional
architecture for fast feature embedding. In Proceedings of the 22nd ACM
international conference on Multimedia, pages 675–678. ACM, 2014.
[14] Bo-Kyeong Kim, Hwaran Lee, Jihyeon Roh, and Soo-Young Lee. Hierarchical
committee of deep cnns with exponentially-weighted decision fusion for static
facial expression recognition. In Proceedings of the 2015 ACM on International
Conference on Multimodal Interaction, pages 427–434. ACM, 2015.
[15] Jian Huang Lai, Pong C Yuen, and Guo Can Feng. Face recognition using
holistic fourier invariant features. Pattern Recognition, 34(1):95–109, 2001.
[16] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar,
and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete
dataset for action unit and emotion-specified expression. In Computer Vision
and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society
Conference on, pages 94–101. IEEE, 2010.
[17] Ara V Nefian and Monson H Hayes. Hidden markov models for face
recognition. In Acoustics, Speech and Signal Processing, 1998. Proceedings of
the 1998 IEEE International Conference on, volume 5, pages 2721–2724. IEEE,
1998.
[18] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
[19] Stephen M Platt and Norman I Badler. Animating facial expressions. In ACM
SIGGRAPH computer graphics, volume 15, pages 245–252. ACM, 1981.
[20] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using
context-dependent deep neural networks. In Twelfth Annual Conference of the
International Speech Communication Association, 2011.
[21] Caifeng Shan, Shaogang Gong, and Peter W McOwan. Facial expression
recognition based on local binary patterns: A comprehensive study. Image and
Vision Computing, 27(6):803–816, 2009.
[22] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on
learning. In AAAI, pages 4278–4284, 2017.
[23] Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. Recognizing action units for facial
expression analysis. IEEE Transactions on pattern analysis and machine
intelligence, 23(2):97–115, 2001.
[24] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar
Paluri. Learning spatiotemporal features with 3d convolutional networks. In
Proceedings of the IEEE international conference on computer vision, pages
4489–4497, 2015.
[25] Du Tran, Lubomir D Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar
Paluri. C3d: generic features for video analysis. CoRR, abs/1412.0767, 2(7):8,
2014.
[26] Yaser Yacoob and Larry S Davis. Recognizing human facial expressions from
long image sequences using optical flow. IEEE Transactions on pattern
analysis and machine intelligence, 18(6):636–642, 1996.
38