Professional Documents
Culture Documents
Master Project
Report
presented at
by
Awatef MESSAOUDI
Mr F F President
Mr Zied LACHIRI Supervisor
Mr F F Reviewer
Dedication
To all of you,
I dedicate this work.
Awatef MESSAOUDI
Aknowledgements
First and Formost, I would like to express my infinite gratitude and respect to my
supervisor Mr Zied LACHIR.
I am also grateful to all my teachers without whom this work would not have been possible.
I will not forget, of course, to express my gratitude to my colleague of Master who have
kindly accepted to cooperate. Most of all, I am thankfull to my family for their exceptional
support through all challenges.
CONTENTS Awatef MESSAOUDI
Contents
Page iii
LIST OF FIGURES Awatef MESSAOUDI
List of Figures
Page iv
LIST OF FIGURES Awatef MESSAOUDI
Acronyms
Page v
INTRODUCTION Awatef MESSAOUDI
Introduction
Automatic recognition of human emotion has been an active area for decades, with growing
application areas including avatar animation, neuro-marketing and social robots. Emotion
recognition was a challenging task as it involves predicting abstract emotional states from
multi-modal input data. These modalities include video, audio and physiological signals.
Facial emotion recognition was the most informative channels. It still the most challenging
tasks in computer vision. In this preliminary study, we proposed a simple solution For
facial expression recognition. We introduced the state of the art of facial expression
recognition. We focused on some notions such as ”emotions, facial expression, and the
six universal emotions”. The first chapter was an introduction to the facial expression
recognition and the different terms like face detection, feature extraction, machine learning
and deep learning. In the second part, we focused our study in deep learning, its differents
blocks with a simple comparison with machine learning. The third section includes the
architecture of two different deep learning network. We have studied the differents blocks
and the data set FER2013 used in our work. Finally, its the phase of implementation and
results. We focused on the results and the performance of our architectures.
Page 1
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
Chapter
1
FER:state of the art
1.1 Introduction:
Due to the important role of facial expression in human interaction, the ability to
perform facial expression recognition automatically via computer vision enables a range
of applications such as human- computer interaction and data analytics, etc… In this
chapter, we will present some notions of emotions and different coding theories as well as
the architecture of facial recognition. We will present some approaches that help as to
recognize facial expression and we will end the chapter with differents machine learning
techniques.
1.2.1 Definitions:
1.2.1.1 Emotions:[1]
the emotion is expressed through many channels such as body position, voice and facial
expressions. It is a mental and physiological state which is subjective and private. It
involves a lot of behaviours, actions, thoughts and feelings.
Page 2
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
facial expression is a meaningful imitation of the face. The meaning can be expression of an
emotion, a semantic index or an intonation in the language of panels. The interpretation
of a set of muscle movements in expression depends on the context of the application. For
example, in the case of an application in Human-Machine interaction where we want to
know an indication of the emotional state of an individual, we will try to classify measures
in terms of emotions.
Charles DARWIN wrote in his 1872 book « the expressions of the emotions in Man and
Animals » that facial expressions of emotion are universal, not learned differently in each
culture. Several studies since have attempted to classify human emotions and demonstrate
how your face can give away your emotional state.[2] In 1960, Ekman and Friesen defined
six basic emotions based on cross-culture study, which indicated that humans perceive
certain basic emotions in the same way regardless of culture. These prototypical facial
expressions are anger, disgust, fear, hapiness, sadness, and surprise.[2]
Page 3
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
Facial expressions is a consequence of activity of facial muscles. These muscles are also
called mimetic muscles or muscles of the facial expressions. The study of facial expressions
cannot be done without the study of the anatomy of the face and the underlying structure
of the muscles. That’s why some researchers focused on a coding system for facial
expressions. Several systems have been proposed such as Ekman system’s. In 1978 Ekman
developed a tool for coding facial expressions widely used today. We will present some
systems.
1.2.3.1 FACS:
facial action coding systems is a system developped by Ekman and friesen which is a
standard way of describing facial expressions in both psychology and computer animation.
Facs is based on 44 actions units (AUs) that represent facial movement that cannot be
composed into smaller area. FACS is very successful but it suffers from some defaults
such as :
• Lack of precision : the transition between two states of a muscle are represented by
linear way, which is an approximation of reality.
Page 4
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
1.2.3.2 MPEG4:
the MPEG4 video encoding standard has a model of the face human developped by the
face and body AdHocGroup interest group. This is a 3D model. This model is built
on a set of facial attributes, called Facial Feature Points(FFP). Measurements are used
to describe muscle movements( Facial animation Parameters-equivalents of Ekman unit
Actions).
1.2.3.3 Candide:
It is a model of the face, contained 75 vertices ans 100 triangles. It is composed of a model
with a generic face and a set of parameters(SHAPE UNITS). These parameters are used
Page 5
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
to adapt the generic model to a particular individual. They represent the differences
between individuals and are 12 in number:
1. head height.
4. eye width.
5. eye height.
Page 6
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
Automatic Facial expression recognition system has many applications including human
behavior understanding, detecting of mental disorder, etc...[3]. It has become a research
field involving many scientists specializing in different Areas such as artificial intelligence,
computer vision, psychology, physiology, education, website customization, etc…
The system that performs automatic recognition of facial expression consists of three
modules : The first one is detecting and recording the face in the image or the input image
sequences. It can be a sensor to detect the face in each image or just detect the face in
the first image and then follow the face in the rest of video sequences. The second module
Page 7
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
consist in extracting and representing the facial changes caused by facial expressions. The
last one determines a similarity between the set of characteristics extracted and a set of
reference characteristics. Other filters or data preprocessing modules can be used between
these main modules to improve the results of detection, extraction of characteristics or
classification.
Face detection consists of determinig the presence or absence of faces in a picture. This
is a preliminary task necessary for most techniques for analysing the face. This used
technique come from the field of recognition shapes. There are several techniques for
detecting the face, we mention the most used.
• LBP( local binary patterns method) ; the technique of local binary model devides the
face into square subregions of equal size where the LBPcharacteristics are calculated
. the vector obtained are concatenated to get the final feature vector.
• Haar filter : this face detection method uses a multiscale haar filter. The
characteristics of a face are described in an XML file.
Page 8
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
The characteristics of the face are mainly located around the facial components such as
the eyes, mouth, eye-brow nose and chin. The detection of characteristics points of the
faces is done by a rectangular box returned by a detector which locates the face. The
extraction of the geometric features such as the countours of facial components and facial
distance provides location or appearance of characteristics. Therefore, there are two types
of approaches :
characteristics represent the shape and location of components of the face(including the
mouth, eyes, eyebrows and nose). The facial compnents or facial features are extracted
to form a vector of features representing the geometry of the face.
It represents change in appearance of the face such as wrinkles and furrows. According to
these methods, the effect of rotation of head and the different facial shooting scales could
be eliminated by a normalization before the step of extraction of characteristics or by a
representation of features before the expression recognition step.
Many researches are divided into three parts global approaches local approaches,
local approaches and finally hybrid approaches. Each approaches has advantages and
disadvantages related to environmental issues, orientation of images, position of the head,
etc…
Page 9
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
These approaches are independent of head positions (top, bottom) and face image
orientations. These methods are effective but requires a heavy learning phase and the
result depends on the number of samples used.
These approachs are based facial objects detection and they are robust to the change
of luminance. The position of the head and its orientation can cause some gaps in the
system.
the alternative is to combine the two approaches(local and global) in order to take
advantages from these approaches. The recognition phase in this system is based on
machine learning theory : The feature vector is formed to describe the facial expressionand
the first part of the classifier is Learning. Classifier training consists of labeling the images
after detection, once the classifier is trained, it can recognize the images input. The
classification method can be devided into two groups : • Recognition based on static
data which only concerns images. • Recognition based on dynamic data concerning
sequences images or videos. Various classifiers have been applied such as neural network,
bayesian network, SVM, etc…
Having sufficient labeled training data that include as many variations of the populations
and environments as possible is important for the design of a deep expression recognition
system. We will introduce some databases that contain a large amount of affective images
collected from the real world to benefit the training of deep neural networks.
Page 10
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
1.3.4.1 CK+:
1.3.4.2 NMI:
this database is laboratry-controlled are includes 326 sequences from 32 subjects. A total
of 213sequences are labeled with six basic expressions and 205 sequences are captured
in frontal view. In contrast to CK+ sequences in NMI are onset-apex-offset labeled.
The sequence begins with a neutral expression and reaches peak near the middle before
returning to the neutral expression.
1.3.4.3 JAFFE:
1.3.4.4 FER-2013:
This database was introduced during the ICML 2013 challenges in representation learning.
FER-2013 is a large scale and unconstrained database collected automatically by the
Page 11
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
google image search API. All images have been registred and resized to 48*48 pixels
after rejecting wrongfully labeled frames and adjusting the cropped region. FER-2013
contains 28.709 training images, 3.589 validation images and 3.589 test images with seven
expression labels ( anger, disgust, fear, hapiness, sadness, surprise, and neutral).
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
1.4 Conclusion:
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
Page 12
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
Chapter
2
Deep learning
2.1 Introduction:
Deep learning is a subset of machine learning, which uses the neural network to analyze
different factors with a structure that is similar to the human neural system. It uses
complex multi-layered neural networks, where the level of abstraction increases gradually
by nonlinear transformations of input data.[5] It concerns algorithms inspired b by the
structure and function of the brain. They can learn several levels of representation in
order to model complex relationships between data
Machine learning algorithms work well for a wide variety of problems. However they failed
to solve some major AI problems such as speech, face and emotions recognition.
Page 13
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
• Train and evaluate model performance( for different algorithms, evaluate and select
the best performing model).
Most of Features must be determined by an expert and then encoded as a data type.
Features can be pixel value, shapes, etc,... The performance of machine learning
algorithms depends upon the accuracy of the features extracted. Deep learning reduces
the task of developing new features extractor, by automating the phase of extracting
and learning features.[10] Deep learning uses neural network to learn representations of
characteristics directly from data.
Page 14
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
Artificial neural network is a computing model that tries to mimmic the human brain
in a very primitive way to emulate the capabilities of human being in a very limited
sense. ANNs have been developed as a generalization of mathematical models of human
cognition or neural biology. It takes an input vector X and produces an output vector
Y. the relationship between X and Y are determined by the network architecture.[12] An
ANN is a network of parallel, distributed information processing. It consists of a number
of informations processing elements called neurons or nodes which are grouped in layers.
The input layer processing elements receive the input vector and transmit the values to the
next layer of processing elements across connections where this process is continued. This
type of network, where data flow one way(forward) is known as a feed forward network.
A feedforward ANN has an input layer, an output layer and one or more hidden layers
between the input and the output layers. Each of the neurons in a layer is connected to
all the neuros of the next layer and the neuron in one layer are connected only to the
neurons of the immediate next layer. The strength of the signal passing from one neuron
to the other depends on the weight of the interconnections. The hidden layers enhance
to the network’s ability to model complex functions. Performance of BPANN(back
propogation artificial neural network) model is compared with the developped linear
transfer function(LTF) model and was found superior.
Page 15
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
2.4.1 Presentation:
Convolutional neural network CNN is an artificial neural network type that proposed by
Yann le Cuhn in 1988. CNNs are one of the most popular deep learning architectures for
image classification, recognition and segmentation. CNN consists of hierarchical multiply
hidden layers. These artificial neurons take input from image, multiply weight, add bias
and then apply activation function. So that, artificial neurons can be used in image
classification, recognition and segmentation by perform simple convolutions by feeding
the convolutional neural network with more data( huge amount of data).[16]
2.4.2 Architecture:[7][12]
Convolutional Neurals networks are the most efficient models for classifying images data.
It was inspired by the mammal’s visual cortex[10] Each CNN channe lis made up of
convolutional layers, max pooling layers, fuuly connected layers and an output layer.[14]
Page 16
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
The convolution layer is the first layer to extract features from an input image[12]. It is
the fundamental unit of a convnet[15] It contains a set of filters whose parameters need
to be learned. Once the information hits a convolution layer , the layer convolves every
filters across the spatial dimensionality of the data to provide a 2D activation map. The
convolution of (N,M) image matrix multiplies with (n,m) filter matrix is called « feature
map ». The convolution of an image with different filters can perform operations such
as edge detection, blur and sharpen by applying filters[15]. During the forward pass,
each filter is convolved across the width and height of the input volume and compute dot
products between the entries of the filter and the input at any position. As the filter
convolve over the width and the height of the input volume it produces a 2 dimensional
activation map that gives the responses of the filter at every spatial position. There will
be an entire set of filters in each of them will produce a separate 2-dimensional activation
map.[17] The 2D convolution between image A and filter B can be given as :
C(i,j)=
N∑
a−1
A(m, n) ∗ B(i − m, j − n)
m=0
Page 17
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
CNN learns the values of these filters on its own during the training process( although
parameters such as number of filters, filter size, architecture of the network, etc still
needed to specify the training process). By increasing the number of filters, the more
image features get extracted and the better network becomes. Three parameters control
the size of the feature map( convolved feature) :
• Depth : correspond to the number of filters we use for the convolution operation.
• Zero padding : it is convenient to pad the input matrix with zeros around the
border, so that filter can be applied to bordering elements of input image matrix.
An additional operation is used after every convolution operation, called RELU layer.
A rectified linear unit apply an activation function, the output is : F(x)= max(0.x). There
are an other non linear fuctions such as tanh or sigmoid that can alsobe used instead of
RELU. Most of the data scientist since performance wise RELU is better than the other
two.[16]
Page 18
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
In the end, a feature extractor vector or CNN code concatenate the output informations
as a unique vector and feed it into fully connected layer(multilayer perceptron). The term
« fully connected »indicates that every neuron in the previous layer is connected to every
neuron on the next layer. The output from the convolutional and pooling layers represent
high level features of the input image. The purpose of the fully connected layer is to use
these features for classifying the input image into various classes based on the training
Dataset.
where j is index for image and K is number of total facial expression class. The RELU
is an activation function which eliminates all the negative values.
In recent years, we remarked the evolution of CNNs architectures. These networks have
gotten so deep that it has become extremely difficult to visualise the entire model.
Page 19
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
2.5.2 AlexNet(2012):
2.5.3 VGG-16(2014):
With this architecture, we notice taht CNNs were strating to get deeper and deeper.
This is because the most straight forward way of improving performance of deep neural
networks is by increasing their size. VGG-16 has 13 convolutional and 3 fully connected
layers, carrying with them the RELU tradition from AlexNet. It consists of 138M
parameters and takes about 500MB of storage space.
2.5.4 Inception-v1(2014):
This 22 layers architecture with 5M parameters is called the inception-v1. The design of
the architecture of an inception module is a product of research on approximating sparse
structures.
2.5.5 ResNet-50(2015):
From the past few CNNs, we have seen nothing but an increasing number of layers in
the design and achieving better performance. But with the network depth increasing,
accuracy gets saturated and the degrades rapidly. The folkes from Microsoft researcher
Page 20
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
adressed this problem with ResNet, using skip connections while building deeper models.
ResNet is one of the early adapters of batch normalisation with 26 M parameters.
2.5.6 Xception(2016):
Xception is an adaptation from inception, when the inception modules have been replaced
with depthwise separable convolution, it has also roughly the same number of parameters
as inception-v1(23M).
2.6 Conclusion:
In this chapter, we have presented the neural network and its differents architectures.
We focuses on CNNs , their structures and its differents layers, then we have presented a
few examples of architectures. In the next chapter, we will explains the idea of using the
architecture that we have chosen for our system of face expression recognition.
Page 21
CHAPTER 3. FACIAL EMOTION RECOGNITION: SYSTEM DESIGN
Awatef MESSAOUDI
Chapter
3
Facial Emotion Recognition: system
design
3.1 Introduction:
Despite the notable success of traditional facial recognition methods through the extracted
of handcrafted features, over the past decade, researchers have directed to the ddp learning
approach due to its high automatic recognition capacity. The goal of our project is to use
a CNN to recognize Facial expressions.
In this section, we will present our Facial Emotion Recognition system Based on CNN.
This system consists of detecting the face of a person from an image, sequence video or
via a camera to find out the expression with an accuracy rate associated with the six
universal expression (happy, disgust, fear, anger, sad, surprise) and Neutral.
Our research on Facial expression recognition has enabled us to note that all the solutions
provided to the recognition of emotions are structured according to the same overall
Page 22
CHAPTER 3. FACIAL EMOTION RECOGNITION: SYSTEM DESIGN
Awatef MESSAOUDI
architecture, in three main modules : face detection, feature extractions and classification.
And these will be the principle modules of our system.
The effectiveness of the system depends on the method used to locate the face in the
picture. We will used the Viola-Jones algorithms to detect various parts of the human face
such as the mouth, eyes, nose, eyebrows, mouth, lips and ears. This algorithm explores
the caracteristics of HAAR type via cascade classifier, which can effectively combine many
features and determine the different filters on a resulting classifier.
Page 23
CHAPTER 3. FACIAL EMOTION RECOGNITION: SYSTEM DESIGN
Awatef MESSAOUDI
Once the face is detected, the system starts the process of extracting features that will
convert pixel data to a smaller representation to be used in the process of classification.
This step reduces the size of the input image while keeping the data most useful. This
process is based on Convolutional Neural Network CNN.
Page 24
CHAPTER 3. FACIAL EMOTION RECOGNITION: SYSTEM DESIGN
Awatef MESSAOUDI
In the second architecture, we tried to focus on the impact of adding more layers to
our system. We added two convolutional layers to our network with 256(3*3) filters then
512(3*3) filters.
This model has demonstrated that the depth of the network is beneficial for the
classification accuracy. However, VGG16 network has two major drawbacks: Also bullets
such as:
• Slow to train.
Page 25
CHAPTER 3. FACIAL EMOTION RECOGNITION: SYSTEM DESIGN
Awatef MESSAOUDI
• Due its depth and the number of fully connected, VGG16 exceeds the size of 533
MB.
Our Facial emotion recognition system is modified from VGG16 which has show excellent
performance in many computer vision tasks.
3.6 Conclusion:
In this chapter we have introduced two models( CNN network and VGG16 network).
these architectures are used for the classification of images, and we noticed that each
architecture has a specific characteristics. In the next chapter, we will present the
implementation of these two models to test them in order to reveal the performance
of a Deep learning FER system and we will compare the differents results.
Page 26
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
Chapter
4
Implementation and realisation:
4.1 Introduction:
The goal of our project is to design and to implement an application which allows us to
recognize the facial expression. we interested in the application of deeplearning model.
In this chapter, we will present the implementation of some code of different models, the
developement environement as well as the various tools used. Finally, we will present the
results obtained.
Page 27
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
4.2.1 Python3.7
4.2.2 OpenCv:
4.2.3 Tensorflow:
4.2.4 Keras:
4.2.5 Numpy:
4.2.6 Sklearn:
4.2.7 Matplotlib:
4.3 Database:
For a best performance, we should train the network with a lot of samples of images. This
would increase the accuracy and improve the performance of the model. Unfortunately,
the large amount of data in datasets do not exist publicly, but we have access to two
public databases(FER2013 and CK+). For our system, we will use FER2013 database.
4.4 Implementation:
The system of Facial Emotion Recognition consists of two modules : The first one is for
face detection and the second one is for emotion recognition. In the following, we will
detail each module.
Page 28
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
In this stage, we used to choose the method proposed by Paul Viola and Michael Jones.
In our system, we opted for Haarcascade.eye.xml from the OpenCv library which provides
the Haar cascade method.
We will use the convolutional neural network to recognize the facial emotion of the
detected faces. The first thing to be considered is the architecture chosen and the number
of layers implemented in this architecture. In our project, we will use two differents
architectures and we will compare the performance of these two architectures. The first
architecture chosen is VGG16, a simple architecture easy to implement and handle on
python. The second one was inspired from VGG16, but we reduced the number of layers.
Our system contains several modules, in the following we will specify each modules their
role and purpose.
4.4.3 Librairie:
In this module, we imported all the librairies we needed during the phase of training.
We applied a data augmentation for our data because we need add more data to our
training set.
Page 29
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
This architectures consists of several layers, eleven in all (8 convolutional layers and 3
fully connected).
Page 30
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
4.4.6 Optimization:
when we train this model, we are basically trying to solve an optimization problem. We
are trying to optimize weights given arbitrarily. Our task is to find the weights that
most accurately map our input data to correct the output class. During the training this
weights updated and saved in the file.h5. The weights are optimised using an optimization
algorithm. In our algorithm we used the optimize SGD( stochastic gradient descent).
4.4.7 Learning :
The next step is to learn the given data. For learning the network we use the « fit() »
function.
Page 31
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
An epoch refers to a single pass of the entire data set to the network during the
training.
4.5 Results:
4.5.1 Performances:
The second figure show us the difference between the two architecture.
Page 32
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
Confusion matrix is applied to find wich emotion usually get confused with each other.
4.6 Discussion
4.7 Conclusion:
In this chapter, we have presented a Facial expression recognition system based on CNN.
We presented the results obtained for each architecture. This system has been tested on
the FER2013 database of kaggle.
Page 33
CONCLUSION Awatef MESSAOUDI
Chapter
5
Conclusion
Applying deep learning methods to emotion recognition is still a challenge, due to the
variation of gesture, pose and emotions detected in real time. The long term goal is to
develop a complete system that englobes speech, gesture and facial expressions. And to
do so, the collection of data for this work requires effort compared to some other tasks
such as object recognition.
Page 34
APPENDIX Awatef MESSAOUDI
Appendix
Page 35
WEBOGRAPHY Awatef MESSAOUDI
Webography
Page 36
BIBLIOGRAPHY Awatef MESSAOUDI
Bibliography
[1] Andoni Beristain. “Emotion recognition based on the analysis of Facial expressions :
a survey”. In: New Mathematics and Natural Computation 05(02):513-534 05.6 (July
2009), pp. 513–534.
[2] Charles Bazerman et al. Emotion recognition based on the analysis of Facial
expressions : a survey. Vol. 356. University of Wisconsin Press Madison, 1988.
[5] Ashraf Aboulnaga, Alaa R Alameldeen, and Jeffrey F Naughton. “Estimating the
selectivity of XML path expressions for internet scale applications”. In: VLDB. Vol. 1.
2001, pp. 591–600.
Page 37