Output 59

Republic of Tunisia LR-SITI-ENIT
Ministry of Higher Education, Scientific

Research and Information and
Communication Technologies
Tunis ELAMANAR University ST-EN07/00

National School of Engineering of Tunis Master Project
Serial N°: 2020 /
Master Project
Report
presented at
National School of Engineering of Tunis

(LR-SITI-ENIT)
in order to obtain the
Master degree in Systems, Science and Data
by
Awatef MESSAOUDI
Facial Emotion Recognition based on

CNN
Defended on 18/12/2020 in front of the committee composed of
Mr F F President
Mr Zied LACHIRI Supervisor
Mr F F Reviewer
Dedication
Put your dedication lines here

And try to be expressive ;)
I dedicate this work to my parents who have provided me with their encouragement,
love and understanding.
To all who where there for me, Thank you for your help and encouragement.
To all of you,
I dedicate this work.
Awatef MESSAOUDI
Aknowledgements
And put your thanks here.
First and Formost, I would like to express my infinite gratitude and respect to my
supervisor Mr Zied LACHIR.
I am also grateful to all my teachers without whom this work would not have been possible.
I will not forget, of course, to express my gratitude to my colleague of Master who have
kindly accepted to cooperate. Most of all, I am thankfull to my family for their exceptional
support through all challenges.
CONTENTS Awatef MESSAOUDI
Contents
Page iii
LIST OF FIGURES Awatef MESSAOUDI
List of Figures
Page iv
LIST OF FIGURES Awatef MESSAOUDI
Acronyms
Page v
INTRODUCTION Awatef MESSAOUDI
Introduction
Automatic recognition of human emotion has been an active area for decades, with growing
application areas including avatar animation, neuro-marketing and social robots. Emotion
recognition was a challenging task as it involves predicting abstract emotional states from
multi-modal input data. These modalities include video, audio and physiological signals.
Facial emotion recognition was the most informative channels. It still the most challenging
tasks in computer vision. In this preliminary study, we proposed a simple solution For
facial expression recognition. We introduced the state of the art of facial expression
recognition. We focused on some notions such as ”emotions, facial expression, and the
six universal emotions”. The first chapter was an introduction to the facial expression
recognition and the different terms like face detection, feature extraction, machine learning
and deep learning. In the second part, we focused our study in deep learning, its differents
blocks with a simple comparison with machine learning. The third section includes the
architecture of two different deep learning network. We have studied the differents blocks
and the data set FER2013 used in our work. Finally, its the phase of implementation and
results. We focused on the results and the performance of our architectures.
Page 1
CHAPTER 1. FER:STATE OF THE ART Awatef MESSAOUDI
Chapter
1
FER:state of the art
1.1 Introduction:
Due to the important role of facial expression in human interaction, the ability to
perform facial expression recognition automatically via computer vision enables a range
of applications such as human- computer interaction and data analytics, etc… In this
chapter, we will present some notions of emotions and different coding theories as well as
the architecture of facial recognition. We will present some approaches that help as to
recognize facial expression and we will end the chapter with differents machine learning
techniques.
1.2 Facial expressions and emotions :
1.2.1 Definitions:
1.2.1.1 Emotions:[1]
the emotion is expressed through many channels such as body position, voice and facial
expressions. It is a mental and physiological state which is subjective and private. It
involves a lot of behaviours, actions, thoughts and feelings.
Page 2
SHERER proposes the following definition : « Emotion is a set of episodic variations

in several components of the organisation in response to events assessed as important by
the organism. »
1.2.1.2 Facial expressions:
facial expression is a meaningful imitation of the face. The meaning can be expression of an
emotion, a semantic index or an intonation in the language of panels. The interpretation
of a set of muscle movements in expression depends on the context of the application. For
example, in the case of an application in Human-Machine interaction where we want to
know an indication of the emotional state of an individual, we will try to classify measures
in terms of emotions.
1.2.2 The universal facial expressions:
Charles DARWIN wrote in his 1872 book « the expressions of the emotions in Man and
Animals » that facial expressions of emotion are universal, not learned differently in each
culture. Several studies since have attempted to classify human emotions and demonstrate
how your face can give away your emotional state.[2] In 1960, Ekman and Friesen defined
six basic emotions based on cross-culture study, which indicated that humans perceive
certain basic emotions in the same way regardless of culture. These prototypical facial
expressions are anger, disgust, fear, hapiness, sadness, and surprise.[2]
Figure 1. The six universal emotions
Page 3
1.2.3 Coding systems:
Facial expressions is a consequence of activity of facial muscles. These muscles are also
called mimetic muscles or muscles of the facial expressions. The study of facial expressions
cannot be done without the study of the anatomy of the face and the underlying structure
of the muscles. That’s why some researchers focused on a coding system for facial
expressions. Several systems have been proposed such as Ekman system’s. In 1978 Ekman
developed a tool for coding facial expressions widely used today. We will present some
systems.
1.2.3.1 FACS:
facial action coding systems is a system developped by Ekman and friesen which is a
standard way of describing facial expressions in both psychology and computer animation.
Facs is based on 44 actions units (AUs) that represent facial movement that cannot be
composed into smaller area. FACS is very successful but it suffers from some defaults
such as :
• Complexity: : it takes 100 hours of learning to master the main concepts.
• Difficulty of handling bu a machine : FACS was created for psychologist, some

measurements remains vague and difficult to assess by a machine.
• Lack of precision : the transition between two states of a muscle are represented by
linear way, which is an approximation of reality.
subsubsectionComplexity: It takes 100 hours of learning to master the main concepts.

subsubsectionDifficulty of handling bu a machine: FACS was created for psychologist,
some measurements remains vague and difficult to assess by a machine. subsubsectionLack
of precision: The transition between two states of a muscle are represented by linear way,
which is an approximation of reality.
Page 4
1.2.3.2 MPEG4:
the MPEG4 video encoding standard has a model of the face human developped by the
face and body AdHocGroup interest group. This is a 3D model. This model is built
on a set of facial attributes, called Facial Feature Points(FFP). Measurements are used
to describe muscle movements( Facial animation Parameters-equivalents of Ekman unit
Actions).
Figure 2. MPEG4 Model
1.2.3.3 Candide:
It is a model of the face, contained 75 vertices ans 100 triangles. It is composed of a model
with a generic face and a set of parameters(SHAPE UNITS). These parameters are used
Page 5
to adapt the generic model to a particular individual. They represent the differences
between individuals and are 12 in number:
1. head height.
2. vertical position of the eyebrows.
3. vertical eye position.
4. eye width.
5. eye height.
6. eye separation distance.
7. depth of the cheeks.
8. depth of the nose.
9. vertical position of the nose.
10. degree of the curvature of the nose.
11. vertical position of the mouth.
12. width of the mouth.
Page 6
Figure 3. Candide Model
1.2.4 Areas of application of FER:
Automatic Facial expression recognition system has many applications including human
behavior understanding, detecting of mental disorder, etc...[3]. It has become a research
field involving many scientists specializing in different Areas such as artificial intelligence,
computer vision, psychology, physiology, education, website customization, etc…
1.3 Architecture of Facial expression recognition:
The system that performs automatic recognition of facial expression consists of three
modules : The first one is detecting and recording the face in the image or the input image
sequences. It can be a sensor to detect the face in each image or just detect the face in
the first image and then follow the face in the rest of video sequences. The second module
Page 7
consist in extracting and representing the facial changes caused by facial expressions. The
last one determines a similarity between the set of characteristics extracted and a set of
reference characteristics. Other filters or data preprocessing modules can be used between
these main modules to improve the results of detection, extraction of characteristics or
classification.
1.3.1 Face detection:
Face detection consists of determinig the presence or absence of faces in a picture. This
is a preliminary task necessary for most techniques for analysing the face. This used
technique come from the field of recognition shapes. There are several techniques for
detecting the face, we mention the most used.
• Automatic facial treatement : it is a method that specifies faces by distances and

proportions between particular pointsaround the eyes, nose, corners of the mouth,
but it is not effective when the light i slow.
• Eigenface : this is an effective method of characterization in facial treatment such as

as face detection and recognition. It is based on the representation of face features
from model grayscale images.
• LDA( linear discriminant analysis) : it is based on predictive discriminant analysis.

It is about explaining and predicting the membership of azn individual to a
predefined class based on measured characteristics using prediction variables.
• LBP( local binary patterns method) ; the technique of local binary model devides the
face into square subregions of equal size where the LBPcharacteristics are calculated
. the vector obtained are concatenated to get the final feature vector.
• Haar filter : this face detection method uses a multiscale haar filter. The
characteristics of a face are described in an XML file.
Page 8
1.3.2 Feature extraction:
The characteristics of the face are mainly located around the facial components such as
the eyes, mouth, eye-brow nose and chin. The detection of characteristics points of the
faces is done by a rectangular box returned by a detector which locates the face. The
extraction of the geometric features such as the countours of facial components and facial
distance provides location or appearance of characteristics. Therefore, there are two types
of approaches :
1.3.2.1 the geometric characteristics:
characteristics represent the shape and location of components of the face(including the
mouth, eyes, eyebrows and nose). The facial compnents or facial features are extracted
to form a vector of features representing the geometry of the face.
1.3.2.2 the characteristics of appearance:
It represents change in appearance of the face such as wrinkles and furrows. According to
these methods, the effect of rotation of head and the different facial shooting scales could
be eliminated by a normalization before the step of extraction of characteristics or by a
representation of features before the expression recognition step.
1.3.3 Emotion recognition:
Many researches are divided into three parts global approaches local approaches,
local approaches and finally hybrid approaches. Each approaches has advantages and
disadvantages related to environmental issues, orientation of images, position of the head,
etc…
Page 9
1.3.3.1 global approach:
These approaches are independent of head positions (top, bottom) and face image
orientations. These methods are effective but requires a heavy learning phase and the
result depends on the number of samples used.
1.3.3.2 local approach:
These approachs are based facial objects detection and they are robust to the change
of luminance. The position of the head and its orientation can cause some gaps in the
system.
1.3.3.3 Hybrid approach:
the alternative is to combine the two approaches(local and global) in order to take
advantages from these approaches. The recognition phase in this system is based on
machine learning theory : The feature vector is formed to describe the facial expressionand
the first part of the classifier is Learning. Classifier training consists of labeling the images
after detection, once the classifier is trained, it can recognize the images input. The
classification method can be devided into two groups : • Recognition based on static
data which only concerns images. • Recognition based on dynamic data concerning
sequences images or videos. Various classifiers have been applied such as neural network,
bayesian network, SVM, etc…
1.3.4 Facial expression databases:
Having sufficient labeled training data that include as many variations of the populations
and environments as possible is important for the design of a deep expression recognition
system. We will introduce some databases that contain a large amount of affective images
collected from the real world to benefit the training of deep neural networks.
Page 10
1.3.4.1 CK+:
The extended cohnkanade database is the most extensively used laboratru-controlled

database for evaluating FER system. CK+ contains 593 video sequences from 123
subjects. The sequences vary in duration from 10 to 60 frames and show a shift from a
neutral facial expression to the peak expression. Among these video, 327 sequences from
118 subjects are labeled with seven basic expression labels(anger, comptemt, disgust, fear,
hapiness, sadness and surprise) based on the facial action coding systems(FACS). Because
CK+does not provide specified training, validation and test set, the algorithms evaluated
on this database are not uniform.
1.3.4.2 NMI:
this database is laboratry-controlled are includes 326 sequences from 32 subjects. A total
of 213sequences are labeled with six basic expressions and 205 sequences are captured
in frontal view. In contrast to CK+ sequences in NMI are onset-apex-offset labeled.
The sequence begins with a neutral expression and reaches peak near the middle before
returning to the neutral expression.
1.3.4.3 JAFFE:
The japaneese female facial expression database is a laboratry-controlled image database

that contains 213 samples of posed expressions from 10 japaneese female. Each person has
3°4 images with each of six basic facial expression( anger, disgust, fear, hapiness, sadness
and surprise) and one image with a neutral expression. The database is challenging
because it contains few examples per subject/expression.
1.3.4.4 FER-2013:
This database was introduced during the ICML 2013 challenges in representation learning.
FER-2013 is a large scale and unconstrained database collected automatically by the
Page 11
google image search API. All images have been registred and resized to 48*48 pixels
after rejecting wrongfully labeled frames and adjusting the cropped region. FER-2013
contains 28.709 training images, 3.589 validation images and 3.589 test images with seven
expression labels ( anger, disgust, fear, hapiness, sadness, surprise, and neutral).
1.3.5 Machine learning:
This is a second subsection[2].

Machine learning is one of the most exciting areas of technology at the moment. We
see daily many stories that herald new breackthroughs in facial recognition technology,
self driving cars or computers that can have a conversation just like a person. Machine
learning technology is set to revolutionise almost any area of human life and work. The
one primary reason behind the using of machine learning is to automate complex tasks
and to analyze the variety and the complexity of data.
1.3.6 Deep learning:
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
1.4 Conclusion:
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
Page 12
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
Chapter
2
Deep learning
2.1 Introduction:
Deep learning is a subset of machine learning, which uses the neural network to analyze
different factors with a structure that is similar to the human neural system. It uses
complex multi-layered neural networks, where the level of abstraction increases gradually
by nonlinear transformations of input data.[5] It concerns algorithms inspired b by the
structure and function of the brain. They can learn several levels of representation in
order to model complex relationships between data
2.2 Machine learning vs Deep learning:
Machine learning algorithms work well for a wide variety of problems. However they failed
to solve some major AI problems such as speech, face and emotions recognition.
Page 13
Figure 4. Machine learning vs Deep learning.
Machine learning method includes the following four steps:
• Features engineering: choice as a basic for prediction( attributes, features).
• Choose the appropriate machine learning algorithm( such as classification algorithm

or regression algorithm).
• Train and evaluate model performance( for different algorithms, evaluate and select
the best performing model).
• Use the trained model to classify or predict the unknown data.[9]
Most of Features must be determined by an expert and then encoded as a data type.
Features can be pixel value, shapes, etc,... The performance of machine learning
algorithms depends upon the accuracy of the features extracted. Deep learning reduces
the task of developing new features extractor, by automating the phase of extracting
and learning features.[10] Deep learning uses neural network to learn representations of
characteristics directly from data.
Page 14
2.3 Artificial neural network:[11]
Artificial neural network is a computing model that tries to mimmic the human brain
in a very primitive way to emulate the capabilities of human being in a very limited
sense. ANNs have been developed as a generalization of mathematical models of human
cognition or neural biology. It takes an input vector X and produces an output vector
Y. the relationship between X and Y are determined by the network architecture.[12] An
ANN is a network of parallel, distributed information processing. It consists of a number
of informations processing elements called neurons or nodes which are grouped in layers.
The input layer processing elements receive the input vector and transmit the values to the
next layer of processing elements across connections where this process is continued. This
type of network, where data flow one way(forward) is known as a feed forward network.
A feedforward ANN has an input layer, an output layer and one or more hidden layers
between the input and the output layers. Each of the neurons in a layer is connected to
all the neuros of the next layer and the neuron in one layer are connected only to the
neurons of the immediate next layer. The strength of the signal passing from one neuron
to the other depends on the weight of the interconnections. The hidden layers enhance
to the network’s ability to model complex functions. Performance of BPANN(back
propogation artificial neural network) model is compared with the developped linear
transfer function(LTF) model and was found superior.
Page 15
Figure 5. Artificial neural network architecture.
2.4 Convolutional neural network CNN:
2.4.1 Presentation:
Convolutional neural network CNN is an artificial neural network type that proposed by
Yann le Cuhn in 1988. CNNs are one of the most popular deep learning architectures for
image classification, recognition and segmentation. CNN consists of hierarchical multiply
hidden layers. These artificial neurons take input from image, multiply weight, add bias
and then apply activation function. So that, artificial neurons can be used in image
classification, recognition and segmentation by perform simple convolutions by feeding
the convolutional neural network with more data( huge amount of data).[16]
2.4.2 Architecture:[7][12]
Convolutional Neurals networks are the most efficient models for classifying images data.
It was inspired by the mammal’s visual cortex[10] Each CNN channe lis made up of
convolutional layers, max pooling layers, fuuly connected layers and an output layer.[14]
Page 16
Figure 6. Architecture for a convolutional neural network.
2.4.2.1 The convolution layer CONV:
The convolution layer is the first layer to extract features from an input image[12]. It is
the fundamental unit of a convnet[15] It contains a set of filters whose parameters need
to be learned. Once the information hits a convolution layer , the layer convolves every
filters across the spatial dimensionality of the data to provide a 2D activation map. The
convolution of (N,M) image matrix multiplies with (n,m) filter matrix is called « feature
map ». The convolution of an image with different filters can perform operations such
as edge detection, blur and sharpen by applying filters[15]. During the forward pass,
each filter is convolved across the width and height of the input volume and compute dot
products between the entries of the filter and the input at any position. As the filter
convolve over the width and the height of the input volume it produces a 2 dimensional
activation map that gives the responses of the filter at every spatial position. There will
be an entire set of filters in each of them will produce a separate 2-dimensional activation
map.[17] The 2D convolution between image A and filter B can be given as :
C(i,j)=
N∑
a−1
A(m, n) ∗ B(i − m, j − n)
m=0
Page 17
where size of A is (Ma * Na), size of B is (Mb * Nb), 0<=i<=Ma+Mb-1

0<=j<=Na+Nb-1
CNN learns the values of these filters on its own during the training process( although
parameters such as number of filters, filter size, architecture of the network, etc still
needed to specify the training process). By increasing the number of filters, the more
image features get extracted and the better network becomes. Three parameters control
the size of the feature map( convolved feature) :
• Depth : correspond to the number of filters we use for the convolution operation.
• Stride : if the size of filter is 3 then stride is3.
• Zero padding : it is convenient to pad the input matrix with zeros around the
border, so that filter can be applied to bordering elements of input image matrix.
An additional operation is used after every convolution operation, called RELU layer.
A rectified linear unit apply an activation function, the output is : F(x)= max(0.x). There
are an other non linear fuctions such as tanh or sigmoid that can alsobe used instead of
RELU. Most of the data scientist since performance wise RELU is better than the other
two.[16]
2.4.2.2 The pooling layer:[12][16][17]
Pool layer is inserted between successive convolution layers, applying a downsampling

operation along the spatial dimensions width and height. Which reduces the
dimensionality of each map but retains important informations. Spatial pooling can be of
different types such as max pooling, average pooling and sum pooling. In MAXpooling,
a spatial neighborhood (for example 2*2 window) is defined and the largest element is
taken from the rectified feature map within that window. In case of average pooling, the
average or sum of all elements is that window is taken. In practice, the MAXpooling has
been shown to work better. MAXpooling reduces the input by applying the max function
over the input Xi,l and m be the size of the filter then the output calculates as follows :
M(Xi)= maxXi+k, +l|k|<=m/2, |l|<=m/2k, l£N
Page 18
2.4.2.3 The fully connected layer:[16][17]
In the end, a feature extractor vector or CNN code concatenate the output informations
as a unique vector and feed it into fully connected layer(multilayer perceptron). The term
« fully connected »indicates that every neuron in the previous layer is connected to every
neuron on the next layer. The output from the convolutional and pooling layers represent
high level features of the input image. The purpose of the fully connected layer is to use
these features for classifying the input image into various classes based on the training
Dataset.
2.4.2.4 Activation function:
The activation function is a mathematical function applied to a signal at the output of

an artificial neuron. The term activation function comes from the biological equivalent
»activation potential » simulation threshold which, once reached leads to a response of
the neuron. Softmax is used for activation function, it treats the outputs as scores for
each class. In the softmax, the function mapping stayed unchanged and these scores are
interpreted as the unnormalized log probabilities for each class. Softmax is calculated as
:
where j is index for image and K is number of total facial expression class. The RELU
is an activation function which eliminates all the negative values.
2.5 Visualisation of some CNN architectures:[5]
In recent years, we remarked the evolution of CNNs architectures. These networks have
gotten so deep that it has become extremely difficult to visualise the entire model.
Page 19
2.5.1 LeNet-5 (1998):
It is one of the simplest architectures. It has 2 convolutional and 3 fully-connected layers.

This architectures has about 60.000 parameters.
2.5.2 AlexNet(2012):
With 60 M parameters, AlexNet has 8 layers 5 convolutional and 3 fully connected.

AlexNet just stacked a few more layers. This architecture was one of the largest
convolutional neural networks to date on the subsets of ImageNet. They are the first
to implement RELU as an activation Function.
2.5.3 VGG-16(2014):
With this architecture, we notice taht CNNs were strating to get deeper and deeper.
This is because the most straight forward way of improving performance of deep neural
networks is by increasing their size. VGG-16 has 13 convolutional and 3 fully connected
layers, carrying with them the RELU tradition from AlexNet. It consists of 138M
parameters and takes about 500MB of storage space.
2.5.4 Inception-v1(2014):
This 22 layers architecture with 5M parameters is called the inception-v1. The design of
the architecture of an inception module is a product of research on approximating sparse
structures.
2.5.5 ResNet-50(2015):
From the past few CNNs, we have seen nothing but an increasing number of layers in
the design and achieving better performance. But with the network depth increasing,
accuracy gets saturated and the degrades rapidly. The folkes from Microsoft researcher
Page 20
adressed this problem with ResNet, using skip connections while building deeper models.
ResNet is one of the early adapters of batch normalisation with 26 M parameters.
2.5.6 Xception(2016):
Xception is an adaptation from inception, when the inception modules have been replaced
with depthwise separable convolution, it has also roughly the same number of parameters
as inception-v1(23M).
2.6 Conclusion:
In this chapter, we have presented the neural network and its differents architectures.
We focuses on CNNs , their structures and its differents layers, then we have presented a
few examples of architectures. In the next chapter, we will explains the idea of using the
architecture that we have chosen for our system of face expression recognition.
Page 21
CHAPTER 3. FACIAL EMOTION RECOGNITION: SYSTEM DESIGN
Awatef MESSAOUDI
Chapter
3
Facial Emotion Recognition: system
design
3.1 Introduction:
Despite the notable success of traditional facial recognition methods through the extracted
of handcrafted features, over the past decade, researchers have directed to the ddp learning
approach due to its high automatic recognition capacity. The goal of our project is to use
a CNN to recognize Facial expressions.
3.2 System presentation:
In this section, we will present our Facial Emotion Recognition system Based on CNN.
This system consists of detecting the face of a person from an image, sequence video or
via a camera to find out the expression with an accuracy rate associated with the six
universal expression (happy, disgust, fear, anger, sad, surprise) and Neutral.
3.2.1 General architecture:
Our research on Facial expression recognition has enabled us to note that all the solutions
provided to the recognition of emotions are structured according to the same overall
Page 22
Awatef MESSAOUDI
architecture, in three main modules : face detection, feature extractions and classification.
And these will be the principle modules of our system.
Figure 7. Facial Emotion Recognition Workflow
The effectiveness of the system depends on the method used to locate the face in the
picture. We will used the Viola-Jones algorithms to detect various parts of the human face
such as the mouth, eyes, nose, eyebrows, mouth, lips and ears. This algorithm explores
the caracteristics of HAAR type via cascade classifier, which can effectively combine many
features and determine the different filters on a resulting classifier.
Page 23
Awatef MESSAOUDI
Figure 8. Viola-Jones algorithm
3.2.3 Facial features extraction:
Once the face is detected, the system starts the process of extracting features that will
convert pixel data to a smaller representation to be used in the process of classification.
This step reduces the size of the input image while keeping the data most useful. This
process is based on Convolutional Neural Network CNN.
3.3 CNNs model presentation:
To evaluate the performance of certains models for facial emotion recognition, we

developed in this project differents model of CNNs with variable. Our first Convolutional
neural network model had two convolutional layers. This network is led to two fully
connected layers. The first part of this network refers to two convolutional layers that
englobe convolution layer, batch normalization, dropout, max pooling and RELU. In the
first layer, we have 64(3*3)filters along with batch normalization, RELU, dropout and
maxpooling with filter size 2*2 filters. In the second convolutional layer we had 128 (3*3)
filters. In the fully connected layer, we had a two hidden layers with 256 neurons then
512 neurons and softmax as a loss function.
Page 24
Awatef MESSAOUDI
Figure 9. model with two convolutional layers
In the second architecture, we tried to focus on the impact of adding more layers to
our system. We added two convolutional layers to our network with 256(3*3) filters then
512(3*3) filters.
3.4 VGG16 architecture presentation:
VGG16 is a convolutional neural network(CNN) architecture developed and refers to the

visual geometry group. This network facilitate the recognition of objects based on output
probabilities of the different classes that an image could belong. This architecture contains
13 convolutional layers, 5 max pooling and 3 fully connected layers. It takes an image of
size 224*224*3 as an input and deploys only 3*3 convolution and 2*2 pooling.
Figure 10. VGG16 architecture
This model has demonstrated that the depth of the network is beneficial for the
classification accuracy. However, VGG16 network has two major drawbacks: Also bullets
such as:
• Slow to train.
Page 25
Awatef MESSAOUDI
• Due its depth and the number of fully connected, VGG16 exceeds the size of 533
MB.
3.5 Architecture of Facial emotion recognition:
Our Facial emotion recognition system is modified from VGG16 which has show excellent
performance in many computer vision tasks.
3.6 Conclusion:
In this chapter we have introduced two models( CNN network and VGG16 network).
these architectures are used for the classification of images, and we noticed that each
architecture has a specific characteristics. In the next chapter, we will present the
implementation of these two models to test them in order to reveal the performance
of a Deep learning FER system and we will compare the differents results.
Page 26
CHAPTER 4. IMPLEMENTATION AND REALISATION: Awatef MESSAOUDI
Chapter
4
Implementation and realisation:
4.1 Introduction:
The goal of our project is to design and to implement an application which allows us to
recognize the facial expression. we interested in the application of deeplearning model.
In this chapter, we will present the implementation of some code of different models, the
developement environement as well as the various tools used. Finally, we will present the
results obtained.
Page 27
4.2 Software and tools used for implementation:
4.2.1 Python3.7
4.2.2 OpenCv:
4.2.3 Tensorflow:
4.2.4 Keras:
4.2.5 Numpy:
4.2.6 Sklearn:
4.2.7 Matplotlib:
4.3 Database:
For a best performance, we should train the network with a lot of samples of images. This
would increase the accuracy and improve the performance of the model. Unfortunately,
the large amount of data in datasets do not exist publicly, but we have access to two
public databases(FER2013 and CK+). For our system, we will use FER2013 database.
4.4 Implementation:
The system of Facial Emotion Recognition consists of two modules : The first one is for
face detection and the second one is for emotion recognition. In the following, we will
detail each module.
Page 28
In this stage, we used to choose the method proposed by Paul Viola and Michael Jones.
In our system, we opted for Haarcascade.eye.xml from the OpenCv library which provides
the Haar cascade method.
4.4.2 Recognition method:
We will use the convolutional neural network to recognize the facial emotion of the
detected faces. The first thing to be considered is the architecture chosen and the number
of layers implemented in this architecture. In our project, we will use two differents
architectures and we will compare the performance of these two architectures. The first
architecture chosen is VGG16, a simple architecture easy to implement and handle on
python. The second one was inspired from VGG16, but we reduced the number of layers.
Our system contains several modules, in the following we will specify each modules their
role and purpose.
4.4.3 Librairie:
In this module, we imported all the librairies we needed during the phase of training.
Figure 11. Librairies
4.4.4 Loading data and data augmentation:
We applied a data augmentation for our data because we need add more data to our
training set.
Page 29
Figure 12. LOading data
4.4.4.1 Data load:
4.4.5 CNN architecture in KERAS:
4.4.5.1 VGG16 architecture:
This architectures consists of several layers, eleven in all (8 convolutional layers and 3
fully connected).
Figure 13. Vgg network in keras
We applied a batchNormalisation after every convolutional layer and RELU(rectified

linear unit) to eliminate the negative values. At the end, we added a layer of average
Global Pooling, We applied then the Softmax function to calculate the rate of the 7
classes of expressions. If we increase the number of layers the accuracy will decrease.
Page 30
4.4.5.2 Convolutional neural network architecture:
Figure 14. CNN network in keras
4.4.6 Optimization:
when we train this model, we are basically trying to solve an optimization problem. We
are trying to optimize weights given arbitrarily. Our task is to find the weights that
most accurately map our input data to correct the output class. During the training this
weights updated and saved in the file.h5. The weights are optimised using an optimization
algorithm. In our algorithm we used the optimize SGD( stochastic gradient descent).
Figure 15. Optimization in keras
4.4.7 Learning :
The next step is to learn the given data. For learning the network we use the « fit() »
function.
Figure 16. Learning Phase
Page 31
An epoch refers to a single pass of the entire data set to the network during the
training.
4.5 Results:
4.5.1 Performances:
Figure 17. Performance VGG16
Figure 18. Performance VGG16
The second figure show us the difference between the two architecture.
Figure 19. CNN accuracy
If we add a convolutional layer to this mode, the accuracy will decrease.
Page 32
4.5.2 Confusion matrix:
Confusion matrix is applied to find wich emotion usually get confused with each other.
Figure 20. Confusion matrix
4.6 Discussion
The performance of the VGG16 model obtained on FER2013 dataset is relatively

low(58percent) compared to the performance of the second architecture (72percent) due
to the number of parameters trained in the second architecture.
Figure 21. Differences
4.7 Conclusion:
In this chapter, we have presented a Facial expression recognition system based on CNN.
We presented the results obtained for each architecture. This system has been tested on
the FER2013 database of kaggle.
Page 33
CONCLUSION Awatef MESSAOUDI
Chapter
5
Conclusion
Applying deep learning methods to emotion recognition is still a challenge, due to the
variation of gesture, pose and emotions detected in real time. The long term goal is to
develop a complete system that englobes speech, gesture and facial expressions. And to
do so, the collection of data for this work requires effort compared to some other tasks
such as object recognition.
Page 34
APPENDIX Awatef MESSAOUDI
Appendix
Page 35
WEBOGRAPHY Awatef MESSAOUDI
Webography
[3] Latex @ Wikipedia. url: https://www.kairos.com/blog/the-universally-recognized-

facial-expressions-of-emotion (visited on 0004–2016).
[4] ENIT. url: http://www.enit.rnu.tn/site/enit_fr/ (visited on 0004–2016).
Page 36
BIBLIOGRAPHY Awatef MESSAOUDI
Bibliography
[1] Andoni Beristain. “Emotion recognition based on the analysis of Facial expressions :
a survey”. In: New Mathematics and Natural Computation 05(02):513-534 05.6 (July
2009), pp. 513–534.
[2] Charles Bazerman et al. Emotion recognition based on the analysis of Facial
expressions : a survey. Vol. 356. University of Wisconsin Press Madison, 1988.
[5] Ashraf Aboulnaga, Alaa R Alameldeen, and Jeffrey F Naughton. “Estimating the
selectivity of XML path expressions for internet scale applications”. In: VLDB. Vol. 1.
2001, pp. 591–600.
Page 37

Output 59

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Output 59

Uploaded by

Copyright:

Available Formats

Republic of Tunisia LR-SITI-ENIT

Ministry of Higher Education, Scientific

Tunis ELAMANAR University ST-EN07/00

National School of Engineering of Tunis

in order to obtain the

Master degree in Systems, Science and Data

Facial Emotion Recognition based on

Defended on 18/12/2020 in front of the committee composed of

Put your dedication lines here

And put your thanks here.

1.2 Facial expressions and emotions :

SHERER proposes the following definition : « Emotion is a set of episodic variations

1.2.1.2 Facial expressions:

1.2.2 The universal facial expressions:

Figure 1. The six universal emotions

1.2.3 Coding systems:

• Complexity: : it takes 100 hours of learning to master the main concepts.

• Diﬀiculty of handling bu a machine : FACS was created for psychologist, some

subsubsectionComplexity: It takes 100 hours of learning to master the main concepts.

Figure 2. MPEG4 Model

2. vertical position of the eyebrows.

3. vertical eye position.

6. eye separation distance.

7. depth of the cheeks.

8. depth of the nose.

9. vertical position of the nose.

10. degree of the curvature of the nose.

11. vertical position of the mouth.

12. width of the mouth.

Figure 3. Candide Model

1.2.4 Areas of application of FER:

1.3 Architecture of Facial expression recognition:

1.3.1 Face detection:

• Automatic facial treatement : it is a method that specifies faces by distances and

• Eigenface : this is an effective method of characterization in facial treatment such as

• LDA( linear discriminant analysis) : it is based on predictive discriminant analysis.

1.3.2 Feature extraction:

1.3.2.1 the geometric characteristics:

1.3.2.2 the characteristics of appearance:

1.3.3 Emotion recognition:

1.3.3.1 global approach:

1.3.3.2 local approach:

1.3.3.3 Hybrid approach:

1.3.4 Facial expression databases:

The extended cohnkanade database is the most extensively used laboratru-controlled

The japaneese female facial expression database is a laboratry-controlled image database

1.3.5 Machine learning:

This is a second subsection[2].

1.3.6 Deep learning:

2.2 Machine learning vs Deep learning:

Figure 4. Machine learning vs Deep learning.

Machine learning method includes the following four steps:

• Features engineering: choice as a basic for prediction( attributes, features).

• Choose the appropriate machine learning algorithm( such as classification algorithm

• Use the trained model to classify or predict the unknown data.[9]

2.3 Artificial neural network:[11]

Figure 5. Artificial neural network architecture.

2.4 Convolutional neural network CNN:

Figure 6. Architecture for a convolutional neural network.

2.4.2.1 The convolution layer CONV:

where size of A is (Ma * Na), size of B is (Mb * Nb), 0<=i<=Ma+Mb-1

• Stride : if the size of filter is 3 then stride is3.