Output 93

Republic of Tunisia
Ministry of Higher
Education, Scientific
Research and
Information and
Communication
Technologies
National Engineering School of Tunis Tunis ELAMANAR University
Master Project
Report
presented at
National Engineering School of Tunis

(LR-SITI-ENIT)
in order to obtain the
Master degree in Systems, Signals and Data
by
Awatef MESSAOUDI
Facial Emotion Recognition based on

CNN
Defended on 18/12/2020 in front of the committee composed of
Mr F F President
Mr Zied LACHIRI Supervisor
Mr F F Reviewer
Dedication
I dedicate this work to my parents who have provided me with their encouragement,
love and understanding.
To all who where there for me, Thank you for your help and encouragement.
To all of you,
I dedicate this work.
Awatef MESSAOUDI
Acknowledgements
First and Foremost, I would like to express my infinite gratitude, and respect to my
supervisor Mr Zied LACHIRI.
I am also grateful to all my teachers without whom this work would not have been
possible.
I will not forget, of course, to express my gratitude to my colleague of Master who
have kindly accepted to cooperate. Most of all, I am thank full to my family for their
exceptional support through all challenges.
CONTENTS Awatef MESSAOUDI
Contents
Dedication i
Acknowledgements ii
Contents v
List of Figures vii
Acronyms viii
Introduction 1
1 Facial Emotion Recognition (FER): state of the art 3

1.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Facial expressions and emotions : . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Definitions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 The universal facial expressions:[3] . . . . . . . . . . . . . . . . . . 4
1.2.3 Coding systems: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Areas of application of FER:[7] . . . . . . . . . . . . . . . . . . . . 8
1.3 Architecture of Facial expression recognition: . . . . . . . . . . . . . . . . . 8
1.3.1 Face detection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Feature extraction: . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Emotion recognition: . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 Facial expression databases:[7] . . . . . . . . . . . . . . . . . . . . . 11
1.3.5 Machine learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.6 Deep learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Page iii
1.4 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Deep learning 14
2.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Machine learning vs Deep learning: . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Artificial neural network:[11] . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Convolutional neural network CNN: . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Presentation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Architecture:[13][12] . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Visualisation of some CNN architectures:[8] . . . . . . . . . . . . . . . . . 20
2.5.1 LeNet-5 (1998): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 AlexNet(2012): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 VGG-16(2014): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4 Inception-v1(2014): . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.5 ResNet-50(2015): . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.6 Xception(2016): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Facial Emotion Recognition (FER): system design 23

3.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 System presentation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 General architecture: . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Face detection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Facial features extraction: . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Implementation and results: 28

4.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Software and tools used for implementation: . . . . . . . . . . . . . . . . . 28
4.2.1 Python3.7:[18] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 OpenCv:[19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Tensorflow:[20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.4 Keras: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.5 Numpy:[21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.6 Sklearn: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.7 Matplotlib:[21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Database: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Page iv
4.4 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.1 Import libraries: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.2 Data download: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.3 Build the CNN model: . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.4 Training phase: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.5 Face detection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Results: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 CNN architecture results: . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 VGG16 network result: . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Conclusion 44
Webography 45
Bibliography 45
Page v
LIST OF FIGURES Awatef MESSAOUDI
List of Figures
1 The six universal emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 MPEG4 model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Candide Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Machine learning vs Deep learning . . . . . . . . . . . . . . . . . . . . . . 15

5 Artificial neural network architecture . . . . . . . . . . . . . . . . . . . . . 17
6 Architecture for a convolutional neural network . . . . . . . . . . . . . . . 18
7 Facial Emotion Recognition Workflow . . . . . . . . . . . . . . . . . . . . . 24

8 Viola-Jones algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9 model with two convolutional layers . . . . . . . . . . . . . . . . . . . . . . 26
10 VGG16 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
11 Librairies and APIs imported from python . . . . . . . . . . . . . . . . . . 32

12 Facial Emotion distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 32
13 Emotion distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
14 Load data and data augmentation . . . . . . . . . . . . . . . . . . . . . . . 34
15 Convolutional neural network 2 layers in keras . . . . . . . . . . . . . . . 35
16 Convolutional neural network 4 layers in keras . . . . . . . . . . . . . . . 36
17 VGG16 network in keras . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
18 Training phase in keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
19 Face detection in keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
20 training and validation accuracy and loss graph(Lr=0.0001) . . . . . . . . 40
21 training and validation accuracy and loss graph(Lr=0.005) . . . . . . . . . 40
22 Plot between accuracy and validation accuracy . . . . . . . . . . . . . . . 41
23 CNN’s Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Page vi
24 CNN’s Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

25 VGG16’s Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Page vii
Acronyms
AU Action unit
CNN Convolutional neural network
FACS Facial Action Coding System
FER Facial Emotion Recognition
FFP Facial Feature Points
JAFFE Japaneese female facial expression
LDA Linear Discriminant Analysis
LPB Local pattern binary
SVM Support Vector Machine
Page viii
INTRODUCTION Awatef MESSAOUDI
Introduction
Automatic recognition of human emotion has been an active area for decades, with growing
application areas including avatar animation, neuro-marketing and social robots. Emotion
recognition was a challenging task as it involves predicting abstract emotional states from
multi-modal input data. These modalities include video, audio and physiological signals.
Just like the tone of voice, facial expressions has an important role in recognition of
emotions and are used to identify people as a sort of non verbal communication. The face
have an important role in daily emotional communication. Facial emotion recognition
was the most informative channels, because human’s face is the most exposed part of the
body. It still the most challenging tasks in computer vision. In recent years, and with
the evolution of deep learning method, researchers introduced an end to end framework
for facial expressions recognition using deep learning models. In this preliminary study,
we proposed a simple solution For facial expression recognition. Our study is based
on the deep learning methods. We introduced the state of the art of facial expression
recognition. We focused on some notions such as ”emotions, facial expression, and the
six universal emotions”. The first chapter was an introduction to the facial expression
recognition and the different terms like face detection, feature extraction, machine learning
and deep learning. In the second part, we focused our study in deep learning, its different
blocks with a simple comparison with machine learning. The third section includes the
architecture of two different deep learning network. We have studied the different blocks
Page 1
INTRODUCTION Awatef MESSAOUDI
and the data set FER2013 used in our work. Finally, its the phase of implementation and
results. We discuss the results of the presented work and we draw conclusion.
Page 2
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
Chapter
1
Facial Emotion Recognition (FER):
state of the art
1.1 Introduction:
Due to the important role of facial expression in human interaction, the ability to
perform facial expression recognition automatically via computer vision enables a range
of applications such as human- computer interaction and data analytic, etc… In this
chapter, we will present some notions of emotions and different coding theories as well as
the architecture of facial recognition. We will present some approaches that help as to
recognize facial expression and we will end the chapter with different machine learning
techniques.
Page 3
1.2 Facial expressions and emotions :
1.2.1 Definitions:
1.2.1.1 Emotions:[1]
The emotion is expressed through many channels such as body position, voice and facial
expressions. It is a mental and physiological state which is subjective and private. It
involves a lot of behaviours, actions, thoughts and feelings.
SHERER proposes the following definition : « emotion is defined as an episode

of interrelated, synchronized changes in the states of all or most of the five organismic
subsystems in response to the evaluation of an external or internal stimulus event as
relevant to major concerns of the organism[2]. »
1.2.1.2 Facial expressions:
Facial expression is a meaningful imitation of the face. The meaning can be expression
of an emotion, a semantic index or an intonation in the language of panels. The
interpretation of a set of muscle movements in expression depends on the context of the
application. For example, in the case of an application in Human-Machine interaction
where we want to know an indication of the emotional state of an individual, we will try
to classify measures in terms of emotions.
1.2.2 The universal facial expressions:[3]
Charles DARWIN wrote in his 1872 book « the expressions of the emotions in Man
and Animals » that facial expressions of emotion are universal, not learned differently
in each culture. Several studies since have attempted to classify human emotions and
demonstrate how your face can give away your emotional state.
Page 4
In 1960, Ekman and Friesen defined six basic emotions based on cross-culture study,
which indicated that humans perceive certain basic emotions in the same way regardless of
culture. These prototypical facial expressions are anger, disgust, fear, hapiness, sadness,
and surprise.
Figure 1. The six universal emotions
1.2.3 Coding systems:
Facial expressions is a consequence of activity of facial muscles. These muscles are also
called mimetic muscles or muscles of the facial expressions. The study of facial expressions
cannot be done without the study of the anatomy of the face and the underlying structure
of the muscles. That’s why some researchers focused on a coding system for facial
expressions. Several systems have been proposed such as Ekman system’s. In 1978 Ekman
developed a tool for coding facial expressions widely used today. We will present some
systems.
1.2.3.1 Facial Action Coding System (FACS):[4]
The Facial Action Coding System (FACS) is a comprehensive, anatomically based system
for describing all visually discernible facial movement. It breaks down facial expressions
into individual components of muscle movement, calledAction unit (AU). FACS is very
successful but it suffers from some defaults such as :
• Complexity: : it takes 100 hours of learning to master the main concepts.
• Difficulty of handling by a machine : FACS was created for psychologist, some

measurements remains vague and difficult to assess by a machine.
• Lack of precision : the transition between two states of a muscle are represented by
linear way, which is an approximation of reality.
Page 5
1.2.3.2 MPEG4:[5]
the MPEG4 video encoding standard has a model of the face human developped by the face
and body AdHocGroup interest group. This is a 3D model. This model is built on a set of
facial attributes, called Facial Feature Points(Facial Feature Points (FFP)). Measurements
are used to describe muscle movements( Facial animation Parameters-equivalents of
Ekman unit Actions).
Figure 2. MPEG4 Model
1.2.3.3 Candide:[6]
It is a model of the face, contained 75 vertices ans 100 triangles. It is composed of a model
with a generic face and a set of parameters(SHAPE UNITS). These parameters are used
to adapt the generic model to a particular individual. They represent the differences
between individuals and are 12 in number:
1. head height.
Page 6
2. vertical position of the eyebrows.
3. vertical eye position.
4. eye width.
5. eye height.
6. eye separation distance.
7. depth of the cheeks.
8. depth of the nose.
9. vertical position of the nose.
10. degree of the curvature of the nose.
11. vertical position of the mouth.
12. width of the mouth.
Figure 3. Candide Model
Page 7
1.2.4 Areas of application of FER:[7]
Automatic Facial expression recognition system has many applications including human
behavior understanding, detecting of mental disorder, etc....
It has become a research field involving many scientists specializing in different Areas
such as artificial intelligence, computer vision, psychology, physiology, education, website
customization, etc…
1.3 Architecture of Facial expression recognition:
The system that performs automatic recognition of facial expression consists of three
modules : The first one is detecting and recording the face in the image or the input image
sequences. It can be a sensor to detect the face in each image or just detect the face in
the first image and then follow the face in the rest of video sequences. The second module
consist in extracting and representing the facial changes caused by facial expressions. The
last one determines a similarity between the set of characteristics extracted and a set of
reference characteristics. Other filters or data preprocessing modules can be used between
these main modules to improve the results of detection, extraction of characteristics or
classification.
1.3.1 Face detection:
Face detection consists of determining the presence or absence of faces in a picture. This
is a preliminary task necessary for most techniques for analysing the face. This used
technique come from the field of recognition shapes. There are several techniques for
detecting the face, we mention the most used.
• Automatic facial treatment : it is a method that specifies faces by distances and

proportions between particular points around the eyes, nose, corners of the mouth,
but it is not effective when the light i slow.
Page 8
• Eigen face : this is an effective method of characterization in facial treatment such

as face detection and recognition. It is based on the representation of face features
from model grayscale images.
• Linear Discriminant Analysis (LDA) : it is based on predictive discriminant analysis.

It is about explaining and predicting the membership of an individual to a predefined
class based on measured characteristics using prediction variables.
• Local pattern binary (LPB) ; the technique of local binary model divides the face
into square sub regions of equal size where the LBP characteristics are calculated .
the vector obtained are concatenated to get the final feature vector.
• Haar filter : this face detection method uses a multi scale haar filter. The
characteristics of a face are described in an XML file.
1.3.2 Feature extraction:
The characteristics of the face are mainly located around the facial components such as
the eyes, mouth, eye-brow nose and chin. The detection of characteristics points of the
faces is done by a rectangular box returned by a detector which locates the face. The
extraction of the geometric features such as the contours of facial components and facial
distance provides location or appearance of characteristics. Therefore, there are two types
of approaches :
1.3.2.1 the geometric characteristics:
characteristics represent the shape and location of components of the face(including the
mouth, eyes, eyebrows and nose). The facial components or facial features are extracted
to form a vector of features representing the geometry of the face.
1.3.2.2 the characteristics of appearance:
It represents change in appearance of the face such as wrinkles and furrows. According to
these methods, the effect of rotation of head and the different facial shooting scales could
Page 9
be eliminated by a normalization before the step of extraction of characteristics or by a

representation of features before the expression recognition step.
1.3.3 Emotion recognition:
Many researches are divided into three parts global approaches local approaches,
local approaches and finally hybrid approaches. Each approaches has advantages and
disadvantages related to environmental issues, orientation of images, position of the head,
etc…
1.3.3.1 global approach:
These approaches are independent of head positions (top, bottom) and face image
orientations. These methods are effective but requires a heavy learning phase and the
result depends on the number of samples used.
1.3.3.2 local approach:
These approaches are based facial objects detection and they are robust to the change
of luminance. The position of the head and its orientation can cause some gaps in the
system.
1.3.3.3 Hybrid approach:
the alternative is to combine the two approaches(local and global) in order to take
advantages from these approaches. The recognition phase in this system is based on
machine learning theory : The feature vector is formed to describe the facial expression
and the first part of the classifier is Learning. Classifier training consists of labeling the
images after detection, once the classifier is trained, it can recognize the images input. The
classification method can be divided into two groups : • Recognition based on static data
which only concerns images. • Recognition based on dynamic data concerning sequences
Page 10
images or videos. Various classifiers have been applied such as neural network, Bayesian
network, Support Vector Machine (SVM), etc…
1.3.4 Facial expression databases:[7]
Having sufficient labeled training data that include as many variations of the populations
and environments as possible is important for the design of a deep expression recognition
system. We will introduce some databases that contain a large amount of affective images
collected from the real world to benefit the training of deep neural networks.
1.3.4.1 CK+:
The extended cohn kanade database is the most extensively used laboratory-controlled
database for evaluating FER system. CK+ contains 593 video sequences from 123
subjects. The sequences vary in duration from 10 to 60 frames and show a shift from
a neutral facial expression to the peak expression. Among these video, 327 sequences
from 118 subjects are labeled with seven basic expression labels(anger, Neutral, disgust,
fear, happiness, sadness and surprise) based on the facial action coding systems(FACS).
Because CK+does not provide specified training, validation and test set, the algorithms
evaluated on this database are not uniform.
1.3.4.2 MMI:
this database is laboratry-controlled are includes 326 sequences from 32 subjects. A total
of 213sequences are labeled with six basic expressions and 205 sequences are captured
in frontal view. In contrast to CK+ sequences in MMI are onset-apex-offset labeled.
The sequence begins with a neutral expression and reaches peak near the middle before
returning to the neutral expression.
Page 11
1.3.4.3 Japaneese female facial expression (JAFFE):
The japaneese female facial expression database is a laboratry-controlled image database

that contains 213 samples of posed expressions from 10 japaneese female. Each person has
3°4 images with each of six basic facial expression( anger, disgust, fear, hapiness, sadness
and surprise) and one image with a neutral expression. The database is challenging
because it contains few examples per subject/expression.
1.3.4.4 FER-2013:
This database was introduced during the ICML 2013 challenges in representation learning.
FER-2013 is a large scale and unconstrained database collected automatically by the
google image search API. All images have been registred and resized to 48*48 pixels
after rejecting wrongfully labeled frames and adjusting the cropped region. FER-2013
contains 28.709 training images, 3.589 validation images and 3.589 test images with seven
expression labels ( anger, disgust, fear, hapiness, sadness, surprise, and neutral).
1.3.5 Machine learning:
Machine learning is one of the most exciting areas of technology at the moment. We
see daily many stories that herald new breakthroughs in facial recognition technology,
self driving cars or computers that can have a conversation just like a person. Machine
learning technology is set to revolutionise almost any area of human life and work. The
one primary reason behind the using of machine learning is to automate complex tasks
and to analyze the variety and the complexity of data.
1.3.6 Deep learning:
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
Page 12
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
1.4 Conclusion:
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
Page 13
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
Chapter
2
Deep learning
2.1 Introduction:
Deep learning is a subset of machine learning, which uses the neural network to analyze
different factors with a structure that is similar to the human neural system. It uses
complex multi-layered neural networks, where the level of abstraction increases gradually
by nonlinear transformations of input data.[8] It concerns algorithms inspired b by the
structure and function of the brain. They can learn several levels of representation in
order to model complex relationships between data
2.2 Machine learning vs Deep learning:
Machine learning algorithms work well for a wide variety of problems. However they failed
to solve some major AI problems such as speech, face and emotions recognition.
Page 14
Figure 4. Machine learning vs Deep learning.
Machine learning method includes the following four steps:
• Features engineering: choice as a basic for prediction( attributes, features).
• Choose the appropriate machine learning algorithm( such as classification algorithm

or regression algorithm).
• Train and evaluate model performance( for different algorithms, evaluate and select
the best performing model).
• Use the trained model to classify or predict the unknown data.[9]
Most of Features must be determined by an expert and then encoded as a data type.
Features can be pixel value, shapes, etc,... The performance of machine learning
algorithms depends upon the accuracy of the features extracted. Deep learning reduces
the task of developing new features extractor, by automating the phase of extracting
and learning features.[10] Deep learning uses neural network to learn representations of
characteristics directly from data.
Page 15
2.3 Artificial neural network:[11]
Artificial neural network is a computing model that tries to mimmic the human brain
in a very primitive way to emulate the capabilities of human being in a very limited
sense. ANNs have been developed as a generalization of mathematical models of human
cognition or neural biology. It takes an input vector X and produces an output vector
Y. the relationship between X and Y are determined by the network architecture.[12] An
ANN is a network of parallel, distributed information processing. It consists of a number
of informations processing elements called neurons or nodes which are grouped in layers.
The input layer processing elements receive the input vector and transmit the values to the
next layer of processing elements across connections where this process is continued. This
type of network, where data flow one way(forward) is known as a feed forward network.
A feedforward ANN has an input layer, an output layer and one or more hidden layers
between the input and the output layers. Each of the neurons in a layer is connected to
all the neuros of the next layer and the neuron in one layer are connected only to the
neurons of the immediate next layer. The strength of the signal passing from one neuron
to the other depends on the weight of the interconnections. The hidden layers enhance
to the network’s ability to model complex functions. Performance of BPANN(back
propogation artificial neural network) model is compared with the developped linear
transfer function(LTF) model and was found superior.
Page 16
Figure 5. Artificial neural network architecture.
2.4 Convolutional neural network CNN:
2.4.1 Presentation:
Convolutional neural network CNN is an artificial neural network type that proposed by
Yann le Cuhn in 1988. CNNs are one of the most popular deep learning architectures for
image classification, recognition and segmentation. CNN consists of hierarchical multiply
hidden layers. These artificial neurons take input from image, multiply weight, add bias
and then apply activation function. So that, artificial neurons can be used in image
classification, recognition and segmentation by perform simple convolutions by feeding
the convolutional neural network with more data( huge amount of data).[13]
2.4.2 Architecture:[13][12]
Convolutional Neurals networks are the most efficient models for classifying images data.
It was inspired by the mammal’s visual cortex.[10] Each CNN channel is made up of
convolutional layers, max pooling layers, fuuly connected layers and an output layer.[14]
Page 17
Figure 6. Architecture for a convolutional neural network.
2.4.2.1 The convolution layer CONV:
The convolution layer is the first layer to extract features from an input image.[12]
It is the fundamental unit of a convnet.[15] It contains a set of filters whose parameters

need to be learned. Once the information hits a convolution layer , the layer convolves
every filters across the spatial dimensionality of the data to provide a 2D activation map.
The convolution of (N,M) image matrix multiplies with (n,m) filter matrix is called «
feature map ». The convolution of an image with different filters can perform operations
such as edge detection, blur and sharpen by applying filters.[15]
During the forward pass, each filter is convolved across the width and height of the
input volume and compute dot products between the entries of the filter and the input
at any position. As the filter convolve over the width and the height of the input volume
it produces a 2 dimensional activation map that gives the responses of the filter at every
spatial position. There will be an entire set of filters in each of them will produce a
separate 2-dimensional activation map.[16] The 2D convolution between image A and
filter B can be given as :
Page 18
C(i,j)=
N∑
a−1
A(m, n) ∗ B(i − m, j − n)
m=0
where size of A is (Ma * Na), size of B is (Mb * Nb),
0<=i<=Ma+Mb-1 0<=j<=Na+Nb-1
CNN learns the values of these filters on its own during the training process( although
parameters such as number of filters, filter size, architecture of the network, etc still
needed to specify the training process). By increasing the number of filters, the more
image features get extracted and the better network becomes. Three parameters control
the size of the feature map( convolved feature) :
• Depth : correspond to the number of filters we use for the convolution operation.
• Stride : if the size of filter is 3 then stride is3.
• Zero padding : it is convenient to pad the input matrix with zeros around the
border, so that filter can be applied to bordering elements of input image matrix.
An additional operation is used after every convolution operation, called RELU layer.
A rectified linear unit apply an activation function, the output is:
F(x)= max(0.x). There are an other non linear fuctions such as tanh or sigmoid
that can alsobe used instead of RELU. Most of the data scientist since performance wise
RELU is better than the other two.[17]
2.4.2.2 The pooling layer:[12][17][16]
Pool layer is inserted between successive convolution layers, applying a downsampling

operation along the spatial dimensions width and height. Which reduces the
dimensionality of each map but retains important informations. Spatial pooling can be of
different types such as max pooling, average pooling and sum pooling. In MAXpooling,
a spatial neighborhood (for example 2*2 window) is defined and the largest element is
taken from the rectified feature map within that window. In case of average pooling, the
average or sum of all elements is that window is taken. In practice, the MAXpooling has
Page 19
been shown to work better. MAXpooling reduces the input by applying the max function
over the input Xi,l and m be the size of the filter then the output calculates as follows :
M(Xi)= maxXi+k, +l|k|<=m/2,|l|<=m/2k, l£N
2.4.2.3 The fully connected layer:[17][16]
In the end, a feature extractor vector or CNN code concatenate the output informations
as a unique vector and feed it into fully connected layer(multilayer perceptron). The term
« fully connected »indicates that every neuron in the previous layer is connected to every
neuron on the next layer. The output from the convolutional and pooling layers represent
high level features of the input image. The purpose of the fully connected layer is to use
these features for classifying the input image into various classes based on the training
Dataset.
2.4.2.4 Activation function:
The activation function is a mathematical function applied to a signal at the output of

an artificial neuron. The term activation function comes from the biological equivalent
»activation potential » simulation threshold which, once reached leads to a response of
the neuron. Softmax is used for activation function, it treats the outputs as scores for
each class. In the softmax, the function mapping stayed unchanged and these scores are
interpreted as the unnormalized log probabilities for each class. Softmax is calculated as
:
where j is index for image and K is number of total facial expression class. The RELU
is an activation function which eliminates all the negative values.
2.5 Visualisation of some CNN architectures:[8]
In recent years, we remarked the evolution of CNNs architectures. These networks have
gotten so deep that it has become extremely difficult to visualise the entire model.
Page 20
2.5.1 LeNet-5 (1998):
It is one of the simplest architectures. It has 2 convolutional and 3 fully-connected layers.

This architectures has about 60.000 parameters.
2.5.2 AlexNet(2012):
With 60 M parameters, AlexNet has 8 layers 5 convolutional and 3 fully connected.

AlexNet just stacked a few more layers. This architecture was one of the largest
convolutional neural networks to date on the subsets of ImageNet. They are the first
to implement RELU as an activation Function.
2.5.3 VGG-16(2014):
With this architecture, we notice taht CNNs were strating to get deeper and deeper.
This is because the most straight forward way of improving performance of deep neural
networks is by increasing their size. VGG-16 has 13 convolutional and 3 fully connected
layers, carrying with them the RELU tradition from AlexNet. It consists of 138M
parameters and takes about 500MB of storage space.
2.5.4 Inception-v1(2014):
This 22 layers architecture with 5M parameters is called the inception-v1. The design of
the architecture of an inception module is a product of research on approximating sparse
structures.
2.5.5 ResNet-50(2015):
From the past few CNNs, we have seen nothing but an increasing number of layers in
the design and achieving better performance. But with the network depth increasing,
accuracy gets saturated and the degrades rapidly. The folkes from Microsoft researcher
Page 21
adressed this problem with ResNet, using skip connections while building deeper models.
ResNet is one of the early adapters of batch normalisation with 26 M parameters.
2.5.6 Xception(2016):
Xception is an adaptation from inception, when the inception modules have been replaced
with depthwise separable convolution, it has also roughly the same number of parameters
as inception-v1(23M).
2.6 Conclusion:
In this chapter, we have presented the neural network and its differents architectures.
We focuses on CNNs , their structures and its differents layers, then we have presented a
few examples of architectures. In the next chapter, we will explains the idea of using the
architecture that we have chosen for our system of face expression recognition.
Page 22
CHAPTER 3. FER: SYSTEM DESIGN Awatef MESSAOUDI
Chapter
3
FER: system design
3.1 Introduction:
Despite the notable success of traditional facial recognition methods through the
extracted of handcrafted features, over the past decade, researchers have directed to the
deep learning approach due to its high automatic recognition capacity. The goal of our
project is to use a Convolutional neural network (CNN) to recognize Facial expressions.
3.2 System presentation:
In this section, we will present a Facial Emotion Recognition system Based on CNN.
This system consists of detecting the face of a person from an image, sequence video or
via camera to find out the expression with an accuracy rate associated with the seven
universal expression (happy, disgust, fear, anger, sad, surprise) and Neutral.
3.2.1 General architecture:
Our research on Facial expression recognition has enabled us to note that all the solutions
provided to the recognition of emotions are structured according to the same overall
Page 23
architecture, in three main modules : face detection, feature extractions and classification.
And these will be the principle modules of our system.
Figure 7. Facial Emotion Recognition Workflow
The effectiveness of the system depends on the method used to locate the face in the
picture. We will used the Viola-Jones algorithms to detect various parts of the human
face such as the mouth, eyes, nose, eyebrows, mouth, lips and ears. This algorithm
explores the characteristics of HAAR type via cascade classifier, which can effectively
combine many features and determine the different filters on a resulting classifier.
Page 24
Figure 8. Viola-Jones algorithm
3.2.3 Facial features extraction:
Once the face is detected, the system starts the process of extracting features that will
convert pixel data to a smaller representation to be used in the process of classification.
This step reduces the size of the input image while keeping the data most useful. This
process is based on Convolutional Neural Network CNN. In our project, we opted to
experiment two architectures CNN and VGG16. These architectures were proposed for
image classification in our case. We will test them for Facial emotion recognition.
3.2.3.1 CNNs model presentation:
To evaluate the performance of certain models for facial emotion recognition, we

developed in this project different model of CNNs with variable layers. Our first
Convolutional neural network model had two convolutional layers and led to two fully
connected layers. The first part of this network refers to convolutional layers that
englobe convolution layer, batch normalization, dropout, max pooling and RELU. In the
first layer, we have 32(3*3)filters along with batch normalization, RELU, dropout and
maxpooling with filter size 2*2 filters. In the second convolutional layer we had 64 (3*3)
filters. The convolutional layer receives the input, transforms the input in some way then
outputs the transformed input to the next layer. This operation is called convolution.
Page 25
Then, we apply the batch normalization to the output. This operation will normalize and
standardize the output data and put all the data points on the same scale. After that,
an activation function maps a node’s inputs to its corresponding output. There is two
type of activation function, RELU and sigmoid. in our case, we used to use the RELU
function because it was one of the most widely used activation functions today, it will
transform our input to an input to be between 0 and 1. Finally we applied a maxpooling
operation. when we add a Maxpooling to our convolutional layer, the dimensionality of
image will be reduced by reducing the number of pixels in the output.
Figure 9. model with two convolutional layers
In the second architecture, we tried to focus on the impact of adding more layers to
our system. We added two convolutional layers to our network with 256 filters with (3*3)
size then 512(3*3) filters. the choice for the number of output filters specified is arbitrary,
and the chosen kernel size of 3x3 is generally a very common size to use. The deeper the
network goes, the more sophisticated the filter become. In later layers the filters may be
able to detect specific objects.
3.2.3.2 VGG16 architecture presentation:
VGG16 is a convolutional neural network(CNN) architecture developed and refers to

the visual geometry group. We selected VGG model because it has shown excellent
performance in many computer vision tasks and it has been very successful in many
different localization and classification tasks. This network facilitate the recognition of
objects based on output probabilities of the different classes that an image could belong.
This architecture contains 13 convolutional layers, 5 max pooling and 3 fully connected
Page 26
layers. It takes an image of size 224*224*3 as an input and deploys only 3*3 convolution
and 2*2 pooling.
Figure 10. VGG16 architecture
This model has demonstrated that the depth of the network is beneficial for the
classification accuracy. However, VGG16 network has two major drawbacks:
• Slow to train.
• Due its depth and the number of fully connected, VGG16 exceeds the size of 533
MB.
3.3 Conclusion:
In this chapter we have introduced two models( CNN network and VGG16 network).
these architectures are used for the classification of images, and we noticed that each
architecture has a specific characteristics. In the next chapter, we will present the
implementation of these two models to test them in order to reveal the performance
of a Deep learning FER system and we will compare the different results.
Page 27
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
Chapter
4
Implementation and results:
4.1 Introduction:
The goal of our project is to design and to implement an application which allows us
to recognize the facial expression. we are interested in the application of deep learning
model. In this chapter, we will present the implementation of some code of different
models, the development environment as well as the various tools used. Finally, we will
present the results obtained.
4.2 Software and tools used for implementation:
4.2.1 Python3.7:[18]
Python is a general-purpose programming language in a similar vein to other programming

languages that you might have heard of such as C++, JavaScript or Microsoft’s C and
Oracle’s Java. It has been around for some considerable time having been originally
conceived back in the 1980s by Guido van Rossum at Centrum Wiskunde and Informatica
(CWI) in the Netherlands. The language is named after one of Guido’s favourite programs
“Monty Pythons Flying Circus”, a classic and somewhat anarchic British comedy sketch
show originally running from 1969 to 1974 (but which has been rerun on various stations
Page 28
ever since) and with several film spin offs. You will even find various references to this
show in the documentation available with Python.
As a language it has gained in interest over recent years, particularly within the
commercial world, with many people wanting to learn the language. This increased
interest in Python is driven by several different factors:
• Its flexibility and simplicity which makes it easy to learn.
• Its use by the Data Science community where it provides a more standard
programming language than some rivals such as R.
• Its suitability as a scripting language for those working in the DevOps field where it
provides a higher level of abstraction than alternative languages traditionally used.
• Its Ability to run on (almost) any operating system, but particularly the big three
operating systems Windows, Mac OS and Linux.
• Its The availability of a wide range of libraries (modules) that can be used to extend
the basic features of the language.
4.2.2 OpenCv:[19]
Open source Computer Vision (OpenCV) is a set of cross-platform libraries containing

functions that provide computer vision in real time.
4.2.3 Tensorflow:[20]
TensorFlow, originally created by researchers at Google, is the most popular one among
the plethora of deep learning libraries. In the field of deep learning, neural networks
have achieved tremendous success and gained wide popularity in various areas. This
family of models also has tremendous potential to promote data analysis and modeling for
various problems in educational and behavioral sciences given its flexibility and scalability.
We give the reader an overview of the basics of neural network models such as the
multilayer perceptron, the convolutional neural network, and stochastic gradient descent,
Page 29
the most commonly used optimization method for neural network models. However,
the implementation of these models and optimization algorithms is time-consuming and
error-prone. Fortunately, TensorFlow greatly eases and accelerates the research and
application of neural network models.
4.2.4 Keras:
Keras is a neural network API written in python and totally integrated with tensorflow
to visualize the filters of Convolutional layers.
4.2.5 Numpy:[21]
NumPy is the fundamental package for scientific computing with Python. Its features are
as follows:
• It has a powerful custom N-dimensional array object for efficient and convenient
representation of data..
• It has tools for integration with other programming languages used for scientific
programming like C/C++ and FORTRAN.
• It is used for mathematical operations like linear algebra, matrix operations, image
processing, and signal processing.
4.2.6 Sklearn:
Scikit-learn is a python library which provides simple and efficient tools for data analysis.
It has the following the following major modules:
• Regression.
• It Classification.
• Model selection.
• Preprocessing.
Page 30
4.2.7 Matplotlib:[21]
Matplotlib is a MATLAB-style data visualization library. Data processing and mining is a

vast topic and outside the scope of this book; however, we can use images as a convenient
data source to demonstrate some of the data processing capabilities of matplotlib.
4.3 Database:
For a best performance, we should train the network with a lot of samples of images. This
would increase the accuracy and improve the performance of the model. Unfortunately,
the large amount of data in data sets do not exist publicly, but we have access to two public
databases(FER2013 and CK+). For implementing our system, we will use FER2013
database downloaded from the kaggle challenge on FER. This dataset consists of 35.887
labeled images which are divided to 3589 test images, 3589 validation images and 28709
train images. It contains 48*48 pixel grayscale images that vary in view point, lighting
and scale. Our objective is to classify each facial image into one of the seven facial emotion
categories: anger, disgust, fear, happiness, sadness, surprise and neutral.
4.4 Implementation:
The system of Facial Emotion Recognition consists of two modules : The first one is for
face detection and the second one is for emotion recognition.
4.4.1 Import libraries:
First of all we need to import the required libraries we needed for building the network.
The code for importing libraries is given below:
Page 31
Figure 11. Librairies
4.4.2 Data download:
4.4.2.1 Dataset overview:
Our fer2013 dataset contains 4953 angry faces, 547 sad faces, 5121 fear faces, 8989 happy
faces, 6077 for sad person , 4002 surprise faces and 6189 neutral faces.
Figure 12. Facial Emotion distribution
Page 32
We can notice that our fer2013 dataset contains an unbalanced data and this will have
an impact in the performance of our model. We will see this in the next section.
Figure 13. Emotion distribution
4.4.2.2 Data augmentation:
We download our data and we need to add more data to our training set to perform
our results. Data augmentation consist of flipping the image horizontally or vertically,
rotating the image, zooming in or out, cropping, or varying the color./ So with
ImageDataGenerator() library we create new data from our existing data set using data
augmentation.
Page 33
Figure 14. Load data and data augmentation
4.4.2.3 Data preprocessing:
In the input image of FER2013 database, the image may contain variation in illumination,
size and colour, so we opted for some preprocessing operations on image to get accurate
and faster results on the algorithm. those operations consists of normalization, gray
scaling and resizing.
• Normalization: this step consists of removing illumination, variation and obtain

improved face image..
• Gray scaling is the process of converting a couloured image.
• Resizing: consists of removing the unnecessary part of the image.
4.4.3 Build the CNN model:
To build the CNN architecture, we first create a variable named model. Model is an
instance of a Sequential object. In our case, we have passed a list of layers to a sequential
Page 34
layers. Our first model contains 2 convolutional layers, two fully connected layers and the
output layer.
4.4.3.1 Architecture of the two convolutional layer of CNN model:
Conv -> BN -> Activation -> MaxPooling -> Dropout Conv -> BN -> Activation ->
MaxPooling -> Dropout Flatten Dense -> BN -> Activation -> Dropout Dense -> BN
-> Activation -> Dropout Output layer
Figure 15. Convolutional neural network 2 layers in keras
The first model is not able to classify the data it was trained on. This model is
underfitting, so to reduce this underfitting we used to increase the number of layer and
the number of neurons. The value of Learning rate in this architecture was 0.0001, we
used to experiment an other value to see the effect of Learning rate in our model.
4.4.3.2 Architecture of the four convolutional layer of CNN model:
The second model contains 4 blocks as mentioned in below: Conv -> BN -> Activation ->
MaxPooling -> Dropout Conv -> BN -> Activation -> MaxPooling -> Dropout Conv ->
BN -> Activation -> MaxPooling -> Dropout Conv -> BN -> Activation -> MaxPooling
-> Dropout Flatten Dense -> BN -> Activation -> Dropout Dense -> BN -> Activation
-> Dropout Output layer
Page 35
Figure 16. Convolutional neural network 4 layers in keras
4.4.3.3 Architecture of VGG16 network model:
The aim of our project is to experiment some architectures to finally have the most
accurate one. The second architecture was inspired from the VGG model. it consists
of several layers, convolution layers using filters of different sizes and we apply a batch
normalization (BatchNormalization), then ReLU corrections to eliminate negative values.
At the end, we added a layer of average Global Pooling, and we applied the function
Softmax to calculate the rate of the 7 Classes of expressions (the six universal expressions
and the neutral state), it therefore returns a vector of size 7, which contains the
probabilities belonging to each of the classes..
Page 36
Figure 17. VGG16 network in keras
4.4.4 Training phase:
The next step is to learn the given data. The model is learning what values to assign to
each weights based on how those incremental changes are affecting the loss function. To
the compile() function, we mention the optimizer Adam (its a variant of SGD), the loss
function and the metrics. The learning rate specified in the Adam constructor is In our
Page 37
case, we have chosen Learning-rate=0.001. Finally, we fit our model to the data by the
model.fit() function. This means training model.
Figure 18. Training phase in keras
4.4.4.1 Optimization:
when we train this model, we are basically trying to solve an optimization problem. We
are trying to optimize weights given arbitrarily. Our task is to find the weights that
most accurately map our input data to correct the output class. During the training this
weights updated and saved in the file.h5. The weights are optimised using an optimization
algorithm. In our algorithm we used the optimize SGD( stochastic gradient descent).
In this stage, we used to choose the method proposed by Paul Viola and Michael Jones.
In our system, we opted for Haarcascade.xml from the OpenCv library which provides
the Haar cascade method.
Page 38
Figure 19. Face detection in keras
4.5 Results:
After the phase of training and validation, we observe that our system is able to detect
emotions. During the training phase we saves a file.h5 to save the accuracy and loss
obtained during each iteration. One of the main thing to avoid when we train our model
is to avoid over fitting. This is when the model fits the data training well but but it
isn’t able to accurate predictions for data it hasn’t seen before. To test if our models are
over fitting, we used a technique called cross-validation, which means we will split our
data into two part training set( train model) and validation set( to evaluate the model’s
performance). If we increase the number of layers the accuracy will decrease.
4.5.1 CNN architecture results:
Accuracy is the training accuracy( max attained= 0.69 percent). Loss is the train Loss(
mil val = 0.8 percent). In the first Architecture of the two convolutional layer and with
Page 39
learning rate= 0.0001, the plot was under fitting. The metrics given for the training
accuracy is low and the training loss is high.
Figure 20. training and validation accuracy and loss graph(Lr=0.0001)
For the next architectures we used to choose a learning rate value equal to 0.005.
Figure 21. training and validation accuracy and loss graph(Lr=0.005)
Page 40
4.5.1.1 Accuracy and Loss plot in CNN network:
An epoch refers to a single pass of the entire data set to the network during the training.
The blue line denoting training accuracy takes a steep increase between 0 epoch and 20
epoch, slows down to become stable around the 25 epoch. Whereas the red line denoting
validation accuracy has a somewhat random variation between the 5th and 10th epoch
and become constant denoting that the classifier may be getting over fitted. The blue
line denoting training Loss has a gradual decline from epoch 0 to 20 and almost constant
decline from the 20 epoch. The red line denoting the val Loss follows the same variation,
except in epochs 5 to 10 there is some variations.
Figure 22. Plot between accuracy and validation accuracy
If we add a convolutional layer to this mode, the accuracy will decrease.
4.5.1.2 Confusion matrix:
Confusion matrix is applied to find which emotion usually get confused with each other.
Page 41
Figure 23. CNN’s Confusion matrix
4.5.2 VGG16 network result:
4.5.2.1 Accuracy and Loss plot:
Figure 24. CNN’s Confusion matrix
Page 42
4.5.2.2 Confusion matrix for VGG16 network:
Figure 25. VGG16’s Confusion matrix
4.6 Discussion
In our case we notice that Happy emotion is the most detected, as it has most number
of examples. Sad, surprise neutral and anger emotions are also good in detecting due
to enough examples. Fear, disgust emotions perform worse, possible reasons:less training
examples and for disgust is similar to anger features. Sad emotions are also closely
detected as neutral, because it is hard to distinguish them.
4.7 Conclusion:
In this chapter, we have presented a Facial expression recognition system based on CNN.
We presented the results obtained for each architecture. This system has been tested on
the FER2013 database of kaggle.
Page 43
CONCLUSION Awatef MESSAOUDI
Chapter
5
Conclusion
Deep learning is playing an important role in our lives.It has a huge impact in many
areas such as cancer diagnosis, precision medicine, self driving cars, speech recognition,
etc.
Our work have shown that deep learning can be applied successfully to the task of
emotion recognition. Applying deep learning methods to emotion recognition is still a
challenge. One of the main challenges in emotion recognition as in most computer vision
tasks is to deal with the complexity of real world scenarios.
This includes the large variations of illumination appearance of subjects, the variation
of gesture, pose and emotions detected in real time. And to do so, the collection of data
for this work requires effort compared to some other tasks such as object recognition.
This offers the opportunity for research and improvement.
The long term goal is to develop a Human activity’s system recognition approach
which englobes speech, gesture and facial expressions.
Page 44
WEBOGRAPHY Awatef MESSAOUDI
Webography
[3] The Universally Recognized Facial Expressions of Emotion. url: https : / / www .
kairos.com/blog/the-universally-recognized-facial-expressions-of-emotion (visited
on 0003–2015).
[4] Facial Action Coding System. url: https : / / www . paulekman . com / facial - action -
coding-system/ (visited on 0012–2020).
[6] Facial Action Tracking (Face Recognition Techniques) Part 1. url: http://what-
when - how . com / face - recognition / facial - action - tracking - face - recognition -
techniques-part-1/ (visited on 0012–2020).
[8] A Simple Guide to Convolutional Neural Networks. url: https://towardsdatascience.

com/a- simple- guide- to- convolutional- neural- networks- 751789e7bd88 (visited on
0001–2019).
[13] Artificial Intelligence vs. Machine Learning vs. Deep Learning: What’s the
Difference. url: https://medium.com/ai- in- plain- english/artificial- intelligence-
vs- machine- learning- vs- deep- learning- whats- the- difference- dccce18efe7f (visited
on 0004–2020).
[17] Understanding of Convolutional Neural Network (CNN) — Deep Learning.
[22] Illustrated: 10 CNN Architectures. url: https://towardsdatascience.com/illustrated-

10-cnn-architectures-95d78ace614d (visited on 0007–2019).
Page 45
BIBLIOGRAPHY Awatef MESSAOUDI
Bibliography
[1] MANUEL GRAÑA Andoni Beristain. “Emotion recognition based on the analysis
of Facial expressions : a survey”. In: New Mathematics and Natural Computation
05.6 (July 2009), pp. 513–534.
[2] Klaus Scherer. “Scherer KR. What are emotions? And how can they be measured?”
In: Social Science Information 2005 SAGE Publications (London, Thousand Oaks,
CA and New Delhi) 44.04 (December 2005), pp. 695–729.
[5] Marco Fratarcangeli and Marco Schaerf. “Realistic Modeling of Animatable Faces
in MPEG-4”. In: Researchgate 01.04 (2010), pp. 1–11.
[7] KM.Pooja yoti Kumaria R.Rajesha. “Facial Expression Recognition : A survey”. In:
Second International Symposium on Computer Vision and the Internet 58.6 (August
2015), pp. 486 – 491.
[9] Afef Abdelkrim Nadia Jmour Sehla Zayen. “Convolutional neural networks for image
classification”. In: 2018 International Conference on Advanced Systems and Electric
Technologies (ICA SET ) 06.17843532 (June 2018), pp. 397–402.
[10] Durgansh Sharma Aditya Kakde Nitin Arora. “A COMPARATIVE STUDY OF

DIFFERENT TYPES OF CNN AND HIGHWAY CNN TECHNIQUES”. In: Global
Journal of Engineering Science and Research) 04.04 (April 2019), pp. 18–21.
Page 46
[11] Shilpa Jain Dinesh Bisht and M. Mohan Raju. “Prediction of Water Table Elevation
Fluctuation through Fuzzy Logic Artificial Neural Networks”. In: International
Journal of Advanced Science and Technology 51.04 (Marsh 2013), pp. 108–119.
[12] Lean YuShouyang WangKin Keung Lai. “Basic Learning Principles of Artificial
Neural Networks”. In: Foreign-Exchange-Rate Forecasting With Artificial Neural
Networks. 2007, pp. 27–37.
[14] IshapUnwala Xiakun Yang Lucy Nwosu Huiwang Jiang Lu and Ting Zhang.
“Deep Neural Network for facial expression recognition using Facial Part”.
In: 15th Intl Conf on Dependable, Autonomic and Secure Computing,
15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on
Big Data Intelligence and Computing and Cyber Science and Technology
Congress(DASC/PiCom/DataCom/CyberSciTech) 213.17661650 (April 2018),
pp. 1318–1321.
[15] Ahmed Jawad Kabir Shadman Sakib Nazib Ahmed and Hridon Ahmed. “An
Overview of Convolutional Neural Network: Its Architecture and Applications”.
In: Proceedings of the IEEE conference on computer vision 1.1 (November 2019),
pp. 1–5.
[16] Deepesh Lekhak. “Facial Expression Recognition System using Convolutional

Neural Network”. In: A Project Report 72.645 (May 2017), pp. 1–17.
[18] John Hunt. “A Beginners Guide to Python 3 Programming”. In: Undergraduate

Topics in Computer Science 01.01 (January 2019), pp. 1–2.
[19] Manoel Carlos Ramon. “Using OpenCV”. In: Intel® Galileo and Intel® Galileo Gen
2 02.01 (December 2014), pp. 319–400.
[20] Yingnian Wu Bo Pang Erik Nijkamp. “Deep Learning With TensorFlow: A Review”.
In: Journal of Educational and Behavioral Statistics 45.02 (Seeptember 2019),
pp. 1–22.
[21] Ashwin Pajankar. “Introduction to NumPy”. In: Raspberry Pi Supercomputing and

Scientific Programming (May 2017), pp. 109–128.
Page 47
[23] Shan Li and Weihong Deng. “Deep Facial Expression Recognition: A Survey”. In:
IEEE Transactions on Affective Computing 1109.10 (March 2020), p. 99.
[24] Zhi Liu Yang Xin Lingshuang Kong. “Machine learning and deep learning method
for cybersecurity”. In: IEEE Access 06.17905844 (May 2018), pp. 35365 –35381.
[25] AysegulALAYBEYOGLU Reza SADIGHZADEHand Aydin AKAN Mehmet

Akif OZDEMIR Berkay ELAGOZ. “Realtime Emotion Rcognition from
Facial expression using CNN Architecture”. In: Medical Technologies Congress
(TIPTEKNO) 107.19136172 (November 2019), pp. 27–36.
Page 48

Output 93

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Output 93

Uploaded by

Copyright:

Available Formats

Republic of Tunisia

National Engineering School of Tunis Tunis ELAMANAR University

National Engineering School of Tunis

in order to obtain the

Master degree in Systems, Signals and Data

Facial Emotion Recognition based on

List of Figures vii

1 Facial Emotion Recognition (FER): state of the art 3

3 Facial Emotion Recognition (FER): system design 23

4 Implementation and results: 28

1 The six universal emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Machine learning vs Deep learning . . . . . . . . . . . . . . . . . . . . . . 15

7 Facial Emotion Recognition Workflow . . . . . . . . . . . . . . . . . . . . . 24

11 Librairies and APIs imported from python . . . . . . . . . . . . . . . . . . 32

24 CNN’s Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

CNN Convolutional neural network

FACS Facial Action Coding System

FER Facial Emotion Recognition

FFP Facial Feature Points

JAFFE Japaneese female facial expression

LDA Linear Discriminant Analysis

LPB Local pattern binary

SVM Support Vector Machine

1.2 Facial expressions and emotions :

SHERER proposes the following definition : « emotion is defined as an episode

1.2.1.2 Facial expressions:

1.2.2 The universal facial expressions:[3]

Figure 1. The six universal emotions

1.2.3 Coding systems:

1.2.3.1 Facial Action Coding System (FACS):[4]

• Complexity: : it takes 100 hours of learning to master the main concepts.

• Diﬀiculty of handling by a machine : FACS was created for psychologist, some

Figure 2. MPEG4 Model

2. vertical position of the eyebrows.

3. vertical eye position.

6. eye separation distance.

7. depth of the cheeks.

8. depth of the nose.

9. vertical position of the nose.

10. degree of the curvature of the nose.

11. vertical position of the mouth.

12. width of the mouth.

Figure 3. Candide Model

1.2.4 Areas of application of FER:[7]

1.3 Architecture of Facial expression recognition:

1.3.1 Face detection:

• Automatic facial treatment : it is a method that specifies faces by distances and

• Eigen face : this is an effective method of characterization in facial treatment such

• Linear Discriminant Analysis (LDA) : it is based on predictive discriminant analysis.

1.3.2 Feature extraction:

1.3.2.1 the geometric characteristics:

1.3.2.2 the characteristics of appearance:

be eliminated by a normalization before the step of extraction of characteristics or by a

1.3.3 Emotion recognition:

1.3.3.1 global approach:

1.3.3.2 local approach:

1.3.3.3 Hybrid approach:

1.3.4 Facial expression databases:[7]

1.3.4.3 Japaneese female facial expression (JAFFE):

The japaneese female facial expression database is a laboratry-controlled image database

1.3.5 Machine learning:

1.3.6 Deep learning:

2.2 Machine learning vs Deep learning: