Professional Documents
Culture Documents
Ministry of Higher
Education, Scientific
Research and
Information and
Communication
Technologies
Master Project
Report
presented at
by
Awatef MESSAOUDI
Mr F F President
Mr Zied LACHIRI Supervisor
Mr F F Reviewer
Dedication
I dedicate this work to my parents who have provided me with their encouragement,
love and understanding.
To all who where there for me, Thank you for your help and encouragement.
To all of you,
I dedicate this work.
Awatef MESSAOUDI
Acknowledgements
First and Foremost, I would like to express my infinite gratitude, and respect to my
supervisor Mr Zied LACHIRI.
I am also grateful to all my teachers without whom this work would not have been
possible.
I will not forget, of course, to express my gratitude to my colleague of Master who
have kindly accepted to cooperate. Most of all, I am thank full to my family for their
exceptional support through all challenges.
CONTENTS Awatef MESSAOUDI
Contents
Dedication i
Acknowledgements ii
Contents v
Acronyms viii
Introduction 1
Page iii
CONTENTS Awatef MESSAOUDI
1.4 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Deep learning 14
2.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Machine learning vs Deep learning: . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Artificial neural network:[11] . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Convolutional neural network CNN: . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Presentation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Architecture:[13][12] . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Visualisation of some CNN architectures:[8] . . . . . . . . . . . . . . . . . 20
2.5.1 LeNet-5 (1998): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 AlexNet(2012): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 VGG-16(2014): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4 Inception-v1(2014): . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.5 ResNet-50(2015): . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.6 Xception(2016): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Page iv
CONTENTS Awatef MESSAOUDI
4.4 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.1 Import libraries: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.2 Data download: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.3 Build the CNN model: . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.4 Training phase: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.5 Face detection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Results: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 CNN architecture results: . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 VGG16 network result: . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Conclusion 44
Webography 45
Bibliography 45
Page v
LIST OF FIGURES Awatef MESSAOUDI
List of Figures
Page vi
LIST OF FIGURES Awatef MESSAOUDI
Page vii
LIST OF FIGURES Awatef MESSAOUDI
Acronyms
AU Action unit
Page viii
INTRODUCTION Awatef MESSAOUDI
Introduction
Automatic recognition of human emotion has been an active area for decades, with growing
application areas including avatar animation, neuro-marketing and social robots. Emotion
recognition was a challenging task as it involves predicting abstract emotional states from
multi-modal input data. These modalities include video, audio and physiological signals.
Just like the tone of voice, facial expressions has an important role in recognition of
emotions and are used to identify people as a sort of non verbal communication. The face
have an important role in daily emotional communication. Facial emotion recognition
was the most informative channels, because human’s face is the most exposed part of the
body. It still the most challenging tasks in computer vision. In recent years, and with
the evolution of deep learning method, researchers introduced an end to end framework
for facial expressions recognition using deep learning models. In this preliminary study,
we proposed a simple solution For facial expression recognition. Our study is based
on the deep learning methods. We introduced the state of the art of facial expression
recognition. We focused on some notions such as ”emotions, facial expression, and the
six universal emotions”. The first chapter was an introduction to the facial expression
recognition and the different terms like face detection, feature extraction, machine learning
and deep learning. In the second part, we focused our study in deep learning, its different
blocks with a simple comparison with machine learning. The third section includes the
architecture of two different deep learning network. We have studied the different blocks
Page 1
INTRODUCTION Awatef MESSAOUDI
and the data set FER2013 used in our work. Finally, its the phase of implementation and
results. We discuss the results of the presented work and we draw conclusion.
Page 2
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
Chapter
1
Facial Emotion Recognition (FER):
state of the art
1.1 Introduction:
Due to the important role of facial expression in human interaction, the ability to
perform facial expression recognition automatically via computer vision enables a range
of applications such as human- computer interaction and data analytic, etc… In this
chapter, we will present some notions of emotions and different coding theories as well as
the architecture of facial recognition. We will present some approaches that help as to
recognize facial expression and we will end the chapter with different machine learning
techniques.
Page 3
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
1.2.1 Definitions:
1.2.1.1 Emotions:[1]
The emotion is expressed through many channels such as body position, voice and facial
expressions. It is a mental and physiological state which is subjective and private. It
involves a lot of behaviours, actions, thoughts and feelings.
Facial expression is a meaningful imitation of the face. The meaning can be expression
of an emotion, a semantic index or an intonation in the language of panels. The
interpretation of a set of muscle movements in expression depends on the context of the
application. For example, in the case of an application in Human-Machine interaction
where we want to know an indication of the emotional state of an individual, we will try
to classify measures in terms of emotions.
Charles DARWIN wrote in his 1872 book « the expressions of the emotions in Man
and Animals » that facial expressions of emotion are universal, not learned differently
in each culture. Several studies since have attempted to classify human emotions and
demonstrate how your face can give away your emotional state.
Page 4
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
In 1960, Ekman and Friesen defined six basic emotions based on cross-culture study,
which indicated that humans perceive certain basic emotions in the same way regardless of
culture. These prototypical facial expressions are anger, disgust, fear, hapiness, sadness,
and surprise.
Facial expressions is a consequence of activity of facial muscles. These muscles are also
called mimetic muscles or muscles of the facial expressions. The study of facial expressions
cannot be done without the study of the anatomy of the face and the underlying structure
of the muscles. That’s why some researchers focused on a coding system for facial
expressions. Several systems have been proposed such as Ekman system’s. In 1978 Ekman
developed a tool for coding facial expressions widely used today. We will present some
systems.
The Facial Action Coding System (FACS) is a comprehensive, anatomically based system
for describing all visually discernible facial movement. It breaks down facial expressions
into individual components of muscle movement, calledAction unit (AU). FACS is very
successful but it suffers from some defaults such as :
• Lack of precision : the transition between two states of a muscle are represented by
linear way, which is an approximation of reality.
Page 5
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
1.2.3.2 MPEG4:[5]
the MPEG4 video encoding standard has a model of the face human developped by the face
and body AdHocGroup interest group. This is a 3D model. This model is built on a set of
facial attributes, called Facial Feature Points(Facial Feature Points (FFP)). Measurements
are used to describe muscle movements( Facial animation Parameters-equivalents of
Ekman unit Actions).
1.2.3.3 Candide:[6]
It is a model of the face, contained 75 vertices ans 100 triangles. It is composed of a model
with a generic face and a set of parameters(SHAPE UNITS). These parameters are used
to adapt the generic model to a particular individual. They represent the differences
between individuals and are 12 in number:
1. head height.
Page 6
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
4. eye width.
5. eye height.
Page 7
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
Automatic Facial expression recognition system has many applications including human
behavior understanding, detecting of mental disorder, etc....
It has become a research field involving many scientists specializing in different Areas
such as artificial intelligence, computer vision, psychology, physiology, education, website
customization, etc…
The system that performs automatic recognition of facial expression consists of three
modules : The first one is detecting and recording the face in the image or the input image
sequences. It can be a sensor to detect the face in each image or just detect the face in
the first image and then follow the face in the rest of video sequences. The second module
consist in extracting and representing the facial changes caused by facial expressions. The
last one determines a similarity between the set of characteristics extracted and a set of
reference characteristics. Other filters or data preprocessing modules can be used between
these main modules to improve the results of detection, extraction of characteristics or
classification.
Face detection consists of determining the presence or absence of faces in a picture. This
is a preliminary task necessary for most techniques for analysing the face. This used
technique come from the field of recognition shapes. There are several techniques for
detecting the face, we mention the most used.
Page 8
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
• Local pattern binary (LPB) ; the technique of local binary model divides the face
into square sub regions of equal size where the LBP characteristics are calculated .
the vector obtained are concatenated to get the final feature vector.
• Haar filter : this face detection method uses a multi scale haar filter. The
characteristics of a face are described in an XML file.
The characteristics of the face are mainly located around the facial components such as
the eyes, mouth, eye-brow nose and chin. The detection of characteristics points of the
faces is done by a rectangular box returned by a detector which locates the face. The
extraction of the geometric features such as the contours of facial components and facial
distance provides location or appearance of characteristics. Therefore, there are two types
of approaches :
characteristics represent the shape and location of components of the face(including the
mouth, eyes, eyebrows and nose). The facial components or facial features are extracted
to form a vector of features representing the geometry of the face.
It represents change in appearance of the face such as wrinkles and furrows. According to
these methods, the effect of rotation of head and the different facial shooting scales could
Page 9
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
Many researches are divided into three parts global approaches local approaches,
local approaches and finally hybrid approaches. Each approaches has advantages and
disadvantages related to environmental issues, orientation of images, position of the head,
etc…
These approaches are independent of head positions (top, bottom) and face image
orientations. These methods are effective but requires a heavy learning phase and the
result depends on the number of samples used.
These approaches are based facial objects detection and they are robust to the change
of luminance. The position of the head and its orientation can cause some gaps in the
system.
the alternative is to combine the two approaches(local and global) in order to take
advantages from these approaches. The recognition phase in this system is based on
machine learning theory : The feature vector is formed to describe the facial expression
and the first part of the classifier is Learning. Classifier training consists of labeling the
images after detection, once the classifier is trained, it can recognize the images input. The
classification method can be divided into two groups : • Recognition based on static data
which only concerns images. • Recognition based on dynamic data concerning sequences
Page 10
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
images or videos. Various classifiers have been applied such as neural network, Bayesian
network, Support Vector Machine (SVM), etc…
Having sufficient labeled training data that include as many variations of the populations
and environments as possible is important for the design of a deep expression recognition
system. We will introduce some databases that contain a large amount of affective images
collected from the real world to benefit the training of deep neural networks.
1.3.4.1 CK+:
The extended cohn kanade database is the most extensively used laboratory-controlled
database for evaluating FER system. CK+ contains 593 video sequences from 123
subjects. The sequences vary in duration from 10 to 60 frames and show a shift from
a neutral facial expression to the peak expression. Among these video, 327 sequences
from 118 subjects are labeled with seven basic expression labels(anger, Neutral, disgust,
fear, happiness, sadness and surprise) based on the facial action coding systems(FACS).
Because CK+does not provide specified training, validation and test set, the algorithms
evaluated on this database are not uniform.
1.3.4.2 MMI:
this database is laboratry-controlled are includes 326 sequences from 32 subjects. A total
of 213sequences are labeled with six basic expressions and 205 sequences are captured
in frontal view. In contrast to CK+ sequences in MMI are onset-apex-offset labeled.
The sequence begins with a neutral expression and reaches peak near the middle before
returning to the neutral expression.
Page 11
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
1.3.4.4 FER-2013:
This database was introduced during the ICML 2013 challenges in representation learning.
FER-2013 is a large scale and unconstrained database collected automatically by the
google image search API. All images have been registred and resized to 48*48 pixels
after rejecting wrongfully labeled frames and adjusting the cropped region. FER-2013
contains 28.709 training images, 3.589 validation images and 3.589 test images with seven
expression labels ( anger, disgust, fear, hapiness, sadness, surprise, and neutral).
Machine learning is one of the most exciting areas of technology at the moment. We
see daily many stories that herald new breakthroughs in facial recognition technology,
self driving cars or computers that can have a conversation just like a person. Machine
learning technology is set to revolutionise almost any area of human life and work. The
one primary reason behind the using of machine learning is to automate complex tasks
and to analyze the variety and the complexity of data.
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
Page 12
CHAPTER 1. FER: STATE OF THE ART Awatef MESSAOUDI
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
1.4 Conclusion:
Deep learning or deep machine learning is a branch of machine learning that takes data
as an input and makes intuitive and intelligent decisions using an artificial neural network
stacked layer wise. It is being applied in various domains for its ability to find patterns
in data extract features and generate intermediate representations.
Page 13
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
Chapter
2
Deep learning
2.1 Introduction:
Deep learning is a subset of machine learning, which uses the neural network to analyze
different factors with a structure that is similar to the human neural system. It uses
complex multi-layered neural networks, where the level of abstraction increases gradually
by nonlinear transformations of input data.[8] It concerns algorithms inspired b by the
structure and function of the brain. They can learn several levels of representation in
order to model complex relationships between data
Machine learning algorithms work well for a wide variety of problems. However they failed
to solve some major AI problems such as speech, face and emotions recognition.
Page 14
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
• Train and evaluate model performance( for different algorithms, evaluate and select
the best performing model).
Most of Features must be determined by an expert and then encoded as a data type.
Features can be pixel value, shapes, etc,... The performance of machine learning
algorithms depends upon the accuracy of the features extracted. Deep learning reduces
the task of developing new features extractor, by automating the phase of extracting
and learning features.[10] Deep learning uses neural network to learn representations of
characteristics directly from data.
Page 15
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
Artificial neural network is a computing model that tries to mimmic the human brain
in a very primitive way to emulate the capabilities of human being in a very limited
sense. ANNs have been developed as a generalization of mathematical models of human
cognition or neural biology. It takes an input vector X and produces an output vector
Y. the relationship between X and Y are determined by the network architecture.[12] An
ANN is a network of parallel, distributed information processing. It consists of a number
of informations processing elements called neurons or nodes which are grouped in layers.
The input layer processing elements receive the input vector and transmit the values to the
next layer of processing elements across connections where this process is continued. This
type of network, where data flow one way(forward) is known as a feed forward network.
A feedforward ANN has an input layer, an output layer and one or more hidden layers
between the input and the output layers. Each of the neurons in a layer is connected to
all the neuros of the next layer and the neuron in one layer are connected only to the
neurons of the immediate next layer. The strength of the signal passing from one neuron
to the other depends on the weight of the interconnections. The hidden layers enhance
to the network’s ability to model complex functions. Performance of BPANN(back
propogation artificial neural network) model is compared with the developped linear
transfer function(LTF) model and was found superior.
Page 16
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
2.4.1 Presentation:
Convolutional neural network CNN is an artificial neural network type that proposed by
Yann le Cuhn in 1988. CNNs are one of the most popular deep learning architectures for
image classification, recognition and segmentation. CNN consists of hierarchical multiply
hidden layers. These artificial neurons take input from image, multiply weight, add bias
and then apply activation function. So that, artificial neurons can be used in image
classification, recognition and segmentation by perform simple convolutions by feeding
the convolutional neural network with more data( huge amount of data).[13]
2.4.2 Architecture:[13][12]
Convolutional Neurals networks are the most efficient models for classifying images data.
It was inspired by the mammal’s visual cortex.[10] Each CNN channel is made up of
convolutional layers, max pooling layers, fuuly connected layers and an output layer.[14]
Page 17
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
The convolution layer is the first layer to extract features from an input image.[12]
During the forward pass, each filter is convolved across the width and height of the
input volume and compute dot products between the entries of the filter and the input
at any position. As the filter convolve over the width and the height of the input volume
it produces a 2 dimensional activation map that gives the responses of the filter at every
spatial position. There will be an entire set of filters in each of them will produce a
separate 2-dimensional activation map.[16] The 2D convolution between image A and
filter B can be given as :
Page 18
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
C(i,j)=
N∑
a−1
A(m, n) ∗ B(i − m, j − n)
m=0
0<=i<=Ma+Mb-1 0<=j<=Na+Nb-1
CNN learns the values of these filters on its own during the training process( although
parameters such as number of filters, filter size, architecture of the network, etc still
needed to specify the training process). By increasing the number of filters, the more
image features get extracted and the better network becomes. Three parameters control
the size of the feature map( convolved feature) :
• Depth : correspond to the number of filters we use for the convolution operation.
• Zero padding : it is convenient to pad the input matrix with zeros around the
border, so that filter can be applied to bordering elements of input image matrix.
An additional operation is used after every convolution operation, called RELU layer.
A rectified linear unit apply an activation function, the output is:
F(x)= max(0.x). There are an other non linear fuctions such as tanh or sigmoid
that can alsobe used instead of RELU. Most of the data scientist since performance wise
RELU is better than the other two.[17]
Page 19
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
been shown to work better. MAXpooling reduces the input by applying the max function
over the input Xi,l and m be the size of the filter then the output calculates as follows :
In the end, a feature extractor vector or CNN code concatenate the output informations
as a unique vector and feed it into fully connected layer(multilayer perceptron). The term
« fully connected »indicates that every neuron in the previous layer is connected to every
neuron on the next layer. The output from the convolutional and pooling layers represent
high level features of the input image. The purpose of the fully connected layer is to use
these features for classifying the input image into various classes based on the training
Dataset.
where j is index for image and K is number of total facial expression class. The RELU
is an activation function which eliminates all the negative values.
In recent years, we remarked the evolution of CNNs architectures. These networks have
gotten so deep that it has become extremely difficult to visualise the entire model.
Page 20
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
2.5.2 AlexNet(2012):
2.5.3 VGG-16(2014):
With this architecture, we notice taht CNNs were strating to get deeper and deeper.
This is because the most straight forward way of improving performance of deep neural
networks is by increasing their size. VGG-16 has 13 convolutional and 3 fully connected
layers, carrying with them the RELU tradition from AlexNet. It consists of 138M
parameters and takes about 500MB of storage space.
2.5.4 Inception-v1(2014):
This 22 layers architecture with 5M parameters is called the inception-v1. The design of
the architecture of an inception module is a product of research on approximating sparse
structures.
2.5.5 ResNet-50(2015):
From the past few CNNs, we have seen nothing but an increasing number of layers in
the design and achieving better performance. But with the network depth increasing,
accuracy gets saturated and the degrades rapidly. The folkes from Microsoft researcher
Page 21
CHAPTER 2. DEEP LEARNING Awatef MESSAOUDI
adressed this problem with ResNet, using skip connections while building deeper models.
ResNet is one of the early adapters of batch normalisation with 26 M parameters.
2.5.6 Xception(2016):
Xception is an adaptation from inception, when the inception modules have been replaced
with depthwise separable convolution, it has also roughly the same number of parameters
as inception-v1(23M).
2.6 Conclusion:
In this chapter, we have presented the neural network and its differents architectures.
We focuses on CNNs , their structures and its differents layers, then we have presented a
few examples of architectures. In the next chapter, we will explains the idea of using the
architecture that we have chosen for our system of face expression recognition.
Page 22
CHAPTER 3. FER: SYSTEM DESIGN Awatef MESSAOUDI
Chapter
3
FER: system design
3.1 Introduction:
Despite the notable success of traditional facial recognition methods through the
extracted of handcrafted features, over the past decade, researchers have directed to the
deep learning approach due to its high automatic recognition capacity. The goal of our
project is to use a Convolutional neural network (CNN) to recognize Facial expressions.
In this section, we will present a Facial Emotion Recognition system Based on CNN.
This system consists of detecting the face of a person from an image, sequence video or
via camera to find out the expression with an accuracy rate associated with the seven
universal expression (happy, disgust, fear, anger, sad, surprise) and Neutral.
Our research on Facial expression recognition has enabled us to note that all the solutions
provided to the recognition of emotions are structured according to the same overall
Page 23
CHAPTER 3. FER: SYSTEM DESIGN Awatef MESSAOUDI
architecture, in three main modules : face detection, feature extractions and classification.
And these will be the principle modules of our system.
The effectiveness of the system depends on the method used to locate the face in the
picture. We will used the Viola-Jones algorithms to detect various parts of the human
face such as the mouth, eyes, nose, eyebrows, mouth, lips and ears. This algorithm
explores the characteristics of HAAR type via cascade classifier, which can effectively
combine many features and determine the different filters on a resulting classifier.
Page 24
CHAPTER 3. FER: SYSTEM DESIGN Awatef MESSAOUDI
Once the face is detected, the system starts the process of extracting features that will
convert pixel data to a smaller representation to be used in the process of classification.
This step reduces the size of the input image while keeping the data most useful. This
process is based on Convolutional Neural Network CNN. In our project, we opted to
experiment two architectures CNN and VGG16. These architectures were proposed for
image classification in our case. We will test them for Facial emotion recognition.
Page 25
CHAPTER 3. FER: SYSTEM DESIGN Awatef MESSAOUDI
Then, we apply the batch normalization to the output. This operation will normalize and
standardize the output data and put all the data points on the same scale. After that,
an activation function maps a node’s inputs to its corresponding output. There is two
type of activation function, RELU and sigmoid. in our case, we used to use the RELU
function because it was one of the most widely used activation functions today, it will
transform our input to an input to be between 0 and 1. Finally we applied a maxpooling
operation. when we add a Maxpooling to our convolutional layer, the dimensionality of
image will be reduced by reducing the number of pixels in the output.
In the second architecture, we tried to focus on the impact of adding more layers to
our system. We added two convolutional layers to our network with 256 filters with (3*3)
size then 512(3*3) filters. the choice for the number of output filters specified is arbitrary,
and the chosen kernel size of 3x3 is generally a very common size to use. The deeper the
network goes, the more sophisticated the filter become. In later layers the filters may be
able to detect specific objects.
Page 26
CHAPTER 3. FER: SYSTEM DESIGN Awatef MESSAOUDI
layers. It takes an image of size 224*224*3 as an input and deploys only 3*3 convolution
and 2*2 pooling.
This model has demonstrated that the depth of the network is beneficial for the
classification accuracy. However, VGG16 network has two major drawbacks:
• Slow to train.
• Due its depth and the number of fully connected, VGG16 exceeds the size of 533
MB.
3.3 Conclusion:
In this chapter we have introduced two models( CNN network and VGG16 network).
these architectures are used for the classification of images, and we noticed that each
architecture has a specific characteristics. In the next chapter, we will present the
implementation of these two models to test them in order to reveal the performance
of a Deep learning FER system and we will compare the different results.
Page 27
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
Chapter
4
Implementation and results:
4.1 Introduction:
The goal of our project is to design and to implement an application which allows us
to recognize the facial expression. we are interested in the application of deep learning
model. In this chapter, we will present the implementation of some code of different
models, the development environment as well as the various tools used. Finally, we will
present the results obtained.
4.2.1 Python3.7:[18]
Page 28
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
ever since) and with several film spin offs. You will even find various references to this
show in the documentation available with Python.
As a language it has gained in interest over recent years, particularly within the
commercial world, with many people wanting to learn the language. This increased
interest in Python is driven by several different factors:
• Its use by the Data Science community where it provides a more standard
programming language than some rivals such as R.
• Its suitability as a scripting language for those working in the DevOps field where it
provides a higher level of abstraction than alternative languages traditionally used.
• Its Ability to run on (almost) any operating system, but particularly the big three
operating systems Windows, Mac OS and Linux.
• Its The availability of a wide range of libraries (modules) that can be used to extend
the basic features of the language.
4.2.2 OpenCv:[19]
4.2.3 Tensorflow:[20]
TensorFlow, originally created by researchers at Google, is the most popular one among
the plethora of deep learning libraries. In the field of deep learning, neural networks
have achieved tremendous success and gained wide popularity in various areas. This
family of models also has tremendous potential to promote data analysis and modeling for
various problems in educational and behavioral sciences given its flexibility and scalability.
We give the reader an overview of the basics of neural network models such as the
multilayer perceptron, the convolutional neural network, and stochastic gradient descent,
Page 29
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
the most commonly used optimization method for neural network models. However,
the implementation of these models and optimization algorithms is time-consuming and
error-prone. Fortunately, TensorFlow greatly eases and accelerates the research and
application of neural network models.
4.2.4 Keras:
Keras is a neural network API written in python and totally integrated with tensorflow
to visualize the filters of Convolutional layers.
4.2.5 Numpy:[21]
NumPy is the fundamental package for scientific computing with Python. Its features are
as follows:
• It has a powerful custom N-dimensional array object for efficient and convenient
representation of data..
• It has tools for integration with other programming languages used for scientific
programming like C/C++ and FORTRAN.
• It is used for mathematical operations like linear algebra, matrix operations, image
processing, and signal processing.
4.2.6 Sklearn:
Scikit-learn is a python library which provides simple and efficient tools for data analysis.
It has the following the following major modules:
• Regression.
• It Classification.
• Model selection.
• Preprocessing.
Page 30
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
4.2.7 Matplotlib:[21]
4.3 Database:
For a best performance, we should train the network with a lot of samples of images. This
would increase the accuracy and improve the performance of the model. Unfortunately,
the large amount of data in data sets do not exist publicly, but we have access to two public
databases(FER2013 and CK+). For implementing our system, we will use FER2013
database downloaded from the kaggle challenge on FER. This dataset consists of 35.887
labeled images which are divided to 3589 test images, 3589 validation images and 28709
train images. It contains 48*48 pixel grayscale images that vary in view point, lighting
and scale. Our objective is to classify each facial image into one of the seven facial emotion
categories: anger, disgust, fear, happiness, sadness, surprise and neutral.
4.4 Implementation:
The system of Facial Emotion Recognition consists of two modules : The first one is for
face detection and the second one is for emotion recognition.
First of all we need to import the required libraries we needed for building the network.
Page 31
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
Our fer2013 dataset contains 4953 angry faces, 547 sad faces, 5121 fear faces, 8989 happy
faces, 6077 for sad person , 4002 surprise faces and 6189 neutral faces.
Page 32
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
We can notice that our fer2013 dataset contains an unbalanced data and this will have
an impact in the performance of our model. We will see this in the next section.
We download our data and we need to add more data to our training set to perform
our results. Data augmentation consist of flipping the image horizontally or vertically,
rotating the image, zooming in or out, cropping, or varying the color./ So with
ImageDataGenerator() library we create new data from our existing data set using data
augmentation.
Page 33
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
In the input image of FER2013 database, the image may contain variation in illumination,
size and colour, so we opted for some preprocessing operations on image to get accurate
and faster results on the algorithm. those operations consists of normalization, gray
scaling and resizing.
To build the CNN architecture, we first create a variable named model. Model is an
instance of a Sequential object. In our case, we have passed a list of layers to a sequential
Page 34
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
layers. Our first model contains 2 convolutional layers, two fully connected layers and the
output layer.
Conv -> BN -> Activation -> MaxPooling -> Dropout Conv -> BN -> Activation ->
MaxPooling -> Dropout Flatten Dense -> BN -> Activation -> Dropout Dense -> BN
-> Activation -> Dropout Output layer
The first model is not able to classify the data it was trained on. This model is
underfitting, so to reduce this underfitting we used to increase the number of layer and
the number of neurons. The value of Learning rate in this architecture was 0.0001, we
used to experiment an other value to see the effect of Learning rate in our model.
The second model contains 4 blocks as mentioned in below: Conv -> BN -> Activation ->
MaxPooling -> Dropout Conv -> BN -> Activation -> MaxPooling -> Dropout Conv ->
BN -> Activation -> MaxPooling -> Dropout Conv -> BN -> Activation -> MaxPooling
-> Dropout Flatten Dense -> BN -> Activation -> Dropout Dense -> BN -> Activation
-> Dropout Output layer
Page 35
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
The aim of our project is to experiment some architectures to finally have the most
accurate one. The second architecture was inspired from the VGG model. it consists
of several layers, convolution layers using filters of different sizes and we apply a batch
normalization (BatchNormalization), then ReLU corrections to eliminate negative values.
At the end, we added a layer of average Global Pooling, and we applied the function
Softmax to calculate the rate of the 7 Classes of expressions (the six universal expressions
and the neutral state), it therefore returns a vector of size 7, which contains the
probabilities belonging to each of the classes..
Page 36
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
The next step is to learn the given data. The model is learning what values to assign to
each weights based on how those incremental changes are affecting the loss function. To
the compile() function, we mention the optimizer Adam (its a variant of SGD), the loss
function and the metrics. The learning rate specified in the Adam constructor is In our
Page 37
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
case, we have chosen Learning-rate=0.001. Finally, we fit our model to the data by the
model.fit() function. This means training model.
4.4.4.1 Optimization:
when we train this model, we are basically trying to solve an optimization problem. We
are trying to optimize weights given arbitrarily. Our task is to find the weights that
most accurately map our input data to correct the output class. During the training this
weights updated and saved in the file.h5. The weights are optimised using an optimization
algorithm. In our algorithm we used the optimize SGD( stochastic gradient descent).
In this stage, we used to choose the method proposed by Paul Viola and Michael Jones.
In our system, we opted for Haarcascade.xml from the OpenCv library which provides
the Haar cascade method.
Page 38
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
4.5 Results:
After the phase of training and validation, we observe that our system is able to detect
emotions. During the training phase we saves a file.h5 to save the accuracy and loss
obtained during each iteration. One of the main thing to avoid when we train our model
is to avoid over fitting. This is when the model fits the data training well but but it
isn’t able to accurate predictions for data it hasn’t seen before. To test if our models are
over fitting, we used a technique called cross-validation, which means we will split our
data into two part training set( train model) and validation set( to evaluate the model’s
performance). If we increase the number of layers the accuracy will decrease.
Accuracy is the training accuracy( max attained= 0.69 percent). Loss is the train Loss(
mil val = 0.8 percent). In the first Architecture of the two convolutional layer and with
Page 39
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
learning rate= 0.0001, the plot was under fitting. The metrics given for the training
accuracy is low and the training loss is high.
For the next architectures we used to choose a learning rate value equal to 0.005.
Page 40
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
An epoch refers to a single pass of the entire data set to the network during the training.
The blue line denoting training accuracy takes a steep increase between 0 epoch and 20
epoch, slows down to become stable around the 25 epoch. Whereas the red line denoting
validation accuracy has a somewhat random variation between the 5th and 10th epoch
and become constant denoting that the classifier may be getting over fitted. The blue
line denoting training Loss has a gradual decline from epoch 0 to 20 and almost constant
decline from the 20 epoch. The red line denoting the val Loss follows the same variation,
except in epochs 5 to 10 there is some variations.
Confusion matrix is applied to find which emotion usually get confused with each other.
Page 41
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
Page 42
CHAPTER 4. IMPLEMENTATION AND RESULTS: Awatef MESSAOUDI
4.6 Discussion
In our case we notice that Happy emotion is the most detected, as it has most number
of examples. Sad, surprise neutral and anger emotions are also good in detecting due
to enough examples. Fear, disgust emotions perform worse, possible reasons:less training
examples and for disgust is similar to anger features. Sad emotions are also closely
detected as neutral, because it is hard to distinguish them.
4.7 Conclusion:
In this chapter, we have presented a Facial expression recognition system based on CNN.
We presented the results obtained for each architecture. This system has been tested on
the FER2013 database of kaggle.
Page 43
CONCLUSION Awatef MESSAOUDI
Chapter
5
Conclusion
Deep learning is playing an important role in our lives.It has a huge impact in many
areas such as cancer diagnosis, precision medicine, self driving cars, speech recognition,
etc.
Our work have shown that deep learning can be applied successfully to the task of
emotion recognition. Applying deep learning methods to emotion recognition is still a
challenge. One of the main challenges in emotion recognition as in most computer vision
tasks is to deal with the complexity of real world scenarios.
This includes the large variations of illumination appearance of subjects, the variation
of gesture, pose and emotions detected in real time. And to do so, the collection of data
for this work requires effort compared to some other tasks such as object recognition.
This offers the opportunity for research and improvement.
The long term goal is to develop a Human activity’s system recognition approach
which englobes speech, gesture and facial expressions.
Page 44
WEBOGRAPHY Awatef MESSAOUDI
Webography
[3] The Universally Recognized Facial Expressions of Emotion. url: https : / / www .
kairos.com/blog/the-universally-recognized-facial-expressions-of-emotion (visited
on 0003–2015).
[4] Facial Action Coding System. url: https : / / www . paulekman . com / facial - action -
coding-system/ (visited on 0012–2020).
[6] Facial Action Tracking (Face Recognition Techniques) Part 1. url: http://what-
when - how . com / face - recognition / facial - action - tracking - face - recognition -
techniques-part-1/ (visited on 0012–2020).
[13] Artificial Intelligence vs. Machine Learning vs. Deep Learning: What’s the
Difference. url: https://medium.com/ai- in- plain- english/artificial- intelligence-
vs- machine- learning- vs- deep- learning- whats- the- difference- dccce18efe7f (visited
on 0004–2020).
Page 45
BIBLIOGRAPHY Awatef MESSAOUDI
Bibliography
[1] MANUEL GRAÑA Andoni Beristain. “Emotion recognition based on the analysis
of Facial expressions : a survey”. In: New Mathematics and Natural Computation
05.6 (July 2009), pp. 513–534.
[2] Klaus Scherer. “Scherer KR. What are emotions? And how can they be measured?”
In: Social Science Information 2005 SAGE Publications (London, Thousand Oaks,
CA and New Delhi) 44.04 (December 2005), pp. 695–729.
[5] Marco Fratarcangeli and Marco Schaerf. “Realistic Modeling of Animatable Faces
in MPEG-4”. In: Researchgate 01.04 (2010), pp. 1–11.
[7] KM.Pooja yoti Kumaria R.Rajesha. “Facial Expression Recognition : A survey”. In:
Second International Symposium on Computer Vision and the Internet 58.6 (August
2015), pp. 486 – 491.
[9] Afef Abdelkrim Nadia Jmour Sehla Zayen. “Convolutional neural networks for image
classification”. In: 2018 International Conference on Advanced Systems and Electric
Technologies (ICA SET ) 06.17843532 (June 2018), pp. 397–402.
Page 46
BIBLIOGRAPHY Awatef MESSAOUDI
[11] Shilpa Jain Dinesh Bisht and M. Mohan Raju. “Prediction of Water Table Elevation
Fluctuation through Fuzzy Logic Artificial Neural Networks”. In: International
Journal of Advanced Science and Technology 51.04 (Marsh 2013), pp. 108–119.
[12] Lean YuShouyang WangKin Keung Lai. “Basic Learning Principles of Artificial
Neural Networks”. In: Foreign-Exchange-Rate Forecasting With Artificial Neural
Networks. 2007, pp. 27–37.
[14] IshapUnwala Xiakun Yang Lucy Nwosu Huiwang Jiang Lu and Ting Zhang.
“Deep Neural Network for facial expression recognition using Facial Part”.
In: 15th Intl Conf on Dependable, Autonomic and Secure Computing,
15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on
Big Data Intelligence and Computing and Cyber Science and Technology
Congress(DASC/PiCom/DataCom/CyberSciTech) 213.17661650 (April 2018),
pp. 1318–1321.
[15] Ahmed Jawad Kabir Shadman Sakib Nazib Ahmed and Hridon Ahmed. “An
Overview of Convolutional Neural Network: Its Architecture and Applications”.
In: Proceedings of the IEEE conference on computer vision 1.1 (November 2019),
pp. 1–5.
[19] Manoel Carlos Ramon. “Using OpenCV”. In: Intel® Galileo and Intel® Galileo Gen
2 02.01 (December 2014), pp. 319–400.
[20] Yingnian Wu Bo Pang Erik Nijkamp. “Deep Learning With TensorFlow: A Review”.
In: Journal of Educational and Behavioral Statistics 45.02 (Seeptember 2019),
pp. 1–22.
Page 47
BIBLIOGRAPHY Awatef MESSAOUDI
[23] Shan Li and Weihong Deng. “Deep Facial Expression Recognition: A Survey”. In:
IEEE Transactions on Affective Computing 1109.10 (March 2020), p. 99.
[24] Zhi Liu Yang Xin Lingshuang Kong. “Machine learning and deep learning method
for cybersecurity”. In: IEEE Access 06.17905844 (May 2018), pp. 35365 –35381.
Page 48