You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/372230358

Emotion recognition for an E-Learning platform using Deep Learning : A


Comparison of Different Approaches

Article · July 2023

CITATIONS READS
0 4

4 authors:

Mohammed Kodad Achraf Zbaida


Université Hassan II de Casablanca Université Hassan II de Casablanca
2 PUBLICATIONS 0 CITATIONS 2 PUBLICATIONS 0 CITATIONS

SEE PROFILE SEE PROFILE

Youssfi Mohamed Abdelmajid Bousselham


Université Hassan II de Casablanca Université Hassan II de Casablanca
190 PUBLICATIONS 827 CITATIONS 17 PUBLICATIONS 106 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Semantic Web Service Discovery and Composition Framework View project

Contribution to the Development of A Dynamic Circulation Map using the Multi-Agent Approach View project

All content following this page was uploaded by Mohammed Kodad on 09 July 2023.

The user has requested enhancement of the downloaded file.


Emotion recognition for an E-Learning platform
using Deep Learning : A Comparison of Different
Approaches

KODAD Mohammed ZBAIDA Achraf YOUSSFI Mohamed


Informatique, Intelligence Artificielle et Informatique, Intelligence Artificielle et Informatique, Intelligence Artificielle et
Cyber Sécurité (L2IAS) Cyber Sécurité (L2IAS) Cyber Sécurité (L2IAS)
ENSET Mohammedia, Université ENSET Mohammedia, Université ENSET Mohammedia, Université
Hassan II de Casablanca Hassan II de Casablanca Hassan II de Casablanca
Mohammedia, Morocco Mohammedia, Morocco Mohammedia, Morocco
kodad.mohammed.me@gmail.com zbaida.achraf@gmail.com m.youssfi@enset-media.ac.ma

BOUSSELHAM Abdelmajid
Informatique, Intelligence Artificielle et
Cyber Sécurité (L2IAS)
ENSET Mohammedia, Université
Hassan II de Casablanca
Mohammedia, Morocco
bousselham@enset-media.ac.ma

Abstract— This abstract provides a brief summary of expressions. Emotion recognition is important in how
utilizing deep learning techniques to recognize emotions computers interact with humans, and it can greatly improve
through facial expressions. Deep learning models, specifically e-learning platforms.
Convolutional Neural Networks (CNNs), have gained
significant popularity in accurately analyzing and E-learning platforms have become popular for their
understanding emotions from facial images. flexibility and accessibility in education. However, they
often struggle to understand and respond to learners'
By training these models on extensive datasets of labeled emotions, which affects the effectiveness of personalized
facial expression images, they can effectively learn and extract and engaging learning experiences.
crucial features. CNNs excel in capturing spatial details from
facial images. To address this, researchers are using deep learning
techniques to build emotion recognition systems. These
The application of deep learning-based emotion recognition systems analyze facial expressions using specialized neural
extends to various domains, including human-computer networks like CNNs and RNNs to accurately identify and
interaction, healthcare, and entertainment. Real-time emotion classify emotions in real-time. This opens up exciting
detection enables personalized interventions, adaptive content possibilities for integrating emotion recognition technology
delivery, and the creation of emotionally captivating into e-learning platforms, making learning more
experiences, particularly in the context of E-Learning. personalized and adaptive.
However, challenges remain, such as the limited availability This paper explores the potential of using deep learning
of diverse and well-annotated datasets and the need to account and facial expression analysis for emotion recognition in
for variations in facial expressions across individuals and e-learning platforms. We will discuss why emotion
cultures. recognition is important for improving human-computer
interaction and its specific benefits in e-learning. We will
Nevertheless, the integration of deep learning techniques also explain the methods and techniques used in deep
for emotion recognition has the potential to revolutionize learning systems, focusing on how CNNs analyze facial
human-computer interaction, enhance user experiences, and expressions.
foster more empathetic and adaptable technologies across
different fields. Continuous research and advancements in By integrating deep learning-based emotion recognition
deep learning approaches are expected to further refine the into e-learning platforms, we can gain real-time insights into
accuracy and reliability of emotion recognition systems based learners' emotions, allowing for personalized support and
on facial expressions. customized content. Deep learning models can also create
engaging learning materials that make the learning
Keywords— emotion recognition, e-learning, facial experience more enjoyable.
expression, online machine learning, real-time, Deep Learning,
DataSet However, while integrating emotion recognition
technology has many benefits, it's important to balance its
I. INTRODUCTION use with respect for privacy and user autonomy. This paper
TDeep learning has emerged as a powerful technology will also discuss the ethical considerations and challenges
that can help machines understand complex patterns in data. associated with using emotion recognition in e-learning
One area where deep learning shows promise is in platforms.
recognizing emotions, especially by analyzing facial

1
In summary, this paper highlights the potential of deep nineteen for collecting EEG signals. Virtual markers were
learning techniques for emotion recognition and their placed on the subject's face, and the markers were tracked
application in e-learning platforms. By using facial using an optical flow algorithm. The distance between the
expressions to understand learners' emotions, we can center of the subject's face and each marker position was
transform the way people interact with online education, used as a feature for facial expression classification, while
creating a more empathetic, adaptable, and effective the fourteen signals collected from the EEG signal reader
learning environment. were used for emotional classification.
The article [2] discusses the importance of facial
recognition in various applications, such as security, identity
II. RESEARCH PROBLEM: verification, and database management systems. The article
In e-learning, the way teachers and students presents a deep learning algorithm for accurate facial
communicate is usually through written messages, which recognition and identification, using haar cascade detection
makes it difficult to express and understand emotions. and a convolutional neural network model. The proposed
Unlike in traditional classrooms, where students can show work includes three objectives : face detection, recognition,
their feelings through their words and actions, e-learning and emotion classification, using OpenCV, Python
lacks this ability. programming, and a dataset. An experiment was conducted
to identify the emotions of multiple students, and the results
Emotions are important in learning because they affect demonstrate the efficacy of the face analysis system. Finally,
how motivated and engaged students are, as well as how the accuracy of the automatic face detection and recognition
well they remember information. That's why it's essential to is measured.
include emotions in e-learning systems to make the learning
experience better and improve results. By finding ways for They proposed [3] a Convolutional Neural Network
students to express their emotions and for the system to (CNN) based LeNet architecture for facial expression
recognize those emotions, e-learning can become more like recognition. First of all, they merged 3 datasets (JAFFE,
face-to-face learning and help students feel more connected KDEF and our custom dataset). Then they trained the LeNet
to what they're learning. architecture for emotion states classification. In this study,
they achieved accuracy of 96.43% and validation accuracy
Researchers have looked into different ways to of 91.81% for classification of 7 different emotions through
recognize emotions, such as by analyzing speech, body facial expressions.
signals, and facial expressions. Facial expressions, in
particular, have shown promise for recognizing emotions in They present [4] an approach of Facial Expression
e-learning. However, most studies have focused on Recognition (FER) using Convolutional Neural Networks
analyzing pre-recorded data and not on recognizing (CNN). This model created using CNN can be used to detect
emotions in real-time during online learning. facial expressions in real time. The system can be used for
analysis of emotions while users watch movie trailers or
One challenge in e-learning is figuring out if students are video lectures.
satisfied and engaged. How students feel about their
learning experience affects how motivated they are and how They discuss [5] how computer-animated agents and
well they do. But the usual ways of measuring satisfaction, robots can add a social dimension to human-computer
like feedback forms or surveys, can be subjective and may interaction. Real-time face-to-face communication requires
not capture the true emotions of students. That's why it's relying on sensory-rich perceptual primitives rather than
important to develop more accurate and objective ways to slow symbolic inference processes due to the high level of
measure how satisfied and engaged students are in uncertainty at this time scale. The system presented in the
e-learning. paper detects frontal faces and codes them with respect to 7
dimensions in real time. It employs boosting techniques and
There's also a lack of research on recognizing emotions SVM classifiers to enhance performance and has been tested
in real-time using facial expressions in e-learning. This is an on a dataset of posed facial expressions. The system's
area that deserves more attention. Developing a system that outputs change smoothly over time, providing a potentially
can recognize emotions in real-time using facial expressions valuable representation to code facial expression dynamics
would greatly improve the e-learning experience. This study in a fully automatic and unobtrusive manner.
aims to fill this research gap by creating an online system
that uses machine learning to recognize emotions from They compared [6] five different methods for real-time
facial expressions in real-time. The system will be tested emotion recognition from facial images, specifically for the
with a group of students in an e-learning environment to see four basic emotions of happiness, sadness, anger, and fear.
how well it measures their satisfaction and engagement. Three of the approaches are based on convolutional neural
networks (CNN) and two are conventional methods using
Histogram of Oriented Gradients (HOG) features. The
approaches compared are : AlexNet CNN, Affdex CNN,
III. RELATED WORKS
FER-CNN, SVM using HOG features, and MLP artificial
BThe study [1] describes an algorithm for real-time neural network using HOG features. The paper presents the
emotion recognition using virtual markers, facial landmarks, results of testing these methods in real-time on a group of
and EEG signals. The study focused on physically disabled eight volunteers.
individuals and children with Autism. The algorithm used
CNN and LSTM classifiers to classify six facial emotions
and EEG signals. The study involved fifty-five This paper [7] presents an advanced deep learning
undergraduate students for facial emotion recognition and technique for emotion prediction through facial expression

2
analysis. The proposed approach employs a two-stage 1) Machine Learning :
convolutional neural network (CNN) model. The first CNN Machine Learning involves training algorithms to analyze
predicts the primary emotion of the input image as happy or and interpret data, and make predictions or decisions based
sad, while the second CNN predicts the secondary emotion. on patterns and statistical models. ML algorithms learn from
The model was trained on FER2013 and JAFFE datasets labeled data and use features derived from that data to make
and achieved superior results compared to existing predictions on new, unseen examples. ML algorithms can be
state-of-the-art methods for emotion prediction from facial broadly categorized into supervised, unsupervised, and
expressions. reinforcement learning.
This paper [8] addresses the challenging task of
real-time emotion recognition through facial expression in Supervised Learning : In supervised learning, algorithms are
live video using an automatic facial feature tracker for face trained using labeled data, where each data point is
localization and feature extraction. The extracted facial associated with a corresponding label or outcome. The goal
features are fed into a Support Vector Machine classifier to is to learn a mapping function that can predict labels for
infer emotions. The paper presents the results of new, unseen data accurately. Examples of supervised
experiments evaluating the accuracy of the approach for learning algorithms include linear regression, decision trees,
various scenarios, including person-dependent and and support vector machines.
person-independent recognition. The findings show that the
proposed method is effective in achieving fully automatic Unsupervised Learning : Unsupervised learning involves
and unobtrusive expression recognition in live video. The training algorithms on unlabeled data, without any
paper concludes by discussing the significance of the
predefined labels or outcomes. The algorithms learn to
research for affective and intelligent man-machine interfaces
and suggesting possible future improvements. identify patterns, similarities, and structures within the data.
Clustering algorithms and dimensionality reduction
This paper [9] focuses on the importance of analyzing techniques are common examples of unsupervised learning.
users' facial expressions to improve the interaction between
humans and machines. The paper proposes a method for Reinforcement Learning : Reinforcement learning involves
extracting facial features and recognizing the user's training algorithms to make decisions or take actions in an
emotional state that is robust to facial expression variations environment to maximize a cumulative reward signal. The
among different users. The method extracts facial animation algorithms learn through trial and error, receiving feedback
parameters (FAPs) and uses a novel neuro fuzzy system to from the environment based on their actions. Reinforcement
analyze FAP variations both at the discrete emotional space
learning has been successful in applications such as game
and in the 2D continuous activation-evaluation space. The
playing and robotics.
system can further learn and adapt to specific users' facial
expression characteristics using clustering analysis. The
paper reports the experimental results of emotionally 2) Deep Learning :
expressive datasets, indicating the good performance and Deep Learning is a subset of ML that focuses on training
potential of the proposed approach. deep neural networks with multiple layers to automatically
learn hierarchical representations of data. Deep Learning
The aim of this study [10] is to develop predictive algorithms are inspired by the structure and function of the
models that can classify emotions in real-time from videos human brain, and they excel at capturing complex patterns
of workshop participants engaging with an educational and relationships in large-scale datasets. Deep neural
robot. We combine the two best generalizing models networks consist of interconnected layers of artificial
(Inception-v3 and ResNet-34) to achieve better prediction
neurons (nodes), with each layer extracting increasingly
accuracy. To test our approach, we apply the models to
video data and analyze the predicted emotions based on the abstract features from the input data.
participants' gender, activities, and tasks. Statistical analysis
reveals that female participants are more likely to show Deep Learning architectures, such as Convolutional Neural
emotions in almost all activity types, and happiness is the Networks (CNNs) for image recognition and Recurrent
most frequently predicted emotion for all activity types, Neural Networks (RNNs) for sequential data, have achieved
regardless of gender. Additionally, programming is the remarkable performance in various domains, including
activity type where the analyzed emotions were the most computer vision, natural language processing, and speech
frequent. These findings highlight the potential of using recognition. Deep Learning algorithms often require
facial expressions to improve teaching practices and substantial amounts of labeled training data and powerful
understand student engagement. computational resources for training due to their complex
architectures.

IV. FOUNDATIONAL CONCEPTS


3) Key Differences:
A. Machine Learning (ML) vs. Deep Learning: A Brief Representation and Feature Engineering : In traditional ML,
Comparison domain experts often need to manually engineer features
Machine Learning (ML) and Deep Learning are both from raw data. Deep Learning, on the other hand,
subfields of artificial intelligence (AI) that focus on training automatically learns hierarchical representations from the
algorithms to learn patterns and make predictions or raw input data, eliminating the need for extensive feature
decisions. While they share similarities, there are key engineering.
differences between ML and Deep Learning.

3
Performance and Scalability: Deep Learning algorithms can human emotional states. This advancement can
achieve state-of-the-art performance in certain tasks when revolutionize numerous industries and significantly enhance
trained on large amounts of data. ML algorithms may be our interaction with technology and each other.
more suitable for smaller datasets or when interpretability of
the model is critical.

Computational Requirements: Deep Learning algorithms


require significant computational resources, such as
powerful GPUs, due to their complex architectures. ML
algorithms can often be trained on more modest hardware.

In summary, Machine Learning focuses on training


algorithms to learn patterns and make predictions based on
features derived from data, while Deep Learning utilizes
deep neural networks to automatically learn hierarchical Fig. 1. The steps of emotion recognition using facial expression
representations of data. The choice between ML and Deep
Learning depends on the task, available data, computational
resources, and the desired level of interpretability. V. MATERIALS AND METHODS
In this study, we will compare different ways of recognizing
B. Recognition of emotion based on facial expressions emotions from facial expressions to see which one works
Recognition of emotion based on facial expressions is a the best. We will look at different techniques and algorithms
fascinating field that focuses on developing methods and used in this field and analyze how well they perform. Based
technologies to accurately detect and interpret human on this analysis, we will create our own special Neural
emotions by analyzing facial movements and expressions. Network (NN) that can recognize emotions accurately.
Human faces are incredibly expressive, conveying a wide
range of emotions through subtle changes in muscle To make our NN better, we will consider things like how we
movements, such as smiles, frowns, and raised eyebrows. prepare the data, the methods we use to pick out important
Researchers and scientists in this field explore various features, and the type of structure we give to the network.
approaches to recognize and understand emotions based on We will also fine-tune the model using advanced methods
facial expressions. They investigate the underlying and check that it works well in different situations.
mechanisms of facial expressions, the relationship between
specific facial movements and emotional states, and the By creating our own NN based on what we learn from the
patterns that indicate different emotions. study, we hope to improve the technology used to recognize
emotions. We want to make it more accurate and faster,
The aim is to develop computer vision and machine which can be useful in many areas like computers
learning algorithms that can automatically detect and
interacting with people, E-Learning, virtual reality, and
classify facial expressions to accurately recognize emotions
like happiness, sadness, anger, surprise, fear, and disgust. healthcare.
These algorithms learn from large datasets of labeled facial
expressions, training models to identify the unique features In summary, this study involves comparing different
and patterns associated with each emotion. methods of recognizing emotions from facial expressions
and using that knowledge to create our own specialized
The applications of emotion recognition based on facial Neural Network. Our goal is to improve how well
expressions are wide-ranging. In psychology, it can aid in computers can understand emotions from faces and make
understanding human behavior, studying emotional the technology more reliable and effective in the future.
disorders, and improving therapy techniques. In
human-computer interaction, it enables more natural and
empathetic interactions between humans and machines. In A. Data Set Description
fields like virtual reality and gaming, it enhances immersion For our research, we will utilize the dataset available at
and user experience. Additionally, it has potential [DataSET] as the primary dataset to train and test our
applications in areas like market research, customer models. This dataset contains folders representing
feedback analysis, and security systems.
different facial expressions, including Surprise, Anger,
Researchers employ various techniques in emotion Happiness, Sad, Neutral, Disgust, and Fear.
recognition, including feature extraction from facial
landmarks, deep learning models such as convolutional The dataset is divided into two main folders, Training
neural networks (CNNs), and multimodal approaches that and Testing, to facilitate model configuration for
combine facial expressions with other modalities like voice end-users. The training set comprises a total of 28,079
and physiological signals. samples, while the testing set contains 7,178 samples.
Each sample consists of grayscale images of faces with
dimensions of 48x48 pixels. The dataset ensures that the
The ultimate goal is to develop sophisticated systems faces are registered automatically, resulting in a more or
that can accurately interpret and respond to human emotions
in real-time, enabling machines to understand and adapt to

4
less centered face occupying a similar amount of space weights to minimize the loss and improve its prediction
in each image. accuracy.

It's important to note that this dataset was obtained from The VGG16 model has demonstrated strong
the "Challenges in Representation Learning: Facial performance in various computer vision tasks, including
Expression Recognition Challenge" competition. The emotion recognition based on facial expressions. Its deep
dataset was prepared by Pierre-Luc Carrier and Aaron architecture allows it to learn complex patterns and
Courville as part of their ongoing research project. They features from images, enabling accurate recognition of
generously provided a preliminary version of their different emotions. However, it's worth noting that the
dataset to the workshop organizers for use in this VGG16 model can be computationally intensive and
contest. may require substantial computational resources for
training and inference, particularly when dealing with
large-scale datasets.
B. VGG16 Model
The VGG16 [11] model is a convolutional neural In summary, the VGG16 model is a powerful CNN
network (CNN) architecture that has been widely used in architecture commonly employed in emotion recognition
various computer vision tasks, including emotion based on facial expressions. Its deep structure, along
recognition based on facial expressions. It was with its ability to learn intricate features, makes it
developed by the Visual Geometry Group (VGG) at the suitable for capturing meaningful representations from
University of Oxford. images and accurately predicting different emotions.

The VGG16 model is characterized by its deep structure,


consisting of 16 layers, including 13 convolutional C. VGG19 Model
layers and 3 fully connected layers. The architecture of The VGG19 [12] model is an extension of the VGG16
VGG16 is known for its simplicity and uniformity, model and is also a convolutional neural network (CNN)
where the convolutional layers have a small receptive architecture commonly utilized in various computer
field (3x3) and are stacked one after another. This design vision tasks, including emotion recognition based on
choice allows the model to learn intricate features by facial expressions.
capturing local patterns in the early layers and then
gradually capturing more complex patterns as the depth Similar to VGG16, VGG19 was developed by the Visual
increases. Geometry Group (VGG) at the University of Oxford. It
is called VGG19 because it consists of 19 layers,
The input to the VGG16 model is typically an image, including 16 convolutional layers and 3 fully connected
and the network performs a series of convolutional layers. The architecture of VGG19 shares similarities
operations to extract features from the input image. Each with VGG16, but it has a deeper structure, allowing it to
convolutional layer is followed by a rectified linear unit capture more complex patterns and features from
(ReLU) activation function, which introduces images.
non-linearity into the model. The intermediate outputs
from the convolutional layers are downsampled using VGG19 follows a similar design principle to VGG16,
max-pooling layers to reduce spatial dimensions while where it utilizes small receptive fields (3x3) in its
preserving the most salient features. convolutional layers stacked on top of each other. This
design choice allows the network to learn rich and
After the convolutional layers, the extracted features are detailed features by applying multiple convolutional
flattened and passed through a series of fully connected operations sequentially. The convolutional layers are
layers. These fully connected layers further process the typically followed by rectified linear unit (ReLU)
features and eventually produce the output predictions. activation functions to introduce non-linearity into the
In the case of emotion recognition, the output layer of model.
the VGG16 model is typically configured to have
multiple units corresponding to different emotion classes Max-pooling layers are used after each set of
(e.g., happy, sad, angry, etc.). The final output is convolutional layers to downsample the feature maps,
obtained using a softmax activation function, which reducing spatial dimensions while preserving the most
produces a probability distribution over the different salient information. The intermediate feature maps are
emotion classes. then passed through a series of fully connected layers,
which further process the extracted features and generate
To train the VGG16 model for emotion recognition, a the final output predictions.
labeled dataset of facial expression images is used. The
model is optimized using methods such as stochastic In the context of emotion recognition based on facial
gradient descent (SGD) or Adam optimizer, and the loss expressions, the VGG19 model's input is an image of a
function used is typically categorical cross-entropy, face, and it learns to extract discriminative features from
which measures the difference between the predicted facial regions to distinguish between different emotions.
probabilities and the true labels. The model is trained on The output layer of VGG19 is configured to have
the labeled dataset iteratively, adjusting the network's multiple units representing the possible emotion classes.

5
A softmax activation function is commonly used to layers, batch normalization, and activation functions.
produce a probability distribution over the emotion The residual connections within the blocks facilitate the
classes, allowing the model to make predictions about flow of information and improve gradient flow during
the dominant emotion in the input facial expression. training.

Training the VGG19 model for emotion recognition In the context of emotion recognition based on facial
typically involves using a labeled dataset of facial expressions, the ResNet50V2 model takes an input
expression images. The model is optimized using image of a face and processes it through the layers to
algorithms such as stochastic gradient descent (SGD) or extract discriminative features. These features capture
Adam optimizer, and the categorical cross-entropy loss the unique characteristics of facial expressions
function is commonly employed to measure the associated with different emotions.
discrepancy between the predicted probabilities and the
true emotion labels. Through iterative training, the The output layer of the ResNet50V2 model is typically
VGG19 model adjusts its weights to minimize the loss configured to have multiple units corresponding to
and improve its ability to accurately classify emotions. different emotion classes. The final activation function,
often softmax, generates a probability distribution over
The VGG19 model's increased depth compared to these emotion classes, enabling the model to make
VGG16 enables it to capture more intricate and nuanced predictions about the dominant emotion exhibited in the
features, potentially leading to improved performance in facial expression.
emotion recognition tasks. However, it's important to
note that the additional depth also increases the model's Training the ResNet50V2 model for emotion recognition
complexity and computational requirements, demanding involves using a labeled dataset of facial expression
more computational resources during training and images. The model's weights are optimized using
inference. algorithms such as stochastic gradient descent (SGD) or
Adam optimizer. The choice of loss function, such as
In summary, the VGG19 model is an extension of the categorical cross-entropy, helps measure the
VGG16 architecture widely utilized in emotion dissimilarity between predicted probabilities and the true
recognition tasks based on facial expressions. Its deeper emotion labels. Through iterative training, the model
structure enables it to capture more complex patterns, adjusts its weights to minimize the loss and improve its
allowing for better discrimination among different ability to accurately classify emotions.
emotions. By leveraging the convolutional and fully
connected layers, the VGG19 model can effectively The ResNet50V2 architecture has shown remarkable
extract features from facial images and provide accurate performance in various computer vision tasks due to its
predictions for various emotion classes. deep structure, residual connections, and efficient
training. These attributes make it capable of capturing
complex visual patterns and effectively recognizing
D. ResNet50V2 Model
emotions based on facial expressions.
The ResNet50V2 [13] model is a convolutional neural
network (CNN) architecture that has been widely In summary, the ResNet50V2 model is a deep CNN
employed in various computer vision tasks, including architecture with residual connections, designed for
emotion recognition based on facial expressions. tasks such as emotion recognition based on facial
expressions. Its ability to learn intricate features, along
ResNet50V2 is an extension of the original ResNet with improved gradient flow through residual
architecture introduced by Microsoft Research. The "50" connections, allows it to effectively capture and classify
in the name refers to the number of layers in the different emotions. By leveraging its layers and
network, indicating its depth. The "V2" denotes that it is connections, the ResNet50V2 model demonstrates
an updated version of the model with improved strong performance in recognizing emotions from facial
performance and efficiency. expressions.

One of the key features of the ResNet architecture is the


introduction of residual connections or skip connections. E. EfficientNetB0 Model
These connections allow the network to bypass some The EfficientNetB0 [14] model is a convolutional neural
layers and directly propagate the input or intermediate network (CNN) architecture that has gained attention for
activations to subsequent layers. This mitigates the its efficiency and excellent performance in various
problem of vanishing gradients, enabling the network to computer vision tasks, including emotion recognition
learn more effectively, especially when dealing with based on facial expressions.
very deep architectures.
EfficientNetB0 belongs to a family of models known as
The ResNet50V2 model consists of a series of EfficientNets, which were designed by combining
convolutional layers, followed by global average pooling principles from neural architecture search and model
and fully connected layers. It incorporates residual scaling. These models achieve high accuracy while
blocks, which are composed of multiple convolutional

6
maintaining computational efficiency, making them
suitable for resource-constrained environments.

The EfficientNetB0 architecture follows a compound In summary, the EfficientNetB0 model is a highly
scaling method, which uniformly scales the network's efficient CNN architecture that achieves excellent
depth, width, and resolution. This scaling allows the performance in emotion recognition tasks based on
model to achieve a good balance between model facial expressions. Its compound scaling approach,
capacity and computational efficiency. The "B0" in the depth-wise separable convolutions, and other techniques
name signifies the base configuration of the EfficientNet contribute to its efficiency and accuracy. By leveraging
family, where "B0" represents the smallest and least these features, the EfficientNetB0 model demonstrates
computationally expensive variant. strong performance in accurately recognizing emotions
from facial expressions.
EfficientNetB0 consists of multiple stacked layers of
depth-wise separable convolutions, which reduce the
F. EfficientNetB7 Model
number of parameters and computational cost while
maintaining model performance. These depth-wise EfficientNetB7 is a deep learning model and a part of the
separable convolutions split the standard convolutional EfficientNet family, which is a series of convolutional
operation into two separate steps: a depth-wise neural networks (CNNs) designed to achieve
convolution, which processes each input channel state-of-the-art performance with significantly fewer
separately, and a point-wise convolution, which parameters compared to other models. The
combines the output of the depth-wise convolution EfficientNetB7 model is the largest and most powerful
across channels. variant in the EfficientNet series.

The EfficientNetB0 model also incorporates other The main idea behind EfficientNet models is compound
techniques such as batch normalization, activation scaling, which involves scaling the depth, width, and
functions, and skip connections. These techniques aid in resolution of the network in a balanced manner. This
improving the learning process, increasing model allows EfficientNetB7 to achieve better performance and
accuracy, and facilitating gradient flow during training. efficiency by effectively utilizing computational
resources.
In the context of emotion recognition based on facial
expressions, the EfficientNetB0 model takes an input Specifically, the EfficientNetB7 model has the following
image of a face and passes it through its layers to extract characteristics:
meaningful features. These features capture the relevant
patterns and expressions associated with different ● Depth: EfficientNetB7 has a deep network
emotions. architecture with a large number of layers, allowing
it to capture complex patterns and features from the
The output layer of the EfficientNetB0 model is input data.
typically configured to have multiple units
corresponding to the different emotion classes. The final ● Width: It has a significantly larger number of
activation function, often softmax, produces a channels or filters in each layer compared to
probability distribution over these emotion classes, smaller variants, which enables it to learn more
enabling the model to make predictions about the expressive representations.
dominant emotion displayed in the facial expression.
● Resolution: The input images to EfficientNetB7 are
Training the EfficientNetB0 model for emotion of higher resolution, which allows the model to
recognition involves using a labeled dataset of facial capture fine-grained details and improve
expression images. The model's weights are optimized recognition accuracy.
using algorithms such as stochastic gradient descent
(SGD) or Adam optimizer. The choice of a suitable loss EfficientNetB7 has been pre-trained on large-scale
function, such as categorical cross-entropy, helps datasets, such as ImageNet, using techniques like
measure the dissimilarity between the predicted transfer learning. As a result, it has learned to recognize
probabilities and the true emotion labels. Through a wide range of features from different images. This
iterative training, the model adjusts its weights to pre-training makes it a powerful feature extractor that
minimize the loss and improve its ability to accurately can be fine-tuned on specific tasks or datasets with
classify emotions. relatively few additional training samples.

The EfficientNetB0 model's efficiency and performance Due to its efficiency and high performance,
make it well-suited for emotion recognition based on EfficientNetB7 is commonly used in various computer
facial expressions. Its ability to capture important vision tasks, such as image classification, object
features while being computationally efficient allows for detection, and semantic segmentation, where it
accurate emotion classification, even in resource-limited consistently achieves top-tier results. However, it should
settings. be noted that EfficientNetB7 might require significant

7
computational resources, especially during training, due SGD also adds a regularizing effect, helping the model
to its large size. generalize better and avoid overfitting.

However, SGD introduces some noise due to the


G. Stochastic Gradient Descent (SGD)
randomness in mini-batch selection, which can cause the
Stochastic Gradient Descent (SGD) is an iterative optimization process to be more erratic. To address this,
optimization algorithm commonly used in machine various modifications of SGD have been proposed, such
learning for training models, including neural networks, as momentum, adaptive learning rates (e.g., AdaGrad,
to minimize a given loss function and find the optimal RMSprop, Adam), and learning rate schedules.
set of parameters. It is particularly effective when
dealing with large datasets. In summary, Stochastic Gradient Descent is an iterative
optimization algorithm that updates model parameters
The name "stochastic" refers to the fact that the based on randomly selected mini-batches of training
algorithm operates on randomly selected subsets of the data. It efficiently handles large datasets and helps
training data, known as mini-batches, instead of the models converge to an optimal solution. By iteratively
entire dataset. This random sampling introduces a level adjusting the parameters in the direction of steepest
of randomness into the optimization process and allows descent, SGD enables the training of complex machine
the algorithm to escape local minima more easily. learning models.

Here is a step-by-step explanation of how SGD works:


H. Stochastic Gradient Descent – Adam Method
1. Initialize Parameters: The algorithm starts by The Adam optimizer is an extension of the stochastic
initializing the model parameters with random gradient descent (SGD) algorithm that combines
values. These parameters are the variables that the elements of both the AdaGrad and RMSprop
model will learn during the training process. optimization techniques. It is widely used in training
deep neural networks due to its effectiveness in finding
2. Select Mini-Batch: SGD randomly selects a good solutions and its adaptive learning rate capabilities.
mini-batch of training examples from the dataset.
The size of the mini-batch is typically chosen Here is a deeper explanation of the Adam optimizer:
based on computational constraints and can range
from a few samples to a few hundred samples. 1. Initialization: The Adam optimizer initializes two
variables, namely the first moment estimate (often
3. Compute Gradient: The selected mini-batch is used called the "mean") and the second moment
to compute the gradient of the loss function with estimate (often called the "variance"). These
respect to the model parameters. The gradient variables are initialized as vectors of zeros with the
represents the direction and magnitude of the same dimensions as the model's parameters.
steepest ascent or descent in the loss function's
landscape. It indicates how the parameters should 2. Computing Gradients: During each iteration of the
be adjusted to minimize the loss. optimization process, a mini-batch of training
examples is randomly sampled. The gradients of
4. Update Parameters: The parameters are updated by the loss function with respect to the model
taking a small step in the opposite direction of the parameters are computed using backpropagation.
gradient. This step is controlled by a learning rate,
which determines the size of the update. A smaller 3. Updating the First Moment Estimate: The Adam
learning rate leads to slower convergence, while a optimizer updates the first moment estimate by
larger learning rate may cause overshooting and calculating the exponentially decaying average of
instability. the gradients. This step helps capture the overall
trend of the gradients over time.
5. Repeat: Steps 2-4 are repeated for a fixed number
of iterations or until a stopping criterion is met. 4. Updating the Second Moment Estimate: The
The entire dataset is usually processed multiple second moment estimate is updated by calculating
times, with each pass over the data known as an the exponentially decaying average of the squared
epoch. The order in which the mini-batches are gradients. This step helps capture the magnitudes
processed can either be randomly shuffled or of the gradients and acts as a form of adaptive
preserved in their original order. learning rate adjustment.

The key advantage of SGD is its efficiency in processing 5. Bias Correction: In the early iterations of training,
large datasets. Since it operates on mini-batches, it the estimates of the first and second moments can
requires less memory and computational resources be biased towards zero due to their initialization as
compared to batch gradient descent, where the entire zero vectors. To address this, bias correction is
dataset is used in each iteration. The stochastic nature of applied to the first and second moment estimates to
make them unbiased.

8
A. Experimental Setup
6. Learning Rate Scaling: The Adam optimizer scales Scaling and resizing an image involves adjusting its size
the gradients by dividing them by the square root while preserving its aspect ratio or changing the aspect
of the second moment estimate. This scaling allows ratio as desired. The process typically involves two
for adaptive learning rates, where the learning rate steps: scaling and resizing.
is automatically adjusted based on the magnitudes
of the gradients. The pixel values of an image typically range from 0 to
255, representing the intensity of each pixel. Scaling the
7. Parameter Update: Finally, the model parameters image by dividing it by 255 transforms the pixel values
are updated by subtracting the scaled gradients, to a range between 0 and 1. This normalization is often
which are divided by the square root of the second performed to ensure that the pixel values are within a
moment estimate, multiplied by the learning rate. consistent and standardized range, which can be
This step effectively moves the parameters in the beneficial for various image processing algorithms and
direction that minimizes the loss function. models.
The Adam optimizer's adaptive learning rate mechanism In the context of facial expression recognition, resizing
makes it less sensitive to the choice of an initial learning an image to (48, 48) is commonly done to preprocess
rate and helps achieve faster convergence. It combines facial images and prepare them as input for emotion
the benefits of AdaGrad, which adjusts the learning rate recognition models. The dimensions of 48x48 pixels
for each parameter individually, and RMSprop, which have been widely adopted in facial expression datasets
performs adaptive learning rate scaling. Additionally, the and models. This size is typically sufficient to capture
bias correction step ensures that the estimates of the first important facial features while keeping the
and second moments are accurate, particularly during the computational requirements manageable.
early stages of training.

In summary, the Adam optimizer is an adaptive


optimization algorithm that updates model parameters by
maintaining estimates of the first and second moments of
the gradients. It incorporates adaptive learning rates,
making it robust and efficient in training deep neural
networks. By combining the strengths of AdaGrad and
RMSprop, the Adam optimizer provides an effective
approach for minimizing the loss function during the
training process.

VI. EXPERIMENTAL STUDY


The main objective of the party is to carry out an
experimental study that focuses on comparing the
performance of five different algorithms used for emotion Fig. 3. Data importation
recognition based on facial expressions. The diagram
provided below illustrates the sequential steps involved in To load a pre-trained model means to import and initialize a
this study, starting from the importation of data and pre-trained deep learning model that has been trained on a
progressing through the various stages until the testing of large dataset. The pre-trained model has already learned
the best model and parameters identified through our meaningful representations of the data it was trained on,
comprehensive comparison. making it capable of performing specific tasks, such as
image classification.

A pre-trained model on ImageNet refers to a deep learning


model that has been trained on the large-scale ImageNet
dataset. The ImageNet dataset is a widely used benchmark
in computer vision, consisting of millions of labeled images
belonging to thousands of different classes or categories.

Fig. 2. The pipeline of our experiment


Fig. 4. Load the pre-trained models

9
The term "freezing" refers to the process of preventing hyperparameter space using different strategies to
the weights and parameters of specific layers in a find the best combination of hyperparameters.
pre-trained model from being updated or trained further ● Objective Functions: You can define an objective
during the fine-tuning or transfer learning process. function that quantifies the performance of your
model based on specific metrics, such as accuracy,
When we load a pre-trained model, all the layers in the loss, or any custom evaluation metric. The tuner
model have already been trained on a large dataset to uses this objective function to guide the search
learn meaningful representations. However, in some process and optimize the hyperparameters.
cases, we may want to fine-tune the model for a specific
task using your own dataset. ● Early Stopping: Keras Tuner supports early
stopping, which allows you to stop the search
By freezing layers, we keep the learned representations process if the performance of the model plateaus or
intact, especially in the early layers that capture basic worsens. This helps save computational resources
features and patterns. This is useful because the by terminating the search early if no further
pre-trained model has been trained on a dataset similar improvements are observed.
to our task, and we want to leverage the pre-existing
knowledge while fine-tuning on your specific data. ● Results Analysis: Keras Tuner provides utilities to
analyze and visualize the results of the
hyperparameter search, such as the best
hyperparameters found, performance metrics
across different trials, and search history.

By using Keras Tuner, you can automate the process of


finding the best hyperparameters for your deep learning
models. This helps you save time and resources by
optimizing your models' performance and generalization
without the need for manual trial and error.
Fig. 5. Freezing layers Overall, Keras Tuner simplifies the hyperparameter
optimization process by providing an interface to define
Keras Tuner is a hyperparameter optimization library for
search spaces, select search algorithms, define objective
Keras, a popular deep learning framework. It provides a
functions, and analyze the results. It empowers you to
convenient and efficient way to search and tune
find the optimal hyperparameters for your deep learning
hyperparameters to improve the performance of your
models efficiently.
deep learning models.

Hyperparameters are parameters that are not learned


during the training process but are set prior to training.
They control various aspects of the model, such as the
learning rate, the number of layers, the number of units
in each layer, the activation functions, the dropout rate,
and more. Tuning these hyperparameters can have a
significant impact on the performance and generalization
of your model.

Keras Tuner offers different search algorithms and


strategies to efficiently explore the hyperparameter space
and find optimal or near-optimal combinations. Here are
some key components and features of Keras Tuner:
Fig. 6. Keras-tuner_1
● Search Spaces: Keras Tuner allows you to define
the search space for each hyperparameter. You can
specify the range, values, or distributions that the
tuner should consider when searching for the best
hyperparameters.

● Tuners: Keras Tuner provides different tuners, such


as RandomSearch, Hyperband, and Bayesian
Optimization-based tuners like
BayesianOptimization and TPE (Tree-structured
Parzen Estimators). These tuners explore the

10
transforming the output of the final layer into
meaningful class probabilities, enabling reliable and
interpretable predictions.

Fig. 9. Best model

Here is the graph of the Loss and the accuracy:

Fig. 7. Keras-tuner_2

Fig. 8. Early Stopping

B. Results and decision


The optimal model consists of the following
components:

● VGG16: A convolutional neural network


architecture known for its deep layers and excellent
performance in image classification tasks.

● Flatten Layer: This layer is responsible for


transforming the multi-dimensional output of the
VGG16 model into a one-dimensional vector,
enabling compatibility with fully connected layers.

● Dense Layer with 1000 Units: A fully connected


layer with 1000 units. Each unit is connected to
every neuron in the previous layer, allowing for
complex feature extraction and representation.

● Output Layer (Dense Layer): This layer serves as


the final layer of the model and contains 6 units,
corresponding to the 6 classes in the classification
task. It utilizes the softmax activation function.

To summarize, the optimal model is built upon the Fig. 10. The graphical representation depicting the loss and accuracy
metrics
VGG16 architecture, which serves as the foundational
backbone. It is complemented by a flatten layer to The confusion matrix provides a comprehensive view of
reshape the output, a dense layer with 1000 units to the model's performance by showing the distribution of
capture intricate patterns, and an output layer with correct and incorrect predictions across different classes.
softmax activation for effective multiclass classification. It helps in understanding the types of errors made by the
The softmax function plays a pivotal role in model, such as false positives and false negatives.

11
VIII. LIMITATIONS AND FUTURE WORKS
While the tested algorithms showed promise, we
encountered several limitations that we aim to address in our
future works. Below, we outline these limitations and
discuss our objectives for future research.

1. Limitations:
● Limited Accuracy: The tested algorithms for
emotion recognition based on facial expressions
did not provide highly accurate results. Further
improvements are needed to enhance their
performance.

● Dataset Limitations: The dataset used for training


and evaluation may have had limitations, such as
being small or not representing emotions well. This
could have affected the accuracy of the algorithms.

● Algorithm Suitability: The five algorithms tested


may not cover all possible options. Exploring more
algorithms could lead to better results.
Fig. 11. Confusion matrix

VII. CONCLUSION ● Resource Requirements: The tested algorithms may


require powerful computers with high-end graphics
In conclusion, this research study aimed to compare the cards and processors. Access to such resources
performance of five different algorithms (VGG16, VGG19, could be a challenge in real-life applications.
RESNET50V2, EfficientNETB0, and EfficientB7) for
recognizing emotions based on facial expressions. The study 2. Future Works:
followed a step-by-step process, starting from importing the ● Improved Algorithm Development: More research
data and going through different stages until finding the best should focus on developing better algorithms for
model and settings through our comparison. emotion recognition based on facial expressions.
Exploring new approaches, like considering time or
The results of this study provide valuable insights into how attention, could improve accuracy.
well these algorithms performed in recognizing emotions.
By comparing their accuracy, confusion matrix, we were ● Enhanced Dataset Collection: Gathering larger and
able to determine which algorithm worked the best for more diverse datasets that represent different
recognizing emotions based on facial expressions. emotions and people can improve algorithm
accuracy.
These findings contribute to improving emotion recognition
technology and can be useful in various areas like e-learning ● Real-life Testing and Validation: It is important to
platforms or computer interactions. By using the best model test and validate the algorithms in real-life
and settings identified in this study, we can enhance the situations, such as in e-learning platforms. This can
effectiveness and efficiency of emotion recognition systems, provide feedback for improvements and
leading to better user experiences and tailored support. adjustments.
It's important to mention that further research can be done to ● Hybrid Approaches: Combining multiple
explore other algorithms or improve the existing ones. Also, algorithms or using ensemble techniques could
when using these algorithms in real-life applications, we lead to better accuracy in emotion recognition.
should consider factors like computational resources,
complexity, and finding a balance between accuracy and ● Personalized CNN Configuration: Creating a
efficiency. custom Convolutional Neural Network (CNN) with
personalized settings specifically for emotion
Overall, this research provides a foundation for future recognition could improve accuracy and
developments in emotion recognition based on facial adaptability to different datasets.
expressions. It highlights the importance of selecting the
most suitable algorithm to achieve accurate and reliable By addressing these limitations and exploring future
results. By using the findings of this study, we can advance research directions, we can improve the accuracy and
emotion recognition technology and improve practical application of emotion recognition technology.
human-computer interactions in various fields. This will enhance human-computer interactions and provide
better user experiences in various fields.

12
View publication stats

FIGURE TABLE [10] Wisal Hashim Abdulsalam, Rafah Shihab Alhamdani, and
Mohammed Najm Abdullah “Facial Emotion Recognition from
Videos Using Deep Convolutional Neural Networks” [CrossRef]
[11] Philipp Michel and Rana El Kaliouby “Real time facial expression
Figure Figure title Page recognition in video using support vector machines” [CrossRef]
number Number [12] David Dukić and Ana Sovic Krzic “Real-Time Facial Expression
Recognition Using Deep Learning with Application in the Active
Figure 1 The steps of emotion 4 Classroom Environment”. [CrossRef]
recognition using facial [13] VGG16 [CrossRef]
expression [14] VGG19 [CrossRef]
[15] ResNet50V2 [CrossRef]
Figure 2 The pipeline of our 9 [16] EfficientNetB0 [CrossRef]
experiment [17] EfficientNetB7 [CrossRef]
[18] Adam optimizer [CrossRef]
Figure 3 Data importation 9

Figure 4 Load the pre-trained models 9

Figure 5 Freezing layers 10

Figure 6 Keras-tuner_1 10

Figure 7 Keras-tuner_2 11

Figure 8 Early Stopping 11

Figure 9 Best model 11

Figure 10 The graphical representation 11


depicting the loss and
accuracy metrics

Figure 11 Confusion matrix 12

REFERENCES

[1] Ayvaz, Uğur; Gürüler, Hüseyin; Devrim, Mehmet Osman “USE OF


FACIAL EMOTION RECOGNITION IN E-LEARNING
SYSTEMS” [CrossRef]
[2] D'Errico Francesca, Paciello Marinella, De Carolis Bernardina,
Vattanid Alessandro Palestra Giuseppe, Anzivino Giuseppe Cognitive
Emotions in E-Learning Processes and Their Potential Relationship
with Students’ Academic Adjustment [CrossRef]
[3] Hassouneh Aya, A.M. Mutawa, and M. Murugappan “Development
of a Real-Time Emotion Recognition System Using Facial
Expressions and EEG based on machine learning and deep neural
network” [CrossRef]
[4] Shaik Asif Hussain and Ahlam Salim Abdallah Al Balushi “A real
time face emotion classification and recognition using deep learning
model” [CrossRef]
[5] Mehmet Akif Ozdemir, Berkay Elagoz, Aysegul Alaybeyoglu, Reza
Sadighzadeh and Aydin Akan “Real Time Emotion Recognition from
Facial Expressions Using CNN Architecture” [CrossRef]
[6] Isha Talegaonkar, Kalyani Joshi, Shreya Valunj, Rucha Kohok and
Anagha Kulkarni “Real Time Facial Expression Recognition using
Deep Learning” [CrossRef]
[7] Marian Stewart Bartlett, Gwen Littlewort, Ian Fasel and Javier R.
Movellan “Real Time Face Detection and Facial Expression
Recognition: Development and Applications to Human Computer
Interaction.” [CrossRef]
[8] Aneta Kartali, Miloš Roglić, Marko Barjaktarović, Milica
Đurić-Jovičić and Milica M. Janković “Real-time Algorithms for
Facial Emotion Recognition: A Comparison of Different
Approaches”. [CrossRef]
[9] Ati Jain, Hare Ram Sah “Student’s Feedback by emotion and speech
recognition through Deep Learning”. [CrossRef]

13

You might also like