You are on page 1of 18

J. Vis. Commun. Image R.

97 (2023) 103948

Contents lists available at ScienceDirect

Journal of Visual Communication and Image Representation


journal homepage: www.elsevier.com/locate/jvci

Full Length Article

Alibaba and forty thieves algorithm and novel Prioritized Prewitt Pattern
(PPP)-based convolutional neural network (CNN) using hyperspherically
compressed weights for facial emotion recognition☆
A. Sherly Alphonse a, S. Abinaya a, S. Abirami a
a
Vellore Institute of Technology, Chennai, India

A R T I C L E I N F O A B S T R A C T

Keywords: The visual representations created using the self-distillation paradigm of Bootstrap Your Emotion Latent (BYEL)
Optimization are empirically found to be less evenly distributed than those created using proposed technique. This proposed
CNN work promotes the compression of weights on a hypersphere by minimizing the hyperspherical energy of
Weights
network weights using a novel method of optimizing manifolds through Riemannian metrics and the Conjugate
Emotion
gradient technique. The proposed work demonstrates how regularising the networks of the BYEL architecture
Hyperparameters
reduces the hyperspherical energy of neurons by directly optimising a measure of uniformity alongside the
standard loss. This leads to more uniformly distributed representation and better performance for downstream
tasks. The Alibaba and Forty Thieves Algorithm-based Optimization (AFTAO) methodology is used to select the
most precise collection of hyperparameters for a novel Prioritized Prewitt Pattern (PPP)-based Convolutional
Neural Network (CNN) that results in a higher accuracy for all the six datasets used for facial emotion
recognition.

1. Introduction algorithm, and the resulting hyper-parameters are input into CNN to
produce the needed output. Finally, the model is compared to other
The phrase “Human-Computer Interaction (HCI)” was created and is models that are considered to be cutting edge, and the findings un­
widely understood to refer to the study of human–machine interaction equivocally demonstrates the superiority of the model. As a result,
with the potential for interdisciplinary application. HCI largely includes original contribution of the paper emphasizes the employment of
the study of interface design, with its applications focusing on the AFTAO techniques for hyper-parameter fine-tuning. The effectiveness of
communications between computers and their users [1]. As computers the algorithm in providing the best accuracy in the shortest amount of
are employed practically in every aspect of human life, the HCI appli­ time can be attributed to the fact that it is the recent-most algorithm for
cations are widespread in all industry verticals like computer industrial solving optimization-based problems while taking the smallest control
engineering, social science, psychology, science and plenty others. The parameter into account.
advancement of the Convolutional Neural Network model to obtain The following are the special contributions of the suggested
improved performance in facial emotion identification is one of the framework:
factors driving this study. Using the Alibaba and Forty Thieves
Algorithm-based Optimization (AFTAO) approach to choose the most 1. A novel Prioritized Prewitt Pattern (PPP)-based CNN is proposed.
precise set of hyperparameters helps to achieve the intended accuracy. 2. Using the AFTAO to select the ideal hyper-parameters for CNN data
The goal of the suggested approach is to achieve the aforementioned training.
ones. The first stage entails gaining access to the publicly accessible 3. Proposing a novel method of compressing the weights on a hyper­
dataset. The facial images are cropped using the ‘Chehra’ face detection sphere using Riemannian metrics and conjugate gradient-based
algorithm. The weights are regularized by using the proposed optimization.
Hypersphere-based regularization technique. The hyper-parameter 4. Minimizing the hyperspherical energy of weights
optimization process is then carried out using the AFTAO search


This paper has been recommended for acceptance by Zicheng Liu.
E-mail address: sherly.a@vit.ac.in (A. Sherly Alphonse).

https://doi.org/10.1016/j.jvcir.2023.103948
Received 19 August 2022; Received in revised form 16 May 2023; Accepted 1 October 2023
Available online 2 October 2023
1047-3203/© 2023 Elsevier Inc. All rights reserved.
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

5. The facial expression dataset achieves accuracy which is greater than deep learning (DL) based emotion detection is superior to that of con­
the current works in literature. ventional image processing techniques. The design of an artificial in­
telligence (AI) system that can identify emotions from facial expressions
Hyperparameters are important because they have a significant is presented in this research. It covers the method for detecting emo­
impact on how the training algorithm functions and how well the model tions, which essentially consists of three steps: face detection, feature
being trained works. An algorithm can really be improved by a carefully extraction, and emotion classification. Convolutional neural networks
designed set of hyperparameters. This paper proposes a novel algorithm (CNN)-based deep learning architecture for emotion recognition from
for hyperparameter selection that improves the performance of a facial photos was proposed by Jaiswal et al. (2020). The Facial Emotion
expression recognition system. In order to achieve an appropriate dis­ Recognition Challenge (FERC-2013) and Japanese Female Facial
tribution of weights on a hypersphere without overlapping, the sug­ Emotion datasets are used to assess the performance of the proposed
gested work makes use of the hypersphere-based compression technique (JAFFE). For the FERC-2013 and JAFFE datasets, the accuracy
technique. It also improves the numerical characteristics of the kernel rates with the CNN model are 70.14 and 98.65 percent, respectively [5].
matrix by decreasing the influence of other transformation variables in The quick development of Artificial intelligence has made significant
the features which improves the overall performance. contributions to the field of technology. Machine learning and deep
The paper is organised as outlined below. The related works are learning algorithms have achieved remarkable success in a variety of
described in section 2, the technique for the proposed study is described applications, including prediction models, recommender systems,
in section 3, the findings are analysed in section 4, and the paper is pattern identification, etc., where previous algorithms failed to fulfil the
concluded in section 5. needs of humans in real-time. The influence of emotion on a person’s
thoughts, actions, and feelings is crucial [6]. By exploiting the advan­
2. Literature survey tages of deep learning, an emotion detection system can be created, and
many applications, including feedback analysis and face unlocking, may
An intriguing topic of research, automatic emotion recognition based be executed with high accuracy. The primary goal of this research is to
on facial expression, has been presented and used in a number of fields, develop a Deep Convolutional Neural Network (DCNN) model that can
including social networks, health-sector, and human–machine in­ categorize seven distinct facial expressions of human emotion. Using
teractions. Researchers in this field are interested in developing tech­ this technique, the model is trained, examined, and verified.
niques to interpret, encode, and extract these features from facial A combined architecture of two networks, a local network and a
expressions in order to enhance computer prediction. Many deep global network, based on Local Enhanced Motion History Images
learning architectures are being used to improve performance due to the (LEMHI) and CNN-LSTM cascaded networks, respectively, is proposed
exceptional success of this technology [1,2]. by Pranav et al. [6] in their study dated 2020. The challenge of identi­
Human conduct is innately influenced by emotions, which play a fying face emotional expressions in video sequences is the main topic of
significant role in how people behave and communicate. For a fuller this article. Frames from unlabelled video are blended into a single
knowledge of human behaviour, it is crucial to accurately analyze and frame in the local network using a cutting-edge method called LEMHI.
interpret the emotional content of facial expressions. Although it takes This method enhances MHI by leveraging recognized human facial
little to no effort for a human to recognize faces and interpret facial landmarks as attention areas to increase local value in difference image
emotions, computer systems still have a difficult time in accurately and calculation, effectively capturing the action of a key facial unit. After
robustly recognizing facial expressions. Analysis of the features of the that, a CNN network will receive this one frame and make a forecast.
human face and the identification of its emotional states are thought to However, global feature extractor and classifier of the global network
be exceedingly demanding and difficult jobs. The non-uniformity of the for video face emotion recognition use an enhanced CNN-LSTM model.
human face and differences in factors like lighting, shadows, facial po­ In order to arrive at a final prediction, a random search weighted
sition, and direction present the biggest challenges. In order to attain summation approach is used. In order to determine which features of the
resilience and the necessary scalability on new types of data, deep face have an impact on the networks’ predictions, this work also pro­
learning approaches have been investigated as a stream of effective vides an understanding of the networks and observable feature maps
methodologies [2]. from each layer of CNN. Studies employing a subject-independent
The most important field of the man–machine interface is emotion validation method on the AFEW, CK+, and MMI datasets show that
recognition via facial expressions. Facial ornaments, uneven lighting, the combined framework of two networks performs better than using
varying poses, etc. are a few difficulties in the emotion recognition field. each network independently. The proposed framework performs better
The disadvantage of feature extraction and classification is mutual than the state-of-the-art techniques [7].
optimization in emotion detection using conventional methods. Deep Humans have traditionally found it simple to identify emotions from
learning algorithms are receiving increased attention from researchers facial expressions, but it is far more difficult for a computer system to do
as a solution to this issue. Deep-learning techniques are now heavily the same. Using machine learning and computer vision, it is now
utilised in categorization jobs. Some literary works use transfer learning possible to identify emotions in photographs using various approaches
techniques to address emotion recognition. They employed the as in literature. The two-fragment convolutional neural network (CNN)
Resnet50, vgg19, Inception V3, and Mobile Net pre-trained networks. on which the FERC is built emphasizes the extraction of facial feature
They removed the fully connected layers from the pre-trained ConvNets vectors while the first portion of the CNN removes the background from
and replace them with fully connected layers that are appropriate for our the image. The expressional vector (EV) in the FERC model is used to
task’s instruction count. Last but not least, the recently added layers can identify the five various varieties of regular facial expressions. Super­
only be trained to update the weights. The experiment used the CK + visory information was gathered from the 10,000-image database that
database and had a 96 percent average accuracy rate for emotion was saved (154 persons). Using an EV of length 24 values, it was feasible
identification issues. [3]. Humans use facial expressions to show their to accurately emphasize the emotion with 96 percent accuracy. The final
emotional states. Humans display their emotional states through their perceptron layer in the two-level CNN operates in series and continu­
facial expressions. Facial expression identification is still a difficult and ously modifies the exponent and weight values. With single-level CNN,
intriguing issue in computer vision, nevertheless. The purpose of this FERC deviates from commonly used methodologies, increasing accu­
project is to categorise each image into one of six types of facial emotion racy. Additionally, a revolutionary background removal technique used
[4]. prior to the formation of EV prevents dealing with a variety of potential
One of the most effective and hard study tasks in social communi­ issues (for example distance from the camera). Caltech faces, CMU,
cation is human emotion recognition from image. The performance of NIST, expanded Cohn-Kanade expression, and more than 750 K photos

2
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 1. Architecture of proposed novel PPP-CNN.

were used to thoroughly test FERC. It is anticipated that the FERC networks, it is difficult to get a needed training data for the face
emotion detection will be beneficial in numerous applications, including expression under various settings. Additionally, deep learning-based
student predictive learning and lie detectors, among others [8,9]. systems need a much larger and more powerful computing system
The fact that most present approaches only take into account frontal than conventional approaches to run training and testing. Therefore, it is
photos and overlook profile views from different angles out of conve­ essential to lessen the computational load placed on the deep learning
nience is a noteworthy flaw, as these views are crucial for a workable algorithm during inference [14]. Traditional FER techniques employ
FER system. A very Deep CNN (DCNN) modelling through Transfer classification algorithms like Adaboost, SVM, and random forest, while
Learning (TL) technique is suggested already in literature for creating a deep learning-based FER methods significantly reduce the need for face-
highly accurate FER system. A pre-trained DCNN model is adopted by physics-based models and other pre-processing methods by enabling
swapping out its dense top layer(s) suitable with FER, and the model is FER. Recent research has focused on a hybrid CNN-LSTM architecture
then tweaked with facial expression data [10]. for facial expressions that can outperform commonly used CNN methods
A multi-level technique incorporating Spatio-temporal information using temporal averaging for aggregation. Deep-learning-based FER
was proposed by Chu et al. in 2017 [11]. First, a CNN is utilized to approaches still have some limitations, including the need for large
extract the spatial representations, reducing the person-specific biases datasets, extremely powerful computers, and plenty of memory, as well
brought on by descriptors like HoG and Gabor. A fusion network that as the lengthy training and testing procedures. Additionally, despite the
further aggregates the outputs of CNNs and LSTMs produce 12 AUs. greater performance of a hybrid architecture, micro-expressions
The 3D Inception with an LSTM unit, which Hasani and Mahoor continue to be a difficult problem to tackle since they are more spon­
(2017) [12] presented, together extract the spatial and temporal re­ taneous and delicate facial motions that take place unintentionally.
lationships present in the facial images across multiple frames of a video The foundation of Convolutional Neural Networks (CNNs) and the
series. This network also takes facial landmark points as inputs, stressing secret to learning complete visual representations is convolution as the
the relevance of facial components over facial regions, which might not inner product. Recent CNNs have shown increased strength in repre­
be as important in producing facial expressions. Graves et al. (2008) sentation thanks to deeper architectures. Despite these advancements,
[13] used a recurrent network to account for the temporal dependencies correctly training a network has become more difficult due to the
found in the image sequences during classification. This work demon­ increasing depth and bigger parameter space. This proposed work sug­
strated that a bidirectional network is better than a unidirectional LSTM gests novel hyperspherical mapping of weights that provides angular
through results of two types of LSTM. representations on hyperspheres, in response to these difficulties. This
Using multi-angle-based optimum configurations, Jain et al. (2017) method can efficiently encode discriminative representation, reduce
[14] suggested a Multi-Angle Optimal Pattern-based Deep Learning training difficulty, and produce classification accuracy that is equivalent
(MAOP-DL) technique to address the issue of abrupt variations in to or even superior to conventional algorithms.
lighting and determine the right creation of the feature set. In order to Face detection algorithms can be used for detecting the face from the
extract the texture patterns and the pertinent key elements of the facial background [15]. Facial expression detection has made extensive use of
points, this method first removes the backdrop and separates the fore­ the effective machine learning technique known as Deep Multi-Task
ground from the photos. Following the selective extraction of the Learning (DMTL). The information of sample spatial distribution is
pertinent characteristics, an LSTM-CNN is used to predict the facial often ignored in current deep multi-task learning algorithms, which
expressions. Contrary to conventional methodologies, deep learning- normally only take into account the information of class labels. The
based systems typically use deep neural network experts to find the Transformer architecture’s applications to computer vision are still
features. Utilizing deep convolutional neural networks, they extract the limited, despite the fact that it has emerged as the de facto standard for
best features with the necessary properties from the data. To learn deep natural language processing tasks. Attention is either used in

3
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 2. Overall view of the proposed architecture using CNN.

conjunction with convolutional networks in vision or is used to replace challenging to use a meta-heuristic process in an effort to address every
some of the constituent parts of the convolutional network while optimization problem that might possibly exist. In practice, a single
maintaining the overall structure of the network. It is demonstrated that meta-heuristic algorithm may be effective in optimizing specific issues in
there is no need for this dependency on CNNs and that pure transformers a given sector yet fall short in other fields when trying to identify the
used directly on sequences of picture patches can get excellent results on global optima. So, there is a need to search for fresh, original, and
image classification tasks. Driving safety is impacted by a driver’s mood. nature-inspired approaches to address problems and get better results on
It is concluded that tracking a driver’s emotions could improve traffic tuning the parameters of CNN. The new metaheuristic algorithm based
safety. However, camera-based facial expression recognition (FER) on human behavior using the story of Ali Baba and the forty thieves is
systems are greatly hindered by the challenging lighting circumstances used in this proposed work for optimizing the hyperparameters of CNN
in a vehicle cockpit. Utilizing attention-based convolutional neural [19–22]. These approaches reduce the number of computations and also
networks (CNNs) locally to indicate the importance of facial areas and increase accuracy while identifying micro-expressions. The proposed
combining it with global facial features and/or other complementing approach achieves better accuracy compared to the existing systems
context information is a recent trend to understand facial expressions in [23–28].
the real-world situation. However, different channels react differently in
the presence of obstructions and pose alterations, and furthermore, a 3. Background and proposed architecture
channel’s reaction intensity varies across geographical regions. Addi­
tionally, for the purpose of defining attention, contemporary facial The facial images obtained from the dataset are cropped using the
expression recognition (FER) designs rely on outside inputs like land­ ‘Chehra’[15](https://sites.google.com/site/chehrahome/home) face
mark detectors [16–18]. detection algorithm. The obtained images are fed into the proposed
Larger parameter spaces have made it more difficult to train a architecture of CNN to categorize the facial images into six emotions.
network effectively. According to the “no-free-lunch” idea, it is The CNN proposed is a Prioritized Prewitt Pattern-based CNN as in

4
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 3. An illustration of how neurons are regularised to have the least amount of spherical energy possible on the unit hypersphere Sd is shown as {w1, wN ∈ R(d
+ 1)}.

Fig. 4. Hypersphere-based compression of weights.

Fig. 1. The proposed CNN extracts both deep learning features and PPP information-based feature vector. A reduction in dimensions is done by
[26] features simultaneously. There are two separate CNN and a module the two convolutional layers of CNN that make up the second branch. It
for fusion in this model. The first branch extracts the global features does so by bringing down the size of the PPP features, which makes it
from the raw image. The second branch takes the PPP code images as easier to combine the two features in the following way.
input and creates an edge cum texture-based feature vector. The fusion In order to extract the PPP code image, the Prewitt edge detector uses
module takes the dimension reduced features and the global features as the eight directional masks as follows. It’s a conventional technique and
input. The global features are a representation of the integrity of the uses the mask in eq. (1). It uses the first derivatives.
expression whereas the second branch creates an edge and texture

5
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 5. Convolutional Neural Network with Hyperparameter Optimization.

Fig. 6. AFTAO algorithm-based optimization of hyperparameters.

6
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 7. BYEL Architecture.

Pθ1 = [1 1 1; 0 0 0; − 1 − 1 − 1] energy weight regularisation is advised [10] to guarantee kernel/neuron


Pθ2 = [1 1 0; 1 0 − 1; 0 − 1 − 1] consistency while being independent and resistant to decreased batch
Pθ3 = [1 0 − 1; 1 0 − 1; 1 0 − 1] size. By utilizing the idea of neuron redundancy reduction in the
Pθ4 = [0 − 1 − 1; 1 0 − 1; 1 1 0]
(1) normalized unit space, the regularisation improves learned picture
Pθ5 = [ − 1 − 1 − 1; 0 0 0; 1 1 1] representations by increasing uniformity of representation.
Pθ6 = [ − 1 − 1 0; − 1 0 1; 0 1 1]
Studies on techniques for obtaining representations via self-
Pθ7 = [ − 1 0 1; − 1 0 1; − 1 0 1]
supervision are currently being actively carried out. In BYOL (Boot­
Pθ8 = [0 1 1; − 1 0 1; − 1 − 1 0]
strap Your Own Latent) [23,24], a network online forecasts how the
The maximum response-based code image P(x, y) is created using target networks will be represented. It carries out bootstrapping and
Prewitt masks as in eq. (2). changes the parameters of the target network to a moving exponential
average between those of online networks. Contrast learning is carried
P(x, y) = max(RPθi (x, y)|0 ≤ i ≤ 7) (2)
out by Moco (Momentum Contrast) using dictionary lookup. When the
where RPθi is the response produced by each mask Pθi , 0 ≤ i ≤ 7. This key and query are produced from the same data, learning for query
code image generated is used for further computations. The weights of representation and key representation is carried out in the direction of
CNN are tuned and regularized using the proposed Hypersphere-based increasing similarity. SimCLR (a simple framework for contrastive
technique as in Fig. 2. Then the hyperparameters of CNN are tuned learning of visual representations) trains representations to function as a
using the Alibaba and forty thieves algorithm. positive pair of two augmented picture pairs. SimCLR excelled in self-
In contrast to its contrastive predecessors, BYEL cannot distribute the supervised over ImageNet, achieving the best results [24,25]. This
weights so uniformly in a normalized unit space which is a surface of a proposed work suggests a novel Hypersphere-based uniformity
unit-hypersphere [8]. approach as in section 3.1.
As a result, BYEL can profit from the techniques that contrastive
methods use to introduce feature homogeneity. In order to achieve this 3.1. Proposed hypersphere-based uniformity approach
in BYEL, a different uniformity restriction than that supplied by [8] and
derived from the contrastive loss is suggested, with the goal of main­ It has been demonstrated by [9,14] that the loading and regular­
taining the avoidance of negative sampling. A minimal hyperspherical isation of weights in batch normalization are essential for effective self-

7
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 8. Matrix representing the confusions in JAFFE dataset.

supervision. This ensures the uniformity of neurons and avoids negative The weights are plotted into a closed surface prior to the mapping
sampling constraints, which is needed to further increase the capacity of onto Hypersphere so that the points are repeatedly scaled as in Fig. 3.
regularisation of supervised learning. In order to explicitly regularise the Due to the fact that the distances within the weights projected onto the
network and decrease the hyperspherical energy of the neurons, the Hypersphere are proportional to their real distances, these points are
hyperspherical regularisation of weights is suggested [10]. This will plotted evenly across the Hypersphere without overlapping. As a result,
increase the diversity of the weights of the network and, as a result, the the suggested work uses the hypersphere-based compression technique
uniformity of the representation. Such techniques fundamentally try to that causes an effective distribution of weights on a hypersphere without
eliminate undesirable redundancy in representation brought on by an overlapping. It lessens the impact of other transformation variables in
uneven distribution of neurons. The results in [10], described instances the features, and enhances the numerical properties of the kernel matrix.
of class imbalance. Here the underrepresented classes were demon­ When mapped into Hypersphere, the Conjugate Gradient solver is
strated to be clearly differentiated as a result of more evenly distributed utilised in order to obtain the smooth inner product. By modelling the
neurons. According to [10], the hyperspherical energy of neurons can be space using a Riemannian manifold, the optimization problem is
used to measure the power of a neural representation. Because of this, resolved [29]. Due to the limitations of the search space, there is a
minimal hyperspherical energy arrangements can encourage greater minimal degree of complexity and excellent numerical properties. The
variation and improve the separability of representations. Riemannian metric is therefore taken into account when optimising. As
It has been demonstrated empirically that the expanding the di­ shown in Equation, the tangent vector obtained on the Riemannian
versity of weights inside the network leads to more widely distributed manifold generates a Riemannian metric as in Eq. (3).
representations. The proposed approach demonstrates a significant
∂ ∂
improvement in representation uniformity. In comparison to the BYEL gkl |x =< , > (3)
∂xk ∂xl
baseline, regularising the neurons by MHE retains reduced hyper­
spherical energy on its activation/representations over the whole ( )
where the co-ordinate space is depicted as x = x1 , x2 ⋯..xm , and ∂ =
network. It is encouraging to observe that the final output layer repre­ ( )
∂x

sentations immediately show decreased hyperspherical energy and ∂x1 , ∂x2 , ⋯, ∂xm forms the tangent space.
∂ ∂ ∂

increased uniformity.
If the inner product is greater for mapping onto the hypersphere, it

8
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 9. Matrix representing the confusions in CK + dataset.

must be decreased to ensure a smooth distribution. Equation (4) states algorithms are discussed, followed by the architecture of the proposed
that the Euclidean Gradient (EG) for the inner products is the function f model. The “no-free-lunch” (NFL) idea asserts that it is challenging to
(x). The Riemannian metric is therefore taken into account when opti­ use a single meta-heuristic algorithm to attempt to address every
mising. As shown in Equation, the tangent vector on the Riemannian possible optimization problem.
manifold generates a Riemannian metric. If the inner product between As it stands, a single meta-heuristic method may be effective at
any two points is greater, it must be minimised in order to be dispersed optimising specific issues in a certain field, but it is ineffective at
evenly when mapping onto the hypersphere. Equation (4) states that locating the global optima in other disciplines. This has driven scholars
function f represents the inner product between any two weight points in this field, including ourselves, to explore for fresh, inventive, and
and the Euclidean Gradient (EG) for that inner product (x). nature-inspired approaches to solve and demonstrate superior results on
the present and emerging challenging real-life situations. The door is
EG = f(x) (4)
still open, and in this article, a new metaheuristic algorithm is intro­
This gradient is transformed from Euclidean to Riemannian by f(x). duced based on human behaviour using the fable of Ali Baba and the
The estimated Riemannian gradient values are in the largest rise while forty thieves as a model.
minimising inner product directions. The search then starts in the Rie­
mannian gradient’s opposite direction. 3.2.1. Convolution neural network (CNN)
The general structure of CNN, various optimization function types,
Δx0 = − ∇x f(x0 ) (5)
epochs, batch sizes, convolutional layers, pool layers, dense layers, loss
The algorithm then moves in the direction of tm after travelling in the functions, and activation functions are all covered in this article. The
original direction, t0 = Δx0 . Conjugate Gradient method converges most widely used network for tasks involving image analysis, data
globally, and the algorithm is developed to update the feature space and processing, and classification is the convolutional neural network
provide a smooth inner product. Fig. 4 presents algorithm for (CNN). In general, CNN is an artificial neural network that excels in
hypersphere-based compression of weights. When the given tolerance picking up or detecting patterns and deciphering them. CNN is partic­
condition is satisfied or the inner product variable is smooth, the pro­ ularly helpful in picture analysis because it can discover patterns. A CNN
cedure is considered complete. The new set of values has been spread is a type of ANN that differs from a traditional Multi-Layer Perceptron
over a closed Hypersphere surface with smoothed inner products. The (MLP). Convolution layers, a hidden layer in CNN, are specifically
weights are contained in a certain area and have been normalised. capable of detecting patterns by selecting the number of filters in each
layer. Although CNN also uses additional non-convolutional layers, its
core is made up of convolutional layers. The function of the convolution
3.2. Alibaba and forty thieves Algorithm-based CNN approach for tuning layers is to take in input and then transform it before sending the
of hyperparameters transformed information to the following layer. The loss function is the
function used to predict the error, and the loss is defined as finding an
In this section, Convolution Neural Networks and AFTAO [3022]

9
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 10. Matrix representing the confusions in SFEW dataset.

error in the NN during training. While there are many different loss successfully entering Ali Baba’s home, the captain and his followers
functions available, it might be challenging to select the best one for loss attempted to attack him without being seen, but they were unsuccessful.
anticipation. Among the loss features the accessible ones are Mean But in each of these strategies, Marjaneh, Ali Baba’s shrewd maid,
Squared Error, Sparse Categorical Cross entropy, and Categorical Cross saved his life. Based on cunning maneuvers and strategies, the looters
entropy [31]. Fig. 5 shows the convolutional network used in our pro­ were able to discover Ali Baba and reach their strategic goal. The gang’s
posed work that used the novel hyperspherical mapping of weights and assistant captain approached the town in the first trial dressed as a
AFTAO algorithm for the optimization of hyperparameters. foreigner, hoping to overhear a conversation or discover some infor­
mation that might lead him to Ali Baba. In his effort to carry out the
3.2.2. Alibaba and forty thieves algorithm group’s ultimate goal, he made it to Ali Baba’s home and left an “X”
The story is search-based, with a group of forty robbers pursuing Ali mark on his door. The plot was foiled when Marjaneh noticed the mark
Baba [22]. The ultimate objective of the gang is to capture Ali Baba in and put identical marks on all of the neighborhood’s doors in response.
order to take revenge and recover their fortune. The search for Ali Baba The proposed algorithm makes use of the robber’s strategies to increase
by the gang is repetitive, going through numerous rounds and solidifies exploration efficiency. But in each of these strategies, as will be sum­
the answer discovered by earlier iterations. The population of thieves, marised below, Marjaneh, Ali Baba’s shrewd maid, saved his life.
which serves as the basis for the search, is based on their aggregate The main objective of this work is to introduce a new optimization
behaviour. The primary character of the story, Marjaneh, used coun­ technique that mimics the story of Ali Baba and the forty thieves as a
termeasures in each iteration to stop the gang from carrying out their coordinated model of human social interaction. The main presumptions
search mission. The large city in which Ali Baba resides stands for the of this algorithm are satisfied by the following rules deduced from this
area of the hunt. The story demonstrates the success of the forty thieves story. To locate Ali Baba’s home, the forty thieves’ band together and
in finding Ali Baba and locating his home. Based on cunning maneuvers seek out advice from others or from one another. The accuracy of this
and strategies, the looters were able to discover Ali Baba and reach their information is uncertain. The forty thieves will go a distance beginning
strategic goal. When a different captain’s assistant accepted the from a starting point till they can reach Ali Baba’s home. Marjaneh can
assignment, the second trial began. He expanded on the methods his trick the thieves’ numerous times using cunning ways to secure Ali Baba
fellow soldier had already used. This is utilized in our method by making from their arrival. It is possible to link the actions of Marjaneh and the
each iteration of the search build upon the best solution thus far thieves to an objective function that needs to be optimized. This allows
discovered from earlier rounds. This time, the thief used a sign that was the development of the new meta-heuristic method described below
difficult to spot by accident to mark Ali Baba’s home. This reflects the possible.
use and improvement of previously discovered solutions in AFTAO once
more. As a result, the suggested method has robust exploration. The 3.2.3. Random initialization
third incident happened when the captain converted the course of events The AFTAO algorithm is started by initialising n people’ positions in
and took on the responsibility of apprehending Ali Baba while still a d-dimensional search space at random, as
building on the successes made thus far by his two helpers. After

10
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 11. Matrix representing the confusions in Oulu-CASIA dataset.

⎡ ⎤
The choice variable values are added to a fitness function for


p11 p12 p13 ⋯p1d

⎥ determining the position. The quality of the solution is assessed for each



p21 p22 p23 ⋯p2d
⎥ thief’s new location in the AFTAO algorithm simulation using a pre­
⎢ ⎥ defined fitness function. If the new location’s solution is better than the
⎢ p31 p32 p33 ⋯p3d

p=⎢


⎥ (6) old one, the location is then updated. If a thief’s existing position is more
⎢ p41 p42 p43 ⋯p4d


⎢ ⋯⋯⋯⋯⋯⋯.. ⎥
⎥ advantageous than a new one, he or she stays there.
⎢ ⎥
⎣ pn pn pn ⋯pn ⎦
1 2 3 d 3.2.4. The suggested mathematical model
The choice variable values are added to a user-defined fitness func­
where pi, d, and p are the number of variables in the given problem and p tion, which is then assessed for each thief’s position. The associated
is the position of all thieves. It is possible to generate the population’s fitness values are saved in an array with the following format: where is
starting position as depicted in the nth thief’s position’s dth dimension.
⎡ ⎤
pi = lj + r × (uj − lj ) (7)
⎢ f (p11 p12 p13 ⋯p1d ) ⎥
⎢ ⎥
Where r is a uniformly distributed random number between 0 and 1, ⎢ f (p2 p2 p2 ⋯p2 ) ⎥
⎢ d ⎥
lj and uj stand for the lower and upper bounds in the jth dimension,
1 2 3
⎢ ⎥
⎢ f (p31 p32 p33 ⋯p3d ) ⎥
respectively, and pi is the position of the ith thief, which represents a p=⎢
⎢ f (p4 p4 p4 ⋯p4 ) ⎥
⎥ (9)
⎢ d ⎥
proposed solution to a problem. Marjaneh’s wit ranking in relation to all ⎢ 1 2 3

⎢ ⋯⋯⋯⋯⋯⋯.. ⎥
thieves can be initiated as follows: ⎢



⎡ ⎤ f (pn1 pn2 pn3 ⋯pnd )

⎢ ma11 ma12 ma13 ⋯ma1d ⎥


⎢ ⎥
⎢ ma2 ma2 ma2 ⋯ma2 ⎥
⎢ 1 2 3 d⎥
where pnd is the dth dimension of the position of the nth thief. The quality
⎢ 3 3 3
⎢ ma1 ma2 ma3 ⋯ma3d ⎥
⎥ of the solution is assessed for each thief’s new location in the AFTAO
ma = ⎢
⎢ ma4 ma4 ma4 ⋯ma4 ⎥
⎥ (8) algorithm simulation using a predefined fitness function. If the new lo­
⎢ 1 2 3 d⎥

⎢ ⋯⋯⋯⋯⋯⋯.. ⎥
⎥ cation’s solution quality is higher than the old one, the location is then

⎣ man man man ⋯man ⎦
⎥ updated. If his present solution quality is better than the new one, each
1 2 3 d thief stays there as in Fig. 6.

where mij indicates the astute level of Marjaneh with respect to the ith 3.2.5. Mathematical models
th While was previously mentioned, three basic situations could arise as
thief at the j dimension.
robbers look for Ali Baba. In each instance, it is supposed that the thieves

11
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 12. Matrix representing the confusions in MMI dataset.

peruse the immediate area, yet there is likewise a portion that happens whereabouts of the thieves can be obtained as follows:
as a result of Marjaneh’s intellect that drives the thieves to peruse
[ ( ) (
arbitrary regions. The following mathematical model can be used to pit+1 = gbestt − Tdt bestit − yit r1 + Tdt yit
represent the above searching behaviour: ) ]
− mab(i)
t r2 sgn(random − 0.5); r3
Case 1. The criminals might use information they learned from someone to < 0.5 (15)
find Ali Baba. In this instance, the thieves’ new whereabouts can be discov­
ered as follows: According to algorithm in Fig. 6, AFTAO starts out by generating a
collection of values of hyperparameters (or potential solutions) at
( ) (
pit+1 = gbesti + [Tdt bestit − yit r1 + Tdt yit − mab(i)
)
r2 ]sgn(random − 0.5); r3 random while taking into account the higher and lesser bounds. The best
value and overall best value are then initialised, as well as Marjaneh’s
t

≥ 0.5, r4 > Ppt


cunning plots. The appropriateness of each constructed solution is
(10) reassessed for each iteration to determine the best solution, and the
b=Г(n-1). random(n,1) (11) quality of each solution is evaluated using a predefined fitness function.
⎧ ⎫ The algorithm is used iteratively determine the new values of the pa­
⎨ pi iff (pi ≥ f (mab(i) ) ⎬ rameters for each dimension. Each new value is given the feasibility test
b(i)
mt = t t t
(12) to see if it can be moved outside the search region. It will be returned to
⎩ mb(i) i b(i) ⎭
t iff (pt < f (mat )
the boundary in this situation using the AFTAO simulation procedures.
The new values, best values, best global values, and Marjaneh’s cunning
Tdt = α0 e− α1 (Tt )α1
(13) plans are then evaluated and revised as necessary. With the exception of
the initialization stages, all of the AFTAO processes outlined in algo­
rithm are carried out iteratively until the termination evaluation con­
Case 2. Upon realising their mistake, the robbers might start frantically dition is met. The optimal values of the hyper-parameters are ultimately
looking for Ali Baba. The following information can be used to determine the determined as the answer to the optimization issue.
current location of the thieves in this case:
3.2.6. Utilizing AFTAO’s potential
[( ) ]
pit+1 = Tdt uj − lj random + lj ; r3 ≥ 0.5, r4 ≤ Ppt (14) The following list of essential variables can be used to execute local
search and exploitation in AFTAO:
Tdt: As iteration progresses, exploitation gradually replaces explo­
Case 3. This work advances the search in positions other than those that ration. Local searches are conducted in promising parts of the search
could be found using Eq. (10) in order to improve the exploration and space when Tdt has small values. Therefore, in the most recent itera­
exploitation features of the AFTAO algorithm. In this occurrence, new tions, if thieves are close to Ali Baba’s home, the position updating

12
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 13. Matrix representing the confusions in MUG dataset.

Although the values for β0 and β1 in Ppt were chosen rather arbitrarily,
Table 1 they were based on pilot testing for a substantial number of test func­
Performance validation of proposed hypersphere-based compression of
tions. The candidates are geographically separated from one another in
weights.
the first iterations. AFTAO is better able to avoid becoming trapped in
Techniques F-Score local optimum stagnation when the parameter Ppt is updated, and it
ResNet50 0.80 becomes closer to the global optimum. Empirical research has shown
BYOL 0.83 that the equations β0 = 1 and β1 = 2 present a fair balance between
BYEL 0.85 exploration and exploitation.
Pre-trained transfer learning 0.74
Proposed Hypersphere-based method 0.96
sgn(rand-0.5): This parameter modifies the search direction to
adjust the quality of exploitation. There is an equal chance of both
negative and positive indications since rand has a uniform distribution
process will aid in local search surrounding the best solution, which will between 0 and 1.
result in exploitation. The probability of recession in local optima would
be decreased and the likelihood of attaining the global optimum would 3.2.7. BYEL
be increased by choosing the proper values for α0 and α1 for Tdt. α0 = 1 The BYEL architecture is given in Fig. 7.The target network (hθ′, gθ′),
and α1 = 2 provides a fair mix between exploration and exploitation which serves as the label for BYEL’s learning, is solely updated through
based on the experimental results. hθ, gθ like BYOL and not by Lbyel. Where h is the feature extractor,
Ppt: By quantifying the amount of exploitation using in Neural ResNet50 was utilised, and the BYOL-like t and t’ augmentation func­
Computing and Applications, this parameter regulates the exploitation tions. Additionally, gθ and qθ are projection and prediction layers,
feature. respectively, and both are MLPs made up of linear plus batch normal­
As iterations go by, the exploitation stage gets more intense when isation plus Relu plus linear.
dealing with relatively high values for this parameter. As a result, the An Emotion Classifier Eθ has been implemented to allow for emotion-
positioning update procedure improves AFTAO’s capacity to locally aware training. Eθ is a matrix with the row-emotion class WE ∈ R qθ(z)in
ramp up the space searches, which leads to additional exploitation. it. WE is a matrix that transforms the prediction qθ(z) into a class of

13
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 14. Accuracy achieved for JAFFE using proposed CNN.

4. Results and discussion

The tests make use of six datasets [32–40] JAFFE, Cohn Kanade (CK
+ ), Multimedia Understanding Group (MUG), Static Facial Expressions
in the Wild (SFEW), Oulu-CASIA NIR&VIS facial expression database
(OULU-CASIA), and Man Machine Interface (MMI). There are 213 facial
expression photos from 10 people in the JAFFE collection. The photos
are 256 × 256 in size. The dataset includes labels for the photos (Lyons
et al. 1998). The pictures are organised into seven different emotion
categories. There are two versions of the datasets in the CK + dataset: CK
and CK +. Compared to CK, the facial representations of CK + are more
emotive. The facial expression photos are 640 × 490 in size. In CK+,
there are 593 image sequences accessible. The photos from 123 peoples
were extracted. Only the most expressive pictures are used in the ex­
periments. The emotion that is neutral in each series is offered initially.
The experiments make use of 1281 photos. There are image sequences in
the MUG dataset where the first image is the neutral emotion image.
There are 60 pictures in each series. The facial expression pictures have
an 896 × 896 pixel resolution. 1462 picture sequences from 86 people
were collected. The experimental analysis uses 81 photos from each
Fig. 15. Performance assessment based on training and testing loss. class. The photographs are taken in a variety of environmental settings.
These pictures were taken in 720 × 576 resolutions in an unrestricted
emotions. By using Lemotion to calculate the output emotion prediction environment. 436 photos are utilised for testing, while 958 images from
and the Cross-Entropy of the Emotion Label corresponding to the ground same dataset are included for training. The experiments employ subject-
truth, the matrix WE is updated. Each column turns into a vector independent testing and two-fold cross-validation. Images in SFEW are
reflecting the corresponding emotion label as WE is trained. By captured in an unrestricted environment. Each facial image is 720 × 576
deducting the emotion information from the prediction qθ(z) of Xsyn in size. There are 436 photos for testing and 958 for training. Tests that
using these WE, it can produce a prediction vector without the emotion are subject-independent are carried out. Images of 80 people were taken
information. In a similar manner, it may remove the emotion vector for the Oulu-CASIA project (Oulu, the Machine Vision Group, Chinese
from the target network’s zˊ value to get the projection vector without Academy of Science, and Institute of Automation). Five to six of the most
the emotion data. In order to make qθ(z) and zˊ conduct emotion-aware emotive photographs from each of the 480 sequences are picked for the
self-supervised learning for Xsyn, it trains qθ(z) and zˊ to be comparable experiments. 500 photos from each class are picked. 30 people
without taking emotion information into account. Fig. 7 illustrates how contributed 312 series of photographs that were taken. Three to four
to update hθ, gθ and qθ and then compute Lbyel if the ground truth photographs with exceptional expressions are chosen from the series of
emotion is fear by subtracting the WE[:,2] vector associated with fear images. The final picture in the series portrays an emotionless feeling.
from W from q(z) and zˊ[23]. The experiments that are carried out are independent of subjects
[41–47,2].

14
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Fig. 16. Running time of various optimization algorithms.

Fig. 17. The accuracy of the proposed method using JAFFE dataset.

Figs. 8 to 13 demonstrate the effectiveness of the suggested strategy neutral, happiness, and surprise are mixed in with other emotions in the
by listing the number of samples that were correctly categorized as well MUG dataset, among others. The fundamental issue with the SFEW
as the number of samples that were incorrectly classified. The neutral dataset is that the photos were taken in an unrestricted environment and
and sad expressions confound the classification of the JAFFE dataset that the samples were out of balance. Therefore, more training data is
photos while foretelling other images. Anger emotions show compara­ required to increase accuracy. As it can recognize sharp edges and has
tively lower classification accuracy in the CK + dataset. Expressions like scale-invariant and rotation-invariant properties using PPP code image

15
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Table 2 algorithm selected.


The hyper-parameters obtained using proposed AFTAO. The hyperspherical energy of standard BYOL and the proposed sys­
Optimizer AFTAO tem of hyperspherical mapping from an intermediate layer of CNN
during training on the MMI dataset is given in Fig. 18. Empirically, it is
Epoch 8
Size of each Batch 64 demonstrated that the proposed system of hyperspherical compression
Count of Convolution layers 3 of weights retains lower hyperspherical energy on their representations
Count of Dense layers 2 across the entire network as compared to BYOL baseline. It is encour­
Count of Pool layers 3 aging to note that the final output layer representations show immedi­
Activation function used in output layer Softmax
Activation function used in intermediate layer Relu
ately lower hyperspherical energy and increased uniformity. The
empirical results in Fig. 18 confirm the proposed theory and justifica­
tion, according to which a greater diversity of weights results in repre­
and the deep learning features, the proposed technique achieves greater sentations that are then more evenly distributed.
accuracy than the other existing techniques. The disgust expression is The five datasets employed in the cross-database experiments were
typically mislabeled. Fear and sadness face expressions cause uncer­ the JAFFE, CK+, MUG, OULU-CASIA, and MUG datasets. Based on the
tainty in relation to the other facial expressions in the Oulu-CASIA test set database that was used, names for each group in Table 3 are
dataset and MMI. assigned. The JAFFE database, which only includes Japanese female
The calculated F-Score reports in Table 1 that the proposed participants, is racially and ethnically quite biased. Results from the
Hypersphere-based CNN achieves the best results compared to existing JAFFE database did not meet expectations. The accuracy of SFEW in the
techniques like ResNet50, BYOL, BYEL and Pre-trained transfer cross-database experiment is low since the photos were taken in
learning. The proposed hypersphere-based regularization technique incredibly unrestricted conditions, making it challenging to compare
achieves the highest F-Score of 0.70. The hyperspherically regularized them to other datasets.
CNN achieves high accuracy for JAFFE compared to CNN proposed in The above comparison of results with the state-of-the-art methods
literature as in Fig. 14. [42–49] indicates that the proposed approach achieves better perfor­
The outcomes of the proposed AFTAO approach are then contrasted mance for all the six datasets. The multitask network identifies some
with those from other CNN models based on cutting-edge nature- attributes like gender, age of the person, and posture of the head. This
inspired algorithms, such as the Gravitational Search Algorithm (GSA), proposed method identifies the emotions only using the facial expres­
Gray Wolf Optimization (GWO), Ant Bee Colony (ABC) algorithm, Par­ sion data. In the future enhancement of the proposed work, this novel
ticle Swarm Optimization (PSO), Whale Optimization Algorithm feature vector PPP will be concatenated with some other feature
(WOA), Genetic Algorithm (GA), and Cuckoo Search Algorithm (CSA)
[37]. The training error and testing error of AFTAO are small when
Table 3
compared to the other existing algorithms for optimization. The running
Accuracy obtained using cross-database experiments.
time of the algorithms are also compared and the AFTAO achieves less
amount of running time compared to the other techniques as in Figs. 15 Dataset Accuracy (%)
and 16. JAFFE 51.1
The accuracy of the proposed approach using JAFFE dataset is CK+ 90.2
SFEW 51.2
plotted in above Fig. 17. The results show that the proposed approach
MMI 69.5
achieves best results. MUG 71.2
The CNN hyper-parameters were chosen via AFTAO optimization. OULU-CASIA 68.6
Table 2 lists the hyper-parameters present in the CNN that the AFTAO

Fig. 18. MHE Vs. Iterations.

16
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

Table 4 5. Conclusion
Comparison with some other existing techniques.
Technique Dataset Classification accuracy Traditional FER employs classification algorithms like SVM, random
(%) forest, and Adaboost. However, deep-learning-based FER approaches
2- channel convolutional neural network JAFFE 95.8 significantly reduce the reliance on face-physics-based models and other
[42] pre-processing methods by enabling “end-to-end” FER. A CNN archi­
Boosted deep belief network [43] 91.8 tecture for facial expressions can outperform other machine learning
Soft Locality Preserving Map [44] 91.8 algorithms. But manual selection of hyperparameters is a difficult pro­
End-to-End Multi-Pose [45] 95.8
Proposed 99.5
cess. Deep-learning-based FER systems still have a number of limita­
Occlusion aware facial expression CKþ 97.2 tions, including the need for large datasets, super-powerful computers,
recognition [46] and a lot of memory, as well as the fact that both the training and testing
De-expression residue learning [47] 97.3 processes are time-consuming. The AFTAO algorithm helps in optimi­
Effective multitask network [48] 98.9
zation of hyperparameters and the hypersphere-based regularization of
Soft Locality Preserving Map 96.1
ROI-guided deep architecture [49] 94.6 weights helps in reduction of time needed for computations, thus lead­
Deep multi-task learning [50] 97.8 ing to better accuracy in less amount of time. Additionally, despite the
Discriminative deep multi-task learning 99.6 greater performance of an optimized architecture, micro-expressions
[51] continue to be a difficult problem to tackle since they are more spon­
Proposed 99.1
De-expression residue learning MUG 95.4
taneous and delicate facial motions that take place unintentionally
Interpersonal relation prediction[48] 96.3 which will be overcome in our future research.
Proposed 99.2
Interpersonal relation prediction SFEW 38.9
Soft Locality Preserving Map 39.5
Declaration of Competing Interest
Spatio-channel attention [52] 58.93
Dynamic multi-channel metric network 54.39
[53] The authors declare that they have no known competing financial
Proposed 75.8 interests or personal relationships that could have appeared to influence
A graph-structured representation with OULU- 93.0 the work reported in this paper.
BRNN [54] CASIA
De-expression residue learning 88.0
Deep multi-task learning 86.9 Data availability
Proposed 95.10
A graph-structured representation with MMI 85.4 The authors do not have permission to share data.
BRNN [54]
De-expression residue learning 87.4
Deep multi-task learning [50] 75.3 References
Discriminative deep multi-task learning 67.7
[51] [1] A.S. Alphonse, D. Dharma, A novel Monogenic Directional Pattern (MDP) and
Proposed 92.0 pseudo-Voigt kernel for facilitating the identification of facial emotions, J. Vis.
Commun. Image Represent. 1 (49) (2017 Nov) 459–470.
[2] P. Giannopoulos, I. Perikos, I. Hatzilygeroudis. Deep learning approaches for facial
emotion recognition: A case study on FER-2013. In Advances in hybridization of
Table 5 intelligent methods 2018 (pp. 1-16). Springer, Cham.
[3] M.K. Chowdary, T.N. Nguyen, D.J. Hemanth, Deep learning-based facial emotion
Accuracy obtained using cross-database recognition for human–computer interaction applications, Neural Comput. &
experiments. Applic. 22 (2021 Apr) 1–8.
[4] D.K. Jain, P. Shamsolmoali, P. Sehdev, Extended deep neural network for facial
JAFFE 51.1
emotion recognition, Pattern Recogn. Lett. 1 (120) (2019 Apr) 69–74.
CK+ 90.2 [5] A. Jaiswal, A.K. Raju, S. Deb. Facial emotion detection using deep learning. In2020
SFEW 51.2 International Conference for Emerging Technology (INCET) 2020 Jun 5 (pp. 1-5).
MMI 69.5 IEEE.
MUG 71.2 [6] E. Pranav, S. Kamal, C.S. Chandran, M.H. Supriya. Facial emotion recognition using
OULU-CASIA 68.6 deep convolutional neural network. In2020 6th International conference on
advanced computing and communication Systems (ICACCS) 2020 Mar 6 (pp. 317-
320). IEEE.
[7] M. Hu, H. Wang, X. Wang, J. Yang, R. Wang, Video facial emotion recognition
descriptors and the results will be analysed. The recognition results are based on local enhanced motion history image and CNN-CTSLSTM networks, J. Vis.
compared in Table 4. Commun. Image Represent. 1 (59) (2019 Feb) 176–185.
[8] N. Mehendale, Facial emotion recognition using convolutional neural networks
The five datasets employed in the cross-database experiments were
(FERC), SN Appl. Sci. 2 (3) (2020 Mar) 1–8.
the JAFFE, CK+, MUG, OULU-CASIA, and MUG datasets. Based on the [9] W. Mellouk, W. Handouzi, Facial emotion recognition using deep learning: review
test set database that was used, names for each group in Table 5 are and insights, Procedia Comput. Sci. 1 (175) (2020 Jan) 689–694.
assigned. The JAFFE database, which only includes Japanese female [10] M.A. Akhand, S. Roy, N. Siddique, M.A. Kamal, T. Shimamura, Facial emotion
recognition using transfer learning in the deep CNN, Electronics 10 (9) (2021 Jan)
participants, is racially and ethnically quite biased. Results from the 1036.
JAFFE database did not meet expectations. Since the images were taken [11] W.S. Chu, F.D. Torre, J.F. Cohn. Learning spatial and temporal cues for multi-label
under remarkably unrestricted conditions, it is difficult to compare the facial action unit detection. In Proceedings of the 12th IEEE International
Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, 30
cross-database experiment’s SFEW accuracy with other datasets. The May–3 June 2017; pp. 1–8.
Prewitt operator used in PPP patterns has some issues when employed [12] B. Hasani, M.H. Mahoor. Facial expression recognition using enhanced deep 3D
for image edge detection. It has a weak anti-noise performance and a low convolutional neural networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, Hawaii, HI, USA, 21–26 July
detecting effect. The figure will have few erroneous edges discovered 2017; pp. 1–11. 55.
throughout the detection process if this method is utilised for edge [13] A. Graves, C. Mayer, M. Wimmer, J. Schmidhuber, B. Radig. Facial expression
detection. An image edge detection optimization approach is required to recognition with recurrent neural networks. In Proceedings of the International
Workshop on Cognition for Technical Systems, Santorini, Greece, 6–7 October
combat the aforementioned issues. This is the proposed work’s limita­
2008; pp. 1–6. 56.
tion, and it will be removed in subsequent research works. [14] D.K. Jain, Z. Zhang, K. Huang, Multi angle optimal pattern-based deep learning for
automatic facial expression recognition, Pattern Recognit. Lett. 1 (2017) 1–9
[CrossRef].

17
A. Sherly Alphonse et al. Journal of Visual Communication and Image Representation 97 (2023) 103948

[15] A. Asthana, S. Zafeiriou, S. Cheng, M. Pantic. Incremental face alignment in the [34] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation
wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern invariant texture classification with local binary patterns, IEEE Trans. Pattern Anal.
Recognition (2014), pp 1859–1866. Mach. Intell. 24 (7) (2002) 971–987.
[16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. [35] M. Pantic, M. Valstar, R. Rademaker, L. Maat. Web-based database for facial
Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit. An image is worth 16x16 expression analysis. In Multimedia and Expo, IEEE International Conference
words: Transformers for image recognition at scale. arXiv preprint arXiv: (2005), pp. 5.
2010.11929. 2020 Oct 22. [36] M. Valstar, M. Pantic. Induced disgust, happiness and surprise:an addition to the
[17] Y. Cui, Y. Ma, W. Li, N. Bian, G. Li, D. Cao, Multi-EmoNet: A Novel Multi-Task MMI facial expression database. In: Proc. 3rd intern. workshop on EMOTION
Neural Network for Driver Emotion Recognition, IFAC-PapersOnLine. 53 (5) (2020 (satellite of LREC): Corpora for Research on Emotion and Afect (2010), pp.65.
Jan 1) 650–655. [37] R. Rai, K.G. Dhal, Recent Developments in Equilibrium Optimizer Algorithm: Its
[18] T. Kanade, J.F. Cohn, Y. Tian. Comprehensive database for facial expression Variants and Applications, Arch. Comput. Meth. Eng. 12 (2023 Apr) 1–54.
analysis. In Proceedings fourth IEEE international conference on automatic face [38] N. Aifanti, C. Papachristou, A. Delopoulos. The MUG facial expression database. In
and gesture recognition (cat. No. PR00580) 2000 Mar 28 (pp. 46-53). IEEE. Proc. 11th Int. Workshop on Image Analysis for Multimedia Interactive Services
[19] M. Koppen, D.H. Wolpert, W.G. Macready, Remarks on a recent paper on the‘‘ no (WIAMIS), Desenzano, Italy, (2010) April 12–14.
free lunch’’ theorems, IEEE Trans Evolut Comput 5 (3) (2001) 295–296. [39] A. Dhall, R. Goecke, J. Joshi, K. Sikka, T. Gedeon. Emotion recognition in the wild
[20] H. Wolpert David, G. Macready William, No free lunch theorems for optimization, challenge 2014:baseline, data and protocol, ACM ICMI 2014 (2014).
IEEE Trans. Evol. Comput. 1 (1) (1997). [40] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Collecting Large, Richly Annotated Facial
[21] W. Yong, W. Tao, Z. Cheng-Zhi, H. Hua-Juan, A new stochastic optimization Expression Databases from Movies, IEEE MultiMedia 19 (2012) 34–41.
approach-dolphin swarm optimization algorithm, Int. J. Comput. Intell. Appl. 15 [41] G. Zhao, X. Huang, M. Taini, S.Z. Li, M. Pietikäinen, Facial expression recognition
(02) (2016) 1650011. from near-infrared videos, Image Vision Comput 29 (9) (2011) 607–619.
[22] M. Braik, M.H. Ryalat, H. Al-Zoubi, A novel meta-heuristic algorithm for solving [42] D. Hamester, P. Barros, S. Wermter, Face expression recognition with a 2- channel
numerical optimization problems: Ali Baba and the forty thieves, Neural Comput. convolutional neural network, in: Neural Networks (IJCNN), 2015 International
Applic. 34 (1) (2022 Jan) 409–455. Joint Conference on, IEEE, 2015, pp. 1–8.
[23] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, [43] P. Liu, S. Han, Z. Meng, Y. Tong. Facial expression recognition via a boosted deep
B.A. Pires, Z. Guo, M.G. Azar, et al., Bootstrap your own latent-a new approach to belief network, in: Proceedings of the IEEE Conference on Computer Vision and
self-supervised learning, Adv. Neural Inf. Proces. Syst. 33 (2020) 21271–21284. Pattern Recognition, 2014, pp. 1805–1812.
[24] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum [44] C. Turan, K.M. Lam, X. He. Soft Locality Preserving Map (SLPM) for Facial
contrast for unsupervised visual representation learning. In Proceedings of the Expression Recognition. arXiv preprint arXiv:1801.03754, 2018.
IEEE/CVF conference on computer vision and pattern recognition, pages [45] W. Wang, Q. Sun, T. Chen. A Fine-Grained Facial Expression Database for End-to-
9729–9738, 2020. End Multi-Pose Facial Expression Recognition, arXiv preprint arXiv:1907.10838,
[25] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive 2019.
learning of visual representations, in: International conference on machine [46] Y. Li, J. Zeng, S. Shan, X. Chen, Occlusion aware facial expression recognition using
learning, PMLR, 2020, pp. 1597–1607. CNN with attention mechanism, IEEE Trans. Image Process. 28 (5) (2018)
[26] S. Balochian, H. Baloochian, Edge detection on noisy images using Prewitt operator 2439–2450.
and fractional order differentiation, Multimed. Tools Appl. 81 (7) (2022 Mar) [47] H. Yang, U. Ciftci, L. Yin. Facial expression recognition by de-expression residue
9759–9770. learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
[27] A. Ramirez Rivera, Rojas Castillo; Oksam Chae, “Local Directional Number Pattern Recognition, 2018, pp. 2168–2177.
for Face Analysis: Face and Expression Recognition,”, Image Processing, IEEE [48] Z. Zhang, P. Luo, C.L. Chen, X. Tang, From facial expression recognition to
Trans. 22 (5) (May 2013) 1740–1752, https://doi.org/10.1109/ interpersonal relation prediction, Int. J. Comput. Vis. 126 (5) (2018) 1–20.
TIP.2012.2235848. [49] X. Sun, P. Xia, L. Zhang, L. Shao, A ROI-guided deep architecture for robust facial
[28] A.R. Rivera, J.R. Castillo, O. Chae, Local directional texture pattern image expressions recognition, Inf. Sci. 1 (522) (2020 Jun) 35–48.
descriptor, Pattern Recogn. Lett. 51 (2015) 94–100. [50] R. Zhao, T. Liu, J. Xiao, D.P. Lun, K.M. Lam. Deep multi-task learning for facial
[29] N. Boumal, B. Mishra, P.A. Absil, R. Sepulchre, Manopt, a Matlab toolbox for expression recognition and synthesis based on selective feature sharing. In2020
optimization on manifolds, J. Mach. Learn. Res. 15 (1) (2014) 1455–1459. 25th International Conference on Pattern Recognition (ICPR) 2021 Jan 10 (pp.
[30] M. Braik, Enhanced Ali Baba and the forty thieves algorithm for feature selection, 4412-4419). IEEE.
Neural Comput. Applic. 35 (8) (2023 Mar) 6153–6184. [51] H. Zheng, R. Wang, W. Ji, M. Zong, W.K. Wong, Z. Lai, H. Lv, Discriminative deep
[31] R. Marazzato, A.C. Sparavigna. Astronomical image processing based on fractional multi-task learning for facial expression recognition, Inf. Sci. 1 (533) (2020 Sep)
calculus: The AstroFracTool. arXiv 2009, arXiv:0910.4637. 60–71.
[32] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews. The extended [52] D. Gera, S. Balasubramanian, Landmark guidance independent spatio-channel
cohn-kanade dataset (ck+): A complete dataset for action unit and emotion- attention and complementary context information based facial expression
specified expression. IEEE Computer Society Conference on Computer Vision and recognition, Pattern Recogn. Lett. 1 (145) (2021 May) 58–66.
Pattern Recognition Workshops (CVPRW) (2010), pp 94–101. [53] Y. Liu, W. Dai, F. Fang, Y. Chen, R. Huang, R. Wang, B. Wan, Dynamic multi-
[33] M. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba. Coding facial expressions with channel metric network for joint pose-aware and identity-invariant facial
gabor wavelets. Third IEEE International Conference on Automatic Face and expression recognition, Inf. Sci. 1 (578) (2021 Nov) 195–213.
Gesture Recognition (1998), pp 200–205. [54] Zhong, Lei, et al., A graph-structured representation with BRNN for static-based
facial expression recognition, 2019 14th IEEE International Conference on
Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019.

18

You might also like