You are on page 1of 8

Emotion Detection in Social Robotics:

Empath-Obscura - An Ensemble Approach with


Novel Face Augmentation Using SPIGA
Debajyoti Dasgupta Arijit Mondal Partha P. Chakrabarti
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Indian Institute of Technology Kharagpur Indian Institute of Technology Patna Indian Institute of Technology Kharagpur
Kharagpur, India Patna, India Kharagpur, India
debajyotidasgupta6@gmail.com arijit@iitp.ac.in ppchak@cse.iitkgp.ac.in

Abstract—Emotion recognition is a key component of human-


computer interaction in social robotics. In this paper, we present
Empath-Obscura, an innovative ensemble model designed to
detect emotions in obfuscated faces. The model combines the
cutting-edge object detection models YOLO V5 and V8 with
the well-established Poster++ facial emotion recognition model.
A significant contribution of this work is the development of
a novel data augmentation technique that utilizes SPIGA, a
shape-preserving facial landmark detection model, to selectively
obscure facial features. This approach enhances the model’s
robustness against partially hidden facial expressions, improving
the performance of the overall model by 13.18%. Empath-
Obscura is rigorously validated on the FER-2013 dataset, which
is well-suited for this study due to its representation of low-
resolution and poor-quality facial images. A manually obfuscated
and annotated test set further ensures accurate evaluation. The
ensemble model achieved a remarkable accuracy of 69.3%, Fig. 1. Samples of FER-2013 [5] emotion dataset with obfuscated eyes
outperforming the individual models. The results presented in
this paper, along with the innovation in our ensemble and data
augmentation techniques, offer a significant contribution to the
fields of social robotics and emotion recognition. This work Over the past few decades, there has been a gradual shift
provides researchers and practitioners with a robust and reliable in the techniques used for facial emotion recognition. Tra-
tool for emotion detection from obfuscated faces, contributing to
ditional approaches primarily relied on hand-crafted features,
advancements in human-computer interaction for social robotics.
such as facial landmarks, and machine learning models like
Index Terms—Emotion Recognition, Social Robotics, Data support vector machines [6]. The advent of deep learning
Augmentation, Obfuscated Faces, Ensemble Model, Human- in the 2010s revolutionized emotion recognition, leading to
Computer Interaction, Computer Vision the development of powerful models like convolutional neural
networks (CNNs) that can automatically learn features from
I. I NTRODUCTION raw data [7]. Since then, several deep learning models have
Emotion recognition is a critical aspect of human-computer been proposed for facial emotion recognition, including the
interaction, especially in social robotics, where robots are POSTER++ model [8], which is known for its high accuracy.
designed to interact and communicate with humans in a The recent introduction of object detection models like
natural and socially acceptable manner [1], [2]. Since the YOLO V5 [9] and V8 [10] has further advanced the field of
inception of social robotics in the late 1990s [3], a plethora emotion recognition. These models are capable of detecting
of methods and models have been developed to enable robots multiple objects in an image and are well-suited for detecting
to understand and respond to human emotions. One of the facial features and emotions. The use of ensemble models,
primary methods for emotion recognition involves analyzing which combine the predictions of multiple models, has also
facial expressions, as they provide essential non-verbal cues become increasingly popular for emotion recognition [11].
about a person’s emotional state [4]. However, real-world Data augmentation is another essential aspect of emotion
scenarios often involve obfuscated faces, for example, due recognition, especially when dealing with obfuscated faces.
to poor lighting, low-resolution images, or occlusions, which Techniques such as rotation, scaling, and cropping have been
pose significant challenges to emotion recognition. widely used to augment data [12]. In this paper, we introduce a
novel data augmentation technique using SPIGA [13], a shape- features like distances between facial landmarks, eye corners,
preserving facial landmark detection model, to selectively ob- and mouth shapes. However, these approaches were sensitive
scure facial features, thereby improving the model’s robustness to variations in lighting conditions, facial occlusions, and pose
against partially hidden facial expressions [13]. changes [3]. These issues limited their performance in real-
In today’s world, where human-robot interactions are be- world scenarios.
coming increasingly prevalent, accurate emotion recognition is The development of convolutional neural networks (CNNs)
of utmost importance. Social robots are used in various appli- revolutionized the field of emotion recognition, leading to the
cations, including healthcare, education, and customer service, creation of powerful models capable of automatically learning
where understanding and responding to human emotions can relevant features from raw images [7]. Several popular archi-
significantly enhance the quality of service and user experience tectures have been proposed, such as VGG [15], ResNet [16],
[14]. In this study, we delve into the crucial task of emotion and Inception [17]. These architectures have been successfully
recognition from facial expressions, essential for human-robot applied to the problem of emotion recognition [18].
interactions. In 2017, Barsoum et al. proposed ”Real-time Convolutional
In this paper, we unveil Empath-Obscura, an avant-garde Neural Networks for Emotion and Gender Classification,”
ensemble model sculpted from the intricately weighted ensem- proposed using CNNs to extract facial features and predict
ble of the POSTER++, YOLO V5 and YOLO V8 architectures. emotions and gender [19]. This work showcased the potential
This model is meticulously crafted to decode emotions from of CNNs for emotion recognition, even in real-time applica-
faces, even when crucial features are obscured. What signifi- tions. Subsequently, the latest milestone in FER was achieved
cantly distinguishes Empath-Obscura from its counterparts is by Mao et al. in the POSTER++: A simpler and stronger
our pioneering data augmentation methodology utilizing the facial expression recognition network [8] which improves the
prowess of the SPIGA model [13]. This unique technique previously proposed POSTER in three directions: cross-fusion,
proficiently mirrors the challenges posed by real-world ob- two-stream, and multi-scale feature extraction. This has been
scured facial landmarks, providing a robust training landscape a state-of-the-art model in Facial Emotion Recognition across
for our model. As a testament to our approach’s efficacy, multiple datasets, including RAF-DB [20] and AffectNet [21].
Empath-Obscura achieves a stellar accuracy of 69.3%, rep- Ensemble methods have also gained popularity in emotion
resenting a monumental leap over individual models. Our recognition, combining multiple models to achieve higher
comprehensive ablation study further elucidates the nuanced accuracy and robustness [11]. Ensembling allows leveraging
interplay of epochs and batch sizes, especially on the YOLO the strengths of multiple models while compensating for their
V5’s performance metrics. Notably, the introduction of our weaknesses. Recent work by Shah et al. [22] shows that en-
facial landmark augmentation technique alone bolstered the semble learning has the potential to improve the performance
model’s performance by a remarkable 13.18%. Anchored in of the face recognition models, which motivated the idea
the FER-2013 dataset and enriched with manually annotated behind ensemble-based Empath Obscura model.
test sets (Fig. 1), our findings promise transformative impli- The field has seen significant advancements, but the chal-
cations for emotion recognition, particularly in the context lenge of recognizing emotions from obfuscated faces remains.
of social robotics, facilitating more nuanced and empathetic This paper aims to address this issue by proposing a novel data
robot-human interactions. augmentation technique using the SPIGA [13] model and an
The remainder of this paper is structured as follows: Sec- ensemble approach called Empath-Obscura.
tion II reviews relevant literature in emotion recognition and
III. M ETHODOLOGY
social robotics. Section III details our methodology and data
augmentation techniques. Section IV presents the results of A. Dataset and Data Preprocessing
our ensemble model, Empath-Obscura, and the ablation The dataset used in this study (FER-2013 [5]) consists
study. Section VI concludes with a discussion on results. of grayscale images of faces, with each image having a
resolution of 48 × 48 pixels. The dataset provides automatic
II. R ELATED W ORK registration of the faces such that each face is approximately
Emotion recognition from facial expressions has attracted a centered and occupies a similar amount of space in every
considerable amount of interest in the fields of computer vision image. The images primarily belong to one of seven cat-
and machine learning. Several methods and algorithms have egories: Angry (0), Disgust (1), Fear (2), Happy (3), Sad
been proposed over the years, showing steady improvements (4), Surprise (5), and Neutral (6). The dataset is divided
in performance. Initially, the primary focus was on feature into a training set and a public test set. The training set
engineering and hand-crafted features. However, with the comprises 28,709 examples, while the public test set includes
advent of deep learning, the field has shifted towards data- 3,589 examples. For the purpose of this study, we sampled a
driven approaches, leveraging large labeled datasets and deep subset of 3,500 images from the test set, ensuring a balanced
neural networks. distribution with approximately 500 images for each of the
Traditional emotion recognition approaches used image pro- seven emotion categories. These test set images were then
cessing techniques and hand-crafted features to extract facial manually obfuscated to remove features related to the eyes,
landmarks [6]. These methods relied on manually designed as shown in Fig. 1. For the training set images, we took the
Fig. 2. Examples of the obfuscated images generated using the Novel Data Augmentation Algorithm using SPIGA’s facial landmarks.

images as is, resized them to a resolution of 64x64 pixels, and study, we focus on the eyes, nose, and lips, obfuscating
normalized them to facilitate better training of the model. each feature separately in order to investigate their individual
contribution to emotion recognition.
B. Novel Obfuscating Augmentation with SPIGA
In this work, we leverage the facial landmarks generated C. Training-time Augmentation
by the SPIGA network [13] to apply a novel data augmenta- To further enhance the diversity and robustness of our
tion technique that obfuscates specific facial features. Facial training dataset, we have applied several preprocessing and
landmarks are critical points on a face image that correspond data augmentation techniques during the training phase.
to key facial features, including the eyes, nose, and lips. The 1) Preprocessing:
SPIGA network is capable of identifying these landmarks with • Auto-Orient: The orientation of each image is automat-
high precision even in case of low-resolution images and hence ically adjusted to ensure consistent alignment of facial
makes it an ideal choice for this task. features.
The obfuscating augmentation process is performed in mul- • Auto-Adjust Contrast: The contrast of each image is
tiple steps as described below: adjusted using adaptive equalization, which redistributes
1) The SPIGA network is applied to the original face image the pixel intensities to enhance visibility and emphasize
to extract facial landmarks corresponding to the eyes, facial features.
nose, and lips (amongst the other points). 2) Training Time Augmentations: For each training ex-
2) Three new images are generated by obfuscating each of ample, we generate three augmented versions with varying
the three features separately in the original image, using combinations of the following transformations (Fig. 4):
the landmarks as reference points. • Rotation: Each image is randomly rotated between -22°
3) We apply the Graham Scan algorithm [23] on the three and +22° to introduce variations in head pose.
images corresponding to one facial feature each to • Shear: The images are subjected to horizontal and verti-
produce the smallest fitting polygon that covers all the cal shear transformations, each varying between ±22°.
points of the feature and then bound it in the smallest • Brightness: The brightness of each image is randomly
fitting rectangle (3rd column in Fig. 2). altered within a range of -25% to +25%.
4) The obfuscation process involves altering the pixel val- • Exposure: The exposure level of each image is randomly
ues within the region defined by the landmarks to hide modified within a range of -25% to +25%.
the facial features. This can be achieved by blurring, • Blur: A blur effect with a radius of up to 3 pixels is
pixelation, or replacement with a constant value, de- applied to the images.
pending on the desired level of obfuscation. We study • Noise: Random noise is introduced in up to 10% of the
full obfuscation by replacing it with a constant value. pixels in each image.
5) The three newly generated images serve as additional • Cutout: Four random cutout boxes, each occupying 25%
training samples, increasing the diversity of the dataset of the image size, are applied to the images.
and contributing to the robustness of the model. By incorporating these preprocessing and augmentation
The obfuscating augmentation with SPIGA allows us to techniques into our training pipeline, we aim to improve the
generate more diverse training samples and explore the impact model’s ability to generalize to diverse and challenging real-
of individual facial features on emotion recognition. In this world scenarios, thereby enhancing its overall performance
Fig. 3. Architecture of the Empath-Obscura model, showing the YOLOv5, YOLOv8, and POSTER++ models and the voting ensembler.

and robustness in emotion recognition. We make use of the task. The training process involved a total of 100 epochs
platform provided by Roboflow [24] for achieving this task. with early stopping based on validation loss.
b) POSTER++:: This model was initially trained for
facial expression recognition from high-quality images from
AffectNet and RAF-DB. We adapted the model by training it
to work with obfuscated images by initializing with the pre-
trained weights and fine-tuning it on the FER-2013 dataset.
The optimization was conducted with the Adam optimizer, a
learning rate of 1×10−4 , and a batch size of 16 and the training
process lasted for 50 epochs with early stopping based on
validation loss. For POSTER V2, we make use of the official
implementation provided by the team on GitHub [8]. Similar
to the YOLO model, during fine-tuning, the last few layers
were unfrozen, allowing for adaptation to the specific task.
2) Voting Ensembler: The voting ensembler aggregates the
Fig. 4. Examples of training-time augmented images generated using the pre- predictions of the YOLOv5, YOLOv8, and POSTER++ mod-
processing and augmentation techniques described on unobfuscated images. els. The ensemble takes into account the confidence scores of
the individual models’ predictions. We optimized the weights
D. Empath-Obscura Model of the ensemble using a simple grid search on the validation
set. The optimal weights were then used to compute the
The Empath-Obscura model comprises an ensemble of final ensemble prediction, which is obtained by performing
three pre-trained models, namely YOLOv5, YOLOv8, and a weighted majority vote. For our experiments, we have used
POSTER++, which were adapted and fine-tuned for the task of the weights shown in Fig. 3. These weights are obtained by
emotion recognition from obfuscated facial images. A voting performing a grid search for the following equation 1. In the
ensembler aggregates the predictions of the three models, with case of YOLOv5 and YOLOv8, we chose the best-performing
the weights of the ensemble being fine-tuned to optimize per- variants when performing individually.
formance. Following described are the important components
in the modeling and fine-tuning of our final model. ŷensemble = arg max (w1 p1,i + w2 p2,i + w3 p3,i ) (1)
i
1) Model Adaptation and Fine-tuning:
Where pj,i is the probability of class i predicted by model j,
a) YOLOv5 and YOLOv8: These models are known for
and wj is the weight of model j. The weights were optimized
their efficiency in real-time object detection and were adapted
such that the ensemble achieved the highest accuracy on the
for emotion recognition from obfuscated facial images. The
validation set.
models were initialized with pre-trained weights on the COCO
By combining the predictions of multiple models and op-
dataset and then fine-tuned on the FER-2013 dataset. We make
timizing the ensemble weights, the Empath-Obscura model
use of the implementation provided by Ultralytics [9],
achieves robust and accurate emotion recognition performance
[10] since it is actively maintained and amongst the most
on obfuscated facial images.
widely used YOLO implementations. The training process was
conducted using the Adam optimizer with a learning rate of E. Training Procedure
1×10−4 and a batch size of 16. During fine-tuning, the last few The training procedure of the Empath-Obscura model was
layers were unfrozen, allowing for adaptation to the specific conducted on an Nvidia Quadro RTX 5000 16GB GPU. The
FER-2013 dataset, augmented with obfuscated and standard emotion recognition performance in challenging conditions,
data augmentation techniques, was used as the training data. such as obfuscated facial images.
The training process consisted of the following steps:
A. Experimental Setup
1) Data Preprocessing: The images were preprocessed by
resizing them to a size of 64x64 pixels and normalizing 1) Dataset: For our experiments, we utilized the FER 2013
the pixel values to the range [0, 1]. The obfuscated dataset, which is a widely recognized dataset for emotion
images were generated by removing the eye features recognition in facial images. The dataset consists of 48x48
from the original images, as shown in Figure 1. pixel grayscale images of faces, which have been categorized
2) Data Augmentation: In addition to the novel obfusca- into seven different emotions, namely: Angry, Disgust, Fear,
tion technique, other standard data augmentation tech- Happy, Sad, Surprise, and Neutral.
niques were applied to the training images, including
rotation, shear, brightness adjustment, exposure adjust- TABLE I
DATASET STATISTICS
ment, blurring, noise addition, and cutout, as described
in Section III-C. This step increased the diversity of the Number of Size After Prepocessed
training data, helping to mitigate overfitting and enhance Split Images Preprocessing Set Size
Training 25,000 64x64 100,000
model generalization. Validation 3,000 64x64 -
3) Model Initialization: The YOLOv5, YOLOv8, and Test 3,500 64x64 -
POSTER++ models were initialized with pre-trained Total 31,500
Emotions Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral
weights on the COCO dataset and the AffectNet dataset,
respectively. These weights served as a good starting
The test set of the FER 2013 dataset was augmented with a
point for further fine-tuning on the FER-2013 dataset.
sample of 3500 manually obfuscated images. The obfuscation
4) Model Fine-tuning: The models were fine-tuned on the
was performed by annotators who were asked to obscure
FER-2013 dataset with the Adam optimizer, a learning
facial features related to the eyes in the images, as shown in
rate of 1 × 10−4 , and a batch size of 16. The last few
Figure 1. The images were then resized to a 64x64 resolution
layers of each model were unfrozen to adapt the models
and normalized for better training.
to the emotion recognition task. Early stopping was
The dataset was divided into three splits for our experi-
employed based on the validation loss, and the training
ments, as presented in Table I. The training set consists of
process lasted for a total of 50 epochs.
25,000 images, the validation set has 3,000 images, and the
5) Ensemble Weight Optimization: The optimal weights
test set comprises 3,500 images.
for the ensemble were determined by performing a
The human annotation of the obfuscated test set ensured that
grid search on the validation set. The goal was to
the obfuscation was performed in a consistent manner across
maximize the accuracy of the ensemble by adjusting the
all images, and the annotations were verified by multiple
weights assigned to each model’s predictions.
reviewers to ensure their accuracy.
6) Model Evaluation: The performance of the Empath-
Obscura model was evaluated on the test set of the V. R ESULTS AND D ISCUSSION
FER-2013 dataset, which is manually annotated. The
A. Performance Against State-of-the-Art Models
accuracy, precision, recall, and F1 scores were computed
to assess the model’s ability to recognize emotions from 1) Overall Performance: Table II compares the perfor-
obfuscated facial images. mance of our Empath-Obscura model with other state-of-the-
art models before and after data augmentation. We evaluate the
IV. E XPERIMENTS AND R ESULTS models in terms of accuracy, precision, recall, and F1 score.
In this section, we present a comprehensive set of experi- Our Empath-Obscura model outperforms other state-of-the-
ments to evaluate the effectiveness of our proposed Empath- art models in terms of accuracy, precision, and F1 score. More-
Obscura model for emotion recognition in obfuscated fa- over, the performance of all models has improved significantly
cial images. Our experiments were designed to examine the after data augmentation, with our model achieving the highest
model’s performance in different scenarios, including base- improvement. This improvement clearly shows that the novel
line models without augmentation, augmented data experi- augmentation technique with SPIGA [13] provides images that
ments, ensemble model experiments, and ablation studies. are beneficial to learning facial features that are relevant in
The following mentioned experiments were conducted on case of different levels of obfuscation.
a high-performance computing environment, specifically the 2) Class-wise Performance: Table III shows the perfor-
Nvidia Quadro RTX 5000 16GB GPU. We benchmark mance of the YOLOv5 [9], YOLO v8 [10], POSTER++ [8]
our results against state-of-the-art methods in the field of and Empath-Obscura model on each of the seven classes of
emotion recognition and analyze the contributions of different emotions, before and after data augmentation. The perfor-
components of our model to its overall performance. Our mance of all the model improved quite well for all emotion
experimental results shed light on the potential of data aug- classes after data augmentation, with the highest improvement
mentation techniques and ensemble methods for improving observed for the Neutral class.
TABLE II
OVERALL P ERFORMANCE C OMPARISON

Before Augmentation After Augmentation


Model Year Accuracy (%) Precision Recall F1 Score Accuracy (%) Precision Recall F1 Score
PAZ MiniXception [25] EASNN-2019 48.23 0.56 0.40 0.47 51.87 0.48 0.54 0.51
EfficientFace [26] AAAI-2021 57.21 0.61 0.47 0.53 60.17 0.65 0.56 0.60
GLAMOR-Net [27] NCA-2021 59.35 0.62 0.49 0.55 61.32 0.60 0.62 0.61
YOLOv5 [9] 2022 52.77 0.59 0.48 0.53 57.25 0.63 0.54 0.58
YOLOv8 [10] 2023 60.54 0.70 0.59 0.64 63.91 0.75 0.59 0.66
POSTER++ [8] 2023 62.66 0.69 0.61 0.65 66.23 0.71 0.75 0.73
Empath-Obscura - 64.83 0.78 0.67 0.72 69.32 0.85 0.69 0.76

TABLE III
C LASS - WISE P ERFORMANCE C OMPARISON

Before Augmentation After Augmentation


Emotion YOLOv5 YOLOv8 POSTER++ Empath-Obscura YOLOv5 YOLOv8 POSTER++ Empath-Obscura
Angry 0.33 0.37 0.39 0.40 0.37 0.42 0.46 0.48
Disgust 0.65 0.76 0.75 0.78 0.71 0.79 0.83 0.84
Fear 0.41 0.43 0.46 0.51 0.45 0.50 0.54 0.56
Happy 0.79 0.76 0.82 0.83 0.85 0.87 0.87 0.88
Sad 0.51 0.63 0.63 0.66 0.63 0.68 0.64 0.71
Surprise 0.46 0.50 0.52 0.52 0.50 0.51 0.53 0.53
Neutral 0.35 0.43 0.47 0.48 0.59 0.65 0.67 0.73

Fig. 5. Normalized confusion matrix of our Empath-Obscura model

Fig. 6. Correctly classified images by Empath-Obscura

Fig. 5 represents the normalized confusion matrix over the


predictions produced by the Empath-Obscura model. Inspect-
ing the matrix, we see that the model performs extremely Indeed, on a closer inspection, we will find that in many
well in case of the emotions of Happy and disgust and cases when the eyes are obscured, the predicted label makes
significantly poorly for angry and sad. more sense as compared to the actual label and thus can be
Fig. 6 shows some sample images for which our model considered as ambiguous cases.
Empath-Obscura performed well and classified the emo- An interesting observation arises when we change the
tions correctly even though the eyes are obfuscated, which obfuscation region for the test subject. Fig. 8 shows the image
is a significant indicator of human emotions. Additionally, of a girl whose actual emotion is Fear. We find that on
we also observe that from Fig. 5 that the largest number of obfuscating the eyes, the model predicts that the emotion is
wrong classifications happened because a large proportion of Surprise. This makes a lot of sense if we look into the
Angry, Sad and Surprised emotions are classified as training data, and we will find that most of the images have
Fear and also a significant number of samples of Angry, their mouths wide open to express Surprise emotion. Then
Fear and Neutral as Sad emotion. Fig. 7 shows some again, if we obscure the mouth part instead of the eyes, we
of the samples for which the wrong classification took place. will find that the model now predicts the emotion as Angry,
lizing combined obfuscation can be attributed to the fact that
different obfuscation types highlight or obscure distinct facial
elements pivotal for emotion recognition. By integrating these
aspects, the model attains a more comprehensive perspective,
which evidently leads to improved performance.

TABLE IV
A BLATION STUDY OF E MPATH -O BSCURA

Obfuscation Type (Accuracy %)


Eyes Nose Mouth Combined
67.85 65.11 66.59 69.32

2) YOLO Ablation: We evaluate the performance of


different configurations of the YOLOv5 model on the
test set. Table V represents the best-performing models
among the different variants of YOLOv5 and v8 available
Fig. 7. Wrongly classified images by Empath-Obscura (nano(n), small(s), medium(m), large(l) and
extra-large(x)). Each model is trained on NVIDIA
Quadro RTX-5000 GPU with a training batch size of
16 images and 1000 epochs and setting the option of
early stopping if no improvement to 10 epochs. Obser-
vation shows that as the model size increases for YOLOv5
variants, the learning ability is limited even with a large
number of epochs, and the features learnt are not helpful
enough for predictions. Hence, nano model proves to be the
best-performing model for YOLOv5 variants. However, for
YOLOv8 variants, we see a contrasting trend and see that
models that are higher in complexity achieve better perfor-
mance. YOLOv8’s training methodology provides more robust
and accurate feature learning as compared to YOLOv5 models.
Fig. 8. Emotions changing on changing the obfuscated area Thus, for YOLO v8, the extra-large model is better performing.

TABLE V
which can be seen from the furrowed brows. YOLO R ESULTS (ACCURACY ) WITH VARYING MODEL SIZE

B. Ablation Study YOLOv5 [9] YOLOv8 [10]


YOLO Model Top1 Top5 Top1 Top5
1) Empath-Obscura Ablation: For assessing the perfor- yolov5n-cls 0.572 0.963 0.606 0.978
mance of individual obfuscation techniques, each training set yolov5s-cls 0.327 0.929 0.611 0.983
comprised 50K images: 25K original images and an additional yolov5m-cls 0.311 0.904 0.622 0.984
yolov5l-cls 0.289 0.896 0.628 0.979
25K images with the specific obfuscated facial feature (either yolov5x-cls 0.271 0.883 0.639 9.986
eyes, nose, or mouth). In contrast, the training set for the
combined obfuscation method encompassed 100K images,
Table VI presents the results of the YOLOv5 nano model
adhering to the previously detailed distribution. The subse-
with varying epochs, and we see that the model learns to pre-
quent ablation study specifically gauged the model’s efficacy
dict better when the number of epochs is higher as compared
on the golden test set with hand featured eye obfuscation.
to lower epochs. However, beyond 1000 epochs, the model
In the evaluation of the Empath-Obscura model (Table IV),
usually starts over-fitting with training loss still decreasing,
the highest accuracy, 69.32%, is achieved with combined
but the test loss remains constant.
obfuscation of eyes, nose, and mouth. Notably, the eyes
and mouth obfuscation individually result in accuracies of
TABLE VI
67.85% and 66.59%, respectively, highlighting their crucial YOLOV 5 N - CLS A BLATION R ESULTS (ACCURACY )
role in emotion recognition. This underscores the essence of
these features in conveying emotional cues, suggesting that Batch Size (256) Epochs (1000)
their obfuscation forces the model to tap into secondary cues Epochs Top1 Top5 Batch Size Top1 Top5
15 0.383 0.930 16 0.572 0.963
for effective emotion discernment. Interestingly, the ensemble 50 0.496 0.964 64 0.544 0.945
model, amalgamating all three features, demonstrates superior 100 0.539 0.969 256 0.539 0.969
robustness. The observed enhancement in accuracy when uti- 1000 0.572 0.961 1024 0.533 0.958
On varying the batch size, keeping the number of epochs [8] J. Mao, R. Xu, X. Yin, Y. Chang, B. Nie, and A. Huang, “Poster++: A
constant to 1000, we observe that most of the models perform simpler and stronger facial expression recognition network,” 2023.
[9] G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, NanoCode012,
better with lower batch size. The optimum batch size found to Y. Kwon, K. Michael, TaoXie, J. Fang, imyhxy, Lorna, Z. Yifu,
be well-fitting for most models is 16. Going below the batch C. Wong, A. V, D. Montes, Z. Wang, C. Fati, J. Nadar, Laughing,
size of 16, the model is not able to benefit from mini-batches, UnglvKitDe, V. Sonck, tkianai, yxNONG, P. Skalski, A. Hogan,
D. Nair, M. Strobel, and M. Jain, “ultralytics/yolov5: v7.0 - yolov5
and convergence takes up significantly more time. sota realtime instance segmentation,” nov 2022. [Online]. Available:
https://doi.org/10.5281/zenodo.7347926
VI. C ONCLUSION [10] G. Jocher, A. Chaurasia, and J. Qiu, “YOLO by Ultralytics,” Jan. 2023.
In this paper, we introduced the Empath-Obscura model [Online]. Available: https://github.com/ultralytics/ultralytics
[11] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple
that effectively identifies human emotions, even in challenging classifier systems. Springer, 2000, pp. 1–15.
conditions where significant facial indicators like the eyes [12] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen-
are obfuscated. Fig. 6 exhibits instances where our model tation for deep learning,” Journal of Big Data, vol. 6, no. 1, p. 60,
2019.
remarkably distinguished the emotions despite the absence [13] A. Prados-Torreblanca, J. M. Buenaposada, and L. Baumela, “Shape
of one of the most crucial features in emotion detection – preserving facial landmarks with graph attention networks,” in 33rd
the eyes. This accomplishment attests to the robustness and British Machine Vision Conference 2022, BMVC 2022, London, UK,
November 21-24, 2022. BMVA Press, 2022. [Online]. Available:
versatility of Empath-Obscura, as it defies the typical https://bmvc2022.mpi-inf.mpg.de/0155.pdf
dependency on the eyes for emotion classification. [14] C. Breazeal, “Social robots in the wild,” Science, vol. 352, no. 6283,
However, our research is not without its intricacies and pp. 148–149, 2016.
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks
nuances. We discerned a prominent misclassification pattern for large-scale image recognition,” in 3rd International Conference on
where emotions such as Angry, Sad, and Surprised Learning Representations (ICLR), 2014.
were frequently identified as Fear and the emotions Angry, [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE Conference on Computer Vision
Fear, and Neutral were also misclassified as Sad on and Pattern Recognition (CVPR), 2016, pp. 770–778.
notable occasions. Intriguingly, a meticulous inspection of [17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
these instances suggests that the predicted labels might be V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
closer to a person’s intuitive assessment when the eyes are Recognition (CVPR), 2015, pp. 1–9.
obscured, hinting at the presence of ambiguous cases in the [18] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning for human
dataset. A particularly fascinating insight emerges when we part discovery in images,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1342–
modulate the obfuscation regions to find that obscuring the 1350.
facial details of different parts of the face may lead to fooling [19] E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang, “Real-time
the model for different emotions. convolutional neural networks for emotion and gender classification,” in
arXiv:1710.07557, 2017.
This exploration underscores the dynamic and some- [20] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-
times ambiguous nature of human emotional expression. It preserving learning for expression recognition in the wild,” in 2017
also highlights the importance of developing models like IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
IEEE, 2017, pp. 2584–2593.
Empath-Obscura that are not just accurate but adaptable [21] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database
to diverse scenarios. The capability to interpret emotions from for facial expression, valence, and arousal computing in the wild,” IEEE
partial facial data has vast implications for areas like security, Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, jan 2019.
[Online]. Available: https://doi.org/10.1109%2Ftaffc.2017.2740923
remote communication, and entertainment. We believe that this [22] A. Shah, B. Ali, M. Habib, J. Frnda, I. Ullah, and M. Shahid Anwar, “An
study lays the foundation for future research aimed at refining ensemble face recognition mechanism based on three-way decisions,”
and enhancing the interpretability and adaptability of emotion J. King Saud Univ. Comput. Inf. Sci., vol. 35, no. 4, p. 196–208, may
2023. [Online]. Available: https://doi.org/10.1016/j.jksuci.2023.03.016
recognition models in varied contexts. [23] R. Graham, “An efficient algorith for determining the convex
hull of a finite planar set,” Information Processing Letters,
R EFERENCES vol. 1, no. 4, pp. 132–133, 1972. [Online]. Available:
[1] C. Breazeal, “Toward sociable robots,” Robotics and autonomous sys- https://www.sciencedirect.com/science/article/pii/0020019072900452
tems, vol. 42, no. 3-4, pp. 167–175, 2003. [24] B. Dwyer, J. Nelson, J. Solawetz et al., “Roboflow (version 1.0),”
[2] K. Dautenhahn, “Socially intelligent robots: dimensions of human-robot 2022, computer vision. [Online]. Available: https://roboflow.com
interaction,” in Philosophical Transactions of the Royal Society B: [25] O. Arriaga, M. Valdenegro-Toro, and P. G. Plöger, “Esann 2019
Biological Sciences, vol. 362, no. 1480. Royal Society, 2007, pp. 679– proceedings, european symposium on artificial neural networks,
704. computational intelligence and machine learning,” in ESANN 2019
[3] T. Fong, I. Nourbakhsh, and K. Dautenhahn, “A survey of socially Proceedings. Bruges, Belgium: i6doc.com publ., April 24–26
interactive robots,” Robotics and Autonomous Systems, vol. 42, no. 3-4, 2019, available from http://www.i6doc.com/en/. [Online]. Available:
pp. 143–166, 2003. http://www.i6doc.com/en/
[4] P. Ekman, “Facial expression and emotion,” American Psychologist, [26] Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial
vol. 48, no. 4, pp. 384–392, 1993. expression recognition network with label distribution training,”
[5] P.-L. Carrier and A. Courville, “Fer-2013: Facial expression recognition Proceedings of the AAAI Conference on Artificial Intelligence,
2013,” https://www.kaggle.com/msambare/fer2013, 2013, kaggle (On- vol. 35, no. 4, pp. 3510–3519, May 2021. [Online]. Available:
line Platform for Data Science Competitions). https://ojs.aaai.org/index.php/AAAI/article/view/16465
[6] M. Pantic and L. J. Rothkrantz, “Expert system for automatic analysis [27] N. Le, K. Nguyen, A. Nguyen et al., “Global-local attention for emotion
of facial expressions,” in Image and Vision Computing, vol. 18, no. 11. recognition,” Neural Computing and Applications, vol. 34, pp. 21 625–
Elsevier, 2000, pp. 881–905. 21 639, 2022. [Online]. Available: https://doi.org/10.1007/s00521-021-
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, 06778-x
no. 7553, pp. 436–444, 2015.

You might also like