Professional Documents
Culture Documents
images as is, resized them to a resolution of 64x64 pixels, and study, we focus on the eyes, nose, and lips, obfuscating
normalized them to facilitate better training of the model. each feature separately in order to investigate their individual
contribution to emotion recognition.
B. Novel Obfuscating Augmentation with SPIGA
In this work, we leverage the facial landmarks generated C. Training-time Augmentation
by the SPIGA network [13] to apply a novel data augmenta- To further enhance the diversity and robustness of our
tion technique that obfuscates specific facial features. Facial training dataset, we have applied several preprocessing and
landmarks are critical points on a face image that correspond data augmentation techniques during the training phase.
to key facial features, including the eyes, nose, and lips. The 1) Preprocessing:
SPIGA network is capable of identifying these landmarks with • Auto-Orient: The orientation of each image is automat-
high precision even in case of low-resolution images and hence ically adjusted to ensure consistent alignment of facial
makes it an ideal choice for this task. features.
The obfuscating augmentation process is performed in mul- • Auto-Adjust Contrast: The contrast of each image is
tiple steps as described below: adjusted using adaptive equalization, which redistributes
1) The SPIGA network is applied to the original face image the pixel intensities to enhance visibility and emphasize
to extract facial landmarks corresponding to the eyes, facial features.
nose, and lips (amongst the other points). 2) Training Time Augmentations: For each training ex-
2) Three new images are generated by obfuscating each of ample, we generate three augmented versions with varying
the three features separately in the original image, using combinations of the following transformations (Fig. 4):
the landmarks as reference points. • Rotation: Each image is randomly rotated between -22°
3) We apply the Graham Scan algorithm [23] on the three and +22° to introduce variations in head pose.
images corresponding to one facial feature each to • Shear: The images are subjected to horizontal and verti-
produce the smallest fitting polygon that covers all the cal shear transformations, each varying between ±22°.
points of the feature and then bound it in the smallest • Brightness: The brightness of each image is randomly
fitting rectangle (3rd column in Fig. 2). altered within a range of -25% to +25%.
4) The obfuscation process involves altering the pixel val- • Exposure: The exposure level of each image is randomly
ues within the region defined by the landmarks to hide modified within a range of -25% to +25%.
the facial features. This can be achieved by blurring, • Blur: A blur effect with a radius of up to 3 pixels is
pixelation, or replacement with a constant value, de- applied to the images.
pending on the desired level of obfuscation. We study • Noise: Random noise is introduced in up to 10% of the
full obfuscation by replacing it with a constant value. pixels in each image.
5) The three newly generated images serve as additional • Cutout: Four random cutout boxes, each occupying 25%
training samples, increasing the diversity of the dataset of the image size, are applied to the images.
and contributing to the robustness of the model. By incorporating these preprocessing and augmentation
The obfuscating augmentation with SPIGA allows us to techniques into our training pipeline, we aim to improve the
generate more diverse training samples and explore the impact model’s ability to generalize to diverse and challenging real-
of individual facial features on emotion recognition. In this world scenarios, thereby enhancing its overall performance
Fig. 3. Architecture of the Empath-Obscura model, showing the YOLOv5, YOLOv8, and POSTER++ models and the voting ensembler.
and robustness in emotion recognition. We make use of the task. The training process involved a total of 100 epochs
platform provided by Roboflow [24] for achieving this task. with early stopping based on validation loss.
b) POSTER++:: This model was initially trained for
facial expression recognition from high-quality images from
AffectNet and RAF-DB. We adapted the model by training it
to work with obfuscated images by initializing with the pre-
trained weights and fine-tuning it on the FER-2013 dataset.
The optimization was conducted with the Adam optimizer, a
learning rate of 1×10−4 , and a batch size of 16 and the training
process lasted for 50 epochs with early stopping based on
validation loss. For POSTER V2, we make use of the official
implementation provided by the team on GitHub [8]. Similar
to the YOLO model, during fine-tuning, the last few layers
were unfrozen, allowing for adaptation to the specific task.
2) Voting Ensembler: The voting ensembler aggregates the
Fig. 4. Examples of training-time augmented images generated using the pre- predictions of the YOLOv5, YOLOv8, and POSTER++ mod-
processing and augmentation techniques described on unobfuscated images. els. The ensemble takes into account the confidence scores of
the individual models’ predictions. We optimized the weights
D. Empath-Obscura Model of the ensemble using a simple grid search on the validation
set. The optimal weights were then used to compute the
The Empath-Obscura model comprises an ensemble of final ensemble prediction, which is obtained by performing
three pre-trained models, namely YOLOv5, YOLOv8, and a weighted majority vote. For our experiments, we have used
POSTER++, which were adapted and fine-tuned for the task of the weights shown in Fig. 3. These weights are obtained by
emotion recognition from obfuscated facial images. A voting performing a grid search for the following equation 1. In the
ensembler aggregates the predictions of the three models, with case of YOLOv5 and YOLOv8, we chose the best-performing
the weights of the ensemble being fine-tuned to optimize per- variants when performing individually.
formance. Following described are the important components
in the modeling and fine-tuning of our final model. ŷensemble = arg max (w1 p1,i + w2 p2,i + w3 p3,i ) (1)
i
1) Model Adaptation and Fine-tuning:
Where pj,i is the probability of class i predicted by model j,
a) YOLOv5 and YOLOv8: These models are known for
and wj is the weight of model j. The weights were optimized
their efficiency in real-time object detection and were adapted
such that the ensemble achieved the highest accuracy on the
for emotion recognition from obfuscated facial images. The
validation set.
models were initialized with pre-trained weights on the COCO
By combining the predictions of multiple models and op-
dataset and then fine-tuned on the FER-2013 dataset. We make
timizing the ensemble weights, the Empath-Obscura model
use of the implementation provided by Ultralytics [9],
achieves robust and accurate emotion recognition performance
[10] since it is actively maintained and amongst the most
on obfuscated facial images.
widely used YOLO implementations. The training process was
conducted using the Adam optimizer with a learning rate of E. Training Procedure
1×10−4 and a batch size of 16. During fine-tuning, the last few The training procedure of the Empath-Obscura model was
layers were unfrozen, allowing for adaptation to the specific conducted on an Nvidia Quadro RTX 5000 16GB GPU. The
FER-2013 dataset, augmented with obfuscated and standard emotion recognition performance in challenging conditions,
data augmentation techniques, was used as the training data. such as obfuscated facial images.
The training process consisted of the following steps:
A. Experimental Setup
1) Data Preprocessing: The images were preprocessed by
resizing them to a size of 64x64 pixels and normalizing 1) Dataset: For our experiments, we utilized the FER 2013
the pixel values to the range [0, 1]. The obfuscated dataset, which is a widely recognized dataset for emotion
images were generated by removing the eye features recognition in facial images. The dataset consists of 48x48
from the original images, as shown in Figure 1. pixel grayscale images of faces, which have been categorized
2) Data Augmentation: In addition to the novel obfusca- into seven different emotions, namely: Angry, Disgust, Fear,
tion technique, other standard data augmentation tech- Happy, Sad, Surprise, and Neutral.
niques were applied to the training images, including
rotation, shear, brightness adjustment, exposure adjust- TABLE I
DATASET STATISTICS
ment, blurring, noise addition, and cutout, as described
in Section III-C. This step increased the diversity of the Number of Size After Prepocessed
training data, helping to mitigate overfitting and enhance Split Images Preprocessing Set Size
Training 25,000 64x64 100,000
model generalization. Validation 3,000 64x64 -
3) Model Initialization: The YOLOv5, YOLOv8, and Test 3,500 64x64 -
POSTER++ models were initialized with pre-trained Total 31,500
Emotions Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral
weights on the COCO dataset and the AffectNet dataset,
respectively. These weights served as a good starting
The test set of the FER 2013 dataset was augmented with a
point for further fine-tuning on the FER-2013 dataset.
sample of 3500 manually obfuscated images. The obfuscation
4) Model Fine-tuning: The models were fine-tuned on the
was performed by annotators who were asked to obscure
FER-2013 dataset with the Adam optimizer, a learning
facial features related to the eyes in the images, as shown in
rate of 1 × 10−4 , and a batch size of 16. The last few
Figure 1. The images were then resized to a 64x64 resolution
layers of each model were unfrozen to adapt the models
and normalized for better training.
to the emotion recognition task. Early stopping was
The dataset was divided into three splits for our experi-
employed based on the validation loss, and the training
ments, as presented in Table I. The training set consists of
process lasted for a total of 50 epochs.
25,000 images, the validation set has 3,000 images, and the
5) Ensemble Weight Optimization: The optimal weights
test set comprises 3,500 images.
for the ensemble were determined by performing a
The human annotation of the obfuscated test set ensured that
grid search on the validation set. The goal was to
the obfuscation was performed in a consistent manner across
maximize the accuracy of the ensemble by adjusting the
all images, and the annotations were verified by multiple
weights assigned to each model’s predictions.
reviewers to ensure their accuracy.
6) Model Evaluation: The performance of the Empath-
Obscura model was evaluated on the test set of the V. R ESULTS AND D ISCUSSION
FER-2013 dataset, which is manually annotated. The
A. Performance Against State-of-the-Art Models
accuracy, precision, recall, and F1 scores were computed
to assess the model’s ability to recognize emotions from 1) Overall Performance: Table II compares the perfor-
obfuscated facial images. mance of our Empath-Obscura model with other state-of-the-
art models before and after data augmentation. We evaluate the
IV. E XPERIMENTS AND R ESULTS models in terms of accuracy, precision, recall, and F1 score.
In this section, we present a comprehensive set of experi- Our Empath-Obscura model outperforms other state-of-the-
ments to evaluate the effectiveness of our proposed Empath- art models in terms of accuracy, precision, and F1 score. More-
Obscura model for emotion recognition in obfuscated fa- over, the performance of all models has improved significantly
cial images. Our experiments were designed to examine the after data augmentation, with our model achieving the highest
model’s performance in different scenarios, including base- improvement. This improvement clearly shows that the novel
line models without augmentation, augmented data experi- augmentation technique with SPIGA [13] provides images that
ments, ensemble model experiments, and ablation studies. are beneficial to learning facial features that are relevant in
The following mentioned experiments were conducted on case of different levels of obfuscation.
a high-performance computing environment, specifically the 2) Class-wise Performance: Table III shows the perfor-
Nvidia Quadro RTX 5000 16GB GPU. We benchmark mance of the YOLOv5 [9], YOLO v8 [10], POSTER++ [8]
our results against state-of-the-art methods in the field of and Empath-Obscura model on each of the seven classes of
emotion recognition and analyze the contributions of different emotions, before and after data augmentation. The perfor-
components of our model to its overall performance. Our mance of all the model improved quite well for all emotion
experimental results shed light on the potential of data aug- classes after data augmentation, with the highest improvement
mentation techniques and ensemble methods for improving observed for the Neutral class.
TABLE II
OVERALL P ERFORMANCE C OMPARISON
TABLE III
C LASS - WISE P ERFORMANCE C OMPARISON
TABLE IV
A BLATION STUDY OF E MPATH -O BSCURA
TABLE V
which can be seen from the furrowed brows. YOLO R ESULTS (ACCURACY ) WITH VARYING MODEL SIZE