Professional Documents
Culture Documents
Applications
Mai El Seknedy Sahar Fawzi .
Nile University Nile University
Giza, Egypt Giza, Egypt
Mai.Magdy@nu.edu.eg sfawzi@nu.edu.eg
Abstract— Speech emotion recognition systems are needed the quality of service and analyze the client’s psychological
nowadays in many human interaction applications such as attitude to take needed actions on the spot during the call.
call-centers, e-learning, autonomous driver emotion Furthermore, e-learning is emerging at incredible rates due
detection, and physiological diseases analysis. The existing to the current situation. It will be beneficial to have an
speech emotion recognition systems focus mainly on a emotional state indication of students/attendants to
single corpus. On the other hand, the speech emotion differentiate if they are frustrated, angry, happy, calm, or
recognition performance on cross-corpus is still an ongoing neutral. This will lead to high improvement in social
challenge in the research domain. This paper presents a communication and ways of transferring information [2].
study for speech emotion recognition tested on 3 widely Moreover, SER is taking an emerging role in the medical
used languages (English, German and French) using sector where identifying the patient’s response to the
RAVDESS, EmoDB and CaFE datasets. Four Machine prescribed medicine or therapy is very important in deciding
learning classifiers: Multi-Layer Perceptron, Support Vector the best medical treatment program. The authors of this
Machine, Random Forest and Logistic Regression have been research work [3] mentioned the impact of SER on Autism
used. A newly developed feature set was introduced Spectrum Disorder (ASD) patients who usually suffer from
consisting of main speech features as prosodic features, defects in facial expressions. Thanks to this advanced
spectral features and energy. Very promising results were technology, they can offer a psychophysiological alternative
obtained using this feature-set even when compared with the modality through SER to analyze emotions through speech.
benchmarked feature-set “Interspeech 2009”. Furthermore, Emotions play an important role in interpreting human
Feature importance techniques were used to study the feelings since that research has discovered the power of
feature importance per each classifier across each corpus. emotions in defining human social interactions [4].
From our results, it was found that SVM is the best classifier
from recognition rates and running performance point of In this research, we provide a model that is capable of
view. The model achieved an accuracy of 70.56% on identifying the effect of emotional state on the end-user
RADVESS, 85.97% on Emodb and 70.61% on CaFE. customer in the previously mentioned applications. It also
tackles the challenge of the cross-corpus SER system
Keywords: Speech emotion recognition, Cross corpus, Mel especially that there are a lot of SER researches targeting a
frequency cepstral coefficients, prosodic features, Mel- single language and not taking into consideration the cross-
spectrogram features, Acted datasets corpus model generalization. This will enable the SER
system to interact in real-life applications which are not just
I. INTRODUCTION unilingual. This work studies the effect of different feature
set combinations as the prosodic related features (intonation,
S peech Emotion Recognition (SER) has a wide range of
applications in human interacting systems that can
comprehend human emotions to enhance the interactive
Fundamental Frequency), spectral features (Contrast), ZCR,
RMS, MFCCs and Mel Spectrograms to predict accurate
emotions for single and cross-corpus. Furthermore, it studies
experience. Useful applications include human-computer
the impact of each of the introduced languages with the
interaction, mental health analysis, autonomous vehicles,
features. The model performance is tested on 3 benchmarked
commercial applications, call centers, web-based e-learning,
acted databases: RAVDESS, EmoDB and CaFE in 3
computer games analytics and psychological diseases
different languages (English, German and French). To
analysis. For example, call center agents can have triggers of
compare our developed feature-set performance, we tested
the customer state which helps to handle the customers more
the model performance on the benchmarked
efficiently.SER can also be integrated with the churn
INTERSPEECH 2009 feature set (IS09) which is based on
prediction model as in telecom companies [1]. In other
the INTERSPEECH 2009 Emotion Challenge [21], [32]. For
words, the call center agencies will be able to have an alarm
classification, four supervised machine learning models
for unsatisfied, satisfied, or neutral customers to analyze the
including Support Vector Machine (SVM), Multi-Layer
behavioral study of the customers. This will help to improve
Perceptron (MLP), Simple Logistic Regression (SLR) and
databases are affected by a group of features. Also, studied EmoDB Pitch(3) MFCC(7) RMS (1) MFCC(7)
classifier impact and how each classifier is impacted by the ZCR(1) pitch(2) MFCC (4) Pitch(2)
feature’s subset. Firstly we used the filter method based on a Mel-Spec(3)
MFCC(1)
tonnetz(1) Mean-
Energy (1)
Tonnetz (1)
Table.3 – Top 10 features using Information gain per language D. Feature Scaling
Language Features Many machine learning algorithms work better
RADVESS Mel spectrogram (8) when features are normalized on a relatively similar scale
RMS (3)
and close to normally distributed. It works on reducing
MFCC (6)
F0 (1) those variations of speakers, languages and recordings
Signal Mean Energy (1) environmental conditions effect on the recognition
EmoDB Pitch (3) process. There are many Normalization techniques as
MFCC (6) Standard Scaler and Minimum and Maximum Scaler
ZCR
(MMS) [23],[26]. Minimum and Maximum scaler (MMS)
CaFE RMS (1) method is used during this work after the extraction of
Mel spectrogram (1) features to transform the features by scaling each one to a
MFCC (2)
F0
given range from 0-1 AS calculated in Eq. (1). The choice
Pitch (2) of MMS was selected after experimenting both MMS and
Contrast (2) Standard scaler where MMS has shown better results.
Signal Mean Energy (1) Also, from literature review as paper[26] MMS was used.
2. Permutation Importance (PI): is a model-agnostic 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = (𝑋 − min) / (max −min) (1)
global explanation method that provides insights into a
machine learning model’s behavior. It estimates and Where X is the input features and min/max is the features
ranks feature importance based on the impact each ranges.
feature has on the trained machine learning model’s
predictions [22]. Table.4 shows the top 10 features E. Machine Learning Models
(weighted per statistical functions) per each classifier During, this research four classification techniques were
per each corpus. And, MFCC is the most dominant considered: Firstly, Support vector Machine (SVM) which is
feature across all the classifiers. Where in Logistic known to perform well in higher dimension data as with
Regression MFCC appeared 7 times out of the top 10 in audio data, that’s why it’s one the most popular classifiers in
3 corpuses as well as SVM where in CaFE dataset SER field owing its high running speed and accurate results
MFCC appeared 9 times in the top 10. Mel-Spectrogram [6], [24], [26]. Secondly, Random forest tree based ensemble
features can be found highest with RADVESS database classifier another benchmark classifier which is widely used
which complements Information gain results as well.
in SER [26], [27]. Thirdly, Logistic Regression Algorithm is
It’s analyzed also that random forest is more sensitive to
used also to analyze the linear model's performance [27].
mel-spectrogram than other classifiers. In SVM, MFCC
and pitch features are more dominant.while, MLP is Finally, Multi-Layer Perceptron (MLP) is a feedforward
more dynamic with respect to each language. neural network algorithm that can be considered as the first
RADVESS
Feature-set MLP SVM Random Forest Logistic Regression
Acc Prc Rec Acc Prc Rec Acc Prc Rec Acc Prc Rec
Featureset-1 63.54% 65.36% 62.51% 57.92% 57.25% 55.64% 55.35% 57.97% 53.78% 50.42% 49.34% 47.87%
Featureset-2 68.06% 68.89% 67.16% 70.42% 69.25% 68.21% 62.97% 64.63% 62.84% 58.81% 59.19% 57.99%
IS09 64.93% 65.76% 64.77% 70.56% 69.96% 70.07% 59.31% 62.06% 57.56% 62.64% 62.39% 62.10%
Emodb
Feature-set MLP SVM Random Forest Logistic Regression
Acc Prc Rec Acc Prc Rec Acc Prc Rec Acc Prc Rec
Featureset-1 78.12% 77.39% 76.99% 82.42% 81.06% 81.81% 70.27% 70.79% 67.92% 77.00% 77.00% 75.22%
Featureset-2 84.86% 85.65% 84.37% 85.97% 86.71% 85.33% 74.01% 73.71% 70.67% 82.61% 83.76% 81.42%
.
IS09 81.28% 81.26% 82.12% 84.69% 85.43% 85.24% 74.56% 76.84% 71.35% 80.75% 80.70% 81.19%
CaFE
Feature-set MLP SVM Random Forest Logistic Regression
Acc Prc Rec Acc Prc Rec Acc Prc Rec Acc Prc Rec
Featureset-1 57.48% 57.63% 58.02% 59.51% 60.17% 61.17% 51.50% 52.51% 51.11% 51.71% 52.53% 52.76%
Featureset-2 69.62% 68.82% 69.31% 70.61% 70.02% 69.47% 63.69% 63.54% 58.49% 65.87% 65.41% 62.87%
IS09 55.24% 55.94% 55.97% 58.99% 59.44% 59.42% 49.37% 51.37% 49.44% 54.92% 55.39% 55.70%
Table.7 – Comparison of SER performance of proposed system with that of existing research
SVM+FS2: 88.89%
SVM+IS09:87.04%
(using 90:10 data split)
A. J. et al (2020) [2] Random Forest MFCCs, RMS, Zero Crossings ,Spectral RADVESS,7 emotions 66.04%
Smoothness (using 10 folds CV)
J. Ancilin et al (2021) [29] SVM MFMC RADVESS,8 emotions 64.31%
(using 10 folds CV)
A. Koduru et al (2020) [24] SVM Pitch , energy, ZCR ,Wavelet , MFCCs RADVESS,4 emotions: SVM:70%
LDA Angry, Happy , Neutral Sad LDA:65%
Dtree -Trained on sample (30-40) Dtree:85%
out of each emotion (Evaluation criteria not specified)
MLP Feature-set2 RADVESS,8 emotions SVM+IS09: 70.56%
Proposed SVM IS09 SVM+FS2: 70.42%
MLP+FS2: 68.06%
MLP+FS2:82.22% - 4 emo
(using 10 folds)
SVM+FS2: 77.08%
SVM+IS09:72.92%
(using 90:10 data split)
Accuracy
comparison with previous work where it enhanced 4.47% in 30.00%
EmoDB Recognition rate in comparison with paperwork
[29] as they used improved MFMC to train SVM model. 20.00%
For, RADVESS our model achieved 4.54% better than SER 10.00%
in paper [2] and a 6.25% performance increase than SER
system in paper [29]. Also, we trained our model on 4 0.00%
emotions to compare our results versus paper [24] which MLP SVM RF LR
achieved 85% using Dtree on RADVESS data subset Classifer
compared to our model which achieved 82.22% when Fig.2 – SER Accuracy Training on RADVESS
trained on the whole dataset. Tr
Fig.3 shows the results obtained when training on both
250 corpuses EmoDB and CaFE while using RADVESS as test
RADVESS data. A recognition rate of 46.67% was achieved using
200
EmoDB Random Forest. The anger recognition gets 56%, neutral
Time in seconds
150 43%, happy 33%, and sad 43%.Fig.4 shows the confusion
CaFE matrix where it can be seen that neutral is being mixed up
100
with sad in 38 test samples.
50
48.00%
0 RADVESS
46.00%
Training time
Training time
Training time
Training time
Testing time
Testing time
Testing time
Testing time
Accuracy
44.00%
42.00%
40.00%
MLP SVM RF LR
38.00%
Classifer MLP SVM RF LR
Fig.1- Model running performance using feature-set2 Classifer
Fig.3 – SER Accuracy training on EmoDB and CaFE
Fig.1 shows analyses for model performance where it was
found that MLP takes the longest time to train the model and
it was expected as it’s a feed forward Neural Network Model
with 400 neuron.Specially, if the data size is big as incase of
RADVESS corpus it took around 227 seconds. Random
Forest comes in second place and it’s justified as it’s from
the ensemble models category and takes time to run through
the whole trees. It took 10 seconds for RADVESS and 4
seconds for CaFE. SVM showed the best training and testing
time even over Logistic Regression.
In this study, we present the results of cross-corpus Fig.4 – Confusion matrix when training on EmoDB and CaFE
training where we explored the performance of the model and testing on RADVESS
using the cross-corpus dataset. During this experiment we Table.8 shows the prediction results when using cross
used Featureset-2 on the 4 common emotions on corpuses: corpus as training datasets composing the 3 languages.
angry, happy, neutral and sad. Data were split to train-test Promising results were found where, RADVESS gets
80:20 ratio. Firstly we used RADVSESS as a training corpus 79.26%, EmoDB 88.24% and CaFE gets 82.35%. These
since it’s the largest and tested on the other 2 datasets as findings show very promising potential to our model,
shown in fig.2.The model achieved an accuracy of 41.18% emphasize model generalization. We believe that better
on EmoDB as test data using Random Forest as a classifier. results can be obtained if the model is enriched with more
Also, we achieved an accuracy of 43.56% on EmoDB as test training samples from cross corpuses.
data and MLP as a classifier.