You are on page 1of 19

An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech

Emotion Recognition
Md. Rayhan Ahmed1,2, Salekul Islam1, A. K. M. Muzahidul Islam1, Swakkhar Shatabda1*
1
Department of Computer Science and Engineering, United International University, Bangladesh
2
Department of Computer Science and Engineering, Stamford University Bangladesh
Email address: rayhansimanto@stamforduniversity.edu.bd, salekul@cse.uiu.ac.bd, muzahid@cse.uiu.ac.bd,
swakkhar@cse.uiu.ac.bd
*Corresponding author: Swakkhar Shatabda (swakkhar@cse.uiu.ac.bd), Department of Computer Science and
Engineering, United International University, Plot-2, United City, Madani Avenue, Badda, Dhaka-1212, Bangladesh.

Abstract: The ability to detect human emotions and reactions aids in the improvement of human-computer interaction (HCI). The process of
distinguishing emotions from speech signals is known as speech emotion recognition (SER). SER’s performance and implementation value are
dependent on the derived features from speech signals. In this article, we propose four new architectures, namely Model-A (1D CNNs-FCNs),
Model-B (1D CNNs-LSTM-FCNs), Model-C (1D CNNs-GRU-FCNs), and Ensemble Model-D, which combines Model-A, Model-B, and Model-
C through a weighted average methodology. To assess the performance of those models, we have utilized five standard benchmark datasets: TESS,
EMO-DB, RAVDESS, SAVEE, and CREMA-D. Since the samples in those datasets are not significantly high, we have performed data
augmentation (DA) by injecting Additive White Gaussian Noise (AWGN), pitch shifting, and stretching the signal level to generalize the models,
increase the models’ accuracy and reduce overfitting. We handcrafted five categories of features: Mel-frequency cepstral coefficients (MFCC), Log
Mel-Scaled Spectrogram (LMS), Zero-Crossing Rate (ZCR), Chromagram, and statistical Root Mean Square Energy (RMSE) value from each
audio files in those datasets. These features are used as the input for the One-dimensional (1D) convolutional neural network (CNN) architecture
that further extracts hidden local features of those speech utterance files. To acquire additional long-term contextual representations from those
learned local features through 1D CNN blocks we have also extended our experiment by incorporating LSTM and GRU after the CNN blocks,
resulting in an increased accuracy than the previous baseline Model-A. It is observed that after performing DA, all four models perform
exceptionally well in the SER task of detecting emotions from raw speech audio; notably, the ensemble Model-D accomplishes the state-of-the-art
weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 74.09% for CREMA-D
datasets. The four proposed frameworks source code is accessible on the following GitHub repository to allow future expansion and reproducibility
of the findings: https://github.com/simanto99/ser-ensemble

Keywords: Speech Emotion Recognition, Data Augmentation, 1D CNN, LSTM, GRU, Ensemble.

1. Introduction

The interaction between humans and computers is progressing swiftly. HCI studies how humans interconnect with computers and
to which extent computers are designed for productive human interaction [1]. Interactions between these two entities should be as
spontaneous as human-to-human conversations. Speech is the principal mode of communication among human beings, and it is a
noteworthy means for HCI by using an auditory sensor like a microphone [2]. Through speech, we humans express one of our most
fundamental components, emotions. SER is imperative for enhancing the HCI. SER is a powerful area of HCI, which sets the
direction in which modern-day electronic devices and equipment are rapidly moving [3]. SER refers to the domain of mining data
from speech signals. Various significant applications such as intelligent robots, audio surveillance, criminal investigations, automated
smart home appliances, movie or music recommendation systems, dialogue systems, etc., rely on user’s emotional state could do
with a system that automatically detects the users emotion from the speech. Researchers have developed various techniques in the
last decade to provide a robust and lightweight SER system [4,5]. However, due to a lack of technologies and tools, ambiguous nature
of emotions, diversity in language and accent across different cultures, frequency and amplitude variation in human utterance
regarding gender and age, recognizing the human’s emotional states from speech have proven to be complicated.
A generic model of SER is comprised of two stages as shown in Fig. 1. At the first stage, features are extracted from the speech
signals. Several studies have derived many features in the area of speech audio processing. Major features types include continuous
speech features (e.g., pitch, timing, energy, articulation, and format), voice quality features (e.g., voice level, and phrase), spectral
based features such as MFCC [6], Log-Frequency Power Coefficients (LFPC) [7], Linear-Prediction Coefficients (LPC) [8], and
Teager-Energy-Operator (TEO) [9] based features and many more [10,11]. Since these features change over time, speech audio is
divided into frames of fitting sizes Low level descriptor features such as MFCC, LMS, ZCR, Energy, Pitch, Chromagram, Tonal

1
Centroid features, RMSE, Spectral contrast are extensively used in the SER task [2,3,7,12–16]. However, the problem of a lower
accuracy rate in SER still exists. The features mentioned earlier are presumed to remain constant throughout a frame. As a result, the
audio of the speech can be divided into frames, each of which is represented by a feature vector [17]. These frames can then be used
as a data set for training the proposed models. For this study, ZCR, Chromagram, MFCC, RMSE value, and LMS features are
explored.

Fig. 1. A generic architecture of the SER model.

The second stage involves classifying those speech signal features by combining them into feature vectors based on some linear
and non-linear classifiers. Frequently used linear classifiers for SER are Support Vector Machine (SVM), Bayesian Networks (BN),
Linear Discriminant Analysis (LDA), and K-Nearest Neighbors (KNN) [18–22]. Due to the non-stationary interpretation of speech
signals, it is considered that various non-linear classifiers such as Hidden Markov Model (HMM), and Gaussian Mixture Model
(GMM) work resourcefully for the SER task [23–25].
DL is a subset of Machine Learning (ML) that has received substantial interest from the research community in recent years. To
address the shortcomings of conventional handcrafted feature-based approaches, DL has automated the feature extraction process.
Handcrafted features have had much success in the SER task. Several studies have been carried out in SER utilizing 1D CNN as the
classification model [26–29] ; however, they lack great accuracy and are memory expensive. Besides that, the researchers to study
long-term contextual correlations and understand the cues of emotions from speech used Recurrent Neural Network (RNN), Long
Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Bidirectional Gated Recurrent Unit (Bi-GRU), Bidirectional LSTM
along with Fully Connected Networks (FCNs) [27,28,30–32]. 1D CNN model is effectively used in time-series data and have shown
great promise regarding audio classification task.
We have used five publicly available datasets that are extensively used in the literature: Toronto Emotional Speech Set (TESS)
[33], Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [34], Surrey Audio-Visual Expressed Emotion
(SAVEE) [35], Crowd-Sourced Emotional Multimodal Actors Dataset (CREMA-D) [36], and Berlin Database of Emotional Speech
(EMO-DB) [37]. The number of samples in each of these datasets are relatively low for a Deep Learning (DL) model to train properly
without any overfitting issues. Beside some of these datasets, have class imbalance issues. For training the model, we performed
audio DA by injecting AWGN [38], changing the pitch of the signal and stretching the signal level. The model was trained with
augmented data along with original dataset, yielding in a higher accuracy rate with minimum overfitting along with improved
generalization ability. Combined with DA, each proposed model produces exceptional results for the SER task.
In this paper, we propose four frameworks: first, a baseline dilated 1D CNNs-FCNs based framework; second, a dilated 1D CNNs-
LSTM-FCNs based framework; third, dilated 1D CNNs-GRU-FCNs based framework; and fourth an Ensemble of those three
frameworks through a weighted average mechanism. All four frameworks performs SER tasks with higher accuracy than most of the
previous studies. The datasets utilized in this study do not have a significant number of utterances in each of them. Whenever we
train a DL model with a small dataset, the issue of overfitting arises. We address this overfitting issue for the SER-task in this study
by performing data augmentation, which increases the data samples for training, giving the model generalizability property, applying
kernel and bias regularization (L2=0.01) that reduces the weights squared magnitude, and adding dropout layers that arbitrarily
eliminate the neurons during training the models as well as performing 4-fold cross-validation. Initially, MFCC, LMS, Chromagram,
ZCR, RMSE value features are extracted from the audio, as previous studies have suggested their efficacy in the SER task [12,39,40].
The mean value of these features is calculated and is used to train the model to detect human emotions such as "happiness," "sadness,"
"fearful", "surprise," "anger," "surprise," "boredom," "neutral" etc., from audio signals with improved recognition performance. The
noteworthy contribution of this work is as follows:

2
• This work first extracts the low-level descriptors features from the audio files, uses seven LFAB blocks inside the 1D CNN-
based baseline model to better understand the high-level hidden local features from those extracted features during training,
and finally classifies the speech emotion. The study attains state-of-the-art detection accuracy in five public speech emotion
datasets: TESS, RAVDESS, SAVEE, CREMA-D, and EMO-DB, covering two languages: English and German.
• We propose four deep neural network-based models built using LFABs. The baseline Model-A uses seven LFABs followed
by Fully Connected Network (FCN) layers and a softmax layer for classification. The other two models, Model-B and
Model-C are proposed by adding a global feature acquiring block (GFAB) after the LFABs. Model-B uses LSTM and FCNs
and Model-C uses GRU and FCNs. LSTM and GRU are able to acquire the long-term contextual representations of the
speech signals. The final ensemble Model-D combines the three individual models by adjusting their weights and that
achieves better performance compared to the individual models (i.e., Model-A, B and C).
• We have extensively experimented the models on five widely used datasets for speech emotion recognition that are publicly
available: TESS, RAVDESS, SAVEE, CREMA-D, and EMO-DB, covering two languages: English and German.
• Data Augmentation is performed to increase the training samples and reduce overfitting. Detection accuracy increased by
3-32% from the model trained with the original dataset only. A comparison is provided regarding the performance of all the
models between non-augmented and augmented datasets.
• An Ensemble framework is also proposed to combine the three individual models by adjusting their weights and achieved
better performance compared to individual Model-A, B and C.
• The performance of all the proposed models is compared with the previous state-of-the-art models. Amongst all four models,
the ensemble Model-D achieves the state-of-the-art weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB,
95.62% for RAVDESS, 93.22% for SAVEE, and 74.09% for CREMA-D datasets. These are significantly improved results
compared to the single models and the previous state-of-the-art methods on each dataset.

The remainder of this paper is assembled as follows. Section 2 presents the ever-required study of existing literature in the SER
task to grasp the current trend, intuition getting and finding scope for improving the task. Section 3 provides the overview of the
system by covering discussion about datasets, data augmentation techniques, the feature extraction process, and proposed
architectures. We analyze the experimental results in the SER task comprehensively on five datasets in section 4. A comparative
analysis of the performance of this work with existing literature is presented in section 5. Finally, in section 6, we conclude with a
discussion about the existing limitations and future research directions in SER.

2. Related Work

Digital signal processing (DSP) is a matter of great interest among the research community, and researchers have come up with
several approaches for a robust improvement by eliminating the existing issues in the SER task. For any SER task, feature selection
is the most critical part because irrelevant features directly affect the next part, which is speech emotion classification. Currently,
researchers worldwide are utilizing DL for SER-related tasks due to their vast triumphs in representing features. One of the significant
features utilized by most researchers nowadays regarding SER tasks is the speech spectrogram. It is a 2D depiction of the speech
signals [2,27,40–42]. It visually represents a signal’s power, where different frequencies over time are shown in waveform. MFCC
is the most widely used feature in terms of SER tasks. By converting the traditional frequency to mel-scale, MFCC accounts for
human insight for sensitivity at acceptable frequencies, making it appropriate for SER tasks. When training models, 12-20 MFCC
coefficients are usually taken into account, containing data about the changes in rate in the different spectrum bands [11–13,43].
Another feature that is often considered is the ZCR, representing the positive and negative sign change in a signal [3]. Chroma-based
audio features have proven effective for investigating, evaluating, and extracting information from music audio and SER-related
tasks [13].

3
CNN, RNN with LSTM, Bi-GRU, or their ensemble or combination make up most SER architectures that use neural networks
[15,28,30,32,42]. By utilizing the 1D CNN-LSTM for SER from audio clips and learning hierarchical local correlations and global
contextual features in the 2D CNN LSTM model for SER, Zhao et al. use LMS as input data for the models and attain a state-of-the-
art accuracy of 95.89% on EMO-DB dataset [27]. Liu demonstrates that a feature set consisting of Gamma-tone frequency cepstral
coefficients improves the SER accuracy by 3.6% over MFCCs by investigating three frameworks: Fully Connected Networks (FCN),
LSTM and Attention-LSTM [11]. Liu et al. propose an SER framework that selects features based on correlation analysis and Fisher
criterion [44]. This process reduces the number of irrelevant and redundant features [45]. Then they have used the extreme- ML
decision tree scheme to classify the emotions into different categories by utilizing the Chinese emotion corpus, CASIA. Liu et al.
utilizes the Genetic algorithm (GA) in their study of SER based on learning the brain emotion [22]. They have used 21 statistical
features, 14 MFCCs, 7 LogMelFreqBand and their first order delta coefficients as the features set. They have reduced the feature set
by using LLD and principle component analysis (PCA) [46]. The model was tested on CASIA, SAVEE and FAU Aibo [47] datasets.
Using a frame-based and utterance-based approach, Fayek et al. [48] evaluate multiple DL architectures for SER on the IEMOCAP
[49] dataset and obtain a test accuracy of 64.78% on frame-based and 57.74% on the utterance-based approach. An SER architecture
recently developed by Sainath et al. [50] obtains features directly from raw audio data using temporal features and a Deep Neural
Network (DNN). Mustaqeem and S. Kwon [51] also extract a similar spatiotemporal feature for SER using a ConvLSTM model.
Using four blocks of 1D CNN and LSTM, they have gathered the most significant distinctive emotional features. The extracted
features were then fed into the GRU-based network, which was used them to re-adjust the global weight. The model attained 75%
accuracy on the IEMOCAP dataset and 80% accuracy on the RAVDESS dataset. Li et al. propose a complex model that combines
spatiotemporal attention network that narrows the emotional frequency regions from a long spectrogram image, with an frequency
based attention network, to focus on the desired emotional regions. They have also developed a large-margin learning technique to
deal with the problem of feature aliasing. It improves intra-class compactness while increasing inter-class distances among features
[52].
Yoon et al. implements a dual recurrent encoder model approach for SER tasks by utilizing text and audio data from the IEMOCAP
dataset [53]. They employ MFCC derivatives along with text tokens as the input features of the proposed framework. Their
multimodal methodology led to an accuracy of 71.8% on the testing dataset. Singh et al. use the CREMA-D dataset to train two
classifiers, SVM and RNN that account for variance in speech intensity. The classifiers were trained at three stages of intensity: low,
medium, and high. The "Happy" and "Neutral" labeled emotions have the highest classification accuracy, while the "Disgust" labeled
emotion has the lowest [54]. However, they did not provide any accuracy curve that proved their stated accuracy in an epoch-wise
manner, neither any confusion matrix that represents class-wise accuracy. Meng et al. propose a 3D LMS-based dilated CNN and
residual block and memory attention mechanisms [31]. They used a composite of static LMS feature, delta, and deltas-deltas feature
to build the feature vector from the raw speech signal as input for the model. The IEMOCAP (speaker-dependent) dataset achieved
74.96% accuracy, and On the IEMOCAP (speaker-independent) dataset it achieved 69.32% accuracy. The model achieved the best
accuracy of 90.37% on the EMO-DB (speaker-dependent) dataset.
Anvarjon et al. propose an SER model based on extracting high-level features from the spectrograms of speech utterances. The
model’s performance was evaluated with two datasets. It achieved 77.01% and 92.02% accuracy for the IEMOCAP and EMO-DB
dataset, respectively [15]. A new multi-task learning approach is proposed by Zhang et al. [55]. It combines features from both speech
and song samples from the RAVDESS dataset. The authors used only four categories of emotions out of eight: "anger," "happiness,"
"neutral," and "sad." Their group multi-task feature selection model achieved an accuracy of 57.14%. Wang et al. suggests a new
kind of sound feature, Fourier Parameter (FP) functions, estimated by Fourier analysis [56]. They only used six of the EMO-DB
dataset’s seven emotion classes, omitting the "disgust" category. They have extracted two types of features: FP and MFCC, from the
dataset and utilized them as input to the SVM classifier that achieved 73.3% average accuracy.
By collecting 20-MFCC, 20-delta, and 20-delta-delta features and computing their mean values Nantasri et al. [12] propose an
SER model. These mean values were used as the input for the Artificial Neural Network (ANN) classifier. They have evaluated their
4
model with RAVDESS and EMO-DB dataset and achieved 82.3% and 87.8% accuracy, respectively. A 2D CNN-based model that
uses spectrograms generated from the EMO-DB dataset, Badshah et al. propose an SER architecture [41]. They have also explored
the field of Transfer Learning (TL) and utilized pre-trained AlexNet [57] architecture but got poor results. The initial proposed model
achieved 84.3% accuracy on the test set. Numerous other studies suggested using ensemble techniques in SER tasks [18,58,59] and
achieved a satisfactory outcome in terms of SER accuracy from speech signals.

3. System Overview

Primarily, an SER framework is comprised of two components. First, one is the preprocessing component that obtains appropriate
features from the speech signals, and the second one is a classifier that uses those obtained features to execute the SER. This section
will provide details about the datasets utilized in this study, extracted features from the audio files, and the organization and design
of the proposed models.

3.1. Datasets

Emotion is a sensitive thing. To perform a meticulous evaluation of the proposed models, five datasets were explored covering
two languages: English and German. We performed DA in all the datasets since the number of samples in each of them is not
significant for a DL model to train appropriately. A summary of each of them is provided below.

TESS: TESS [33] is the first dataset that this study explored. It contains 200 target words. Those words were spoken by two English
actresses, ages 26 & 64 years, respectively. The dataset is well balanced and contains 2800 audio files and depicts seven emotions:
"angry," "neutral," "happy," "disgust," "surprise," "fear," and "sad". Note that this dataset has not been extensively used in SER
studies previously. After augmenting, the samples increased to 8400 samples. The average sample duration for all datasets is 2.8
seconds, with TESS being the outcast with an avg. period of 2.1 seconds.
RAVDESS: RAVDESS [34] is one of the most explored datasets in the SER tasks. It includes both audio and video recordings of
twelve male and twelve female actors reciting English sentences while exhibiting eight distinct emotional expressions. For this study,
only the speech audio samples were utilized. The total number of audio files is 1440 with a sampling rate of 48 kHz, with 60 trials
per actor. Only the speech audio sample from the dataset of the following eight categories are covered in this study: "sad," "happy,"
"angry," "calm," "fearful," "surprised," "disgust," and "neutral". It is a balanced dataset though the "neutral" class has less number of
records compared to other classes. After performing DA, the samples increased to 7200 samples.
SAVEE: SAVEE [35] is not thoroughly utilized in the SER study yet. It consists of 480 TIMIT corpus sentences spoken by four
English actors aged 27 to 31 years in seven diverse emotions: "angry," "happy," "neutral," "disgust," "sad," "fear," and "surprise" in
a phonetically stable manner. The utterances are sampled at a rate of 44.1 kHz with a resolution of 16 bits However, this dataset has
a class imbalance issue, with the "neutral" class is almost double compared to all the other classes. For this study, only the speech
audio samples were utilized. DA increased the samples to 1920 samples.
Berlin EMO-DB: EMO-DB [37] is the most well-known and extensively used dataset in the SER research field. The utterances are
sampled at a rate of 16 kHz with a resolution of 16 bits. It comprises 535 audio recordings in the German language categorized into
seven emotional kinds: "anger," "fear/anxiety," "sadness," "happiness," "disgust," "boredom," and "neutral". However, this dataset
has a class imbalance issue, with the "anger," class utterance number is large compared to other classes. With DA, the samples
increased to 2140.
CREMA-D: CREMA-D [36] is the least explored dataset in the SER research field. It uses 7442 recordings from ninety-one
actors/actresses (48 male actors and 43 female actresses) from diverse races and customs, making it the most complicated to use.
Actors spoke from a group of twelve sentences of six different emotional categories: "angry," "happy," "neutral," "disgust," "fear,"

5
and "sad". Though the original number of samples is quite large compared to the other four datasets, DA was also done in this dataset
with an increased samples of 22326. Fig. 2 shows the class-wise utterance of each of the datasets.

(a) CREMA-D (b) RAVDESS (c) SAVEE

(d) EMO-DB (e) TESS (f) Combined classes in all five datasets.
Fig. 2: Class-wise utterance distribution in all five datasets, (a) CREMA-D, (b) RAVDESS, (c) SAVEE, (d) EMO-DB, (e) TESS, and (f) Combined.
3.2. Data Augmentation (DA)
DA is the method of applying minor modifications to our original training dataset to produce new artificial training samples. Since
the number of speech utterance records in each class is relatively low, this study performs three types of audio DA, AWGN [38]
injection, time stretching, and pitch shifting in the audio files. The impact is more data for proper training of the models. The impact
of these techniques is visually presented in Fig. 3. The obtained signal with AWGN is equal to the transmitted signal with some
added noise, which is statistically independent of the signal. AWGNs are random samples dispersed at consistent intervals with a
mean value of zero and a standard deviation of one. We added AWGN to the samples by using numpy's normal and uniform method
with a rate of 0.030. We can adjust the speed or duration of a sound sample without changing the pitch by stretching time. We
performed this task by using the time_stretch method of python’s librosa library, with a factor of 0.8. We also changed the sound's
pitch without affecting the speed. Pitch shifting was done by using the pitch_shift method of librosa, with a factor of 0.6.

(a) Original Sound waveform (b) AWGN injected waveform (c) Stretched waveform (d) Waveform with shifted pitch
Fig. 3. Pictorial illustration of various DA techniques impact in the speech utterances.

3.3. Feature Extraction


Extracting salient features from speech audio signals is one of the most important measures in SER-related activities. Precise
extraction of crucial features improves the performance in terms of SER accuracy of the model. Specifically, this study uses five
different spectral features: MFCC, LMS, ZCR, Chromagram, and RMSE values of the speech audio files as the input for the proposed
dilated 1D CNNs-FCNs, 1D CNNs-LSTM-FCNs, 1D CNNs-GRU-FCNs, and Ensemble of those three models. The brief details of
the spectral features are given below.
MFCC: Human-generated sounds are filtered through the vocal tract shape that includes tongue and teeth elements, which also is
unique for each individual. The structure of these elements determines the voice of an individual. A precise measurement of the shape
represents the phoneme being created. This shape exhibits in the short-time power spectrum envelope, which is represented by
MFCCs [60] and this feature is commonly used in SER research [12,22,40,61]. The MFCC feature extraction process is depicted in

6
Fig. 4. It starts with the speech signal being converted into a short frame of 20-30 millisecond (ms) window, and every 10 ms, it is
advanced, allowing the temporal features of individual speech signals to be traced. Then Discrete Fourier Transform (DFT) is
performed on every windowed frame, and they are converted into magnitude spectrum using Eq. 1.
N −1 - j 2 kn
xi (k ) =  xi (n)h(n)e N
0  k  N −1 (1)
n =0

Here, h(n) is the hamming window, k defines the DFT length, x (n) represents the time-domain signal, i defines the frame
number, and N defines the number of points used to calculate the DFT. After that, applying 26 filters in the previous signal the Mel-
Scaled Filter-bank (MSFB) is calculated. MSFB is a measurement unit that is dependent on the frequency perception of the human
ear. As a result, we have 26 numbers that describe the energy of each frame. The log energies are then calculated to obtain log filter-
bank energies. The estimation of Mel from the physical frequency can be quantified through Eq. 2.
f
f Mel = 2590 log10 (1 + ) (2)
700

Here, f denotes the physical frequency (Hz) and f Mel denotes the frequency perception of human ear. Finally, Discrete Cosine
Transform (DCT) is performed to get the MFCCs from the log filter-bank energies.

Fig. 4. MFCC feature extraction process.


For this study, 13-lower dimensions MFCCs were extracted from each audio file. Envelopes are sufficient to reflect the differences
between phonemes, allowing us to recognize phonemes using MFCC. Sampling rate was set at 44.1 kHz, with DCT-2.
Chromagram and Pitch: The Chromagram is a time-frequency transformation of an acoustic signal into a briefly changing
predecessor of the pitch and used extensively in SER task [13,16]. It is related to the twelve diverse classes of pitch. Applying Short-
Time Fourier Transforms (STFT) to the waveform created from datasets audio files chromagram features are collected. For this
study, 12 Chromagram-bins were extracted from each audio file. The sound wave’s frequencies determine the pitch feature in SER
task [62]. While the frequency is high, pitch is considered high, and when the frequency is low, pitch is considered as low too. In this
study, the pitch factor was set at 0.6 during DA to create more samples for the training.
ZCR: ZCR is widely used for SER as well as music information collection related tasks [63–65]. ZCR measures the number of times
the amplitude of speech signals passes through zero value in a given period. ZCR is the best way to tell the difference between voiced
and unvoiced expressions. There is no authoritative low-frequency fluctuation where there are frequent zero crosses.

Emotion Waveforms Spectrograms MFCCs

Angry

Bored

Disgust

7
Surprise

Fig. 5. Waveforms, spectrograms and MFCCs of some of the categories of emotions from the experimented datasets.
Spectrogram: Spectrogram portrays a signal’s intensity in terms of the time-frequency domain. A spectrogram is generated by
dividing a time-domain signal into equal-length segments. After that, each segment is subjected to the Fast Fourier Transform (FFT).
The spectrogram is a plot of each segment’s spectrum. It is a significant feature for any speech-related classification task and performs
exceptionally with CNN [31,40,41,66]. For this study, 128 LMS features were extracted from each audio file.
The use of multiple audio features rather than just one integrates several sound characteristics such as pitch, tone, harmony, etc.,
into a single training speech. This gives the SER models a more detailed interpretation of a speech sound sample, which improves
their performance. Some of the randomly selected waveforms of the dataset’s recordings (both male and female) and their
corresponding MFCCs and LMSs are visually represented in Fig. 5.
Root Mean Square Energy (RMSE): The cumulative magnitude of a signal refers to the signal’s energy. Which roughly
corresponds to the signal’s loudness in audio signals. RMSE is a normalization technique used by researchers [51,67] for speech
audio that uses a signal’s magnitude as a metric for signal power, irrespective of positive or negative amplitude.
This study's total number of extracted features is 13 MFCC, 12 Chromagram, 128 LMS, and two ZCR and RMSE features, creating
a feature vector of dimension 155 (128+13+12+1+1=155).
3.4. Proposed baseline Model-A (1D CNNs-FCNs)
This study uses 1D CNN followed by FCNs to build the first baseline model for SER. 1D CNN performs well with structured data.
In terms of audio data, 1D CNN extracts the temporal information within the speech signal. The extracted ZCR, Chromagram, MFCC,
RMSE, and LMS features from the speech signals are stored in an array creating a vector of features. This vector of features is fed
to the proposed baseline model as input.
Using seven LFAB containing convolutional, max-pooling, batch normalization (BN), and dropout layers, the model extracts
hidden local patterns from the speech audio signals, as shown in Fig. 6. The 1D-convolution layer and 1D-max-pooling layer are the
essential layers of the LFABs. Two FCNs collect the global features from the speech signals.

Fig. 6. Architecture of the proposed baseline Model-A (1D CNNs-FCNs).

The proposed 1D CNNs-FCNs Model-A takes 155x1feature vector arrays as the input. The first LFAB block has 256 filters, with
a kernel size of eight, padding = ‘same’, and dilation rate = (1x1), and a stride of one. The dilation rate reduces the input vector
feature-map. The Rectified Linear Unit (ReLU) then triggers its output after BN is added. By solving the vanishing gradient problem,
the BN layer aids all layers of the NN in learning at a normalized rate. It speeds up the training process by normalizing the hidden
layer activation. In addition, to cope with the model overfitting issue, we used the dropout layer and kernel regularization (L1 and
L2) methods with a rate of 0.01. The output of a preceding input layer is received by the second layer in this stack, which consists of
identical 256 filters with the corresponding kernel size, dilation rate, and stride. ReLU also enables the output of this layer, and then
dropout at a rate of 0.25 is added. Following that, BN is performed, with the output being fed to a 1D max-pooling layer with a
8
window size of two. The following six LFAB blocks with filters of 256, 128, 128, 128, 256, and 64 filters use this kernel size, dilation
rate, and stride configuration as previous blocks. The flattening layer and 50% dropout follows the ultimate LFAB block. This
flattening layer output is received by two FCNs of 128, 64 units with a dropout of 50%, and finally, the output layer with softmax
activation function depending on the number of predicted classes.

3.5 Proposed baseline Model-B (1D CNNs-LSTM-FCNs) and Model-C (1D CNNs-GRU-FCNs)
Proposed Model-B and Model-C, as shown in Fig. 7, is built on top of Model-A. After the final LFAB block in Model-A, one
additional GFAB comprising of LSTM layer (Model-B), and GRU layer (Model-C) of 512 units, activated by ‘tanh’ function,
followed by a dropout of 50%, is added to learn the global features, as well as adjust the global weights. The FCNs configuration
remains same as Model-A.
We adopt GRU and LSTM architecture to obtain the long-term contextual representations in speech utterances. In an LSTM cell,
there are three gates: forget, input, and an output gate. The three gates control the transfer of information into and out of the cell, and
the cell retains values over different periods. Gates are a way to allow information to pass through selectively. The GRU is similar
to the LSTM. It is combination of gates: updates, and reset gates, but the internal structure is different from LSTM. While training,
the gates learn what information is essential to retain or overlook. The update gate in the GRUs replaces the forget and input gates
of the LSTM. Reset gates aid in the capture of sequences’ ephemeral representations. Since GRU has fewer gates than LSTM, it is
less complicated. GRU should be used if the dataset is relatively small; otherwise, LSTM should be applied for large volume datasets.

Fig. 7. Architecture of the proposed Model-B (1D CNNs-LSTM-FCNs) and Model-C (1D CNNs-GRU-FCNs).

3.6 Proposed Ensemble Model-D


Ensemble learning (EL) combines the learning procedures of several models to achieve a more stable and comprehensive
prediction with a maximum accuracy that is superior to the individual DL models’ accuracy. Specific models are good at modeling
one part of the data, and others are good at modeling another. EL succeeds because several models will not make identical errors in
the same test dataset. It assures that the most accurate and reliable prediction is generated. Many features in the field of SER can
reflect the emotion of speech. When the distinct advantages and accuracy of various SER-related models are merged and the features
are combined, the recognition efficiency can be significantly enhanced. A weighted average ensemble was performed in this study
by combining Model-A, B, and C (see Fig. 8). Each model is assigned weights which we gather from executing a grid-search
technique. We perform predictions on each of the three individual models. Using the tensordot function of numpy, we multiplied the
predictions vector with these assigned weights and stored the sum of this product of elements over the specified axis. Then from this
weighted sum, the maximum is chosen for the final prediction. The weighted average ensemble Model-D, tested with the original
dataset and augmented data, achieved higher WAA than the individual models.

9
Fig. 8. Visual representation of the proposed ensemble Model-D.

3.7 Model Training


After getting the feature vector, the study performs data normalization by calculating the mean and standard deviation of the
features. Data is divided into training data and testing data with an 80:20 proportion. Those data are then turned into arrays for input.
Since we deal with categorical data, each label is given a specific number dependent on alphabetical order. We investigate the system's
potency, robustness, and effectiveness using a 4-fold cross-validation strategy for model generalizability. In each fold, 20% of the
data is used for model validation, and the remaining 80% of the data is used to train the models. Every model is trained on both
original datasets and augmented datasets. The whole process was carried out using the Keras framework for DL [68]. Grid-Search
[69] was applied to tune the hyper-parameter such as optimizer, batch size, learning rate, and weights to define the optimal values
for all four models. According to Grid-Search's output, the batch size is set at 32, 'Adam' is selected as the optimizer, and weights
were calculated to produce the highest WAA in the ensemble Model-D.
4. Results Analysis
The number of correctly classified speech emotion labels (True Positives-TP), correctly classified instances that do not belong to
the speech emotion label (True Negatives-TN), and instances that were either incorrectly classified to the speech emotion label (False
Positives-FP) or were not classified as the speech emotion label (False Negatives-FN) will all be used to evaluate the correctness of
a speech emotion classification task. A confusion matrix is made up of these four measurements [70]. In this section, we present a
detailed analysis of our experimental results. We show the training vs. validation accuracy curve and confusion matrix of every
experiment performed in all five datasets for all three individual Model-A, B, C, and ensemble Model-D. Since the ensemble Model-
D performed best, we present the confusion matrix for this model for both before and after performing DA. The confusion matrix for
Model-A, B, C represents the model's performance after DA only.
4.1. Evaluation metrics
We analyze the performance of individual Model-A, B, and C in terms of a weighted average (WA) and macro average (MA) F1-
score, since the datasets have imbalanced distribution of classes. We also provide the precision, recall, and F1-score of the best-
performed individual model in each dataset because of the class imbalance issues in those datasets. For the ensemble Model-D, we
have utilized the WAA metric, which calculates accuracy by adjusting the weights of each model. Table 1 presents the performance
measurement metrics for the multiclass SER task. For a distinct emotion label Li, We define the evaluation by TPi; TNi; FNi; FPi;
and Accuracy, Precision, Recall are computed from the counts for Li.
Table 1
Performance measurement metrics for multiclass SER of this study.

Metric Function Formula Equation Number


TP + TN
Accuracy Presents the overall efficacy of the SER classifier. (1)
TP + FN + TN + FP

K
TPi + TN i
It computes the average accuracy by assigning weights to each model ( )
WAA i=1 TPi + FN i + TN i + FPi (2)
of the ensemble Model-D classifier.
K

10
K

It calculates the number of TPi recognition that fall into the positive  TPi
i=1
Precision K K (3)
speech emotion labels (TPi+FPi) in a multiclass SER task.  TPi +  FPi
i =1 i =1

Proportion of correctly recognized positive speech emotion label  TP i


i
Recall K K (4)
across all labels.  TPi +  FN i
i i

k k

It combines precision and recall into a single metric that captures


2 *  Precision  Recall i
*
i
i=1 i=1
F1-score k k
(5)
properties of both in a multiclass SER task.
 Precision +  Recall i i
i=1 i=1

It calculates the F1-score for each class independently using a weight k k

that depends on the number of actual labels of each class in the


2*  Precision *  Recall i i

WA-F1 k
i=1

k
i=1
* Weights Li (6)
dataset and provides the average value when taking the percentage of
 Precision +  Recall i i

each label into account. i=1 i=1

k k

It calculates the F1-score for each class in the dataset and returns the 2*  Precision  Recall i
*
i
i=1 i=1

MA-F1 average value without considering the percentage for each label k k
(7)
 Precision +  Recall i i

without using any weights. All class is treated equally. i=1 i=1

4.2. Performance Analysis


All the four models performed exceptionally well in each of their ER accuracies. At first, each model’s performances are evaluated
in the original dataset. All four models performed exceptionally well in the TESS dataset with over 96% WAA. After performing
data augmentation, the performance improved further and achieved a WAA of 99.46% in the ensemble Model-D. For the EMO-DB
dataset, the ER performance of all four models was not up to the mark and drawn the issue of overfitting, which is a common issue
when deep models are trained on a relatively smaller size dataset. Another reason for this is the class distribution imbalance and the
low number of samples for the deep models to train efficiently. The same reason applies for the SAVEE dataset, with the WAA is
poor for the original dataset with a very low number of samples for training the models. However, the performance of the models
drastically improved when tested with an increased number of samples after performing DA. The performance increased around 32%
from the non-augmented EMO-DB dataset and 22% from the non-augmented SAVEE dataset. In the EMO-DB dataset, the emotion
"neutral" is often classified as "boredom," after rechecking the audio files, we saw that the spectral entropy properties of these two
categories are pretty similar; that is why all the models are misclassifying these two types. The same reasoning goes for the SAVEE
dataset for the emotions "happy" and "surprise.” After rechecking, we found that some recordings with the "happy" labels are high
pitched, almost the same as the emotion "surprise."
The highest WAA achieved by the ensemble Model-D in the EMO-DB and SAVEE datasets is 95.42% and 93.22%, respectively.
When tested against the RAVDESS dataset, the models performed well, with the highest WAA achieved in the original dataset, and
the augmented dataset is 89.19% and 95.62%, respectively. CREMA-d is the least explored dataset in SER-related studies. It is
challenging to work with because many actors and actresses from different races uttered several different sentences in the dataset.
The human accuracy of this dataset’s audio-only part is around 40.9%. All four models exceeded that number, with the ensemble
achieved around 68.14% accuracy in the original dataset and around 74% accuracy in the augmented dataset. Tested against the
original CREMA-D dataset, the "neutral" emotion is often misclassified as "sad" because both have a lower pitch and amplitude in
the signal waveform; hence, the spectral properties are similar enough. That is why the model interprets "neutral" as "sad." Table 2
presents every model’s SER performance with and without DA in each dataset. The accuracy curve for training versus validation and
confusion matrix for the highest achieved accuracy in terms of SER after DA is also presented for all four models in Fig. 9-18. The
confusion matrix for the ensemble Model-D is achieved by adjusting the proper weights of the Model-A, B, and C after performing
a grid-search technique. Since the ensemble Model-D attained the best accuracies, we provide the confusion matrix of Model-D to
evaluate the performance before and after DA.

11
(a) (b) (c)
Fig. 9. Performance evaluation after performing DA of the four proposed models in the TESS dataset. (a), (b), and (c) shows the training vs. validation accuracy
curve for Model-A, Model-B, and Model-C, respectively, trained for 1000 epochs.

(a) (b) (c) (d) (e)


Fig. 10. Performance evaluation after performing DA of the four proposed models in the TESS dataset. (a), (b), and (c) shows the confusion matrix of Model-A,
Model-B, and Model-C, respectively, (d) and (e) presents the confusion matrix for the ensemble Model-D before and after performing DA, respectively.

(a) (c) (e)


Fig. 11. Performance evaluation after performing DA of the four proposed models in the EMO-DB dataset. (a), (c), and (e) shows the training vs. validation
accuracy curve for Model-A, Model-B, and Model-C, respectively, trained for 1000 epochs.

(a) (b) (c) (d) (e)


Fig. 12. Performance evaluation after performing DA of the four proposed models in the EMO-DB dataset. (a), (b), and (c) shows the confusion matrix of Model-
A, Model-B, and Model-C, respectively, (d) and (e) presents the confusion matrix for the ensemble Model-D before and after performing DA, respectively.

(a) (b) (c)


Fig. 13. Performance evaluation after performing DA of the four proposed models in the RAVDESS dataset. (a), (b), and (c) shows the training vs. validation
accuracy curve for Model-A, Model-B, and Model-C, respectively, trained for 1000 epochs.

12
(a) (b) (c) (d) (e)
Fig. 14. Performance evaluation after performing DA of the four proposed models in the RAVDESS dataset. (a), (b), and (c) shows the confusion matrix of Model-
A, Model-B, and Model-C, respectively, (d) and (e) presents the confusion matrix for the ensemble Model-D before and after performing DA, respectively.

(a) (b) (c)


Fig. 15. Performance evaluation after performing DA of the four proposed models in the SAVEE dataset. (a), (b), and (c) shows the training vs. validation accuracy
curve for Model-A, Model-B, and Model-C, respectively, trained for 1000 epochs.

(a) (b) (c) (d) (e)


Fig. 16. Performance evaluation after performing DA of the four proposed models in the SAVEE dataset. (a), (b), and (c) shows the confusion matrix of Model-
A, Model-B, and Model-C, respectively, (d) and (e) presents the confusion matrix for the ensemble Model-D before and after performing DA, respectively.

(a) (b) (c)


Fig. 17. Performance evaluation after performing DA of the four proposed models in the CREMA-D dataset. (a), (b), and (c) shows the training vs.
validation accuracy curve for Model-A, Model-B, and Model-C, respectively, trained for 500 epochs.

(a) (b) (c) (d) (e)


Fig. 18. Performance evaluation after performing DA of the four proposed models in the CREMA-D dataset. (a), (b), and (c) shows the confusion matrix of Model-
A, Model-B, and Model-C, respectively, (d) and (e) presents the confusion matrix for the ensemble Model-D before and after performing DA, respectively.

13
Table 2
Comparison of the proposed models based on the SER performance on TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D datasets.

Accuracy
Datasets Model Name
Without Augmentation With Augmentation
Model-A 96.78% 99.05%
Model-B 96% 98.49%
TESS
Model-C 95.68% 98.10%
Model-D 98% 99.46%
Model-A 65.88% 92%
Model-B 64.32% 92.21%
EMO-DB
Model-C 65.18% 91.53%
Model-D 67.74% 95.42%
Model-A 86.11% 94.38%
Model-B 88.54% 93.61%
RAVDESS
Model-C 86.77% 94%
Model-D 89.19% 95.62%
Model-A 68% 92%
Model-B 65.87% 93%
SAVEE
Model-C 68.14% 88.28%
Model-D 71% 93.22%
Model-A 66.60% 72.19%
Model-B 66% 68%
CREMA-D
Model-C 65.88% 64.30%
Model-D 68.14% 74%

Table 3-7 presents the class-wise SER results after performing DA in terms of precision, recall, F1-score, WA-F1 and MA-F1 for
the TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D datasets. We only present the best individual model’s performance in
those tables due to space complexity. The results show that the "Disgust" labeled emotion has the lowest accuracy in most cases. The
"angry," "fear," "happy," "sad," and "surprise" labeled emotion achieved the highest accuracy almost in every dataset utilized in this
study. The "boredom" class is only available in the EMO-DB dataset and achieved an accuracy of 92%. The system occasionally
misclassified "sad" emotion with "neutral" and "happy" with "surprise" emotion because of pitch and vocal cord similarities.
Table 3 Table 4
Class-wise SER performance on the TESS dataset (Model-A). Class-wise SER performance on the EMO-DB dataset (Model-B).

Category Precision Recall F1-score WA-F1 MA-F1 Category Precision Recall F1-score WA-F1 MA-F1
Angry 100% 100% 100% Angry 96% 94% 95%
Disgust 97% 99% 98% Boredom 92% 88% 90%
Fear 100% 100% 100% Disgust 86% 90% 88%
Happy 98% 100% 99% 99.05% 99% Fear 95% 95% 95% 92.21% 91%
Neutral 100% 100% 100% Happy 83% 92% 87%
Sad 100% 100% 100% Neutral 89% 91% 90%
Surprise 100% 95% 97% Sadness 97% 92% 95%

Table 5 Table 6
Class-wise SER performance on the RAVDESS dataset (Model-A). Class-wise SER performance on the SAVEE dataset (Model-B).

Category Precision Recall F1-score WA-F1 MA-F1 Category Precision Recall F1-score WA-F1 MA-F1
Angry 97% 96% 96% Angry 98% 96% 97%
Calm 97% 94% 96% Disgust 91% 88% 89%
Disgust 92% 97% 94% Fear 90% 88% 89%
Fear 96% 90% 93% Happy 95% 97% 96%
94.38% 94% 93% 93%
Happy 92% 95% 94%
Neutral 92% 96% 94%
Neutral 90% 90% 90%
Sad 93% 93% 93% Sad 93% 89% 91%
Surprise 96% 97% 97% Surprise 90% 92% 91%

Table 7
Class-wise SER performance on the CREMA-D dataset (Model-A).

Category Precision Recall F1-score WA-F1 MA-F1


Angry 84% 83% 83%
Disgust 66% 66% 66%
Fear 72% 66% 69%
72.19% 72%
Happy 70% 77% 73%
Neutral 67% 69% 68%
Sad 71% 68% 70%

14
5. Comparative analysis with other methods
There has already been extensive research in the field of SER. However, comparing performance was tough because only a few
performed DA in those datasets for SER [67,71,72] using Generative adversarial networks (GAN) or multi-window method, and
through adding generative noise. We have identified that augmenting data is necessary in case of SAVEE, EMO-DB, and RAVDESS
datasets because the sizes of these datasets are significantly smaller for the proper training of a DL-based model. Besides that, only
a few bits of literature utilized the TESS and CREMA-D datasets for this task. Some of the existing articles [14,40,52] use only a
subset of those datasets; some perform feature extraction from the audio, text, and video samples of those datasets [53,73]. Our scope
in this research is only audio samples. Some [29,74,75] evaluate their framework’s performance using a different metric from ours,
like unweighted average accuracy, recall [31,76]; Some choose a questionable training and testing split ratio of 90:10 [18]; therefore,
we only compare with those articles that match our criterion.
In the following Table. 8-12, we present the performance comparison of our work with previous work for TESS, EMO-DB,
RAVDESS, SAVEE, and CREMA-D datasets, respectively. Table.13 presents a comparison of our work with those articles adopting
different DA method to increase the SER accuracy using EMO-DB, RAVDESS and SAVEE datasets. The comparison clearly shows
that this study's DA approach uses a lesser feature dimension than other methods, providing more improved results. Among all four
models, ensemble Model-D performed best in terms of SER accuracy in all the datasets. All three proposed individual Model-A, B,
and C are lightweight and occupy less memory of around 20 to 35 MB. The individual model’s excellent performance in detecting
emotion from speech across all five datasets and adjusting the proper weights for each model for the ensemble prediction contributes
mainly to the improved recognition rate of ensemble Model-D.
Table 8
Performance comparison of this work with previous literature in the TESS dataset

Reference Methodology Features Accuracy


[77] DCNN MFCC 55.71%
[78] 1D-CNN MSFB-Cepstral Coefficients 95.79%
[23] 2D CNN with Global average pooling LMS 62%
1D CNNs-FCNs 99.05%
1D CNNs-LSTM-FCNs MFCC, LMS, ZCR, Chromagram 98.49%
This work
1D CNNs-GRU-FCNs and RMSE value 98.10%
1D CNNs-LSTM-GRU-ENSEMBLE 99.46%

Table 9
Performance comparison of this work with previous literature in the EMO-DB dataset

Reference Methodology Features Accuracy


[13] 1D CNN MFCC, LMS, Chromagram, 86.10%
Spectral contrast, Tonnetz
[72] DNN ZCR, RMS energy, MFCC and 82.73%
statistical features
[28] 1D CNN, Bi-LSTM Acoustic features 94%
[27] 1D-2D DCNN, LSTM Spectral features 95.33%
[15] 2D CNN LMS 92.02%
[32] Bi-LSTM LMS 85.57%
[79] DCNN, SVM, MLP LMS 95.10%
[12] ANN MFCCs, Delta, Delta-Deltas 87.80%
[67] DNN, SVM, GAN, Autoencoder MFCC, ZCR, RMS 84.49%
1D CNNs-FCNs 92%
1D CNNs-LSTM-FCNs MFCC, LMS, ZCR, Chromagram 92.21%
This work
1D CNNs-GRU-FCNs and RMSE value 91.53%
1D CNNs-LSTM-GRU-ENSEMBLE 95.42%

Table 10
Performance comparison of this work with previous literature in the RAVDESS dataset

Reference Methodology Features Accuracy


[13] 1D CNN MFCC, LMS, Chromagram, 71.61%
Spectral contrast, Tonnetz
[28] 1D CNN, Bi-LSTM Acoustic features 73%
15
[77] DCNN MFCC 75.83%
[79] DCNN, SVM, MLP LMS 81.30%
[71] CNN MFCC, Chromagram and Time- 88%
domain features
[12] ANN MFCCs, Delta, Delta-Deltas 82.3%
[32] Bi-LSTM LMS 77.02%
[51] 1D CNN Not mentioned. 80%
1D CNNs-FCNs 94.38%
1D CNNs-LSTM-FCNs MFCC, LMS, ZCR, Chromagram 93.61%
This work
1D CNNs-GRU-FCNs and RMSE value 94%
1D CNNs-LSTM-GRU-ENSEMBLE 95.62%

Table 11
Performance comparison of this work with previous literature in the SAVEE dataset

Reference Methodology Features Accuracy


[40] 3D CNN LMS 81.05%
[79] DCNN, SVM, MLP LMS 82.10%
[71] CNN MFCC, Chromagram and Time- 70%
domain features
[77] DCNN MFCC 65.83%
[22] GA, PCA, LLD MFCC 76.40%
1D CNNs-FCNs 92%
1D CNNs-LSTM-FCNs MFCC, LMS, ZCR, Chromagram 93%
This work
1D CNNs-GRU-FCNs and RMSE value 88.28%
1D CNNs-LSTM-GRU-ENSEMBLE 93.22%

Table 12
Performance comparison of this work with previous literature in the CREMA-D dataset

Reference Methodology Features Accuracy


[77] DCNN MFCC 65.77%
[54] SVM MFCC, ZCR, RMSE 58.22%
1D CNNs-FCNs 72.19%
1D CNNs-LSTM-FCNs MFCC, LMS, ZCR, Chromagram 68%
This work
1D CNNs-GRU-FCNs and RMSE value 64.30%
1D CNNs-LSTM-GRU-ENSEMBLE 74.09%

Table 13
Performance comparison of this work with previous literature adopting different DA techniques in the aforementioned datasets.

Feature
Reference Methodology Datasets Accuracy Features DA method
Dimension
ZCR, RMS energy, MFCC
[72] DNN EMO-DB 76.77% 6552 Generative noise model
and statistical features
SAVEE 70% MFCC, Chromagram and Multi-Window based DA
[71] CNN 34
RAVDESS 88% Time-domain features method
[67] DNN, SVM, GAN, Autoencoder EMO-DB 84.49% MFCC, ZCR, RMS 4368 Adversarial DA network
1D CNNs-FCNs 92%
1D CNNs-LSTM-FCNs EMO-DB 92.21%
1D CNNs-GRU-FCNs 91.53%
1D CNNs-LSTM-GRU-ENSEMBLE 95.42%
1D CNNs-FCNs 92% Injecting AWGN,
MFCC, LMS, ZCR,
1D CNNs-LSTM-FCNs 93% stretching the speech audio
This work SAVEE Chromagram and RMSE 155
1D CNNs-GRU-FCNs 88.28% files, and modification of
value.
1D CNNs-LSTM-GRU-ENSEMBLE 93.22% the pitch of the sound.
1D CNNs-FCNs 94.38%
1D CNNs-LSTM-FCNs 93.61%
RAVDESS
1D CNNs-GRU-FCNs 94%
1D CNNs-LSTM-GRU-ENSEMBLE 95.62%

6. Discussion and Conclusion


Inadequate data could prohibit any DL model from achieving its maximum ability, which is a significant issue in the Deep-NN-
based SER task. Lack of data samples often leads to a complex DL model to overfitting. This paper presents a comprehensive study

16
about different DL model-based SER systems utilizing five different datasets, covering two languages: English and German. We
perform DA to increase the training sample to make a robust and generalized model. Data are augmented by injecting AWGN,
stretching the sound, and modifying the pitch of the sound sample. We have handcrafted five types of low-level descriptor features
from each audio file. Those features work as the input for the baseline 1D CNN model. We have designed multiple LFABs inside
the 1D CNNs-FCNs baseline model to learn local hidden features of the speech samples. We have also added a GFAB block
containing the LSTM algorithm (Model-B) and the GRU algorithm (Model-C) that learns the long-term contextual representations
and correlations from those local features learned through LFABs. To the best of our knowledge, this is the first effort where the
efficacies of all the proposed models are assessed on five standard benchmark datasets.
A comprehensive comparison of the performances of the proposed models with and without DA is also provided. The results of
all the models are satisfactory, and a detailed comparison with the existing literature shows that we have achieved a state-of-the-art
result in terms of SER. Although there have been steady advancements in methods, features, and obtained accuracy in SER, many
limitations are yet to be addressed for an effective and industry-grade speech recognition scheme. Majority of the datasets are scripted
and only cover a few discrete statements and expressions throughout the corpus. Moreover, in many cases, the sample size is
insufficient to train a DL model adequately. All of the experiments are simulated and semi-natural. They are not noisy, and they are
far away from the natural environment with a real-world scenario. This brings questions about the ability of a developed system,
built using those datasets, to detect the correct emotion in a real-world scenario.
Even though this study performs exceptionally well in SER across five datasets, we believe that further analysis on this subject is
necessary. The detection of emotions in a dialogue between multiple actors from a continuous speech audio is an area that needs
further research. Another area of research that needs to be explored further is the use of transformers to create a language-aware DL
model framework that adapts to the language for the SER task. In future, the use of GAN to generate more synthetic data and feature
reduction may be addressed further. The following GitHub repository contains all the python notebooks that implement the proposed
models: https://github.com/simanto99/ser-ensemble

References

[1] E. Hudlicka, To feel or not to feel: The role of affect in human-computer interaction, Int. J. Hum. Comput. Stud. 59 (2003) 1–32.
https://doi.org/10.1016/S1071-5819(03)00047-8.
[2] Mustaqeem, S. Kwon, A CNN-assisted enhanced audio signal processing for speech emotion recognition, Sensors (Switzerland). 20 (2020).
https://doi.org/10.3390/s20010183.
[3] J. Chatterjee, V. Mukesh, H.H. Hsu, G. Vyas, Z. Liu, Speech emotion recognition using cross-correlation and acoustic features, in: Proc. - IEEE 16th Int.
Conf. Dependable, Auton. Secur. Comput. IEEE 16th Int. Conf. Pervasive Intell. Comput. IEEE 4th Int. Conf. Big Data Intell. Comput. IEEE 3, IEEE,
2018: pp. 250–255. https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00050.
[4] B.J. Abbaschian, D. Sierra-Sosa, A. Elmaghraby, Deep learning techniques for speech emotion recognition, from databases to models, Sensors
(Switzerland). 21 (2021) 1–27. https://doi.org/10.3390/s21041249.
[5] R.A. Khalil, E. Jones, M.I. Babar, T. Jan, M.H. Zafar, T. Alhussain, Speech Emotion Recognition Using Deep Learning Techniques: A Review, IEEE
Access. 7 (2019) 117327–117345. https://doi.org/10.1109/ACCESS.2019.2936124.
[6] J. Lyons, MFCC, (2020). http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/ (accessed
April 30, 2021).
[7] K.H. Hyun, E.H. Kim, Y.K. Kwak, Robust speech emotion recognition using log frequency power ratio, in: 2006 SICE-ICASE Int. Jt. Conf., 2006: pp.
2586–2589. https://doi.org/10.1109/SICE.2006.314794.
[8] M.A. Yusnita, A.M. Hafiz, M.N. Fadzilah, A.Z. Zulhanip, M. Idris, Automatic gender recognition using linear prediction coefficients and artificial neural
network on speech signal, Proc. - 7th IEEE Int. Conf. Control Syst. Comput. Eng. ICCSCE 2017. 2017-Novem (2018) 372–377.
https://doi.org/10.1109/ICCSCE.2017.8284437.
[9] H.A. Patil, S. Viswanath, Effectiveness of Teager energy operator for epoch detection from speech signals, Int. J. Speech Technol. 14 (2011) 321–337.
https://doi.org/10.1007/s10772-011-9110-8.
[10] M. El Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognit. 44 (2011)
572–587. https://doi.org/10.1016/j.patcog.2010.09.020.
[11] G.K. Liu, Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech, ArXiv. (2018) 2–6.
[12] P. Nantasri, E. Phaisangittisagul, J. Karnjana, S. Boonkla, S. Keerativittayanun, A. Rugchatjaroen, S. Usanavasin, T. Shinozaki, A Light-Weight Artificial
Neural Network for Speech Emotion Recognition using Average Values of MFCCs and Their Derivatives, 17th Int. Conf. Electr. Eng. Comput.
Telecommun. Inf. Technol. ECTI-CON 2020. (2020) 41–44. https://doi.org/10.1109/ECTI-CON49241.2020.9158221.
[13] D. Issa, M. Fatih Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural networks, Biomed. Signal Process. Control. 59 (2020)
101894. https://doi.org/10.1016/j.bspc.2020.101894.
[14] S. Demircan, H. Kahramanli, Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech, Neural Comput.
Appl. 29 (2018) 59–66. https://doi.org/10.1007/s00521-016-2712-y.
17
[15] T. Anvarjon, Mustaqeem, S. Kwon, Deep-net: A lightweight cnn-based speech emotion recognition system using deep frequency features, Sensors
(Switzerland). 20 (2020) 1–16. https://doi.org/10.3390/s20185212.
[16] G.K. Birajdar, M.D. Patil, Speech/music classification using visual and spectral chromagram features, J. Ambient Intell. Humaniz. Comput. 11 (2020) 329–
347. https://doi.org/10.1007/s12652-019-01303-4.
[17] M. Ghai, S. Lal, S. Duggal, S. Manik, Emotion recognition on speech signals using machine learning, in: Proc. 2017 Int. Conf. Big Data Anal. Comput.
Intell. ICBDACI 2017, 2017: pp. 34–39. https://doi.org/10.1109/ICBDACI.2017.8070805.
[18] A. Bhavan, P. Chauhan, Hitkul, R.R. Shah, Bagged support vector machines for emotion recognition from speech, Knowledge-Based Syst. 184 (2019)
104886. https://doi.org/10.1016/j.knosys.2019.104886.
[19] W.Q. Zheng, J.S. Yu, Y.X. Zou, An experimental study of speech emotion recognition based on deep convolutional neural networks, 2015 Int. Conf. Affect.
Comput. Intell. Interact. ACII 2015. (2015) 827–831. https://doi.org/10.1109/ACII.2015.7344669.
[20] A. Iqbal, K. Barua, A Real-time Emotion Recognition from Speech using Gradient Boosting, 2nd Int. Conf. Electr. Comput. Commun. Eng. ECCE 2019.
(2019) 1–5. https://doi.org/10.1109/ECACE.2019.8679271.
[21] Z. Esmaileyan, H. Marvi, A database for automatic persian Speech Emotion Recognition: Collection, processing and evaluation, Int. J. Eng. Trans. A Basics.
27 (2014) 79–90. https://doi.org/10.5829/idosi.ije.2014.27.01a.11.
[22] Z.T. Liu, Q. Xie, M. Wu, W.H. Cao, Y. Mei, J.W. Mao, Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing.
309 (2018) 145–156. https://doi.org/10.1016/j.neucom.2018.05.005.
[23] K. Venkataramanan, H.R. Rajamohan, Emotion Recognition from Speech, ArXiv:1912.10458v1. (2019) 2019. https://doi.org/10.1007/978-3-319-02732-
6_7.
[24] T.L. Nwe, S.W. Foo, L.C. De Silva, Speech emotion recognition using hidden Markov models, Speech Commun. 41 (2003) 603–623.
https://doi.org/10.1016/S0167-6393(03)00099-2.
[25] H. Hu, M.X. Xu, W. Wu, GMM supervector based SVM with spectral features for speech emotion recognition, in: ICASSP, IEEE Int. Conf. Acoust. Speech
Signal Process. - Proc., 2007: pp. 1–5. https://doi.org/10.1109/ICASSP.2007.366937.
[26] Mustaqeem, S. Kwon, 1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features, Comput. Mater. Contin. 67
(2021) 4039–4059. https://doi.org/10.32604/cmc.2021.015070.
[27] J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomed. Signal Process. Control. 47 (2019) 312–323.
https://doi.org/10.1016/j.bspc.2018.08.035.
[28] A. Yadav, Di.K. Vishwakarma, A Multilingual Framework of CNN and Bi-LSTM for Emotion Classification, in: 2020 11th Int. Conf. Comput. Commun.
Netw. Technol. ICCCNT 2020, 2020. https://doi.org/10.1109/ICCCNT49239.2020.9225614.
[29] Mustaqeem, S. Kwon, MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach, Expert Syst. Appl. 167
(2021) 114177. https://doi.org/10.1016/j.eswa.2020.114177.
[30] W. Lim, D. Jang, T. Lee, Speech Emotion Recognition using Convolutional Recurrent Neural Networks and Spectrograms, in: Asia-Pacific Signal Inf.
Process. Assoc. Annu. Summit Conf., 2017. https://doi.org/10.1109/APSIPA.2016.7820699.
[31] H. Meng, T. Yan, F. Yuan, H. Wei, Speech Emotion Recognition from 3D Log-Mel Spectrograms with Deep Learning Network, IEEE Access. 7 (2019)
125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007.
[32] Mustaqeem, M. Sajjad, S. Kwon, Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM, IEEE Access. 8
(2020) 79861–79875. https://doi.org/10.1109/ACCESS.2020.2990405.
[33] Pichora-Fuller, M. Kathleen;, K. Dupuis, Toronto emotional speech set (TESS), (2020). https://doi.org/10.5683/SP2/E8H2MF.
[34] S. Livingstone, F. Russo, The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), PLoS One. 13 (2018).
https://doi.org/10.5281/zenodo.1188976.
[35] P. Jackson, S. Haq, SAVEE DATASET, (2014). http://kahlan.eps.surrey.ac.uk/savee/Database.html.
[36] H. Cao, D.G. Cooper, M.K. Keutmann, R.C. Gur, A. Nenkova, R. Verma, CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset, IEEE Trans.
Affect. Comput. 5 (2014) 377–390. https://doi.org/10.1109/TAFFC.2014.2336244.CREMA-D.
[37] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, A database of German emotional speech, in: 9th Eur. Conf. Speech Commun. Technol.,
2005: pp. 1517–1520.
[38] K. Zeng, J. Huang, M. Dong, White Gaussian noise energy estimation and wavelet multi-threshold de-noising for heart sound signals, Circuits, Syst. Signal
Process. 33 (2014) 2987–3002. https://doi.org/10.1007/s00034-014-9784-7.
[39] W.S. Lee, Y.W. Roh, D.J. Kim, J.H. Kim, K.S. Hong, Speech emotion recognition using spectral entropy, in: Lect. Notes Comput. Sci. (Including Subser.
Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 2008: pp. 45–54. https://doi.org/10.1007/978-3-540-88518-4_6.
[40] N. Hajarolasvadi, H. Demirel, 3D CNN-based speech emotion recognition using k-means clustering and spectrograms, Entropy. 21 (2019).
https://doi.org/10.3390/e21050479.
[41] A.M. Badshah, J. Ahmad, N. Rahim, S.W. Baik, Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network, in: 2017 Int.
Conf. Platf. Technol. Serv. PlatCon 2017 - Proc., 2017. https://doi.org/10.1109/PlatCon.2017.7883728.
[42] S. Zhang, X. Zhao, Q. Tian, Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM, IEEE Trans. Affect. Comput. PP
(2019) 1. https://doi.org/10.1109/TAFFC.2019.2947464.
[43] A. Christy, S. Vaithyasubramanian, A. Jesudoss, M.D.A. Praveena, Multimodal speech emotion recognition and classification using convolutional neural
network techniques, Int. J. Speech Technol. 23 (2020) 381–388. https://doi.org/10.1007/s10772-020-09713-y.
[44] Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection, Proc. 27th Conf. Uncertain. Artif. Intell. UAI 2011. (2011) 266–273.
[45] Z.T. Liu, M. Wu, W.H. Cao, J.W. Mao, J.P. Xu, G.Z. Tan, Speech emotion recognition based on feature selection and extreme learning machine decision
tree, Neurocomputing. 273 (2018) 271–280. https://doi.org/10.1016/j.neucom.2017.07.050.
[46] H. Abdi, L.J. Williams, Principal component analysis, Wiley Interdiscip. Rev. Comput. Stat. 2 (2010) 433–459. https://doi.org/10.1002/wics.101.
[47] A. Batliner, S. Steidl, E. Nöth, Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo Emotion Corpus., Proc.
Work. Corpora Res. Emot. Affect Lr. (2008) 28–31.
[48] H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning architectures for Speech Emotion Recognition, Neural Networks. 92 (2017) 60–68.
https://doi.org/10.1016/j.neunet.2017.02.013.
[49] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: Interactive emotional dyadic motion
capture database, Lang. Resour. Eval. 42 (2008) 335–359. https://doi.org/10.1007/s10579-008-9076-6.
[50] T.N. Sainath, O. Vinyals, A. Senior, H. Sak, Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, in: ICASSP, IEEE Int.
Conf. Acoust. Speech Signal Process. - Proc., 2015: pp. 4580–4584. https://doi.org/10.1109/ICASSP.2015.7178838.
[51] Mustaqeem, S. Kwon, CLSTM: Deep feature-based speech emotion recognition using the hierarchical convlstm network, Mathematics. 8 (2020) 1–19.

18
https://doi.org/10.3390/math8122133.
[52] S. Li, X. Xing, W. Fan, B. Cai, P. Fordson, X. Xu, Spatiotemporal and frequential cascaded attention networks for speech emotion recognition,
Neurocomputing. 448 (2021) 238–248. https://doi.org/10.1016/j.neucom.2021.02.094.
[53] S. Yoon, S. Byun, K. Jung, Multimodal Speech Emotion Recognition Using Audio and Text, in: 2018 IEEE Spok. Lang. Technol. Work., IEEE, 2018: pp.
112–118. https://doi.org/10.1109/SLT.2018.8639583.
[54] R. Singh, H. Puri, N. Aggarwal, V. Gupta, An Efficient Language-Independent Acoustic Emotion Classification System, Arab. J. Sci. Eng. 45 (2020) 3111–
3121. https://doi.org/10.1007/s13369-019-04293-9.
[55] B. Zhang, E.M. Provost, G. Essl, Cross-Corpus Acoustic Emotion Recognition From Singing And Speaking : A Multi-Task Learning Approach, in: 2016
IEEE Int. Conf. Acoust. Speech Signal Process., IEEE, 2016: pp. 5805–5809. https://ieeexplore.ieee.org/document/7472790.
[56] K. Wang, N. An, B.N. Li, Y. Zhang, L. Li, Speech emotion recognition using Fourier parameters, IEEE Trans. Affect. Comput. 6 (2015) 69–75.
https://doi.org/10.1109/TAFFC.2015.2392101.
[57] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: ACM Int. Conf. Proceeding Ser., 2012: pp.
1–9. https://doi.org/10.1145/3383972.3383975.
[58] K. Zvarevashe, O. Olugbara, Ensemble Learning of Hybrid Acoustic Features for Speech Emotion Recognition, Agorithms. 13 (2020) 1–24.
https://doi.org/10.26782/jmcms.2020.09.00016.
[59] C. Zheng, C. Wang, N. Jia, An ensemble model for multi-level speech emotion recognition, Appl. Sci. 10 (2020). https://doi.org/10.3390/app10010205.
[60] S.B. Davis, P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE
Trans. Acoust. ASSP-28 (1980) 357–366. https://doi.org/10.1016/b978-0-08-051584-7.50010-3.
[61] L. Abdel-Hamid, Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features, Speech Commun. 122 (2020) 19–30.
https://doi.org/10.1016/j.specom.2020.04.005.
[62] F. Noroozi, T. Sapiński, D. Kamińska, G. Anbarjafari, Vocal-based emotion recognition using random forests and decision tree, Int. J. Speech Technol. 20
(2017) 239–246. https://doi.org/10.1007/s10772-017-9396-2.
[63] R.G. Bachu, S. Kopparthi, B. Adapa, B.D. Barkana, Separation of Voiced and Unvoiced using Zero crossing rate and Energy of the Speech Signal, in: Am.
Soc. Eng. Educ. Zo. Conf. Proc., 2008: pp. 1–7. https://doi.org/10.1007/978-90-481-3660-5_47.
[64] E. Widiyanti, S.N. Endah, Feature Selection for Music Emotion Recognition, 2018 2nd Int. Conf. Informatics Comput. Sci. ICICoS 2018. (2018) 120–124.
https://doi.org/10.1109/ICICOS.2018.8621783.
[65] C. Xu, N.C. Maddage, X. Shao, F. Cao, Q. Tian, Musical genre classification using support vector machines, ICASSP, IEEE Int. Conf. Acoust. Speech
Signal Process. - Proc. 5 (2003) 429–432. https://doi.org/10.1109/icassp.2003.1199998.
[66] A. Satt, S. Rozenberg, R. Hoory, Efficient emotion recognition from speech using deep learning on spectrograms, in: Proc. Annu. Conf. Int. Speech
Commun. Assoc. INTERSPEECH, 2017: pp. 1089–1093. https://doi.org/10.21437/Interspeech.2017-200.
[67] L. Yi, M. Mak, S. Member, Improving Speech Emotion Recognition With Adversarial Data Augmentation Network, IEEE Trans. NEURAL NETWORKS
Learn. Syst. (2020) 1–13.
[68] F. Chollet, Keras: The python deep learning library, (2018).
[69] P. Liashchynskyi, P. Liashchynskyi, Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS, ArXiv. (2019) 1–11.
[70] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Inf. Process. Manag. 45 (2009) 427–437.
https://doi.org/10.1016/j.ipm.2009.03.002.
[71] S. Padi, D. Manocha, R.D. Sriram, Multi-Window Data Augmentation Approach for Speech Emotion Recognition, ArXiv Prepr. ArXiv2010.09895. (2020).
http://arxiv.org/abs/2010.09895.
[72] U. Tiwari, M. Soni, R. Chakraborty, A. Panda, S.K. Kopparapu, Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech
Emotion Recognition in Noisy Conditions, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 2020-May (2020) 7194–7198.
https://doi.org/10.1109/ICASSP40776.2020.9053581.
[73] N.C. Ristea, L.C. Dutu, A. Radoi, Emotion recognition system from speech and visual information based on convolutional neural networks, in: 2019 10th
Int. Conf. Speech Technol. Human-Computer Dialogue, SpeD 2019, 2019. https://doi.org/10.1109/SPED.2019.8906538.
[74] M. Chen, X. He, J. Yang, H. Zhang, 3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition, IEEE Signal
Process. Lett. 25 (2018) 1440–1444. https://doi.org/10.1109/LSP.2018.2860246.
[75] J. Kim, G. Englebienne, K.P. Truong, V. Evers, Towards speech emotion recognition “in the wild” using aggregated corpora and deep multi-task learning,
ArXiv Prepr. 2017-Augus (2017) 1113–1117. https://doi.org/10.21437/Interspeech.2017-736.
[76] K. Feng, T. Chaspari, A siamese neural network with modified distance loss for transfer learning in speech emotion recognition, ArXiv Prepr.
ArXiv2006.03001. (2020).
[77] S. Mekruksavanich, A. Jitpattanakul, N. Hnoohom, Negative Emotion Recognition using Deep Learning for Thai Language, in: 2020 Jt. Int. Conf. Digit.
Arts, Media Technol. with ECTI North. Sect. Conf. Electr. Electron. Comput. Telecommun. Eng. ECTI DAMT NCON 2020, 2020: pp. 71–74.
https://doi.org/10.1109/ECTIDAMTNCON48261.2020.9090768.
[78] R. Chatterjee, S. Mazumdar, R.S. Sherratt, R. Halder, T. Maitra, D. Giri, Real-time speech emotion analysis for smart home assistants, IEEE Trans. Consum.
Electron. 67 (2021) 68–76. https://doi.org/https://doi.org/10.1109/TCE.2021.3056421.
[79] M. Farooq, F. Hussain, N.K. Baloch, F.R. Raja, H. Yu, Y. Bin Zikria, Impact of feature selection algorithm on speech emotion recognition using deep
convolutional neural network, Sensors (Switzerland). 20 (2020) 1–18. https://doi.org/10.3390/s20216008.

19

You might also like