You are on page 1of 9

Emotion and Depression Detection

from Speech

Yash Deshpande, Shreya Patel, Meghan Lendhe, Manpreet Chavan,


and Reeta Koshy

Abstract Human speech communication conveys semantic information of the


underlying emotions corresponding to the speech of the interlocutor. So, detection
of emotions by analysis of speech is important for identifying a subject’s emotional
state. Numerous features from human speech are used by convolutional neural
network (CNN) and support vector machine (SVM) techniques to detect the emotions
such as anger, happiness, fear, sadness, surprise and neutral that are associated with
the speech. Prolonged sadness is considered the prerequisite for depression. Moni-
toring subject’s speech over a period of time helps in detecting clinical depression.
Databases of different accents of English language are taken to make sure the system
incorporates multiple accents. Emotion and depression detection have applications
in fields like lie detection, military, counseling, database access systems, etc.

Keywords Emotion detection using machine learning · Depression detection ·


SVM · Supervised neural networks · Mel frequency cepstral coefficients · Health
informatics · Spectral feature extraction

Y. Deshpande (B) · S. Patel · M. Lendhe · M. Chavan · R. Koshy


Sardar Patel Institute of Technology, Mumbai, India
e-mail: yash.deshpande@spit.ac.in
S. Patel
e-mail: shreya.patel@spit.ac.in
M. Lendhe
e-mail: meghan.lendhe@spit.ac.in
M. Chavan
e-mail: manpreet.chavan@spit.ac.in
R. Koshy
e-mail: reeta_koshy@spit.ac.in

© The Editor(s) (if applicable) and The Author(s), under exclusive license 257
to Springer Nature Singapore Pte Ltd. 2021
S. Fong et al. (eds.), ICT Analysis and Applications, Lecture Notes
in Networks and Systems 154, https://doi.org/10.1007/978-981-15-8354-4_27
258 Y. Deshpande et al.

1 Introduction

Sentiment analysis is one of the most useful and significant applications of machine
learning and is used extensively for understanding the sentiments (emotions) from
various tweets or comments online. Research shows that a person’s voice contains
information that reflects the mental state of that person and hence can be used as an
indicator for detection of depression and its severity [1]. Also, speech features do
not vary broadly, making it easier to record and process as compared to video or text
features. Detection of depression (and thereafter seeking therapy) can help improve
the subject’s mental health and thus become more productive. Emotion detection from
speech is a relatively unexplored arena with many interesting possibilities which can
have many innovative and a plethora of applications in the near future, especially in
the healthcare sector for easy detection of clinical depression. Many new customer
services can use speech emotion detection to provide better services. For example,
in call centers automated systems can detect specifically angry callers and prioritize
their target customers accordingly.
Mental health is a rising matter of concern among the younger generation. Mental
well-being is as important as physical well-being. According to a report by the
World Health Organization, more than 264 million people suffer from depression
worldwide. As depression has become a common cause for increased suicide cases,
early detection and its treatment thus become important. But many people suffering
from mental disorders often do not open up easily because of the stigma attached to
mental health. Currently, most of the depression detection cases are mainly assessed
through subjective evaluations of the symptoms by a clinical expert. This method is
very labor-intensive and time-consuming. Also, the evaluation is inconsistent subject
to the professionals and is expensive [1]. Detection of depression from speech can
facilitate the diagnosis and treatment of patients. Machine learning provides an easy
and reliable pre-screening tool of detecting severity of depression from speech. Chat
bots can be made more humane to evoke appropriate responses considering the mood
and emotion of the user. Thus, emotion detection from speech has a lot of potential
for use in interactive systems and providing better customer service.

2 Literature Survey

Emotion detection is not restricted to only tweets or online comments, but even
speech has enormous scope for its application. Emotions affect human experience,
especially leadership skills in a big way. They have a very significant influence on
one’s life [2]. Identification of emotion from speech signal can aid professionals in
counseling students, ensuring proper mental stability of employees [3] and help new
businesses grow [4]. Feature extraction is one of the most important steps involved
in emotion detection [5].
Emotion and Depression Detection from Speech 259

For detecting emotions, algorithms use prosodic and spectral features. Pitch,
inflection, etc., are some types of prosodic features [2]. Spectral features try to deci-
pher frequency segments from speech. Mel-frequency cepstral coefficients (MFCCs)
represent audio signals in frequency domain. MFCCs help in studying the manifes-
tation of speech through the vocal track. This helps in converting the continuous
speech signal into pseudo-stationary fragments of frequency which can be treated
as discrete. MFCCs manipulate bands of frequency in logarithmic fashion which is
more accurate for detecting human speech than the linearly spaced frequency bands
of DCTs or FFTs [6].
For emotion detection, MFCCs show better performance on well-researched
databases like Berlin Database of Emotional Speech (emoDB) than other coefficients
[4]. Low-level descriptors can be extracted for working in various domains like space,
time, etc. However, global descriptors prove to be more useful of windowed signals.
Global descriptors are aggregated to obtain feature vectors.
Classification models can be built using linear discriminant analysis (LDA) and
support vector machines. For the emoDB dataset, SVM classifier performed well
for one-vs-one classification for most emotions, except for anger and sadness due
to similar acoustic signatures. LDA classifier was not significantly accurate due to
sparse amount of data. On the other hand, the RED database had 73.3 and 71.8%
accuracy for SVM and LDA, respectively. Overall, SVM was better at both multi-
class and binary classification. Sonawane et al. [7] used SVM for classification.
MFCCs were used for feature extraction. SVM could only classify fixed length data
and not variable length data. Both multiple, viz. linear and nonlinear, techniques were
used for emotion classification. Better performance was obtained using nonlinear
multiple SVM. Multiple SVM as a classifier suffered on a training data set of multiple
languages and accents for speech features of MFCC [7].
Harar [8] used deep neural network (DNN)-based approach for emotion detection.
EmoDB was used containing 271 labeled recordings but only for three emotions, viz.
anger, neutral and sad. The DNN architecture had only fully connected layers—one
of size 480 and another of size 240; and a final softmax layer for output. Stochastic
gradient descent algorithm was used for training the model. Input data batched in
the accuracy obtained using this approach was 96.97%.
Berlin Database of Emotional Speech (BDES) upon choosing different clas-
sifiers along with different acoustic features showed variations in the emotions
detected. Acoustic features used can be linear prediction cepstral coefficients or Mel-
frequency cepstral coefficients. Support vector machine (SVM) classifier showed
better accuracy as compared to GMM on BDES considering MFCC as speech
features [9]. Parameters of convolutional neural network (CNN) and long short-term
memory (LSTM) layers were trained on interactive emotional dyadic motion capture
(IEMOCAP). The pace at which the two models trained was significantly different
than each other [10]. Gradient boosting on The Ryerson Audio-Visual Database
of Emotional Speech and Song (RAVDESS) and Surrey Audio-Visual Expressed
Emotion (SAVEE) datasets showed that SAVEE has higher accuracy in detecting
emotions than RAVDESS. While considering live data, i.e., spontaneous, non-acted,
genuine speech samples from subjects, gradient boosting showed better results than
260 Y. Deshpande et al.

both K-Nearest Neighbors (KNN) and SVM [11]. It was also evident that pitch
feature and MFCC feature yielded different accuracy on BDES dataset with HMM
classifier [9]. When audio segments were classified independently on BDES using
deep neural network’s (DNN) stochastic gradient descent, DNN had no knowledge
of the actual context of the samples of speech it was supposed to detect emotions
from [8].
There exists a strong correlation between the speech features and the presence
or absence of depression. This was studied by Sahu et al. [12] by using the Mundt
database. While comparing the sustained vowel utterances and free-flowing speech,
attributes like jitter, shimmer and degree of breathing were considered to extract
MFCC features and train the SVM model for classification. Average magnitude
difference function (AMDF) was used to quantify the parameters, and it was observed
that shimmer, jitter and breathiness were high for a depressed person. Mari et al. [9]
also found that quantifying the jitter and shimmer, degree of breathing computed
from an AMDF on Mundt database on Hamilton depression scores showed that dip
profiles contain important information about the state of a depressed person.
When a speaker decides to speak something, a series of information is processed
in his brain. Thus, her thoughts, feelings and sense of well-being affect her speech.
Apart from spectral features, depression affects not only the semantic (content of
information) but also syntactic (structure of information) features. Mitra et al. [13]
found that spontaneous speech is better than read speech for depression detection
and refutes the traditional clinical data collection methods.
Chlasta et al. [14] used the Distress Analysis Interview Corpus (DAIC) dataset
with several pre-trained CNN architecture models replacing the last layer of the
model as per the dimensions of their dataset. By fine-tuning the parameters, they
built a strong classification model with two classes—depressed and non-depressed.
Tasnim et al. [15] made an attempt to develop a system that captures the users’
voice and analyzes it to detect depression severity. They proposed a comparative
study for distinguishing depressed and non-depressed individuals (binary classi-
fication problem) and determining severity of depression (a regression problem).
They compared four models—random forest model, SVM, gradient boosting tree
(GBT) and deep neural network on the AVEC 2013 and AVEC 2017 datasets. It was
observed that DNN was the most effective model for classifying the depressed and
non-depressed individuals.
The Patient Health Questionnaire depression scale (PHQ-8) scores and Beck
scores [16] are some of the popular scales for diagnosing depression. The classi-
fication accuracies obtained were similar in both cases. In He et al. [16], AVEC 2013
and AVEC 2014 databases were used for developing DCNN classification model.
It asserted that deep-learned features performed better than hand-crafted features
(energy descriptors) which are tedious to derive. But the combination of both features
boosted the performance.
Emotion and Depression Detection from Speech 261

3 Dataset

For developing an emotion classifier, the datasets which we have used are Ryerson
Audio-Visual Database of Emotional Speech and Song (RAVDESS), Toronto
emotional speech set (TESS) and Surrey Audio-Visual Expressed Emotion (SAVEE).
RAVDESS provides audio samples in North American English accent for four
emotions that include happy, sad, angry and fearful. RAVDESS consists of 7356 files
in three modality formats, i.e., audio-only (16bit, 48 kHz.wav), audio-video (720p
H.264, AAC 48 kHz,.mp4) and video-only (no sound) for each of the 24 actors (12
females, 12 males).
The SAVEE database was recorded from four native English male speakers,
postgraduate students and researchers at the University of Surrey aged from 27 to
31 years. Emotions for which audio samples are recorded are anger, disgust, fear,
happiness, sadness and surprise. A neutral category is also added to provide record-
ings of seven emotion categories. The text material consisted of three common, two
emotion-specific and ten generic sentences that were different for each emotion and
phonetically balanced.
In TESS, there are a set of 200 target words spoken by two actresses (aged 26 and
64 years) portraying each of seven emotions (anger, disgust, fear, happiness, pleasant
surprise, sadness and neutral). There are 2800 audio files in total. The dataset is
organized such that each of the two female actors and their emotions are contained
within its own folder. And within that, all 200 target words audio file can be found.
The format of the audio file is a WAV format.
For depression detection, the DAIC-WOZ database was used which is part of
Distress Analysis Interview Corpus (DAIC) [1]. DAIC-WOZ is a collection of 189
sessions of subjects with extensive questionnaire. A particular session was labeled
“depressed” if PHQ8 score for that session was greater than or equal to 10. Since it
is a highly imbalanced dataset, around only a quarter of the 189 files are depression-
labeled and the rest are non-depression-labeled. Out of 189 files, we used 49 files of
depressed subjects and non-depressed subjects each. Furthermore, out of the many
types of data available in the corpus for every session, the COVAREP feature files
along with the audio files were used (Fig. 1).

4 Methodology

4.1 Feature Extraction

Features for emotion detection were extracted using pyAudioAnalysis [17] python
library. All 34 features, viz. zero crossing rate, energy, entropy of energy, spectral
centroid, spectral spread, spectral entropy, spectral flux, spectral rolloff, 13 MFCCs,
12 chroma vectors and chroma deviation, were used to train both the models. Window
size of 50 ms was used to make frames for extracting features.
262 Y. Deshpande et al.

Fig. 1 System architecture

On the other hand, for detecting depression, features extracted using COVAREP
[18] were used from DAIC-WOZ dataset. Window size of 10 ms was used to frame
the audio signals. A total of 74 features were extracted—the details of which can be
found in the documentation.

4.2 Classification Models

In this work, two algorithms—SVM and CNN—are used for emotion classification.
The emotion classes used for the classifier were Anger, Disgust, Fear, Happy, Neutral,
Sad and Surprise. Four cases were analyzed by training both SVM and CNN on first
only on MFCC features and then on other acoustic features as well.
Before training the SVM model, the feature vectors of predictor variables were
standardized between 0 and 1 with mean and standard deviation of the data 0 and 1,
respectively. The response variable, i.e., emotion classes was label-encoded. Radial
basis function (RBF) kernel with gamma kernel coefficient 0.1 was used for SVM
classifier for training only MFCC features, while gamma value of 0.01 was used for
training all 34 features (Figs. 2 and 3).
The convolutional neural network (CNN), on the other hand, comprised of eight
convolutional layers with ReLU activation function, two pooling layers and a dense

Fig. 2 SVM hinge-loss function


Emotion and Depression Detection from Speech 263

Fig. 3 SVM gradients

layer with softmax activation function. This configuration was specifically used as it
gave the most stable model. The training and testing losses for the said model neatly
converged. Adam optimizer was used for training the network with 0.01 as learning
rate. Both only MFCC features and the other 34 features were trained on the same
network.
For detecting depression, a 2-D CNN was used with four convolutional layers
each with a pooling layer and a global average pooling layer with batch size of 32
for training.

5 Experimentation and Results

Figure 4 shows the accuracies for the two models—SVM and CNN when trained
on only MFCCs and other acoustic features along with MFCCs. In both the cases,
SVM gave better accuracy compared to CNN as SVM works better in classification
problems especially in emotion recognition.
While extra features made the accuracy of SVM increase by almost 4%, CNN
accuracy dipped by around 14%. While SVM found a better hyperplane to classify

Fig. 4 Loss with MFCC


only
264 Y. Deshpande et al.

Table 1 Accuracies of
Algorithm MFCCs MFCCs + Others
emotion detection algorithms
SVM 84.53 88.26
CNN 84.21 70.31

data points with extra features, CNN model could not make use of the increased
dimensions.
In light of an imbalanced and scarce dataset and computational limitations of our
machines, depression accuracy with 2-D CNN remained at 60% (Table 1).

6 Conclusion

The model detects different emotions from speech, and its applications range from
counseling, health care to music recommendation. But most of the previous work only
incorporates one kind of accent. Hence, pan accent work has not been established yet
in emotion detection. This thesis attempts to overcome this issue by incorporating
speech data from various accents to provide a more comprehensive emotion detection
system. We have also found that neural network worked as good as SVM classifier on
only one type of features but as the dimension of data increased, CNN model already
struggling with small size of data could not find relevant patterns and its accuracy
fell behind SVM’s.
In future, the accuracy can further be increased with intelligent feature selec-
tion. This work can be coupled with a real-time application for both emotion and
depression detection that can process the spontaneous speech instead of the acted
emotions.

References

1. Gratch, J., Artstein, R., Lucas, G., Stratou, G., Scherer, S., Nazarian, A., Wood, R., Boberg, J.,
DeVault, D., Marsella, S., Traum, D., Rizzo, S., & Morency, L. -P. (2014) The distress analysis
interview corpus of human and computer interviews. In Proceedings of Language Resources
and Evaluation Conference (LREC).
2. Rajvanshi, K. (2018). An Efficient approach for emotion detection from speech using neural
networks. International Journal for Research in Applied Science and Engineering Technology,
6, 1062–1065. https://doi.org/10.22214/ijraset.2018.5170.
3. Sharma, U. (2016). Identification of emotion from speech signal. In 2016 3rd International
Conference on Computing for Sustainable Global Development (INDIACom), New Delhi
(pp. 2805–2807).
4. Kamaruddin, N., Abdul Rahman, A. W., & Abdullah, N. S. (2014). Speech emotion identifica-
tion analysis based on different spectral feature extraction methods. In The 5th International
Conference on Information and Communication Technology for The Muslim World (ICT4M),
Kuching (pp. 1–5). https://doi.org/10.1109/ict4m.2014.7020588.
Emotion and Depression Detection from Speech 265

5. Semwal, N., Kumar, A., & Narayanan, S. (2017). Automatic speech emotion detection system
using multi-domain acoustic feature selection and classification models. In 2017 IEEE Inter-
national Conference on Identity, Security and Behavior Analysis (ISBA), New Delhi (pp. 1–6).
https://doi.org/10.1109/isba.2017.7947681.
6. Chamoli, A., Semwal, A., & Saikia, N. (2017). Detection of emotion in analysis of speech using
linear predictive coding techniques (L.P.C). In 2017 International Conference on Inventive
Systems and Control (ICISC), Coimbatore (pp. 1–4). https://doi.org/10.1109/icisc.2017.806
8642.
7. Sonawane, A., Inamdar, M. U., & Bhangale, K. B. (2017) Sound based human emotion recogni-
tion using MFCC & multiple SVM. In 2017 International Conference on Information, Commu-
nication, Instrumentation and Control (ICICIC), Indore (pp. 1–4). https://doi.org/10.1109/ico
micon.2017.8279046.
8. Harár, P., Burget, R., & Dutta, M. K. (2017). Speech emotion recognition with deep learning.
In 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN),
Noida (pp. 137–140). https://doi.org/10.1109/spin.2017.8049931.
9. Lukose, S., & Upadhya, S. S. (2017). Music player based on emotion recognition of voice
signals. In 2017 International Conference on Intelligent Computing, Instrumentation and
Control Technologies (ICICICT), (pp. 1751–1754). https://doi.org/10.1109/icicict1.2017.834
2835.
10. Sudhakar, R. S., & Anil, M. C. (2015). Analysis of speech features for emotion detection:
A review. In 2015 International Conference on Computing Communication Control and
Automation, Pune (pp. 661–664). https://doi.org/10.1109/iccubea.2015.135.
11. Iqbal, A., & Barua, K. (2019) A real-time emotion recognition from speech using gradient
boosting. In 2019 International Conference on Electrical, Computer and Communication
Engineering (ECCE) (pp. 1–5) https://doi.org/10.1109/ecace.2019.8679271.
12. Saurabh, S., & Carol, E. -W. (2016). Speech features for depression detection. (pp. 1928–1932).
https://doi.org/10.21437/interspeech.2016-1566.
13. Mitra, V., & Shriberg, E. (2015). Effects of feature type, learning algorithm and speaking style
for depression detection from speech. In 2015 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) pp. 4774–4778. https://doi.org/10.1109/icassp.2015.
7178877.
14. Chlasta, K., Wołk, K., & Krejtz, I. (2019). Automated speech-based screening of depression
using deep convolutional neural networks. In 8th International Conference on Health and
Social Care Information Systems and Technologies, Oct 2019. https://doi.org/10.1016/j.procs.
2019.12.228.
15. Tasnim, M., & Stroulia, E. (2019). Detecting Depression from Voice. Cham: Springer Nature
Switzerland AG 2019, April 2019, (pp. 472–478). https://doi.org/10.1007/978-3-030-18305-
9_47.
16. He, L., & Cao, C. (2018). Automated depression analysis using convolutional neural network
from speech. Journal of Biomedical Informatics, 83, 103–111.
17. Giannakopoulos, T. (2015). pyAudioAnalysis: An open-source python library for audio signal
analysis. PLoS ONE, 10(12), e0144610. https://doi.org/10.1371/journal.pone.0144610.
18. Degottex, G., Kane, J., Drugman, T., Raitio, T., & Scherer, S. (2014). COVAREP—A collab-
orative voice analysis repository for speech technologies. In Proceedings IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy 2014.

You might also like