You are on page 1of 13

SPEECH EMOTION RECOGNITION USING CONVOLUTIONAL NEURAL

NETWORKS

By

JABRENEE ANGELINA HUSSIE

A THESIS

submitted in partial fulfilment of the requirements


for the Honours Program of Delaware State University
DOVER, DELAWARE

May 2023

This Honors thesis is approved by the following Thesis Committee:


Dr. Fatima Boukari, Research Advisor, Computer Science Department, Delaware State
University
Dr. -------------, Honors Council Committee Member, …. (position), Department of ---------------,
Delaware State University
Dr. Gulnihal Ozbay, Honors Council Committee Chair, Professor, Department of Agriculture
and Natural Resources, Delaware State University
Ms. Shonda Poe, Honors Program Director, URELAH, Delaware State University
SPEECH EMOTION RECOGNITION USING CONVOLUTIONAL NEURAL
NETWORKS
ABSTRACT
Our society is designed to professionally and socially benefit people who find themselves

comfortable in social situations and have strong interpersonal skills. While those who aren’t as

strong in this area develop coping mechanisms to better navigate the world. The development of

these skills come from the ability to recognize the emotions of others and acting accordingly.

The inability to recognize emotions affects around 13% of the population. Doctors recommend

different forms of therapy to cope, while researchers recommend understanding one’s own

physical, emotional, and biological responses to understand others. While these are good tools to

help one learn about emotional responses, everyone reacts to situations differently and learning

how one internally reacts to a situation won’t necessarily help them identify that in others.

Speech-based emotion recognition (SER) research aims to provide aid by analyzing speech to

identify emotions of the speaker by creating deep learning models that will aid in the ability to

recognize emotions through speech. Previous SER studies only represent English speakers. In

this research, deep learning techniques are used to analyze the speech of various speech sound

signals in different languages to develop a model to predict emotions based on the speech that

the model was given. This study analyzes 5 different languages and 7 basic sentiments. This

study has 67% accuracy and a 78% f1 score in the primary model using Convolutional Neural

Networks (CNN). A secondary model using Multilayer Perceptron (MLP) had 85% accuracy.

Five out of seven sentiments were predicted correctly.


Table of Contents
Problem Statement..........................................................................................................................1
Significance of Work......................................................................................................................2
Artificial Intelligence.................................................................................................................2
Psychology..................................................................................................................................2
Methodology...................................................................................................................................3
References.......................................................................................................................................8
List of Figures
Figure 1: Waveplot & Spectrogram of happy emotion sample audio.............................................4
Figure 2: Waveplot & Spectrogram of angry emotion sample audio..............................................4
Figure 3: Graphical depiction of the Accuracy of testing and training for the CNN Model...........6
1

The backbone of human life is our relationships with others. A part of the ability to

maintain those connections is the ability to recognize the emotions of yourself and the people

around you. Our society is designed in a way those who can find themselves comfortable and

confident in social situations, and with strong interpersonal skills excel professionally, and

socially. While those who aren’t as strong in this area develop masks and coping mechanisms to

better navigate a world not designed for them.

Problem Statement

The inability to recognize emotions generally affects 13% of the population (Lo, 2021).

This ailment can be developed from a host of diagnoses that can come at any time in someone’s

life. The presence of this condition has the greatest impact in the social, economic, and mental

well-being of those with this affliction. The effects of living with this condition have a broader

effect on society wherein platonic, professional and romantic relationships all are affected by this

whereas the foundation of maintaining these relationships lie in the ability to interact with,

understand, and empathize with others. People whose brain develops or works differently

(neurodivergents) will suffer from discrimination in various ways whether it be economically or

socially, and employers will suffer as well because studies show that the neurodiverse teams are

30% more productive than neurotypical ones and make few errors (MyDisabilityJobs, 2022). In

the workforce as well as social settings gaps in pay, responsibility, and treatment can already be

seen within many marginalized communities. The intersectionality between these communities

can really create social, and economic disparities that put an immense amount of pressure and

difficulty on one person.

Doctors recommend different forms of therapy to cope, while researchers recommend

that people living with this condition should understand their own responses such as their heart
2

rate and its fluctuations as well as journaling physical and emotional responses that one

experiences (Cherney, 2021). While these are good tools to help one learn about emotional

responses, everyone reacts to situations differently and learning how one internally reacts to a

situation won’t necessarily help them identify that in others. This research aims to bridge the gap

for those who live with these circumstances by analyzing speech to identify emotions of the

speaker to create models that will aid in the ability to recognize emotions through speech.

Significance of Work

Artificial Intelligence

This year we are starting to see a different focus in technology news when it comes to

artificial intelligence. Currently there are a lot of discussions around AI and emotion recognition.

The market for emotion recognition-based AI is growing and is projected to grow substantially

from now to 2029 (TheExpressWire, 2022). The projected usage and lie in the consumer and

producer relations. For example, companies want to better understand customer behavior to

know how to better sell their products. Many industries are branching out into computer-vision

and speech recognition to make their products more human-like (nishi, 2022). The appeal of

emotional AI is that it allows markets to better understand people and provide more useful

services (nishi, 2022). It is even proposed that with the era of online learning the technology can

be of aid to teachers as they can collect and analyze students' reactions to help alleviate the

communication perils between teachers and students which can lead to curriculum evolution to

improve learning outcomes (nishi, 2022).

Psychology

With a lot of mental conditions and disorders that affect one’s ability to recognize

emotion we see a lot of emotional recognition training (Andersen, 2022). Which is generally
3

done by studying facial expressions and attaching them to emotions to help alleviate some of the

pressures of context blindness (Andersen, 2022). For example, what does it look like when

someone is anxious or uninterested? Could you denote that from a simple facial expression?

There is a general understanding when it comes to the intersection between facial expressions,

body language, and emotions but, what about when you don’t have the ability to see those facial

expressions. Conditions such as Prosopagnosia, facial blindness, make speech-based emotion

recognition very important (2019). Living with any mental disorder, diagnosed or undiagnosed,

can be alienating when no one understands what that experience is like, and everyone expects a

form of normalcy that you just can’t give them. Trying to cope with these expectations that can

be really damaging to one’s mental health. For example, people with autism are six times more

likely to attempt death by suicide and seven times more likely to die by suicide in comparison to

those who are not autistic (Jachyra et al., 2022). Those disparities are something that should not

be taken lightly and if speech-based emotion recognition research can help in any way it is worth

exploring.

Methodology

In layman terms, the plan for this study is to use English phrases and their corresponding

translations in French, Spanish, Japanese, Mandarin Chinese and Korean to develop a system of

speech-based emotion recognition. This format was chosen after carefully scouring related works

and noticing the imbalance of representation. Many of the datasets used in related works are in

English only as well as have a bias toward mostly male speakers and the creation of an

alternative dataset will help get a more balanced and worldly view for the study. This is an

important aspect of the study because the inability to recognize emotions doesn’t only affect

English speakers.
4

Data collection is the first step. Collection consists of capturing clips in all the languages

ranging from 10 to 30 seconds from by screen recording two Netflix series. The TESS dataset

was also used in combination with the recorded clips from the two Netflix series Julie and The

Phantoms and Heartstopper. After the data is collected one must classify the clips by the

emotion. The emotions that are focused on in this study are calm, happy, sad, angry, fearful,

disgust, and surprised labeled from 1-7 respectively. A jupyter notebook is utilized for the

analysis in this study. Data will be composed in a spreadsheet which will then be imported in the

notebook. Then the dataset that was created was visualized using pie charts, spectrograms, and

wave plots using the python libraries of matplotlib and librosa.

Figure 1: Waveplot & Spectrogram of happy emotion sample audio

Figure 2: Waveplot & Spectrogram of angry emotion sample audio

Before one starts modeling and doing any kind of analysis we have to prepare the data.

We begin with data augmentation, and then we preprocess the data. Data augmentation is used to

create new data samples and widen the threshold for data we use. Adding these small changes

helps account for changes that occur naturally to build a general model. One makes changes to

the pitch, length, and injecting noise. In data preprocessing one does standardization, label

encoding, and feature extraction. Standardization is scaling each input variable separately by
5

subtracting the mean (called centering) and dividing by the standard deviation to shift the

distribution to have a mean of zero and a standard deviation of one. It is used to improve the

performance of predicting modeling. Label encoding utilizes OneHotEncoder() which converts

categorical data into numerical features of a data set. The purpose of encoding in this manner is

that machine learning algorithms assume that and require data to be numeric. Feature extraction

helps the algorithms built get better grasps of the important parts of the data they should pay

attention to and can easily consume. In this study we use the most common features for audio

analysis which are Zero Crossing Rate, Mel Frequency Cepstral Coefficients, Root Mean Square

Value, and MelSpectrogram.

The model built in the study contains two convolutional blocks, each containing three

convolutional layers. The convolutional blocks are separated by max pooling layers. Then, they

are passed through a global average pooling layer, into 6 fully connected layers and then a

classification output layer. To overcome overfitting, we added a dropout layer to remove some of

the connections between the layers. By lowering the complexity of the model in this way, we can

prevent the model from being able to overfit to the parameters, resulting in a much better

accuracy. The model is trained using Python code together with the PyTorch deep learning

framework. The data is split the data into train and test subsets using train test split of sklearn

80/20 split randomly. Loss function we settled on using is the categorical cross entropy loss

function and the adam was used as an optimization function.

Results

The application of categorical cross entropy loss function for optimization and assessed

the generated loss curves with training time using ADAM optimizer. After training and

validating the proposed CNN model, a Mean Squared Error value of 0.6 and an accuracy of 0.67.
6

An f1 score of 78% was achieved which is a machine learning evaluation metric used to assess a

model’s accuracy. However, the MLP of sklearn we obtained an accuracy of 85% which is still

high. This is attributed to the size of the dataset for the multi-labeled classification and the

problem challenges. 5 to 7 sentiments were predicted correctly.

Figure 3: Graphical depiction of the Accuracy of testing and training for the CNN Model

Future Advancements

In the future, this study would benefit from expansion. We want to expand the number of

clips used in the various languages. We could also add more variety in language to account for

linguistic structure and dialect nuances. The accuracy can be improved by running multiple

experiments to let time and repetition help the model learn more. One can also compare multi-
7

label margin loss, multi-label soft margin loss functions for best results. Lastly you can test

several optimization functions, including adam, adamax, and SGD.

Conclusion

We proposed and implemented a CNN model using an accessible open-source dataset

and a few data samples that we generated from movies available on the internet. The architecture

of the model is composed of linked layers, and we applied Python code together with the

PyTorch deep learning framework and tools. Categorical cross entropy loss function was applied

for optimization and assessed the generated optimization curves with training time and ADAM

optimizer. After training and validating the proposed model, we achieved a Mean Squared Error

value of 0.6 which is still high. This is due to the size of the dataset for the multi-label

classification.
8

References

Andersen, R. (2022, August 12). How to help your autistic child with context blindness. Autism

Parenting Magazine. Retrieved September 11, 2022, from

https://www.autismparentingmagazine.com/autism-context-blindness/

Biswal, A. (2022, September 9). Top 10 deep learning algorithms you should know in 2022.

Simplilearn.com. Retrieved September 11, 2022, from

https://www.simplilearn.com/tutorials/deep-learning-tutorial/deep-learning-algorithm

Cherney, K. (2021, September 9). Alexithymia: Causes, symptoms, and treatments. Healthline.

Retrieved September 11, 2022, from

https://www.healthline.com/health/autism/alexithymia#tips-to-cope

Jachyra, P., Rodgers , J., & Cassidy , S. (2022, July 11). Autistic people are six times more likely

to attempt suicide – poor mental health support may be to blame. The Conversation.

Retrieved September 11, 2022, from https://theconversation.com/autistic-people-are-six-

times-more-likely-to-attempt-suicide-poor-mental-health-support-may-be-to-blame-180266

Lo, I. (2021, February 6). Alexithymia: Do you know what you feel? Psychology Today. Retrieved

September 11, 2022, from https://www.psychologytoday.com/us/blog/living-emotional-

intensity/202102/alexithymia-do-you-know-what-you-feel
9

MyDisabilityJobs. (2022, August 25). Neurodiversity in the workplace: Statistics: Update 2022.

MyDisabilityJobs.com. Retrieved September 11, 2022, from

https://mydisabilityjobs.com/statistics/neurodiversity-in-the-workplace/

NHS. (2019). Prosopagnosia (face blindness). NHS choices. Retrieved September 11, 2022, from

https://www.nhs.uk/conditions/face-blindness/#:~:text=Prosopagnosia%2C%20also

%20known%20as%20face,severe%20impact%20on%20everyday%20life

nishi. (2022, September 1). The future of online learning is being shaped by emotional AI after

covid-19. Inventiva. Retrieved September 11, 2022, from

https://www.inventiva.co.in/trends/the-future-of-online-learning-is-being/

TheExpressWire. (2022, September 5). Artificial Intelligence-emotion recognition market insight

manufacturers analysis, revenue, covid-19 impact, supply, growth, upcoming demand,

regional outlook till 2029. Digital Journal. Retrieved September 11, 2022, from

https://www.digitaljournal.com/pr/artificial-intelligence-emotion-recognition-market-

insight-manufacturers-analysis-revenue-covid-19-impact-supply-growth-upcoming-

demand-regional-outlook-till-2029

You might also like