You are on page 1of 54

Empowering Communication: The

Evolution and Potential of Speech


Recognition System

Vaibhav Tiwari
(10621051)

Applied Research Project submitted in partial fulfilment of the requirements for the degree of MSc

in Data Analytics at Dublin Business School

Supervisor: Agatha Mattos

January 2024
2

Declaration

I declare that this Applied Research Project that I have submitted to Dublin Business School

for the award of MSc in Data Analytics is the result of my own investigations, except where

otherwise stated, where it is clearly acknowledged by references. Furthermore, this work has not

been submitted for any other degree.

Signed: Vaibhav Tiwari

Student number: 10621051

Date: 5th January 2024


3

Acknowledgement

I want to thank professor Agatha Mattos, my research mentor, for providing me with guidance,

encouragement, and valuable suggestions throughout my research journey. I thank my supervisor for

her practical advice and assistance in my study. Additionally, I extend special thanks to the DBS

library and academic operations for their support in providing necessary reference materials for my

literature work.
4

Abstract

This project explores the classification of emotions in speech data using neural network

models. The project aims to build robust models to correctly identify emotions such as happiness,

sadness, anger, fear, and disgust from audio clips. The process involves data collection, cleaning up,

and feature extraction using Mel-frequency Cepstral Coefficients (MFCCs) data, creating, and

assessing three unique neural network structures. The first model, a sequence structure with dense

layers, is used to compare the performances of the subsequent models. The second model, which

uses an upgraded dataset, performed better. The third model had mixed results when adding an

LSTM layer, compared to models only using dense layers. The results underline the importance of

varied data in improving emotion identification accuracy from speech data. This research brings

essential knowledge to the field of neural network-based emotion classification. It sets the

foundation for valuable applications in monitoring areas, including mental health and interactions

between humans and computers.


5

Table of Contents

Declaration .......................................................................................................................................................2
Acknowledgement ...........................................................................................................................................3
Abstract ............................................................................................................................................................4
List of Tables ....................................................................................................................................................7
List of Figures ...................................................................................................................................................8
1 Introduction .............................................................................................................................................9
1.1 Introduction ...........................................................................................................................................9
1.2 Background ......................................................................................................................................... 10
1.3 Motivation........................................................................................................................................... 11
1.3.1 Enhancing Human-Computer Interaction ........................................................................................ 12
1.3.2 Improving Mental Health Assessment ............................................................................................. 12
1.3.3 Unveiling Valuable Customer Insights ............................................................................................. 12
1.3.4 Human-Emotion Synthesis ............................................................................................................... 13
1.3.5 Unlocking Unseen Insights ............................................................................................................... 13
1.4 Research Objectives ............................................................................................................................ 13
1.5 Research Question .............................................................................................................................. 14
1.6 Report Overview ................................................................................................................................. 14
2 LITERATURE REVIEW ............................................................................................................................. 17
2.1 Evolution of Artificial Intelligence (AI) ................................................................................................ 17
2.1.1 Early Challenges in Artificial Intelligence ......................................................................................... 17
2.2 Machine Learning and the Quest for Representation ........................................................................ 18
2.3 Representation Learning and Deep Learning ..................................................................................... 18
2.4 The Role of Depth in Deep Learning ................................................................................................... 18
2.5 Related Work ...................................................................................................................................... 19
3 Methodology......................................................................................................................................... 28
3.1 Business Understanding:..................................................................................................................... 28
3.2 Data Understanding ............................................................................................................................ 28
3.3 Data Preparation: ................................................................................................................................ 30
3.4 Modeling: ............................................................................................................................................ 30
3.5 Research ethics: .................................................................................................................................. 35
4 Implementation .................................................................................................................................... 36

Vaibhav Tiwari (10621051)


6

5 Evaluation ............................................................................................................................................. 41
6 Conclusion and Future Work ................................................................................................................ 48
6.1 Conclusion ........................................................................................................................................... 48
6.2 Future Work ........................................................................................................................................ 49
7 References ............................................................................................................................................ 51
8 Appendix ............................................................................................................................................... 54
8.1 Artefact Links ...................................................................................................................................... 54
8.2 Dataset Links ....................................................................................................................................... 54

Vaibhav Tiwari (10621051)


7

List of Tables

Table 1Comparison metrics for different Models .............................................................................. 46

Vaibhav Tiwari (10621051)


8

List of Figures

Figure 1Emotion and Gender Distribution ......................................................................................... 36


Figure 2 Waveplot for sad emotion audio - RAVDESS dataset......................................................... 37
Figure 3Accuracy for training data - RAVDESS ............................................................................... 42
Figure 4Loss for training data - RAVDESS ....................................................................................... 42
Figure 5Accuracy/loss for CNN+Attention net+LSTM model .......................................................... 43
Figure 6Accuracy/loss for ANNt+LSTM model ................................................................................ 44
Figure 7Accuracy/loss for ANN+multi layered LSTM model ........................................................... 45

Vaibhav Tiwari (10621051)


9

1 Introduction

This chapter introduces the subject, gives background information, emphasizes the

motivation, details the research objectives, poses research questions, and provides a structure for the

report's organization, setting the scene for the remaining study.

1.1 Introduction

Human communication is an intricate and multifaceted phenomenon. At its core, speech is the

primary medium through which we express thoughts, convey information, and engage with one

another. Yet, beneath the surface of words and sentences lies a more decadent layer of human

interaction—emotions. These intangible, ever-shifting states of mind are as fundamental to our

conversations and relationships as our words. Understanding and interpreting these emotional nuances

within speech has become a focal point of the study, especially in the backdrop of the swift and

profound advancements in artificial intelligence (AI).

Emotions are the emotional colour palette that paints the canvas of human communication.

They are the joy that fills laughter, the sorrow that trembles in a sigh, the excitement that crackles in

enthusiasm, and the serenity that flows in comforting words. For centuries, humans have been aware

of the emotional richness that speech carries. Still, the ability to systematically recognise, quantify,

and categorise these emotional cues within spoken language has been a complex and elusive

challenge.

In the current age, we stand at the precipice of a remarkable evolution in technology and

understanding. The rise of artificial intelligence has revolutionised the way we interact with machines

and the world around us. It has ushered in an era where machines are becoming increasingly proficient

at tasks that once required human insight and judgment. The amalgamation of AI and studying human

Vaibhav Tiwari (10621051)


10

emotions through speech has birthed a promising field known as Speech Emotion Recognition (SER).

This intersection of AI and emotional intelligence has sparked an array of previously

unimaginable possibilities. With AI as the catalyst, we can now embark on a journey to develop

systems capable of deciphering the emotional undertones, recognising, and categorizing emotions

embedded in spoken language. This evolution in technology has the potential to provide us with a

window into the emotional landscape of human communication, which has profound implications for

various sectors of our society.

As we delve deeper into this thesis, we will explore the intricate mechanisms of SER and the

remarkable possibilities it brings to the table. We will discover how the evolution of AI is not just a

technological advancement but a transformation in how we comprehend and respond to the emotional

dimensions of our interactions. It is a journey into the uncharted territory of artificial emotional

intelligence, allowing us to uncover the subtle emotional cues hidden within our words, thus enriching

our understanding of human communication and our capacity to use technology to decipher the human

experience.

1.2 Background

Over the years, extensive interdisciplinary research has provided valuable insights into speech

emotion recognition. The advent of AI has been instrumental in shedding new light on this domain,

opening exciting possibilities for understanding and utilizing emotional cues in speech.

The marriage of AI with emotion recognition has enabled researchers to address a multifaceted

problem. It's the culmination of efforts by neuroscientists, computational intelligence experts,

linguists, and AI specialists that have unlocked this potential. Neuroscientists have delved into

understanding how the human brain processes and perceives emotional stimuli, providing

foundational insights. Computational intelligence researchers have translated this knowledge into
Vaibhav Tiwari (10621051)
11

mathematical solutions, bridging the gap between neural processes and machine learning. Linguists

have approached speech emotion recognition by dissecting speech's semantic and syntactic aspects,

adding linguistic context to the equation. The collaboration of these interdisciplinary fields has

unraveled the complexities of emotional content within the speech.

Scherer's 2003 work presented various design paradigms, often using the modified

Brunswick’s functional lens model of perception to study speech emotion recognition. Researchers

have experimented with techniques like spectral analysis and Hidden Markov Models to identify and

categorize emotions within spoken language. The application of AI and machine learning has

significantly enhanced the accuracy and efficiency of these recognition systems.

In human-computer interaction, AI-driven emotion recognition from speech is proving

invaluable. Affective systems, empowered by AI, can detect a user's emotional state in real-time,

offering the opportunity to adapt system responses and enhance user satisfaction. Speech and gesture

recognition are cornerstones of this burgeoning field known as affective computing, with audio-based

devices being the most widely adopted, primarily due to the established trust in the technology and

concerns over privacy associated with video sensors.

1.3 Motivation

The study of Speech Emotion Recognition (SER) driven by artificial intelligence is not merely

an academic pursuit but a critical endeavor with profound implications for individuals, businesses,

and society. This research is motivated by several compelling factors that highlight the urgency and

significance of this work.

Vaibhav Tiwari (10621051)


12

1.3.1 Enhancing Human-Computer Interaction

In an increasingly digital world, how humans interact with technology has evolved

significantly. From smartphones to smart speakers, machines are integral to our daily lives. However,

for technology to truly serve us, it must understand us on a deeper level. The ability of AI-driven

systems to recognize and respond to human emotions in speech is fundamental in creating more

meaningful and satisfying human-computer interactions. Whether it's in virtual assistants providing

empathetic responses or customer service chatbots tailoring their support to user emotions, the

potential for enhancing our technological interactions is limitless.

1.3.2 Improving Mental Health Assessment

Emotions play an essential role in mental health and well-being. Accurate detection of

emotions in speech offers a powerful tool for assessing and monitoring mental health conditions. It

can contribute to early intervention, improved therapy, and a deeper understanding of emotional well-

being. The potential to develop AI systems that can recognize signs of emotional distress or instability

in speech could be transformative in addressing the global mental health crisis.

1.3.3 Unveiling Valuable Customer Insights

Businesses thrive on understanding their customers. Emotion recognition in customer

interactions, whether through phone calls, chat messages, or reviews, offers a goldmine of

information. Recognizing the emotional tone of customer feedback can aid businesses in tailoring

their products and services, improving customer satisfaction, and addressing issues proactively. The

ability to discern the sentiment behind the words can lead to enhanced decision-making, better

products, and, ultimately, stronger customer relationships.

Vaibhav Tiwari (10621051)


13

1.3.4 Human-Emotion Synthesis

The potential of AI to detect and respond to human emotions opens doors to a new era of

human-AI coexistence. Machines with affective properties can detect user emotions and adapt their

responses to meet emotional needs. This improves user satisfaction and lays the groundwork for a

more empathetic and understanding AI-human partnership.

1.3.5 Unlocking Unseen Insights

Human communication is a treasure trove of unspoken emotions. While words may convey

one message, emotions often reveal a different narrative. SER enables us to explore these hidden

narratives, providing invaluable insights into human behavior, sentiment, and well-being. This

research motivates us to unveil these unseen communication layers, contributing to our collective

understanding of human nature and enhancing our ability to address diverse real-world challenges.

In a world where technology and human interaction are increasingly entwined, the ability to

decipher and respond to human emotions in speech is not just a scientific pursuit; it is a quest to

improve lives, augment businesses, and expand the horizons of AI-human collaboration. This thesis

embarks on this quest, aiming to unlock the potential of artificial emotional intelligence in the service

of humanity.

1.4 Research Objectives

This research's primary objective is to develop and evaluate efficient neural network

architectures for accurate emotion classification from diverse speech datasets, focusing on the

impact of data augmentation and advanced feature engineering techniques. Further, some

other objectives are as follows:

• Developing neural network architectures: This research investigates different model


Vaibhav Tiwari (10621051)
14

architectures for emotion classification.

• Accurate emotion classification: Aim to improve the accuracy of emotion recognition from

speech.

• Diverse speech datasets: Leverage multiple datasets for model training and evaluation.

• Data augmentation: Explore the impact of data augmentation on model performance.

• Advanced feature engineering: Consider alternative feature extraction

approaches beyond MFCCs.

1.5 Research Question

How can the integration of Convolutional Neural Networks (CNN) and Long Short-Term

Memory networks (LSTM) be optimized to improve the accuracy and efficiency of Speech Emotion

Recognition (SER) systems, and what impact will this have on real-world applications?

1.6 Report Overview

A brief outline or guideline to the readers. The report is structured as follows:

• Chapter 2 – Literature Review, presents its readers with an overview of the background

research on the topic. It begins with an exploration of hate speech, classifying its nature, impact, and

reputation in online environments. It then navigates through previous studies and methodologies

utilized for hate speech detection. Additionally, this section critically assesses the limitations and

challenges inherent in the existing hate speech detection systems, offering insights into the gaps and

opportunities for improvement.

• Chapter 3 – Methodology, it explains the process of data collection, this section elaborates

on the methods used to acquire datasets from diverse online platforms. Subsequently, it describes the
Vaibhav Tiwari (10621051)
15

rigorous steps involved in data preprocessing, including cleaning, formatting, and preparing the

collected data for analysis. This segment emphasizes the importance of high-quality, well-pre-

processed data for effective hate speech detection models.

• Chapter 4 – Implementation, this segment evaluates a specific machine learning model's

performance in hate speech detection. It starts with an analysis of Logistic Regression, discussing its

strengths, limitations, and relevance in identifying hate speech. Following that, it examines Naive

Bayes, delineating its efficiency and drawbacks. The section then explores Random Forest,

highlighting its performance, challenges, and suitability for hate speech detection. Finally, it

investigates the LSTM model, focusing on its strengths, weaknesses, and metrics for identifying hate

speech.

• Chapter 5 - Evaluation, this section summarizes and compares the performance metrics of

the evaluated machine learning models used for hate speech detection. It offers a comprehensive

discussion on the strengths, weaknesses, and applicability of each model in addressing online hate

speech. The comparative analysis drawn from these evaluations provides conclusive insights and

implications for future studies and practical applications in the field.

• Chapter 6 - Conclusion, encapsulates the project's essence, key findings and insights

derived from the study's exploration of hate speech detection through machine learning models. It

explores the implications of these findings, offering recommendations for future research directions

and practical implementations to enhance hate speech detection methodologies. Additionally, it

explores potential areas for further research and advancements in mitigating online hate speech.

• Chapter 7 – References, compiles a comprehensive list of sources and materials cited

Vaibhav Tiwari (10621051)


16

throughout the thesis, providing readers with an extensive repository for further exploration and

validation of the presented information.

Vaibhav Tiwari (10621051)


17

2 LITERATURE REVIEW

The Literature Review chapter delves into the existing body of knowledge and research

relevant to Speech Emotion Recognition (SER) and the use of Convolutional Neural Networks (CNN)

and Long Short-Term Memory networks (LSTM) in the field. This section serves as the foundation

for understanding the evolution of SER, the significance of AI-driven techniques, and the context in

which this thesis is situated.

2.1 Evolution of Artificial Intelligence (AI)

The aspiration to create machines with the capacity for thought and intelligence has been a

long-standing desire throughout human history. This aspiration traces back to ancient times, with

legendary figures such as Pygmalion, Daedalus, and Hephaestus often interpreted as early inventors

of artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy, 1997). The vision of intelligent

machines predates the creation of programmable computers by over a century (Lovelace, 1842).

Today, the field of artificial intelligence (AI) has transformed into a thriving area of study with

practical applications and ongoing research endeavors.

2.1.1 Early Challenges in Artificial Intelligence

When AI started, it mainly tackled tasks that humans found challenging mentally but were

simple for computers. Such tasks followed strict, math-based laws, perfect for computer-based

solutions. An impressive early achievement was IBM's chess-playing system, Deep Blue. Deep Blue

beat chess pro, Garry Kasparov, in 1997 (Hsu, 2002). Chess offered a precise playing field with set

rules that could be programmed into the computer. AI could solve the rule-based tasks; the real test

was to crack tasks humans did easily but hard to put into formal terms.

Vaibhav Tiwari (10621051)


18

2.2 Machine Learning and the Quest for Representation

Machine learning sparked a crucial change in AI. It made computers able to learn from past

events and grasp the world in a layered way. Each layer is based on simpler ones. This method eased

the dependence on humans to provide all the knowledge a computer needs directly. Learning from

past events let computers skip the complications of human-set rules. They could grasp the key points

from raw data.

2.3 Representation Learning and Deep Learning

A crucial aspect of machine learning is representation learning, which involves discovering

representations that explain observed data effectively. Representation learning algorithms, such as

autoencoders, allow computers to build complex concepts from simpler ones. Deep learning, a subset

of representation learning, focuses on forming a deep hierarchy of concepts, each constructed from

simpler components. The deep learning approach involves learning representations by considering

each layer as a state of memory and executing a sequence of instructions, making it highly capable of

capturing complex relationships and patterns within data.

2.4 The Role of Depth in Deep Learning

The concept of "depth" in deep learning is multifaceted. It can be measured based on the

number of sequential instructions or the depth of the graph describing relationships between concepts.

The choice of perspective influences the interpretation of depth in deep learning models. What

remains consistent is that deep learning models are designed to discover complex representations by

iteratively combining simpler ones, enabling the extraction of intricate patterns from data.

This section sets the stage for the in-depth exploration of deep learning and its relevance to
Vaibhav Tiwari (10621051)
19

our thesis on Speech Emotion Recognition using AI with CNN and LSTM. It establishes the historical

context and evolution of AI and machine learning, ultimately leading to the emergence of deep

learning as a promising approach for complex tasks.

2.5 Related Work

In this section, we provide an overview of the related work in the field of speech emotion

recognition (SER). The SER domain encompasses various aspects, including feature extraction,

classification methods, and applications in different sectors. We discuss the key research contributions

in these areas and summarize their findings and insights.

In simple terms, speech is a sequence of sounds we use to convey our feelings and thoughts. A speech

signal carries various types of information, including the identity of the speaker, their gender, the

message they want to convey, the language they are speaking, and even their emotional state. This

realization has led researchers to consider speech a powerful medium for interaction between humans

and machines (Frant & Stoica, 2017; Schuller, 2018). While significant progress has been made in

speech recognition over the last two decades (Nassif et al., 2019), the task of recognizing emotions

from speech signals still presents challenges that require further investigation.

Speech carries a complexity and richness, where even small changes in tonal attributes can

alter the meaning of the exact words. Therefore, detecting emotions from voice remains a valuable

area of research. Psychologists have developed various theories about emotions, and two prevalent

models for vocal emotions are the dimensional and discrete (or categorical) models. The discrete

model categorizes emotions into specific, distinct categories, such as happiness, sadness, and fear,

assuming clear boundaries between these emotions (Ekman, 1992).

However, some researchers, like Ronan et al. (2018), argue that this categorization is

insufficient to express the full spectrum of human emotions. This has led to the development of
Vaibhav Tiwari (10621051)
20

dimensional models considering emotions along broader dimensions. The widely accepted two-

dimensional model classifies emotions based on valence (positive to negative) and arousal (low to

high) (Russell, 1980). To account for more complexity, a three-dimensional model introduces tension

measured on a tense-to-relaxed scale (Sarprasatham, 2015), while a four-dimensional model adds

intensity and potency dimensions to valence and arousal (Fontaine et al., 2007).

Despite these advancements, dimensional models are criticized for not distinguishing certain

emotions like fear and anger. The subject of emotion measurement remains highly subjective,

influenced by personal and cultural differences. Recent research by Cowen et al. (2018) introduces

an intriguing hypothesis suggesting that people can perceive over 20 different emotions in wordless

sounds. They tested this theory through various approaches, involving more than 2000 sounds and

1000 participants. Through a combination of free-choice and forced-choice descriptions, they

identified 24 reliable emotional categories, demonstrating that emotion is a complex and multifaceted

aspect.

These findings highlight the need for deeper exploration and understanding of emotions,

particularly in the context of Speech Emotion Recognition (SER) systems. The proposed model in

this study aims to address the challenges associated with misclassifying different emotional states,

building upon the evolving understanding of emotions and their expression in speech.

Advancements in Speech Emotion Recognition (SER) through Deep Learning

Artificial intelligence (AI) has seen remarkable growth over time, and machine learning has

become a stand-out area. A technique called Deep Learning within Machine Learning has rising

importance. It's modelled after how our brains work. Within the landscape of Speech Emotion

Recognition (SER), artificial neural networks, a kind of Deep Learning, have gained a lot of ground.

This approach gives those studying emotion detection from speech more ways to improve their

Vaibhav Tiwari (10621051)


21

work faster (Nassif et al., 2019).

Deep Learning Models in SER: A Revolution in Emotion Detection

Deep learning models have been extensively explored to enhance SER. Zhao et al. (2017)

conducted research on phoneme recognition and SER, demonstrating that the Recurrent

Convolutional Neural Network (RCNN) model could effectively detect emotions with a weighted

accuracy of 53.6% on the IEMOCAP dataset. This research prompted Microsoft to investigate pitch-

based features and deep neural networks, leading to an accuracy of 54.3% (Han et al., 2014). Zhao et

al. (2017) countered this by proving that comparable results could be achieved using spectral features

alone with the RCNN model, highlighting the versatility of deep learning approaches. In the same

year, Microsoft introduced another research by Mirsamadi et al. (2017) involving Recurrent Neural

Network (RNN) with local attention, achieving an accuracy of 61.8% on the same dataset for four

classes of emotions. Ayek et al. (2017) explored multiple deep-learning methods and achieved an

impressive 64.78% accuracy on the same dataset with five emotional classes. Their findings revealed

the superiority of frame-based feature models over utterance-based feature models.

Feature Selection

Feature selection has been a focus of several studies to identify speech's most relevant

emotional components. Liu et al. (2013) explored various techniques, including the Fisher criterion,

distance analysis, partial correlation analysis, and bivariate correlation analysis, to determine the best

feature subset for recognizing emotions. They used an extreme learning machine (ELM) to build a

decision tree that can effectively classify emotions. The study concluded that ELM was a particularly

best fit for the decision tree method, and feature selection techniques like the Fisher criterion and

correlation analysis had been thoroughly verified (Liu, Z. T., Li, X., & Chen, W., 2013).

Vaibhav Tiwari (10621051)


22

Another area for feature extraction is the application of deep auto-encoders (DAE). In the

research conducted by Wang et al. (2014), a DAE method with five hidden layers was used to extract

speech emotion characteristics. Alongside DAE, standard features like MFCC, Perceptual Linear

Prediction cepstral coefficients (PLP), and LPCC were extracted from speech signals. When all these

features were utilized as input for a Support Vector Machine (SVM) model, the findings indicated

that DAE-extracted features exhibited a clear advantage over other feature types, highlighting the

potential of deep learning in improving feature extraction for SER (Wang, F., Yang, J., Chen, H., &

Wu, J., 2014).

A Leap Towards Multidimensional SER: Deep Learning Breakthroughs

The journey of integrating deep learning into SER continued with innovative models. Chen et

al. (2018) introduced a three-dimensional attention convolution Recurrent Neural Network (CRNN)

SER model, leveraging Mel spectrogram features, which achieved high accuracy on both IEMOCAP

and Emo-DB datasets. Zhao et al. (2018) proposed an improved method by merging 1D and 2D

Convolutional Neural Networks (CNNs) to reach remarkable accuracy rates of 86.36% on IEMOCAP

and 91.78% on Emo-DB for seven emotional classes. These advancements signaled a shift towards

more robust and multidimensional SER, reflecting the power of deep learning in capturing intricate

emotional nuances.

Deep Learning Revolution in Music Emotion Recognition with CNN-LSTM

In music emotion identification, a groundbreaking technique emerged through the

convolutional long short-term memory deep neural network (CLDNN) architecture (Hizlisoy et al.,

2021). This approach was implemented and tested on a novel collection of 124 Turkish traditional

music snippets. This approach harnessed log-Mel filter bank energies, MFCCs, and essential acoustic

Vaibhav Tiwari (10621051)


23

characteristics to recognize emotions effectively. Remarkable results were achieved by combining the

LSTM and DNN classifiers, incorporating the new features with traditional ones. Compared to

conventional methods like KNN, SVM, and random forest classifiers, the LSTM+DNN classifiers

demonstrated superior accuracy. This approach, consisting of four convolutional layers, one LSTM

layer, and fully connected layers, showcases the potential of the CNN-LSTM fusion for music

emotion recognition.

Exploring the Landscape of Speech Emotion Recognition with CNN-LSTM

As the demand for efficient real-time Speech Emotion Recognition (SER) continues to grow

in human-computer interactions, it is essential to comprehensively investigate SER's current

approaches and datasets to arrive at optimal solutions for this ongoing challenge (Abbaschian et al.,

2021). This article delves into the deep learning techniques for SER using publicly available datasets

and discusses traditional machine learning methods for speech emotion detection. Furthermore, it

offers a multifaceted exploration of SER techniques using functional neural networks, shedding light

on the nuances of speech emotion recognition. In this context, CNN-LSTM architectures have taken

the lead, showcasing their capabilities to address emotion detection challenges due to their enhanced

low-level and short-term discriminative skills. The integration of LSTM networks in CNN models

has further improved the network's performance, allowing it to recognize long-term paralinguistic

patterns and demonstrating exceptional speaker-independent emotional processing abilities.

The Convergence of Audio and Lyrics in Music Emotion Classification with CNN-LSTM

Music emotion classification has posed a challenging yet intriguing problem, leading to

innovative approaches in artificial intelligence and machine learning. Chen and Li (2020) proposed a

hybrid network classifier that integrates audio and lyrics using a CNN-LSTM architecture, marking a

Vaibhav Tiwari (10621051)


24

departure from the limitations of single network classification paradigms. This hybrid model

leverages two-dimensional and one-dimensional emotional features and significantly enhances

classification accuracy compared to the single-modal classification approach. The study underscores

the critical role of audio and lyrics as crucial elements for categorizing music based on its emotional

content. It highlights the potential for further exploration in multimodal music emotion detection

through deep learning. CNN-LSTM-based models continue to pave the way for a more

comprehensive understanding of emotions in music, significantly contributing to the evolving

landscape of music emotion recognition.

A research study completed by Latif et al. focused on boosting the precision of SER systems.

They utilized a new approach called transition learning. This method was particularly effective when

dealing with different languages and databases. Compared to other models like support vector

machines (SVMs) and sparse autoencoders, deep belief networks (DBNs) performed better. DBNs

gave more precise results in emotion recognition across five databases in three languages. An

interesting observation was that using various languages during training significantly boosted

accuracy while limiting target data. This improvement was evident even in databases with minimum

training examples.

Zhao et al. proposed two CNN+LSTM networks, one 1D CNN+LSTM network and one 2D

CNN+LSTM network, to learn local and global emotion-related features from speech and log-Mel

spectrograms, respectively. The architecture of the two networks is identical, with four regional

function learning blocks (LFLBs) and one LSTM layer in each. LFLB is designed to learn local

correlations and derive hierarchical correlations, and it consists primarily of one convolutional layer

and one max-pooling layer. The LSTM layer is used to learn long-term dependencies from the locally

known functions.

Vaibhav Tiwari (10621051)


25

Sun and his team introduced a new technique, combining a special type of code, called a sparse

autoencoder, with an attention-grabbing method. Their goal? Using this unique code to study and

learn from labelled and unidentified data. Moreover, they wanted the attention method to focus on

parts of speech that really drive emotion. Are speech sections not filled with emotion? They're usually

overlooked. The team decided to put their new technique to the test on three online databases using a

multilingual system. They discovered that compared to other popular ways to identify emotions in

spoken language; their technique offered results you could trust.

Jiang et al. suggested a feature representation extraction method based on deep learning from

heterogeneous acoustic feature groups that could include redundant and irrelevant content, resulting

in poor emotion recognition output in their research. A fusion network is learned to jointly learn the

discriminative acoustic feature representation and SVM as the final classifier after the informative

features are obtained. The proposed architecture increased recognition efficiency by 64% compared

to current state-of-the-art methods, according to experimental findings on the IEMOCAP dataset.

Pandey et al. [29] provided an overview of deep learning strategies for extracting and

classifying emotional states from speech utterances. They investigate the most commonly used simple

deep learning architectures in the literature. On the two standard datasets, Emo-DB and IEMOCAP,

architectures such as CNN and LSTM were used to measure the emotion capture capability of various

standard speech representations such as Mel-spectrograms, magnitude spectrograms, and MFCCs.

The experiments’ results and their reasoning have been discussed to determine which architecture and

function combination is best for speech emotion detection.

Meng et al. employed the bidirectional LSTM along with CNN to recognize speech emotions.

In addition, they adopted the Mel-spectrogram features in the 3D space as the main features used to

train the CNN network. That model was evaluated based on IEMOCAP and Emo-DB datasets.

Vaibhav Tiwari (10621051)


26

Although the results achieved by this model are promising, they lack generalization, as the model

performs well on the training data; however, the performance is worse on the test set.

Zhen et al. proposed a model composed of CNN, BLSTM, and SVM for recognizing speech

emotions based on log-Mel spectrogram features. The model is evaluated on the IEMOCAP dataset

and performs better when compared with another approach in the literature. Despite the promising

performance of the model, it still needs to be evaluated using other datasets to show its generalization

capability. On the other hand, the study presented in [32] showed the performance of various models

used in SER using six speech datasets. This study concluded that the CNN+LSTM model performs

better than the other models for five of the six datasets.

Lili Guo et al. employed a kernel extreme learning machine (KELM) to classify speech

emotion classes. This approach uses a fusion of spectral features to train the presented model. The

evaluation of this model is performed in terms of two datasets, Emo-DB and IEMOCAP. However,

the results show promising performance on only one dataset, which means the approach lacks proper

generalization. In addition, the authors concluded that the fusion of the spectral features allows the

models to achieve higher classification accuracy.

Misbah et al. investigated the application of a deep convolutional neural network (DCNN) to

extract features from the log-Mel spectrogram of the raw speech. The study employed four datasets:

IEMOCAP, Emo-DB, SAVEE, and RAVDESS. The classification of speech emotions is performed

using four classifiers: SVM, random forest, k nearest neighbors, and neural networks. The

performance of these classifiers is promising; however, no single classifier could perform well on the

four datasets. This indicates that these classifiers lack generalization capability.

Sonawane et al. demonstrated a deep learning approach for speech emotion understanding. A

multilayer convolutional neural network is used with a basic K-nearest neighbor (KNN) classifier to

classify emotions such as positive, negative, indifferent, disgust, and surprise. The combination of

Vaibhav Tiwari (10621051)


27

MFCC-CNN and the KNN classifier performs better than the current MFCC algorithm, according to

experimental findings on a real-time database obtained from the open-access social media site

YouTube.

Sajjad et al. presented a new SER system focused on Radial basis function network (RBFN)

similarity calculation in clusters and the main sequence segment selection method. The STFT

algorithm transforms the chosen sequence into a spectrogram, which is then fed into the CNN model,

which extracts the discriminative and salient features from the speech spectrogram. Additionally, to

ensure precise recognition performance, CNN features were normalized and fed to the deep

bidirectional long short-term memory (BiLSTM) for emotion recognition based on the learned

temporal information.

In conclusion, integrating deep learning techniques, particularly the fusion of CNN and LSTM

networks, has ushered in a new era of accuracy and efficiency in Speech Emotion Recognition and

Music Emotion Recognition. These models have showcased their capabilities in capturing human

emotions' intricacies and paved the way for more nuanced and multidimensional.

Vaibhav Tiwari (10621051)


28

3 Methodology
3.1Business Understanding:

It's essential to understand emotions in speech. This is useful in many areas, such as human-

computer interaction, sentiment analysis, customer service, and mental health evaluations.

Recognising emotions from speech can enhance user interaction, improve customer service, and boost

mental health testing.

In our digital world, we often interact with technology. Hence, machines need to understand

human emotions. Consider customer service. Knowing a user's emotions can make interactions more

personal and increase service quality. In mental health evaluations, machines that can recognise

emotions could help experts do early assessment and action.

Imagine a world where we could understand emotions in voices through machine learning.

This research aims to make that happen, bridging the gap between speech and emotional intelligence.

The potential impacts could be widespread. Think of customer service where an AI can soothe angry

customers. Or in healthcare, where speech analysis helps diagnose mental health problems. In

education, personalised learning could keep students engaged. Video games could even change

according to the player's feelings, making the play more personal. In essence, this research envisions

accurately capturing emotions from speech. It seeks to transform businesses, technology, medicine,

and relationships.

3.2 Data Understanding

For this research, I have used the Ryerson Audio-Visual Database of Emotional Speech and

Song (RAVDESS), Toronto emotional speech set (TESS) and Crowd Sourced Emotional Multimodal

Actors Dataset (CREMA-D) datasets for training the model.


Vaibhav Tiwari (10621051)
29

There are 1440 files in this section of the RAVDESS: 24 actors times 60 trials equal 1440.

Twenty-four professional actors, twelve female and twelve males, perform two lexically matched

phrases in a neutral North American accent for the RAVDESS. Calm, joyful, sad, furious, afraid,

surprised, and disgusted expressions are examples of spoken emotions. Every expression is generated

at two different emotional intensity levels (strong and normal), along with a neutral expression.

Regarding the TESS dataset, two actresses, ages 26 and 64, each performed a set of 200 target

words in the carrier phrase "Say the word _." The set was recorded depicting the seven emotions

(anger, disgust, fear, happiness, pleasant surprise, sorrow, and neutral). In total, there are 2800 data

points (audio files).

Due to its organisational structure, the two female actors and their emotions are contained

under separate folders in the dataset. All 200 target words' audio files can be found within that. The

audio file is in the WAV format.

CREMA-D comprises 7,442 original clips from 91 actors and is an emotive multimodal actor

data set between the ages of 20 and 74, with 48 male and 43 female actors representing a range of

racial and ethnic backgrounds (African American, Asian, Caucasian, Hispanic, and Unspecified)

performed in these clips.

For testing, I have used the Surrey Audio-Visual Expressed Emotion (SAVEE) dataset -

Four native English male speakers (DC, JE, JK, and KL), postgraduate students, and

researchers at the University of Surrey, ranging in age from 27 to 31, provided data for the SAVEE

database. Anger, contempt, fear, happiness, sadness, and surprise are some of the distinct categories

that psychology has used to characterise emotion. A neutral category is also included to offer

recordings of 7 emotion categories.

Each emotion in the text was represented by 15 TIMIT sentences: two emotion-specific, three

common, and ten phonetically balanced generic sentences. To get 30 neutral sentences, the 2 × 6 =

Vaibhav Tiwari (10621051)


30

12 emotion-specific and three common sentences were registered as neutral. Each speaker produced

a total of 120 utterances as a result.

3.3 Data Preparation:

Several steps were followed in creating a dataset for detecting emotion in speech. Firstly,

gathered audio data from open-source sources like the RAVDESS, CREMA-D, and TESS databases,

which had diverse emotional content. The data were then transformed into a single format WAV and

adjusted to the same sampling rate for consistency. In the data, each audio clip is tagged with its

respective emotion. This labelling process made it easier to differentiate clips by emotion. To convert

the audio into data readable by machines, we used feature extraction techniques such as Mel-

frequency cepstral coefficients (MFCCs) and spectral attributes. The dataset was enhanced by

reducing noise and ensuring each emotional class was well represented, using oversampling and

augmentation methods. After preparing the dataset, we split it into training, validation, and test sets.

These sets became the foundation for our emotion recognition models.

3.4 Modeling:

This research focused on how well Neural Networks (NNs) and Long Short-Term Memory

(LSTM) models could find emotions in speech data. The dataset had speech clips with different

emotions. The data was processed and made ready for study by using Mel-frequency cepstral

coefficients (MFCCs). It turned raw audio into inputs the models could use. The NN model had hidden

layers with a ReLU activation function. The LSTM model used stacked layers. It used bidirectional

LSTM cells to understand patterns in the data over time. Training used the Adam optimizer. The best

learning rate and batch size were used to avoid overfitting. Accuracy, precision, recall, and F1-score
Vaibhav Tiwari (10621051)
31

measured how well the models performed. Both models were good at finding emotions from speech.

But the LSTM model did better. It was great at understanding data over time. It was found that LSTM

models have a lot of promise. They might be even better at finding emotions in speech. Below, we

have discussed both the techniques in detail:

Neural Network:

Neural Networks (NNs) are a class of powerful machine-learning models based on the

structure and functionality of the human brain. They excel in handling complex tasks, learning

intricate patterns, and making predictions from data. Neural networks consist of interconnected nodes,

or neurons, arranged in layers that work collaboratively to process information.

The basic structure of a Neural Network comprises three main types of layers:

1. Input Layer: This layer receives the initial data or features to be processed.

Each node in this layer represents a feature of the input data.

2. Hidden Layers: These layers lie between the input and output layers and

perform complex computations by applying weights to the input and passing it through

activation functions. The number of hidden layers and the number of neurons within each

layer can differ based on the complexity of the problem.

3. Output Layer: This layer produces the network's final predictions or outputs

based on the computations performed in the hidden layers. The number of nodes in the output

layer depends on the nature of the problem, for example, classification or regression.

Within a neural network, connections between neurons are represented by weights,

strengthening the connections. During the learning process, these weights are adjusted to minimize

the error between predicted outputs and actual targets.

Moreover, biases are additional parameters within each neuron that allow the model to handle

the delta or error in the output. They provide flexibility to the network by enabling it to fit more

Vaibhav Tiwari (10621051)


32

complex functions.

Neural networks rely heavily on activation functions. These functions allow networks to

handle and learn from complex data by adding a factor of non-linearity. Here are a few well-known

activation functions:

• Rectified Linear Unit (ReLU): This takes inputs and provides outputs. If the

input is positive, it returns the same. If it's not, it gives you zero. ReLU is popular because it's

simple and it keeps away vanishing gradient issues.

• SoftMax: This is commonly used for categories. It changes raw scores into

probability. It confirms totals add up to 1, showing probabilities for different groups.

Neural Networks learn by testing and adjusting. They take in data and predict the outcomes.

They compare their guess with the actual result and make the changes to get better results. Tools such

as gradient descent are used in this process.

The architecture and adaptiveness of Neural Networks, combined with their ability to learn

from data, make them invaluable. They have had incredible influence in areas such as image and

voice recognition, language processing, and data prediction.

LSTM (Long Short-Term Memory):

Long Short-Term Memory (LSTM) is an enhancement of Recurrent Neural Networks

(RNNs). They're made to handle the tricky business of keeping track of essential details in a sequence

over a long period of time. Normal RNNs often fumble this task due to a tricky problem called the

vanishing or exploding gradient, which impacts their usefulness with long sequences.

LSTM networks were made to beat this issue. They did it by adding special memory cells,

making it easier for them to remember things over different time steps. These networks have the

unique power to decide what to remember, update, or forget. Because of this, they're suitable for jobs

Vaibhav Tiwari (10621051)


33

that deal with a sequence of data, like checking a series of numbers over time, processing language,

recognizing speech and more.

The two things that make LSTMs different from normal RNNs are memory cells and these

Gating Mechanisms. They help the network remember and manage long sequences:

1. Memory Cells: LSTMs have these memory cells. They work like a storage unit,

able to keep information for a long time. These cells have a state vector that can change over

time. This way, the network can selectively remember or forget stuff depending on its

relevance to the task at hand.

2. Gating Mechanisms (Forget, Input, Output Gates): LSTMs use gates to control

how information moves within the memory cells. The gates use two mathematical operations,

sigmoid and element-wise multiplication, to control the flow of information.

Forget Gate: Decides which details in the memory state should be let go or erased. It looks at

the current entry and the past condition to determine what details are no longer helpful for future

forecasts.

Input Gate: Chooses to revamp the memory state by pinpointing fresh data to be stored. This

gate figures out the significance of the new data along with the present condition.

Output Gate: Decide the cell's output based on the refreshed state. This gate controls which

sections of the memory condition are used to make the output for this moment.

LSTMs, with these gate mechanisms, can learn when to keep or let go of information. This

skill helps lessen the problems of vanishing or blowing up gradients common in standard RNNs. This

allows LSTMs to model and predict long-distance sequences accurately and efficiently.

LSTMs have proved their worth in real-life tasks where context and time relationships are

key. For example, they excel in predicting stock prices, scrutinizing feelings in writings, creating

meaningful text sequences, and handling time series data. The talent for taking in and holding on to

Vaibhav Tiwari (10621051)


34

context over long sequences has made them an essential structure in studying and modelling seque-

ntial data.

MFCC:

MFCC or Mel-frequency cepstral coefficients are extensively utilized to extract features in

audio or speech signal processing. They originate from the audio signal's short-term power spectrum

with the aim of capturing features similar to our human hearing system.

Here are the steps to compute MFCCs:

1. Signal Framing: The audio signal is chopped into brief frames, typically

between 20 and 40 milliseconds, with partial overlaps.

2. Pre-emphasis: This process uses a pre-emphasis filter, which amplifies high-

frequency sounds, making the signal clearer.

3. Windowing: Here, frames get multiplied by a window function (like Hamming

or Hanning) to decrease spectral leakage.

4. Fast Fourier Transform (FFT): FFT is applied to every frame, altering the

signal from the time domain into the frequency domain.

5. Mel Filterbank: The power spectrum goes through a set of triangular filters

placed on the Mel-frequency scale. This process emulates how humans perceive sound non-

linearly.

6. Log Compression: The energy of the filterbank undergoes logarithm

computation, copying how humans hear sounds logarithmically.

7. Discrete Cosine Transform (DCT): The last step involves applying DCT to the

logarithm of filterbank energies. The result? A set of coefficients (MFCCs) which represent

the spectral features of every audio frame.

Vaibhav Tiwari (10621051)


35

MFCCs offer a summarized depiction of the audio signal. They capture essential spectral data

while excluding less distinctive features. Their effectiveness in showing the human auditory system's

characteristics makes them popular in various applications like emotion recognition from audio

signals, speaker identification, and speech recognition.

3.5 Research ethics:

I have taken steps to align my research with data protection standards, emphasizing my

commitment to ethical conduct. I can confidently affirm that there has been no misuse or

misrepresentation of the considered data. The dataset remains unaltered and openly accessible on

Kaggle for research purposes. By strictly adhering to data protection regulations, I have prioritized

transparency and integrity throughout the research process, ensuring responsible and ethical handling

of the information. The public availability of the dataset on Kaggle fosters collaboration and supports

research initiatives. Additionally, for data preparation, augmentation and concatenation were

performed with other publicly available datasets.

Vaibhav Tiwari (10621051)


36

4 Implementation

The process started with creating a Pandas Data Frame named ravdess_dataframe; this data

frame contained file paths, gender categories, and emotion labels. The value_counts() method was

used on the 'emotion_label' column for observing the count of each emotion label in the DataFrame.

This was a balanced dataset with 192 samples for each emotion: ‘angry', 'fear', 'disgust', 'sad',

and 'happy', but only 96 samples for the 'neutral' emotion.

Figure 1Emotion and Gender Distribution

The above bar plot shows the distribution of emotions across different genders within the

dataset, which is balanced again.

The waveform was used for plotting graphs and playing the audio file for the 'sad' emotion. It

loads the audio file path for 'sad' emotion, generates a waveform plot, and plays the audio. This was

done just for the understanding of the dataset.

Vaibhav Tiwari (10621051)


37

Figure 2 Waveplot for sad emotion audio - RAVDESS dataset

The above image represents a waveform plot for a specific emotion (sad in this case) from an

audio file path in the Ravdess dataset.

Then, the next step was the process of extracting Mel-frequency cepstral coefficients (MFCC)

from audio files using the Librosa library. It loads the audio files, normalizes them, computes the

MFCC features, and calculates the mean of these features. These extracted MFCC features, and their

corresponding emotion labels are then stored in a new Data Frame. This Data Frame consisted of 960

samples, each containing 128 MFCC features. The emotion labels were one-hot encoded after being

encoded with Label Encoder with the help of Scikit-learn, which resulted in a shape of (960, 5) for

the labels. This preprocessed dataset is suitable for training machine learning or deep learning models

to predict emotions based on the extracted MFCC features from audio recordings.

The training of a neural network model using TensorFlow's Keras API was done. The model

architecture was built with four dense layers: three hidden layers with 256 neurons each, employing

ReLU activation functions, and an output layer with five neurons using a softmax activation function

for multiclass classification. The model was then compiled using the Adam optimizer and categorical

cross-entropy loss function while tracking metrics such as accuracy, precision, recall, and area under

the curve (AUC). An EarlyStopping callback was employed to monitor the validation loss and prevent

overfitting by restoring the best weights after observing no improvement for three consecutive epochs.

During training with 15 epochs and a batch size of 32, the model's performance metrics (accuracy,
Vaibhav Tiwari (10621051)
38

precision, recall, AUC) and loss values for training are recorded and visualized using Matplotlib. The

accuracy and loss curves plotted against the number of epochs show the model's learning progress

and generalization performance on the training data, aiding in assessing its training behavior, potential

overfitting, or underfitting tendencies.

Further, some more data was added using the TESS dataset and CREMA-D for better training

of the model. The RAVDESS, CREMA-D and TESS datasets were concatenated, which contained

audio files expressing various emotions like anger, fear, disgust, sadness, happiness, and neutrality in

different proportions between the two datasets. After consolidating and cleaning the combined

dataset, neutral expressions were removed from the training data, focusing solely on emotional

expressions for model training. Emotionally labeled audio files underwent various augmentation

techniques, including noise addition, time stretching, shifting, and pitch modification, to diversify the

dataset. These augmented data are then used for feature extraction, encompassing essential audio

characteristics such as Zero Crossing Rate, Root Mean Square, Mel-frequency cepstral coefficients,

Chroma_stft, and Mel Spectrogram. The resulting features are standardized, labels are one-hot

encoded, and sequences are uniformly padded to a fixed length. The processed features and labels are

stored as numpy arrays and pickled files, readying them for model training.

A sequential model was then initiated and trained using the training data; The model

effectively combines convolutional layers (CNN) for feature extraction with recurrent layers (LSTM)

for capturing long-term temporal dependencies. Sequential Structure: Layers are arranged

sequentially, starting with two convolutional blocks, followed by two LSTM layers, and ending with

dense output layers. It had an Input Shape as (164, 1), indicating a 1D input sequence with 164-time

steps and a single feature channel. The convolutional blocks were 1st block: 256 filters, 5-point kernel,

strides of 2, padding to maintain input length, and ReLU activation.

2nd block: 128 filters, 5-point kernel, strides of 2, padding, and ReLU activation.

Vaibhav Tiwari (10621051)


39

MaxPooling Layers: These were used for reducing dimensionality and extracting dominant features.

Dropout Layers: Regularization to prevent overfitting (20% dropout after each convolutional and

LSTM layer).

LSTM Layers: 1st layer: 64 units, returning sequences for further processing.

2nd layer: 32 units, not returning sequences, providing final temporal context representation.

Dense Layers:

1st layer: 16 units with ReLU activation and intermediate processing.

2nd layer: 5 units with softmax activation and producing probability distribution over five classes.

Training and Evaluation: Training was done for 50 epochs with a batch size of 32, Loss:

categorical cross-entropy, Optimizer used-Adam. These were used for the metrics: accuracy,

precision, recall, and AUC.

After this, introduced a model which analyzes data with a combination of convolutional layers,

attention, and LSTMs, achieving high accuracy, precision, recall, and AUC on a multi-class

classification task.

Two artificial neural networks were trained for multi-class classification using a sequential

architecture and Adam optimiser. Both models displayed improvement in accuracy and other metrics

over 50 training epochs with 32 samples per batch.

The first ANN achieved a final accuracy of 73.83% and an AUC of 94.55%, suggesting robust

performance. The second ANN, while starting with lower accuracy, exhibited potential for further

improvement through hyperparameter tuning or additional training. Overall, the results indicate

successful model training.

Further, the ANN-LSTM model was trained on a dataset for a multi-class classification. The

model used an LSTM layer with 256 units to record temporal dependencies, followed by fully

connected layers and dropout for feature extraction and regularization. Finally, a softmax output layer

Vaibhav Tiwari (10621051)


40

with five units predicted the class probabilities.

During training with Adam optimiser and categorical cross-entropy loss, the model shows

steady improvement in accuracy, reaching around 60% on the training set. Loss continuously

decreased, resulting in effective error minimization. Recall, initially low, gradually increased,

explaining better identification of positive cases. The AUC score also improved, resulting in enhanced

class discrimination. Overall, the model showed promising performance.

Vaibhav Tiwari (10621051)


41

5 Evaluation

This section delves into the evaluation of the model performances. Six models were trained in

this research project. These are as follows:

Model Development:

• Model 1 (CNN): This model Utilizes a sequential neural network with dense layers. The model

achieved moderate accuracy and other metric scores (accuracy, precision, recall, and AUC) on the

training set.

• Model 2 (CNN + LSTM): This model Extended the training data by including additional samples

from the TESS and CREMA datasets. The performance improved noticeably, achieving higher

accuracy, precision, recall, and AUC scores on the test data compared to Model 1.

• Model 3 (CNN + Attention net + LSTM): The third model experimented with a Long Short-Term

Memory (LSTM) layer and attention net. This model performed best among all the models.

• Model 4 (ANN): This model performed lower than the CNN and CNN+LSTM models.

• Model 5 (ANN + LSTM): This model performed even lower than the ANN model.

• Model 6 (ANN + More layers of LSTM): This model showed improvement when compared to

model 5 but was still not good enough and was lower than the ANN model.

Vaibhav Tiwari (10621051)


42

Below is the evaluation of these models:

Figure 3Accuracy for training data - RAVDESS

Figure 4Loss for training data - RAVDESS

The graphs visualise the first model's accuracy and loss metrics across epochs for training data

from ravdess dataset, offering insights into the training process and potential overfitting or

convergence issues.

Vaibhav Tiwari (10621051)


43

Figure 5Accuracy/loss for CNN+Attention net+LSTM model

The graphs visualises the CNN + Attention net + LSTM model's accuracy and loss metrics

across epochs for training data from the new dataset which contains the data from ravdess, crema and

tess dataset for improving the training, offering insights into the training process and potential

overfitting or convergence issues.

From the above graphs, it can be observed how the Model Accuracy and Model Loss metrics

showed performance improvement.

Vaibhav Tiwari (10621051)


44

Figure 6Accuracy/loss for ANNt+LSTM model

The accuracy graph of ANN +LSTM shows that the training accuracy starts off at about 0.3

and increases to about 0.6 over the course of 50 epochs. The Loss starts off at 1.5 and after 50 epochs

it is around 0.8.

Vaibhav Tiwari (10621051)


45

Figure 7Accuracy/loss for ANN+multi layered LSTM model

The ANN+ More layers of the LSTM loss graph show that the training loss starts off at about

1.7 and decreases to about 0.6 over the course of 50 epochs. The accuracy starts off at about 0.25 and

increases to about 0.60 over the course of 50 epochs. This suggests that the model is learning from

the data, as the loss decreases over time.

Overall, the graphs suggest that the model is doing an excellent job of learning from the data

but is overfitting the training data.

Vaibhav Tiwari (10621051)


46

Table 1Comparison metrics for different Models

Models Accuracy Precision Recall AUC

CNN 0.5698 0.6790 0.4406 0.8401

CNN+LSTM 0.6887 0.8185 0.5436 0.9235

CNN+Attention 0.8925 0.9008 0.8843 0.9872

Net+LSTM

ANN 0.7383 0.8265 0.6437 0.9455

ANN + LSTM 0.6030 0.7961 0.4060 0.8747

ANN + More layers of 0.6164 0.7811 0.4400 0.8812

LSTM

The table mentioned above compares the performance of six different models, these are

evaluated by four metrics: accuracy, precision, recall, and AUC. Here's an analysis of the results:

Overall Performance:

• The CNN+ Attention Net +LSTM model achieves the highest accuracy of 0.8925, precision of

0.9008, recall of 0.8843, and AUC of 0.9872, and hence it is the best-performing model out of

the six different models.

• CNN+LSTM performed as second best with a strong performance across all metrics:

accuracy: 0.6887, precision: 0.8185, recall: 0.5436, AUC: 0.9235.

• ANN models generally perform lower than CNN and LSTM-based models, which explains the

Vaibhav Tiwari (10621051)


47

importance of capturing temporal dependencies for this task.

• Adding more LSTM layers to the ANN model (ANN + More layers of LSTM) does not

significantly improve performance accuracy: 0.6164, potentially indicating overfitting.

Vaibhav Tiwari (10621051)


48

6 Conclusion and Future Work

6.1Conclusion
This project aims to identify emotion classification through neural networks using speech data

sourced primarily from the RAVDESS, CREMA and TESS datasets. The project's primary objective

was to develop efficient models that accurately identify emotions such as happiness, sadness, anger,

fear, disgust, and more from speech samples. The process included data collection, feature extraction

(notably Mel-frequency cepstral coefficients - MFCCs), model development, evaluation, and insights

into potential future work to enhance the current models' performance.

Data Collection and Preparation: The beginning of the analysis started with collecting audio

files from RAVDESS, CREMA and TESS datasets, ensuring a diverse representation of emotional

states. After that, thorough data preprocessing was executed to extract essential features from the

speech samples. The utilization of MFCCs allowed for transforming the speech signals into a format

compatible with neural network-based analysis.

Further, the project had six distinct neural network architectures for the emotion classification

task. Model 1 employed a sequential neural network structure with dense layers, despite moderate

performance metrics - including accuracy, precision, recall, and AUC. Model 1 served as a benchmark

for further evaluations. Model 2 extended the dataset by including additional samples from the

CREMA and TESS datasets. This augmentation notably enhanced the model's performance, denoting

marked improvements in accuracy, precision, recall, and AUC scores on the test dataset compared to

Model 1. However, Model 3, which incorporated an attention net and LSTM layer for sequence

processing, resulted in comparatively better performance metrics than the dense layer-based models.

The evaluation demonstrated promising results, particularly with the augmentation of the

dataset in Models 2 and 3, proving the significance of diverse data in improving emotion classification

accuracy from speech inputs. Additionally, the comparative analysis indicated the superior

Vaibhav Tiwari (10621051)


49

effectiveness of the dense layer architectures and attention net and LSTM-based models in handling

the extracted audio features for emotion recognition.

Several recommendations were outlined to further enhance the models' performance. Firstly,

exploring advanced data augmentation techniques or alternative feature engineering approaches could

enrich the dataset, potentially boosting classification accuracy. Secondly, fine-tuning

hyperparameters to optimize the neural network models and improve their generalization capabilities

could be helpful. Thirdly, investing in model interpretability techniques could highlight the most

influential features contributing to emotion classification, enhancing the models' transparency, and

understanding.

In conclusion, the analysis successfully implemented neural network architectures for emotion

classification from speech data, achieving commendable results, especially with the integration of

additional diverse data. However, the scope for improvement remains, including further data

augmentation, hyperparameter tuning, and interpretability exploration. These enhancements could

further improve the models' accuracy and reliability in recognizing emotions from audio samples,

contributing significantly to the field of emotion recognition and human-computer interaction.

6.2 Future Work


Moving forward, future work in this domain could focus on various aspects. Augmenting the

dataset further with a wider variety of emotional states and demographic diversities might augment

the model’s capacity to generalize across diverse populations. Additionally, the exploration of

advanced neural network architectures or leveraging transfer learning from pre-trained models could

potentially gain more robust and nuanced emotion classification frameworks. Moreover, investigating

real-time emotion classification applications or multimodal approaches combining audio and visual

inputs could lead to more comprehensive emotion recognition systems for practical use cases in fields

Vaibhav Tiwari (10621051)


50

such as mental health monitoring, human-computer interaction, and beyond. Lastly, continuous

improvement of interpretability methods to explore the neural network model’s decision-making

processes would not only enhance model understanding but also increase trust and acceptance of these

technologies in real-world applications.

Vaibhav Tiwari (10621051)


51

7 References

R. S. Livingstone and A. F. Russo, ‘‘The Ryerson audio-visual database of emotional speech and

song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American

English,’’ PLoS ONE, vol. 13, no. 5, pp. 1–35, May 2018.

P. Jackson and S. U. Haq, ‘‘Surrey audio-visual expressed emotion (SAVEE) database,’’ Univ.

Surrey, Guildford, U.K., Tech. Rep., Apr. 2014.

A. Satt, S. Rozenberg, and R. Hoory, ‘‘Efficient emotion recognition from speech using deep

learning on spectrograms,’’ in Proc. Interspeech, Aug. 2017.

J. Chang and S. Scherer, ‘‘Learning representations of emotional speech with deep convolutional

generative adversarial networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.

(ICASSP), Mar. 2017, pp. 2746–2750.

M. Chen, X. He, J. Yang, and H. Zhang, ‘‘3-D convolutional recurrent neural networks with

attention model for speech emotion recognition,’’ IEEE Signal Process. Lett., vol. 25, no. 10, pp.

1440–1444, Oct. 2018.

J. Zhao, X. Mao, and L. Chen, ‘‘Learning deep features to recognise speech emotion using merged

deep CNN,’’ IET Signal Process., vol. 12, no. 6, pp. 713–721, 2018.

P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, ‘‘Speech emotion recognition

using spectrogram & phoneme embedding,’’ in Proc. Interspeech, Sep. 2018.

M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and N. Dehak, ‘‘Emotion

identification from raw speech signals using DNNs,’’ in Proc. Interspeech, Sep. 2018.

S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, ‘‘Transfer learning for improving speech

emotion classification accuracy,’’ in Proc. Interspeech, Sep. 2018.

J. Zhao, X. Mao, and L. Chen, ‘‘Speech emotion recognition using deep 1D & 2D CNN LSTM

networks,’’ Biomed. Signal Process. Control, vol. 47, pp. 312–323, 2019.
Vaibhav Tiwari (10621051)
52

T.-W. Sun and A.-Y.-A. Wu, ‘‘Sparse autoencoder with attention mechanism for speech emotion

recognition,’’ in Proc. IEEE Int. Conf. Artif. Intell. Circuits Syst. (AICAS), Mar. 2019, pp. 146–

149.

W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li, ‘‘Speech emotion recognition with heterogeneous

feature unification of deep neural network,’’ Sensors, vol. 19, no. 12, p. 2730, Jun. 2019.

S. K. Pandey, H. S. Shekhawat, and S. R. M. Prasanna, ‘‘Deep learning techniques for speech

emotion recognition: A review,’’ in Proc. 29th Int. Conf. Radioelektronika

(RADIOELEKTRONIKA), Apr. 2019, pp. 1–6.

H. Meng, T. Yan, F. Yuan, and H. Wei, ‘‘Speech emotion recognition from 3D log-mel

spectrograms with deep learning network,’’ IEEE Access, vol. 7, pp. 125868–125881, 2019.

Z.-T. Liu, P. Xiao, D.-Y. Li, and M. Hao, Speaker-Independent Speech Emotion Recognition Based

on CNN-BLSTM and Multiple SVMs. Aug. 2019, pp. 481–491.

J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, and G. Hofer, ‘‘Analysis of deep

learning architectures for cross-corpus speech emotion recognition,’’ in Proc. Interspeech, Sep.

2019, pp. 1656–1660.

L. Guo, L. Wang, J. Dang, Z. Liu, and H. Guan, ‘‘Exploration of complementary features for speech

emotion recognition based on kernel extreme learning machine,’’ IEEE Access, vol. 7, pp. 75798–

75809, 2019.

M. Farooq, F. Hussain, N. K. Baloch, F. R. Raja, H. Yu, and Y. B. Zikria, ‘‘Impact of feature

selection algorithm on speech emotion recognition using deep convolutional neural network,’’

Sensors, vol. 20, no. 21, p. 6008, Oct. 2020.

S. Sonawane and N. Kulkarni, ‘‘Speech emotion recognition based on MFCC and convolutional

neural network,’’ Int. J. Adv. Sci. Res. Eng. Trends, Jul. 2020.

Mustaqeem, M. Sajjad, and S. Kwon, ‘‘Clustering-based speech emotion recognition by

Vaibhav Tiwari (10621051)


53

incorporating learned features and deep BiLSTM,’’ IEEE Access, vol. 8, pp. 79861–79875, 2020.

Mustaqeem and S. Kwon, ‘‘A CNN-assisted enhanced audio signal processing for speech emotion

recognition,’’ Sensors, vol. 20, no. 1, p. 183, Dec. 2019.

N. Vryzas, L. Vrysis, M. Matsiola, R. Kotsakis, C. Dimoulas, and G. Kalliris, ‘‘Continuous speech

emotion recognition with convolutional neural networks,’’ J. Audio Eng. Soc., vol. 68, nos. 1–2, pp.

14–24, Feb. 2020.

N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. Lee, ‘‘Multimodal approach of speech emotion

recognition using multi-level multi-head fusion attention-based recurrent neural network,’’ IEEE

Access, vol. 8, pp. 61672–61686, 2020.

O. Atila and A. Şengür, ‘‘Attention guided 3D CNN-LSTM model for accurate speech-based

emotion recognition,’’ Appl. Acoust., vol. 182, Nov. 2021, Art. no. 108260.

T. Tuncer, S. Dogan, and U. R. Acharya, ‘‘Automated accurate speech emotion recognition system

using twine shuffle pattern and iterative neighbourhood component analysis techniques,’’ Knowl.-

Based Syst., vol. 211, Jan. 2021, Art. no. 106547.

J. Liu and H. Wang, ‘‘A speech emotion recognition framework for better discrimination of

confusions,’’ in Proc. Interspeech, Aug. 2021, pp. 4483–4487.

Vaibhav Tiwari (10621051)


54

8 Appendix
8.1Artefact Links
Google Drive Link -

https://drive.google.com/file/d/19audQU8H0pmuiAxibxCL5d4YGE6trl55/view?usp=sharing

Microsoft OneDrive Link -

https://mydbsmy.sharepoint.com/:u:/g/personal/10621051_mydbs_ie/EecAZ63OQX1Gq_6f3IRr_U

BAtxMkfeIQZM6MDgT9JzAeA?e=xRPMI

8.2Dataset Links

RAVDESS - https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-
audio?resource=download

TESS - https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess

CREMA-D - https://www.kaggle.com/datasets/ejlok1/cremad

SAVEE - https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee

Vaibhav Tiwari (10621051)

You might also like