Speech Emotion System Full Project Report

Empowering Communication: The
Evolution and Potential of Speech

Recognition System
Vaibhav Tiwari
(10621051)
Applied Research Project submitted in partial fulfilment of the requirements for the degree of MSc
in Data Analytics at Dublin Business School
Supervisor: Agatha Mattos
January 2024
2
Declaration
I declare that this Applied Research Project that I have submitted to Dublin Business School
for the award of MSc in Data Analytics is the result of my own investigations, except where
otherwise stated, where it is clearly acknowledged by references. Furthermore, this work has not
been submitted for any other degree.
Signed: Vaibhav Tiwari
Student number: 10621051
Date: 5th January 2024

3
Acknowledgement
I want to thank professor Agatha Mattos, my research mentor, for providing me with guidance,
encouragement, and valuable suggestions throughout my research journey. I thank my supervisor for
her practical advice and assistance in my study. Additionally, I extend special thanks to the DBS
library and academic operations for their support in providing necessary reference materials for my
literature work.
4
Abstract
This project explores the classification of emotions in speech data using neural network
models. The project aims to build robust models to correctly identify emotions such as happiness,
sadness, anger, fear, and disgust from audio clips. The process involves data collection, cleaning up,
and feature extraction using Mel-frequency Cepstral Coefficients (MFCCs) data, creating, and
assessing three unique neural network structures. The first model, a sequence structure with dense
layers, is used to compare the performances of the subsequent models. The second model, which
uses an upgraded dataset, performed better. The third model had mixed results when adding an
LSTM layer, compared to models only using dense layers. The results underline the importance of
varied data in improving emotion identification accuracy from speech data. This research brings
essential knowledge to the field of neural network-based emotion classification. It sets the
foundation for valuable applications in monitoring areas, including mental health and interactions
between humans and computers.

5
Table of Contents
Declaration .......................................................................................................................................................2
Acknowledgement ...........................................................................................................................................3
Abstract ............................................................................................................................................................4
List of Tables ....................................................................................................................................................7
List of Figures ...................................................................................................................................................8
1 Introduction .............................................................................................................................................9
1.1 Introduction ...........................................................................................................................................9
1.2 Background ......................................................................................................................................... 10
1.3 Motivation........................................................................................................................................... 11
1.3.1 Enhancing Human-Computer Interaction ........................................................................................ 12
1.3.2 Improving Mental Health Assessment ............................................................................................. 12
1.3.3 Unveiling Valuable Customer Insights ............................................................................................. 12
1.3.4 Human-Emotion Synthesis ............................................................................................................... 13
1.3.5 Unlocking Unseen Insights ............................................................................................................... 13
1.4 Research Objectives ............................................................................................................................ 13
1.5 Research Question .............................................................................................................................. 14
1.6 Report Overview ................................................................................................................................. 14
2 LITERATURE REVIEW ............................................................................................................................. 17
2.1 Evolution of Artificial Intelligence (AI) ................................................................................................ 17
2.1.1 Early Challenges in Artificial Intelligence ......................................................................................... 17
2.2 Machine Learning and the Quest for Representation ........................................................................ 18
2.3 Representation Learning and Deep Learning ..................................................................................... 18
2.4 The Role of Depth in Deep Learning ................................................................................................... 18
2.5 Related Work ...................................................................................................................................... 19
3 Methodology......................................................................................................................................... 28
3.1 Business Understanding:..................................................................................................................... 28
3.2 Data Understanding ............................................................................................................................ 28
3.3 Data Preparation: ................................................................................................................................ 30
3.4 Modeling: ............................................................................................................................................ 30
3.5 Research ethics: .................................................................................................................................. 35
4 Implementation .................................................................................................................................... 36
Vaibhav Tiwari (10621051)

6
5 Evaluation ............................................................................................................................................. 41
6 Conclusion and Future Work ................................................................................................................ 48
6.1 Conclusion ........................................................................................................................................... 48
6.2 Future Work ........................................................................................................................................ 49
7 References ............................................................................................................................................ 51
8 Appendix ............................................................................................................................................... 54
8.1 Artefact Links ...................................................................................................................................... 54
8.2 Dataset Links ....................................................................................................................................... 54

7
List of Tables
Table 1Comparison metrics for different Models .............................................................................. 46

8
List of Figures
Figure 1Emotion and Gender Distribution ......................................................................................... 36

Figure 2 Waveplot for sad emotion audio - RAVDESS dataset......................................................... 37
Figure 3Accuracy for training data - RAVDESS ............................................................................... 42
Figure 4Loss for training data - RAVDESS ....................................................................................... 42
Figure 5Accuracy/loss for CNN+Attention net+LSTM model .......................................................... 43
Figure 6Accuracy/loss for ANNt+LSTM model ................................................................................ 44
Figure 7Accuracy/loss for ANN+multi layered LSTM model ........................................................... 45

9
1 Introduction
This chapter introduces the subject, gives background information, emphasizes the
motivation, details the research objectives, poses research questions, and provides a structure for the
report's organization, setting the scene for the remaining study.
1.1 Introduction
Human communication is an intricate and multifaceted phenomenon. At its core, speech is the
primary medium through which we express thoughts, convey information, and engage with one
another. Yet, beneath the surface of words and sentences lies a more decadent layer of human
interaction—emotions. These intangible, ever-shifting states of mind are as fundamental to our
conversations and relationships as our words. Understanding and interpreting these emotional nuances
within speech has become a focal point of the study, especially in the backdrop of the swift and
profound advancements in artificial intelligence (AI).
Emotions are the emotional colour palette that paints the canvas of human communication.
They are the joy that fills laughter, the sorrow that trembles in a sigh, the excitement that crackles in
enthusiasm, and the serenity that flows in comforting words. For centuries, humans have been aware
of the emotional richness that speech carries. Still, the ability to systematically recognise, quantify,
and categorise these emotional cues within spoken language has been a complex and elusive
challenge.
In the current age, we stand at the precipice of a remarkable evolution in technology and
understanding. The rise of artificial intelligence has revolutionised the way we interact with machines
and the world around us. It has ushered in an era where machines are becoming increasingly proficient
at tasks that once required human insight and judgment. The amalgamation of AI and studying human

10
emotions through speech has birthed a promising field known as Speech Emotion Recognition (SER).
This intersection of AI and emotional intelligence has sparked an array of previously
unimaginable possibilities. With AI as the catalyst, we can now embark on a journey to develop
systems capable of deciphering the emotional undertones, recognising, and categorizing emotions
embedded in spoken language. This evolution in technology has the potential to provide us with a
window into the emotional landscape of human communication, which has profound implications for
various sectors of our society.
As we delve deeper into this thesis, we will explore the intricate mechanisms of SER and the
remarkable possibilities it brings to the table. We will discover how the evolution of AI is not just a
technological advancement but a transformation in how we comprehend and respond to the emotional
dimensions of our interactions. It is a journey into the uncharted territory of artificial emotional
intelligence, allowing us to uncover the subtle emotional cues hidden within our words, thus enriching
our understanding of human communication and our capacity to use technology to decipher the human
experience.
1.2 Background
Over the years, extensive interdisciplinary research has provided valuable insights into speech
emotion recognition. The advent of AI has been instrumental in shedding new light on this domain,
opening exciting possibilities for understanding and utilizing emotional cues in speech.
The marriage of AI with emotion recognition has enabled researchers to address a multifaceted
problem. It's the culmination of efforts by neuroscientists, computational intelligence experts,
linguists, and AI specialists that have unlocked this potential. Neuroscientists have delved into
understanding how the human brain processes and perceives emotional stimuli, providing
foundational insights. Computational intelligence researchers have translated this knowledge into
11
mathematical solutions, bridging the gap between neural processes and machine learning. Linguists
have approached speech emotion recognition by dissecting speech's semantic and syntactic aspects,
adding linguistic context to the equation. The collaboration of these interdisciplinary fields has
unraveled the complexities of emotional content within the speech.
Scherer's 2003 work presented various design paradigms, often using the modified
Brunswick’s functional lens model of perception to study speech emotion recognition. Researchers
have experimented with techniques like spectral analysis and Hidden Markov Models to identify and
categorize emotions within spoken language. The application of AI and machine learning has
significantly enhanced the accuracy and efficiency of these recognition systems.
In human-computer interaction, AI-driven emotion recognition from speech is proving
invaluable. Affective systems, empowered by AI, can detect a user's emotional state in real-time,
offering the opportunity to adapt system responses and enhance user satisfaction. Speech and gesture
recognition are cornerstones of this burgeoning field known as affective computing, with audio-based
devices being the most widely adopted, primarily due to the established trust in the technology and
concerns over privacy associated with video sensors.
1.3 Motivation
The study of Speech Emotion Recognition (SER) driven by artificial intelligence is not merely
an academic pursuit but a critical endeavor with profound implications for individuals, businesses,
and society. This research is motivated by several compelling factors that highlight the urgency and
significance of this work.

12
1.3.1 Enhancing Human-Computer Interaction
In an increasingly digital world, how humans interact with technology has evolved
significantly. From smartphones to smart speakers, machines are integral to our daily lives. However,
for technology to truly serve us, it must understand us on a deeper level. The ability of AI-driven
systems to recognize and respond to human emotions in speech is fundamental in creating more
meaningful and satisfying human-computer interactions. Whether it's in virtual assistants providing
empathetic responses or customer service chatbots tailoring their support to user emotions, the
potential for enhancing our technological interactions is limitless.
1.3.2 Improving Mental Health Assessment
Emotions play an essential role in mental health and well-being. Accurate detection of
emotions in speech offers a powerful tool for assessing and monitoring mental health conditions. It
can contribute to early intervention, improved therapy, and a deeper understanding of emotional well-
being. The potential to develop AI systems that can recognize signs of emotional distress or instability
in speech could be transformative in addressing the global mental health crisis.
1.3.3 Unveiling Valuable Customer Insights
Businesses thrive on understanding their customers. Emotion recognition in customer
interactions, whether through phone calls, chat messages, or reviews, offers a goldmine of
information. Recognizing the emotional tone of customer feedback can aid businesses in tailoring
their products and services, improving customer satisfaction, and addressing issues proactively. The
ability to discern the sentiment behind the words can lead to enhanced decision-making, better
products, and, ultimately, stronger customer relationships.

13
1.3.4 Human-Emotion Synthesis
The potential of AI to detect and respond to human emotions opens doors to a new era of
human-AI coexistence. Machines with affective properties can detect user emotions and adapt their
responses to meet emotional needs. This improves user satisfaction and lays the groundwork for a
more empathetic and understanding AI-human partnership.
1.3.5 Unlocking Unseen Insights
Human communication is a treasure trove of unspoken emotions. While words may convey
one message, emotions often reveal a different narrative. SER enables us to explore these hidden
narratives, providing invaluable insights into human behavior, sentiment, and well-being. This
research motivates us to unveil these unseen communication layers, contributing to our collective
understanding of human nature and enhancing our ability to address diverse real-world challenges.
In a world where technology and human interaction are increasingly entwined, the ability to
decipher and respond to human emotions in speech is not just a scientific pursuit; it is a quest to
improve lives, augment businesses, and expand the horizons of AI-human collaboration. This thesis
embarks on this quest, aiming to unlock the potential of artificial emotional intelligence in the service
of humanity.
1.4 Research Objectives
This research's primary objective is to develop and evaluate efficient neural network
architectures for accurate emotion classification from diverse speech datasets, focusing on the
impact of data augmentation and advanced feature engineering techniques. Further, some
other objectives are as follows:
• Developing neural network architectures: This research investigates different model

14
architectures for emotion classification.
• Accurate emotion classification: Aim to improve the accuracy of emotion recognition from
speech.
• Diverse speech datasets: Leverage multiple datasets for model training and evaluation.
• Data augmentation: Explore the impact of data augmentation on model performance.
• Advanced feature engineering: Consider alternative feature extraction
approaches beyond MFCCs.
1.5 Research Question
How can the integration of Convolutional Neural Networks (CNN) and Long Short-Term
Memory networks (LSTM) be optimized to improve the accuracy and efficiency of Speech Emotion
Recognition (SER) systems, and what impact will this have on real-world applications?
1.6 Report Overview
A brief outline or guideline to the readers. The report is structured as follows:
• Chapter 2 – Literature Review, presents its readers with an overview of the background
research on the topic. It begins with an exploration of hate speech, classifying its nature, impact, and
reputation in online environments. It then navigates through previous studies and methodologies
utilized for hate speech detection. Additionally, this section critically assesses the limitations and
challenges inherent in the existing hate speech detection systems, offering insights into the gaps and
opportunities for improvement.
• Chapter 3 – Methodology, it explains the process of data collection, this section elaborates
on the methods used to acquire datasets from diverse online platforms. Subsequently, it describes the
15
rigorous steps involved in data preprocessing, including cleaning, formatting, and preparing the
collected data for analysis. This segment emphasizes the importance of high-quality, well-pre-
processed data for effective hate speech detection models.
• Chapter 4 – Implementation, this segment evaluates a specific machine learning model's
performance in hate speech detection. It starts with an analysis of Logistic Regression, discussing its
strengths, limitations, and relevance in identifying hate speech. Following that, it examines Naive
Bayes, delineating its efficiency and drawbacks. The section then explores Random Forest,
highlighting its performance, challenges, and suitability for hate speech detection. Finally, it
investigates the LSTM model, focusing on its strengths, weaknesses, and metrics for identifying hate
speech.
• Chapter 5 - Evaluation, this section summarizes and compares the performance metrics of
the evaluated machine learning models used for hate speech detection. It offers a comprehensive
discussion on the strengths, weaknesses, and applicability of each model in addressing online hate
speech. The comparative analysis drawn from these evaluations provides conclusive insights and
implications for future studies and practical applications in the field.
• Chapter 6 - Conclusion, encapsulates the project's essence, key findings and insights
derived from the study's exploration of hate speech detection through machine learning models. It
explores the implications of these findings, offering recommendations for future research directions
and practical implementations to enhance hate speech detection methodologies. Additionally, it
explores potential areas for further research and advancements in mitigating online hate speech.
• Chapter 7 – References, compiles a comprehensive list of sources and materials cited

16
throughout the thesis, providing readers with an extensive repository for further exploration and
validation of the presented information.

17
2 LITERATURE REVIEW
The Literature Review chapter delves into the existing body of knowledge and research
relevant to Speech Emotion Recognition (SER) and the use of Convolutional Neural Networks (CNN)
and Long Short-Term Memory networks (LSTM) in the field. This section serves as the foundation
for understanding the evolution of SER, the significance of AI-driven techniques, and the context in
which this thesis is situated.
2.1 Evolution of Artificial Intelligence (AI)
The aspiration to create machines with the capacity for thought and intelligence has been a
long-standing desire throughout human history. This aspiration traces back to ancient times, with
legendary figures such as Pygmalion, Daedalus, and Hephaestus often interpreted as early inventors
of artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy, 1997). The vision of intelligent
machines predates the creation of programmable computers by over a century (Lovelace, 1842).
Today, the field of artificial intelligence (AI) has transformed into a thriving area of study with
practical applications and ongoing research endeavors.
2.1.1 Early Challenges in Artificial Intelligence
When AI started, it mainly tackled tasks that humans found challenging mentally but were
simple for computers. Such tasks followed strict, math-based laws, perfect for computer-based
solutions. An impressive early achievement was IBM's chess-playing system, Deep Blue. Deep Blue
beat chess pro, Garry Kasparov, in 1997 (Hsu, 2002). Chess offered a precise playing field with set
rules that could be programmed into the computer. AI could solve the rule-based tasks; the real test
was to crack tasks humans did easily but hard to put into formal terms.

18
2.2 Machine Learning and the Quest for Representation
Machine learning sparked a crucial change in AI. It made computers able to learn from past
events and grasp the world in a layered way. Each layer is based on simpler ones. This method eased
the dependence on humans to provide all the knowledge a computer needs directly. Learning from
past events let computers skip the complications of human-set rules. They could grasp the key points
from raw data.
2.3 Representation Learning and Deep Learning
A crucial aspect of machine learning is representation learning, which involves discovering
representations that explain observed data effectively. Representation learning algorithms, such as
autoencoders, allow computers to build complex concepts from simpler ones. Deep learning, a subset
of representation learning, focuses on forming a deep hierarchy of concepts, each constructed from
simpler components. The deep learning approach involves learning representations by considering
each layer as a state of memory and executing a sequence of instructions, making it highly capable of
capturing complex relationships and patterns within data.
2.4 The Role of Depth in Deep Learning
The concept of "depth" in deep learning is multifaceted. It can be measured based on the
number of sequential instructions or the depth of the graph describing relationships between concepts.
The choice of perspective influences the interpretation of depth in deep learning models. What
remains consistent is that deep learning models are designed to discover complex representations by
iteratively combining simpler ones, enabling the extraction of intricate patterns from data.
This section sets the stage for the in-depth exploration of deep learning and its relevance to
19
our thesis on Speech Emotion Recognition using AI with CNN and LSTM. It establishes the historical
context and evolution of AI and machine learning, ultimately leading to the emergence of deep
learning as a promising approach for complex tasks.
2.5 Related Work
In this section, we provide an overview of the related work in the field of speech emotion
recognition (SER). The SER domain encompasses various aspects, including feature extraction,
classification methods, and applications in different sectors. We discuss the key research contributions
in these areas and summarize their findings and insights.
In simple terms, speech is a sequence of sounds we use to convey our feelings and thoughts. A speech
signal carries various types of information, including the identity of the speaker, their gender, the
message they want to convey, the language they are speaking, and even their emotional state. This
realization has led researchers to consider speech a powerful medium for interaction between humans
and machines (Frant & Stoica, 2017; Schuller, 2018). While significant progress has been made in
speech recognition over the last two decades (Nassif et al., 2019), the task of recognizing emotions
from speech signals still presents challenges that require further investigation.
Speech carries a complexity and richness, where even small changes in tonal attributes can
alter the meaning of the exact words. Therefore, detecting emotions from voice remains a valuable
area of research. Psychologists have developed various theories about emotions, and two prevalent
models for vocal emotions are the dimensional and discrete (or categorical) models. The discrete
model categorizes emotions into specific, distinct categories, such as happiness, sadness, and fear,
assuming clear boundaries between these emotions (Ekman, 1992).
However, some researchers, like Ronan et al. (2018), argue that this categorization is
insufficient to express the full spectrum of human emotions. This has led to the development of
20
dimensional models considering emotions along broader dimensions. The widely accepted two-
dimensional model classifies emotions based on valence (positive to negative) and arousal (low to
high) (Russell, 1980). To account for more complexity, a three-dimensional model introduces tension
measured on a tense-to-relaxed scale (Sarprasatham, 2015), while a four-dimensional model adds
intensity and potency dimensions to valence and arousal (Fontaine et al., 2007).
Despite these advancements, dimensional models are criticized for not distinguishing certain
emotions like fear and anger. The subject of emotion measurement remains highly subjective,
influenced by personal and cultural differences. Recent research by Cowen et al. (2018) introduces
an intriguing hypothesis suggesting that people can perceive over 20 different emotions in wordless
sounds. They tested this theory through various approaches, involving more than 2000 sounds and
1000 participants. Through a combination of free-choice and forced-choice descriptions, they
identified 24 reliable emotional categories, demonstrating that emotion is a complex and multifaceted
aspect.
These findings highlight the need for deeper exploration and understanding of emotions,
particularly in the context of Speech Emotion Recognition (SER) systems. The proposed model in
this study aims to address the challenges associated with misclassifying different emotional states,
building upon the evolving understanding of emotions and their expression in speech.
Advancements in Speech Emotion Recognition (SER) through Deep Learning
Artificial intelligence (AI) has seen remarkable growth over time, and machine learning has
become a stand-out area. A technique called Deep Learning within Machine Learning has rising
importance. It's modelled after how our brains work. Within the landscape of Speech Emotion
Recognition (SER), artificial neural networks, a kind of Deep Learning, have gained a lot of ground.
This approach gives those studying emotion detection from speech more ways to improve their

21
work faster (Nassif et al., 2019).
Deep Learning Models in SER: A Revolution in Emotion Detection
Deep learning models have been extensively explored to enhance SER. Zhao et al. (2017)
conducted research on phoneme recognition and SER, demonstrating that the Recurrent
Convolutional Neural Network (RCNN) model could effectively detect emotions with a weighted
accuracy of 53.6% on the IEMOCAP dataset. This research prompted Microsoft to investigate pitch-
based features and deep neural networks, leading to an accuracy of 54.3% (Han et al., 2014). Zhao et
al. (2017) countered this by proving that comparable results could be achieved using spectral features
alone with the RCNN model, highlighting the versatility of deep learning approaches. In the same
year, Microsoft introduced another research by Mirsamadi et al. (2017) involving Recurrent Neural
Network (RNN) with local attention, achieving an accuracy of 61.8% on the same dataset for four
classes of emotions. Ayek et al. (2017) explored multiple deep-learning methods and achieved an
impressive 64.78% accuracy on the same dataset with five emotional classes. Their findings revealed
the superiority of frame-based feature models over utterance-based feature models.
Feature Selection
Feature selection has been a focus of several studies to identify speech's most relevant
emotional components. Liu et al. (2013) explored various techniques, including the Fisher criterion,
distance analysis, partial correlation analysis, and bivariate correlation analysis, to determine the best
feature subset for recognizing emotions. They used an extreme learning machine (ELM) to build a
decision tree that can effectively classify emotions. The study concluded that ELM was a particularly
best fit for the decision tree method, and feature selection techniques like the Fisher criterion and
correlation analysis had been thoroughly verified (Liu, Z. T., Li, X., & Chen, W., 2013).

22
Another area for feature extraction is the application of deep auto-encoders (DAE). In the
research conducted by Wang et al. (2014), a DAE method with five hidden layers was used to extract
speech emotion characteristics. Alongside DAE, standard features like MFCC, Perceptual Linear
Prediction cepstral coefficients (PLP), and LPCC were extracted from speech signals. When all these
features were utilized as input for a Support Vector Machine (SVM) model, the findings indicated
that DAE-extracted features exhibited a clear advantage over other feature types, highlighting the
potential of deep learning in improving feature extraction for SER (Wang, F., Yang, J., Chen, H., &
Wu, J., 2014).
A Leap Towards Multidimensional SER: Deep Learning Breakthroughs
The journey of integrating deep learning into SER continued with innovative models. Chen et
al. (2018) introduced a three-dimensional attention convolution Recurrent Neural Network (CRNN)
SER model, leveraging Mel spectrogram features, which achieved high accuracy on both IEMOCAP
and Emo-DB datasets. Zhao et al. (2018) proposed an improved method by merging 1D and 2D
Convolutional Neural Networks (CNNs) to reach remarkable accuracy rates of 86.36% on IEMOCAP
and 91.78% on Emo-DB for seven emotional classes. These advancements signaled a shift towards
more robust and multidimensional SER, reflecting the power of deep learning in capturing intricate
emotional nuances.
Deep Learning Revolution in Music Emotion Recognition with CNN-LSTM
In music emotion identification, a groundbreaking technique emerged through the
convolutional long short-term memory deep neural network (CLDNN) architecture (Hizlisoy et al.,
2021). This approach was implemented and tested on a novel collection of 124 Turkish traditional
music snippets. This approach harnessed log-Mel filter bank energies, MFCCs, and essential acoustic

23
characteristics to recognize emotions effectively. Remarkable results were achieved by combining the
LSTM and DNN classifiers, incorporating the new features with traditional ones. Compared to
conventional methods like KNN, SVM, and random forest classifiers, the LSTM+DNN classifiers
demonstrated superior accuracy. This approach, consisting of four convolutional layers, one LSTM
layer, and fully connected layers, showcases the potential of the CNN-LSTM fusion for music
emotion recognition.
Exploring the Landscape of Speech Emotion Recognition with CNN-LSTM
As the demand for efficient real-time Speech Emotion Recognition (SER) continues to grow
in human-computer interactions, it is essential to comprehensively investigate SER's current
approaches and datasets to arrive at optimal solutions for this ongoing challenge (Abbaschian et al.,
2021). This article delves into the deep learning techniques for SER using publicly available datasets
and discusses traditional machine learning methods for speech emotion detection. Furthermore, it
offers a multifaceted exploration of SER techniques using functional neural networks, shedding light
on the nuances of speech emotion recognition. In this context, CNN-LSTM architectures have taken
the lead, showcasing their capabilities to address emotion detection challenges due to their enhanced
low-level and short-term discriminative skills. The integration of LSTM networks in CNN models
has further improved the network's performance, allowing it to recognize long-term paralinguistic
patterns and demonstrating exceptional speaker-independent emotional processing abilities.
The Convergence of Audio and Lyrics in Music Emotion Classification with CNN-LSTM
Music emotion classification has posed a challenging yet intriguing problem, leading to
innovative approaches in artificial intelligence and machine learning. Chen and Li (2020) proposed a
hybrid network classifier that integrates audio and lyrics using a CNN-LSTM architecture, marking a

24
departure from the limitations of single network classification paradigms. This hybrid model
leverages two-dimensional and one-dimensional emotional features and significantly enhances
classification accuracy compared to the single-modal classification approach. The study underscores
the critical role of audio and lyrics as crucial elements for categorizing music based on its emotional
content. It highlights the potential for further exploration in multimodal music emotion detection
through deep learning. CNN-LSTM-based models continue to pave the way for a more
comprehensive understanding of emotions in music, significantly contributing to the evolving
landscape of music emotion recognition.
A research study completed by Latif et al. focused on boosting the precision of SER systems.
They utilized a new approach called transition learning. This method was particularly effective when
dealing with different languages and databases. Compared to other models like support vector
machines (SVMs) and sparse autoencoders, deep belief networks (DBNs) performed better. DBNs
gave more precise results in emotion recognition across five databases in three languages. An
interesting observation was that using various languages during training significantly boosted
accuracy while limiting target data. This improvement was evident even in databases with minimum
training examples.
Zhao et al. proposed two CNN+LSTM networks, one 1D CNN+LSTM network and one 2D
CNN+LSTM network, to learn local and global emotion-related features from speech and log-Mel
spectrograms, respectively. The architecture of the two networks is identical, with four regional
function learning blocks (LFLBs) and one LSTM layer in each. LFLB is designed to learn local
correlations and derive hierarchical correlations, and it consists primarily of one convolutional layer
and one max-pooling layer. The LSTM layer is used to learn long-term dependencies from the locally
known functions.

25
Sun and his team introduced a new technique, combining a special type of code, called a sparse
autoencoder, with an attention-grabbing method. Their goal? Using this unique code to study and
learn from labelled and unidentified data. Moreover, they wanted the attention method to focus on
parts of speech that really drive emotion. Are speech sections not filled with emotion? They're usually
overlooked. The team decided to put their new technique to the test on three online databases using a
multilingual system. They discovered that compared to other popular ways to identify emotions in
spoken language; their technique offered results you could trust.
Jiang et al. suggested a feature representation extraction method based on deep learning from
heterogeneous acoustic feature groups that could include redundant and irrelevant content, resulting
in poor emotion recognition output in their research. A fusion network is learned to jointly learn the
discriminative acoustic feature representation and SVM as the final classifier after the informative
features are obtained. The proposed architecture increased recognition efficiency by 64% compared
to current state-of-the-art methods, according to experimental findings on the IEMOCAP dataset.
Pandey et al. [29] provided an overview of deep learning strategies for extracting and
classifying emotional states from speech utterances. They investigate the most commonly used simple
deep learning architectures in the literature. On the two standard datasets, Emo-DB and IEMOCAP,
architectures such as CNN and LSTM were used to measure the emotion capture capability of various
standard speech representations such as Mel-spectrograms, magnitude spectrograms, and MFCCs.
The experiments’ results and their reasoning have been discussed to determine which architecture and
function combination is best for speech emotion detection.
Meng et al. employed the bidirectional LSTM along with CNN to recognize speech emotions.
In addition, they adopted the Mel-spectrogram features in the 3D space as the main features used to
train the CNN network. That model was evaluated based on IEMOCAP and Emo-DB datasets.

26
Although the results achieved by this model are promising, they lack generalization, as the model
performs well on the training data; however, the performance is worse on the test set.
Zhen et al. proposed a model composed of CNN, BLSTM, and SVM for recognizing speech
emotions based on log-Mel spectrogram features. The model is evaluated on the IEMOCAP dataset
and performs better when compared with another approach in the literature. Despite the promising
performance of the model, it still needs to be evaluated using other datasets to show its generalization
capability. On the other hand, the study presented in [32] showed the performance of various models
used in SER using six speech datasets. This study concluded that the CNN+LSTM model performs
better than the other models for five of the six datasets.
Lili Guo et al. employed a kernel extreme learning machine (KELM) to classify speech
emotion classes. This approach uses a fusion of spectral features to train the presented model. The
evaluation of this model is performed in terms of two datasets, Emo-DB and IEMOCAP. However,
the results show promising performance on only one dataset, which means the approach lacks proper
generalization. In addition, the authors concluded that the fusion of the spectral features allows the
models to achieve higher classification accuracy.
Misbah et al. investigated the application of a deep convolutional neural network (DCNN) to
extract features from the log-Mel spectrogram of the raw speech. The study employed four datasets:
IEMOCAP, Emo-DB, SAVEE, and RAVDESS. The classification of speech emotions is performed
using four classifiers: SVM, random forest, k nearest neighbors, and neural networks. The
performance of these classifiers is promising; however, no single classifier could perform well on the
four datasets. This indicates that these classifiers lack generalization capability.
Sonawane et al. demonstrated a deep learning approach for speech emotion understanding. A
multilayer convolutional neural network is used with a basic K-nearest neighbor (KNN) classifier to
classify emotions such as positive, negative, indifferent, disgust, and surprise. The combination of

27
MFCC-CNN and the KNN classifier performs better than the current MFCC algorithm, according to
experimental findings on a real-time database obtained from the open-access social media site
YouTube.
Sajjad et al. presented a new SER system focused on Radial basis function network (RBFN)
similarity calculation in clusters and the main sequence segment selection method. The STFT
algorithm transforms the chosen sequence into a spectrogram, which is then fed into the CNN model,
which extracts the discriminative and salient features from the speech spectrogram. Additionally, to
ensure precise recognition performance, CNN features were normalized and fed to the deep
bidirectional long short-term memory (BiLSTM) for emotion recognition based on the learned
temporal information.
In conclusion, integrating deep learning techniques, particularly the fusion of CNN and LSTM
networks, has ushered in a new era of accuracy and efficiency in Speech Emotion Recognition and
Music Emotion Recognition. These models have showcased their capabilities in capturing human
emotions' intricacies and paved the way for more nuanced and multidimensional.

28
3 Methodology
3.1Business Understanding:
It's essential to understand emotions in speech. This is useful in many areas, such as human-
computer interaction, sentiment analysis, customer service, and mental health evaluations.
Recognising emotions from speech can enhance user interaction, improve customer service, and boost
mental health testing.
In our digital world, we often interact with technology. Hence, machines need to understand
human emotions. Consider customer service. Knowing a user's emotions can make interactions more
personal and increase service quality. In mental health evaluations, machines that can recognise
emotions could help experts do early assessment and action.
Imagine a world where we could understand emotions in voices through machine learning.
This research aims to make that happen, bridging the gap between speech and emotional intelligence.
The potential impacts could be widespread. Think of customer service where an AI can soothe angry
customers. Or in healthcare, where speech analysis helps diagnose mental health problems. In
education, personalised learning could keep students engaged. Video games could even change
according to the player's feelings, making the play more personal. In essence, this research envisions
accurately capturing emotions from speech. It seeks to transform businesses, technology, medicine,
and relationships.
3.2 Data Understanding
For this research, I have used the Ryerson Audio-Visual Database of Emotional Speech and
Song (RAVDESS), Toronto emotional speech set (TESS) and Crowd Sourced Emotional Multimodal
Actors Dataset (CREMA-D) datasets for training the model.

29
There are 1440 files in this section of the RAVDESS: 24 actors times 60 trials equal 1440.
Twenty-four professional actors, twelve female and twelve males, perform two lexically matched
phrases in a neutral North American accent for the RAVDESS. Calm, joyful, sad, furious, afraid,
surprised, and disgusted expressions are examples of spoken emotions. Every expression is generated
at two different emotional intensity levels (strong and normal), along with a neutral expression.
Regarding the TESS dataset, two actresses, ages 26 and 64, each performed a set of 200 target
words in the carrier phrase "Say the word _." The set was recorded depicting the seven emotions
(anger, disgust, fear, happiness, pleasant surprise, sorrow, and neutral). In total, there are 2800 data
points (audio files).
Due to its organisational structure, the two female actors and their emotions are contained
under separate folders in the dataset. All 200 target words' audio files can be found within that. The
audio file is in the WAV format.
CREMA-D comprises 7,442 original clips from 91 actors and is an emotive multimodal actor
data set between the ages of 20 and 74, with 48 male and 43 female actors representing a range of
racial and ethnic backgrounds (African American, Asian, Caucasian, Hispanic, and Unspecified)
performed in these clips.
For testing, I have used the Surrey Audio-Visual Expressed Emotion (SAVEE) dataset -
Four native English male speakers (DC, JE, JK, and KL), postgraduate students, and
researchers at the University of Surrey, ranging in age from 27 to 31, provided data for the SAVEE
database. Anger, contempt, fear, happiness, sadness, and surprise are some of the distinct categories
that psychology has used to characterise emotion. A neutral category is also included to offer
recordings of 7 emotion categories.
Each emotion in the text was represented by 15 TIMIT sentences: two emotion-specific, three
common, and ten phonetically balanced generic sentences. To get 30 neutral sentences, the 2 × 6 =

30
12 emotion-specific and three common sentences were registered as neutral. Each speaker produced
a total of 120 utterances as a result.
3.3 Data Preparation:
Several steps were followed in creating a dataset for detecting emotion in speech. Firstly,
gathered audio data from open-source sources like the RAVDESS, CREMA-D, and TESS databases,
which had diverse emotional content. The data were then transformed into a single format WAV and
adjusted to the same sampling rate for consistency. In the data, each audio clip is tagged with its
respective emotion. This labelling process made it easier to differentiate clips by emotion. To convert
the audio into data readable by machines, we used feature extraction techniques such as Mel-
frequency cepstral coefficients (MFCCs) and spectral attributes. The dataset was enhanced by
reducing noise and ensuring each emotional class was well represented, using oversampling and
augmentation methods. After preparing the dataset, we split it into training, validation, and test sets.
These sets became the foundation for our emotion recognition models.
3.4 Modeling:
This research focused on how well Neural Networks (NNs) and Long Short-Term Memory
(LSTM) models could find emotions in speech data. The dataset had speech clips with different
emotions. The data was processed and made ready for study by using Mel-frequency cepstral
coefficients (MFCCs). It turned raw audio into inputs the models could use. The NN model had hidden
layers with a ReLU activation function. The LSTM model used stacked layers. It used bidirectional
LSTM cells to understand patterns in the data over time. Training used the Adam optimizer. The best
learning rate and batch size were used to avoid overfitting. Accuracy, precision, recall, and F1-score
31
measured how well the models performed. Both models were good at finding emotions from speech.
But the LSTM model did better. It was great at understanding data over time. It was found that LSTM
models have a lot of promise. They might be even better at finding emotions in speech. Below, we
have discussed both the techniques in detail:
Neural Network:
Neural Networks (NNs) are a class of powerful machine-learning models based on the
structure and functionality of the human brain. They excel in handling complex tasks, learning
intricate patterns, and making predictions from data. Neural networks consist of interconnected nodes,
or neurons, arranged in layers that work collaboratively to process information.
The basic structure of a Neural Network comprises three main types of layers:
1. Input Layer: This layer receives the initial data or features to be processed.
Each node in this layer represents a feature of the input data.
2. Hidden Layers: These layers lie between the input and output layers and
perform complex computations by applying weights to the input and passing it through
activation functions. The number of hidden layers and the number of neurons within each
layer can differ based on the complexity of the problem.
3. Output Layer: This layer produces the network's final predictions or outputs
based on the computations performed in the hidden layers. The number of nodes in the output
layer depends on the nature of the problem, for example, classification or regression.
Within a neural network, connections between neurons are represented by weights,
strengthening the connections. During the learning process, these weights are adjusted to minimize
the error between predicted outputs and actual targets.
Moreover, biases are additional parameters within each neuron that allow the model to handle
the delta or error in the output. They provide flexibility to the network by enabling it to fit more

32
complex functions.
Neural networks rely heavily on activation functions. These functions allow networks to
handle and learn from complex data by adding a factor of non-linearity. Here are a few well-known
activation functions:
• Rectified Linear Unit (ReLU): This takes inputs and provides outputs. If the
input is positive, it returns the same. If it's not, it gives you zero. ReLU is popular because it's
simple and it keeps away vanishing gradient issues.
• SoftMax: This is commonly used for categories. It changes raw scores into
probability. It confirms totals add up to 1, showing probabilities for different groups.
Neural Networks learn by testing and adjusting. They take in data and predict the outcomes.
They compare their guess with the actual result and make the changes to get better results. Tools such
as gradient descent are used in this process.
The architecture and adaptiveness of Neural Networks, combined with their ability to learn
from data, make them invaluable. They have had incredible influence in areas such as image and
voice recognition, language processing, and data prediction.
LSTM (Long Short-Term Memory):
Long Short-Term Memory (LSTM) is an enhancement of Recurrent Neural Networks
(RNNs). They're made to handle the tricky business of keeping track of essential details in a sequence
over a long period of time. Normal RNNs often fumble this task due to a tricky problem called the
vanishing or exploding gradient, which impacts their usefulness with long sequences.
LSTM networks were made to beat this issue. They did it by adding special memory cells,
making it easier for them to remember things over different time steps. These networks have the
unique power to decide what to remember, update, or forget. Because of this, they're suitable for jobs

33
that deal with a sequence of data, like checking a series of numbers over time, processing language,
recognizing speech and more.
The two things that make LSTMs different from normal RNNs are memory cells and these
Gating Mechanisms. They help the network remember and manage long sequences:
1. Memory Cells: LSTMs have these memory cells. They work like a storage unit,
able to keep information for a long time. These cells have a state vector that can change over
time. This way, the network can selectively remember or forget stuff depending on its
relevance to the task at hand.
2. Gating Mechanisms (Forget, Input, Output Gates): LSTMs use gates to control
how information moves within the memory cells. The gates use two mathematical operations,
sigmoid and element-wise multiplication, to control the flow of information.
Forget Gate: Decides which details in the memory state should be let go or erased. It looks at
the current entry and the past condition to determine what details are no longer helpful for future
forecasts.
Input Gate: Chooses to revamp the memory state by pinpointing fresh data to be stored. This
gate figures out the significance of the new data along with the present condition.
Output Gate: Decide the cell's output based on the refreshed state. This gate controls which
sections of the memory condition are used to make the output for this moment.
LSTMs, with these gate mechanisms, can learn when to keep or let go of information. This
skill helps lessen the problems of vanishing or blowing up gradients common in standard RNNs. This
allows LSTMs to model and predict long-distance sequences accurately and efficiently.
LSTMs have proved their worth in real-life tasks where context and time relationships are
key. For example, they excel in predicting stock prices, scrutinizing feelings in writings, creating
meaningful text sequences, and handling time series data. The talent for taking in and holding on to

34
context over long sequences has made them an essential structure in studying and modelling seque-
ntial data.
MFCC:
MFCC or Mel-frequency cepstral coefficients are extensively utilized to extract features in
audio or speech signal processing. They originate from the audio signal's short-term power spectrum
with the aim of capturing features similar to our human hearing system.
Here are the steps to compute MFCCs:
1. Signal Framing: The audio signal is chopped into brief frames, typically
between 20 and 40 milliseconds, with partial overlaps.
2. Pre-emphasis: This process uses a pre-emphasis filter, which amplifies high-
frequency sounds, making the signal clearer.
3. Windowing: Here, frames get multiplied by a window function (like Hamming
or Hanning) to decrease spectral leakage.
4. Fast Fourier Transform (FFT): FFT is applied to every frame, altering the
signal from the time domain into the frequency domain.
5. Mel Filterbank: The power spectrum goes through a set of triangular filters
placed on the Mel-frequency scale. This process emulates how humans perceive sound non-
linearly.
6. Log Compression: The energy of the filterbank undergoes logarithm
computation, copying how humans hear sounds logarithmically.
7. Discrete Cosine Transform (DCT): The last step involves applying DCT to the
logarithm of filterbank energies. The result? A set of coefficients (MFCCs) which represent
the spectral features of every audio frame.

35
MFCCs offer a summarized depiction of the audio signal. They capture essential spectral data
while excluding less distinctive features. Their effectiveness in showing the human auditory system's
characteristics makes them popular in various applications like emotion recognition from audio
signals, speaker identification, and speech recognition.
3.5 Research ethics:
I have taken steps to align my research with data protection standards, emphasizing my
commitment to ethical conduct. I can confidently affirm that there has been no misuse or
misrepresentation of the considered data. The dataset remains unaltered and openly accessible on
Kaggle for research purposes. By strictly adhering to data protection regulations, I have prioritized
transparency and integrity throughout the research process, ensuring responsible and ethical handling
of the information. The public availability of the dataset on Kaggle fosters collaboration and supports
research initiatives. Additionally, for data preparation, augmentation and concatenation were
performed with other publicly available datasets.

36
4 Implementation
The process started with creating a Pandas Data Frame named ravdess_dataframe; this data
frame contained file paths, gender categories, and emotion labels. The value_counts() method was
used on the 'emotion_label' column for observing the count of each emotion label in the DataFrame.
This was a balanced dataset with 192 samples for each emotion: ‘angry', 'fear', 'disgust', 'sad',
and 'happy', but only 96 samples for the 'neutral' emotion.
Figure 1Emotion and Gender Distribution
The above bar plot shows the distribution of emotions across different genders within the
dataset, which is balanced again.
The waveform was used for plotting graphs and playing the audio file for the 'sad' emotion. It
loads the audio file path for 'sad' emotion, generates a waveform plot, and plays the audio. This was
done just for the understanding of the dataset.

37
Figure 2 Waveplot for sad emotion audio - RAVDESS dataset
The above image represents a waveform plot for a specific emotion (sad in this case) from an
audio file path in the Ravdess dataset.
Then, the next step was the process of extracting Mel-frequency cepstral coefficients (MFCC)
from audio files using the Librosa library. It loads the audio files, normalizes them, computes the
MFCC features, and calculates the mean of these features. These extracted MFCC features, and their
corresponding emotion labels are then stored in a new Data Frame. This Data Frame consisted of 960
samples, each containing 128 MFCC features. The emotion labels were one-hot encoded after being
encoded with Label Encoder with the help of Scikit-learn, which resulted in a shape of (960, 5) for
the labels. This preprocessed dataset is suitable for training machine learning or deep learning models
to predict emotions based on the extracted MFCC features from audio recordings.
The training of a neural network model using TensorFlow's Keras API was done. The model
architecture was built with four dense layers: three hidden layers with 256 neurons each, employing
ReLU activation functions, and an output layer with five neurons using a softmax activation function
for multiclass classification. The model was then compiled using the Adam optimizer and categorical
cross-entropy loss function while tracking metrics such as accuracy, precision, recall, and area under
the curve (AUC). An EarlyStopping callback was employed to monitor the validation loss and prevent
overfitting by restoring the best weights after observing no improvement for three consecutive epochs.
During training with 15 epochs and a batch size of 32, the model's performance metrics (accuracy,
38
precision, recall, AUC) and loss values for training are recorded and visualized using Matplotlib. The
accuracy and loss curves plotted against the number of epochs show the model's learning progress
and generalization performance on the training data, aiding in assessing its training behavior, potential
overfitting, or underfitting tendencies.
Further, some more data was added using the TESS dataset and CREMA-D for better training
of the model. The RAVDESS, CREMA-D and TESS datasets were concatenated, which contained
audio files expressing various emotions like anger, fear, disgust, sadness, happiness, and neutrality in
different proportions between the two datasets. After consolidating and cleaning the combined
dataset, neutral expressions were removed from the training data, focusing solely on emotional
expressions for model training. Emotionally labeled audio files underwent various augmentation
techniques, including noise addition, time stretching, shifting, and pitch modification, to diversify the
dataset. These augmented data are then used for feature extraction, encompassing essential audio
characteristics such as Zero Crossing Rate, Root Mean Square, Mel-frequency cepstral coefficients,
Chroma_stft, and Mel Spectrogram. The resulting features are standardized, labels are one-hot
encoded, and sequences are uniformly padded to a fixed length. The processed features and labels are
stored as numpy arrays and pickled files, readying them for model training.
A sequential model was then initiated and trained using the training data; The model
effectively combines convolutional layers (CNN) for feature extraction with recurrent layers (LSTM)
for capturing long-term temporal dependencies. Sequential Structure: Layers are arranged
sequentially, starting with two convolutional blocks, followed by two LSTM layers, and ending with
dense output layers. It had an Input Shape as (164, 1), indicating a 1D input sequence with 164-time
steps and a single feature channel. The convolutional blocks were 1st block: 256 filters, 5-point kernel,
strides of 2, padding to maintain input length, and ReLU activation.
2nd block: 128 filters, 5-point kernel, strides of 2, padding, and ReLU activation.

39
MaxPooling Layers: These were used for reducing dimensionality and extracting dominant features.
Dropout Layers: Regularization to prevent overfitting (20% dropout after each convolutional and
LSTM layer).
LSTM Layers: 1st layer: 64 units, returning sequences for further processing.
2nd layer: 32 units, not returning sequences, providing final temporal context representation.
Dense Layers:
1st layer: 16 units with ReLU activation and intermediate processing.
2nd layer: 5 units with softmax activation and producing probability distribution over five classes.
Training and Evaluation: Training was done for 50 epochs with a batch size of 32, Loss:
categorical cross-entropy, Optimizer used-Adam. These were used for the metrics: accuracy,
precision, recall, and AUC.
After this, introduced a model which analyzes data with a combination of convolutional layers,
attention, and LSTMs, achieving high accuracy, precision, recall, and AUC on a multi-class
classification task.
Two artificial neural networks were trained for multi-class classification using a sequential
architecture and Adam optimiser. Both models displayed improvement in accuracy and other metrics
over 50 training epochs with 32 samples per batch.
The first ANN achieved a final accuracy of 73.83% and an AUC of 94.55%, suggesting robust
performance. The second ANN, while starting with lower accuracy, exhibited potential for further
improvement through hyperparameter tuning or additional training. Overall, the results indicate
successful model training.
Further, the ANN-LSTM model was trained on a dataset for a multi-class classification. The
model used an LSTM layer with 256 units to record temporal dependencies, followed by fully
connected layers and dropout for feature extraction and regularization. Finally, a softmax output layer

40
with five units predicted the class probabilities.
During training with Adam optimiser and categorical cross-entropy loss, the model shows
steady improvement in accuracy, reaching around 60% on the training set. Loss continuously
decreased, resulting in effective error minimization. Recall, initially low, gradually increased,
explaining better identification of positive cases. The AUC score also improved, resulting in enhanced
class discrimination. Overall, the model showed promising performance.

41
5 Evaluation
This section delves into the evaluation of the model performances. Six models were trained in
this research project. These are as follows:
Model Development:
• Model 1 (CNN): This model Utilizes a sequential neural network with dense layers. The model
achieved moderate accuracy and other metric scores (accuracy, precision, recall, and AUC) on the
training set.
• Model 2 (CNN + LSTM): This model Extended the training data by including additional samples
from the TESS and CREMA datasets. The performance improved noticeably, achieving higher
accuracy, precision, recall, and AUC scores on the test data compared to Model 1.
• Model 3 (CNN + Attention net + LSTM): The third model experimented with a Long Short-Term
Memory (LSTM) layer and attention net. This model performed best among all the models.
• Model 4 (ANN): This model performed lower than the CNN and CNN+LSTM models.
• Model 5 (ANN + LSTM): This model performed even lower than the ANN model.
• Model 6 (ANN + More layers of LSTM): This model showed improvement when compared to
model 5 but was still not good enough and was lower than the ANN model.

42
Below is the evaluation of these models:
Figure 3Accuracy for training data - RAVDESS
Figure 4Loss for training data - RAVDESS
The graphs visualise the first model's accuracy and loss metrics across epochs for training data
from ravdess dataset, offering insights into the training process and potential overfitting or
convergence issues.

43
Figure 5Accuracy/loss for CNN+Attention net+LSTM model
The graphs visualises the CNN + Attention net + LSTM model's accuracy and loss metrics
across epochs for training data from the new dataset which contains the data from ravdess, crema and
tess dataset for improving the training, offering insights into the training process and potential
overfitting or convergence issues.
From the above graphs, it can be observed how the Model Accuracy and Model Loss metrics
showed performance improvement.

44
Figure 6Accuracy/loss for ANNt+LSTM model
The accuracy graph of ANN +LSTM shows that the training accuracy starts off at about 0.3
and increases to about 0.6 over the course of 50 epochs. The Loss starts off at 1.5 and after 50 epochs
it is around 0.8.

45
Figure 7Accuracy/loss for ANN+multi layered LSTM model
The ANN+ More layers of the LSTM loss graph show that the training loss starts off at about
1.7 and decreases to about 0.6 over the course of 50 epochs. The accuracy starts off at about 0.25 and
increases to about 0.60 over the course of 50 epochs. This suggests that the model is learning from
the data, as the loss decreases over time.
Overall, the graphs suggest that the model is doing an excellent job of learning from the data
but is overfitting the training data.

46
Table 1Comparison metrics for different Models
Models Accuracy Precision Recall AUC
CNN 0.5698 0.6790 0.4406 0.8401
CNN+LSTM 0.6887 0.8185 0.5436 0.9235
CNN+Attention 0.8925 0.9008 0.8843 0.9872
Net+LSTM
ANN 0.7383 0.8265 0.6437 0.9455
ANN + LSTM 0.6030 0.7961 0.4060 0.8747
ANN + More layers of 0.6164 0.7811 0.4400 0.8812
LSTM
The table mentioned above compares the performance of six different models, these are
evaluated by four metrics: accuracy, precision, recall, and AUC. Here's an analysis of the results:
Overall Performance:
• The CNN+ Attention Net +LSTM model achieves the highest accuracy of 0.8925, precision of
0.9008, recall of 0.8843, and AUC of 0.9872, and hence it is the best-performing model out of
the six different models.
• CNN+LSTM performed as second best with a strong performance across all metrics:
accuracy: 0.6887, precision: 0.8185, recall: 0.5436, AUC: 0.9235.
• ANN models generally perform lower than CNN and LSTM-based models, which explains the

47
importance of capturing temporal dependencies for this task.
• Adding more LSTM layers to the ANN model (ANN + More layers of LSTM) does not
significantly improve performance accuracy: 0.6164, potentially indicating overfitting.

48
6 Conclusion and Future Work
6.1Conclusion
This project aims to identify emotion classification through neural networks using speech data
sourced primarily from the RAVDESS, CREMA and TESS datasets. The project's primary objective
was to develop efficient models that accurately identify emotions such as happiness, sadness, anger,
fear, disgust, and more from speech samples. The process included data collection, feature extraction
(notably Mel-frequency cepstral coefficients - MFCCs), model development, evaluation, and insights
into potential future work to enhance the current models' performance.
Data Collection and Preparation: The beginning of the analysis started with collecting audio
files from RAVDESS, CREMA and TESS datasets, ensuring a diverse representation of emotional
states. After that, thorough data preprocessing was executed to extract essential features from the
speech samples. The utilization of MFCCs allowed for transforming the speech signals into a format
compatible with neural network-based analysis.
Further, the project had six distinct neural network architectures for the emotion classification
task. Model 1 employed a sequential neural network structure with dense layers, despite moderate
performance metrics - including accuracy, precision, recall, and AUC. Model 1 served as a benchmark
for further evaluations. Model 2 extended the dataset by including additional samples from the
CREMA and TESS datasets. This augmentation notably enhanced the model's performance, denoting
marked improvements in accuracy, precision, recall, and AUC scores on the test dataset compared to
Model 1. However, Model 3, which incorporated an attention net and LSTM layer for sequence
processing, resulted in comparatively better performance metrics than the dense layer-based models.
The evaluation demonstrated promising results, particularly with the augmentation of the
dataset in Models 2 and 3, proving the significance of diverse data in improving emotion classification
accuracy from speech inputs. Additionally, the comparative analysis indicated the superior

49
effectiveness of the dense layer architectures and attention net and LSTM-based models in handling
the extracted audio features for emotion recognition.
Several recommendations were outlined to further enhance the models' performance. Firstly,
exploring advanced data augmentation techniques or alternative feature engineering approaches could
enrich the dataset, potentially boosting classification accuracy. Secondly, fine-tuning
hyperparameters to optimize the neural network models and improve their generalization capabilities
could be helpful. Thirdly, investing in model interpretability techniques could highlight the most
influential features contributing to emotion classification, enhancing the models' transparency, and
understanding.
In conclusion, the analysis successfully implemented neural network architectures for emotion
classification from speech data, achieving commendable results, especially with the integration of
additional diverse data. However, the scope for improvement remains, including further data
augmentation, hyperparameter tuning, and interpretability exploration. These enhancements could
further improve the models' accuracy and reliability in recognizing emotions from audio samples,
contributing significantly to the field of emotion recognition and human-computer interaction.
6.2 Future Work

Moving forward, future work in this domain could focus on various aspects. Augmenting the
dataset further with a wider variety of emotional states and demographic diversities might augment
the model’s capacity to generalize across diverse populations. Additionally, the exploration of
advanced neural network architectures or leveraging transfer learning from pre-trained models could
potentially gain more robust and nuanced emotion classification frameworks. Moreover, investigating
real-time emotion classification applications or multimodal approaches combining audio and visual
inputs could lead to more comprehensive emotion recognition systems for practical use cases in fields

50
such as mental health monitoring, human-computer interaction, and beyond. Lastly, continuous
improvement of interpretability methods to explore the neural network model’s decision-making
processes would not only enhance model understanding but also increase trust and acceptance of these
technologies in real-world applications.

51
7 References
R. S. Livingstone and A. F. Russo, ‘‘The Ryerson audio-visual database of emotional speech and
song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American
English,’’ PLoS ONE, vol. 13, no. 5, pp. 1–35, May 2018.
P. Jackson and S. U. Haq, ‘‘Surrey audio-visual expressed emotion (SAVEE) database,’’ Univ.
Surrey, Guildford, U.K., Tech. Rep., Apr. 2014.
A. Satt, S. Rozenberg, and R. Hoory, ‘‘Efficient emotion recognition from speech using deep
learning on spectrograms,’’ in Proc. Interspeech, Aug. 2017.
J. Chang and S. Scherer, ‘‘Learning representations of emotional speech with deep convolutional
generative adversarial networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
(ICASSP), Mar. 2017, pp. 2746–2750.
M. Chen, X. He, J. Yang, and H. Zhang, ‘‘3-D convolutional recurrent neural networks with
attention model for speech emotion recognition,’’ IEEE Signal Process. Lett., vol. 25, no. 10, pp.
1440–1444, Oct. 2018.
J. Zhao, X. Mao, and L. Chen, ‘‘Learning deep features to recognise speech emotion using merged
deep CNN,’’ IET Signal Process., vol. 12, no. 6, pp. 713–721, 2018.
P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, ‘‘Speech emotion recognition
using spectrogram & phoneme embedding,’’ in Proc. Interspeech, Sep. 2018.
M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and N. Dehak, ‘‘Emotion
identification from raw speech signals using DNNs,’’ in Proc. Interspeech, Sep. 2018.
S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, ‘‘Transfer learning for improving speech
emotion classification accuracy,’’ in Proc. Interspeech, Sep. 2018.
J. Zhao, X. Mao, and L. Chen, ‘‘Speech emotion recognition using deep 1D & 2D CNN LSTM
networks,’’ Biomed. Signal Process. Control, vol. 47, pp. 312–323, 2019.
52
T.-W. Sun and A.-Y.-A. Wu, ‘‘Sparse autoencoder with attention mechanism for speech emotion
recognition,’’ in Proc. IEEE Int. Conf. Artif. Intell. Circuits Syst. (AICAS), Mar. 2019, pp. 146–
149.
W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li, ‘‘Speech emotion recognition with heterogeneous
feature unification of deep neural network,’’ Sensors, vol. 19, no. 12, p. 2730, Jun. 2019.
S. K. Pandey, H. S. Shekhawat, and S. R. M. Prasanna, ‘‘Deep learning techniques for speech
emotion recognition: A review,’’ in Proc. 29th Int. Conf. Radioelektronika
(RADIOELEKTRONIKA), Apr. 2019, pp. 1–6.
H. Meng, T. Yan, F. Yuan, and H. Wei, ‘‘Speech emotion recognition from 3D log-mel
spectrograms with deep learning network,’’ IEEE Access, vol. 7, pp. 125868–125881, 2019.
Z.-T. Liu, P. Xiao, D.-Y. Li, and M. Hao, Speaker-Independent Speech Emotion Recognition Based
on CNN-BLSTM and Multiple SVMs. Aug. 2019, pp. 481–491.
J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, and G. Hofer, ‘‘Analysis of deep
learning architectures for cross-corpus speech emotion recognition,’’ in Proc. Interspeech, Sep.
2019, pp. 1656–1660.
L. Guo, L. Wang, J. Dang, Z. Liu, and H. Guan, ‘‘Exploration of complementary features for speech
emotion recognition based on kernel extreme learning machine,’’ IEEE Access, vol. 7, pp. 75798–
75809, 2019.
M. Farooq, F. Hussain, N. K. Baloch, F. R. Raja, H. Yu, and Y. B. Zikria, ‘‘Impact of feature
selection algorithm on speech emotion recognition using deep convolutional neural network,’’
Sensors, vol. 20, no. 21, p. 6008, Oct. 2020.
S. Sonawane and N. Kulkarni, ‘‘Speech emotion recognition based on MFCC and convolutional
neural network,’’ Int. J. Adv. Sci. Res. Eng. Trends, Jul. 2020.
Mustaqeem, M. Sajjad, and S. Kwon, ‘‘Clustering-based speech emotion recognition by

53
incorporating learned features and deep BiLSTM,’’ IEEE Access, vol. 8, pp. 79861–79875, 2020.
Mustaqeem and S. Kwon, ‘‘A CNN-assisted enhanced audio signal processing for speech emotion
recognition,’’ Sensors, vol. 20, no. 1, p. 183, Dec. 2019.
N. Vryzas, L. Vrysis, M. Matsiola, R. Kotsakis, C. Dimoulas, and G. Kalliris, ‘‘Continuous speech
emotion recognition with convolutional neural networks,’’ J. Audio Eng. Soc., vol. 68, nos. 1–2, pp.
14–24, Feb. 2020.
N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. Lee, ‘‘Multimodal approach of speech emotion
recognition using multi-level multi-head fusion attention-based recurrent neural network,’’ IEEE
Access, vol. 8, pp. 61672–61686, 2020.
O. Atila and A. Şengür, ‘‘Attention guided 3D CNN-LSTM model for accurate speech-based
emotion recognition,’’ Appl. Acoust., vol. 182, Nov. 2021, Art. no. 108260.
T. Tuncer, S. Dogan, and U. R. Acharya, ‘‘Automated accurate speech emotion recognition system
using twine shuffle pattern and iterative neighbourhood component analysis techniques,’’ Knowl.-
Based Syst., vol. 211, Jan. 2021, Art. no. 106547.
J. Liu and H. Wang, ‘‘A speech emotion recognition framework for better discrimination of
confusions,’’ in Proc. Interspeech, Aug. 2021, pp. 4483–4487.

54
8 Appendix
8.1Artefact Links
Google Drive Link -
https://drive.google.com/file/d/19audQU8H0pmuiAxibxCL5d4YGE6trl55/view?usp=sharing
Microsoft OneDrive Link -
https://mydbsmy.sharepoint.com/:u:/g/personal/10621051_mydbs_ie/EecAZ63OQX1Gq_6f3IRr_U
BAtxMkfeIQZM6MDgT9JzAeA?e=xRPMI
8.2Dataset Links
RAVDESS - https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-
audio?resource=download
TESS - https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess
CREMA-D - https://www.kaggle.com/datasets/ejlok1/cremad
SAVEE - https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee

Speech Emotion System Full Project Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Emotion System Full Project Report

Uploaded by

Copyright:

Available Formats

Empowering Communication: The

Evolution and Potential of Speech

in Data Analytics at Dublin Business School

Supervisor: Agatha Mattos

been submitted for any other degree.

Signed: Vaibhav Tiwari

Student number: 10621051

Date: 5th January 2024

between humans and computers.

Vaibhav Tiwari (10621051)

Vaibhav Tiwari (10621051)

Table 1Comparison metrics for different Models .............................................................................. 46

Vaibhav Tiwari (10621051)

Figure 1Emotion and Gender Distribution ......................................................................................... 36

Vaibhav Tiwari (10621051)

report's organization, setting the scene for the remaining study.

interaction—emotions. These intangible, ever-shifting states of mind are as fundamental to our

profound advancements in artificial intelligence (AI).

Vaibhav Tiwari (10621051)

This intersection of AI and emotional intelligence has sparked an array of previously

various sectors of our society.

problem. It's the culmination of efforts by neuroscientists, computational intelligence experts,

unraveled the complexities of emotional content within the speech.

significantly enhanced the accuracy and efficiency of these recognition systems.

In human-computer interaction, AI-driven emotion recognition from speech is proving

concerns over privacy associated with video sensors.

significance of this work.

Vaibhav Tiwari (10621051)

1.3.1 Enhancing Human-Computer Interaction

potential for enhancing our technological interactions is limitless.

1.3.2 Improving Mental Health Assessment

in speech could be transformative in addressing the global mental health crisis.

1.3.3 Unveiling Valuable Customer Insights

Businesses thrive on understanding their customers. Emotion recognition in customer

products, and, ultimately, stronger customer relationships.

Vaibhav Tiwari (10621051)

1.3.4 Human-Emotion Synthesis

more empathetic and understanding AI-human partnership.

1.3.5 Unlocking Unseen Insights

1.4 Research Objectives

other objectives are as follows:

• Developing neural network architectures: This research investigates different model

architectures for emotion classification.

• Data augmentation: Explore the impact of data augmentation on model performance.

• Advanced feature engineering: Consider alternative feature extraction

approaches beyond MFCCs.

1.5 Research Question

1.6 Report Overview

A brief outline or guideline to the readers. The report is structured as follows:

opportunities for improvement.

processed data for effective hate speech detection models.

• Chapter 4 – Implementation, this segment evaluates a specific machine learning model's

implications for future studies and practical applications in the field.

and practical implementations to enhance hate speech detection methodologies. Additionally, it

• Chapter 7 – References, compiles a comprehensive list of sources and materials cited

Vaibhav Tiwari (10621051)

validation of the presented information.

Vaibhav Tiwari (10621051)

which this thesis is situated.

2.1 Evolution of Artificial Intelligence (AI)

practical applications and ongoing research endeavors.

2.1.1 Early Challenges in Artificial Intelligence

Vaibhav Tiwari (10621051)