Professional Documents
Culture Documents
Vaibhav Tiwari
(10621051)
Applied Research Project submitted in partial fulfilment of the requirements for the degree of MSc
January 2024
2
Declaration
I declare that this Applied Research Project that I have submitted to Dublin Business School
for the award of MSc in Data Analytics is the result of my own investigations, except where
otherwise stated, where it is clearly acknowledged by references. Furthermore, this work has not
Acknowledgement
I want to thank professor Agatha Mattos, my research mentor, for providing me with guidance,
encouragement, and valuable suggestions throughout my research journey. I thank my supervisor for
her practical advice and assistance in my study. Additionally, I extend special thanks to the DBS
library and academic operations for their support in providing necessary reference materials for my
literature work.
4
Abstract
This project explores the classification of emotions in speech data using neural network
models. The project aims to build robust models to correctly identify emotions such as happiness,
sadness, anger, fear, and disgust from audio clips. The process involves data collection, cleaning up,
and feature extraction using Mel-frequency Cepstral Coefficients (MFCCs) data, creating, and
assessing three unique neural network structures. The first model, a sequence structure with dense
layers, is used to compare the performances of the subsequent models. The second model, which
uses an upgraded dataset, performed better. The third model had mixed results when adding an
LSTM layer, compared to models only using dense layers. The results underline the importance of
varied data in improving emotion identification accuracy from speech data. This research brings
essential knowledge to the field of neural network-based emotion classification. It sets the
foundation for valuable applications in monitoring areas, including mental health and interactions
Table of Contents
Declaration .......................................................................................................................................................2
Acknowledgement ...........................................................................................................................................3
Abstract ............................................................................................................................................................4
List of Tables ....................................................................................................................................................7
List of Figures ...................................................................................................................................................8
1 Introduction .............................................................................................................................................9
1.1 Introduction ...........................................................................................................................................9
1.2 Background ......................................................................................................................................... 10
1.3 Motivation........................................................................................................................................... 11
1.3.1 Enhancing Human-Computer Interaction ........................................................................................ 12
1.3.2 Improving Mental Health Assessment ............................................................................................. 12
1.3.3 Unveiling Valuable Customer Insights ............................................................................................. 12
1.3.4 Human-Emotion Synthesis ............................................................................................................... 13
1.3.5 Unlocking Unseen Insights ............................................................................................................... 13
1.4 Research Objectives ............................................................................................................................ 13
1.5 Research Question .............................................................................................................................. 14
1.6 Report Overview ................................................................................................................................. 14
2 LITERATURE REVIEW ............................................................................................................................. 17
2.1 Evolution of Artificial Intelligence (AI) ................................................................................................ 17
2.1.1 Early Challenges in Artificial Intelligence ......................................................................................... 17
2.2 Machine Learning and the Quest for Representation ........................................................................ 18
2.3 Representation Learning and Deep Learning ..................................................................................... 18
2.4 The Role of Depth in Deep Learning ................................................................................................... 18
2.5 Related Work ...................................................................................................................................... 19
3 Methodology......................................................................................................................................... 28
3.1 Business Understanding:..................................................................................................................... 28
3.2 Data Understanding ............................................................................................................................ 28
3.3 Data Preparation: ................................................................................................................................ 30
3.4 Modeling: ............................................................................................................................................ 30
3.5 Research ethics: .................................................................................................................................. 35
4 Implementation .................................................................................................................................... 36
5 Evaluation ............................................................................................................................................. 41
6 Conclusion and Future Work ................................................................................................................ 48
6.1 Conclusion ........................................................................................................................................... 48
6.2 Future Work ........................................................................................................................................ 49
7 References ............................................................................................................................................ 51
8 Appendix ............................................................................................................................................... 54
8.1 Artefact Links ...................................................................................................................................... 54
8.2 Dataset Links ....................................................................................................................................... 54
List of Tables
List of Figures
1 Introduction
This chapter introduces the subject, gives background information, emphasizes the
motivation, details the research objectives, poses research questions, and provides a structure for the
1.1 Introduction
Human communication is an intricate and multifaceted phenomenon. At its core, speech is the
primary medium through which we express thoughts, convey information, and engage with one
another. Yet, beneath the surface of words and sentences lies a more decadent layer of human
conversations and relationships as our words. Understanding and interpreting these emotional nuances
within speech has become a focal point of the study, especially in the backdrop of the swift and
Emotions are the emotional colour palette that paints the canvas of human communication.
They are the joy that fills laughter, the sorrow that trembles in a sigh, the excitement that crackles in
enthusiasm, and the serenity that flows in comforting words. For centuries, humans have been aware
of the emotional richness that speech carries. Still, the ability to systematically recognise, quantify,
and categorise these emotional cues within spoken language has been a complex and elusive
challenge.
In the current age, we stand at the precipice of a remarkable evolution in technology and
understanding. The rise of artificial intelligence has revolutionised the way we interact with machines
and the world around us. It has ushered in an era where machines are becoming increasingly proficient
at tasks that once required human insight and judgment. The amalgamation of AI and studying human
emotions through speech has birthed a promising field known as Speech Emotion Recognition (SER).
unimaginable possibilities. With AI as the catalyst, we can now embark on a journey to develop
systems capable of deciphering the emotional undertones, recognising, and categorizing emotions
embedded in spoken language. This evolution in technology has the potential to provide us with a
window into the emotional landscape of human communication, which has profound implications for
As we delve deeper into this thesis, we will explore the intricate mechanisms of SER and the
remarkable possibilities it brings to the table. We will discover how the evolution of AI is not just a
technological advancement but a transformation in how we comprehend and respond to the emotional
dimensions of our interactions. It is a journey into the uncharted territory of artificial emotional
intelligence, allowing us to uncover the subtle emotional cues hidden within our words, thus enriching
our understanding of human communication and our capacity to use technology to decipher the human
experience.
1.2 Background
Over the years, extensive interdisciplinary research has provided valuable insights into speech
emotion recognition. The advent of AI has been instrumental in shedding new light on this domain,
opening exciting possibilities for understanding and utilizing emotional cues in speech.
The marriage of AI with emotion recognition has enabled researchers to address a multifaceted
linguists, and AI specialists that have unlocked this potential. Neuroscientists have delved into
understanding how the human brain processes and perceives emotional stimuli, providing
foundational insights. Computational intelligence researchers have translated this knowledge into
Vaibhav Tiwari (10621051)
11
mathematical solutions, bridging the gap between neural processes and machine learning. Linguists
have approached speech emotion recognition by dissecting speech's semantic and syntactic aspects,
adding linguistic context to the equation. The collaboration of these interdisciplinary fields has
Scherer's 2003 work presented various design paradigms, often using the modified
Brunswick’s functional lens model of perception to study speech emotion recognition. Researchers
have experimented with techniques like spectral analysis and Hidden Markov Models to identify and
categorize emotions within spoken language. The application of AI and machine learning has
invaluable. Affective systems, empowered by AI, can detect a user's emotional state in real-time,
offering the opportunity to adapt system responses and enhance user satisfaction. Speech and gesture
recognition are cornerstones of this burgeoning field known as affective computing, with audio-based
devices being the most widely adopted, primarily due to the established trust in the technology and
1.3 Motivation
The study of Speech Emotion Recognition (SER) driven by artificial intelligence is not merely
an academic pursuit but a critical endeavor with profound implications for individuals, businesses,
and society. This research is motivated by several compelling factors that highlight the urgency and
In an increasingly digital world, how humans interact with technology has evolved
significantly. From smartphones to smart speakers, machines are integral to our daily lives. However,
for technology to truly serve us, it must understand us on a deeper level. The ability of AI-driven
systems to recognize and respond to human emotions in speech is fundamental in creating more
meaningful and satisfying human-computer interactions. Whether it's in virtual assistants providing
empathetic responses or customer service chatbots tailoring their support to user emotions, the
Emotions play an essential role in mental health and well-being. Accurate detection of
emotions in speech offers a powerful tool for assessing and monitoring mental health conditions. It
can contribute to early intervention, improved therapy, and a deeper understanding of emotional well-
being. The potential to develop AI systems that can recognize signs of emotional distress or instability
interactions, whether through phone calls, chat messages, or reviews, offers a goldmine of
information. Recognizing the emotional tone of customer feedback can aid businesses in tailoring
their products and services, improving customer satisfaction, and addressing issues proactively. The
ability to discern the sentiment behind the words can lead to enhanced decision-making, better
The potential of AI to detect and respond to human emotions opens doors to a new era of
human-AI coexistence. Machines with affective properties can detect user emotions and adapt their
responses to meet emotional needs. This improves user satisfaction and lays the groundwork for a
Human communication is a treasure trove of unspoken emotions. While words may convey
one message, emotions often reveal a different narrative. SER enables us to explore these hidden
narratives, providing invaluable insights into human behavior, sentiment, and well-being. This
research motivates us to unveil these unseen communication layers, contributing to our collective
understanding of human nature and enhancing our ability to address diverse real-world challenges.
In a world where technology and human interaction are increasingly entwined, the ability to
decipher and respond to human emotions in speech is not just a scientific pursuit; it is a quest to
improve lives, augment businesses, and expand the horizons of AI-human collaboration. This thesis
embarks on this quest, aiming to unlock the potential of artificial emotional intelligence in the service
of humanity.
This research's primary objective is to develop and evaluate efficient neural network
architectures for accurate emotion classification from diverse speech datasets, focusing on the
impact of data augmentation and advanced feature engineering techniques. Further, some
• Accurate emotion classification: Aim to improve the accuracy of emotion recognition from
speech.
• Diverse speech datasets: Leverage multiple datasets for model training and evaluation.
How can the integration of Convolutional Neural Networks (CNN) and Long Short-Term
Memory networks (LSTM) be optimized to improve the accuracy and efficiency of Speech Emotion
Recognition (SER) systems, and what impact will this have on real-world applications?
• Chapter 2 – Literature Review, presents its readers with an overview of the background
research on the topic. It begins with an exploration of hate speech, classifying its nature, impact, and
reputation in online environments. It then navigates through previous studies and methodologies
utilized for hate speech detection. Additionally, this section critically assesses the limitations and
challenges inherent in the existing hate speech detection systems, offering insights into the gaps and
• Chapter 3 – Methodology, it explains the process of data collection, this section elaborates
on the methods used to acquire datasets from diverse online platforms. Subsequently, it describes the
Vaibhav Tiwari (10621051)
15
rigorous steps involved in data preprocessing, including cleaning, formatting, and preparing the
collected data for analysis. This segment emphasizes the importance of high-quality, well-pre-
performance in hate speech detection. It starts with an analysis of Logistic Regression, discussing its
strengths, limitations, and relevance in identifying hate speech. Following that, it examines Naive
Bayes, delineating its efficiency and drawbacks. The section then explores Random Forest,
highlighting its performance, challenges, and suitability for hate speech detection. Finally, it
investigates the LSTM model, focusing on its strengths, weaknesses, and metrics for identifying hate
speech.
• Chapter 5 - Evaluation, this section summarizes and compares the performance metrics of
the evaluated machine learning models used for hate speech detection. It offers a comprehensive
discussion on the strengths, weaknesses, and applicability of each model in addressing online hate
speech. The comparative analysis drawn from these evaluations provides conclusive insights and
• Chapter 6 - Conclusion, encapsulates the project's essence, key findings and insights
derived from the study's exploration of hate speech detection through machine learning models. It
explores the implications of these findings, offering recommendations for future research directions
explores potential areas for further research and advancements in mitigating online hate speech.
throughout the thesis, providing readers with an extensive repository for further exploration and
2 LITERATURE REVIEW
The Literature Review chapter delves into the existing body of knowledge and research
relevant to Speech Emotion Recognition (SER) and the use of Convolutional Neural Networks (CNN)
and Long Short-Term Memory networks (LSTM) in the field. This section serves as the foundation
for understanding the evolution of SER, the significance of AI-driven techniques, and the context in
The aspiration to create machines with the capacity for thought and intelligence has been a
long-standing desire throughout human history. This aspiration traces back to ancient times, with
legendary figures such as Pygmalion, Daedalus, and Hephaestus often interpreted as early inventors
of artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy, 1997). The vision of intelligent
machines predates the creation of programmable computers by over a century (Lovelace, 1842).
Today, the field of artificial intelligence (AI) has transformed into a thriving area of study with
When AI started, it mainly tackled tasks that humans found challenging mentally but were
simple for computers. Such tasks followed strict, math-based laws, perfect for computer-based
solutions. An impressive early achievement was IBM's chess-playing system, Deep Blue. Deep Blue
beat chess pro, Garry Kasparov, in 1997 (Hsu, 2002). Chess offered a precise playing field with set
rules that could be programmed into the computer. AI could solve the rule-based tasks; the real test
was to crack tasks humans did easily but hard to put into formal terms.
Machine learning sparked a crucial change in AI. It made computers able to learn from past
events and grasp the world in a layered way. Each layer is based on simpler ones. This method eased
the dependence on humans to provide all the knowledge a computer needs directly. Learning from
past events let computers skip the complications of human-set rules. They could grasp the key points
representations that explain observed data effectively. Representation learning algorithms, such as
autoencoders, allow computers to build complex concepts from simpler ones. Deep learning, a subset
of representation learning, focuses on forming a deep hierarchy of concepts, each constructed from
simpler components. The deep learning approach involves learning representations by considering
each layer as a state of memory and executing a sequence of instructions, making it highly capable of
The concept of "depth" in deep learning is multifaceted. It can be measured based on the
number of sequential instructions or the depth of the graph describing relationships between concepts.
The choice of perspective influences the interpretation of depth in deep learning models. What
remains consistent is that deep learning models are designed to discover complex representations by
iteratively combining simpler ones, enabling the extraction of intricate patterns from data.
This section sets the stage for the in-depth exploration of deep learning and its relevance to
Vaibhav Tiwari (10621051)
19
our thesis on Speech Emotion Recognition using AI with CNN and LSTM. It establishes the historical
context and evolution of AI and machine learning, ultimately leading to the emergence of deep
In this section, we provide an overview of the related work in the field of speech emotion
recognition (SER). The SER domain encompasses various aspects, including feature extraction,
classification methods, and applications in different sectors. We discuss the key research contributions
In simple terms, speech is a sequence of sounds we use to convey our feelings and thoughts. A speech
signal carries various types of information, including the identity of the speaker, their gender, the
message they want to convey, the language they are speaking, and even their emotional state. This
realization has led researchers to consider speech a powerful medium for interaction between humans
and machines (Frant & Stoica, 2017; Schuller, 2018). While significant progress has been made in
speech recognition over the last two decades (Nassif et al., 2019), the task of recognizing emotions
from speech signals still presents challenges that require further investigation.
Speech carries a complexity and richness, where even small changes in tonal attributes can
alter the meaning of the exact words. Therefore, detecting emotions from voice remains a valuable
area of research. Psychologists have developed various theories about emotions, and two prevalent
models for vocal emotions are the dimensional and discrete (or categorical) models. The discrete
model categorizes emotions into specific, distinct categories, such as happiness, sadness, and fear,
However, some researchers, like Ronan et al. (2018), argue that this categorization is
insufficient to express the full spectrum of human emotions. This has led to the development of
Vaibhav Tiwari (10621051)
20
dimensional models considering emotions along broader dimensions. The widely accepted two-
dimensional model classifies emotions based on valence (positive to negative) and arousal (low to
high) (Russell, 1980). To account for more complexity, a three-dimensional model introduces tension
intensity and potency dimensions to valence and arousal (Fontaine et al., 2007).
Despite these advancements, dimensional models are criticized for not distinguishing certain
emotions like fear and anger. The subject of emotion measurement remains highly subjective,
influenced by personal and cultural differences. Recent research by Cowen et al. (2018) introduces
an intriguing hypothesis suggesting that people can perceive over 20 different emotions in wordless
sounds. They tested this theory through various approaches, involving more than 2000 sounds and
identified 24 reliable emotional categories, demonstrating that emotion is a complex and multifaceted
aspect.
These findings highlight the need for deeper exploration and understanding of emotions,
particularly in the context of Speech Emotion Recognition (SER) systems. The proposed model in
this study aims to address the challenges associated with misclassifying different emotional states,
building upon the evolving understanding of emotions and their expression in speech.
Artificial intelligence (AI) has seen remarkable growth over time, and machine learning has
become a stand-out area. A technique called Deep Learning within Machine Learning has rising
importance. It's modelled after how our brains work. Within the landscape of Speech Emotion
Recognition (SER), artificial neural networks, a kind of Deep Learning, have gained a lot of ground.
This approach gives those studying emotion detection from speech more ways to improve their
Deep learning models have been extensively explored to enhance SER. Zhao et al. (2017)
conducted research on phoneme recognition and SER, demonstrating that the Recurrent
Convolutional Neural Network (RCNN) model could effectively detect emotions with a weighted
accuracy of 53.6% on the IEMOCAP dataset. This research prompted Microsoft to investigate pitch-
based features and deep neural networks, leading to an accuracy of 54.3% (Han et al., 2014). Zhao et
al. (2017) countered this by proving that comparable results could be achieved using spectral features
alone with the RCNN model, highlighting the versatility of deep learning approaches. In the same
year, Microsoft introduced another research by Mirsamadi et al. (2017) involving Recurrent Neural
Network (RNN) with local attention, achieving an accuracy of 61.8% on the same dataset for four
classes of emotions. Ayek et al. (2017) explored multiple deep-learning methods and achieved an
impressive 64.78% accuracy on the same dataset with five emotional classes. Their findings revealed
Feature Selection
Feature selection has been a focus of several studies to identify speech's most relevant
emotional components. Liu et al. (2013) explored various techniques, including the Fisher criterion,
distance analysis, partial correlation analysis, and bivariate correlation analysis, to determine the best
feature subset for recognizing emotions. They used an extreme learning machine (ELM) to build a
decision tree that can effectively classify emotions. The study concluded that ELM was a particularly
best fit for the decision tree method, and feature selection techniques like the Fisher criterion and
correlation analysis had been thoroughly verified (Liu, Z. T., Li, X., & Chen, W., 2013).
Another area for feature extraction is the application of deep auto-encoders (DAE). In the
research conducted by Wang et al. (2014), a DAE method with five hidden layers was used to extract
speech emotion characteristics. Alongside DAE, standard features like MFCC, Perceptual Linear
Prediction cepstral coefficients (PLP), and LPCC were extracted from speech signals. When all these
features were utilized as input for a Support Vector Machine (SVM) model, the findings indicated
that DAE-extracted features exhibited a clear advantage over other feature types, highlighting the
potential of deep learning in improving feature extraction for SER (Wang, F., Yang, J., Chen, H., &
The journey of integrating deep learning into SER continued with innovative models. Chen et
al. (2018) introduced a three-dimensional attention convolution Recurrent Neural Network (CRNN)
SER model, leveraging Mel spectrogram features, which achieved high accuracy on both IEMOCAP
and Emo-DB datasets. Zhao et al. (2018) proposed an improved method by merging 1D and 2D
Convolutional Neural Networks (CNNs) to reach remarkable accuracy rates of 86.36% on IEMOCAP
and 91.78% on Emo-DB for seven emotional classes. These advancements signaled a shift towards
more robust and multidimensional SER, reflecting the power of deep learning in capturing intricate
emotional nuances.
convolutional long short-term memory deep neural network (CLDNN) architecture (Hizlisoy et al.,
2021). This approach was implemented and tested on a novel collection of 124 Turkish traditional
music snippets. This approach harnessed log-Mel filter bank energies, MFCCs, and essential acoustic
characteristics to recognize emotions effectively. Remarkable results were achieved by combining the
LSTM and DNN classifiers, incorporating the new features with traditional ones. Compared to
conventional methods like KNN, SVM, and random forest classifiers, the LSTM+DNN classifiers
demonstrated superior accuracy. This approach, consisting of four convolutional layers, one LSTM
layer, and fully connected layers, showcases the potential of the CNN-LSTM fusion for music
emotion recognition.
As the demand for efficient real-time Speech Emotion Recognition (SER) continues to grow
approaches and datasets to arrive at optimal solutions for this ongoing challenge (Abbaschian et al.,
2021). This article delves into the deep learning techniques for SER using publicly available datasets
and discusses traditional machine learning methods for speech emotion detection. Furthermore, it
offers a multifaceted exploration of SER techniques using functional neural networks, shedding light
on the nuances of speech emotion recognition. In this context, CNN-LSTM architectures have taken
the lead, showcasing their capabilities to address emotion detection challenges due to their enhanced
low-level and short-term discriminative skills. The integration of LSTM networks in CNN models
has further improved the network's performance, allowing it to recognize long-term paralinguistic
The Convergence of Audio and Lyrics in Music Emotion Classification with CNN-LSTM
Music emotion classification has posed a challenging yet intriguing problem, leading to
innovative approaches in artificial intelligence and machine learning. Chen and Li (2020) proposed a
hybrid network classifier that integrates audio and lyrics using a CNN-LSTM architecture, marking a
departure from the limitations of single network classification paradigms. This hybrid model
classification accuracy compared to the single-modal classification approach. The study underscores
the critical role of audio and lyrics as crucial elements for categorizing music based on its emotional
content. It highlights the potential for further exploration in multimodal music emotion detection
through deep learning. CNN-LSTM-based models continue to pave the way for a more
A research study completed by Latif et al. focused on boosting the precision of SER systems.
They utilized a new approach called transition learning. This method was particularly effective when
dealing with different languages and databases. Compared to other models like support vector
machines (SVMs) and sparse autoencoders, deep belief networks (DBNs) performed better. DBNs
gave more precise results in emotion recognition across five databases in three languages. An
interesting observation was that using various languages during training significantly boosted
accuracy while limiting target data. This improvement was evident even in databases with minimum
training examples.
Zhao et al. proposed two CNN+LSTM networks, one 1D CNN+LSTM network and one 2D
CNN+LSTM network, to learn local and global emotion-related features from speech and log-Mel
spectrograms, respectively. The architecture of the two networks is identical, with four regional
function learning blocks (LFLBs) and one LSTM layer in each. LFLB is designed to learn local
correlations and derive hierarchical correlations, and it consists primarily of one convolutional layer
and one max-pooling layer. The LSTM layer is used to learn long-term dependencies from the locally
known functions.
Sun and his team introduced a new technique, combining a special type of code, called a sparse
autoencoder, with an attention-grabbing method. Their goal? Using this unique code to study and
learn from labelled and unidentified data. Moreover, they wanted the attention method to focus on
parts of speech that really drive emotion. Are speech sections not filled with emotion? They're usually
overlooked. The team decided to put their new technique to the test on three online databases using a
multilingual system. They discovered that compared to other popular ways to identify emotions in
Jiang et al. suggested a feature representation extraction method based on deep learning from
heterogeneous acoustic feature groups that could include redundant and irrelevant content, resulting
in poor emotion recognition output in their research. A fusion network is learned to jointly learn the
discriminative acoustic feature representation and SVM as the final classifier after the informative
features are obtained. The proposed architecture increased recognition efficiency by 64% compared
Pandey et al. [29] provided an overview of deep learning strategies for extracting and
classifying emotional states from speech utterances. They investigate the most commonly used simple
deep learning architectures in the literature. On the two standard datasets, Emo-DB and IEMOCAP,
architectures such as CNN and LSTM were used to measure the emotion capture capability of various
The experiments’ results and their reasoning have been discussed to determine which architecture and
Meng et al. employed the bidirectional LSTM along with CNN to recognize speech emotions.
In addition, they adopted the Mel-spectrogram features in the 3D space as the main features used to
train the CNN network. That model was evaluated based on IEMOCAP and Emo-DB datasets.
Although the results achieved by this model are promising, they lack generalization, as the model
performs well on the training data; however, the performance is worse on the test set.
Zhen et al. proposed a model composed of CNN, BLSTM, and SVM for recognizing speech
emotions based on log-Mel spectrogram features. The model is evaluated on the IEMOCAP dataset
and performs better when compared with another approach in the literature. Despite the promising
performance of the model, it still needs to be evaluated using other datasets to show its generalization
capability. On the other hand, the study presented in [32] showed the performance of various models
used in SER using six speech datasets. This study concluded that the CNN+LSTM model performs
better than the other models for five of the six datasets.
Lili Guo et al. employed a kernel extreme learning machine (KELM) to classify speech
emotion classes. This approach uses a fusion of spectral features to train the presented model. The
evaluation of this model is performed in terms of two datasets, Emo-DB and IEMOCAP. However,
the results show promising performance on only one dataset, which means the approach lacks proper
generalization. In addition, the authors concluded that the fusion of the spectral features allows the
Misbah et al. investigated the application of a deep convolutional neural network (DCNN) to
extract features from the log-Mel spectrogram of the raw speech. The study employed four datasets:
IEMOCAP, Emo-DB, SAVEE, and RAVDESS. The classification of speech emotions is performed
using four classifiers: SVM, random forest, k nearest neighbors, and neural networks. The
performance of these classifiers is promising; however, no single classifier could perform well on the
four datasets. This indicates that these classifiers lack generalization capability.
Sonawane et al. demonstrated a deep learning approach for speech emotion understanding. A
multilayer convolutional neural network is used with a basic K-nearest neighbor (KNN) classifier to
classify emotions such as positive, negative, indifferent, disgust, and surprise. The combination of
MFCC-CNN and the KNN classifier performs better than the current MFCC algorithm, according to
experimental findings on a real-time database obtained from the open-access social media site
YouTube.
Sajjad et al. presented a new SER system focused on Radial basis function network (RBFN)
similarity calculation in clusters and the main sequence segment selection method. The STFT
algorithm transforms the chosen sequence into a spectrogram, which is then fed into the CNN model,
which extracts the discriminative and salient features from the speech spectrogram. Additionally, to
ensure precise recognition performance, CNN features were normalized and fed to the deep
bidirectional long short-term memory (BiLSTM) for emotion recognition based on the learned
temporal information.
In conclusion, integrating deep learning techniques, particularly the fusion of CNN and LSTM
networks, has ushered in a new era of accuracy and efficiency in Speech Emotion Recognition and
Music Emotion Recognition. These models have showcased their capabilities in capturing human
emotions' intricacies and paved the way for more nuanced and multidimensional.
3 Methodology
3.1Business Understanding:
It's essential to understand emotions in speech. This is useful in many areas, such as human-
computer interaction, sentiment analysis, customer service, and mental health evaluations.
Recognising emotions from speech can enhance user interaction, improve customer service, and boost
In our digital world, we often interact with technology. Hence, machines need to understand
human emotions. Consider customer service. Knowing a user's emotions can make interactions more
personal and increase service quality. In mental health evaluations, machines that can recognise
Imagine a world where we could understand emotions in voices through machine learning.
This research aims to make that happen, bridging the gap between speech and emotional intelligence.
The potential impacts could be widespread. Think of customer service where an AI can soothe angry
customers. Or in healthcare, where speech analysis helps diagnose mental health problems. In
education, personalised learning could keep students engaged. Video games could even change
according to the player's feelings, making the play more personal. In essence, this research envisions
accurately capturing emotions from speech. It seeks to transform businesses, technology, medicine,
and relationships.
For this research, I have used the Ryerson Audio-Visual Database of Emotional Speech and
Song (RAVDESS), Toronto emotional speech set (TESS) and Crowd Sourced Emotional Multimodal
There are 1440 files in this section of the RAVDESS: 24 actors times 60 trials equal 1440.
Twenty-four professional actors, twelve female and twelve males, perform two lexically matched
phrases in a neutral North American accent for the RAVDESS. Calm, joyful, sad, furious, afraid,
surprised, and disgusted expressions are examples of spoken emotions. Every expression is generated
at two different emotional intensity levels (strong and normal), along with a neutral expression.
Regarding the TESS dataset, two actresses, ages 26 and 64, each performed a set of 200 target
words in the carrier phrase "Say the word _." The set was recorded depicting the seven emotions
(anger, disgust, fear, happiness, pleasant surprise, sorrow, and neutral). In total, there are 2800 data
Due to its organisational structure, the two female actors and their emotions are contained
under separate folders in the dataset. All 200 target words' audio files can be found within that. The
CREMA-D comprises 7,442 original clips from 91 actors and is an emotive multimodal actor
data set between the ages of 20 and 74, with 48 male and 43 female actors representing a range of
racial and ethnic backgrounds (African American, Asian, Caucasian, Hispanic, and Unspecified)
For testing, I have used the Surrey Audio-Visual Expressed Emotion (SAVEE) dataset -
Four native English male speakers (DC, JE, JK, and KL), postgraduate students, and
researchers at the University of Surrey, ranging in age from 27 to 31, provided data for the SAVEE
database. Anger, contempt, fear, happiness, sadness, and surprise are some of the distinct categories
that psychology has used to characterise emotion. A neutral category is also included to offer
Each emotion in the text was represented by 15 TIMIT sentences: two emotion-specific, three
common, and ten phonetically balanced generic sentences. To get 30 neutral sentences, the 2 × 6 =
12 emotion-specific and three common sentences were registered as neutral. Each speaker produced
Several steps were followed in creating a dataset for detecting emotion in speech. Firstly,
gathered audio data from open-source sources like the RAVDESS, CREMA-D, and TESS databases,
which had diverse emotional content. The data were then transformed into a single format WAV and
adjusted to the same sampling rate for consistency. In the data, each audio clip is tagged with its
respective emotion. This labelling process made it easier to differentiate clips by emotion. To convert
the audio into data readable by machines, we used feature extraction techniques such as Mel-
frequency cepstral coefficients (MFCCs) and spectral attributes. The dataset was enhanced by
reducing noise and ensuring each emotional class was well represented, using oversampling and
augmentation methods. After preparing the dataset, we split it into training, validation, and test sets.
These sets became the foundation for our emotion recognition models.
3.4 Modeling:
This research focused on how well Neural Networks (NNs) and Long Short-Term Memory
(LSTM) models could find emotions in speech data. The dataset had speech clips with different
emotions. The data was processed and made ready for study by using Mel-frequency cepstral
coefficients (MFCCs). It turned raw audio into inputs the models could use. The NN model had hidden
layers with a ReLU activation function. The LSTM model used stacked layers. It used bidirectional
LSTM cells to understand patterns in the data over time. Training used the Adam optimizer. The best
learning rate and batch size were used to avoid overfitting. Accuracy, precision, recall, and F1-score
Vaibhav Tiwari (10621051)
31
measured how well the models performed. Both models were good at finding emotions from speech.
But the LSTM model did better. It was great at understanding data over time. It was found that LSTM
models have a lot of promise. They might be even better at finding emotions in speech. Below, we
Neural Network:
Neural Networks (NNs) are a class of powerful machine-learning models based on the
structure and functionality of the human brain. They excel in handling complex tasks, learning
intricate patterns, and making predictions from data. Neural networks consist of interconnected nodes,
The basic structure of a Neural Network comprises three main types of layers:
1. Input Layer: This layer receives the initial data or features to be processed.
2. Hidden Layers: These layers lie between the input and output layers and
perform complex computations by applying weights to the input and passing it through
activation functions. The number of hidden layers and the number of neurons within each
3. Output Layer: This layer produces the network's final predictions or outputs
based on the computations performed in the hidden layers. The number of nodes in the output
layer depends on the nature of the problem, for example, classification or regression.
strengthening the connections. During the learning process, these weights are adjusted to minimize
Moreover, biases are additional parameters within each neuron that allow the model to handle
the delta or error in the output. They provide flexibility to the network by enabling it to fit more
complex functions.
Neural networks rely heavily on activation functions. These functions allow networks to
handle and learn from complex data by adding a factor of non-linearity. Here are a few well-known
activation functions:
• Rectified Linear Unit (ReLU): This takes inputs and provides outputs. If the
input is positive, it returns the same. If it's not, it gives you zero. ReLU is popular because it's
• SoftMax: This is commonly used for categories. It changes raw scores into
Neural Networks learn by testing and adjusting. They take in data and predict the outcomes.
They compare their guess with the actual result and make the changes to get better results. Tools such
The architecture and adaptiveness of Neural Networks, combined with their ability to learn
from data, make them invaluable. They have had incredible influence in areas such as image and
(RNNs). They're made to handle the tricky business of keeping track of essential details in a sequence
over a long period of time. Normal RNNs often fumble this task due to a tricky problem called the
vanishing or exploding gradient, which impacts their usefulness with long sequences.
LSTM networks were made to beat this issue. They did it by adding special memory cells,
making it easier for them to remember things over different time steps. These networks have the
unique power to decide what to remember, update, or forget. Because of this, they're suitable for jobs
that deal with a sequence of data, like checking a series of numbers over time, processing language,
The two things that make LSTMs different from normal RNNs are memory cells and these
Gating Mechanisms. They help the network remember and manage long sequences:
1. Memory Cells: LSTMs have these memory cells. They work like a storage unit,
able to keep information for a long time. These cells have a state vector that can change over
time. This way, the network can selectively remember or forget stuff depending on its
2. Gating Mechanisms (Forget, Input, Output Gates): LSTMs use gates to control
how information moves within the memory cells. The gates use two mathematical operations,
Forget Gate: Decides which details in the memory state should be let go or erased. It looks at
the current entry and the past condition to determine what details are no longer helpful for future
forecasts.
Input Gate: Chooses to revamp the memory state by pinpointing fresh data to be stored. This
gate figures out the significance of the new data along with the present condition.
Output Gate: Decide the cell's output based on the refreshed state. This gate controls which
sections of the memory condition are used to make the output for this moment.
LSTMs, with these gate mechanisms, can learn when to keep or let go of information. This
skill helps lessen the problems of vanishing or blowing up gradients common in standard RNNs. This
allows LSTMs to model and predict long-distance sequences accurately and efficiently.
LSTMs have proved their worth in real-life tasks where context and time relationships are
key. For example, they excel in predicting stock prices, scrutinizing feelings in writings, creating
meaningful text sequences, and handling time series data. The talent for taking in and holding on to
context over long sequences has made them an essential structure in studying and modelling seque-
ntial data.
MFCC:
audio or speech signal processing. They originate from the audio signal's short-term power spectrum
with the aim of capturing features similar to our human hearing system.
1. Signal Framing: The audio signal is chopped into brief frames, typically
4. Fast Fourier Transform (FFT): FFT is applied to every frame, altering the
5. Mel Filterbank: The power spectrum goes through a set of triangular filters
placed on the Mel-frequency scale. This process emulates how humans perceive sound non-
linearly.
7. Discrete Cosine Transform (DCT): The last step involves applying DCT to the
logarithm of filterbank energies. The result? A set of coefficients (MFCCs) which represent
MFCCs offer a summarized depiction of the audio signal. They capture essential spectral data
while excluding less distinctive features. Their effectiveness in showing the human auditory system's
characteristics makes them popular in various applications like emotion recognition from audio
I have taken steps to align my research with data protection standards, emphasizing my
commitment to ethical conduct. I can confidently affirm that there has been no misuse or
misrepresentation of the considered data. The dataset remains unaltered and openly accessible on
Kaggle for research purposes. By strictly adhering to data protection regulations, I have prioritized
transparency and integrity throughout the research process, ensuring responsible and ethical handling
of the information. The public availability of the dataset on Kaggle fosters collaboration and supports
research initiatives. Additionally, for data preparation, augmentation and concatenation were
4 Implementation
The process started with creating a Pandas Data Frame named ravdess_dataframe; this data
frame contained file paths, gender categories, and emotion labels. The value_counts() method was
used on the 'emotion_label' column for observing the count of each emotion label in the DataFrame.
This was a balanced dataset with 192 samples for each emotion: ‘angry', 'fear', 'disgust', 'sad',
The above bar plot shows the distribution of emotions across different genders within the
The waveform was used for plotting graphs and playing the audio file for the 'sad' emotion. It
loads the audio file path for 'sad' emotion, generates a waveform plot, and plays the audio. This was
The above image represents a waveform plot for a specific emotion (sad in this case) from an
Then, the next step was the process of extracting Mel-frequency cepstral coefficients (MFCC)
from audio files using the Librosa library. It loads the audio files, normalizes them, computes the
MFCC features, and calculates the mean of these features. These extracted MFCC features, and their
corresponding emotion labels are then stored in a new Data Frame. This Data Frame consisted of 960
samples, each containing 128 MFCC features. The emotion labels were one-hot encoded after being
encoded with Label Encoder with the help of Scikit-learn, which resulted in a shape of (960, 5) for
the labels. This preprocessed dataset is suitable for training machine learning or deep learning models
to predict emotions based on the extracted MFCC features from audio recordings.
The training of a neural network model using TensorFlow's Keras API was done. The model
architecture was built with four dense layers: three hidden layers with 256 neurons each, employing
ReLU activation functions, and an output layer with five neurons using a softmax activation function
for multiclass classification. The model was then compiled using the Adam optimizer and categorical
cross-entropy loss function while tracking metrics such as accuracy, precision, recall, and area under
the curve (AUC). An EarlyStopping callback was employed to monitor the validation loss and prevent
overfitting by restoring the best weights after observing no improvement for three consecutive epochs.
During training with 15 epochs and a batch size of 32, the model's performance metrics (accuracy,
Vaibhav Tiwari (10621051)
38
precision, recall, AUC) and loss values for training are recorded and visualized using Matplotlib. The
accuracy and loss curves plotted against the number of epochs show the model's learning progress
and generalization performance on the training data, aiding in assessing its training behavior, potential
Further, some more data was added using the TESS dataset and CREMA-D for better training
of the model. The RAVDESS, CREMA-D and TESS datasets were concatenated, which contained
audio files expressing various emotions like anger, fear, disgust, sadness, happiness, and neutrality in
different proportions between the two datasets. After consolidating and cleaning the combined
dataset, neutral expressions were removed from the training data, focusing solely on emotional
expressions for model training. Emotionally labeled audio files underwent various augmentation
techniques, including noise addition, time stretching, shifting, and pitch modification, to diversify the
dataset. These augmented data are then used for feature extraction, encompassing essential audio
characteristics such as Zero Crossing Rate, Root Mean Square, Mel-frequency cepstral coefficients,
Chroma_stft, and Mel Spectrogram. The resulting features are standardized, labels are one-hot
encoded, and sequences are uniformly padded to a fixed length. The processed features and labels are
stored as numpy arrays and pickled files, readying them for model training.
A sequential model was then initiated and trained using the training data; The model
effectively combines convolutional layers (CNN) for feature extraction with recurrent layers (LSTM)
for capturing long-term temporal dependencies. Sequential Structure: Layers are arranged
sequentially, starting with two convolutional blocks, followed by two LSTM layers, and ending with
dense output layers. It had an Input Shape as (164, 1), indicating a 1D input sequence with 164-time
steps and a single feature channel. The convolutional blocks were 1st block: 256 filters, 5-point kernel,
2nd block: 128 filters, 5-point kernel, strides of 2, padding, and ReLU activation.
MaxPooling Layers: These were used for reducing dimensionality and extracting dominant features.
Dropout Layers: Regularization to prevent overfitting (20% dropout after each convolutional and
LSTM layer).
LSTM Layers: 1st layer: 64 units, returning sequences for further processing.
2nd layer: 32 units, not returning sequences, providing final temporal context representation.
Dense Layers:
2nd layer: 5 units with softmax activation and producing probability distribution over five classes.
Training and Evaluation: Training was done for 50 epochs with a batch size of 32, Loss:
categorical cross-entropy, Optimizer used-Adam. These were used for the metrics: accuracy,
After this, introduced a model which analyzes data with a combination of convolutional layers,
attention, and LSTMs, achieving high accuracy, precision, recall, and AUC on a multi-class
classification task.
Two artificial neural networks were trained for multi-class classification using a sequential
architecture and Adam optimiser. Both models displayed improvement in accuracy and other metrics
The first ANN achieved a final accuracy of 73.83% and an AUC of 94.55%, suggesting robust
performance. The second ANN, while starting with lower accuracy, exhibited potential for further
improvement through hyperparameter tuning or additional training. Overall, the results indicate
Further, the ANN-LSTM model was trained on a dataset for a multi-class classification. The
model used an LSTM layer with 256 units to record temporal dependencies, followed by fully
connected layers and dropout for feature extraction and regularization. Finally, a softmax output layer
During training with Adam optimiser and categorical cross-entropy loss, the model shows
steady improvement in accuracy, reaching around 60% on the training set. Loss continuously
decreased, resulting in effective error minimization. Recall, initially low, gradually increased,
explaining better identification of positive cases. The AUC score also improved, resulting in enhanced
5 Evaluation
This section delves into the evaluation of the model performances. Six models were trained in
Model Development:
• Model 1 (CNN): This model Utilizes a sequential neural network with dense layers. The model
achieved moderate accuracy and other metric scores (accuracy, precision, recall, and AUC) on the
training set.
• Model 2 (CNN + LSTM): This model Extended the training data by including additional samples
from the TESS and CREMA datasets. The performance improved noticeably, achieving higher
accuracy, precision, recall, and AUC scores on the test data compared to Model 1.
• Model 3 (CNN + Attention net + LSTM): The third model experimented with a Long Short-Term
Memory (LSTM) layer and attention net. This model performed best among all the models.
• Model 4 (ANN): This model performed lower than the CNN and CNN+LSTM models.
• Model 5 (ANN + LSTM): This model performed even lower than the ANN model.
• Model 6 (ANN + More layers of LSTM): This model showed improvement when compared to
model 5 but was still not good enough and was lower than the ANN model.
The graphs visualise the first model's accuracy and loss metrics across epochs for training data
from ravdess dataset, offering insights into the training process and potential overfitting or
convergence issues.
The graphs visualises the CNN + Attention net + LSTM model's accuracy and loss metrics
across epochs for training data from the new dataset which contains the data from ravdess, crema and
tess dataset for improving the training, offering insights into the training process and potential
From the above graphs, it can be observed how the Model Accuracy and Model Loss metrics
The accuracy graph of ANN +LSTM shows that the training accuracy starts off at about 0.3
and increases to about 0.6 over the course of 50 epochs. The Loss starts off at 1.5 and after 50 epochs
it is around 0.8.
The ANN+ More layers of the LSTM loss graph show that the training loss starts off at about
1.7 and decreases to about 0.6 over the course of 50 epochs. The accuracy starts off at about 0.25 and
increases to about 0.60 over the course of 50 epochs. This suggests that the model is learning from
Overall, the graphs suggest that the model is doing an excellent job of learning from the data
Net+LSTM
LSTM
The table mentioned above compares the performance of six different models, these are
evaluated by four metrics: accuracy, precision, recall, and AUC. Here's an analysis of the results:
Overall Performance:
• The CNN+ Attention Net +LSTM model achieves the highest accuracy of 0.8925, precision of
0.9008, recall of 0.8843, and AUC of 0.9872, and hence it is the best-performing model out of
• CNN+LSTM performed as second best with a strong performance across all metrics:
• ANN models generally perform lower than CNN and LSTM-based models, which explains the
• Adding more LSTM layers to the ANN model (ANN + More layers of LSTM) does not
6.1Conclusion
This project aims to identify emotion classification through neural networks using speech data
sourced primarily from the RAVDESS, CREMA and TESS datasets. The project's primary objective
was to develop efficient models that accurately identify emotions such as happiness, sadness, anger,
fear, disgust, and more from speech samples. The process included data collection, feature extraction
(notably Mel-frequency cepstral coefficients - MFCCs), model development, evaluation, and insights
Data Collection and Preparation: The beginning of the analysis started with collecting audio
files from RAVDESS, CREMA and TESS datasets, ensuring a diverse representation of emotional
states. After that, thorough data preprocessing was executed to extract essential features from the
speech samples. The utilization of MFCCs allowed for transforming the speech signals into a format
Further, the project had six distinct neural network architectures for the emotion classification
task. Model 1 employed a sequential neural network structure with dense layers, despite moderate
performance metrics - including accuracy, precision, recall, and AUC. Model 1 served as a benchmark
for further evaluations. Model 2 extended the dataset by including additional samples from the
CREMA and TESS datasets. This augmentation notably enhanced the model's performance, denoting
marked improvements in accuracy, precision, recall, and AUC scores on the test dataset compared to
Model 1. However, Model 3, which incorporated an attention net and LSTM layer for sequence
processing, resulted in comparatively better performance metrics than the dense layer-based models.
The evaluation demonstrated promising results, particularly with the augmentation of the
dataset in Models 2 and 3, proving the significance of diverse data in improving emotion classification
accuracy from speech inputs. Additionally, the comparative analysis indicated the superior
effectiveness of the dense layer architectures and attention net and LSTM-based models in handling
Several recommendations were outlined to further enhance the models' performance. Firstly,
exploring advanced data augmentation techniques or alternative feature engineering approaches could
hyperparameters to optimize the neural network models and improve their generalization capabilities
could be helpful. Thirdly, investing in model interpretability techniques could highlight the most
influential features contributing to emotion classification, enhancing the models' transparency, and
understanding.
In conclusion, the analysis successfully implemented neural network architectures for emotion
classification from speech data, achieving commendable results, especially with the integration of
additional diverse data. However, the scope for improvement remains, including further data
further improve the models' accuracy and reliability in recognizing emotions from audio samples,
dataset further with a wider variety of emotional states and demographic diversities might augment
the model’s capacity to generalize across diverse populations. Additionally, the exploration of
advanced neural network architectures or leveraging transfer learning from pre-trained models could
potentially gain more robust and nuanced emotion classification frameworks. Moreover, investigating
real-time emotion classification applications or multimodal approaches combining audio and visual
inputs could lead to more comprehensive emotion recognition systems for practical use cases in fields
such as mental health monitoring, human-computer interaction, and beyond. Lastly, continuous
processes would not only enhance model understanding but also increase trust and acceptance of these
7 References
R. S. Livingstone and A. F. Russo, ‘‘The Ryerson audio-visual database of emotional speech and
song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American
English,’’ PLoS ONE, vol. 13, no. 5, pp. 1–35, May 2018.
P. Jackson and S. U. Haq, ‘‘Surrey audio-visual expressed emotion (SAVEE) database,’’ Univ.
A. Satt, S. Rozenberg, and R. Hoory, ‘‘Efficient emotion recognition from speech using deep
J. Chang and S. Scherer, ‘‘Learning representations of emotional speech with deep convolutional
generative adversarial networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
M. Chen, X. He, J. Yang, and H. Zhang, ‘‘3-D convolutional recurrent neural networks with
attention model for speech emotion recognition,’’ IEEE Signal Process. Lett., vol. 25, no. 10, pp.
J. Zhao, X. Mao, and L. Chen, ‘‘Learning deep features to recognise speech emotion using merged
deep CNN,’’ IET Signal Process., vol. 12, no. 6, pp. 713–721, 2018.
P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, ‘‘Speech emotion recognition
identification from raw speech signals using DNNs,’’ in Proc. Interspeech, Sep. 2018.
S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, ‘‘Transfer learning for improving speech
J. Zhao, X. Mao, and L. Chen, ‘‘Speech emotion recognition using deep 1D & 2D CNN LSTM
networks,’’ Biomed. Signal Process. Control, vol. 47, pp. 312–323, 2019.
Vaibhav Tiwari (10621051)
52
T.-W. Sun and A.-Y.-A. Wu, ‘‘Sparse autoencoder with attention mechanism for speech emotion
recognition,’’ in Proc. IEEE Int. Conf. Artif. Intell. Circuits Syst. (AICAS), Mar. 2019, pp. 146–
149.
W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li, ‘‘Speech emotion recognition with heterogeneous
feature unification of deep neural network,’’ Sensors, vol. 19, no. 12, p. 2730, Jun. 2019.
H. Meng, T. Yan, F. Yuan, and H. Wei, ‘‘Speech emotion recognition from 3D log-mel
spectrograms with deep learning network,’’ IEEE Access, vol. 7, pp. 125868–125881, 2019.
Z.-T. Liu, P. Xiao, D.-Y. Li, and M. Hao, Speaker-Independent Speech Emotion Recognition Based
J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, and G. Hofer, ‘‘Analysis of deep
learning architectures for cross-corpus speech emotion recognition,’’ in Proc. Interspeech, Sep.
L. Guo, L. Wang, J. Dang, Z. Liu, and H. Guan, ‘‘Exploration of complementary features for speech
emotion recognition based on kernel extreme learning machine,’’ IEEE Access, vol. 7, pp. 75798–
75809, 2019.
selection algorithm on speech emotion recognition using deep convolutional neural network,’’
S. Sonawane and N. Kulkarni, ‘‘Speech emotion recognition based on MFCC and convolutional
neural network,’’ Int. J. Adv. Sci. Res. Eng. Trends, Jul. 2020.
incorporating learned features and deep BiLSTM,’’ IEEE Access, vol. 8, pp. 79861–79875, 2020.
Mustaqeem and S. Kwon, ‘‘A CNN-assisted enhanced audio signal processing for speech emotion
emotion recognition with convolutional neural networks,’’ J. Audio Eng. Soc., vol. 68, nos. 1–2, pp.
N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. Lee, ‘‘Multimodal approach of speech emotion
recognition using multi-level multi-head fusion attention-based recurrent neural network,’’ IEEE
O. Atila and A. Şengür, ‘‘Attention guided 3D CNN-LSTM model for accurate speech-based
emotion recognition,’’ Appl. Acoust., vol. 182, Nov. 2021, Art. no. 108260.
T. Tuncer, S. Dogan, and U. R. Acharya, ‘‘Automated accurate speech emotion recognition system
using twine shuffle pattern and iterative neighbourhood component analysis techniques,’’ Knowl.-
J. Liu and H. Wang, ‘‘A speech emotion recognition framework for better discrimination of
8 Appendix
8.1Artefact Links
Google Drive Link -
https://drive.google.com/file/d/19audQU8H0pmuiAxibxCL5d4YGE6trl55/view?usp=sharing
https://mydbsmy.sharepoint.com/:u:/g/personal/10621051_mydbs_ie/EecAZ63OQX1Gq_6f3IRr_U
BAtxMkfeIQZM6MDgT9JzAeA?e=xRPMI
8.2Dataset Links
RAVDESS - https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-
audio?resource=download
TESS - https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess
CREMA-D - https://www.kaggle.com/datasets/ejlok1/cremad
SAVEE - https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee