Professional Documents
Culture Documents
Abstract— Speech is one of the most common and effective this method is considered the most appropriate for human
modes of communication, as one not only gets the message but frequency modeling. Therefore the authors use this method for
also intent, motive and emotions of the individual communicating. feature extraction. The dialect that will be used in this research
It’s not only the facial expression but also the signals generated by is the dialect of Makassar people as voice data. The author
one’s speech that helps us in recognizing the emotion of the chose this dialect because of its unique characteristics and
individual. The main hurdle that emotion recognition systems face different dialects from other regions. This rough-speeching
is the fact that people have varied emotional response, when dialect marks the identity of a strong and mighty person.
exposed to the same stimuli. We have used dataset from --. It
is also observed that using cnn along with mfcc improves the
accuracy of the model. II. Literature survey
Keywords—Speech Emotion Recognition, Recursive Neural
Network, Long short term memory, K Neural Network
a. Machine learning-based speech emotion recognition
I. INTRODUCTION system[2] this paper provides a block diagram of speech
emotion recognition system. It also proposes the
Deep learning behaves like our nervous system. When it is
methodology of breaking the speech signal into
when presented with a problem to solve it processes the data
it already has and builds a solution upon it. The data already segments of small time intervals, which makes it easier
present to the system is called as training data. In other to extract features.
words deep learning models computer to process b. Speech emotion recognition using Deep Learning [3] this
information and compute results like the human brain but project uses deep learning models to detect emotions
with much more computational power.
and attempt to classify them according to audio signals.
Since the rapid advancement in the field of AI and IEMOCAP dataset is used for detection of emotions. It
specially deep learning, it is being widely used in various uses SER to gauge the emotional state of the driver and
fields nowadays like medical treatment [1], detecting frauds, comes handy to prevent any accidents.
and detecting emotions etc. Speech emotion recognition is
the process of categorizing and classifying emotions. It has c. Emotion detection in dialog systems: applications
varied field of applications and can be used to upscale and strategies and challenges [4] , this paper has two
optimize the already existing business operations, it also has conclusions. First is to label audio signals as angry and
applications that help in resolving social issues like suicide the second one is to gauge how satisfied a customer. It
prevention and much more. uses 21 hours of recorded data where users are
Multiple studies uses Support Vector Machine (SVM) for reporting phone connectivity issues, the actors then
emotion detection but they were found that they were not as choose to rate their experience based on a scale of 1-5.
optimal as other deep learning algorithms [2]. We have used Finally the models labels the data as “garbage”,
Convolutional Neural Network(CNN) in our model which “angry”, “not angry”, “don’t know”.
yields in better result than compared to SVM. For prediction we
are per-training the data and constructing a training set and d. Emotion detection: a technology overview [5] in this
testing set and testing set, as our prediction obtains an optimum paper we explore various factors that stimulate
node such that the predicted node provides the satisfactory emotions and emotion recognition with voice is delt in
output. We can look at CNN as a model that feeds ahead here. It describes three technologies vokaturi which is
artificial network in which joining sequence among its nodes is run on a computer, beyond verbal which is a cloud
motivated by presenting an animal visual-cortex, its dimension based service and is used in the project to send api
is bounded by the depth of the structure, also the size of the request and get results it needs at least 13 seconds of
filter and zero-padding. voice audio.
Based on existing emotion recognition research, feature e. Speech emotion recognition using feqture and word
extraction is used for speech recognition both in the embedding [6] this paper describes categorical speech
implementation of emotion recognition, feature extraction recognition. It tells us that we can combine text and
used is Mel-frequency Cepstral Coefficient. This is because
voice functions to improve the accuracy of the model.
f.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on February 15,2023 at 06:36:35 UTC from IEEE Xplore. Restrictions apply.
g. A review on emotion detection and classification Dataset is collected from Kaggle – Ravdess dataset. It has
using speech [7] this paper demonstrates all of the 1440 files and 60 actors and 1440 trials. The ravedess
techniques we can use for feature extraction which consists of 24 professional voices and lexically matched
when coupled with certain algorithms can yield in sentences in the north American accent which are happy
improved efficiency. sad and angry and labeled as garbage. Every expression
is generatedin2 levels of emotional intensity (light, bold),
with a neutral expression. Every file out of 1440 files has
III. METHOD an unique filename. The filename holds a 7-part numerical
We have used cnn for out model and its architecture identifier (e.g., 03-02-05-01-02-02-11.wav). They
diagram of the system is given below.- constitutethe evoking features.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on February 15,2023 at 06:36:35 UTC from IEEE Xplore. Restrictions apply.