You are on page 1of 2

Human Emotion Detection with Speech Recognition

Using Mel-frequency Cepstral Coefficient and CNN


Akshat Kumar Tiwari Anika Pandey Dr Aruna
SRM institute of science and SRM institute of science and Anguvel
technology, SRM university technology, SRM university SRM institute of science and
at7989@srmist.edu.in ap7481@srmist.edu.in technology, SRM university

Abstract— Speech is one of the most common and effective this method is considered the most appropriate for human
modes of communication, as one not only gets the message but frequency modeling. Therefore the authors use this method for
also intent, motive and emotions of the individual communicating. feature extraction. The dialect that will be used in this research
It’s not only the facial expression but also the signals generated by is the dialect of Makassar people as voice data. The author
one’s speech that helps us in recognizing the emotion of the chose this dialect because of its unique characteristics and
individual. The main hurdle that emotion recognition systems face different dialects from other regions. This rough-speeching
is the fact that people have varied emotional response, when dialect marks the identity of a strong and mighty person.
exposed to the same stimuli. We have used dataset from --. It
is also observed that using cnn along with mfcc improves the
accuracy of the model. II. Literature survey
Keywords—Speech Emotion Recognition, Recursive Neural
Network, Long short term memory, K Neural Network
a. Machine learning-based speech emotion recognition
I. INTRODUCTION system[2] this paper provides a block diagram of speech
emotion recognition system. It also proposes the
Deep learning behaves like our nervous system. When it is
methodology of breaking the speech signal into
when presented with a problem to solve it processes the data
it already has and builds a solution upon it. The data already segments of small time intervals, which makes it easier
present to the system is called as training data. In other to extract features.
words deep learning models computer to process b. Speech emotion recognition using Deep Learning [3] this
information and compute results like the human brain but project uses deep learning models to detect emotions
with much more computational power.
and attempt to classify them according to audio signals.
Since the rapid advancement in the field of AI and IEMOCAP dataset is used for detection of emotions. It
specially deep learning, it is being widely used in various uses SER to gauge the emotional state of the driver and
fields nowadays like medical treatment [1], detecting frauds, comes handy to prevent any accidents.
and detecting emotions etc. Speech emotion recognition is
the process of categorizing and classifying emotions. It has c. Emotion detection in dialog systems: applications
varied field of applications and can be used to upscale and strategies and challenges [4] , this paper has two
optimize the already existing business operations, it also has conclusions. First is to label audio signals as angry and
applications that help in resolving social issues like suicide the second one is to gauge how satisfied a customer. It
prevention and much more. uses 21 hours of recorded data where users are
Multiple studies uses Support Vector Machine (SVM) for reporting phone connectivity issues, the actors then
emotion detection but they were found that they were not as choose to rate their experience based on a scale of 1-5.
optimal as other deep learning algorithms [2]. We have used Finally the models labels the data as “garbage”,
Convolutional Neural Network(CNN) in our model which “angry”, “not angry”, “don’t know”.
yields in better result than compared to SVM. For prediction we
are per-training the data and constructing a training set and d. Emotion detection: a technology overview [5] in this
testing set and testing set, as our prediction obtains an optimum paper we explore various factors that stimulate
node such that the predicted node provides the satisfactory emotions and emotion recognition with voice is delt in
output. We can look at CNN as a model that feeds ahead here. It describes three technologies vokaturi which is
artificial network in which joining sequence among its nodes is run on a computer, beyond verbal which is a cloud
motivated by presenting an animal visual-cortex, its dimension based service and is used in the project to send api
is bounded by the depth of the structure, also the size of the request and get results it needs at least 13 seconds of
filter and zero-padding. voice audio.
Based on existing emotion recognition research, feature e. Speech emotion recognition using feqture and word
extraction is used for speech recognition both in the embedding [6] this paper describes categorical speech
implementation of emotion recognition, feature extraction recognition. It tells us that we can combine text and
used is Mel-frequency Cepstral Coefficient. This is because
voice functions to improve the accuracy of the model.
f.

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on February 15,2023 at 06:36:35 UTC from IEEE Xplore. Restrictions apply.
g. A review on emotion detection and classification Dataset is collected from Kaggle – Ravdess dataset. It has
using speech [7] this paper demonstrates all of the 1440 files and 60 actors and 1440 trials. The ravedess
techniques we can use for feature extraction which consists of 24 professional voices and lexically matched
when coupled with certain algorithms can yield in sentences in the north American accent which are happy
improved efficiency. sad and angry and labeled as garbage. Every expression
is generatedin2 levels of emotional intensity (light, bold),
with a neutral expression. Every file out of 1440 files has
III. METHOD an unique filename. The filename holds a 7-part numerical
We have used cnn for out model and its architecture identifier (e.g., 03-02-05-01-02-02-11.wav). They
diagram of the system is given below.- constitutethe evoking features.

MFCC for feature extraction:

It is one of the methods used for feature extraction. . The


human auditory system is assumed to process speech
signals non linearly and measured on a mel-frequency
scale. In speech recognition, Mel-frequency cepstrum
represents the short-run spectral strength of the greeting
frame using the linear transform cosine of the log freedom
Training and testing model: we fetch a training data to the
system consisting of experimental label and weight training spectrum on a non-linear mel-frequency scale [8].
is also provided for the network. An audio is taken as an
input which is then normalized in order to train the CNN, Modules of CNN
the purpose being that the impact of presentation sequence
of the examples don’t affect the training performance. The In our CNN module we have four important layers;
outcome we get from the process is that we it acquires the
best result with this learning data. It fetches the system with 1. Convolution layer: Identifies salient regions at
energy along with pitch. The network weights trained gives
the determined emotion. The output is represented in a intervals, length utterances that are variable and
numerical value each corresponding to the expressed depicts the feature map sequence.
emotions. 2. Activation layer: it is a non linear activation layer
function that is used as customary to the
The emotion that are detected here are happy, angry, sad convolution layer outputs, we have used corrected
and garbage.
linear unit (ReLU) for this.
3. Max pooling layer: this layer enables options with
Algorithm used;
the maximum value to the dense layers. It helps to
1. Sample audio is given as input keep the variable length inputs to a fixed sized
2. Waveform is plotted from the input feature array.
3. Using librosa we extract the mfcc(mel frequency 4. Dense layer
cepstral coefficient)
4. Then the data is divided into training and testing
data after constructing a CNN model and its layers
to train the dataset.
5. Then we predict the emotions

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on February 15,2023 at 06:36:35 UTC from IEEE Xplore. Restrictions apply.

You might also like