Professional Documents
Culture Documents
Key Words: Speech Emotion Recognition, Convolutional Convolutional neural networks (CNNs) are one of the
Neural Network. most popular deep learning models that have
manifested remarkable success in the research areas
1. INTRODUCTION
such as 14 object recognition, face recognition,
handwriting recognition, speech recognition, and
In today’s digital era, speech signals become a mode of
natural language processing. The term convolution
communication between humans and machines which comes from the fact that convolution—the
is possible by various technological advancements. mathematical operation—is employed in these
Speech recognition techniques with methodologies networks. Generally, CNNs have three fundamental
signal processing techniques made leads to Speech-to- building blocks: the convolutional layer, the pooling
Text (STT) technology which is used mobile phones as layer, and the fully connected layer. Following, we
a mode of communication. Speech Recognition is the describe these building blocks along with some basic
concepts such as SoftMax unit, rectified linear unit, and
fastest growing research topic in which attempts to dropout.
recognize speech signals. This leads to Speech Emotion
Recognition (SER) growing research topic in which lots 1.2 CONVOLUTIONAL LAYER
of advancements can lead to advancements in various
Convolutional layers in CNNs use convolution instead
field like automatic translation systems, machine to
of multiplication to compute the output. As a result, the
human interaction, used in synthesizing speech from neurons in the convolutional layers are not connected
text so on. In contrast the paper focus to survey and to all the neurons in their preceding layers. This
review various speech extraction features, emotional architecture is inspired by the fact that neurons of the
speech databases, classifier algorithms and so on. visual cortex have local receptive field. That is, the
Problems present in various topics were addressed. neurons are specialized to respond to the stimuli
limited to a specific location and structure. As a result,
Speech Recognition is the technology that deals with using convolution introduces sparse connectivity and
techniques and methodologies to recognize the speech parameter sharing to CNNs, which decreases the
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4302
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072
Figure 1: Human Speech Emotion Recognition When a user gives feed back to a certain movie they
saw (say they can rate from one to five), this collection
2. LITERATURE REVIEW of feedback can be represented in a form of a matrix.
Complete review on the speech emotion recognition is Where each row represents each user, while each
explained in which reviews properties of dataset, column represents different movies. Obviously, the
speech emotion recognition study classifier choice. matrix will be sparse since not everyone is going to
Various acoustic features of speech are investigated watch every movie, (we all have different taste when it
and some of the classifier methods are analyzed in comes to movies).
which is helpful in the further investigation of modern
methods of emotion recognition. This paper One strength of matrix factorization is the fact that it
investigated the prediction of the next reactions from
can incorporate implicit feedback, information that are
emotional vocal signals based on the recognition of
emotions, using different categories of classifiers. Some not directly given but can be derived by analyzing user
of the classification algorithms like K-NN, Random behavior. Using this strength, we can estimate if a user
Forest are used in to classify emotion accordingly. is going to like a movie that (he/she) never saw. And if
Recurrent Neural network arises enormously which that estimated rating is high, we can recommend that
tries to solve many problems in the field of data movie to the user.
science. Deep RNN like LSTM, Bi-directional LSTM
trained for acoustic features are used. Various range of The concept of matrix factorization can be written
CNN are being implemented and trained for speech mathematically to look something like below.
emotion recognition are evaluated. Emotion is inferred
from speech signals using filter banks and Deep CNN
which shows high accuracy rate which gives an
inference that deep learning can also be used for
emotion detection. Speech emotion recognition can be
also performed using image spectrograms with deep
convolutional networks which is implemented. 3.2 K-nearest neighbor
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4303
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4304
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072
6. DATABASES
0 or 1. The Jaccard similarity coefficient J, is given as 6.1 RAVDESS: BRITISH ENGLISH DATABASE
where, The RAVDESS is a validated multimodal database of
Z11 Represents the total number of attributes emotional speech and song. The database is gender
where X and Y both having a value of 1. balanced consisting of 24 professional actors,
vocalizing lexically-matched statements in a neutral
Z01 Represents the total number of attributes where North American accent. Speech includes calm, happy,
the attribute of X is 0 and the attribute of Y is 1.
sad, angry, fearful, surprise, and disgust expressions,
Z10 Represents the total number of attributes where the and song contains calm, happy, sad, angry, and fearful
attribute of X is 1 and the attribute of Y is 0. emotions. Each expression is produced at two levels of
emotional intensity, with an additional neutral
expression. All conditions are available in face-and-
voice, face-only, and voice-only formats. The set of
happy 1 1 1 1
sad 2 2 2 2
angry 3 3 3 3
scared 4 4 4 4
bored 5 - - -
surprisd - 5 5 -
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4306
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4307
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 4308