You are on page 1of 14

ADVANCED EMOTION DETECTION FOR

CUSTOMER SUPPORT ORGANIZATION USING


VOICE RECOGNITION

An Interim Project Report

Submitted by

VADDI RAVI TEJA [CB.EN.U4CSE18062]


M V LOKESH CHOWDARY [CB.EN.U4CSE18238]
V SANJEEVI [CB.EN.U4CSE18251]
M J S D AKHIL [CB.EN.U4CSE18435

Under the guidance of


Arun Kumar C
(Assistant Professor, Department of Computer Science & Engineering)

in partial fulfillment for the award of the degree


of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENE & ENGINEERING

AMRITA VISHWA VIDYAPEETHAM


Amrita Nagar PO, Coimbatore - 641 112, Tamilnadu
March 8, 2022

i
ABSTRACT

Emotion recognition is the part of speech recognition which is gaining


more popularity and need for it increases enormously. Although there are
methods to recognize emotion using machine learning techniques, this
project attempts to use deep learning and image classification method
to recognize emotion and classify the emotion according to the speech
signals
TABLE OF CONTENTS

ABSTRACT ii

1 INTRODUCTION 2

2 LITERATURE SURVEY 3
2.1 Machine Learning Based Speech Emotions Recognition System . . 3
2.2 Speech Emotion Recognition using Deep Learning . . . . . . . . . 3
2.3 Emotion Detection in Dialog Systems: Applications, Strategies and
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Emotion Detection: A Technology review . . . . . . . . . . . . . . 4
2.5 Speech Emotion Recognition Using Speech Feature and Word Embed-
ding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 A Review on Emotion Detection and Classification using Speech . . 6
2.7 Detection and Analysis of Emotion from Speech Signals . . . . . . 7
2.8 Speech Emotion Recognition: Methods and Cases Study . . . . . . 8

3 ARCHITECTURE DIAGRAM 9

4 Conclusion 10

5 References 11
Chapter 1

INTRODUCTION

The proposed system is to build an application to help any organization improve the
quality of service, business ideas and to know employees. The application will be used
by the administration and the employees. There will be separate login. Employees
will have a system to let the know about the customer mood. Administration we be
able to know what the customers think about the company. In employee side we will be
feeding the model with an audio recording of a phone conversation. We will be splitting
the audio into time slice of 10 seconds. That 10 second audio will be used by the model
to find the emotion of the customer. While doing this if at some point customer is found
to be angry then the employee should talk him out of the problem or should be careful
about what he is about say. Emojis will be used to indicate mood of the customer.
At the administration side they will have access to the details of all the customer. A
rating system will give rating to the employee according to their performance (e.g., If
the customer’s mood was happy before declining the call it is considered the call is
successful and high rating will be given to the employee). This rating system can be
used to know performance of each and every employee, so employees with high rating
can be give a rise or they can also be connected with customers who are difficult to
handle. With all the data we have we will also visualize customer’s experience. If
there are many calls with customer being angry, it ultimately means that there is some
problem with the product. With this information the companies will work on their
product and improve their business. So, this application also helps a company to make
the business profitable
Chapter 2

LITERATURE SURVEY

2.1 Machine Learning Based Speech Emotions Recog-

nition System

In this paper mainly brief introduction about speech emotion recognition is given along
with the speech emotion recognition system block diagram description. In this the field
of affect detection, a very important role is played by suitable choice of speech database.
They classified good emotion recognition system mainly three databases are used. 1.
Elicited emotional speech database 2. Actor based speech database 3. Natural speech
database
On the basis of ability, they have to recognize a speech recognition system can
be separated in different classes are isolated, connected, spontaneous and continuous
words.
Relevant emotional features extraction from the speech is the second important step
in emotions recognition. To classify features there is no unique way but preferably
acoustic and linguistic features taxonomy is considered separately. feature extraction

2.2 Speech Emotion Recognition using Deep Learning

This project attempts to use deep learning and image classification method to recognize
emotion and classify the emotion according to the speech signals
Inception Net is used for emotion recognition with IEMOCAP data sets. Final ac-
curacy of this emotion recognition model using Inception Net v3 Model is 35
SER is used in-car board system based on information of the mental state of the
driver can be provided to the system to initiate his/her safety preventing accidents to
happen
Some of the classification algorithms like K-NN, Random Forest are used in to clas-
sify emotion accordingly. Recurrent Neural network arises enormously which tries to
solve many problems in the filed of data science. Deep RNN like LSTM, Bi-directional
LSTM trained for acoustic features are used in . Various range of CNN are being im-
plemented and trained for speech emotion recognition are evaluated in . Emotion is
inferred from speech signals using filter banks and Deep CNN which shows high ac-
curacy rate which gives an inference that deep learning can also be used for emotion
detection. Speech emotion recognition can be also performed using image spectrogram
with deep convolutional neural networks which is implemented

2.3 Emotion Detection in Dialog Systems: Applications,

Strategies and Challenges

In this paper they describe some experiments conducted with respect to two applications
of emotion detection in human machine communication: firstly to detect anger in voice
portal services and secondly to predict user satisfaction in dialog recordings.
Data set has 21 hours of recordings where customers report problems with their
phone connection. They label the data as Garbage, Unsure, Angry and Not Angry. For
each turn, the labelers have the choice to assign an anger value between 1 and 5 or mark
the turn as “non applicable” (garbage) using a self developed GUI based labeling tool.
The challenge here is it usually is only measured by asking the user. 25 users were
assigned and given a dialog script to rate them manually during the experiment. In the
classification part, the Hidden Markov Model is used.

2.4 Emotion Detection: A Technology review

In this paper they explore the different sources from which emotions can be read, along
with existing technologies developed to recognize them. Many sources are discussed
like speech, image, video, gait etc. We will be looking into only emotion detection from

4
speech.
In this paper the discussed three technologies:
Beyond Verbal: the services are cloud-based, using them on a project is as simple
as making a request to their API. technology needs, at least, 13 seconds of spoken voice
to give a result.
Vokaturi: It operates on the computer where it is being used. It does not require
the internet. Vokaturi results are expressed as a combination of Paul Ekman’s six basic
emotions.
Emo Voice: It is a framework for real-time recognition of emotions from acoustic
properties of speech. It offers tools to record, analyze and recognize human behavior in
real-time.

2.5 Speech Emotion Recognition Using Speech Feature

and Word Embedding

This paper presents a categorical speech emotion recognition using speech features and
word embedding. Text features can be combined with speech features to improve emo-
tion recognition accuracy, and both features can be obtained from speech. Here, we use
speech segments, by removing silences in an utterance, where the acoustic feature is
extracted for speech-based emotion recognition. Word embedding is used as an input
feature for text emotion recognition and a combination of both features is proposed for
performance improvement purposes.
Two unidirectional LSTM(Long Short Term Memory layers )are used for text and
fully connected layers are applied for acoustic emotion recognition.For speech based
emotion recognition while processing data we may face several issues like unnecessary
information is processed which yields poor performance so, One solution to deal with
that issue is by using the segmented speech part of the utterance by removing silence
for feature extraction
To start acoustic feature extraction from speech segments, speech files within the

5
data set first are read as a vector.For each utterance (each file), we perform silence
removal to obtain speech segments. We perform silence removal based on two parame-
ters: minimum threshold(0.001
A simple model to improve the Speech Emotion Recognition accuracy is obtained
by integrating LSTM with fully connected (dense) layers.The result shows the combi-
nation of speech and text achieve higher accuracy i.e. 75.49

2.6 A Review on Emotion Detection and Classification

using Speech

In this paper demonstrates all the scientists and researchers proposed models and its
performances.Using the extracted feature, the SVM and GMM classifies the age of the
speaker into different age slots. Then the emotion is predicted on the basis of the trained
data.
LPCC stands for Linear Prediction Cepstral Coefficient. The feature extraction of
speech is utilized to demonstrate the speech signals by the limited number of pro-
portions of the signals. LPCC encapsulates the qualities of the specific speech chan-
nel.LPCC is used to characterize vowels more successfully. LPCC uses the auto cor-
relation technique for execution. The principle disadvantage of LPCC is that it is sig-
nificantly sensitive to quantisization error.LPC stands for Linear Predictive Coding. It
emulates the human vocal tract and shows robust speech features.
MFCC is the most important for feature extraction from speech.One of the chal-
lenges is words with multiple sentiment polarity. In the case of such words, identifica-
tion of exact emotion is very difficult. Sometimes audios carry a lot of noise with them,
emotion analysis can’t be done properly with noisy audios.
Most of the research done for emotion speech recognition was only classification
technique but recent advancement shows development in sentimental analysis of the
classifier. Feature extraction from the speech signal is the second most important step
in this field. MFCC algorithm is widely used in most of the paper for feature extraction

6
because it performs better than the other techniques in the case of noise.Still, there are
many open issues that need to be solved like diversity in emotion, recognizing spon-
taneous emotion and speaker recognition in case of simultaneous conversation.Further
research is expected to improve accuracy and increment the number of emotion in an
input speech signal.

2.7 Detection and Analysis of Emotion from Speech Sig-

nals

This study states that in the presence of negative stimuli that evoke negative emotions,
heart rate is more actively slower than in the presence of positive stimuli.Emotion classi-
fication is one of the most difficult tasks in the field of audio signal processing. Speaker
or speech recognition problems are relatively simpler than recognizing emotions from
speech. Audio signals are one of the most important communication media and can be
processed to recognize speakers, languages, and even emotions. The basic principle of
emotion recognition lies in the analysis of the acoustic differences that occur when the
same thing is said in different emotional situations. In, the child‘s mood is identified
based on the audio signal. In addition to the speaker and/or speech response capabil-
ities, audio signals have several functions that represent the speaker’s emotional state.
This paper addresses the issue of the emotional classification of human language. The
purpose of this study is to investigate whether the type of speech depends on the human
emotional state. Since emotions directly affect the nervous system, heart rate is also
affected. You can also measure a person’s heart rate to provide information about that
person’s emotional state. Interestingly, the voice signal also represents the speaker’s
heart rate, because the heart rate also affects the voice.

7
2.8 Speech Emotion Recognition: Methods and Cases

Study

This paper states that there is still a debate going on to find the correct features which
help us in finding the correct emotion. 2 types of data sets are taken for this study.
One is Spanish and the other is Berlin. RNN classifier is used for the classification.
MLR and SVM classifiers are used to compare the performances. There are also many
more classifiers that will help us in classification like GMM, HMM, NN, BEL, ANFIS,
MLP, GP. the voice segment selection algorithm deals with the voice segment as texture
image processing which is different from the traditional methods. The combination of
MFCC and MS has helped us in achieving higher accuracy compared to using both of
them independently.

8
Chapter 3

ARCHITECTURE DIAGRAM

Figure 3.1: Architecture diagram

There are two sides to the project. One us the web framework and other is the
module that will be performing the analysis. There modules will be working on the
analysis part. If a user requests an operation from web framework, its respective process
will be started in the module side and result will be returned. All the important details
are collected and stored in database for further analysis. Once all operations are done,
the running process is terminated.
Chapter 4

CONCLUSION

This summary gives the review of the ongoing updates in speech emotion recognition.
In the field of emotion identification, database choice played an important role in good
exactness in the result. Feature extraction from the speech signal is the second most
important step in this field. From our examination, we find the MFCC algorithm is
widely used in most of the paper for feature extraction because it performs better than
the other techniques in the case of noise. Next important part of sentiment analysis is the
use of Classifier.Next important part of sentiment analysis is the use of Classifier. Still,
there are many open issues that need to be solved like diversity in emotion, recognize
spontaneous emotion and speaker recognition in case of simultaneous conversation.
Further research is expected to improve accuracy and increment the number of emotion
in an input speech signal.
Chapter 5

REFERENCES

[1] B. T. Atmaja, K. Shirai and M. Akagi, "Speech Emotion Recognition Using


Speech Feature and Word Embedding," 2019

[2] Anjali Tripathi, Upasana Singh, Garima Bansal, Rishabh Gupta, Ashutosh Kumar
Singh “A Review on Emotion Detection and Classification using Speech” May
2020

[3] F. Burkhardt, M. van Ballegooy, K. Engelbrecht, T. Polzehl and J. Stegmann,


"Emotion detection in dialog systems: Applications, strategies and challenges,"
2009

[4] J. García-García, V. Penichet, and M. Lozano, “Emotion detection: a technology


review,” September 2017.

[5] Assel Davletcharova , Sherin Sugathan, Bibia Abraham , Alex Pappachen James,
"Detection and Analysis of Emotion From Speech Signals"

[6] Leila Kerkeni , Youssef Serrestou , Mohamed Mbarki , Kosai Raoof and Mo-
hamed Ali Mahjoub, “Speech Emotion Recognition: Methods and Cases Study”

You might also like