You are on page 1of 29

INFERRING EMOTION FROM SPEECH

A project Report submitted


in partial fulfillment for the award of the Degree of

Bachelor of Technology in
Computer Science and Engineering by

K. JOTHIKA (U18CS013)
NEETU (U18CS017)
S. MANASA (U18CS027)
P. CHANDRIKA (U18CS062)

Under the guidance of


Dr. Anitha Karthi

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SCHOOL OF COMPUTING


BHARATH INSTITUTE OF HIGHER EDUCATION AND RESEARCH
(Deemed to be University Estd u/s 3 of UGC Act, 1956)

CHENNAI 600 073, TAMILNADU, INDIA


April , 2022
CERTIFICATE

This is to certify that the project report entitled “Inferring Emotion from Speech” submitted by
K.Jothika(U18CS013),Neetu(U18CS017), S.Manasa(U18CS027), P. Chandrika(U18CS062)
to the Department of Computer Science and Engineering, Bharath Institute of Higher Education
and Research, in partial fulfillment for the award of the degree of B. Tech in Computer Science
and Engineering is a bona fide record of project work carried out by them under my supervision.
The contents of this report, in full or in parts, have not been submitted to any other Institution or
University for the award of any other degree.

<Signature of Supervisor>
Dr. Anitha Karthi
Department of Computer Science & Engineering,
School of Computing, Bharath Institute of Higher Education
and Research
April, 2022

<Signature of Head of the Department>

Dr. B. Persis Urbana Ivy Professor & Head


Department of Computer Science & Engineering,
School of Computing, Bharath Institute of Higher
Education and Research,
April, 2022
DECLARATION

We declare that this project report titled “Inferring Emotions from Speech” submitted in partial
fulfillment of the degree of B. Tech in Computer Science and Engineering is a record of
original work carried out by us under the supervision of Dr. Anitha Karthi, and has not formed
the basis for the award of any other degree or diploma, in this or any other Institution or
University. In keeping with the ethical practice in reporting scientific information, due
acknowledgements have been made wherever the findings of others have been cited.

<Signature>
K. Jothika
(U18CS013)

<Signature>
Neetu
(U18CS017)

<Signature>
S. Manasa
(U18CS027)

<Signature>
P. Chandrika
(U18CS062)

Chennai
<Date>
ACKNOWLEDGMENTS

First, we wish to thank the almighty who gave us good health and success throughout our project work.
We express our deepest gratitude to our beloved President Dr. J. Sundeep Aanand, and Managing
Director Dr. E. Swetha Sundeep Aanand for providing us the necessary facilities for the completion of our
project.
We take great pleasure in expressing sincere thanks to Vice Chancellor (I/C) Dr. K. Vijaya Baskar Raju,
Pro Vice Chancellor (Academic) Dr. M. Sundararajan, Registrar Dr. S. Bhuminathan and Additional Registrar
Dr. R. Hari Prakash for backing us in this project.
We thank our Dean Engineering Dr. J. Hameed Hussain for providing sufficient facilities for the
completion of this project.
We express our immense gratitude to our Academic Coordinator Mr. G. Krishna Chaitanya for his
eternal support in completing this project.
We thank our Dean, School of Computing Dr. S. Neduncheliyan for his encouragement and the valuable
guidance.
We record indebtedness to our Head, Department of Computer Science and Engineering Dr. B.Persis
Urbana Ivy for immense care and encouragement towards us throughout the course of this project.
We also take this opportunity to express a deep sense of gratitude to our Supervisor Dr. Anitha Karthi for
her cordial support, valuable information and guidance, she helped us in completing this project through various
stages.
We thank our department faculty, supporting staff and friends for their help and guidance to complete this
project.
ABSTRACT

Emotion recognition is a rapidly growing research domain in recent years. Unlike humans, machines lack
the abilities to perceive and show emotions. But human-computer interaction can be improved by
automated emotions recognition, thereby reducing the need of human intervention.
It offers tremendous scope to human computer interaction, robotics, health care, biometric security and
behavioral modeling. Emotion recognition systems recognize emotions from speech signals.
Inferring Emotion from Speech, abbreviated as IEFS, is the act of attempting to recognize human emotion
and the associated affective states from speech. This is capitalizing on the fact that voice often reflects
underlying emotion through tone and pitch. Emotion recognition is a rapidly growing research domain in
recent years. Unlike humans, machines lack the abilities to perceive and show emotions. But human-
computer interaction can be improved by implementing automated emotion recognition, thereby reducing
the need of human intervention. In this project, basic emotions like calm, happy, fearful, disgust etc. are
analyzed from emotional speech signals. We use machine learning techniques like Multilayer perceptron
Classifier (MLP Classifier) which is used to categorize the given data into respective groups which are
non-linearly separated. Mel-frequency cepstrum coefficients (MFCC), chroma and Mel features are
extracted from the speech signals and used to train the MLP classifier. For achieving this objective, we
use python libraries like Librosa, sklearn, pyaudio, numpy and soundfile to analyze the speech
modulations and recognize the emotion.

ix
TABLE OF CONTENTS

DESCRIPTION PAGE NUMBER

CERTIFICATE
DECLARATION
ACKNOWLEDGEMENTS
ABSTRACT
LIST OF FIGURES
LIST OF TABLES
ABBREVIATIONS/ NOTATIONS/ NOMENCLATURE
1. INTRODUCTION
1.1 About the Project
2. LITERATURE SURVEY
3. EXISTING SYSTEM & PROPOSED SYSTEM
3.1 EXISTING SYSTEM
3.2 PROPOSED SYSTEM
4. IMPLEMENTATION
4.1 SPEECH DATABASE
4.2 DATA PRE-PROCESSING
4.3 MACHINR LEARNING ALEORITHM
4.4 EMOTION RECOGNITON
4.5 DEPLOYMENT
4.6 ACCURACY
5. METHODOLOGY
5.1 IMPORT MODULES
5.2 LOAD THE SPEECH EMOTION DATASET
5.3 EXPLORATRORY DATA ANALYSIS
5.4 FEATURE EXTRACTION USING MFCC
xii
5.5 CREATING THE MODEL
5.6 PLOT THE MODEL RESULTS
6. SYSTEM ANALYSIS AND DESIGN
6.1 SYSTEM ARCHITECTURE
6.2 FLOWCHART
7. SYSTEM REQUIREMENTS AND SPECIFICATIONS
7.1 BASIC REQUIREMENTS
7.2 REQUIREMENTS
7.2.1 SOFTWARE REQUIREMENTS
7.2.2 HARDWARE REQUIREMENTS
8. RESULT AND DISCUSSION
8.1 RESULT
9. CONCLUSION AND FUTURE WORK
9.1 CONCLUSION
9.2 FUTURE WORK
REFERENCES
APPPENDIX

xii
LIST OF FIGURES

xiii
ABBREVIATIONS/ NOTATIONS/ NOMENCLATURE

Inferring Emotion from Speech (IEFS)


Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Multi-Variate Regression (MVR)
Support Vector Machine (SVM)
Multi-Layer Perceptron (MLP)
Artificial Neural Network (ANN)
Toronto Emotion speech set (TESS)
Mel-Frequency Cepstral Coefficient (MFCC)

xvii
CHAPTER 1
INTRODUCTION

In naturalistic human-computer interaction (HCI), Inferring Emotion from Speech (IEFS) is


becoming increasingly important in various applications. At present, Inferring Emotion from Speech is an
emerging crossing field of artificial intelligence and artificial psychology; besides, it is a popular research
topic of signal processing and pattern recognition. The research is widely applied in human-computer
interaction, interactive teaching, entertainment, security fields, and so on. Speech emotion processing and
recognition system is generally composed of three parts, the first being speech signal pre-processing, then
comes the feature extraction followed by emotion recognition. The most propitious technique for emotion
recognition is the neural network-based approach. Artificial Neural Networks, (ANN) are biologically
inspired tools for information processing.
Inferring emotions from speech can have applications between a natural man and machine
interaction. Such as web movies and computer tutorial applications, in-car board system where
information of the mental state of the driver, it can be also used as a diagnostic tool for therapists, as well

as it may also be useful in automatic translate systems.

Inferring emotions from speech (IEFS) is the task of recognizing the emotional aspects of speech
irrespective of the semantic contents. While humans can efficiently perform this task as a natural part of
speech communication, the ability to conduct it automatically using programmable devices is still an
ongoing subject of research.
Robots capable of understanding emotions could provide appropriate emotional responses and
exhibit emotional personalities. In some circumstances, humans could be replaced by computer-generated
characters having the ability to conduct very natural and convincing conversations by appealing to human
emotions. Machines need to understand emotions conveyed by speech. Only with these 10 capabilities, an
entirely meaningful dialogue based on mutual human-machine trust and understanding can be achieved.

1
Emotion Recognition is gaining its popularity in research which is the key to solve many problems
also makes life easier.

Inferring Emotion from Speech is claiming to achieve between the additional parts due to its
complexity. Moreover, the representation of a sensible computer system lacks the system to impersonate
human response. Inferring Emotions from Speech built as a classification obstacle determined using
numerous ML algorithms. Recognizing the emotional nature of the user appears with a significant lead in
this application.

In addition to emotion recognition, various other factors such as valence, polarity, arousal play
prominent roles in identifying one’s state of mind. By mapping of the brain using valence, polarity,
arousal and emotion recognition is called Sentimental Analysis. Sentiment analysis is used to understand
the person’s opinion and attitude towards a particular topic or at that instant of time using various
computational approaches.
Emotion recognition has wide scope in many areas such as human computer interaction, biometric
security etc. So, it provides insight into artificial intelligence or machine intelligence that uses various
supervised and unsupervised machine-learning algorithms to simulate the human brain. It was explored
that study of human emotions, their interpretation, processing and adaptation by machines is known as
affective computing or artificial emotional intelligence. Human emotional state can be recognized from
facial expressions, body movements, speech, text writing, brain or heart signals etc. using various
machine learning techniques that extract required features or patterns from the collected data.

The approach for Inferring Emotion from Speech (IEFS) primarily comprises two phases known
as feature extraction and features classification phase. In the field of speech processing, have derived
several features. The second phase includes feature classification using multi-Perceptron classifiers
(MLP). Usually, the speech signal is considered to be non-stationary. Hence, it is considered that non-
linear classifiers work effectively for IEFS. MLP is widely used for classification of information that is
derived from basic level features. Energy-based features such as Mel-Frequency Cepstrum Coefficients
(MFCC) often used for effective emotion recognition from speech.
The aim to develop machines to interpret paralinguistic data, like emotion, helps in human-machine

2
interaction and it helps to make the interaction clearer and natural. In this Paper We are using
classification models such as ANN to predict in speech sample. The MFCC is used for the feature
extraction. To train the model TESS dataset using along with Data Augmentation.

3
CHAPTER 2
LITERATURE SURVEY

Mel Frequency Cepstral Coefficients (MFCC), Acoustic phonetic recognition In this study total
seven different approaches which are widely used for SRS have been discussed and after comparative
study of these approaches it is concluded that Hidden Markov method (HMM) is best suitable approach
for a SRS because it efficient, robust, and reduces time and complexity Continuity is less important
because it can match sequence with missing information and Reliable time alignment between reference
and test pattern [1].
Statistical Modeling, Robust speech recognition, Noisy speech recognition, classifiers, feature
extraction, performance evaluation, Data base. There is now increasing interest in finding ways to bridge
such a performance gap. What we know about human speech processing is very limited. Continuity is less
important because it can match sequence with missing information [2].
Signal preprocessing, feature extraction, language model, decoder, emotion recognition the speech
sound is captured using microphone to convert it into electrical signal. The purpose of sound card inside
the computer is to change analog signal into digital signal. Speech is basic mode of communication
between human beings, so a feasible interface is required to connect human with machines [3].
Sequence-to sequence End-to-end (E2E) automatic speech We investigated a set of models, LSTM
model recognition (ASR) models directly map an acoustic feature sequence to a word sequence. Due to
their universal capability to handle even non-monotonic alignment, attention models are widely used in
many machine learning problems, e.g., translation techniques on top of our best recipe for the
Switchboard English speech recognition benchmark [4].
Signal preprocessing, feature extraction, language model, Decoder, speech recognition There is
now increasing interest in finding ways to bridge such a performance gap. What we know about human
speech processing is very limited. Although these areas of investigations are important the significant
advances will come from studies in acoustic phonetics, speech perception, linguistics. Various neural
network model such as deep neural networks, NRLN & LSTM for getting the requested output [5].
CHAPTER 3
EXISTING SYSTEM & PROPOSED SYSTEM

3.1 EXISTING SYSTEM


The Existing system contest these algorithm or methods are the most widely used to predicts
emotions from speech. These algorithms are as follow
• Random Decision Forest (RDF)
• Convolutional neural network (CNN)
• CREMA-D dataset
• Naïve bayes
• Recurrent neural network (RNN) classifier
• Multivariate Regression (MVR)
• Support Vector Machine Technique (SVM)

3.1.2 Random Decision Forest (RDF)


With increase in computational power, we can now choose algorithms which perform very
intensive calculations. One such algorithm is “Random Forest”, which we will discuss in this article.
While the algorithm is very popular in various competitions (e.g., like the ones running on Kaggle),
the end output of the model is like a black box and hence should be used judiciously.

3.1.3 CREMA-D dataset


Crema-D dataset is an multi model actor dataset of original clips from 91 actors and it
recognize 7 emotions embedded in an audio file
3.1.4 Naive bayes
Naive bayes is a probabilistic classifier and supervised learning algorithm, it predicts on the
basis of the probability of an object and can make a quick prediction. It assumes that all features are
independent.
3.1.5 RNN Classifier
RNN is suitable for temporal data, also called sequential data. Recurrent neural network is a
type of neural network where the output from the previous step is fed as input to the current step.
RNN’s are mainly used for,
• Sequence classification
• Sequence labelling
• Sequence generation

3.1.6 Convolutional Neural Network (CNN)


Convolutional neural networks (CNN) are one of the most popular models used today. This
neural network computational model uses a variation of multilayer perceptron's and contains one or
more convolutional layers that can be either entirely connected or pooled. These convolutional
layers create feature maps that record a region of image which is ultimately broken into rectangles
and sent out for

3.1.7 Support Vector Machine (SVM)


Support Vector Machine" (SVM) is a supervised machine learning algorithm which can be
used for either classification or regression challenges. However, it is mostly used in classification
problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is
number of features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two
classes very well.

3.2 PROPOSED SYSTEM


This system we are using feed forward Artificial neural network (ANN) model it is most widely
used in many practical applications. It biologically inspired classification algorithm and every unit in a
layer is connected with all the units in the previous layer. MLP Classifier is a Multi-Layer Perceptron, a
supervised classification technique that uses back propagation. It consists of input layer and output layers.
Based on the fact that voice often reflects underlying emotions through tones and pitches and MLP
classifier is used to classify the emotion from given input signals.
• Instead of using CNN architecture, Feedforward ANN model and MLP Classifier are used.
• Unlike SVM or Naive bayes, MLP Classifier has an internal neural network for the purpose of
classification of emotions from given input signals. MLP Classifier relies on an underlying Neural
Network to perform the task of classification.
• Soundfile and SKlearn are used for analyzing audio and music. Librosa, a python library is used to
extract features from the sound files.
• TESS dataset is used. It is a collection of audio clips of 2 women expressing 7 different emotions (anger,
disgust, fear, happiness, pleasant surprise, sadness, and neutral).
CHAPTER 4
IMPLEMENTATION

4.1 Speech database


TORONTO EMOTIONAL SPEECH SET(TESS) is a collection of audio clips of 2 women
expressing 7 different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral).
TESS Dataset is different from other existing speech dataset in addition of pleasant surprise emotions.
Speakers and emotions are organized in separate folders which is very convenient.
It contains 2800 sound files (200 files of each emotion)

4.2 Data Pre-Processing


Preprocessing is the very first step after collecting data that will be used to train the classifier in a IEFS system.
The preprocessing includes silence removal, pre-emphasis, normalization and windowing so it is an
important phase to get pure signal which is used in the next stage (feature extraction).

4.3 Machine Learning Algorithm


Artificial Neural Network (ANN) are a special type of machine learning algorithm that are modeled
after the human brain.
It is a computational algorithm and intended to stimulate the behavior of biological system composed of
“neurons”.
ANN are computational model, is capable of machine learning as well as pattern recognition.
The System will predict emotions (happy, sad, surprise, angry, fear, neutral) from the speech from given
speech input dataset.
It is built using Python, we will use the python libraries librosa, soundfile, Pyaudio, sklearn and to build a
model using an MLP Classifier.
The system will receive sound files from the dataset present on the internet.
• From audio data three key features will be extracted:
-- MFCC (Mel Frequency Cepstral Coefficient)
-- Chroma
-- Mel spectrogram
A model is built that uses ML techniques like MLP Classifier to classify the given
input speech signals into respective parts.
1. Load the dataset.
2. Extract features from the speech signals.
3. Split the dataset (TESS) into training and testing sets.
4. Initialize an MLP classifier and train the model.
5. Finally calculate the accuracy of the model.

4.4 Emotion Recognition


Hereby, we recognized emotions from speech. We used machine learning algorithm (ANN) and
MLP Classifier for this and made use of the python library to read the sound file, and the librosa library to
extract features from it.

4.5 Deployment
The proposed system recommends the best suitable model for inferring emotions from speech by
considering the ANN algorithm and MLP Classifier.
We deployed our speech emotion recognition model using TESS dataset. Using TESS dataset model
deployed with more accuracy and appropriate speech emotions.

4.6 Accuracy
This model delivered an accuracy of 72.4%.
CHAPTER 6
METHODOLOGY

6.1 Load the Speech Emotion Dataset


6.2 Exploratory Data Analysis
6.3 Feature Extraction using MFCC
The MFCC feature extraction technique basically includes windowing the signal, applying the DFT,
taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying
the inverse DCT. The detailed description of various steps involved in the MFCC feature extraction is
explained below:

 PRE-EMPHASIS - Pre-emphasis refers to filtering that emphasizes the higher frequencies. Its purpose is
to balance the spectrum of voiced sounds that have a steep roll-off in the high-frequency region.

 Frame blocking and windowing - The speech signal is a slowly time-varying or quasi-stationary signal.
 DFT spectrum: Each windowed frame is converted into magnitude spectrum by applying DFT.

 Mel spectrum: Mel spectrum is computed by passing the Fourier transformed signal through a set of band-
pass filters known as Mel-filter bank. A Mel is a unit of measure based on the human ears perceived
frequency. It does not correspond linearly to the physical frequency of the tone, as the human auditory
system apparently does not perceive pitch linearly. The Mel scale is approximately a linear frequency
spacing below 1 kHz and a logarithmic spacing above 1 kHz.

 Discrete cosine transforms (DCT): Since the vocal tract is smooth, the energy levels in adjacent bands
tend to be correlated. The DCT is applied to the transformed Mel frequency coefficients produces a set of
cepstral coefficients. Prior to computing DCT, the Mel spectrum is usually represented on a log scale.
This results in a signal in the cepstral domain with a quefrequency peak corresponding to the pitch of the
signal and a number of formants representing low quefrequency peaks. Since most of the signal
information is represented by the first few MFCC coefficients, the system can be made robust by
extracting only those coefficients ignoring or truncating higher order DCT components.
 Dynamic MFCC features: The cepstral coefficients are usually referred to as static features, since they
only contain information from a given frame. The extra information about the temporal dynamics of the
signal is obtained by computing first and second derivatives of cepstral coefficients.
6.4 Creating the Model
6.5 Plot the model Result
CHAPTER 6
SYSTEM ANALYSIS AND DESIGN

6.1 SYSTEM ARCHITECTURE


The Structural design of speech emotion recognition model and distribution of functional
correspondences. The formal elements, the embodiment of concepts and information for inferring emotion
from speech.
6.2 FLOWCHART
A flowchart is simply a graphical representation of steps. It shows steps in sequential order and is
widely used in presenting the flow of algorithms, workflow or processes. Typically, a flowchart shows the
steps as boxes of various kinds, and their order by connecting them with arrows. It originated from
computer science as a tool for representing algorithms and programming logic but had extended to use in
all other kinds of processes. Nowadays, flowcharts play an extremely important role in displaying
information and assisting reasoning. They help us visualize complex processes, or make explicit the
structure of problems and tasks. A flowchart can also be used to define a process or project to be
implemented.
CHAPTER 7
SYSTEM REQUIREMENTS AND SPECIFICATIONS

A software requirements specification (SRS) is a description of a software system to be developed.


It lays out functional and non-functional requirements, and may include a set of use cases that describe
user interactions that the software must provide. It is very important in a SRS to list out the requirements
and how to meet them. It helps the team to save upon their time as they are able to comprehend how are
going to go about the project. Doing this also enables the team to find out about the limitations and risks
early on. A SRS can also be defined as a detailed description of a software system to be developed with
its functional and non-functional requirements. It may include the use cases of how the user is going to
interact with the software system. The software requirement specification document is consistent with all
necessary requirements required for project development. To develop the software system, we should
have a clear understanding of Software system. To achieve this, we need continuous communication with
customers to gather all requirements. A good SRS defines how the Software System will interact with all
internal modules, hardware, and communication with other programs and human user interactions with a
wide range of real-life scenarios. It is very important that testers must be cleared with every detail
specified in this document in order to avoid faults in test cases and its expected results.
Qualities of SRS
• Correct
• Unambiguous
• Complete
• Consistent
• Ranked for importance and/or stability
• Verifiable
• Modifiable •
Traceable
7.2 REQUIREMENTS
In particular, we propose a hardware requirement to collect input data, a software to store, visualize, and
synchronize the acquired data, and a tool able to self-assess the users' emotions
7.2.1 Hardware Requirements
The hardware requirements include the requirements specification of the physical computer
resources for a system to work efficiently. The hardware requirements may serve as the basis for a
contract for the implementation of the system and should therefore be a complete and consistent
specification of the whole system. The Hardware Requirements are listed below:
1. Processor: A processor is an integrated electronic circuit that performs the calculations that run a
computer. A processor performs arithmetical, logical, input/output (I/O) and other basic instructions
that are passed from an operating system (OS). Most other processes are dependent on the
operations of a processor. A minimum 1 GHz processor should be used, although we would
recommend S2GHz or more. A processor includes an arithmetical logic and control unit (CU),
which measures capability in terms of the following:
• Ability to process instructions at a given time
• Maximum number of bits/instructions
• Relative clock speed Learning
The proposed system requires minimum Core i3 processor or higher.

2. Memory (RAM): Random-access memory (RAM) is a form of computer data storage that
stores data and machine code currently being used. A random- access memory device allows data
items to be read or written in almost the same amount of time irrespective of the physical location of
data inside the memory. In today's technology, random-access memory takes the form of integrated
chips. RAM is normally associated with volatile types of memory (such as DRAM modules), where
stored information is lost if power is removed, although non- volatile RAM has also been developed.
A minimum of 8 GB RAM is recommended for the proposed system.

3. Hard Drive: A hard drive is an electro-mechanical data storage device that uses magnetic storage
to store and retrieve digital information using one or more rigid rapidly rotating disks, commonly
known as platters, coated with magnetic material. The platters are paired with magnetic heads,
usually arranged on a moving actuator arm, which reads and writes data to the platter surfaces. Data
is accessed in a random-access manner, meaning that individual blocks of data can be stored or
retrieved in any order and not only sequentially. HDDs are a type of nonvolatile storage, retaining
stored data even when powered off. 32 GB or higher is recommended for the proposed system.
CHAPTER 8
RESULT AND DISCUSSION

8.1 Result
In this work, we experimented with IEFS on one database: TESS In TESS, we have divided the
dataset into training and testing parts. For training purposes, 200 samples (50 samples from each
category) of emotions (angry, happy, disgust, pleasant surprise) were separated. At the same time, 24
samples (6 from each category) were separated from testing the data. The MFCC vector of this training
data was passed to the classifier, which gave 97% accuracy. At the same time, 30 samples (10 from each
category) were treated as test data. The accuracy of this dataset after training was 86%. The results in
Table show that our proposed method outperforms other three baseline methods.
CHAPTER 9
CONCLUSION AND FUTURE WORK

9.1 CONCLUSION
This paper shows that MLPs are very powerful in classifying speech signals. Even with simplified
models, a limited set of characters can be easily identified. We have obtained higher accuracies as
compared to other approaches for individual emotions. The performance of a module is highly dependent
on the quality of pre-processing. Every human emotion has been thoroughly studied, analyzed and the
accuracy has been checked. The results obtained in this study demonstrate that speech emotion
recognition is feasible, and demonstrating the accuracy of each emotion present in the speech.

9.2 FUTURE WORK


Systems linked to speech are rapidly changing and evolving. The early implementation of speech
engineering has achieved different levels of success.
The hope for the future is significantly higher quality in almost every area of speech related technologies,
with more robustness for speakers, ambient noise, etc. Such tools can be used as an application-based
device and used in the medical field by recognizing the speech emotion and moreover, they can be used
for home security purposes which can be used for home security purposes which can help the households
and Neighbours to provide emotional assistance to the people in need.
- To increase the efficiency of emotion recognition systems in terms of accuracy. In addition to that,
security of system can be improved by using cloud storage and cancelable biometrics.
- The plan to further make the Speech Emotion Recognition system more robust & Realtime analysis
would be done
A few possible steps that can be implemented to make the models more robust and accurate are the
following:
 An accurate implementation of the pace of the speaking can be explored to check if it can resolve some of
the deficiencies of the model.
 Figuring out a way to clear random silence from the audio clip
REFERENCES

[1] Dr.Yogesh Kumar and Dr.Manish Mahajan, “Machine Learning based Speech
Recognizing emotion Systems", International Journal of Scientific and Technology
Research (IJSTR), 2019.

[2] M. Badshah, J. Ahmad, N. Rahim, and S. W. Baik, “Speech Emotion Recognition


from Spectrograms with Deep Convolutional Neural Network,” 2017 Int. Conf. Platf.
Technol. Serv., pp. 1–5, 2017.

[3] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar and T. Alhussain, "Speech


Emotion Recognition Using Deep Learning Techniques: A Review," in IEEE Access,
vol. 7, pp. 117327-117345, 2019, doi: 10.1109/ACCESS.2019.2936124.

[4] Murugan, Harini, (2020). Speech Emotion Recognition using ANN. International
Journal of Psychosocial Rehabilitation. 24. 10.37200/IJPR/V24I8/PR280260.

[5] T. Seehapoch and S. Wongthanavasu, "Speech emotion recognition using Support


Vector Machines," 2013 5th International Conference on Knowledge and Smart
Technology (KST), 2013, pp. 86-91, doi: 10.1109/KST.2013.6512793.

[6] Byun, Sung-Woo, and Seok-Pil Lee. 2021. "A Study on a Speech Emotion
Recognition System with Effective Acoustic Features Using Deep Learning Algorithms"
Applied Sciences 11, no. 4: 1890. https://doi.org/10.3390/app11041890

[7] K. Aurangzeb, N. Ayub, and M. Alhussein, “Aspect based multi-labeling using svm
27based ensembler,” IEEE Access, vol. 9, pp. 26026–26040, 2021.
[8] B. Jena, A. Mohanty, and S. K. Mohanty, “Gender recognition of speech signal using
knn and svm,” Available at SSRN 3769786, 2021.

[9] M. J. Al Dujaili, A. Ebrahimi-Moghadam, and A. Fatlawi, “Speech emotion


recognition based on svm and knn classifications fusion,” International Journal of
Electrical and Computer Engineering, vol. 11, no. 2, p. 1259, 2021.

[10] B. T. Atmaja and M. Akagi, “Two-stage dimensional emotion recognition by fusing


predictions of acoustic and text networks using svm,” Speech Communication, vol. 126,
pp. 9–21, 2021.

[11] M. H. Abdul-Hadi and J. Waleed, “Human speech and facial emotion recognition
technique using svm,” in 2020 International Conference on Computer Science and
Software Engineering (CSASE). IEEE, 2020, pp. 191–196.

[12] A. Host-Madsen and P. Handel, “Effects of sampling and quantization on single-


tonefrequency estimation,” IEEE Transactions on Signal Processing, vol. 48, no. 3, pp.
650–662, 2000.

[13] M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS),” 2020.

You might also like