Professional Documents
Culture Documents
Bachelor of Technology in
Computer Science and Engineering by
K. JOTHIKA (U18CS013)
NEETU (U18CS017)
S. MANASA (U18CS027)
P. CHANDRIKA (U18CS062)
This is to certify that the project report entitled “Inferring Emotion from Speech” submitted by
K.Jothika(U18CS013),Neetu(U18CS017), S.Manasa(U18CS027), P. Chandrika(U18CS062)
to the Department of Computer Science and Engineering, Bharath Institute of Higher Education
and Research, in partial fulfillment for the award of the degree of B. Tech in Computer Science
and Engineering is a bona fide record of project work carried out by them under my supervision.
The contents of this report, in full or in parts, have not been submitted to any other Institution or
University for the award of any other degree.
<Signature of Supervisor>
Dr. Anitha Karthi
Department of Computer Science & Engineering,
School of Computing, Bharath Institute of Higher Education
and Research
April, 2022
We declare that this project report titled “Inferring Emotions from Speech” submitted in partial
fulfillment of the degree of B. Tech in Computer Science and Engineering is a record of
original work carried out by us under the supervision of Dr. Anitha Karthi, and has not formed
the basis for the award of any other degree or diploma, in this or any other Institution or
University. In keeping with the ethical practice in reporting scientific information, due
acknowledgements have been made wherever the findings of others have been cited.
<Signature>
K. Jothika
(U18CS013)
<Signature>
Neetu
(U18CS017)
<Signature>
S. Manasa
(U18CS027)
<Signature>
P. Chandrika
(U18CS062)
Chennai
<Date>
ACKNOWLEDGMENTS
First, we wish to thank the almighty who gave us good health and success throughout our project work.
We express our deepest gratitude to our beloved President Dr. J. Sundeep Aanand, and Managing
Director Dr. E. Swetha Sundeep Aanand for providing us the necessary facilities for the completion of our
project.
We take great pleasure in expressing sincere thanks to Vice Chancellor (I/C) Dr. K. Vijaya Baskar Raju,
Pro Vice Chancellor (Academic) Dr. M. Sundararajan, Registrar Dr. S. Bhuminathan and Additional Registrar
Dr. R. Hari Prakash for backing us in this project.
We thank our Dean Engineering Dr. J. Hameed Hussain for providing sufficient facilities for the
completion of this project.
We express our immense gratitude to our Academic Coordinator Mr. G. Krishna Chaitanya for his
eternal support in completing this project.
We thank our Dean, School of Computing Dr. S. Neduncheliyan for his encouragement and the valuable
guidance.
We record indebtedness to our Head, Department of Computer Science and Engineering Dr. B.Persis
Urbana Ivy for immense care and encouragement towards us throughout the course of this project.
We also take this opportunity to express a deep sense of gratitude to our Supervisor Dr. Anitha Karthi for
her cordial support, valuable information and guidance, she helped us in completing this project through various
stages.
We thank our department faculty, supporting staff and friends for their help and guidance to complete this
project.
ABSTRACT
Emotion recognition is a rapidly growing research domain in recent years. Unlike humans, machines lack
the abilities to perceive and show emotions. But human-computer interaction can be improved by
automated emotions recognition, thereby reducing the need of human intervention.
It offers tremendous scope to human computer interaction, robotics, health care, biometric security and
behavioral modeling. Emotion recognition systems recognize emotions from speech signals.
Inferring Emotion from Speech, abbreviated as IEFS, is the act of attempting to recognize human emotion
and the associated affective states from speech. This is capitalizing on the fact that voice often reflects
underlying emotion through tone and pitch. Emotion recognition is a rapidly growing research domain in
recent years. Unlike humans, machines lack the abilities to perceive and show emotions. But human-
computer interaction can be improved by implementing automated emotion recognition, thereby reducing
the need of human intervention. In this project, basic emotions like calm, happy, fearful, disgust etc. are
analyzed from emotional speech signals. We use machine learning techniques like Multilayer perceptron
Classifier (MLP Classifier) which is used to categorize the given data into respective groups which are
non-linearly separated. Mel-frequency cepstrum coefficients (MFCC), chroma and Mel features are
extracted from the speech signals and used to train the MLP classifier. For achieving this objective, we
use python libraries like Librosa, sklearn, pyaudio, numpy and soundfile to analyze the speech
modulations and recognize the emotion.
ix
TABLE OF CONTENTS
CERTIFICATE
DECLARATION
ACKNOWLEDGEMENTS
ABSTRACT
LIST OF FIGURES
LIST OF TABLES
ABBREVIATIONS/ NOTATIONS/ NOMENCLATURE
1. INTRODUCTION
1.1 About the Project
2. LITERATURE SURVEY
3. EXISTING SYSTEM & PROPOSED SYSTEM
3.1 EXISTING SYSTEM
3.2 PROPOSED SYSTEM
4. IMPLEMENTATION
4.1 SPEECH DATABASE
4.2 DATA PRE-PROCESSING
4.3 MACHINR LEARNING ALEORITHM
4.4 EMOTION RECOGNITON
4.5 DEPLOYMENT
4.6 ACCURACY
5. METHODOLOGY
5.1 IMPORT MODULES
5.2 LOAD THE SPEECH EMOTION DATASET
5.3 EXPLORATRORY DATA ANALYSIS
5.4 FEATURE EXTRACTION USING MFCC
xii
5.5 CREATING THE MODEL
5.6 PLOT THE MODEL RESULTS
6. SYSTEM ANALYSIS AND DESIGN
6.1 SYSTEM ARCHITECTURE
6.2 FLOWCHART
7. SYSTEM REQUIREMENTS AND SPECIFICATIONS
7.1 BASIC REQUIREMENTS
7.2 REQUIREMENTS
7.2.1 SOFTWARE REQUIREMENTS
7.2.2 HARDWARE REQUIREMENTS
8. RESULT AND DISCUSSION
8.1 RESULT
9. CONCLUSION AND FUTURE WORK
9.1 CONCLUSION
9.2 FUTURE WORK
REFERENCES
APPPENDIX
xii
LIST OF FIGURES
xiii
ABBREVIATIONS/ NOTATIONS/ NOMENCLATURE
xvii
CHAPTER 1
INTRODUCTION
Inferring emotions from speech (IEFS) is the task of recognizing the emotional aspects of speech
irrespective of the semantic contents. While humans can efficiently perform this task as a natural part of
speech communication, the ability to conduct it automatically using programmable devices is still an
ongoing subject of research.
Robots capable of understanding emotions could provide appropriate emotional responses and
exhibit emotional personalities. In some circumstances, humans could be replaced by computer-generated
characters having the ability to conduct very natural and convincing conversations by appealing to human
emotions. Machines need to understand emotions conveyed by speech. Only with these 10 capabilities, an
entirely meaningful dialogue based on mutual human-machine trust and understanding can be achieved.
1
Emotion Recognition is gaining its popularity in research which is the key to solve many problems
also makes life easier.
Inferring Emotion from Speech is claiming to achieve between the additional parts due to its
complexity. Moreover, the representation of a sensible computer system lacks the system to impersonate
human response. Inferring Emotions from Speech built as a classification obstacle determined using
numerous ML algorithms. Recognizing the emotional nature of the user appears with a significant lead in
this application.
In addition to emotion recognition, various other factors such as valence, polarity, arousal play
prominent roles in identifying one’s state of mind. By mapping of the brain using valence, polarity,
arousal and emotion recognition is called Sentimental Analysis. Sentiment analysis is used to understand
the person’s opinion and attitude towards a particular topic or at that instant of time using various
computational approaches.
Emotion recognition has wide scope in many areas such as human computer interaction, biometric
security etc. So, it provides insight into artificial intelligence or machine intelligence that uses various
supervised and unsupervised machine-learning algorithms to simulate the human brain. It was explored
that study of human emotions, their interpretation, processing and adaptation by machines is known as
affective computing or artificial emotional intelligence. Human emotional state can be recognized from
facial expressions, body movements, speech, text writing, brain or heart signals etc. using various
machine learning techniques that extract required features or patterns from the collected data.
The approach for Inferring Emotion from Speech (IEFS) primarily comprises two phases known
as feature extraction and features classification phase. In the field of speech processing, have derived
several features. The second phase includes feature classification using multi-Perceptron classifiers
(MLP). Usually, the speech signal is considered to be non-stationary. Hence, it is considered that non-
linear classifiers work effectively for IEFS. MLP is widely used for classification of information that is
derived from basic level features. Energy-based features such as Mel-Frequency Cepstrum Coefficients
(MFCC) often used for effective emotion recognition from speech.
The aim to develop machines to interpret paralinguistic data, like emotion, helps in human-machine
2
interaction and it helps to make the interaction clearer and natural. In this Paper We are using
classification models such as ANN to predict in speech sample. The MFCC is used for the feature
extraction. To train the model TESS dataset using along with Data Augmentation.
3
CHAPTER 2
LITERATURE SURVEY
Mel Frequency Cepstral Coefficients (MFCC), Acoustic phonetic recognition In this study total
seven different approaches which are widely used for SRS have been discussed and after comparative
study of these approaches it is concluded that Hidden Markov method (HMM) is best suitable approach
for a SRS because it efficient, robust, and reduces time and complexity Continuity is less important
because it can match sequence with missing information and Reliable time alignment between reference
and test pattern [1].
Statistical Modeling, Robust speech recognition, Noisy speech recognition, classifiers, feature
extraction, performance evaluation, Data base. There is now increasing interest in finding ways to bridge
such a performance gap. What we know about human speech processing is very limited. Continuity is less
important because it can match sequence with missing information [2].
Signal preprocessing, feature extraction, language model, decoder, emotion recognition the speech
sound is captured using microphone to convert it into electrical signal. The purpose of sound card inside
the computer is to change analog signal into digital signal. Speech is basic mode of communication
between human beings, so a feasible interface is required to connect human with machines [3].
Sequence-to sequence End-to-end (E2E) automatic speech We investigated a set of models, LSTM
model recognition (ASR) models directly map an acoustic feature sequence to a word sequence. Due to
their universal capability to handle even non-monotonic alignment, attention models are widely used in
many machine learning problems, e.g., translation techniques on top of our best recipe for the
Switchboard English speech recognition benchmark [4].
Signal preprocessing, feature extraction, language model, Decoder, speech recognition There is
now increasing interest in finding ways to bridge such a performance gap. What we know about human
speech processing is very limited. Although these areas of investigations are important the significant
advances will come from studies in acoustic phonetics, speech perception, linguistics. Various neural
network model such as deep neural networks, NRLN & LSTM for getting the requested output [5].
CHAPTER 3
EXISTING SYSTEM & PROPOSED SYSTEM
4.5 Deployment
The proposed system recommends the best suitable model for inferring emotions from speech by
considering the ANN algorithm and MLP Classifier.
We deployed our speech emotion recognition model using TESS dataset. Using TESS dataset model
deployed with more accuracy and appropriate speech emotions.
4.6 Accuracy
This model delivered an accuracy of 72.4%.
CHAPTER 6
METHODOLOGY
PRE-EMPHASIS - Pre-emphasis refers to filtering that emphasizes the higher frequencies. Its purpose is
to balance the spectrum of voiced sounds that have a steep roll-off in the high-frequency region.
Frame blocking and windowing - The speech signal is a slowly time-varying or quasi-stationary signal.
DFT spectrum: Each windowed frame is converted into magnitude spectrum by applying DFT.
Mel spectrum: Mel spectrum is computed by passing the Fourier transformed signal through a set of band-
pass filters known as Mel-filter bank. A Mel is a unit of measure based on the human ears perceived
frequency. It does not correspond linearly to the physical frequency of the tone, as the human auditory
system apparently does not perceive pitch linearly. The Mel scale is approximately a linear frequency
spacing below 1 kHz and a logarithmic spacing above 1 kHz.
Discrete cosine transforms (DCT): Since the vocal tract is smooth, the energy levels in adjacent bands
tend to be correlated. The DCT is applied to the transformed Mel frequency coefficients produces a set of
cepstral coefficients. Prior to computing DCT, the Mel spectrum is usually represented on a log scale.
This results in a signal in the cepstral domain with a quefrequency peak corresponding to the pitch of the
signal and a number of formants representing low quefrequency peaks. Since most of the signal
information is represented by the first few MFCC coefficients, the system can be made robust by
extracting only those coefficients ignoring or truncating higher order DCT components.
Dynamic MFCC features: The cepstral coefficients are usually referred to as static features, since they
only contain information from a given frame. The extra information about the temporal dynamics of the
signal is obtained by computing first and second derivatives of cepstral coefficients.
6.4 Creating the Model
6.5 Plot the model Result
CHAPTER 6
SYSTEM ANALYSIS AND DESIGN
2. Memory (RAM): Random-access memory (RAM) is a form of computer data storage that
stores data and machine code currently being used. A random- access memory device allows data
items to be read or written in almost the same amount of time irrespective of the physical location of
data inside the memory. In today's technology, random-access memory takes the form of integrated
chips. RAM is normally associated with volatile types of memory (such as DRAM modules), where
stored information is lost if power is removed, although non- volatile RAM has also been developed.
A minimum of 8 GB RAM is recommended for the proposed system.
3. Hard Drive: A hard drive is an electro-mechanical data storage device that uses magnetic storage
to store and retrieve digital information using one or more rigid rapidly rotating disks, commonly
known as platters, coated with magnetic material. The platters are paired with magnetic heads,
usually arranged on a moving actuator arm, which reads and writes data to the platter surfaces. Data
is accessed in a random-access manner, meaning that individual blocks of data can be stored or
retrieved in any order and not only sequentially. HDDs are a type of nonvolatile storage, retaining
stored data even when powered off. 32 GB or higher is recommended for the proposed system.
CHAPTER 8
RESULT AND DISCUSSION
8.1 Result
In this work, we experimented with IEFS on one database: TESS In TESS, we have divided the
dataset into training and testing parts. For training purposes, 200 samples (50 samples from each
category) of emotions (angry, happy, disgust, pleasant surprise) were separated. At the same time, 24
samples (6 from each category) were separated from testing the data. The MFCC vector of this training
data was passed to the classifier, which gave 97% accuracy. At the same time, 30 samples (10 from each
category) were treated as test data. The accuracy of this dataset after training was 86%. The results in
Table show that our proposed method outperforms other three baseline methods.
CHAPTER 9
CONCLUSION AND FUTURE WORK
9.1 CONCLUSION
This paper shows that MLPs are very powerful in classifying speech signals. Even with simplified
models, a limited set of characters can be easily identified. We have obtained higher accuracies as
compared to other approaches for individual emotions. The performance of a module is highly dependent
on the quality of pre-processing. Every human emotion has been thoroughly studied, analyzed and the
accuracy has been checked. The results obtained in this study demonstrate that speech emotion
recognition is feasible, and demonstrating the accuracy of each emotion present in the speech.
[1] Dr.Yogesh Kumar and Dr.Manish Mahajan, “Machine Learning based Speech
Recognizing emotion Systems", International Journal of Scientific and Technology
Research (IJSTR), 2019.
[4] Murugan, Harini, (2020). Speech Emotion Recognition using ANN. International
Journal of Psychosocial Rehabilitation. 24. 10.37200/IJPR/V24I8/PR280260.
[6] Byun, Sung-Woo, and Seok-Pil Lee. 2021. "A Study on a Speech Emotion
Recognition System with Effective Acoustic Features Using Deep Learning Algorithms"
Applied Sciences 11, no. 4: 1890. https://doi.org/10.3390/app11041890
[7] K. Aurangzeb, N. Ayub, and M. Alhussein, “Aspect based multi-labeling using svm
27based ensembler,” IEEE Access, vol. 9, pp. 26026–26040, 2021.
[8] B. Jena, A. Mohanty, and S. K. Mohanty, “Gender recognition of speech signal using
knn and svm,” Available at SSRN 3769786, 2021.
[11] M. H. Abdul-Hadi and J. Waleed, “Human speech and facial emotion recognition
technique using svm,” in 2020 International Conference on Computer Science and
Software Engineering (CSASE). IEEE, 2020, pp. 191–196.
[13] M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS),” 2020.