You are on page 1of 6

Classification of Emotions from Speech using Implicit

Features
Mohit Srivastava Anupam Agarwal
Human Computer Interaction Human Computer Interaction
Indian Institute of Information Technology Indian Institute of Information Technology
Allahabad Allahabad
srivastava.mohit88@gmail.com anupam@iiita.ac.in

Abstract— Human computer interaction with the time has expressions are much more complicated and costly, as this
extended its branches to many different other fields like requires high quality cameras for taking the images. Speech can
engineering, cognition, medical etc. Speech analysis has also prove to be more cost effective and efficient mode of
become an important area of concern. People involved are
recognition. Speech signals don’t obey the property of stationary
using this mode for the interaction with the machines to
bridge the gap between physical and digital world. Speech signals. So usually we analyze the speech samples by
emotion recognition has become an integral subfield in the performing short time analysis. The study of human emotions
domain. Human beings have an excellent capability to can be done in two fields: one is the psychological study which
determine the situation by knowing the emotions, and can perceives the emotion due to changes in states and behavior, and
change the emotion of interaction depending on the context. another is the signal processing point of view which studies the
In the following work the implicit features of the speech have transition in emotions due to changes in the vocal parameters
been used for the detection of emotions like anger,
[3].
happiness, sadness, fear and disgust. As a data set, we have
used a standard Berlin emotional database for testing. The Earlier to categorize emotions from speech, researchers use
classification is done using SVM (support vector machine) the explicit channel content of the speech. In recent years the
which is found to be more consistent with all the emotions as focus has been shifted to implicit channel [1] (“how it has been
compared to ANN (artificial neural network). said”) i.e. non-linguistics content. In the former approach we
analyze the linguistic content of the sentence being spoken by an
Keywords— implicit features, SVM, emotions, ANN
individual, while in the later case we emphasize on the acoustic
I. INTRODUCTION parameters of the statement. With the change in the emotional
With the advancement in technology, there is a need to make the orientation of the person, the linguistic content does not get
interaction of humans with computers more and more user modified while paralinguistic features get changed. These
friendly. The essence of making it user friendly means, the features vary with the variation in the emotion of an individual.
interaction should be more like the conversation between two Tawari et al. [4] also proposed to utilize the contextual
humans [1]. In physical world the response of one depends on information for the efficient detection of the emotion. Various
the emotion with which next person is speaking or interacting, in studies have been done to find out the features that can predict
order to make the communication much more meaningful and the emotions. We generally denote different emotions in the
healthy. form of a coordinate system, with arousal and valence being the
The importance of emotions in human life can be understood two axes. In the fig (1), above arousal signifies the intensity of
by the study in [2], which suggest that only 10% of human life is the spoken words, while the valence specifies the kind of
unemotional. Voices, gestures, expressions are some ways of response an individual is showing.
communicating yourself. In order to formulate the system to It is quite easy for humans to understand the circumstances by
understand the emotions, facial expressions and gestures are knowing the other person’s emotion. For example, if we view
primary modes to do that. But in recent times there has been an sadness or grief in the voice than we can easily conclude that
effort to utilize the primary way of interaction in physical world something bad has happened. If fear is being realized then every
into the digital world. Detecting emotions through facial
chance is that concerned person has been in some kind of trouble At the later stage spectral features like Mel Frequency
[6]. Cepstral Coefficients (MFCC) gained popularity and are being
used in [14]. As the time gone by researchers realize and tries to
combine both these class of features to make the best use of
them. MFCC has been proves to be key feature while detecting
emotions as it has been witnessed that several work has done in
speech processing domain on the back of this important feature.
It has also been observed that that there is no fixed set of
features that only carries emotions. So apart from these spectral
and prosodic features, researchers are continuously looking for
new feature sets. Voice quality parameters [15] like harmonic
noise ratio (HNR), shimmer and jitter are also being analyzed.
Teager Energy Operator, epoch parameters are also few new
features set that seems to carry emotional content.
Fig 1: Two dimensional view of emotional coordinate system [5]
On the other hand researchers also try to combine the available
features in such a way that they can maximize the utilization of
II. RELATED WORK these sets for the classification [16]. Analyzing speech as non-
Speech processing has been part of research for many decades stationery signal generates a local feature vector; authors have
now; with speech recognition were the primary focus of the used various statistical tools in order to have a global feature
research. Gradually, with time different areas got developed out vector [15]. Although this particular area of research is relatively
of which emotion detection is one of them. Quite a few works fresh but its extensive applications and the challenges it involves
has been done on categorizing emotions from speech samples. makes researchers from different domains to contribute in the
Cheng et al. [7] discussed the process to develop features like growth and nurturing of this area of research.
pitch, intensity and MFCC that are being used to determine the
III. PROPOSED METHODOLOGY
variation in the emotions. In order to obtained feature vectors
for one speech sample authors have used principal component In order to perform the recognition of emotions from implicit
analysis. W. Ser et al. [8] utilizes the same set of features and features of speech; the following is the process flow of the
the classification has being done using Hybrid Probabilistic methodology that has been proposed in fig (2). Before extracting
Neural Network and Gaussian Mixture Model (GMM). the features from each of the speech sample we needed to follow
Q. Zhang et al. [9] has previously used HMM for some preprocessing steps.
classification by considering features like MFCC and energy, A. Pre-processing
and found an improvement in performance while using MFCC
The samples of the speech carry some quiet or silent sections
and Auto correlation function coefficients. For classifications of
which are inhibited during vocalization of the sentence. These
the emotions namely anger, neutral, sadness and happy, features
portions of sample are not of our concern during feature
are being extracted using Discrete Wavelet Transform while
extraction. So we have selected a window of duration 20ms, and
Artificial Neural Network is used for detection [10].
if the energy associated with this portion is less than the selected
In recent years researchers are utilizing Support Vector
threshold so we have ignored that section from the sample. As
Machine (SVM) [11, 12], but SVM generally is a two class
we are aware of the fact that speech signals have a varying
classifier so A. Milton et al. [12] use multiple level SVM. The
nature, so in order to calculate features from this preprocessed
features used for detection are being mainly divided into
signal we again need to break signal into frames [17]. Now in
prosodic and spectral features each one has its own importance.
order to make sure all the speech samples which are of different
Features like pitch, energy are utilized for the detection of
durations are divided into equal number of frames, we tend to
several emotions with good accuracy in [13], which shows that
use a window with varying size.
paralinguistic features carry the emotional content. I. Luengo et
al. [13] have also been used Gaussian Mixture Model as a B. Extraction of Implicit Features
classification tool by using only different prosodic features. The first feature that we look to compute is pitch.
Pitch: - Glottis vibrates with a fundamental frequency called Mel Frequency cepstral coefficients: - The spectral features
pitch. There are numerous ways to calculate pitch like which we try to calculate are cepstral coefficients. In order to
autocorrelation, average magnitude difference function [18] but compute this we have divided the signal into short duration
we will be calculating it using cepstrum analysis. Speech signal, frames and then for each frame we compute the periodogram.
while it passes through the vocal tract gets influenced by its
multiplicative behavior. In order to avoid this effect we go for Input speech sample
homomorphic filtering, after this we calculate inverse transform
with sample number having the highest magnitude signifies the
Remove silent parts
pitch period [3]. The plots in fig (3) and fig (4) show the pitch
variation for two different emotions anger and sadness with the
frame number, for the single speech sample. Framing/Windowing

Feature Extraction from


each frame

Pitch Three formant Energy Mel frequency


frequencies Cepstral
Fig 3: Plot showing pitch variation with frame for anger speech sample coefficients

Form a global vector


by combining the
features of the frames

Apply feature Apply feature vector


Fig 4: Plot showing pitch variation with frame for sadness speech sample vector to neural to Support Vector
network Machine
Formant frequencies: - After computing pitch we move
towards the calculation of three formant frequencies [18]. These
are the resonant frequencies that depend on the shape of the Emotion Emotion
vocal tract. We used linear predictive coding for calculating the detected detected
formant frequencies by getting the coefficients (about 18), than
compute the poles (p) that represent the vocal tract model. Fig 2: Process flow for emotion detection
The periodogram or power spectrum gives a lot of information
1/ ∑ 1 (1)
that is not required. So we sum up the values of the spectrum at
Here, represents the coefficients computed using Durbin’s
various frequencies. It is done by the help of mel filter bank
algorithm [18]. The spectrum of the vocal tract will tend to give
which has a spectrum in such a way that its bandwidth is lower
the formant frequencies, we can either use magnitude spectrum
at small frequencies and wider as frequency goes high.
or phase spectrum. In this work we have considered three
Frequency on mel scale and on linear scale are related as [19]:
formant frequencies. It can be infer from the fig (5) and fig (6),
that anger has more peaks in formant frequencies compared to 1125 ln 1 (2)
sadness speech sample. 700 exp 1 (3)
where and represents the frequency in linear and mel scale.
After getting the filter bank energies, take logarithm of it.
This is driven by the fact that we don’t hear loudness on a linear
scale. Since our filter banks are overlapping and energies are
correlated with each other in order to make them uncorrelated
we compute discrete cosine transform. Out of all the DCT
coefficients obtained we kept higher coefficients as they denote
fast change in energies. We have considered 12 coefficients
Fig 7: Plot of MFCC coefficient variation with frame in anger speech sample.
here. Fig (7) and fig (8) show one of the MFCC coefficient
variation with frame for anger and sad samples. These locally calculated features are now being combined using
some statistical tools, like mean and standard deviation in order
to find out the global feature vector which represent each speech
sample [16]. These two parameters for each of the above
calculated features will form the feature vector. Thus, two values
for pitch, six for formant frequencies, two for energy and twenty
four for mel frequency cepstral coefficients makes the vector of
34x1 for one speech sample.
Classifiers: - In this paper we have done classification using
back propagation algorithm and support vector machine. For the
Fig 5: Plot showing three formant frequency variation with frame for anger back propagation algorithm we have used three layered neural
speech sample network. The input layer having thirty four neurons and output
layers has neurons equal to number of classes. We have used
single hidden layer for the classification. In order to terminate
the adjustment of the weights we count the number of zeros in
the error matrix which is the difference between desired class
and predicted class of training samples.

Fig 6: Plot showing three formant frequency variation with frame for sadness
speech sample
After getting the filter bank energies, take logarithm of it.
This is driven by the fact that we don’t hear loudness on a linear
scale. Since our filter banks are overlapping and energies are
correlated with each other in order to make them uncorrelated Fig 8: Plot of MFCC coefficient variation with frame in sadness speech sample.
we compute discrete cosine transform. Out of all the DCT The emotions on which we are performing the classification
coefficients obtained we kept higher coefficients as they denote are anger (A), happy (H), sadness (S), fear (F) and disgust (D).
fast change in energies. We have considered 12 coefficients After the neural network has got trained, we have selected
here. Fig (7) and fig (8) show one of the MFCC coefficient different set of speech samples from the database to perform the
variations with frame for anger and sad samples. testing for each emotion. We have selected ten samples from
Energy: - Finally we have calculated the energy associated each emotion for testing. The matrix in table (1) shows the
with each of the assumed 60 frames, and considered it as one of number of emotions recognized correctly. After performing the
the feature. The represent the energy for frame, is the classification from back propagation algorithm, we have also
number of samples in each frame and is the speech sample. tested the data set using Support Vector Machine.
| | 4
There are different classification schemes while using SVM. achieved from the network for the emotions efficiency, and anger
In the scheme where “one vs. rest” is used there we need (c-1) and fear emotions are correctly classified with average efficiency
classifiers, where c is number of classes. The problem with this sadness, happy and disgust have been detected with high. The
scheme is that the training samples for (c-2) classifiers are not in overall detection accuracy is about 74%.
the equal ratios. Table 1: Matrix showing classification results using back propagation algorithm
L. Yu et al. [20] have also described different classification
ANGER HAPPY SADNESS FEAR DISGUST
scheme namely directed acyclic graph (DAG) as shown in fig
ANGER 6 2 0 0 0
(9). The training in this scheme is easier as compared to
HAPPY 3 8 0 0 0
previous scheme. Firstly, this scheme classifies the given test
SADNESS 1 0 9 5 0
data in two classes (Anger/Sadness). Then depending upon the
FEAR 0 0 1 5 1
outcome of first level of classification; it classifies the sample
DISGUST 0 0 0 0 9
further as Anger/Happy, or Sadness/Fear and so on. The
resulting emotion can be determined from the terminal (leaf)
The classification using SVM has been displayed in table (2)
node of the classifier.
and fig (11).
100%
80%
60%
40%
NO
20%
0% YES

Fig 9: Directed acyclic graph showing the classification scheme for SVM
adopted from [20] Fig 10: Graph showing classification results using back propagation algorithm

IV. RESULTS AND ANALYSIS A. Hassan et al. [25] have classified the emotion which uses
“one against all” criteria for SVM and has gained the accuracy
The different classifiers that have been trained using the implicit
percentage of 80%. Javidi et al. [26] has computed pitch,
feature set have been tested using ten speech samples from each
energy, zero crossing rate, MFCC which are implicit features for
of the emotions.
the Berlin dataset. The classification has been performed using
Data sets: - Designing a data set for speech related neural network and SVM separately. The classification accuracy
application is more difficult than what related to applications we have achieved by combining different set of features is
involving facial expressions. The Berlin database [21] which is higher than what mentioned in [26] for same group of five
being used in this paper was recorded at 48 KHz sampling rate emotions. Hence, the method described in the paper and features
and then it was downsampled at 16 KHz. This involves five male chosen for the classification has shown better results. As shown
and five female actors. In order to perform the testing of the in table (3) with the proposed classification scheme of SVM on
designed classifiers, we select 10 speech samples from each the Berlin dataset, the results obtained have shown better
emotion. The matrices in table (1) and (2), shows to which class consistency than using back propagation algorithm.
these speech samples are being predicted. The results have also
being displayed in graphical form in fig (10) and fig (11). V. CONCLUSION AND FUTURE WORK
It has been witnessed that combining different kinds of features
S. Wang et al. [22] utilize the principal component analysis and
can increase the detection rate. In order to design speaker
back propagation with 52%-56% accuracy. Padmawar et al. [23]
independent emotional detection system more focus should be on
has done detection for primary emotions achieving 72.87%
implicit features of human voice. The work can be extended to
detection rate. We in this paper have trained the designed neural
classify more emotions and also in order to increase the detection
network to predict the respective emotions. The response
accuracy, some other classification tools can be used.
[7] Cheng Xianglin, and Qiong Duan. "Speech Emotion Recognition Using
Gaussian Mixture Model," In Proceedings of the 2012 International
Conference on Computer Application and System Modeling. Atlantis Press,
Table 2: Matrix showing classification results using support vector machine pp. 1222-1225, 2012.
[8] W. Ser, L. Cen, and Z.-L. Yu, “A Hybrid PNN-GMM classification
ANGER HAPPY SADNESS FEAR DISGUST scheme for speech emotion recognition,” in 19th International Conference
on Pattern Recognition (ICPR) Tampa, pp. 1–4, Dec 2008.
ANGER 8 2 0 0 0
[9] Q. Zhang, N. An, K. Wang, F. Ren, and L. Li, “Speech emotion
HAPPY 1 8 0 0 0 recognition using combination of features,” in 2013 Fourth International
SADNESS 1 0 8 3 0 Conference on Intelligent Control and Information Processing (ICICIP)
Beijing, pp. 523–528, 2013.
FEAR 0 0 2 6 2
[10] Firoz Shah. A, Raji Sukumar. A, Babu Anto.P, “Discrete wavelet
DISGUST 0 0 0 1 8 transforms and artificial neural networks for speech emotion recognition,”
Int’l Journal of Computer Theory and Engineering, Vol. 2, No. 3, pp. 319–
322, 2010.
100% [11] Y.-L. Lin and G. Wei, “Speech emotion recognition based on HMM and
80% SVM,” in Proceedings of 2005 International Conference on Machine
Learning and Cybernetics Guangzhou, Vol.8, pp. 4898–4901, 2005,.
60%
NO [12] A. Milton and S. Tamil Selvi, “Class-specific multiple classifiers scheme
40% to recognize emotions from speech signals,” Computer Speech &
20% YES Language, Vol. 28, No. 3, pp. 727–742, May 2014.
0% [13] I. Luengo, E. Navas, I. Hernáez, and J. Sánchez, “Automatic Emotion
Recognition using Prosodic Parameters,” in in Proc. of INTERSPEECH,
pp. 493–496, 2005.
[14] K. V. Krishna Kishore and P. Krishna Satish, “Emotion recognition in
speech using MFCC and wavelet features,” IEEE 3rd International
Advance Computing Conference (IACC) Ghaziabad, pp. 842–847, 2013.
Fig 11: Graph showing classification results using support vector machine [15] C.N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features and
classifiers for emotion recognition from speech: a survey from 2000 to
Table 3: Comparison chart showing classification accuracy for different 2011,” Artif Intell Rev, pp. 1–23, Nov. 2012.
emotions [16] Y. Zhou, Y. Sun, J. Zhang, and Y. Yan, “Speech Emotion Recognition
Using Both Spectral and Prosodic Features,” in International Conference
BACK PROPAGATION SUPPORT VECTOR
on Information Engineering and Computer Science (ICIECS) Wuhan, pp.
ALGORITHM MACHINE 1–4, 2009.
ANGER 60 % 80% [17] ”Speech Signal Processing Laboratory : Electronics &
HAPPY 80% 80% Communications : IIT GUWAHATI Virtual Lab.” [Online]. Available:
http://iitg.vlab.co.in/index.php?sub=59&brch=164, [Accessed: 25-
SADNESS 90% 80% Jun-2014].
FEAR 50% 60% [18] L. R. Rabiner and R. W. Schafer, Digital processing of speech signals, first
edition, in New Jersey: Prentice Hall, 1978.
DISGUST 90% 80%
[19] Hongyu Xu, Xia Z., Liang J., “The Extraction and Simulation of Mel
Frequency Cepstrum Speech Parameters,” IEEE International Conference
REFERENCES on Systems and Informatics Yantai, pp. 1765-1768, May 2012.
[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. [20] L. Yu, B. Wu, and T. Gong, “A hierarchical support vector machine based
Fellenz, and J. G. Taylor, “Emotion recognition in human-computer on feature-driven method for speech emotion recognition,” Int’l conference
interaction,” IEEE Signal Processing Magazine, Vol. 18, No. 1, pp. 32–80, on Artificial Immune Systems Nottingham, pp. 901–908, 2013.
Jan. 2001. [21] ”www.expressive-speech.net,” [Online] Available: www.expressive-
[2] S. Ramakrishnan and I. M. M. E. Emary, “Speech emotion recognition speech.net [Accessed: 25-Jun-2014].
approaches in human computer interaction,” Telecomm. Syst, Vol. 52, No. [22] S. Wang, X. Ling, F. Zhang, and J. Tong, “Speech Emotion Recognition
3, pp. 1467–1478, Mar. 2013. Based on Principal Component Analysis and Back Propagation Neural
[3] K. Cherry, “What are the 3 Major Theories of Emotion?,” About.com Network,” in International Conference on Measuring Technology and
Psychology, 06–24-2014. [Online]. Available: Mechatronics Automation (ICMTMA), Vol. 3, pp. 437–440, 2010.
http://psychology.about.com/od/psychologytopics/a/theories-of- [23] Padmavar S., Deshpande P.S., “Classification of Speech using MFCC and
emotion.htm. [Accessed: 24-Jun-2014]. Power Spectrum,” Int’l Journal of Engg. Research and Applications, Vol.
[4] Tawari A, Trivedi M. M., “Speech Emotion Analysis: Exploring the Role 3 Issue 1, pp. 1451-1454, 2013.
of Context,” IEEE Transactions on Multimedia Vol. 12 No. 6, pp. 502-509, [24] A. Hassan and R. I. Damper, “Multi-class and hierarchical SVMs for
2010. emotion recognition,” in INTERSPEECH, pp. 2354–2357, 2010.
[5] E.-H. Kim, K.-H. Hyun, S.-H. Kim, and Y.-K. Kwak, “Improved Emotion [25] Javidi, Mohammad Masoud, and Ebrahim Fazlizadeh Roshan,"Speech
Recognition With a Novel Speaker-Independent Feature,” IEEE/ASME Emotion Recognition by Using Combinations of C 5.0, Neural Network,
Transactions on Mechatronics, Vol. 14, No. 3, pp. 317–325, Jun. 2009. and Support Vector Machines Classification Methods," Journal of
[6] Saheer L, Potard B, “Understanding Factors in Emotion Perception,” 8th Mathematics and Computer Science, Vol. 6, No. 2, pp. 191-200, 2013.
ISCA Speech Synthesis Workshop, Barcelona, Spain, pp. 59-64, Aug 31-
Sep 2, 2013.

You might also like