You are on page 1of 5

Proceedings of the 2013 IEEE/SICE International SP1-L.

Symposium on System Integration, Kobe International
Conference Center, Kobe, Japan, December 15-17,

Reduce the Dimensions of Emotional Features by Principal

Component Analysis for Speech Emotion Recognition
Changqin QUAN, Dongyu WAN, Bin ZHANG, Fuji REN member IEEE

Abstractin this paper, the principal component analysis the classification tools are used to decode the emotional state
(PCA) is applied to speech emotion recognition for improv ing from the extracted speech features. (There are already many
the accuracy of the system. The traditional prosodic features successful classification tools.) While the theory of
like pitch-related features and formant-related features are classification is much well developed [3], we focus in this
extracted from the Berlin speech database [7] and a Chinese
paper on the feature selection issue.
database. These collected feature data is processed by PCA to
remove the irrelevant information. After that, three kinds of
One problem is that we usually get a larger number of
features including the processed features by PCA, unprocessed features after the feature extraction; it is difficult but
features and other speech-related features are used to train a meaningful to utilize these features effectively since there are
SVM classifier. And six emotions are tested in the experiment no general rules on the combination of these features. To
The classification accuracy of the processed features by PCA is solve this problem, a common method is applied to perform
about 3.1% higher than the unprocessed features and about dimensionality reduction which can generate some new data
17.6% higher than the MFCC features when using 240
s. The recognition accuracies among different
to better reflect the characteristics of the speech information.
emotions in both databases are also presented in the study. In our research, the principal component analysis (PCA) [12,
13], a traditional linear dimensionality reduction techniques,
I. INTRODUCTION is adopted to help removing irrelevant feature data. By doing
so, the previous set of high-dimensional speech features is
S PEECH communication conveys not only the semantic
and syntactic contents of the linguistic sentences but also
the emotional situations of human beings. Voice is able to
replaced by more distinctive feature set and the accuracy of
the recognition has been obviously improved.
The outline of this paper is as follows. In section , we
express emotion, because it contains the parameters to reflect first give a brief overview about the two steps of speech
the characteristic of emotions. Emotional changes reflect in emotion recognition to help the reader better understand the
the difference of parameters. Therefore, it is extremely background of the whole system. Then we will focus on
significant for speech emotion recognition to extract
introducing the method of the features selection in section .
parameters which can best reflect the feature of emotion from
The details of the experiment including the experiment
the speech signals. In general, most characteristic of emotion
database and the results of the speech emotion recognition are
is presented by the prosodic feature of the voice. For example,
presented in section . And finally, a brief conclusion is
when a man is angry, his speech tempo becomes faster, his
tone becomes higher, his volume becomes bigger. All these given in section .
changes can be easily perceived. And many researches have
already been done to detect emotional characteristics of II. EMOTION RECOGNITION
speech [1, 2, 4, 10, 11, 14, 16, 17, 18, and 19].
A. Emotion Categorization
The whole process of the speech emotion recognition is
usually divided into two major steps. The first step is to There are many schemes of the classification about
extract distinctive features from the speech signals. And then, emotions proposed by different psychologists. American
psychologist Eckman posed six kinds of basic emotions [5].
They are anger, fear, sadness, surprise, happiness and disgust.
This research has been partially supported by National Natural Science Another scientist Plutchik regard anger, fear, accept, sadness,
Foundation of China under Grant No. 61203312, and the National High-Tech
Research & Development Program of China 863 Program under Grant surprise, happiness, vigilant, resent as the 8 basic emotions
No.2012AA011103, and the Scientific Research Foundation for the Returned [6]. But they are all similar in the degree of affection. At
Overseas Chinese Scholars, State Education Ministry, and Key Science and present, most of the speech emotion recognition systems use
Technology Program of Anhui Province, under Grant No. 1206c0805039.
Changqin QUAN is with AnHui Province Key laboratory of Affective
five kinds of emotional states defined by the MPEG-4
Computing and Advanced Intelligent Machine, School of Computer and International Standards. They are anger, surprise, sadness,
Information, HeFei University of Technology, HeFei, China (e-mail: fear, happiness. In this paper, we also take these five kinds of
Dongyu WAN is with the School of Computer and Information, HeFei
emotions plus neutral as the six basic emotions. And in the
University of Technology, HeFei, China (e-mail: following content, we call these emotions as the six standard
Bin ZHANG is with the School of Computer and Information, HeFei emotions.
University of Technology, HeFei, China (e-mail:
Fuji REN is with AnHui Province Key Laboratory of Affective B. Feature Extraction
Computing and Advanced Intelligent Machine, School of Computer and
Information, HeFei University of Technology, HeFei, China. (e-mail: A large number of paralinguistic and linguistic feature

978-1-4799-2625-1/13/$31.00 2013 IEEE 222

information related to emotion expression is presented in becomes the essential step for many kinds of analyses and
speech signals. Among those features the Mel Frequency applications in speech processing.
Cepstrum Coefficient (MFCC) and Linear Prediction In the spectrogram, the lowest frequency of the formant is
Cepstrum Coefficient (LPCC) are widely employed. In called f1, the second f2, and the third f3. There are multiply of
speech recognition, the MFCC features are segmental formants in every frame of voice. But the first three formants
features since they are generated for each short speech frame (f1, f2, f3) are often enough to distinguish the resonance
to decode the phonemes. However, according to [22], MFCC characteristic of human channel. These three formants
features are less effective for emotion recognition because determine the quality of the voice as they contain most part of
they are related to linguistic information, not paralinguistic the spectrum. Most of these formants are produced by tube
features. In addition, the emotional state of the speaker is and chamber resonance, but the pronunciation of different
unlikely to change as fast as phonemes. Typically, we assign emotions will change the shape of the channel, thus affect the
one emotion to one short utterance of a few seconds. Hence forming location of the formants.
we need suprasegmental features with one feature value per Extraction of formants in speech processing is an important
utterance. Even so, the MFCC features were tested in the research topic in speech recognition. A lot of extraction
TABLE I methods have been proposed, the most frequently used
method employed is the linear predictive analysis (LPC) [23]
Feature Description for the consideration of the algorithms complexity and ease of
implementation. It is a technique used to model the vocal tract
Rate passing rate of syllables per
unit time
which extracts the formant frequencies of speech from the
Pitch Average the average value of the pitch linear prediction spectrum. In general, LPC spectrum follows
the envelope of the spectrum and measures the vowel
Pitch Range The range of the pitch formants correspond to the peaks.

Energy the intensity and the amplitude

of the voice
Zero crossing rate times when the signal passing
through the zero axis

Formant frequency the frequency of the formant

Formant bandwidth the bandwidth of the formant

All these speech features are discussed under the circumstance of
short-time analysis. They are all extracted in a 30 ms speech frame.

experiment for emotion recognition. And the results of them Fig.1. the original magnitude spectrum and the former four
compared to other prosodic features are showed in the section formants [24]

. In Fig.1 [24], we can see the original magnitude spectrum of a

In this paper, we experiment several prosodic features and speech and the final formants of it. The resulting curve is
formant-related features as the main target to test the best presented in two dimensions like the FFT spectrum.
feature sets. The name of these features and their descriptions Other important speech features in emotion recognition are
are given in TABLE.1. energy and zero crossing rate (ZCR). The former shows the
Some of these acoustic features are described as follows. intensity and the amplitude of the signal. It is calculated as
Pitch, also referred to the fundamental frequency, is (1).
N 1
defined as the lowest frequency of a periodic waveform. In x 2 (i) (1)
E = n
terms of a superposition of sinusoids, the fundamental i= 0

frequency is the lowest frequency sinusoidal in the sum. In Where x n ( i ) is the nth frame of the speech sample.
speech processing, the dynamic range and rate of change of And the ZCR indicates the times the signal cross the zero
the fundamental frequency reflect the change of emotions to a axis in each speech frame. It is calculated as (2).
certain degree. 1 N 1 (2)
Another common speech feature in recognition is formant. Z = | sgn [x ( i ) ] sgn [x ( i - 1 ) ] |
n n
2 i= 0
Formant, often measured as an amplitude peak in
the frequency spectrum of the sound, is an important feature Where sgn[x] = {1,1(,x(x0<)0)
to reflect the resonance characteristic of human channel. It These features also play an effective role in the speech
corresponds to resonances within the spectral shape of speech recognition. For example, when a voice appear to express the
signals and shows the direct information about the source of positive emotion (such as: happiness, anger, surprise) which
the sound. In addition to conveying phonemic identities, may involve the positive energy, it usually carries a relatively
formants are also affected by speaker and accent higher amount of energy. This can be reflects in the average
characteristics. Thus the research of the formant structure amplitude of the signals. And ZCR is a sign about the

978-1-4799-2625-1/13/$31.00 2013 IEEE 223

frequency of the signal, which can be used to detect the process monitoring [12] and so on. The essence of the PCA in
voiced and unvoiced part of the speech. emotion recognition is to transfer the high-dimensional data
All these speech acoustic features mentioned above are onto low-dimensional space through dimensionality
called standard features in the following context as show in reduction. But after the transfer the data cannot be distorted,
Figure 1. which means only the noise and the redundant data are
filtered by the PCA. Noise represents the useless information
C. Classification
in the extracted data. It hugely affects the relevance between
Speech emotion recognition is an issue about the each feature. The redundant data means the overlapping
supervised learning. Various supervised learning methods information among the features. This information makes no
have been proposed. The most popular approaches are the effort in improving the accuracy of the recognition.
support vector machine (SVM) as an extension of the linear Next, we will take a brief review about the algorithm of
discriminant analysis (LDA) with a high-dimensional feature PCA.
space, the hidden Markov model (HMM) to capture temporal The whole approach of PCA can be divided into three steps
state transitions Bayesian learning, the linear discriminant summarized as follows:
analysis and the multilayer neural network. In our experiment, 1. Data preparation: The main task of this part is to collect N
SVM is used as the classifier to perform the recognition samples, ( a i , i 1, 2 ,... N ) . And a i are vectors, each
process. And here is a quick review of this training machine.
of them contains p sensors. And the bases of the vector
SVM is a popular supervised learning method used for
space is obtained as (3)
classification, regression, and many other tasks. Each pattern
used for the training of the classifier carries the correct
X p n = [a 1 , a 2 ,..., a n ], a i R (3)
emotion class label. The core idea of it is to find a hyperplane And each row of matrix X is then scaled to zero mean (each
that divides the data so that all the marked with the same label column of the matrix divided by its mean).
are on the same side of the hyperplane. No matter what the 2. Eigenvalue decomposition and the forming of the
dimensional space the original problem state in, we often Projection Matrix:
come across the problem that the data to discriminate are not The covariance matrix is defined as (4):
linearly separable in that space. For this reason, it was 1
proposed that the original finite-dimensional space be
Z p p = X p n X pT n (4)
N 1
transformed into a much higher dimensional space, where the After the covariance matrix is formed, the diagonalization of
separation in that space is much easier. the matrix has to be solved. We should note that the matrix Z
Intuitively, a good separation is achieved by the hyperplane is a symmetric matrix, which means the diagonalization of it
that has the largest distance to the nearest training data point is to find a orthogonal matrix P satisfy the formula (5):
of any class. There may be many possible linear classifiers V p p = PpT1 Z p p Pp 1 (5)
that can separate the original sample, but there is only one that
The specific operation of the diagonalization is to perform
maximizes the margin, since in general the larger the margin
eigenvalue decomposition on the matrix Z. After that, we can
between two sample space is, the easier the two pattern
usually get several eigenvalues. Assuming there are m of
categories can be distinguished, the better the classifier
them. Then we can get the appropriate eigenvector of each
performs for these two pattern categories, the lower
eigenvalue and thus we can form the matrix of the
the generalization error of the classifier. And SVM has been
eigenvectors. And after orthogonalization of it, the previously
proved to be a excellent tool in the classification field.
described matrix P is formed. If we take the maximum m
eigenvalues correspond to the m dimensions. The new
Projection Matrix is formed. And the sum of all m
As described in the introduction. Excessive number of eigenvalues equals the trace of the n of the original variables.
selected feature will not only affect the speed of the And these eigenvalues have the same number of eigenvectors.
computation, but also reduce the accuracy of classification. These eigenvectors together formed a new eigenvectors
So the use of dimensionality reduction tools to remove matrix as (6) showed.
irrelevant feature data is essential.
The most widely used linear dimensionality reduction
W m p = P1 , P2 ,..., Pp ] (6)

techniques include Fishers linear discriminant analysis 3. Get the new dimensionality reduction matrix through the
(LDA) [8, 9] and principal component analysis (PCA) [12, projection to the original matrix:
13], which have been successfully proved to reduce the X m n = W m p X p n (7)
dimensionality of original data and abandon irrelevant Now compared (3) with (7), we can easily find that the
information [14, 15, 20, 21]. In this study, PCA is adopted to dimension of the original matrix has been reduced from p to m
deal with the dimensionality reduction problem. (m < p). The new sets of data reduce the noise and redundant
PCA is considered to be an effective tool of statistical mentioned above to a large extent.
process monitoring and it is widely used in the numerous At this point, the entire process of PCA is finished.
areas including feature extraction, signal analysis, and

978-1-4799-2625-1/13/$31.00 2013 IEEE 224


A. Database Preparation
Two emotional databases were tested in our experiment to
perform the emotion recognition. One is the famous
emotional German Berlin speech database [5]. The other is
the Chinese emotional dialogue database collected by our
team. The Berlin speech database contains 694 utterances for
six emotions, happiness, anger, sadness, anxiety, boredom
and neutral. All these utterances are record by ten actors in
German. The utterances are between two and five seconds
and the signals are sampled at 16 KHz. And the database
established by our team is collected in the form of dialogue. It Fig.3. Emotion recognition accuracy among different features
contains a total of 99 dialogues, 488 utterances. Each of the used database2.
dialogue is between two persons. The emotion types of this
It is obviously that the MFCC features are not suitable for
database are happiness, anger, sadness, neutral, fear and
speech emotion recognition compared to the standard features.
And as the increasing of the number of utterances, the
B. Experiment Setup features processed by PCA show a greater advantage in
We have applied the linear predictive coding method [23] recognition. But why the accuracy of recognition of PCA is
to extract the pitch and frequency and bandwidth of the lower than it of MFCC? As we know, the purpose of PCA is
formants. In the experiment, each utterance is divided into noise reduction and to reduce the redundancy. When the data
30ms speech frames with an overlap of 10ms between set is small, the function of PCA decreases.
successive frames. And the unvoiced part of each frame has Both of the databases are used for this experiment. Four of
been deleted. All the experiments from features selection, the same emotions (happiness, anger, and sadness, neutral)
PCA to the classification of SVM are implemented in the from these two databases are applied. We selected 60
platform of MATLAB. utterances from each of these four emotions from the two
databases. Then each of former three emotions (happiness,
C. Experiment Results sadness, and anger) makes a pair with the emotion neutral.
Three kinds of speech features are trained to the classifier. After the process of feature extracting, these three pairs are
They are MFCC features, standard features and standard called the original feature sets. And the new feature sets are
features after the process of PCA. The original utterances are formed after the process of PCA. Both the original feature
picked from the Berlin speech database. All six kinds of sets and the new feature set are organized to train the
emotions are selected on average. And the accuracies of the recognition system. And the accuracies of these pairs are
recognition are showed in Fig.2 and Fig.3. presented in Fig.4 and Fig.5.

Fig.2. Emotion recognition accuracy among different features Fig.4 .Recognition accuracy of three pair of emotions
used database1. used database1

978-1-4799-2625-1/13/$31.00 2013 IEEE 225

[9] K. Fukunaga, Introduction to statistical pattern recognition, 2nd edn.
Academic, Boston, 1990.
[10] V. Hozjan, Z. Kacic Improved emotion recognition with large set of
statistical features, In:EUROSPEECH-2003, Geneva, pp. 133136,
[11] A. Iliev, M. Scordilis, J. Papa, A. Falcao Spoken emotion recognition
through optimum-path forest,classification using glottal features,
Comput Speech Lang vol. 24, No.3, pp. 445460. 2010
[12] IT. Jolliffe, Principal component analysis, 2nd edn. Springer, New
York, 1986.
[13] K. Pearson, On lines and planes of closest fit to systems of points in
space, Phil Mag vol. 2, No.6, pp. 559572, 1901
[14] CM. Lee, SS. Narayanan, Toward detecting emotions in spoken
dialogs, IEEE Trans Audio Speech, Lang Process , vol. 13, No.2, pp.
Fig.5 .Recognition accuracy of three pair of emotions 293303. 2005
used database2 [15] CM. Lee, SS. Narayanan, R. Pieraccini, Recognition of negative
emotions from the speech signal, In: IEEE Workshop Automatic
Speech Recognition and Understanding (ASRU), Trento, pp.
II. CONCLUSION 240243,2001
[16] V. Petrushin, Emotion recognition in speech signal: experimental study,
In this paper, first we gave a brief review of emotion development, and application, In: 6th International Conference on
recognition from speech signals. Then we introduced PCA to Spoken Language Processing (ICSLP00), Beijing, China, pp. 222225,
improve the accuracy of the recognition. The first experiment 2000
[17] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L.
has showed that the standard features processed by PCA have Devillers, L. Vidrascu, N. Amir, L. Kessous, The relevance of feature
an advantage in recognition than the unprocessed features and type for the automatic classification of emotional user states: low level
the MFCC features. The next experiment showed the descriptors and functionals. In: INTERSPEECH-2007, Antwerp,
Belgium, pp. 22532256, 2007
accuracy of different emotions compared to neutral. This
[18] D. Ververidis, C. Kotropoulos, Emotional speech recognition:
study combined the PCA with the SVM classifier to increase resources, features, and methods, Speech Comm vol. 48, No.9, pp.
the accuracy of emotion recognition and showed the 11621181. 2006
superiority of prosodic features than MFCC features in [19] S. Yildirim, S. Narayanan, A. Potamianos, Detecting emotional state
of a child in a conversational computer game, Comput Speech Lang
speech emotion recognition. vol. 25, No. 1, pp. 2944.2011
[20] B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition
ACKNOWLEDGMENT combining acoustic features and linguistic,information in a hybrid
support vector machine-belief network architecture, In: IEEE
This research has been partially supported by the National International Conference on Acoustics, Speech, and Signal Processing,
Natural Science Foundation of China under Grant No. Montreal, Quebec, Canada, pp. 577580
[21] D. Ververidis, C. Kotropoulos, I. Pitas, Automatic emotional speech
61203312, National High-Tech Research & Development classification, In: IEEE, International, Conference on Acoustics,
Program of China 863 Program under Grant Speech and Signal Processing (ICASSP04), Montreal, Quebec,
No.2012AA011103, Key Science and Technology Program Canada, pp. 593596, 2004.
[22] M. Lugger, B. Yang, Psychological motivated multi-stage emotion
of Anhui Province, under Grant No. 1206c0805039, and the classification exploiting voice quality features, in: F. Mihelic, J. Zibert
Scientific Research Foundation for the Returned Overseas (Eds.), Speech Recognition, In-Tech, 2008 (Chapter 22).
Chinese Scholars, State Education Ministry. [23] Q. Zhao, T. Shimamura, J. Takahashi, J. Suzuki, Improvement of
Noise Robustness for Formant Frequency
Extraction Based on Linear Predictive Analysis, Electronics and
REFERENCES Communications in Japan , vol. 85, No.9 pp. 745-758, 2002.
[24] J. Darch, B. Milner, S. Vaseghi MAP prediction of formant
[1] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, A. Stolcke, Prosody-based frequencies and voicing class from MFCC vectors in noise, Speech
automatic detection of annoyance and frustration in human-computer Communication, vol. 48, pp. 1556-1572, 2006
dialog, 7th International Conference on Spoken Language Processing
Denver, Colorado, pp. 20372040, 2002
[2] R. Banse, KR. Scherer, Acoustic profiles in vocal emotion
expression, J Pers Soc Psychol, vol. 70, pp. 614636, 1996.
[3] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed.,
Wiley, New York, 2001.
[4] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L.
Devillers, L. Vidrascu, N. Amir, L. Kessous , V. Aharonson,
Whodunnitsearching for the most important feature types signalling
emotionrelated,user states in speech, Comput Speech Lang, vol. 25,No.
1, pp. 428. 2011
[5] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss A
database of German emotional speech. In: Interspeech-2005, Lisbon,
Portugal, pp. 14
[6] P. Ekman, An Argument for Basic Emotions, Cognition and Emotion,
vol. 6, pp. 169-200, 1992.
[7] R. Plutchik, A General Psychoevolutionary Theory of Emotion, In
Emotion: Theory, Research, and Experience, 1980, vol. 1, pp. 3-33
[8] R. Fisher, The use of multiple measures in taxonomic problems, Ann
Eugenics vol. 7, pp. 179188, 1936.

978-1-4799-2625-1/13/$31.00 2013 IEEE 226