Speech Emotion Recoginition

Speech Emotion Recognition Based on Deep Learning and Kernel Nonlinear
PSVM
HAN Zhiyan, WANG Jian
College of Engineering, Bohai University, Jinzhou 121013
E-mail: hanzyme@126.com
Abstract: For the sake of ameliorating the precision of speech emotion recognition, this paper put forward a new
emotion recognition technique based on Deep Learning and Kernel Nonlinear PSVM (Proximal Support Vector
Machine) to discern four fundamental human emotion (angry, joy, sadness, surprise). First of all, preprocess speech
signal. And then use DBN (Deep Belief Networks) to extract emotional features in speech signal automatically. After
that, integrate the DBN automatic features and traditional features (prosody features and quality features) as the total
features. Finally, use six Nonlinear Proximal Support Vector Machines to recognize the emotion and use majority voting
principle to obtain the final identification result. To assess the new method, we compare the total features, DBN
automatic features and traditional features. The experimental results indicate that the total features are better than the
other two methods.
Key Words: Emotion Recognition, Speech Signal, Deep Learning, PSVM
developed a speech emotion recognition model using Deep

1 INTRODUCTION Neural Network (DNN) and Extreme Learning Machine
Emotion recognition is an interdisciplinary research field (ELM), which the experimental results revealed that the
that has received increasing attention over the last few years. proposed method effectively learns the emotional features
We can get emotional states from speech signal, facial and leads to 20% increased accuracy. In 2018, Jin et al. [5-6]
expression signal or physiological signal. The use of speech proposed a sentiment analysis model with the combination
signal is the most feasible choice because the use of other of deep learning and ensemble learning, this paper proves
signals is complicated in certain real application [1]. So we that combining deep learning and ensemble learning
select speech signal to recognize emotion. contributes to the improvement of the accuracy of sentiment
analysis, which shows some value in research. In 2014, Liu
Speech emotion recognition system is generally composed
et al. [7] presented a novel Boosted Deep Belief Network
of three parts. They are speech signal preprocessing, feature
for performing three training stages iteratively in a unified
extraction, and classification. The feature extraction is the
loopy framework, and the extensive experiments on two
most important part. Most of previous works on emotion
public databases showed that the framework yielded
recognition have been based on the analysis of speech
dramatic improvements in facial expression analysis.
prosody features and quality features. The quality of the
characteristic parameters directly influences the result of Some advantages of deep learning technique are its ability
emotion identification. to detect complex interactions among features, capability to
learn low-level features from minimally processed raw data.
Although some speech emotion features have been
proposed, there is not yet a manually designed optimal Based on these above analyses, we come up with a novel
feature set. Researchers may combine more and more emotion recognition method using Deep Learning and
features, which could have an excessive dimension problem. Kernel Nonlinear PSVM aiming at improving the
Moreover, the robustness of emotion recognition system recognition correct rate and computation efficiency, and
may be easily affected by the variation of speaker, content, finally finished the recognition for four basic human
and environment. Recently, deep learning has been emotions (angry, joy, sadness, surprise).
successfully applied in various fields such as speech
recognition and image understanding [2]. Microsoft and 2 SYSTEM ARCHITECTURE
Google use deep learning techniques to address challenges For this article, the system architecture is shown in Figure 1.
in key areas. Deep learning is a rapidly growing field which The specific implementation plans are as follows:
models high level patterns as intricate multilayer networks,
Step 1: Acquire emotion speech signal and make sampling,
which is mainly used in artificial intelligence and machine
quantification, pre-emphasizing, framing, adding window
learning. For example, in 2014, Kun Han et al. [3-4]
and endpoint detection.
Step 2: Use DBN to extract emotional features in speech
This work is supported by National Nature Science Foundation under signal automatically.
Grant 61503038, 61403042 is supported by Liaoning Nature Science Step 3: Extract speech prosody features and speech quality
Foundation under Grant 20180550189 and is supported by Bohai University
Teaching Reform Program BDJG2016QA02. features.
978-1-7281-0106-4/19/$31.00 2019
c IEEE 1426
Step 4: Integrate the DBN automatic features and Assuming visual variables v {0,1}M and hidden
traditional features as the total features.
variables h {0,1}N , we can get the joint distribution over
Step 5: Use six Nonlinear Proximal Support Vector
Machines to recognize the emotion. the visual and hidden units:
Step 6: Use majority voting principle to obtain the final 1
identification result.
p( v, h;T ) exp( E ( v, h;T )) (1)
Z (T )
Where T {a,b, W} are parameters, Z (T ) is the
Preprocess normalization constant, E ( v, h; T ) is the energy function:
Speech
Z (T ) ¦¦ exp( E ( v, h;T ))
v h
(2)
Extract DBN Extract traditional
M N M N
features features
E ( v, h;T ) ¦¦ Wij vi h j ¦ bi vi ¦ a j h j (3)
i 1 j 1 i 1 j 1
Deep belief networks [4,8] are highly complex directed

acyclic graph, which are formed by a sequence of restricted
Integrate DBN features and Boltzmann Machine (RBM) architectures. It is one of the
traditional features first non-convolutional models to successfully apply deep
architecture training [9, 10]. The introduction of the DBN
begins the current renaissance of deep learning. DBN could
be trained by training RBMs layer by layer from bottom to
top. Since RBM could be trained rapidly through layered
PSVM PSVM PSVM PSVM PSVM PSVM contrast divergence algorithm, the training avoids a high
degree of complexity of training DBNs which inturn
simplifies the process to train each RBM. Studies on DBN
illustrated that it can solve low convergence speech and
local optimum problems in traditional back propagation
algorithm in training multilayer neural network. Figure 3
Use majority voting represents the architecture of the three hidden layers DBN
principle to obtain the final in which the RBMs are trained layer by layer from bottom to
result top.
Fig 1. A block diagram of the system architecture.

RBM3
3 FEATURE EXTRACITON
3.1 Extract DBN Features RBM2
A Restricted Boltzmann Machine is one of the most

common components in the deep probability model, which
RBM is an undirected probability graph model consisting of RBM1
a visual layer and a hidden layer. It can stack up to form a
deeper model, such as DBN. Figure 2 is the diagram
structure of RBM. Fig 3. The diagram structure of three hidden layers DBN.
Hidden layer 3.2 Extract traditional Features

In this paper, 16 emotion features are extracted, and the first
nine are prosody features, and the latter seven are quality
Visual layer features [11].
The first nine features are shown in Table1.
Fig 2. The diagram structure of RBM.
Table1. Prosody Features
Feature number feature

The ratio of the duration of sentence
Feature 1
pronunciation to the duration of
The 31th Chinese Control and Decision Conference (2019 CCDC) 1427
corresponding calm statement 1 2 1
min Q
y 澠u cu +J 2澡
Feature 2 the pitch frequency average value (Z ,J , y )R n1 m 2 2
Feature 3 the maximum pitch frequency s.t. D ( AAcDu eJ ) y =e
(5)
average value
Feature 4
If we now instead of the linear kernel AAc using a nonlinear
the difference of pitch frequency
kernel K ( A, Ac) , we can get:
average value and corresponding
calm statement pitch frequency 1 2 1
average value min Qy 澠u cu +J 2澡
(Z ,J , y )R n1 m 2 2
Feature 5 the difference of maximum pitch
s.t. D ( K ( A, Ac) Du eJ ) y =e
(6)
frequency and corresponding calm
statement maximum pitch frequency Here, we replace K ( A, Ac) using the shorthand notation K .
Feature 6 amplitude average energy We will use the following Gaussian kernel for the PSVM :
Feature 7 amplitude energy dynamic range 2
P Ai c B. j
Feature 8 the difference of amplitude average ( K ( A, B ))ij H , i 1,! , m, j 1,! , k
(7)
energy and corresponding calm
mun nu k
statement amplitude average energy Where A R , BR K and P is a positive constant.
Feature 9 the difference of amplitude energy The gradients with respect to (u , J , y, v) of the Lagrangian:
dynamic range and corresponding
2
calm statement value Q 2 1 ªu º
L(u , J , y,X ) y + «J » -X c( D( KDu eJ ) y e)
2 2 ¬ ¼
The latter seven features are shown in Table2.
Table2. Prosody Features (8)
Feature number feature Here, X R is the Lagrange multiplier. Setting the

m
the first resonance peak frequency Lagrangian gradients about (u , J , y,X ) equation to zero,
Feature 10
average value gives the following KKT optimality conditions:
Feature 11 the second resonance peak frequency u DK cDX 0
average value
J ecDX 0
Feature 12 the third resonance peak frequency
average value vy X 0
Feature 13 the harmonic noise ratio mean D ( KDu eJ ) y e (9)
Feature 14 the maximum of harmonic noise We can get:
ratio
u DK cDX
Feature 15 the minimum of harmonic noise ratio
J ecDX
Feature 16 the harmonic noise ratio variance
X
y
Q
4 NONLINEAR PROXIMAL SUPPORT I I
X ( D ( KK c eec) D) 1 e ( GGc)1 e
VECTOR MACHINE Q Q (10)
The Proximal Support Vector Machine (PSVM) classifies Where G is as follow:
points relying on proximity to one of two parallel planes.
G D>K -e@
Acquiring a linear or nonlinear PSVM classifier doesn’t (11)
require more complicated than solving a single linear The nonlinear separating surface is as follows˖
equation [12-16].
xcZ J (12)
xcAcDu J 0
In order to get our nonlinear proximal support vector
machine classifier, we revise the following optimization We can obtain the nonlinear separating surface by
problem Eq. (4) to Eq. (5) by taking the place of the primal replacing xcAc use K ( xc, Ac) :
variable Z using its dual equivalent Z AcDu : K ( xc, Ac) Du J K ( xc, Ac) DDK ( A, Ac) DX ecDX
1 2 1 ( K ( xc, Ac) K ( A, Ac)c ec) DX 0 (13)
min Q y 澠Z cZ +J 2澡
(Z ,J , y )R n1 m 2 2 (4)
The corresponding nonlinear classifier for this nonlinear
s.t. D ( AZ eJ ) y =e
separating surface is then:
1428 The 31th Chinese Control and Decision Conference (2019 CCDC)
! 0, then x A, Emotion
Emotion categories
° categories
( K ( xc, Ac) K ( A, Ac)c ec) DX ® 0, then x A,
° Joy 98% 1% 0 1%
¯ 0, then x A or x A,
Anger 2% 97% 1% 0
Surprise 5% 4% 91% 0
(14)
Sadness 2% 1% 3% 94%
From the table 3, we could see that the average recognition

5 EXPERIMENTS error rate for using traditional features and PSVM is 13.25%;
From the table 4, we could see that the average recognition
In order to evaluate the performance of the emotion error rate for using DBN automatic features and PSVM is
recognition system, we conducted some simulation 8.25%; From the table 6, we could see that the average
experiments. In our experiment, four discrete emotion states recognition error rate for using integrated features (DBN
(angry, joy, sadness and surprise) are classified by the work. automatic features and traditional features) and PSVM is
We record the emotion speech database in Chinese language. 5.0%. We can see that the average recognition error rate
The sampling rate is 16kHz under 10dB SNR by seven using integrated features is the lowest. Moreover, the error
speakers, 1000 data per emotion have been used for the rate of DBN automatic features is clearly lower than that of
training, while another1000 data per emotion were used traditional features. So the new algorithm is able to extract
testing. For the deep network, one important observation is some meaningful features and the whole extraction process
that it is important to have a sufficient reduction in close to human emotion recognition more.
information from the visible layer to the first hidden layer.
This paper uses 4 hidden layers deep belief network to 6 CONCLUSION
extract speech emotion features. We use one versus one
In this paper, we propose a novel emotion recognition
classifier to classify emotions, so in this paper, we use six
technique based on deep learning and kernel nonlinear
PSVM. Obtain the final recognition result through majority
PSVM to discern four human emotion. We have shown that
voting principle. Table 3 shows the results of the emotion
it is possible to obtain a significant improvement using this
recognition using traditional features and PSVM. Table 4
method. In particular, the integration of DBN automatic
shows the results of the emotion recognition using DBN
features and traditional features is the best method. So the
automatic features and PSVM. Table 5 shows the results of
features have great influence on the recognition results. But
the emotion recognition using integrated features (DBN
the way of human beings expressing emotions is diverse, it
automatic features and traditional features) and PSVM.
has the expression complexity and culture relative property.
Table3. Recognition correct rate using traditional features + There are many limitations for only using speech to
PSVM recognize emotion. So we can combine facial expression
signal to recognize emotion.
Emotion
Emotion categories
categories REFERENCES
Joy 90% 4% 5% 1%
[1] E. M. Albornoz, M. Sanchez-Gutierrez, F. Martinez-Licona,
H. L. Rufiner, J. Goddard, Spoken emotion recognition using
Anger 4% 85% 10% 1%
deep learning, Proceedings of 19th Iberoamerican Congress
on Pattern Recognition, 104-111, 2014.
Surprise 17% 3% 80% 0
[2] Z. W. Huang, W. T. Xue, Q. R. Mao, Speech emotion
recognition with unsupervised feature learning, Frontiers of
Sadness 3% 3% 2% 92%
Information Technology & Electronic Engineering, Vol.16,
No.5, 358-366, 2015.
Table4. Recognition correct rate using DBN automatic
[3] K. Han, D. Yu, I. Tashev, speech emotion recognition using
features + PSVM deep neural network and extreme learning machine,
Emotion Interspeech, 14-18, 2014.
Emotion categories [4] C. Pushpa, M. M. Priya, A review on deep learning
categories
algorithms for speech and facial emotion recognition,
Joy 93% 1% 3% 3% International Journal of Control Theory & Applications,
Vol.9, No.24, 183-204, 2016.
Anger 0 90% 7% 3%
[5] Z. G. Jin, Y. Han, Q. Zhu, A sentiment analysis model with
the combination of deep learning and ensemble learning,
Surprise 3% 1% 90% 6%
Journal of Harbin Institute of Technology, Vol.50, No.11,
1-8, 2018.
Sadness 0 1% 5% 94%
[6] N. Agarwalla, D. Panda, M. K. Modi, Deep learning using
restricted boltzmann machines, International Journal of
Table5. Recognition correct rate using integrated features +
PSVM
The 31th Chinese Control and Decision Conference (2019 CCDC) 1429
Computer Science & Information Security, Vol.7, No.3, [14] V. N. Vapnik, The nature of statistical learning theory,
1552-1556, 2016. Springer, New York, 2000.
[7] P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression [15] P. S. Bradley, O. L. Mangasarian, Massive data
recognition via a boosted deep belief network, IEEE discrimination via linear support vector machines,
Conference on Computer Vision & Pattern Recognition, Optimization methods and software, Vol.13, No.1, 1-10,
1805-11812, 2014. 2000.
[8] G. Ian, B. Yoshua, C. Aaron, Deep learning, Posts & [16] D. D. Aroor, C. S. Chellu, GMM-Based intermediate
Telecom Press, Beijing, China, 2017. matching kernel for classification of varying length patterns
[9] G. E. Hinton, Learning multiple layers of representation, of long duration speech using support vector machines, IEEE
Trends in Cognitive Sciences, Vol.11, No.10, 428-434, Trans on Neural Networks and Learning Systems, Vol.25,
2007. No.8, 1421-1432, 2014.
[10] G. E. Hinton, S. Osindero, Y. W. The, A fast learning
algorithm for deep belief nets, Neural Computation, Vol.18,
No.7, 1527-1554, 2014.
[11] Z. Han, J. Wang, Speech emotion recognition based on
Gaussian Kernel Nonlinear Proximal Support Vector
Machine, Proceedings of Chinese Automation Congress,
2513-2516, 2017.
[12] G. Fung, O. L. Mangasarian, Proximal Support Vector
Machine Classifiers, International Conference on
Knowledge Discovery & Data Mining, 77-86, 2001.
[13] L.H. Chiang, M.E. Kotanchek, A.K. Kordon, Fault diagnosis
based on fisher discriminant analysis and support vector
machines, Computers & Chemical Engineering, Vol.28,
No.8, 1389-1401, 2004.
1430 The 31th Chinese Control and Decision Conference (2019 CCDC)

Speech Emotion Recoginition

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Emotion Recoginition

Uploaded by

Copyright:

Available Formats

Speech Emotion Recognition Based on Deep Learning and Kernel Nonlinear

developed a speech emotion recognition model using Deep

Deep belief networks [4,8] are highly complex directed

Fig 1. A block diagram of the system architecture.

3.1 Extract DBN Features RBM2

A Restricted Boltzmann Machine is one of the most

Hidden layer 3.2 Extract traditional Features

Feature number feature

Feature number feature Here, X R is the Lagrange multiplier. Setting the

From the table 3, we could see that the average recognition

You might also like

Speech Emotion Recoginition

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Emotion Recoginition

Uploaded by

Copyright:

Available Formats

Speech Emotion Recognition Based on Deep Learning and Kernel Nonlinear

developed a speech emotion recognition model using Deep

Deep belief networks [4,8] are highly complex directed

Fig 1. A block diagram of the system architecture.

3.1 Extract DBN Features RBM2

A Restricted Boltzmann Machine is one of the most

Hidden layer 3.2 Extract traditional Features

Feature number feature

Feature number feature Here, X  R is the Lagrange multiplier. Setting the

From the table 3, we could see that the average recognition

You might also like

Feature number feature Here, X R is the Lagrange multiplier. Setting the