You are on page 1of 5

Speech Emotion Recognition Based on Convolution Neural Network combined

with Random Forest

Li Zheng1, Qiao Li2, Hua Ban1 , Shuhua Liu1*
1. School of Information Science and Technology, Northeast Normal University, Changchun, China 130024
E-mail: { zhengl887, banh622 , liush129}
2. College of Humanities and Sciences of Northeast Normal University, Changchun, China, 130117

Abstract: The key to speech emotion recognition is extraction of speech emotion features. In this paper, a new network
model (CNN-RF) based on convolution neural network combined with random forest is proposed. Firstly, the
convolution neural network is used as the feature extractor to extract the speech emotion feature from the normalized
spectrogram, used random forest classification algorithm to classify the speech emotion features. The result of
experiment shows that CNN-RF model is superior to the traditional CNN model. Secondly, Improved the Record Sound
command box of Nao and applied the CNN-RF model to Nao robot. Finally, Nao robot can "try to figure out" a human's
psychology through speech emotion recognition and also know about people's happiness, anger, sadness and joy,
achieving a more intelligent human-computer interaction.
Key Words: Convolution Neural Network, Speech Emotion Recognition, Random Forest, Spectrogram, Nao Robot

[1], spectral features [2], and other features [3-4]. They

combined with support vector machine [5] (SVM) and other
1 INTRODUCTION classifiers to realize recognition of speech emotions. Yu Gu
In facing with the development trend of era and increasing et al. [6] proposed the voiced segment selection (VSS)
material and spiritual demands of people, robot industry is algorithm that realizes accurate segmentation of voice
extending from the manufacturing industry to service signals. Compared to traditional methods, VSS algorithm
industry and home service robot mainly based on speech takes the voiced sound signal segment as texture image
interaction is going to penetrate into thousands of processing. Voiced sound and unvoiced sound features in
households. With the gradual development of the new spectrogram were classified by using the Log-Gabor filter.
generation of man-machine interaction technology as well Jun Deng et al. [7] established an emotion recognition
as continuous increase of people’s demands for emotional system based on phase position. Nucleus Fisher and GMM
intelligence of home service robots, man-machine model were introduced in to extract phase features. In other
interaction technology based on speech emotion recognition words, Fisher vectors for code extraction by using GMM
has attract wide research attentions. Traditional machine model were applied and then phase features of speech
learning method has achieved great progresses in speech signals were extracted by setting Fisher vectors. Finally, the
emotion recognition. However, there are some problems: on linear support vector machine was used for classification
the one hand, which features can reflect the differences and recognition. Pavitra Patel et al. [8] applied PCA
between different emotions, there is no unified opinion; On algorithm to extract pitch, loudness and resonance peak in
the other hand, these artificially designed features rely speeches to reduce data dimensions. In this study, an
highly on database and have low generalization ability. It improved GMM algorithm was introduced and the
takes long time to extract features. Deep learning can extract expectation maximum algorithm (EM) was integrated into
different layers of features from the original data through the Boosting framework. It was proved by experiment that
automatic learning, it has been widely used in speech the Boosted-GMM algorithm increases speech emotion
recognition, image recognition and other fields. In this study, recognition rate effectively.
CNN model was used as the feature extractor to extract Considering the shortcomings of traditional machine
high-order features of spectrogram. RF was used as learning and the advantages of deep learning, Wootaek Lim
classifier to design and implement the speech emotion et al. [9] combined the convolution neural network and one
recognition system and applied in the NAO robot. special circulation neural network and constructed a new
deep neural network by using the Time Distributed layer:
2 RELATED WORK Time Distributed CNNs. CNNs and LSTM network
In traditional speech emotion recognition methods, accomplished feature learning together. Experiments
extraction of speech emotional features mainly focus on demonstrated that the recognition rate of Time Distributed
superphonetics and linguistics to extract prosodic features CNNs in 7 emotions in EmoDB database was higher than
those of CNNs and LSTM networks. Reference [10] applied
This work is supported by the project of Jilin Provincial Science and the same method, combined CNNs and LSTM networks for
Technology Department under the Grant 20180201003GX.

978-1-5386-1243-9/18/$31.00 2018
c IEEE 4143
automatic learning features which were easy to be 99*145 pixels, 3 channels and jpg format and used as the
distinguished from the original speech signals, and used input;
these features to solve feature extraction problem related C1 and C2 layers: Convolution layer, these two layers used
with context. Compared with traditional speech emotion 16 5*5 multi-convolution nucleus. The feature image size in
recognition methods based on signal processing, the C1 is changed from 99*145 into 95*141 and the feature
proposed method has good prediction capacity on RECOLA image size in C2 is 91*137, each layer generates 16 feature
natural emotion database. images;
S3 layer: Pooling layer, this layer used the 2*2 maximum
sampling convolution nucleus. This layer was to reduce
ON CNN-RF dimension of data and generated 16 45*68 feature images.
The framework of speech emotion recognition is shown in C4 and C5 layers: Convolution layer, these layers used 16
Figure 1, spectrogram was calculated from emotion speech types of 3*3 multi-convolution nucleus, 16 feature images
samples through framing, windowing, short-time Fourier size in C5 is changed from 43*66 of C4 into 41*64.
transform (STFT) and power spectral density (PSD), and S6 layer: Pooling layer, the 2*2 maximum sampling
the normalized spectrogram was used as input of CNN. convolution nucleus was used and 16 20*32 feature images
Speech emotion features were extracted by CNN and the were gained.
output of CNN Flatten layer was input into RF classifier as
F1 layer: Full connection layer. In this paper uses two fully
eigenvectors of speech emotion samples. In the recognition
connected layers. F1 have 128 parameters, F2 have 6
stage, the test speech signals were transformed into
parameters (6 represents the speech emotion category),
spectrogram and then input into the CNN-RF model
softmax was used as the classifier.
classifier to recognize types of speech emotions.
Train CNN-RF Model
D layer: Dropout layer, this study applied dropout strategy
in F1, the parameter value was set 0.6, the loss value at
training and val_loss reduction at verification achieved the
best. The purpose is to improve the generalization ability of
the model.
Types of
It can be seen from Fig.2 that the similarity between applied
Spectrogram CNN-RF Model
Signals Emotions
CNN feature extraction model and VGG16 and VGG19 [11]
lies in design of convolution layers. The repeated
Fig 1. Framework of speech emotion recognition system occurrence of convolution layers is conducive to fuller
feature learning.
3.1 CNN Feature Extraction
3.2 RF Classifier
Speech emotion feature is the basis for speech emotion
recognition. Extraction accuracy of speech emotion features It RF [12] was used as classifier after CNN accomplished
in the original speech emotion samples can influence the features. The designed RF in this study covered 200
final recognition rate of speech emotions directly. The decision trees. The classification standards used Gini
comparison diagram between CNN-RF and CNN is shown indexes. Except for these two parameters, other parameters
in Figure 2. In order to learn more global features from all used defaulted values of RF classifier in scikit-learn
different angles, we set up setting multi-convolution kernels packet.
in different layers. The following are the parameter settings In generation process of RF, the following two random
of CNN network model for feature extraction: sampling methods were used to construct the decision tree,
INPUT layer: Emotion speech samples with different time which increases generalization ability effectively:
length were normalized into colorful spectrogram with



C1 C2 S3 C4 C5 S6 F1ǃD

Fig 2. Structure of CNN model

4144 The 30th Chinese Control and Decision Conference (2018 CCDC)
y Row sampling. In the original training dataset N, n regarded as a convolution layer.The second full connection
(n<<N) samples with replacement were chosen layer was connected after F1 and D, and the activation
randomly to generate the decision tree. Input samples function softmax was used to realize multi-classification.
of each tree were the proper subset of the original Parameter of softmax was set 6, indicating 6 emotions.
training dataset, thus enabling to prevent overfitting. Classification results of CNN model and CNN-RF model
y Column sampling. Suppose there were M features and are Table 3. The recognition accuracy of CNN-RF is 3.25%
each node chose m (m<<M) features to determine the higher than that of CNN model.
best splitting points. This ensures complete splitting or
pointing to one classification of leaf nodes of the Table1. Classification Results of CNN and CNN-RF
decision tree.
In addition, classification results of RF were determined by Network Model ACC
number of output types in all decision trees, resulting in the
relatively high classification accuracy. CNN 0.8143

3.3 CNN-RF Model Analysis CNN-RF 0.8468

In actual application process of speech emotion recognition

system, interfered by noises in environment, purity of 5 Applications OF CNN-RF Model IN NAO
speech samples collected by robot was far lower than Robot
samples in database. Therefore, it is extremely important to NAO is a humanoid robot developed by Aldebaran
enhance generalization ability of CNN-RF model when Robotics company. NAO is especially applicable as the
increasing the recognition accuracy. research platform of service robot and users can make visual
CNN-RF model uses CNN as a special feature extractor and programming of NAO through Choregraphe software. The
each convolution layer was set with multi-convolution Record Sound box provided by Choregraphe software
nucleus to make extracted features more comprehensive. called the microphone in the NAO robot. It can realize the
Random sampling method was used during the generation following two types of video recording:
of RF classifier, which prevented overfitting effectively and y Four tracks, 48000Hz, .wav.
generalization ability of the system. Therefore, the proposed
y Single track (front microphone), 16000 Hz, .ogg.
CNN-RF model can meet actual demands of speech emotion
recognition. In existing studies of man-machine interaction, speech
signal processing in Python and Matlab mainly used audio
4 CONTRAST EXPERIMENT with single track and .wav format. Reference [14] proposed
the methods of single track extraction from multi-tracks and
Core i7 3.4Hz, 16G memory and Windows system were conformation of audio format. However, extra operation
used as the experimental environment. Graphics card used makes man-machine interaction complicated. Therefore,
NVIDIA GeForce GTX 1070 and ground floor was keras this study improved form the source and gained the
framework of Theano. The GPU acceleration was realized designed single-track audio by modifying command box. It
by Anaconda software. Experimental database used the not only simplified audio processing, but also increased
CASIA Chinese emotion database of Institution of real-time performance of man-machine interaction.
Automation, Chinese Academy of Sciences [13]. This
database was recorded by 4 professional players, including 5.1 The Improved Record Sound Box
angry, fear, happy, neutral, sad and surprise, including 9600
samples. Each emotion accounts for 1/6 samples. The
normalized spectrogram after normalization of two neutral
emotion speeches which are chosen randomly from
database is shown in Figure 3.

Fig 3. Normalized spectrogram corresponding to speech signals

CASIA database were divided into training set and test set
according to the proportion of 5:1. The performance of the
classification is evaluated by Accuracy (ACC). It can be
seen from Fig.2 that after the Flatten layer of the CNN
model, two fully connected layers (indicated by dotted Fig 4.Choregraphe interface
lines) are connected, the first fully connected layer can be

The 30th Chinese Control and Decision Conference (2018 CCDC) 4145
In this study, the Record Sound box was improved and the
“ALAudioRecorder” application programming interface
was used to replace the “ALAudioDevice” interface to
generate new Recorder box. It can be seen from the interface
of Choregraphe in Figure 4 that the improved box increases
the single-track (front microphone) .wav format and
four-track .ogg format in recording.
In this study, the same text content of “Hello, Nao!” was
used ad recorded by the original Record Sound box and the
improved Recorder box. Two different formats of
documents which were generated by the original Record
Sound box were named as RS1.wav and RS2.ogg, while the
document generated by the improved Recorder box was
named by Recorder.wav. Three speeches in internal
memory of NAO are shown in Figure 5. All three audio
documents were downloaded and input into Audacity.
Name, number of tracks and sampling frequency of
documents were examined (Figure 6). The first two is
Recorder.wav,and the second to the fifth rows are RS1.wav,
indicating that RS1 is the four-track mixed sound. Audios
collected by left, right, front and back microphones of the
robot were shown from top to bottom. The sixth row is
RS2.ogg single-track audio, but the .ogg document belongs
to audio lossy compression document and can affect tone
quality to some extent. In addition, it tries to use .wav format
in order ensure consistency of format of audios collected 
NAO and CASIA database and eliminate interferences of Fig 6. Time-domain waveforms of Recorder, RS1 and RS2
audio format on application effects. Hence, the
Recorder.wav meets the above requirements. (3) Test samples are classified by the trained and stored
CNN-RF network model. Use the “ALTextSpeech”
application interface in NAO robot to “speak out” the
recognition results.


Acquire Emotional Speech

Samples by NAO

Draw the Spectrogram

Fig 5. Memory space of NAO

It can be seen from Figure 5 and Figure 6 that the audio
Speech Emotion
collected by Recorder box doesn’t need other operations Recognition
and can meet speech signal analysis, Chinese words
segmentation, speech recognition and speech emotion
“Speak Out” the
recognition based on NAO. It can simplify man-machine Recognition
interaction process, thus increasing time efficiency Results

5.2 Speech Emotion Test on Nao (1'

The CNN-RF speech emotion recognition model Fig 7. Speech emotion recognition of NAO robot
successfully applied to Nao robot needs the three steps 
shown in Figure 7. The detailed process is as follows: Test results on NAO robots are shown in Table 2. In the
(1) Acquire emotional speech samples timely by NAO. process of test, different sentences were used to express
Connect NAO robot and local computer, use the improved emotions. According to test, “surprise” and “fear” were the
Recorder box to store the emotional speech samples in .wav easiest to be confused. The recognition degree of “fear” and
format and download them by SFTP functions of python; “happy” was lower than the average level. The highest
(2) Draw the spectrogram; recognition degree was achieved to “neutral”, reaching

4146 The 30th Chinese Control and Decision Conference (2018 CCDC)
Table2. Test results in NAO robots

angry neutral surprise happy fear sad ACC

angry 19 3 0 1 0 0 82.6%
neutral 0 37 4 0 0 0 90.24%
surprise 1 0 20 0 6 0 74.07%
happy 1 2 3 12 1 1 60%
fear 0 0 3 0 17 0 85%
sad 1 1 1 0 2 8 61.53%
Average 75.57%
on Machine Learning and Cybernetics. Guangzhou, China,
6 CONCLUSION 2005, VIII:4898㸫4901.
In this study, CNN model is used as feature extractor and [6] Gu Y, Postma E, Lin H X, et al. Speech Emotion Recognition
combined with RF classifier. On this basis, the Chinese Using Voiced Segment Selection Algorithm[J]. 2016.
speech emotion recognition system is designed and used in [7] Deng J, Xu X, Zhang Z, et al. Fisher kernels on phase-based
NAO robot. In the application process, the original Record features for speech emotion recognition[M]//Dialogues with
Sound box is improved and a new Recorder box is generated. Social Robots. Springer Singapore, 2017: 195-203.
Speech signals which are collected by the improved [8] Patel P, Chaudhari A, Kale R, et al. EMOTION
Recorder box not only can meet format requirements of
speech emotion recognition, but also can meet needs of International Journal of Research In Science & Engineering,
studying NAO-BASED speech signal analysis, Chinese 2017, 3.
words segmentation and speech recognition. After a [9] Lim W, Jang D, Lee T. Speech emotion recognition using
systematic test, the proposed CNN-RF model gives NAO convolutional and Recurrent Neural Networks[C]//Signal
robot basic functions of speech emotion recognition. and Information Processing Association Annual Summit and
Conference (APSIPA), 2016 Asia-Pacific. IEEE, 2016: 1-4.
REFERENCES [10] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features?
[1] Juan Pablo Arias, Carlos Busso, Nestor Becerra Yom. End-to-end speech emotion recognition using a deep
Shape-based modeling of the fundamental frequency convolutional recurrent network[C]//Acoustics, Speech and
contour for emotion detection in speech[J]. Computer Signal Processing (ICASSP), 2016 IEEE International
Speech & Language, 2014, 28(1):278-294. Conference on. IEEE, 2016: 5200-5204.
[2] Yongming Huang, Guobao Zhang, Yue Li, Improved [11] Simonyan K, Zisserman A. Very deep convolutional
Emotion Recognition with Novel Task-Oriented Wavelet networks for large-scale image recognition[J]. arXiv preprint
Packet Features [J]. Intelligent Computing Theory, 2014, arXiv:1409.1556, 2014.
8588:706-714. [12] Liaw A, Wiener M. Classification and regression by
[3] H.M.Teager Sc.D, S.M.Teager. Allen. Evidence for randomForest[J]. R news, 2002, 2(3): 18-22.
nonlinear production mechanisms in the vocal tract [J]. [13] Institute of Automation Chinese Academy of Sciences. The
Speech Production and Speech Modelling, 1990, selected Speech Emotion Database of Institute of
55:241-261. Automation Chinese Academy of Sciences (CASIA)
[4] Chong Feng, Chunhui Zhao. Voice activity detection based [DB/OL].2012/5/17.
on ensemble empirical mode decomposition and teager [14] Cheng Ming, An approach of speech interaction and software
kurtosis [A]. International Conference on Signal design for humanoid robots, thesis, Changsha, Hunan
Processing[C]. USA: IEEE, 2014:455-460. University, 2016.
[5] Lin Yilin, Wei Gang. Speech Emotion Recognition Based on
HMM and SVM // Proc of the 4th International Conference

The 30th Chinese Control and Decision Conference (2018 CCDC) 4147