Information Sciences: Luefeng Chen, Wanjuan Su, Yu Feng, Min Wu, Jinhua She, Kaoru Hirota

Information Sciences 509 (2020) 150–163
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Two-layer fuzzy multiple random forest for speech emotion

recognition in human-robot interaction
Luefeng Chen a,b, Wanjuan Su a,b, Yu Feng a,b, Min Wu a,b,∗, Jinhua She c,
Kaoru Hirota d,e
a
School of Automation, China University of Geosciences, Wuhan 430074, China
b
Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
c
School of Engineering, Tokyo University of Technology, Tokyo 192-0982, Japan
d
Tokyo Institute of Technology, Yokohama 226-8502, Japan School of Automation, Beijing Institute of Technology, Beijing 100081, China
e
School of Automation, Beijing Institute of Technology, Beijing 100081, China
a r t i c l e i n f o a b s t r a c t
Article history: The two-layer fuzzy multiple random forest (TLFMRF) is proposed for speech emotion
Received 20 March 2019 recognition. When recognizing speech emotion, there are usually some problems. One is
Revised 31 August 2019
that feature extraction relies on personalized features. The other is that emotion recogni-
Accepted 5 September 2019
tion doesn’t consider the differences among different categories of people. In the proposal,
Available online 6 September 2019
personalized and non-personalized features are fused for speech emotion recognition. High
MSC: dimensional emotional features are divided into different subclasses by adopting the fuzzy
00-01 C-means clustering algorithm, and multiple random forest is used to recognize different
99-00 emotional states. Finally, a TLFMRF is established. Moreover, a separate classification of
certain emotions which are difficult to recognize to some extent is conducted. The results
Keywords: show that the TLFMRF can identify emotions in a stable manner. To demonstrate the effec-
Speech emotion recognition
tiveness of the proposal, experiments on CASIA corpus and Berlin EmoDB are conducted.
Fuzzy C-means
Multiple random forest
Experimental results show the recognition accuracies of the proposal are 1.39%–7.64% and
Human-robot interaction 4.06%–4.30% higher than that of back propagation neural network and random forest re-
spectively. Meanwhile, preliminary application experiments are also conducted to investi-
gate the emotional social robot system, and application results indicate that mobile robot
can real-time track six basic emotions, including angry, fear, happy, neutral, sad, and sur-
prise.
© 2019 Published by Elsevier Inc.
1. Introduction
Robotics has been widely and rapidly developed recently, and people hope that human-robot interaction (HRI) becomes
more humanized and naturalized, even robot understands human intention [8,9]. Moreover, the robot also is expected to
have the ability to express and react emotions [10]. Affective computing is attracting more attention. Emotions are impor-
tant bond in human-robot interaction, and they can be perceived by speech signals [14], facial expressions [12], physiologi-
cal signals [30] such as electrocardiograph (ECG), blood pressure and electroencephalogram (EEG), and body posture, among
∗
Corresponding author at: School of Automation, China University of Geosciences, Wuhan 430074, China.
E-mail addresses: chenluefeng@cug.edu.cn (L. Chen), wumin@cug.edu.cn (M. Wu).
https://doi.org/10.1016/j.ins.2019.09.005
0020-0255/© 2019 Published by Elsevier Inc.
L. Chen, W. Su and Y. Feng et al. / Information Sciences 509 (2020) 150–163 151
others. Emotion recognition has received considerable attention in recent years [1,25,33]. Human beings communicate with
each other mostly through voice, which not only expresses clearly and conveniently, but also conveys our emotional infor-
mation. Speech signals as a main way of affective computing in HRI, have been applied in HRI [42,49,50]. Speech emotion
recognition (SER), which is defined as extracting the emotional state of a speaker from his or her speech, is used to get
useful semantic information from speech, furthermore, is a key component of HRI and machine intelligence, and a key
prerequisite for HRI. With the in-depth study of SER, its importance to computer development and social life has become
increasingly prominent, and has been applied in many fields, such as interactive movies, emotional translation, psychological
testing, video games, assisted psychotherapy, and so on.
SER system consists of two major stages. One is the choice of suitable feature for speech representation, the other is
design of an appropriate classification [3]. A feature extraction focuses on emotion-relevant features from voice, and a
feature classification consists of a training and a test phase for identifying the type of emotion. Generally speaking, we
reveal many features when we speak. One is linguistic features, which define the speaker’s purpose in a specific way.
The other is hyperlinguistic feature, which is presented by small changes in language. Moreover, the hyperlinguistic fea-
ture may also convey information about the speaker’s accent and his social stratum. According to the research results,
hyperlinguistic features can be roughly divided into prosodic features, sound quality features, and spectral features. Espe-
cially, the characteristics of speaker and language can improve SER performance. Many studies considered acoustic or/and
prosodic features, such as pitch, intensity, voice quality features, spectrum and cepstrum [43]. However, such studies just
focused on personalized speech features [29]. In general, emotional speech data expressed by different speakers demon-
strated large variations in acoustic characteristics, even if they intend to express the same emotion. And several pairs of
representative emotions tend to have similar acoustic characteristics. For example, voices of sadness and boredom have
similar characteristics, indicating a large overlap among acoustic features [38]. Thus, speaker-independent speech, namely,
non-personalized speech emotional features, which do not rely on the speaker, are adopted for SER [31]. As a result, person-
alized and non-personalized speech emotional features are proposed for SER in this paper. This is the first motivation of the
proposal.
At the stage of speech feature recognition, many researchers analyze the characteristics and distribution of speech emo-
tion features, and hierarchical classifier is used to improve the performance of classifier. According to the hierarchical clas-
sification of prosodic information and semantic tags, Wu et al. [47] obtained the final result by weighted integration into
the semantic tag, and tested under natural sound environment. Yuncu et al. [48] got the easily distinguishable degree of the
emotion categories by choosing the feature set, the size of the confusion between different emotions. The corresponding two
decision trees were constructed, and the emotional state recognition was completed layer by layer. For considering the lack
of feature sets, Sheikhan et al. [44] presented a fuzzy neural-Support Vector Machine (neural-SVM) recognition method. The
method of ensemble learning can partly overcome the dimension disaster problem and make better use of the information
provided by features. Morrison et al. [35] uses classification results that integrate multiple classifiers. In the test phase, five
base classifiers are used to obtain the classification results, and then a multi-response linear regression classifier is used
to determine whether the classification results of the base classifier are correct or not. The identification results based on
different feature sets are divided into three layers to make fuzzy decision step by step, then the final recognition results
are obtained. At present, the combination of standard methods, such as fusion, ensemble and hierarchical classifiers, has
become the key point in SER.
In this paper, Random Forest (RF) is adopted to recognize emotional speech, including angry, fear, happy, neutral,
sad, surprise. Personalized features and non-personalized features are fused for SER. Identification information of human
(i.e., gender, province, and age) has important influence on emotional intention understanding [8]. Therefore, it can be
seen that identification information has a certain influence on SER, and this is the other motivation of the proposal. Ac-
cording to the identification information, such as gender and age, the feature data can be divided into different sub-
classes by adopting the fuzzy C-means (FCM) clustering, which the Euclidean distance and membership functions are
used to cluster the speech data based on the characteristics of it. Then, two-layer fuzzy multiple random forest (TLFMRF)
is proposed, where the decision-tree and Bootstrap method are employed in the RF to recognize the speech features.
Finally, the output of the confusion matrix marks that our model has been established. Moreover, a separate classifi-
cation of some certain emotions that are difficult recognition to some extent is carried out in each of the multiple
classifications.
The contributions of this work are as follows. 1) A TLFMRF modal is proposed for speech emotion recognition, which
fused personalized features and non-personalized features, and the identification information of human (i.e., gender,
province, and age) is taken into account. The results show that the TLFMRF can identify emotions in a stable manner. 2)
High dimensional emotional features are divided into different subclasses by adopting the fuzzy C-means clustering algo-
rithm, and multiple random forest is used to recognize different emotional states. A separate classification of some certain
emotions that are difficult recognition to some extent is carried out in each of the multiple classifications. 3) Preliminary
application experiments are conducted by using the proposal to the emotional social robot system, and the application re-
sults indicate that the mobile robot can realtime track six basic emotions, including angry, fear, happy, neutral, sad, and
surprise.
The remainder of this article is structured as follows. In Section 2, TLFMRF for SER is introduced. The experimental
simulations and analysis including a further application are presented in Section 3. In Section 4, a summary and prospective
of the proposal will be made.
152 L. Chen, W. Su and Y. Feng et al. / Information Sciences 509 (2020) 150–163
2. Related work
The research on SER mainly involves the extraction of speech emotional features and the selection of emotional classi-
fiers. In terms of speech emotional feature extraction, various features have been investigated and applied for SER over the
past decades [26,41], and some optimization algorithms are also been applied in features selection, such as particle swarm
optimization (PSO) [17], ant colony optimization (ACO) [18], and so on [19]. Among these studies, the global statistics over
the low level descriptors (LLDs), e.g., fundamental frequency (F0 ), durations, intensities, Mel frequency cepstrals (MFCCs) [3],
had achieved dominant superiority. Wu et al. [47] used acoustic-prosodic information and semantic labels for SER. Eyben
et al. [22] used the openSMILE [40] toolkit to achieve the extraction of short-term acoustic features, such as pitch, energy,
F0 , time duration and MFCC. In this paper, the festures of F0 , root-mean-square signal frame energy (RMS energy), cross-
zero rate (ZCR), harmonic noise ratio, MFCC are applied for SER. And personalized and non-personalized speech emotional
features are proposed for SER in this paper.
At the stage of the selection of emotional classifiers, representative classifiers have been used for the task of SER, in-
cluding hidden Markov model (HMM) [13], the Gaussian mixture model (GMM) [34], support vector machine (SVM) [11,16],
and artificial neural network (ANN) [15,23]. Enrique et al. [2] developed a novel ensemble classifier that consists of multiple
standard classifiers where the SVM is used to deal with multiple languages, and it was tested on never-seen languages.
Sarker et al. [45] adopted neural network, decision tree, SVM and k-nearest neighbor (KNN) to classify the test data. Due to
the lack of specific studies on high dimensional feature data, Breiman et al. [5] proposed the random forest (RF), which is an
algorithm based on classification trees, and a large number of independent variables which may be up to several thousand
could be handled. In addition, RF is used as a machine learning algorithm, both for individual feature sets and for decision-
level fusion [24,46]. Recently, RF have been used for natural language recognition [27,28]. Kondo et al. [28] suggested that
compared with ANN, support vector regression (SVR), logistic regression (LR), RF had better performance. At present, the
combination of standard methods, such as fusion, ensemble or hierarchical classifiers, has become the key point in SER.
Based on the above factors, personalized and non-personalized speech emotional features are proposed for SER in this pa-
per, and a suitable classifier for identification of emotional states is designed.
3. Two-layer fuzzy multiple random forest for speech emotion recognition
The framework of TLFMRF for SER is shown in Fig. 1. As shown in Fig. 1, the proposal first extracts the speech emotional
features by using openSMILE [22]. And then, by using FCM, the training sets are clustered into multi subclasses. Finally, RF
is used to identify the emotion of the selected speech features, and the output is the label of emotional affiliation.
3.1. Feature extraction
In the aspect of speech emotional feature extraction, we adopt the non-personalization speech emotional feature based
on derivative to supplement the traditional speech personalized emotional characteristics, and realize the universal and ne-
gotiability emotional characteristics. The speech emotional feature sets are computed using openSMILE [22] toolkit (version
2.3). As shown in Table 1, 16 basic features and their 1st derivative are extracted as fundamental features. 16 basic features
are F0, ZCR, RMS energy, and MFCC 1–12, respectively. Derivative features are less affected by different person, which are
seen as non-personalized features, and 12 statistics values of these fundamental features are calculated. According to this
method, the personalized features and non-personalized features are obtained.
In the speech emotional features, the ZCR refers to the fact that the signal pattern we are talking through passes through
the zero-level record. The ZCR of the speech signal x(m) is denoted by
1
N−1
Z= |sgn[x(m )] − sgn[x(m − 1 )]| (1)
2
m=0
Table 1
Emotional speech features.
Index 16 Basic Features 12 Statistic Values
Personalized F0 , ZCR, max, min,

Features RMS energy, average,
Harmonic std,range,
noise ratio, maxpos,minpos,
MFCC 1–12 linregc1,
Non-personalized 1st order delta linregc2,
Features coefficient of linregerrQ,
16 basic kurtosis,
features skewness
Fig. 1. The framework of two-layer fuzzy multiple random forest for speech emotion recognition.
where sgn[·] is a function:

1, ( x ≥ 0 )
sgn[x] = (2)
−1, (x < 0 )
MFCC can be calculated by the following steps:

Step1: Improve the speaker’s voice information by the Hanming window and the framing. After such pre-processing, the
Fast Fourier Transformation (FFT) is implemented to obtain the spectrum.
Step2: Square the result of Step1, then pass it through the corresponding triangle filter, and then evenly arrange the
center frequency according to the Mel frequency scale. The center frequency of the bandpass filter is divided by the interval
of 150Mel and the bandwidth of 300Mel. Suppose the number of filters is M, and the output frequency after filtering is
X(k), k = 1, 2, . . . , M.
Step3: Calculate the logarithm of the output of the bandpass filter in Step2, and then calculate the obtained power log
spectrum by the following formula to obtain K MFCC (K = 12 − 16 ), where K refers to the order of MFCC parameter. Accord-
ing to the symmetry, the transformation can be simplified as follows:

K
Cn = log Y (k ) cos [π (k − 0.5 )n/N], n = 1, 2, . . . , N (3)
k=1
where N represents the number of filters, and Cn is the filtered output.

The personalized features have a good effect on the SER of a certain person, but in the case of unfamiliar speaker without
databases, the emotion recognition rate is not very high. Derivative-based non-personalized features can solve this problem.
Therefore, both personalized features and non-personalized features are used for SER in this paper.
3.2. FCM based features classification
The FCM is used for data clustering algorithm. Based on the method of bootstrap, the training sample data set D is
obtained, i.e., D = [y1 y2 . . . yN]T and yo = [yo1 yo2 . . . yo384 ], o = 1, 2, . . . , N, where N is the number of sample
data. The FCM method is an iterative clustering algorithm to partition l normalized samples into c clusters by minimizing
the following objective function,

L
N
min Jm (U, V ) = (μko )m D2ko
k=1 o=1
D2ko = yo − ck 2 (4)
L
μko = 1
s.t. k=1 k = 1, . . . , L, o = 1, . . . , N
0 < μko < 1
where μko is the membership value of the oth sample in the kth cluster, U is the related fuzzy partition matrix consisting of
μko , V = (c1 , c2 , . . . , cL ) is the cluster center matrix, L is the number of clusters, m is the fuzzification exponential which has
an important regulatory effect on the fuzziness degree of clusters, and usually m = 2 [4,39]. Dko is the Euclidean distance
between oth sample yo and kth cluster center ck .
To minimize Jm , the following update equations are used [49]
1
μko = L (5)
f =1 (Dko D f o )
2/(m−1 )
N
(μko )m yo
ck = o=1 (6)
N
o=1 (μko )
m
where Dfo is the Euclidean distance between yo and cf . Selecting the appropriate initial cluster value, the algorithm starts
iterating. The iteration will stop when termination criterion is satisfied, i.e. Dfo ≤ , where is the given sensitivity threshold.
Finally, the training set S can be clustered into L subclasses which are denoted as S = {S1 , S2 , . . . , SL } by using FCM to classify
features.
3.3. Multiple random forest algorithm for speech emotion recognition
RF (Random Forest) is an ensemble classifier consisting of a group of decision trees h(x, θ i ), i = 1, 2, . . . , k, where θ i
is subject to independent and identically distributed random vectors, and k is the number of decision trees. In a given
speech characteristic variable x, each decision tree classifier votes to determine the optimal classification result. The steps
to generate the RF are as follows:
Step1: From the original training data, using the method of bootstrap, suppose the number of decision tree is k, and take
k subset samples at random, without replacement from the original data. Each subset will be the training set for growing a
tree. And then, the sample comprises k out-of-bag (OOB) data, which is not selected, to test trained model.
Step2: Assume N features, and choose n features randomly (n N). The nodes are split using optimized split on these n
variables with the best classification capabilities. The value of n does not change during the growth of the forest. Repeat the
process until the whole tree has been completed.
Step3: Each tree grows to a maximum and does not perform any cutting.
The RF is made up of the generated trees, and the RF is used to classify the speech test data. The output of the RF is
determined according to the results of each decision tree. The category with the most votes is considered as the final output.
And the process of decision-making is as follows,

M
H (x ) = arg max I ( hi ( x ) = Y ) (7)
i=1
where Hx is the output of ensemble classifier, I(·) refers to indicatory function, hi (x) represents single decision tree model,
Y refers to target tag, which is the types of emotions here. The method of bootstrap extracts training subset randomly by
the aid of Gini coefficients. Research shows that if the Gini coefficient is small, then its selection characteristics will be even
better. Suppose that there are K classes of samples in set D, then the corresponding Gini coefficients is defined as

K
K
Gini(D ) = pi ( 1 − pi ) = 1 − Pi2 (8)
i=1 i=1
where pi refers to probability of the of class i. In the case of two categories, it is relatively easier, assuming that the first
sample probability is p, then the Gini coefficient formula is as follow,
Gini(D ) = 2 p(1 − p) (9)
If there are k branches in a node, then the corresponding Gini index of each branch is
D1 D
Ginibranch (D ) = Gini(D1 ) + · · · + k Gini(Dk ) (10)
D D
where Dk is the kth subset of set D. Gini index reflects the degree of dispersion of the nodes. The smaller the Gini index is,
the lower impurity degree of the nodes is. If data of the node belongs to one class, Gini index of this node is 0. Importance
of each feature can be sorted according to the Gini index.
Fig. 2. The structure of multiple random forest for speech emotion recognition.
3.4. Two-layer fuzzy multiple random forest
Due to some certain emotions are difficult to recognize to some extent, a separate classification of some certain emotions
is conducted for multiple random forest algorithm. It includes K emotional states, and takes out 2 categories of emotions
that are relatively difficult to recognize each time. It can conclude that M random forest is needed, which is given by

K
MRF = 2 − 1, K = 1, 2, . . . , n (11)
2
where M is the number of random forests, and K is the number of emotional states. In this paper, the emotional corpus
database includes 6 basic emotions, and the proposed method solves the mutual interference between emotions so that the
recognition rate of different emotion has been greatly improved. TLFMRF algorithm includes four steps.
Step1: Extract feature data through speech signal pre-processing by using openSMILE.
Step2: By using FCM, the training set is clustered into the L subclass in consideration of the influence of the identification
information on emotions.
Step3: Train classifier of RF. As a result, a total of 5 classifiers are trained, where the number of classifiers is determined
by (11). Moreover, the structure of multiple random forest is decided based on the experience and similarity between ex-
pressions. According to the experiments on benchmark databases, the accuracies of sad and fear is relatively lower in most
cases, and there is always some confusion between these two emotions, so in the training phase, the sad and fear emotions
are trained separately, while the other four emotions are trained. The structure is given as follows, Classifier 1 distinguishes
between sad, fear and others, Classifier 2 distinguishes between sad and fear, Classifier 3 distinguishes others. Classifier 4
distinguishes happy and neutral, Classifier 5 distinguishes angry and surprise. The structure of multiple random forest for
SER is shown as Fig. 2.
Step4: Use trained TLFMRF for classification of six basic emotion, and integrate L subclasses into the classification result.
4. Experiments on two-layer fuzzy multiple random forest
4.1. Data setting
We use the CASIA corpus [7] and the Berlin EmoDB [6] respectively to perform simulation experiments. Emotion speeches
of four people (2 male, 2 female) are recorded in CASIA corpus. They speak the same 300 basic emotional short utterances
using six basic emotion, i.e., angry, fear, happy, neutral, sad, and surprise that includes a total of 7200 segments of speech.
And the Berlin EmoDB contains about 500 utterances spoken by 10 different actors in a happy, angry, fear, boredom and
surprise way as well as in a neutral version.
Speech emotion feature sets are extracted by using openSMILE toolkit [22]. Some basic features such as RMS energy, ZCR,
harmonic noise ratio, MFCC are obtained as shown in Table 2. In order to realize SER that does not depend on speaker and
natural environment, speech emotional features are divided into personalized features and non-personalized features. In the
personalized features aspect, MFCC includes 12 spectral energy dynamic coefficients on equal frequency bands. These basic
features are calculated from 12 statistics. Unlike personalized features, non-personalized speech emotion features are used
to eliminate the influence of different speakers’ individual by introducing rate of change.
Table 2
Speech emotional feature extracted by openSMILE.
Index Maximum value Minimum value Mean Maximum value Slope
RMS energy 7.51E-05 1.61E-01 2.25E+02 4.30E+01 1.98E-02

MFCC 7.38E+00 −3.62E+01 4.36E+01 3.73E+02 −1.15E+01
ZCR 4.82E+0 −2.58E+01 2.23E+01 3.71E+02 2.26E+00
Harmonic noise ratio 4.30E+01 −2.24E+01 2.07E+0 2.43E+02 9.43E-01
4.2. Environment setting
After completing the preprocessing of speech emotional feature sets, 3600 ∗ 384 dimensional eigenvectors are obtained
from one man and one women, where each eigenvector corresponds to a label (1-angry, 2-fear, 3-happy, 4-neutral, 5-sad,
6-surprise). In fact, age and gender have some effects on SER as a result of the differences in the way that men and women
express their emotions. Thus, according to gender (male and female), the training set are clustered into L = 2 subclasses
by using FCM. In each subclass, 80% of these feature data are used to train a RF model for SER, and the remaining 20%
are used to test the model. In the process of establishing RF, bootstrap method is used to sample randomly. Then, 500
samples subsets are formed, where the same sample may be selected repeatedly. With that, a decision tree is generated by
training on each subset of samples, and a RF model is formed. As a result, RF algorithm is used for classification of six basic
emotions. Due to a low recognition rate of certain emotions, multiple random forest (MRF) algorithm is adopted to identify
those certain emotions that are difficult to distinguish.
4.3. Simulations and analysis
To verify the effectiveness of SER, different classification models which are back propagation neural network (BPNN), RF,
and TLFMRF are used. BPNN is usually used as a baseline algorithm to verify the effectiveness of the proposal [37]. To verify
the validity of the model, and consider the amount of simulation data, the five-fold cross validation method is used to verify
the algorithm. Then, the experiments are carried out with different data every time. In the end, the results of 5 cycles are
output, where the cyclic variable is k.
The comparison of 3 methods of SER results on CASIA corpus and Berlin EmoDB are shown in Tables 3 and 4. In the first
case, the BPNN algorithm is adopted, which includes a 3-layer neural network, the number of hidden layer nodes is set to
100, the activation function is the sigmoid function, and the number of nodes in the output layer are 6 on CASIA corpus
and 7 on Berlin EmoDB. According to the BPNN algorithm, the confusion matrix of SER results by using BPNN is shown in
Figs. 3 and 6, and the above results are obtained by cross validation. From the above results, the average recognition rates
are 81.75% and 77.94% on CASIA corpus and Berlin EmoDB, respectively.
Table 3
Comparison of speech emotion recognition for CASIA corpus.
Index BPNN RF TLFMRF
k=1(%) 77.64 71.94 82.22

k=2(%) 76.11 73.75 81.39
k=3(%) 81.25 77.36 85.56
k=4(%) 85.97 86.39 85.83
k=5(%) 87.78 85.87 80.69
Average ± std(%) 81.75 ± 5.08 79.08 ± 6.7 83.14 ± 2.40
Kappa Coefficient 0.781 0.755 0.799
Sensibility 0.825 0.829 0.872
Specificity 0.964 0.959 0.967
Table 4
Comparison of speech emotion recognition for Berlin EmoDB.
k=1(%) 81.31 84.11 85.98

k=2(%) 78.50 82.24 84.11
k=3(%) 75.70 80.37 85.05
k=4(%) 76.64 79.44 87.85
k=5(%) 77.57 80.37 85.04
Average ± std(%) 77.94 ± 1.92 81.31 ± 1.67 85.61 ± 1.26
Sensibility 0.641 0.62 0.652
Specificity 0.872 0.91 0.972
Fig. 3. Confusion matrix of recognition results by using BPNN (CASIA corpus).
Fig. 4. Confusion matrix of recognition results by using random forest (CASIA corpus).
In the second case, the RF algorithm is adopted firstly. The confusion matrix of SER results by using RF is shown in Figs. 4
and 7. Similarly, the above results are obtained by cross validation, the average recognition rates are 79.08% and 81.3% on
CASIA corpus and Berlin EmoDB, respectively. It is seen from the above results that the average recognition rate of RF is
lower compared with BPNN on CASIA corpus. Thus, TLFMRF is proposed, while L = 2. Results show that SER average result
obtains accuracy of 83.14% and 85.61 % by using TLFMRF on CASIA corpus and Berlin EmoDB, which is 4.06%-4.30% higher
than that of RF. Meantime, compared with the BPNN, the proposed TLFMRF is obviously high up to 1.39%–7.64%, and the
confusion matrix of SER results by using TLFMRF is shown in Figs. 5 and 8. Similarly, the above results are obtained by cross
validation. According to the comparison of 3 methods of SER results, it is obvious that the proposed method is better and
the model is relatively more stable that is of great importance in HRI.
Moreover, according to Figs. 3–5, the average accuracies of the six basic emotions by using BPNN on CASIA corpus are
86.83% of angry, 67.49% of fear, 79.83% of happy, 93.17% of neutral, 75.50% of sad and 89.33% of surprise; the average accu-
racies of the six basic emotions by using RF on CASIA corpus are 87.34% of angry, 71.33% of fear, 67.19% of happy, 89.00% of
neutral, 74.50% of sad and 89.50% of surprise; the average accuracies of the six basic emotions by using TLFMRF on CASIA
corpus are 86.84% of angry, 71.33% of fear, 78.00% of happy, 93.772% of neutral, 78.648% of sad and 80.17% of surprise. Ac-
cording to Figs. 6–8, the average accuracies of the six basic emotion by using BPNN on Berlin EmoDB are 76.60% of angry,
Fig. 5. Confusion matrix of recognition results by using two-layer fuzzy multiple random forest (CASIA corpus).
Fig. 6. Confusion matrix of recognition results by using BPNN (Berlin EmoDB).
75.40% of boredom, 62.40% of disgust, 79.40% of fear, 87.60% of neutral, 78.40% of happy and 80.60% of sad; the average ac-
curacies of the six basic emotions by using RF on Berlin EmoDB are 69.20% of angry, 68.40% of boredom, 60.00% of disgust,
79.60% of fear, 89.00% of neutral, 86.00% of happy and 98.60% of sad; the average accuracies of the six basic emotions by
using TLFMRF on Berlin EmoDB are 74.00% of angry, 80.60% of boredom, 63.20% of disgust, 84.00% of fear, 96.20% of neutral,
85.80% of happy and 98.60% of sad. Therefore, it can be seen that the accuracies of every emotion by using the proposed
method are roughly higher than that of BPNN and RF.
Although the BPNN and RF obtain a slightly better recognition effect than TLFMRF on CASIA corpus, which is caused
by the uneven sample distribution, the average recognition rate of the TLFMRF model is higher than other models with a
lower variance. And the accuracies of fear, happy and sad are relatively lower in the six basic emotions by using above 3
methods. This is because there are similarities in the speech features of these emotions of activation (i.e. anger, happy and
fear), which may lead to misidentification. In addition, the current speech features are not ideal for distinguishing emotions
Fig. 7. Confusion matrix of recognition results by using random forest (Berlin EmoDB).
Fig. 8. Confusion matrix of recognition results by using two-layer fuzzy multiple random forest (Berlin EmoDB).
of valence because of the non-prominent in audio frequency. As a result, there is always some confusion between sadness
and other emotions. Between fear and sad, there is also a relatively high confusions by using the 3 methods. This is due to
that the fear and sad are similar when expressed by speech signals.
In order to carry out in-depth analysis about the experimental result of the TLFMRF model, we obtain the Kappa Coef-
ficient, Sensibility and Specificity of TLFMRF based on its confusion matrix. The details are shown in Tables 3 and 4. The
mean value of Kappa Coefficient, sensibility and specificity show a high degree of consistency.
Fig. 9. The structure of emotional social robot system.
Fig. 10. The preliminary application in emotional social robot system.
According to the SER results of 3 methods, it is seen that the average recognition rate of TLFMRF in SER is higher than
that of BPNN and RF. As shown in Table 3, the standard errors of BPNN, RF, and TLFMRF on CASIA corpus are 5.08, 6.7,
and 2.40, respectively. According to Table 4, the standard errors of BPNN, RF, and TLFMRF on Berlin EmoDB are 1.92, 1.67
and 1.26, respectively. Therefore, it can be seen that the proposed method is relatively more stable. Moreover, TLFMRF has
great advantages in dealing with high-dimensional data, where identification information is embodied. As a result, age and
gender are taken into account for data classification. In addition, the computation time for the results in Table 3, in which
the computation time of TLFMRF, RF, and BPNN are 0.0707 s, 0.0196 s, and 0.0024 s, respectively. Although the proposed
algorithm has the longest computation time, the computation time is still at sub-second level.
4.4. Preliminary application experiment
4.4.1. Experimental environment setting

The structure of our developing emotional social robot system is shown in Fig. 9, which is mainly composed of a mobile
robot, an affective computing workstation, a personal computer (PC), a router, and the data transmission equipment. The
system first acquires speech signal through Kinect fixed to the mobile robot, then transmits the data to the affective com-
puting workstation. Then, the workstation will input the data into the trained SER system to identify the emotion, the result
will be feedback to the mobile robot, so that the mobile robot can understand human emotions, and make some reactions.
The PC is used to debug program remotely when people are not at the workstation, which makes the whole system more
flexible.
4.4.2. Results and analysis of preliminary application experiment

The preliminary application experiment is carried out by the affective computing workstation, system debug interface
and Kinect, which are given by Fig. 10. Then, the microphone array in Kinect is used to get the multichannel audio that
is the original data of audio stream. Similarly, the source data are imported into the audio processing software. ComParE
features set are extracted by using openSMILE toolkit.
Moreover, the HRI of SER is shown in Fig. 10. Five volunteers (three men and two women) which are postgraduate in
our lab are invited to the experiment. The volunteers speak the same 100 basic emotional short utterances using six basic
Table 5
Comparison of speech emotion recognition for application experiment.
k=1(%) 73.45 83.24 71.33

k=2(%) 82.26 87.34 86.18
k=3(%) 67.17 76.57 88.49
k=4(%) 77.35 89.50 84.00
k=5(%) 61.41 65.92 73.65
Average ± std(%) 72.33 ± 0.67 80.51 ± 0.91 80.73 ± 0.59
Sensibility 0.538 0.601 0.639
Specificity 0.812 0.869 0.936
emotions, i.e., angry, fear, happy, neutral, sad, and surprise. In the end, a total of 1200 segment of speech are formed.
And these speech emotional features are obtained by mobile robot through the Kinect, then the data is transmitted to
affective computing workstation, using the proposed method to get the emotional recognition results. The results obtained
by BPNN, RF and TLFMRF are shown as Table 5. From the table, the average accuracies of emotion are 70.84 ± 0.55% by using
BPNN, 79.33 ± 0.49% by using RF, and 79.08 ± 0.35% by using TLFMRF. It is obvious that the result of SER by using BPNN is
lower than that of RF and TLFMRF. Although the proposed algorithm has slightly lower accuracy compared with the RF,
it has strong stability by comparing the variance of the two algorithms. The Kappa Coefficient, Sensibility and Specificity
are calculated to show the effectiveness of the TLFMRF model which are given in Table 5. According to Table 5, the mean
value of Kappa Coefficient, sensibility and specificity show a high degree of consistency with the accuracies, which illustrate
the validity of the proposal. Moreover, the computation time of TLFMRF, RF, and BPNN are 0.0579s, 0.0128s, and 0.0013s,
respectively. Although the proposed algorithm has the longest computation time, the computation time is still at sub-second
level, which can still guarantee the accuracy of the real-time tracking in the acceptable range. Therefore, it can be seen that
our model can adapt to other databases. From other perspectives, this proposed method gives a deep view on real-time SER
in HRI, and endows HRI extremely significant.
5. Conclusion
A TLFMRF has been developed to recognize emotional states by speech signal. It mainly solves the problems include the
choice of features and a classification method identification. In the aspect of speech emotional feature extraction, we adopt
the non-personalization speech emotional features based on derivative to supplement the traditional speech personalized
emotional characteristics, and realize the universal and negotiability emotional characteristics. In the aspect of speech emo-
tional feature classification, TLFMRF is adopted to deal with high dimension correlation features and improve the recognition
result. In TLFMRF, the FCM is adopted first to divide the feature data into different subclasses according to the identification
information by using Euclidean distance and membership functions. Next, the multiple RF is employed by using decision-
tree and Bootstrap method to recognize these feature data.
The proposed TLFMRF considers the impact of futures sufficiently, the novelty are not only it extracts the non-
personalized features by taking the derivative of personalized features, but also it divides the high dimensional feature data
into different subset data in such a way that the computational dimension is reduced and characteristics of each subset data
are similar to ensure the learning efficiency. More detailed descriptions are given as follow:
(1) To avoid the problem that feature extraction relies on the personalized features, personalized features and non-
personalized features are extracted and fused.
(2) By considering that emotion recognition does not take into account different categories of people, multiple random
forest is adopted to recognize different emotional states.
(3) Two-layer fuzzy multiple random forest is proposed to improve recognition rate. Since the high dimensional correla-
tion features are divided into different subclasses by adopting the fuzzy C-means, separate classifications of emotions
are carried out in each of the random forest, in such a way that indistinguishable emotions are identified and the
recognition rates are improved.
In order to verify the validity of the TLFMRF, experiments on CASIA corpus and Berlin EmoDB are carried out, and exper-
imental results show the recognition accuracies of the proposal are higher than that of the baseline algorithms. Meanwhile,
preliminary application experiments are also carried out on the emotional social robot system, and the application results by
using TLFMRF indicate that the mobile robots can recognize the basic emotion. Moreover, the indexes of Kappa Coefficient,
specificity and sensitivity are also used to evaluate the proposed method which are effective and frequently used meth-
ods to test interrater reliability [32,36]. According the value of these indexes of the simulation experiments and application
experiments, the proposal has application prospect.
For an emotional social robot system, it would be more accurate to use multimodal information to recognize the emotion,
such as facial expression, speech, body gesture, and so on. However, in some situation, acquiring the visual information is
very difficult, due to the fact that it need positive face and body information in most case, so that using visual information
would be the best choice at this time, which can really help recognize those human emotion.
For further research, the intelligent optimization algorithm such as Genetic Algorithm (GA) can be employed in the
TLFMRF to further improve the performance of recognition [20,21]. Moreover, TLFMRF is being applied to human-robot
interaction in our developing emotional social robot system, in which robots are able to sense human emotion and people
can communicate with robots more smoothly.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grants 61973286, 61603356, and
61733016, the 111 project under Grant B17040, the Wuhan Science and Technology Project under Grant 2017010201010133,
and the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan) (No. 2018039).
References
[1] E.M. Albornoz, D.H. Milone, H.L. Rufiner, Feature extraction based on bio-inspired model for robust emotion recognition, Soft Comput. 21 (17) (2017)
5145–5158.
[2] M.E. Albornoz, D. Milone, Emotion recognition in never-seen languages using a novel ensemble method with emotion profiles, IEEE Trans. Affect.
Comput. 8 (99) (2016) 1–11.
[3] M.E. Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: features classification schemes, and databases, Pattern Recognit. 44 (3) (2011)
572–587.
[4] J.C. Bezdek, A physical interpretation of fuzzy ISODATA, IEEE Trans. Syst. Man Cybern. 6 (5) (1976) 387–389.
[5] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.
[6] Berlin database of emotional speech, 2005. [Online], available:http://emodb.bilderbar.info/index-1280.html, October 10.
[7] CASIA chinese emotion corpus, 2008. [Online], available: http://www.chineseldc.org/resourceinfo.php?rid=76, June 11.
[8] L.F. Chen, M. Wu, M.T. Zhou, Z.T. Liu, J.H. Hua, K. Hirota, Dynamic emotion understanding in human-robot interaction based on two-layer fuzzy SVR-TS
model, IEEE Trans. Syst. Man Cybern (2017), doi:10.1109/TSMC.2017.2756447.
[9] L.F. Chen, Z.T. Liu, M. Wu, M. Ding, F.Y. Dong, K. Hirota, Emotion-age-gender-nationality based intention understanding in human-robot interaction
using two-layer fuzzy support vector regression, Int. J. Soc. Robot. 7 (5) (2015) 709–729.
[10] L.F. Chen, M. Wu, M.T. Zhou, J.H. She, F.Y. Dong, K. Hirota, Information-driven multi-robot behavior adaptation to emotional intention in human-robot
interaction, IEEE Trans. Cognit. Dev.Syst. 10 (3) (2018) 647–658.
[11] L.F. Chen, M.T. Zhou, W.J. Su, M. Wu, J.H. She, K. Hirota, Softmax regression based deep sparse autoencoder network for facial emotion recognition in
human-robot interaction, Inf. Sci. 428 (2018) 49–61.
[12] L.F. Chen, M.T. Zhou, M. Wu, J.H. She, Z.T. Liu, F.Y. Dong, K. Hirota, Three-layer weighted fuzzy support vector regression for emotional intention
understanding in human-robot interaction, IEEE Trans. Fuzzy Syst. 26 (5) (2018) 2524–2538.
[13] M. Deriche, A.H.A. Absa, A two-stage hierarchical bilingual emotion recognition system using a hidden Markov model and neural networks, Arabian J.
Sci. Eng. 42 (12) (2017) 5231–5249.
[14] L. Devillers, M. Tahon, M.A. Sehili, Inference of human beings’ emotional states from speech in human-robot interactions, Int. J. Soc. Robot. 7 (4) (2015)
451–463.
[15] J. Deng, Z. Zhang, E. Marchi, Sparse autoencoder-based feature transfer learning for speech emotion recognition, in: Proceedings of Humaine Associa-
tion Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2013, pp. 511–516.
[16] A.D. Dileep, C.C. Sekhar, GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support
vector machines, IEEE Trans. Neural Netw. Learn.Syst. 25 (8) (2014) 1421–1432.
[17] W. Deng, R. Yao, H. Zhao, A novel intelligent diagnosis method using optimal LS-SVM with improved PSO algorithm, Soft Comput. (2–4) (2017) 1–18.
[18] W. Deng, H.M. Zhao, L. Zou, A novel collaborative optimization algorithm in solving complex optimization problems, Soft Comput. 21 (15) (2017)
4387–4398.
[19] W. Deng, S. Zhang, H. Zhao, A novel fault diagnosis method based on integrating empirical wavelet transform and fuzzy entropy for motor bearing,
IEEE Access 6 (1) (2018) 35042–35056.
[20] W. Deng, R. Chen, B. He, A novel two-stage hybrid swarm intelligence optimization algorithm and application, Soft Comput. 16 (10) (2012) 1707–1722.
[21] W. Deng, H. Zhao, X. Yang, Study on an improved adaptive PSO algorithm for solving multi-objective gate assignment, Appl. Soft Comput. 59 (2017)
288–302.
[22] F. Eyben, M. Wöllmer, A. Graves, Online emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues, J. Multi-
modal Interfaces 3 (1–2) (2010) 7–19.
[23] H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning architectures for speech emotion recognition, Neural Netw. 92 (2017) 62–68.
[24] R. Genuer, J.M. Poggi, C. Tuleau-Malot, Variable selection using random forests, Pattern Recogn. Lett. 31 (14) (2010) 2225–2236.
[25] V.P. Gonçalves, G.T. Giancristofaro, G.P.R. Filho, Assessing users’ emotion at interaction time: a multimodal approach with multiple sensors, Soft Com-
put. 21 (18) (2017) 5309–5323.
[26] K. Hakhyun, E. Hokim, Y. Keunkwak, Emotional feature extraction method based on the concentration of phoneme influence for human-robot interac-
tion, Adv. Rob. 24 (1–2) (2010) 47–67.
[27] T. Iliou, C.N. Anagnostopoulos, Comparison of different classifiers for emotion recognition, in: Proceedings of Panhellenic Conference on Informatics,
Corfu, Greece, 2009, pp. 102–106.
[28] K. Kondo, K. Taira, Estimation of binaural speech intelligibility using machine learning, Appl. Acoust. 129 (2018) 408–416.
[29] J.B. Kim, J.S. Park, Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition, Eng. Appl. Artif.
Intell. 52 (C) (2016) 126–134.
[30] J. Kim, E. André, Emotion recognition based on physiological changes in music listening, IEEE Trans. Pattern Anal. Mach.Intell. 30 (12) (2018)
2067–2083.
[31] E.H. Kim, K.H. Hyun, S.H. Kim, Improved emotion recognition with a novel speaker-independent feature, IEEE/ASME Trans. Mechatron. 14 (3) (2009)
317–325.
[32] C.Q. Laura, D. Andrew, G. Ekin, The matchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics
in-the-wild during free-standing conversations and speed dates, IEEE Trans. Affect. Comput. (2018), doi:10.1109/TAFFC.2018.2848914.
[33] F.Y. Leu, J.C. Liu, Y.T. Hsu, The simulation of an emotional robot implemented with fuzzy logic, Soft Comput. 18 (9) (2014) 1729–1743.
[34] A. Mohamed, G.E. Dahl, G. Hinton, Acoustic modeling using deep belief networks, IEEE Trans. Audio Speech Lang.Process. 20 (1) (2012) 14–22.
[35] D. Morrison, R. Wang, L.C.D. Silva, Ensemble methods for spoken emotion recognition in call-centres, Speech Commun. 49 (2) (2007) 98–112.
[36] E.W. McGinnis, S.P. Anderau, J. Hruschak, Giving voice to vulnerable children: machine learning analysis of speech detects anxiety and depression in
early childhood, IEEE J. Biomed. Health Inform. (2019), doi:10.1109/JBHI.2019.2913590.
[37] O.K. Oyedotun, A. Khashman, Prototype-incorporated emotional neural network, IEEE Trans. Neural Netw. Learn. Syst. 29 (8) (2018) 3560–3572.
[38] J.S. Park, J.H. Kim, Y.H. Oh, Feature vector classification based speech emotion recognition for service robots, IEEE Trans. Consum. Electron. 55 (3)
(2009) 1590–1596.
[39] N.R. Pal, J.C. Bezdek, On cluster validity for the fuzzy c-means model, IEEE Trans. Fuzzy Syst. 3 (3) (1995) 370–379.
[40] F. Raposo, R. Ribeiro, D.M.D. Matos, Using generic summarization to improve music information retrieval tasks, IEEE/ACM Trans. Audio Speech
Lang.Process. 24 (6) (2015) 1119–1128.
[41] P. Song, S.F. Ou, Z.B. Du, Learning corpus-invariant discriminant feature representations for speech emotion recognition, IEICE Trans. Inf. Syst. E100-D
(5) (2017) 1136–1139.
[42] B.W. Schuller, A.M. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing, John Wiley & Sons Inc,
2014.
[43] B. Schuller, S. Steidl, A. Batliner, The INTERSPEECH emotion challenge, Proce. INTERSPEECH (2009) 312–315.
[44] M. Sheikhan, M. Bejani, D. Gharavian, Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method, Neural
Comput. Appl. 23 (1) (2013) 215–227.
[45] Y. Sun, G. Wen, Ensemble softmax regression model for speech emotion recognition, Multimed. Tools Appl. 76 (6) (2016) 8305–8328.
[46] E. Vaiciukynas, A. Verikas, A. Gelzinis, Detecting Parkinson’s disease from sustained phonation and speech signals, PLoS ONE 12 (10) (2017) 1–16.
[47] C.H. Wu, W.B. Liang, Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels,
IEEE Trans. Affect. Comput. 2 (1) (2010) 10–21.
[48] E. Yuncu, H. Hacihabiboglu, C. Bozsahin, Automatic speech emotion recognition using auditory models with binary decision tree and SVM, in: Pro-
ceedings of International Conference on Pattern Recognition, 2014, pp. 773–778.
[49] M.T. Zhou, L.F. Chen, J.P. Xu, X.H. Cheng, M. Wu, W.H. Cao, J.H. She, K. Hirota, FCM-based multiple random forest for speech emotion recognition, in:
Proceedings of the 5th International Workshop on Advanced Computational Intelligence and Intelligent Informatics, 2017.
[50] S. Zhang, X. Zhao, B. Lei, Speech emotion recognition using an enhanced kernel isomap for human-robot interaction, Int. J. Adv. Rob. Syst. 10 (2) (2013)
1–7.

Information Sciences: Luefeng Chen, Wanjuan Su, Yu Feng, Min Wu, Jinhua She, Kaoru Hirota

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Sciences: Luefeng Chen, Wanjuan Su, Yu Feng, Min Wu, Jinhua She, Kaoru Hirota

Uploaded by

Copyright:

Available Formats

Information Sciences 509 (2020) 150–163

Contents lists available at ScienceDirect

Two-layer fuzzy multiple random forest for speech emotion

3. Two-layer fuzzy multiple random forest for speech emotion recognition

3.1. Feature extraction

Index 16 Basic Features 12 Statistic Values

Personalized F0 , ZCR, max, min,

where sgn[·] is a function:

MFCC can be calculated by the following steps:

where N represents the number of ﬁlters, and Cn is the ﬁltered output.

3.2. FCM based features classiﬁcation

the following objective function,

3.3. Multiple random forest algorithm for speech emotion recognition

3.4. Two-layer fuzzy multiple random forest

4. Experiments on two-layer fuzzy multiple random forest

4.1. Data setting

Index Maximum value Minimum value Mean Maximum value Slope

RMS energy 7.51E-05 1.61E-01 2.25E+02 4.30E+01 1.98E-02

4.2. Environment setting

4.3. Simulations and analysis

Index BPNN RF TLFMRF

k=1(%) 77.64 71.94 82.22

Index BPNN RF TLFMRF

k=1(%) 81.31 84.11 85.98

Fig. 3. Confusion matrix of recognition results by using BPNN (CASIA corpus).

Fig. 6. Confusion matrix of recognition results by using BPNN (Berlin EmoDB).

Fig. 9. The structure of emotional social robot system.

Fig. 10. The preliminary application in emotional social robot system.

4.4. Preliminary application experiment

4.4.1. Experimental environment setting

4.4.2. Results and analysis of preliminary application experiment

Index BPNN RF TLFMRF

k=1(%) 73.45 83.24 71.33

Declaration of competing interest

You might also like