Professional Documents
Culture Documents
: A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks 1
accuracy of 69.2%; S Zhang et al [32] proposed multimodal [35] and GoogleNet have similar requirement for imputing
DCNNs, which fuses the audio and visual cues in a deep images as 256*256 as human eyes, hence they could also
model. This is an early work fusing audio and visual cues in make accurate judgments about the same thing in different
DCNNs; George Trigeorgis et al [33] combined CNN with sizes.
LSTM networks, which can auto- matically learn the best However, the training of deep neural network needs a large
representation of the speech signal from the raw time amount of data, while the data provided by current common
representation. speech emotion database is very limited. This leads to the
Though previous studies have achieved some results, the problem that the deep neural network can’t be fully trained.
accuracy of recognition remains relatively low, and it is far Referring to the imaging principle of the retinal and the
from the practical application. In order to address the convex lens, we propose DRCNNs method consisting of two
problems of small training data and low accuracy, this paper parts. The working process is shown in Fig. 2.
proposes DRCNNs, which consists of two parts:
1) Data Augmentation Algorithm Based on Retinal Imaging
Principle (DAARIP), using the principle of retina and convex
lens imaging, we get more training data by changing the size
of the spectrogram.
2) Deep Convolution Neural Networks (DCNNs) [34], which
can extract high-level features from the spectrogram and
make precise prediction. This novel method achieves an
average accuracy over 99% on IEMOCAP, EMO-DB and
SAVEE database.
This work was supported by the Natural Science Foundation of China (No. 61309013) and and Chongqing Basic and frontier
Fig. 3. Using convex lens to simulate our eyes, and take x TABLE 4. CONFUSION MATRIX OF THE FIRST
points at location L1 (F<L1<2F), one point at location (L2 = EXPERIMENT ON THE ORIGINAL TESTING DATA
2F), y points at location L3(L3 > 2F). In this way, we can get ang exc fea fru hap neu sad sur
x+y+1 spectrograms. ang 64 20 0 35 0 29 16 2
exc 33 47 0 16 0 41 19 0
fea 1 1 0 1 0 2 1 0
fru 25 25 0 82 0 84 62 0
hap 3 11 0 10 2 32 31 0
neu 5 9 0 25 0 135 82 0
sad 2 0 0 0 1 20 139 0
Fig. 4. The DCNNs architecture for SER using the
sur 2 0 0 2 0 3 9 0
spectrogram as input, which has 5 convolution layers
(C1~C5), 3 pooling layers (P1, P2, P5) and 3 fully connected
layers (F6, F7, F8).
Table 6, and the confusion matrix on the augmented data is results are better from both the number of emotions and
shown in Table 7. accuracy, the detail is shown in Table 8.
Emotions The augmented data Accuracy Method Year Database Emotions Accuracy
anger 44213 99.55% Ref [31] 2015 IEMOCAP 4 classes 69.2%
happiness 29450 100% Ref [30] 2015 IEMOCAP 4 classes 63.89%
sadness 43360 99.15% Ref [28] 2015 IEMOCAP 5 classes 40.02%
neutral 70028 99.06%
Our Work IEMOCAP 8 classes 99.25%
frustration 73960 99.17%
excitement 42681 99.41%
surprise 4815 96.26% In order to test the robustness of our proposed DRCNNs
fear 1560 100% method, we do experiments on EMO-DB [37] and SAVEE
310067 99.25% database.
On the EMO-DB database, we augment the original data
TABLE 6. MAIN PARAMETER OF THE according to the algorithm of DAARIP, and we select 70%
SECOND EXPERIMENT data randomly as training data, 15% as validation data, and
Parameter name Parameter value the other 15% as testing data. Then, training the DCNNs
base_learning_rate 0.001 model on the augmented data, and the accuracy can reach
learning_rate_policy fixed 99.9% on the validation data after 10 epochs. After training,
momentum 0.9 the accuracy of the 7 kinds of emotions prediction on the
weight_decay 1e-05 testing data is 99.79%. The training process is shown in Fig.
solver_type SGD 7 and the accuracy of each emotion is shown in Table 9, the
confusion matrix is shown in Table 10.
TABLE 7. CONFUSION MATRIX OF THE SECOND TABLE 9. EXPERIMENT ON THE
EXPERIMENT ON THE AUGMENTED TESTING DATA AUGMENTED DATA OF EMO-DB
ang exc fea fru hap neu sad sur Emotions The augmented data Accuracy
ang 6602 3 0 6 0 7 0 14 fear 1320 100%
exc 5 6364 0 2 0 23 6 2 disgust 1380 100%
fea 0 0 234 0 0 0 0 0
happiness 1360 99.51%
fru 30 17 0 11002 0 20 25 0
boredom 1440 100%
hap 0 0 0 0 4462 0 0 0
neutral 1482 100%
neu 5 10 0 7 9 10405 65 3
sadness 1364 100%
sad 0 2 0 0 2 44 6449 7
anger 1386 99.04%
sur 0 0 0 1 0 17 9 695
9732 99.79%
This work was supported by the Natural Science Foundation of China (No. 61309013) and and Chongqing Basic and frontier
TABLE 10. CONFUSION MATRIX ON THE TABLE 12. CONFUSION MATRIX ON THE
AUGMENTED TESTING DATA OF EMO-DB. AUGMENTED TESTING DATA OF SAVEE.
fea dis hap bor neu sad ang ang dis fea hap neu sad sur
fea 198 0 0 0 0 0 0 ang 955 0 0 0 0 0 0
dis 0 207 0 0 0 0 0 dis 0 900 0 0 0 0 0
hap 0 0 203 0 0 0 1 fea 0 0 900 0 0 0 0
bor 0 0 0 216 0 0 0 hap 0 0 0 900 0 0 0
neu 0 0 0 0 223 0 0 neu 0 2 0 0 898 0 0
sad 0 0 0 0 0 204 0 sad 0 0 0 0 34 866 0
ang 0 0 2 0 0 0 206 sur 0 0 0 0 0 0 900
On the SAVEE database, we augment the original data These experimental results have shown the good adaptab-
according to the algorithm of DAARIP, and we select 70% ility and stability that the proposed DRCNNs method has for
data randomly as training data, 15% as validation data, and SER.
the other 15% as testing data. Then, training the DCNNs
model on the augmented data, and the accuracy can reach IV. CONCLUSION AND FUTURE WORK
99.9% on the validation data after 5 epochs. After training, SER is particularly useful for enhancing naturalness in
the accuracy of the 7 kinds of emotions prediction on the speech based on human machine interaction. SER system has
testing data is 99.43%. The training process is shown in Fig. extensive applications in day-to-day life. For example,
8 , the accuracy of each emotion is shown in Table 11, and emotion analysis of telephone conversation between
the confusion matrix is shown in Table 12. criminals would help criminal investigation department to
detect cases. Conversation with robotic pets and humanoid
TABLE 11. EXPERIMENT ON THE AUGMENTED partners will become more realistic and enjoyable if they are
DATA OF SAVEE capable of understanding and expressing emotions like
Emotions The augmented data Accuracy humans do. Automatic emotion analysis may be applied to
anger 6368 100% automatic speech to speech translation systems, where speech
disgust 6000 100% in one language is translated into another language by the
fear 6000 100% machine.
happy 6000 100% In this paper, we propose a novel method called DRCNNs,
neutral 6000 99.78% addressing the problem of small training data and low
sadness 6000 96.22% prediction accuracy in SER. The main idea of this method is
surprise 6000 100% two-fold. First, referring to the imaging principle of retina
42368 99.43% and convex lens, DARRIP algorithm is used to augment the
original datasets, which are inputted into the DCNNs. Second,
DCNNs learn high-level feature from spectrogram and
classify speech emotion. Experimental results indicate that an
average accuracy on three databases achieves over 99%.
Obviously, the proposed method has dramatically improved
the state-of –the-art in speech emotion recognition and will
ultimately make major progress in HCI and artificial
intelligence. We plan to extend the proposed method and
evaluate its performance on multilingual speech emotion
database in the near future.
REFERENCES
[1]. Luo, Q. "Speech emotion recognition in E-learning system by using
general regression neural network." Nature 153.3888(2014):542-543.
[2]. Koolagudi, Shashidhar G., and K. S. Rao. "Emotion recognition from
speech: a review." International Journal of Speech Technology
Fig. 8. The training process of DRCNNs on the augmented 15.2(2012):99-117.
[3]. Ayadi, Moataz El, M. S. Kamel, and F. Karray. "Survey on speech
data of SAVEE. The accuracy achieves at 99.9% on the
emotion recognition: Features, classification schemes, and databases."
validation data after 5 epochs of training. Pattern Recognition 44.3(2011):572-587.
[4]. Wang, Sheguo, et al. "Speech Emotion Recognition Based on Principal
Component Analysis and Back Propagation Neural Network."
Measuring Technology and Mechatronics Automation (ICMTMA),
2010 International Conference on 2010:437-440.
This work was supported by the Natural Science Foundation of China (No. 61309013) and and Chongqing Basic and frontier
[5]. Bhatti, M. W., Y. Wang, and L. Guan. "A neural network approach for neural network classifier by using speech spectrogram." International
human emotion recognition in speech." International Symposium on Conference on Systems, Signals and Image Processing IEEE,
Circuits and Systems IEEE Xplore, 2004: II-181-4 Vol.2. 2015:73-76.
[6]. Fragopanagos, N., and J. G. Taylor. "2005 Special Issue: Emotion [28]. Zheng, W. Q., J. S. Yu, and Y. X. Zou. "An experimental study of
recognition in human-computer interaction." Neural Networks speech emotion recognition based on deep convolutional neural
18.4(2005):389-405. networks." International Conference on Affective Computing and
[7]. Nicholson, J., K. Takahashi, and R. Nakatsu. "Emotion Recognition in Intelligent Interaction 2015:827-831.
Speech Using Neural Networks." International Conference on Neural [29]. Fayek, H. M., M. Lech, and L. Cavedon. "Towards real-time Speech
Information Processing, 1999. Proceedings. ICONIP IEEE, Emotion Recognition using deep neural networks." International
1999:495-501 vol.2. Conference on Signal Processing and Communication Systems
[8]. Ververidis, Dimitrios, and C. Kotropoulos. "Fast and accurate 2015:1-5.
sequential floating forward feature selection with the Bayes classifier [30]. Lee, Jinkyu, and I. Tashev. "High-level Feature Representation using
applied to speech emotion recognition." Signal Processing Recurrent Neural Network for Speech Emotion Recognition."
88.12(2008):2956-2970. INTERSPEECH 2015.
[9]. Mao, Xia, L. Chen, and L. Fu. "Multi-level Speech Emotion [31]. Jin, Qin, et al. "Speech emotion recognition with acoustic and lexical
Recognition Based on HMM and ANN." Computer Science and features."in IEEE International Conference on Acoustics, Speech and
Information Engineering, 2009 WRI World Congress on Signal Processing 2015:4749-4753
2009:225-229. [32]. Zhang, Shiqing, et al. "Multimodal Deep Convolutional Neural
[10]. Nwe, Tin Lay, S. W. Foo, and L. C. D. Silva. "Speech emotion Network for Audio-Visual Emotion Recognition." ACM,
recognition using hidden Markov models." Speech Communication 2016:281-284.
41.4(2003):603-623. [33]. Trigeorgis, George, et al. "Adieu Features? End-to-end Speech
[11]. Schuller, B., G. Rigoll, and M. Lang. "Hidden Markov model-based Emotion Recognition using a Deep Convolutional Recurrent
speech emotion recognition." International Conference on Multimedia Network." ICASSP 2016.
and Expo, 2003. ICME '03. Proceedings IEEE Xplore, 2003: I-401-4 [34]. Esteva, A, et al. "Dermatologist-level classification of skin cancer with
vol.1. deep neural networks." Nature 542.7639(2017):115.
[12]. Ntalampiras, Stavros, and N. Fakotakis. "Modeling the Temporal [35]. Krizhevsky, Alex, I. Sutskever, and G. E. Hinton. "ImageNet
Evolution of Acoustic Parameters for Speech Emotion Recognition." Classification with Deep Convolutional Neural Networks." Advances
IEEE Transactions on Affective Computing 3.99(2012):116-125. in Neural Information Processing Systems 25.2(2012):2012.
[13]. Zhou, Jian, et al. "Speech Emotion Recognition Based on Rough Set [36]. Busso, Carlos, et al. "IEMOCAP: interactive emotional dyadic motion
and SVM." International Conference on Machine Learning and capture database." Language Resources and Evaluation
Cybernetics 2005:53-61. 42.4(2008):335-359.
[14]. Hu, Hao, M. X. Xu, and W. Wu. "GMM Supervector Based SVM with [37]. Berlin Emotion Database (German emotional speech)
Spectral Features for Speech Emotion Recognition." IEEE
International Conference on Acoustics, Speech and Signal Processing -
ICASSP IEEE, 2007: IV-413-IV-416.
[15]. Neiberg, Daniel, K. Laskowski, and K. Elenius. "Emotion Recognition
in Spontaneous Speech Using GMMs." INTERSPEECH 2006- ICSLP
(2006):1 - 4. Yafeng Niu was born in Handan, China
[16]. Wu, Chung Hsien, and W. B. Liang. "Emotion Recognition of
in 1990. He received the B.S degrees in
Affective Speech Based on Multiple Classifiers Using
Acoustic-Prosodic Information and Semantic Labels." IEEE software engineering from the University
Transactions on Affective Computing 2.1(2011):10-21. of Xinjiang. Now he is the Master
[17]. Schuller, B., et al. "Speaker Independent Speech Emotion Recognition candidate in the college of computer
by Ensemble Classification." IEEE International Conference on science, Chongqing university. His main
Multimedia & Expo IEEE, 2005:864-867.
research interests include machine
[18]. Bengio, Yoshua, A. Courville, and P. Vincent. "Representation
Learning: A Review and New Perspectives." IEEE Transactions on learning, deep learning and affective
Pattern Analysis & Machine Intelligence 35.8(2013):1798-828. computing.
[19]. Lecun, Y, Y. Bengio, and G. Hinton. "Deep learning. " Nature
521.7553 (2015):436-44.
[20]. Mnih, V, et al. "Human-level control through deep reinforcement
learning." Nature 518.7540(2015):529-533.
[21]. Silver, D, et al. "Mastering the game of Go with deep neural networks
and tree search." Nature 529.7587(2016):484.
[22]. Kim, Yelin, H. Lee, and E. M. Provost. "Deep learning for robust Dongsheng Zou received the B.S. degree,
feature generation in audiovisual emotion recognition." IEEE M.S. degree and Ph.D. degree in Compu-
International Conference on Acoustics, Speech and Signal Processing ter Science and Technology from Chong-
IEEE, 2013:3687-3691.
[23]. Zheng, Wei Long, et al. "EEG-based emotion classification using deep
qing University, Chongqing, China, in
belief networks." IEEE International Conference on Multimedia & 1999, 2002, and 2009, respectively. He
Expo IEEE, 2014:1-6. was a Postdoctoral Fellow in the College
[24]. Mao, Q., et al. "Learning Salient Features for Speech Emotion of Automation at Chongqing University
Recognition Using Convolutional Neural Networks." Multimedia from October 2009 to December 2012.
IEEE Transactions on 16.8(2014):2203-2213.
He is currently an Assistant Professor in
[25]. Huang, Zhengwei, et al. "Speech Emotion Recognition Using CNN."
the ACM International Conference 2014:801-804. computer at Chongqing University, and a Member of China
[26]. Han, Kun, D. Yu, and I. Tashev. "Speech Emotion Recognition Using Computer Federation . His current research interests include
Deep Neural Network and Extreme Learning Machine." machine learning, data mining and pattern recognition.
INTERSPEECH 2014.
[27]. Prasomphan, S. "Improvement of speech emotion recognition with
This work was supported by the Natural Science Foundation of China (No. 61309013) and and Chongqing Basic and frontier
This work was supported by the Natural Science Foundation of China (No. 61309013) and and Chongqing Basic and frontier