You are on page 1of 8

Int. J. Computational Science and Engineering, Vol. 19, No.

2, 2019 169

Recognising continuous emotions in dialogues


based on DISfluencies and non-verbal vocalisation
features for a safer network environment

Huan Zhao*, Xiaoxiao Zhou and Yufeng Xiao


School of Information Science and Engineering,
Hunan University (Changsha),
No. 2, Lushan South Road, Hunan, 410082, China
Email: zhouxiaoxiao@hnu.edu.cn
Email: hzhao@hnu.edu.cn
Email: hnxiaoyf@hnu.edu.cn
*Corresponding author

Abstract: With the development of networks and social media, audio and video have become a
more popular way to communicate. Those audio and video can spread information to create some
negative effect, e.g., negative sentiment with suicide tendency or threatening messages to make
people panic. In order to keep a safe network environment, it is necessary to recognise emotion in
dialogues. To improve recognition of continuous emotion in dialogues, we propose to combine
DISfluencies and non-verbal vocalisations (DIS-NV) features with bidirectional long short-term
memory (BLSTM) model to predict continuous emotion. DIS-NV features are effective emotion
features, including filled pauses features, fillers features, shutters features, laughter feature and
breath feature. Bidirectional long short-term memory (BLSTM) can learn past information and
use future information. State-of-the-art recognition attains 62% accuracy. Our experimental
method can increase accuracy to 76%.

Keywords: continuous emotion; bidirectional long short-term memory; BLSTM; dialogue;


knowledge-inspired features; safe network environment; DISfluencies and non-verbal
vocalisation; DIS-NV; AVEC2012; discretisation; speech emotion recognition; low-level
descriptors; LLD.

Reference to this paper should be made as follows: Zhao, H., Zhou, X. and Xiao, Y. (2019)
‘Recognising continuous emotions in dialogues based on DISfluencies and non-verbal
vocalisation features for a safer network environment’, Int. J. Computational Science and
Engineering, Vol. 19, No. 2, pp.169–176.
Biographical notes: Huan Zhao is a Professor at the School of Information Science and
Engineering, Hunan University. She obtained her BSc and MS in Computer Application
Technology at Hunan University in 1989 and 2004, respectively, and completed her PhD in
Computer Science and Technology at the same school in 2010. Her current research interests
include speech information processing, speech emotion recognition, embedded system design
and embedded speech recognition. She served as visiting scholar at the University of
Califomia-San Diego (UCSD), USA during the period of March 2008 to September 2008. The
visiting scholarship was appointed and sponsored by the China Computer Federation, Governing
of Hunan Computer Society, China and China Education Ministry Steering Committee Member
of Computer Member of Computer Education on Arts. She has published more than 40 papers
and six books.

Xiaoxiao Zhou received her BE from the Hunan University, Changsha, China, in 2015, in
Computing Science and Technology. Her current research interests include speech information
processing and speech emotion recognition.

Yufeng Xiao received his MS in Hunan University, Changsha, China, in 2013, and his BE in the
same school, in 2010, both in Computing Science and Technology. His current research interests
include speech information processing, speech emotion recognition and affection computing.

1 Introduction expressing ourselves. More and more people use audio and
video to express their emotions and communicate with each
Emotions play a very important role in our lives.
other on social networking sites. Those audio and video
Emotions assist in maintaining good interpersonal and
maybe spread the deep depression with suicide tendency or
communication, making sentimental decisions, and
spread some information to make people panic. It is a

Copyright © 2019 Inderscience Enterprises Ltd.


170 H. Zhao et al.

thought-provoking issue whether the communication of (Wöllmer et al., 2008). Wöllmer et al. quantised V–A
these emotions is a benefit to construct harmonious network dimensions into four or seven levels and used conditional
environment or not. For the sake of the security to safeguard random fields (CRFs) to predict quantised labels (Wöllmer
network environment, it is very necessary to find user et al., 2008). Tian et al. (2015) discretised dimensional
behaviours (Xian et al., 2017) and emotions in network. labels into three levels and applied long short-term memory
Besides, encrypted speech and retrieved information over (LSTM) to classify emotion.
encrypted speech (He and Zhao, 2017a) have been paid Extraction of effective emotional features is an aspect of
much attention by many scholars, and encrypting speech emotion recognition in speeches. Low-level descriptors
information will be the shape of things to come. Then, how (LLD) is the most commonly used audio feature, which
to recognise emotions of encrypted speech can be one of the covers different acoustic features, such as energy-related
future research directions. Under this background, features, pitch frequency features, formant frequency, linear
development in speech emotion recognition can be very prediction coefficients (LPC), and mel-frequency cepstrum
helpful to build a safer network environment. The ability of coefficients (MFCC). Some statistical functions (such as:
intelligent systems to recognise emotions in dialogues mean, median, standard deviation, maximum, minimum and
should be improved. so on) are then applied to LLDs (e.g., MFCCs) and their
corresponding delta coefficients to form the global
statistical features. The final LLDs features consist of local
2 Related work acoustic features and global statistical features. LLDs
features used for different databases were not exactly the
According to research in psychology, three major
same. For example, energy-, spectral-, and voicing-related
approaches to affect modelling can be distinguished
and its statistical features are used in AVEC 2012 database
(Grandjean et al., 2008): categorical, dimensional, and
(Valstar et al., 2012), while LLDs of IEMOCAP database
appraisal-based approach. The majority of studies in
(Busso et al., 2008) mainly contain energy-, spectral-, and
automated emotion recognition are based on categorical
pitch-related information and its statistical features. The
approaches. These approaches suggest the existence of a
number of LLDs is fairly huge, which ranges from 1 to
few emotions, which are basic, hard-wired in our brains,
2,000. The large number of features could result in
and universally recognised (Ekman and Friesen, 1975).
information redundancy and the interrelationship between
Universal basic categories focused on the analysis of six
features may run out of control. This situation would
discrete basic emotions (Ekman, 1992), which include
contradict the training model. High-level features, such as
happiness, sadness, surprise, fear, anger and disgust. Junek
knowledge-inspired features, would then develop. Results
(2007) showed that cognitive mental states occur more often
indicate that knowledge-inspired features may be highly
in day-to-day interactions than in basic emotions. Therefore,
predictive (Bone et al., 2014; Zhang et al., 2017). Lexical
a single label or a limited number of discrete classes may
and word-based features have proven to be effective in
not properly describe subtle and complex affective states.
sentiment classification (Zhao et al., 2017). Psycholinguistic
Researchers who study automatic affect recognition started
studies showed that emotions can influence neural
exploring a continuous model that labels emotions in
mechanisms in the brain, and thereby influence sensory
multidimensional space instead of categorically viewing
processing and attention (Vuilleumier, 2005). There, the use
emotions. This paper (Russell, 1980) proposed a two-
of DIS-NV to recognise emotions in dialogues has been
dimensional emotional space, which consists of valence
proposed (Moore et al., 2014), which attained good
(whether the effect is positive or negative) and arousal
recognition precision.
(whether the effect is excitement or apathy). The three-
Another indispensable aspect of emotion recognition is
dimensional emotional space of pleasure, arousal and
the selection of a suitable classification model. Many
dominance (Mehrabian, 1996) was raised after that. Four
machine learning algorithms, such as support vector
dimensions of emotional states, namely, valence, arousal,
machines (Lubis et al., 2016; Gupta and Ahmad, 2017),
power (sense of control over the affect), and expectation
hidden Markov models (Ozkan et al., 2012), CRFs
(degree of anticipation or being taken unaware) are the most
(Baltrušaitis et al., 2013), and neural networks (Njikam and
sufficient affect dimensions (Fontaine et al., 2007).
Zhao, 2016; Kumar et al., 2018; Gu et al., 2017), are widely
To simplify the problem in dimensional emotion
used to build classification models. These models work
recognition and make full use of the advantages of matured
considerably to affect recognition, but they cannot
speech recognition based on categorical emotions, some
accurately obtain the temporal information of affection. By
researchers quantised continuous labels into a finite number
contrast, LSTM (Hochreiter and Schmidhuber, 1997) can
of discrete levels. For example, the range of actual
learn long-range temporal dependencies. LSTM-RNNs are
continuous labels is from –1 to 1, if we take the range of –1
able to deal with variable length inputs and model
to 0 as the negative class, 0 to 1 as the positive class, then
sequential data with long range context. These studies
the continuous problem has been changed to a classification
(Wöllmer et al., 2009, 2010) show that BLSTM networks
task. The simple strategy is to translate the continuous
can further enhance context-sensitive sequence processing.
prediction problem into a two-class recognition problem
The article(Schuster and Paliwal, 1997) presented a
(positive vs. negative classification) (Schuller et al., 2009;
bidirectional long short-term memory (BLSTM) that can
Wöllmer et al., 2013) or a four-class recognition problem
Recognising continuous emotions in dialogues based on DISfluencies and non-verbal vocalisation features 171

train LSTM networks in both time directions, it overcomes for approximately five minutes, and the topics of
the issue of the past and future context, trains sequences in a conversations vary from daily life to political issues. These
forward and a backward order recurrent network, recordings are divided into training, development, and test
respectively, which are connected to a common output sets. Emotion labels, timings, transcriptions, audios, and
layer. The BLSTM model was applied in many fields, such videos of each dialogue sessions are offered by the
as text-to-speech (Fan et al., 2014; Shan et al., 2017), AVEC2012 database.
speech recognition (Graves and Schmidhuber, 2005; Graves The AVEC2012 database labels emotion information in
et al., 2005), handwriting recognition (Graves, 2008; Chen the arousal-expectancy-power-valence emotion space. The
et al., 2016), and speech synthesis (Li et al., 2017; Ding et arousal dimension describes the subject’s initiative, wherein
al., 2016). excitement indicates high values, boring states get low
The success attained by these methods inspired us to values. Expectancy indicates whether or not the topic can be
combine the features of BLSTM and DIS-NV to recognise predicted by the speaker, it also indicates the levels of
emotions in audio dialogues. The mean recognition rate of interest to the topic. Power shows the dominant degree of
this method reaches 76%, which is more than 14% from the speaker in dialogues. When a person speaks to his or her
performance of LSTM and DIS-NV in Tian et al. (2015). superior, he or she is prone to obtain a low power value; the
This finding also demonstrates that BLSTM is more suitable superior’s power is much higher. Valence conveys the
for DIS-NV features than LSTM in terms of recognising feelings of a subject, wherein positive feeling maps the
emotion in dialogues. As the experimental results are positive aspect of value; negative feeling matches the
analysed, we further define the strengths and weaknesses of negative part of value. Two types of labels, namely, fully
the DIS-NV features. The remainder of this study is continuous and word-level, respectively correspond to the
organised as follows. The next section describes related two sub-challenges of FCSC and WLSC in AVEC2012.
databases, the computational method of emotion features, This study only uses word-level labels, wherein each word
and the specific structure of the classification model. In in dialogues is a data instance. The training and
Section 3, we present the setting of the experiments and the development sets are combined as a total training set for
results of the research are discussed. The final section training the model. The number of training instances is
concludes the study and discusses the future directions of 36,469, and 13,410 test instances are used to test the model.
speech emotions recognition.
3.2 Features
3 BLSTM and DIS-NVs-based method This section discusses five knowledge-inspired features,
which consist of three types of disfluencies (DIS) and two
Prior to discussing our access scheme based on CPABE, types of non-verbal vocalisations (NV). DIS-NV features
this section begins with the introduction of related things are initially raised (Moore et al., 2014). They applied
used for our scheme. This includes bilinear maps used for DIS-NV features and LSTM model (Hochreiter and
encryption and decryption, access structure for access Schmidhuber, 1997) to recognise emotions in dialogues,
control scheme, DBDH assumption for formalisation which achieved good performance. This study indicates that
analysis of the security of our scheme, and TEE for DIS-NV features are effective emotion features.
avoiding attack from the local side. The three types of DIS are filled pauses, fillers, and
shutters. Filled pauses are non-lexical sounds, which have
3.1 Database no actual meaning, but are filled with emotion information.
This type fits emotion recognition. An example is ‘hmm’ in
The AVEC2012 database (Valstar et al., 2012) is used in the
the utterance ‘hmm… I don’t know how to do it.’, we can
following experiment. This database, which is prepared for
easily read a hesitate emotion from this word. The three
the audio-video emotion challenge in 2012, is widely used
commonly used filled pauses in the AVEC database are
in emotion recognition and affects computing. It is a part of
‘em’, ‘eh’, and ‘oh’. Fillers are phrases used to maintain the
the SEMAINE corpus (Mckeown et al., 2010) (Some
pace of dialogues and avoid pauses. An example is ‘you
samples have no text information corresponded to those
know’ in the utterance ‘I thought I’d, you know, have a chat
audio). AVEC2012 is a spontaneous corpus and not an
with you’. The three most common fillers in the AVEC
acted out by professional actors based on transcripts. It
database are ‘well’, ‘you know’, and ‘I mean’. Stutters are
means that all the speakers can express themselves in any
expressed as words or parts of words the speaker
emotion as much as they want. Twenty-four people are
involuntarily repeats when speaking. For example, ‘ma’ in
invited to anticipate conversation with four on-screen
the utterance ‘ma maybe it will come true’, it may convey
characters role-played by workers. These four characters are
people’s tension. The two types of non-verbal vocalisations
acted out with different styles, such as poppy is happy,
are laughter and breath. Laugher is a typical emotion
spike remains angry, obadiah is in a depressive state,
display. The intensity of breathing while speaking can
prudence is relatively tempered. When people converse with
stands how strong person’s feelings. We use the labels
one of the characters, they will receive guidance in
<LAUGH> and <BREATH> for laughter and breath
expressing their corresponding affection. Audio-visual
features, respectively, in the transcripts provided with the
recordings last for about eight hours. Each fragment lasts
172 H. Zhao et al.

AVEC challenge. Table 1 shows the proportions of DIS-NV traditional recurrent neural network (Hochreiter and
features in AVEC database. Schmidhuber, 1997). This model has an input gate and an
output gate. Each gate is an activation function, and the two
Table 1 Frequency of DIS-NV features gates can help the network architecture store and access
information. Gers et al. (2000) improved the architecture
Database FP FL ST LA BR
using a forget gate. Some previous inputs can be selected to
AVEC2012 32.0 14.7 9.4 11.9 2.7 enable forget function using forget activation. The latter is
Note: i.e., filled pauses [FP], fillers [FL], shutters [ST], the widely used structure in this study.
laughter [LA], and breath [BR]. LSTM is a recurrent neural network architecture that
contains an input layer and a hidden layer made up of
The method of calculating the DIS-NV features is based on
recurrently connected memory cells and an output layer.
the word timing provided by AVEC database. Each word
The processing of recurrent neural network is as follows.
has five numerical values that correspond to five DIS-NV
Given an input feature sequence x = (x1, …, xt), t denotes the
features. DIS-NV features are computed using equation (1):
time when the x is the input. Sequence x go through the
Dd = td Td (1) hidden LSTM memory cells to obtain the new sequence
h = (h1, …, ht). The hidden sequence h is entered into the
Dd stands for the value of a DIS-NV feature D. td stands for output layer (which is commonly an activation function) to
the duration of feature D in an utterance. Td stands for the obtain final result y = (y1, ..., yt). The iteration procedure for
duration of a whole sentence. AVEC database records the moment t is as follows:
timing word-by-word. The lengths of those sentences are
different, and the interval between utterances is inconsistent. ( ht , ct ) = H ( xt , ht −1 , ct −1 ) (2)
We assume each sentence contain 15 words. The duration of
the 15 words is considered as Td. A moving window method yt = Why ht + by (3)
is used, which contains 15 words that represent an utterance. Index t indicates the present moment and t – 1 is
The position of the window does not change from the first approaching past time. Value h indicates the hidden output.
word to the 15th word in a dialogue. This setting means that c indicates cell state. x is the input and y denotes the final
the first 15 words have same Td. The window moves a word output. The first function shows the present hidden output ht
once after the 15th word. The duration of utterance Td of and cell state ct are decided by the current x, the past hidden
the word wi (i is the position of word in a session) is equal output, and the past cell state. In the second function, Why
to the time sum from wi-14 to wi. A sentence length is set at and by are the parameters of the activation function, which
15 according to the average length of a speaker turn. It is respectively denote weight matrices and bias vectors.
hypothesising that the speaker’s emotional state maintains a We then focus on the special of LSTM model, the
similar pace in a moving window. hidden LSTM cell. Each LSTM cell contains memory cells
In the present study, we program a string matching and a set of multiplicative gates, namely, input gate, output
algorithm with python to automatically calculate the value gate, and forget gate. These gates states are modified during
of these features. First, we use five lists to hold the words the training phase of the model, then the final states are
that describe the five DIS-NV features. We hold the words applied to the testing phase. Figure 1 shows the specific
frequently used in daily life for filled pauses, fillers, and structure of LSTM memory cell.
stutters. We only use the laugh label <LAUGH> and The processing in Figure 1 is executed by the following
breathe label <BREATH> for laugh and breath features. We equations:
then traverse the transcripts document and compare the
word in transcripts with the word in lists. If they match, we Figure 1 LSTM cell
get the word time span, calculate the sum of the previous
15th words (including this word) timing as the sentence ht
time ,then we can obtain the feature value using
equation (1). If not, the feature value is set as 0. To ensure
xt ot ×
σ
accurate results, we manually examine and align the ht-1 output
features. For example, ‘bye bye’ is recognised as a shutter gate
tanh
in the string matching algorithm, but actually it is not. And
the ‘well’ in ‘it works well’ is considered as a filler, while ft xt
here it’s a sentence element. We make necessary corrections Cell Ct × σ
input
because these considerations are inaccurate. forget
xt gate
it gate ht-1
σ × +
3.3 Model ht-1

The traditional recurrent neural networks are unable to learn tanh


temporal dependencies longer than a few time steps due to
the vanishing gradient problem. The LSTM model is
proposed to solve the problem of long-term dependencies in xt ht-1
Recognising continuous emotions in dialogues based on DISfluencies and non-verbal vocalisation features 173

Equation (4) is a sigmoid activation that can choose to Figure 2 Bidirectional RNN model
discard information in vain. The corresponding sigmoid part
output yt-1 yt yt+1
in Figure 1 is called forget gate. Equations (5) and (6) ... ...

decide on the information that should be updated and the ← ← ←


values that should be stores, separately. The equation (5) ht-1 → ht → h t+1 →
h t-1 ht h t+1
describes the input gate in Figure 1. The function of (6) is to ← ← ←
backward Ct-1 Ct Ct+1
update old cell state ct–1 into new cell state ct, which is the
output gate. The function of (7) is to select the output part of
the old cell. It is the output gate. Hidden output ht is → → →
forward Ct-1 Ct Ct+1
computed using function (8). The previous output, the
current input and the new cell state jointly decide the final
output result. These gates enable LSTM-RNN to remove or
add information to the cell state and make full use of the input ... Xt-1 Xt Xt+1 ...
information while avoiding long-term dependencies.
time ... t-1 t t+1 ...
ft = δ (Wxf xt + Whf ht −1 + b f ) (4)

it = δ (Wxi xt + Whi ht −1 + bi ) (5) 4 Result and discussion


ct = ft ct −1 + it tanh (Wxc xt + Whc ht −1 + bc ) (6) 4.1 Experiment setting
ot = δ (Wxo xt + Who ht −1 + Wco ct −1 + bo ) (7) Our experiments utilised the TensorFlow toolbox (Abadi
et al., 2016) to build the BLSTM-RNN model. The forward
ht = ot tanh ( ct ) (8) and backward directions in the hidden layer have 16 LSTM
memory cells. A dropout layer is added between the hidden
In addition, traditional RNNs process input in a temporal layer and output layer to avoid network overfitting. The
order, thus learning input information by relating only to dropout factor is 0.5. The Adam algorithm is chosen as the
past context. Then, A bidirectional RNN model (BRNN) optimiser. Learning rate and momentum are set as 0.01 and
has been proposed (Schuster and Paliwal, 1997). BRNNs 0.8, respectively. We normalise continuous emotion
have two hidden layers that process data in different annotations into [–1, 1]. The continuous values of each
directions, one is in a forward, and another is in a backward dimension are then converted into three discrete categories:
order. category 0 (value range [–1, –0.333]), category 1 (value
Figure 2 shows that a hidden layer computes the forward range [–0.333, 0.333]), category 2 (value range [0.333, 1]).
result from the start of the input sequence to the end.
Another hidden layer calculates the backward result from Table 2 Recognition rates on the AVEC2012 test set
the end to start. In this way, the BRNN can learn both future
and past events in relation to the current step. The BRNN Features + model A E P V M
Final output sequence yt at t moment are decided by ht in L+S 52.4 60.8 67.5 59.2 60.0
forward layer and ht in backward layer. The iterating L + LS 52.4 60.7 66.1 58.1 59.3
functions for moment t are as follows. D + LS 54.1 65.8 68.3 60.1 62.0
Graves (2008) proposed BLSTM combined with BRNN
D + BLS 77.0 78.0 71.9 77.0 76.0
and LSTM. Long-range context can be used efficiently in
both input directions via the BLSTM model. They have Notes: In the first row, A is arousal, E is expectancy, P is
been successfully applied to the task of automatic speech power, V is valence, M is mean. In the first
emotion prediction (Wöllmer et al., 2008, 2010). column, L means LLD features, D means
DIS-NV features, S means SVM model, LS
r
ht = H (Wxhr xt + Whhrr h
t −1 + bh ) (9) means LSTM model, BLS means BLSTM model.

s
ht = H (Wxhs xt + Whh
ss h
t −1 + bh ) (10) 4.2 Results
r s The results of experiments on AVEC2012 include baseline
yt = Why
r h + Ws h + b
t hy t y (11)
experiment based on LLD features and SVM model, the
This approach is not realistic in real-time environment, but other two comparable experiments with the similar
it excellent for processing off-line sequence labelling. In the approaches [one is based on LLD features and LSTM
present study, we aim to solve the problem in accurately model, another is based on DIS-NV features and LSTM
labelling emotions in a recorded dialogue. The BLSTM model (Tian et al., 2015)] and our experiments based on
model is selected as training model. The bidirectional DIS-NV features and BLSTM model, which are exhibited in
characteristic of BLSTM model is more suitable than the Table 2. The first column of the table shows the features
LSTM model. and the model used in the corresponding experiment. The
numbers in the table stand for weighted F-measures of the
174 H. Zhao et al.

three classes expressed as percentage. Each emotion databases with fixed text. As we can see, the DIS-NV
dimension has a result. The last column ‘mean’ is the features are much more colloquial than other features. In
unweighted average of the F-measures of four dimensions. daily conversations, those numbers of words that meet the
Table 3 presents the confusion matrixes of the classification DIS-NV features can be expected to increase, and frequency
result of four dimensions (arousal, expectancy, power and of those words can be obviously improved. We will further
valence). The quantity distribution in each dimension is explore the performance of DIS-NV features in more
shown in Figure 3. natural dialogues. From a long-term point of view, this
feature is not enough for use in practical applications. In the
Table 3 Confusion matrix of each dimension of the best future, we will integrate other knowledge-inspired features
recognition rates or acoustic features to improve the recognition performance.
Category 0 1 2
Figure 3 Emotion attribution
0 0 0 0
1 2382 8368 2655
2 0 5 0
Category 0 1 2
0 0 0 0
1 4567 8555 288
2 0 0 0
Category 0 1 2
0 0 0 0
1 0 8 7
2 540 5295 7560
Category 0 1 2
0 0 0 0
1 2382 8368 2655
5 Conclusions
2 0 5 0 We introduce a new method by combining DIS-NV features
with BLSTM model to recognise continuous emotion in
Note: The following four tables in turn correspond to
arousal, expectancy, power, valence.
dialogues. We conduct experiments to confirm the
effectiveness of the method. Our results are more accurate
The results in Table 2 show that recognition with DIS-NV than LLD features and SVM model, DIS-NV features and
features and BLSTM model outperforms the result of LSTM model by 10%. We simplified the emotion
DIS-NV features and LSTM model for each dimension of recognition task to discrete emotions classification through
discretised continuous emotion recognition. Emotion is discrete processing, but it is not an actual continuous
continuous in our dialogues, which means that affection emotion recognition approach. DIS-NV features are
between words is not independent. The result of the last row remarkable features to recognise emotions in audio. But, as
in Table 2 indicates that the typical bi-directional we can see, a DIS-NV feature has still some shortage. Using
characteristics of the BLSTM model are fairly beneficial in the feature fusion method (Li et al., 2015) to combing
predicting feelings in words. The number of DIS-NV DIS-NV features and the spectrograms of audio files to
features also decreases the complexity of model training. predict continuous emotions in audio can be a future study.
Optimal improvement occurs on the arousal dimension. Besides, it is a considerable area to combine the DIS-NV
This finding indicates that past information is not enough features in Chinese speech and the speech recognition
for arousal. Future information can be an effective method (He and Zhao, 2017b) to recognise emotions in
supplement. However, the increase in power dimension is Chinese audio.
only about 4%, which is maybe limited by the predictability
of DIS-NV features. Table 2 and Figure 3 list the
classification results and emotion attribution of each References
dimension. The two show that the DIS-NV features have
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
great performance in the category with the most instances Citro, C. et al. (2016) Tensorflow: Large-scale Machine
for each dimension, but have little effect in other categories. Learning on Heterogeneous Distributed Systems,
This phenomenon also shows that the DIS-NV features may Arxiv,no.1603.04467.
lack available emotion information. As Table 1 at the Baltrušaitis, T., Banda, N. and Robinson, P. (2013) ‘Dimensional
beginning of the article, the problem of sparse DIS-NV affect recognition using continuous conditional random
features may be another important influence factor. The fields’, IEEE International Conference and Workshops on
sparsity of DIS-NV features is common in the database; Automatic Face and Gesture Recognition, pp.1–8.
especially we could hardly find those DIS-NV features in
Recognising continuous emotions in dialogues based on DISfluencies and non-verbal vocalisation features 175

Bone, D., Lee, C.C. and Narayanan, S. (2014) ‘Robust Hochreiter, S. and Schmidhuber, J. (1997) ‘Long short-term
unsupervised arousal rating: a rule-based framework with memory’, Neural Computation, Vol. 9, No. 8, pp.1735–1780.
knowledge-inspired vocal features’, IEEE Trans Affect Junek, W. (2007) ‘Mind reading: the interactive guide to
Compute, Vol. 5, No. 2, pp.201–213. emotions’, Journal of the Canadian Academy of Child and
Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Adolescent Psychiatry Journal de l’Academie Canadienne de
Kim, S. et al. (2008) ‘Iemocap: interactive emotional dyadic Psychiatrie de l’enfant et de l’adolescent, Vol. 16, No. 4,
motion capture database’, Language Resources and pp.182–183.
Evaluation, Vol. 42, No. 4, p.335. Kumar, P.S.J., Huan, T.L., Li, X. and Yuan, Y. (2018)
Chen, K., Yan, Z.J. and Huo, Q. (2016) ‘A context-sensitive-chunk ‘Panchromatic and multispectral remote sensing image fusion
BPTT approach to training deep LSTM/BLSTM recurrent using machine learning for classifying bucolic and farming
neural networks for offline handwriting recognition’, region’, International Journal of Computational Science and
International Conference on Document Analysis and Engineering, Vol. 15, Nos. 5/6, pp.340–370.
Recognition, pp.411–415. Li, R., Wu, Z., Liu, X., Meng, H. and Cai, L. (2017) ‘Multi-task
Ding, C., Xie, L., Yan, J., Zhang, W. and Liu, Y. (2016) learning of structured output layer bidirectional LSTMS for
‘Automatic prosody prediction for Chinese speech synthesis speech synthesis’, IEEE International Conference on
using BLSTM-RNN and embedding features’, Automatic Acoustics, Speech and Signal Processing, pp.5510–5514.
Speech Recognition and Understanding, pp.98–102. Li, Z., He, S. and Hashem, M. (2015) ‘Robust object tracking via
Ekman, P. (1992) ‘An argument for basic emotions’, Cognition multi-feature adaptive fusion based on stability: contrast
and Emotion, Vol. 6, Nos. 3–4, pp.169–200. analysis’, Visual Computer, Vol. 31, No. 10, pp.1319–1337.
Ekman, P. and Friesen, W.V. (1975) Unmasking the Face: a Guide Lubis, N., Sakti, S., Neubig, G., Toda, T., Purwarianti, A. and
to Recognizing Emotions from Facial Clues, Prentice-Hall Nakamura, S. (2016) ‘Emotion and its triggers in human
International Inc., Englewood Cliffs, NJ. spoken dialogue: recognition and analysis’, Proceedings of
Fan, Y., Qian, Y., Xie, F.L. et al. (2014) ‘TTS synthesis with 5th International Workshop on Spoken Dialog Systems,
bidirectional LSTM-based recurrent neural networks’, pp.224–229.
Interspeech, pp.1964–1968. Mckeown, G., Valstar, M.F., Cowie, R. and Pantic, M. (2010)
Fontaine, J.R., Scherer, K.R., Roesch, E.B. and Ellsworth, P.C. ‘The SEMAINE corpus of emotionally coloured character
(2007) ‘The world of emotions is not two-dimensional’, interactions’, IEEE International Conference on Multimedia
Psychological Science, Vol. 18, No. 12, pp.1050–1057. and Expo, Vol. 26, pp.1079–1084.
Gers, F.A., Schmidhuber, J. and Cummins, F. (2000) ‘Learning to Mehrabian, A. (1996) ‘Pleasure-arousal-dominance: a general
forget: continual prediction with LSTM’, Neural framework for describing and measuring individual
Computation, Vol. 12, No. 10, p.2451. differences in temperament’, Current Psychology, Vol. 14,
No. 4, pp.261–292.
Grandjean, D., Sander, D. and Scherer, K.R. (2008) ‘Conscious
emotional experience emerges as a function of multilevel, Moore, J.D., Tian, L. and Lai, C. (2014) ‘Word-level emotion
appraisal-driven response synchronization’, Consciousness recognition using high-level features’, Lecture Notes in
and Cognition, Vol. 17, No. 2, p.484. Computer Science, Vol. 8404, pp.17–31.
Graves, A. (2008) ‘Supervised sequence labelling with recurrent Njikam, A.N.S. and Zhao, H. (2016) ‘A novel activation function
neural networks’, Studies in Computational Intelligence, for multilayer feed-forward neural networks’, Applied
p.385. Intelligence, Vol. 45, No. 1, pp.75–82.
Graves, A. and Schmidhuber, J. (2005) ‘Framewise phoneme Ozkan, D., Scherer, S. and Morency, L.P. (2012) ‘Step-wise
classification with bidirectional LSTM and other neural emotion recognition using concatenated-HMM’, ACM.
network architectures’, Neural Networks, Vol. 18, No. 5, Russell, J.A. (1980). ‘A circumplex model of affect’, Journal of
pp.602–610. Personality and Social Psychology, Vol. 39, No. 6,
Graves, A., Fernández, S. and Schmidhuber, J. (2005) pp.1161–1178.
‘Bidirectional LSTM networks for improved phoneme Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G. and
classification and recognition’, Artificial Neural Networks: Wendemuth, A. (2010) ‘Acoustic emotion recognition: a
Formal MODELS and Their Applications – ICANN 2005, benchmark comparison of performances’, IEEE Workshop on
International Conference, Warsaw, Poland, 11–15 Automatic Speech Recognition and Understanding, ASRU
September, Vol. 3697, pp.799–804. 2009, pp.552–557.
Gu, L., Guo, H. and Liu, X. (2017) ‘Fuzzy time series forecasting Schuster, M. and Paliwal, K.K. (1997) ‘Bidirectional recurrent
based on information granule and neural network’, neural networks’, IEEE Transactions on Signal Processing,
International Journal of Computational Science and Vol. 45, No. 11, pp.2673–2681.
Engineering, Vol. 15, Nos. 1/2, p.146. Schuster, M. and Paliwal, K.K. (1997) ‘Bidirectional recurrent
Gupta, D. and Ahmad, M. (2017) ‘A hybrid technique based on neural networks’, IEEE Press.
fuzzy methods and support vector machine for prediction of Shan, C., Xie, L. and Yao, K. (2017) ‘A bi-directional LSTM
brain tumor’, International Journal of Computational Science approach for polyphone disambiguation in Mandarin
and Engineering, No. 8, p.9. Chinese’, International Symposium on Chinese Spoken
He, S. and Zhao, H. (2017a) ‘A retrieval algorithm of encrypted Language Processing, pp.1–5.
speech based on syllable-level perceptual hashing’, Computer Tian, L., Moore, J.D. and Lai, C. (2015) ‘Emotion recognition in
Science and Information Systems, Vol. 14, No. 3, spontaneous and acted dialogues’, International Conference
pp.704–718. on Affective Computing and Intelligent Interaction, Vol. 54,
He, S. and Zhao, H. (2017b) ‘Automatic syllable segmentation pp.698–704.
algorithm of Chinese speech based on MF-DFA’, Speech
Communication, Vol. 92, pp.42–51.
176 H. Zhao et al.

Valstar, M., Cowie, R. and Pantic, M. (2012) ‘AVEC 2012: the Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B. and Rigoll, G.
continuous audio/visual emotion challenge – an introduction’, (2013). ‘LSTM-modeling of continuous emotions in an
ACM International Conference on Multimodal Interaction, audiovisual affect recognition framework’, Image and Vision
Vol. 37, pp.361–362. Computing, Vol. 31, No. 2, pp.153–163.
Vuilleumier, P. (2005) ‘How brains beware: neural mechanisms of Wöllmer, M., Schuller, B., Eyben, F. and Rigoll, G. (2010)
emotional attention’, Trends in Cognitive Sciences, Vol. 9, ‘Combining long short-term memory and dynamic Bayesian
No. 12, pp.585–594. networks for incremental emotion-sensitive artificial
Wöllmer, M., Eyben, F., Graves, A., Schuller, B. and Rigoll, G. listening’, IEEE Journal of Selected Topics in Signal
(2010) ‘Bidirectional LSTM networks for context-sensitive Processing, Vol. 4, No. 5, pp.867–881.
keyword detection in a cognitive virtual agent framework’, Xian, X., Chen, F. and Wang, J. (2017). ‘An insight into campus
Cognitive Computation, Vol. 2, No. 3, pp.180–190. network user behaviour analysis decision system’,
Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B. and International Journal of Embedded Systems, Vol. 9, No. 1,
Rigoll, G. (2009) ‘Robust discriminative keyword spotting for p.3.
emotionally coloured spontaneous speech using bidirectional Zhang, J., He, J., Wu, Z., Azmat, F. and Li, P. (2017) ‘Prosodic
LSTM networks’, IEEE International Conference on features-based speaker verification using speaker-specific-text
Acoustics, Speech and Signal Processing, pp.3949–3952. for short utterances’, International Journal of Embedded
Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Systems, Vol. 9, No. 3, p.250.
Douglas-Cowie, E. et al. (2008) ‘Abandoning emotion classes Zhao, H., Zhang, X. and Li, K. (2017) ‘A sentiment classification
– towards continuous emotion recognition with modelling of model using group characteristics of writing style features’,
long-range dependencies’, Conference of the International International Journal of Pattern Recognition & Artificial
Speech Communication Association, Incorporating, Intelligence, Vol. 31, No. 12.
Australasian International Conference on Speech Science and
Technology, INTERSPEECH, pp.597–600.

You might also like