情绪模型

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2885304, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1
Autoencoder for Semisupervised Multiple Emotion

Detection of Conversation Transcripts
Duc-Anh Phan, Yuji Matsumoto, Hiroyuki Shindo
Abstract—Textual emotion detection is a challenge in com- what causes the said changes, what the final state is, and what
putational linguistics and affective computing as it involves the kind of actions result from the said final state. A survey on
discovery of all associated emotions expressed in a given piece of ”Trend analysis in social networking using opinion mining”
text. It becomes even more difficult when applied to conversation
transcripts, as there arises a need to model the spoken utterances [4] predicted the need for emotion detection in streaming data
between speakers while keeping in mind the context of the and live chat.
entire conversation. In this paper, we propose a semisupervised Conversational text is also more of a challenge than other
multi-label method of predicting emotions from conversation types of text. For short text samples, such as news headlines
transcripts. The corpus contains conversational quotes extracted
from movies. A small number of them are annotated, whereas
[1] or tweets [5], the expression of emotion generally depends
the rest are used for unsupervised training. The word2vec word- on the words being used. Meanwhile, for longer text, the
embedding method has been used to build an emotion lexicon grammatical structure and syntactic variables such as nega-
from the corpus and then embed the utterances into vector tions, embedded sentences, and type of sentence (question,
representations. A deep-learning auto-encoder is then used to exclamation, command, or statement) play a part in expressing
discover the underlying structure of the unsupervised data. We
emotions [6]. Identifying emotions in conversational text is
fine-tune the learned model based on labeled training data and
measure its performance on a test set. The experiment result also very different from paragraphs because there is often
suggests that the method is effective and is only slightly less more than just one party in a conversation. Each party takes
effective than human annotators. turns, which are called utterances, to express different ideas
Index Terms—Emotion recognition, semisupervised learning, and emotions, thus making an impact on the other party’s
multilabel, word2vec, autoencoder. emotions. Therefore, to detect emotions in conversational text,
one needs to monitor not only the current utterance but also
the previous utterances, as well as the context of the entire
I. I NTRODUCTION conversation [7].
E MOTION detection is an upcoming field of research

closely related to sentiment analysis. Extending beyond
sentiment analysis in terms of complexity, emotion detection
In this study, we follow Plutchik’s work [8] and make
assumptions about the complicated nature of human emotions.
Emotions have connections, some are similar to each other but
primarily aims to recognize feelings concealed within a media have different intensities while some are opposite to each other
source. Most of the research works until now have focused on and some occur together at the same time, resonate, and create
brain signals, video of facial expressions, audio recording, and other emotional states: dyads (figure 1).
so on, along with multiclass classification of emotions. Little Robert Plutchik proposed a hybrid model of both basic
research has been done regarding the detection of multiple emotions and dimensional theory [9]. He also put forward four
emotions simultaneously in textual data. axes of bipolar emotions basic emotions: joy-sadness, fear-
Nowadays, along with the popularity of social networks, anger, trust-disgust, and surprise-anticipation with different
the Internet also contains an enormous amount of unlabeled levels of intensity (subfigure 1a). These primary emotions
data, most of which is textual. By mining and applying may blend to form the full spectrum of human emotional
semisupervised emotion detection techniques to such data, we experience. The new complex emotions formed by a mix of
open ourselves to a wide range of useful applications, such these primary emotions called dyads (subfigure 1b). Plutchik’s
as measuring citizen happiness, improving customer service, theory explains the connection between emotions, he proposes
social mental health care, and early screening of possible that some emotions are similar but have different intensities,
suicides or crimes. While many such researches have been while some emotions do not occur at the same time since they
undertaken for news headlines [1], tweets [2], and paragraphs are on opposite sides of the axis. He also says that complex
[3], the most useful type of textual data, conversational text emotions can be viewed as combinations of the relevant pri-
such as chat logs and replies on social media, is often mary ones. This idea enables us to approach emotion detection
overlooked. Text of this type is of great use to applications in a more comprehensive manner. [7]
that use emotion detection. It gives us information about the In this paper, we propose a five-step method for detection
initial emotional state, how it changes during the conversation, of emotions in a conversation:
The authors are with the Graduate School of Information and Science,
Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara,
630-0192, Japan
Email: phan.duc anh.oq3,matsu,shindo -at- is.naist.jp 1) We follow in [10]’s footsteps by building an emotion
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2885304, IEEE
test data and analyzing the evaluation result.

The approach described in this paper exploits Plutchik’s
theory in a comprehensive manner, which covers the full
spectrum of human emotions, to work on a challenging multi-
label conversation corpus. Our work is different from the
previous researches in four main ways:
• We have integrated Plutchik’s idea of basic emotion dyads
and intensity in our model, which provides scalability to
address emotions at a fine-grained level in the future.
• We use word2vec to automatically produce an emotion
lexicon, which is essential for automatic feature extrac-
tion of the raw input data.
• Our method is semisupervised: An autoencoder is used
on a large unsupervised dataset and its parameters are
used to retrain a deep network of annotated data. This
allows us to take advantage of both the scarce-labeled
data and the vast reserve of unlabeled data available on
(a) Plutchik’s wheel of emotions has four major axes (dimensions)
of basic emotions. Along each axis, there are opposite pairs of basic
the Internet.
emotions and different levels of intensity. The outer circle indicates • The output of our system is multi-label, which means that
milder emotions while the inner circle indicates stronger ones. The basic it is capable of addressing multiple emotions simultane-
emotions from the two axes may blend in together and form more
complex emotions called dyads. For example: the combination of disgust
ously.
and anger is contempt and the combination of disgust and sadness is The remainder of this paper is organized into six sections.
remorse.
Section II summarizes related work on emotion detection.
Section III discusses the nature of our dataset and explains
the annotating scheme. Section IV proposes the visualization
of our lexicon and discusses our approach, which includes
the steps mentioned in the preceding paragraphs. Section V
evaluates the proposed method in general and section VI
presents the conclusion and discusses future work.
II. R ELATED W ORKS

In the previous studies carried out on emotion analysis,
Ekman’s model of six basic emotions (anger, disgust, fear,
happiness, sadness, and surprise) [12] was often employed.
However, this model was developed from the observation of
human facial expressions. Therefore, it becomes irrelevant
when applied to text classification where there are opposing
emotions that cannot occur together. For example, a person
will feel either happy or sad about an event, yet he may feel
(b) Dyads: It is the combination of two emotions. Conflicting (opposite) both happy and surprised at the same time. Ekman’s notion
basic emotions cannot combine, whereas the other emotions can blend also does not address other complicated emotions such as
in and form emotions that are more complex. Primary dyads are
combinations from neighboring axes, as shown in the above figure 1a, annoyance, pride, or dominance.
whereas secondary and tertiary dyads are combinations of emotions that Newer works rely on dimensional models using the valence-
are distanced by one or two axes, respectively. arousal (VA) space [13], [14] where valance indicates the
Fig. 1. Plutchik’s basic emotions and dyads (combination of two basic positive-negative level of the emotion and arousal indicates the
emotions) - image taken from http://twinklet8.blogspot.jp
intensity of the emotion. Both works focus on the projection of
lexicon using word2vec [11] on IMDb quotes dataset 1 emotional words into the VA space that can be used for lexicon
2) Transforming raw input data into feature vectors using augmentation. [15] proposed building affective resources in
the lexicon. VA dimensions. However, VA space suggests a common and
3) Using autoencoder to learn the underlying structure of interconnected neurophysiological system that is responsible
unsupervised data. for all affective states [16], which is in contrast to the theory of
4) Retraining deep networks initialized from the parameters basic emotion [8], [12] where different emotions are assumed
of unsupervised models using labeled training data. to arise from separate neural systems. Mostly, the VA space is
5) Verifying the performance of the system on annotated often used to test the stimuli of emotional words or facial
expressions, as it appears to be too noisy for the emotion
1 ftp://ftp.fu-berlin.de/pub/misc/movies/database/ classification task.
One of the most obvious clues to identify emotions in takes advantage of the label correlations and at the same time
conversation is the choice of words as suggested in [17]. manages to avoid the problem of a large number of subsets
Therefore, [18], [19] proposed a database of more than English that traditional LP methods face.
lemmas giving information about the affective meanings of Algorithm adaptation methods include the multi-label ex-
words. These works inspired the use of emotional lexicon tension of decision trees, support vector machines, neural
in emotion analysis, including the NRC emotion lexicon [2] networks, and so on. [27]. Meka’s 2 implementation of deep
and the Wordnet-affect [20]. However, their lexicon size is back-propagation neural network (DBPNN) [28] is also ca-
small, therefore lacks coverage, and only works on short text pable of handling multi-label problems. The method focuses
domains such as tweets, news headlines, and so on where the on using multiple layers of restricted Boltzmann machines
recognition of emotion heavily depends on words. They relied (RBM) to create an autoencoder for pre-training. After that, the
on a set of specific seed emotional words and then expanded whole network is fine-tuned using back-propagation of error
the lexicon by finding all its synonyms and antonyms. The derivatives. In our study, we implement a similar adaptation of
lexicons are often annotated in a multiclass manner, which a multi-label neural network using a stack of fully connected
means that one lexical item can only be associated with a layers and also incorporate the correlations between labels by
specific emotion. However, in reality, one word might express directly modifying the loss function.
different emotions in different situations. Their methods not Using Plutchik’s basic emotions, [29] proposed a simple
only oversimplify the complex nature of emotions but also bag-of-words approach and fine-tuned RAkEL for multi-label
remove the collocation connections between the lexical items. classification of movie reviews. We delve further and work
As a result, the approaches of previous lexicons show inflexi- on the conversation data, where the exchange between the
bility when applied to other domains, such as emotion analysis characters and the context of the entire dialogue is of great
in customer reviews, restaurants, personal blogs, because they importance. The closest example of our work is [3] on para-
do not take into account the different meanings of words (word graphs and documents, which tried to improve the sentence-
senses) in different contexts. [21]. In this paper, despite the level prediction of some special emotions which, owing to
fact that we build our lexicon from the IMDb movie quotes data sparseness and inherent multi-label classification, were
domain, the lexicon is automatically extracted from the corpus very difficult to predict. These researchers incorporated label
through word embedding and can be reproduced on any other dependency between labels and context dependency into the
domains without much effort. graph model to achieve the goal. However, their work is for
Most of the previous work concentrated on narrowing down paragraphs in Chinese. In our case, we take advantage of an
the complexity of the problem by focusing on only a small autoencoder to capture the abstract representation of context
set of emotions that barely involved three or four emotional information.
states [22], [23]. While such approaches may have succeeded
in particular problems and domains, they lacked the capacity
III. C ORPUS
to predict all genres of emotions in different kinds of text.
Another work by [24] performed a multiclass classification The IMDb quotes corpus is newly published and gets
of dialogue data sourced from Twitter in Japanese. The re- updated frequently. It includes approximately 2,107,863 ut-
searchers automatically labeled the obtained dialogues using terances (turns in conversation) from 117,425 movies. There
emotional expression clues, which is similar to our work of are several reasons for us to choose the IMDb corpus. First,
using an emotional word lexicon. The work assumed that one movies have more input for the annotators with video and
tweet might portray only one emotion. Although, this may audio other than just text, which helps them to provide a
be the case in short exchanges of tweets, it is not applicable more accurate annotation. Second, movies tend to evoke many
to real-life conversations where the number of characters emotions among the viewers [30]. Third, there are various
per utterance is not limited, and the context information genres in movies such as romance, sci-fi, documentary, and so
complicates the process of emotion analysis. on. These genres might be equivalent to real-life domains, for
Existing multi-label classification methods can be classi- example: normal life conversations and scientific exchanges
fied into either the problem transformation category or the and debates. Lastly, unlike other types of art, such as literature
algorithm adaptation category. The former group of methods or theatre plays, moviemakers often try to give their dialogues
transforms the multi-label tasks into multiple single-label as much real-life feelings as possible [31]. In the work of
classifications. They use the multi-class approach of one-vs- [32] and [33], it had been concluded that spoken language
rest but ignore the correlations among labels similar to the in movies resembles spontaneous spoken language, which
binary relevance method [25]. The less common approach supports our choice of using a movie corpus to imitate real-life
in this group is the Label Powerset (LP) method, which conversations.
produces new labels from every possible combination of the To produce labeled data, we sample quotes from a few
original labels. This approach considers label correlations but chosen famous movies. We then try to match the quotes with
also produces a very large number of subsets of labels with subtitle files to get the time of the corresponding scenes from
very few examples of each. Random k-Labelsets (RAkEL) the movies. Annotators are given a description of the basic
[26] follow LP; however, they train the LP classifiers with emotions and dyads and then asked to watch the scenes with
only k random labelsets from all possible subsets. RAkEL
2 http://meka.sourceforge.net/\#about
using the values in this work. In future work, we hope to be

able to predict the intensity of the emotion and conduct more
thorough annotating sessions.
Emotion class Average Intensity
Anger 0.495 ± 0.200
Fear 0.426 ± 0.181
Disgust 0.411 ± 0.175
Trust 0.498 ± 0.167
Joy 0.564 ± 0.194
Sadness 0.477 ± 0.202
Surprise 0.484 ± 0.190
Anticipation 0.433 ± 0.152
TABLE I
AVERAGE INTENSITY VALUES PER EMOTION AND STANDARD
DERIVATIONS . T HE DERIVATIONS ARE TOO HIGH , INDICATING THAT THE
VARIED VIEWS OF THE ANNOTATOR ON THE INTENSITY OF EMOTIONS .
We define the gold-standard data by applying the majority

rule: any emotion annotated by at least three annotators is
(a) UI of the annotating website. Users can choose the appropriate emotions considered a valid label for the utterance. If an utterance has
by adjusting the confidence bars or by typing the emotions or dyads into no valid label, it is considered to be objective and have no
the text box. The dyads are then decomposed automatically into primary
emotions and the bars are readjusted.
emotion. There might be complex cases where a single sample
has been annotated with different emotions of high-confidence
values by different annotators. We acknowledge that this is
the current blind spot of our scheme, as these cases might
get labeled as objective. We report on the statistics of the
data in Table II. In the 1,000 utterances of gold-standard data,
896 of them are agreed by at least three annotators and have
(b) Examples of annotated transcripts from movie: Brave Heart (1995) - In emotion labels. The remaining 104 utterances that have no
the last example, opposing emotions of trust and disgust are both annotated. label according to the gold standard are defined above and
are considered to be objective. All these 1,000 utterances are
Fig. 2. In the annotating scheme of the testing data, each utterance is used in the testing phase.
annotated with basic emotions or dyads. On the website, the confidence levels
are adjusted using the bar scale in 2a and translated into intensity values in2b Following are some statistics of the corpus: a total of
2,107,863 utterances comprising of over 26 million words.
subtitles (figure 2), following the annotation scheme shown
We separated the training data of 10,000 utterances, which
below:
was annotated by only one annotator, from the testing data of
• One utterance may hold zero, one, or more emotions at 1,000 utterances (gold-standard), which was annotated by five
the same time. The list of emotions to assign includes annotators. The rest of nearly two million utterances are to be
Plutchik’s basic emotions and dyads. The system will used as unsupervised data.
treat the dyads as a combination of basic emotions. In One of the most common Inter-annotator agreement
case an utterance holds no emotion, it should be annotated measurements is the Kappa statistics [34]. However, these are
with ”None.” The intensity of emotions is also considered not applicable to multi-label datasets because their way of
in the labeling phrase, as in subfigure 2a. computing causes a hypothetical probability-of-chance agree-
• The annotators need to assign the entire utterance, which ment Pe to be greater than 1, since there are cases where two
may have two or more sentences, with a set of all or more labels are annotated to a given instance. Therefore,
emotions expressed inside it. There may be cases where we consider the gold-standard data to be the ground-truth data
opposing emotions appear simultaneously in the same and measure the average accuracy of each of the emotions
utterance, for example the pair trust-disgust in the last and the F1 score of the annotators in Table III. A survey by
example of the subfigure 2b. [35] suggests that low agreement scores are often observed in
We understand that the annotated emotions might be linked multi-label annotating tasks.
to different entities (for example, in the last example of From Table III we can see that there is an unbalanced
subfigure 2b, trust is linked to the princess while disgust is No. Annotators No. utterances
Agreed by 2 annotators 996
to the king). However, in the scope of this paper, we would Agreed by 3 annotators (gold-standard) 896
like to cover only the emotions that can be felt by a speaker. Agreed by 4 annotators 443
Hence, the annotation scheme does not require the annotators Agreed by 5 annotators 253
Total No. utterances 1,000
to point out the target entities of the emotions.
We have several intensity values for each annotation; how- TABLE II
N UMBER OF UTTERANCES RECEIVING AGREEMENTS BY ANNOTATORS IN
ever, the standard derivations are very high, as shown in Table THE TESTING DATA .
I. While annotators agree on the existence of some emotions,
their views on the intensity vary. Thus, we decided to avoid
Emotion class Frequency Accuracy

Anger 0.214 0.72
is [1.5,0,0,0]. It is to be noted that the minus sign indicates
Fear 0.154 0.673 only that the emotion is on the other side of the axis and does
Disgust 0.127 0.624 not suggest a negative emotion in any case. Figure 3 explains
Trust 0.279 0.65
Joy 0.183 0.606
the intensity and polarity of basic emotions according to the
Sadness 0.232 0.584 theory.
Surprise 0.141 0.575
Anticipation 0.064 0.491
Average accuracy (by class) 0.615
Average accuracy (by annotator) 0.43
Average F1 (by annotator) 0.626
Total No. utterances 1,000
Emotions per utterance 1.41
TABLE III
AGREEMENT SCORE AMONG ANNOTATORS WITH GOLD - STANDARD DATA
AS THE GROUND - TRUTH . W E COMPARE THE ANNOTATION OF EACH
ANNOTATOR TO THE GOLD - STANDARD DATA AND THEN INFER THE
ACCURACY AND THE F1 SCORE
distribution of emotions in our gold-standard data. While Fig. 3. Intensity and polarity of basic emotions.
emotions such as anger, trust, and sadness are annotated For the dyads and other words, we calculate the similarity
very frequently, the annotators rarely label anticipation. There between these words and all primary emotions ei . We assume
might be two reasons for this, the first being the subtlety of the that the higher the similarity, the closer the emotional state
emotion itself and the second being the choice of the movie of the words to the primary emotions. The emotional vector
genre used in our data. However, both seem to be interesting of one word is the averaged result of all primary emotion
points to investigate on our future research. On an average, vectors multiplied by the similarity weights, as in an equation
our data has 1.41 emotions per utterance. 2. Since there are four axes of basic emotions in Plutchik’s
theory, the result of this step is a four-dimensional vector of
IV. P ROPOSED METHOD emotion features. Pn
A. Emotional words to vectors sim(word, ei ) × vec(ei )
vec(word) = 1 (2)
n
Using a lexicon is proven to provide significant improve- The final embedding is the concatenation result of the
ments in identifying the emotion conveyed by a word [2]. word2vec-generated vectors and the newly calculated emo-
Therefore, in our case, we build a new lexicon, in which each tional vectors. In this research, our embedding is of 100-
lexical item displays not only its association with Plutchik’s dimensional vectors, 96 of which are generated with word2vec.
basic emotions but also the strength of the association. We The other four are generated using the preceding steps.
combine the word2vec features and calculated emotion fea-
tures to form a hybrid vector representation of a lexicon item
as follows: B. Visualization of the lexicon
1) word2vec features: Using word2vec, we generate the
Our lexicon consists of 181,276 lexical words, which is
embedding of all words available in the corpus. With the
much larger than most previous lexicons proposed by other
embedding, the cosine similarity between each word and the
researchers. The NRC Emotion Lexicon [2] and Wordnet-
primary emotion words is calculated in the form of an equation
Affect [20] contain 25,000 and 2,876 synsets, respectively.
1. In our work, the embedding of a word is a 96-dimensional
Figure 4a is a visualization of the top 5,000 popular lexical
vector.
items and some of the basic emotions. The projection is
vec(word) · vec(ei ) done directly on 100-dimensional vectors using the matplotlib
sim(word, ei ) = (1) library function 3 and t-SNE algorithm from the scikit-learn
kvec(word)k2 kvec(ei )k2
packet 4 for dimension reduction. Despite the fact that the
2) Emotion features: By contrast, we define the primary visualization is done by reducing the number of dimensions
emotions and dyads proposed in Plutchik’s theories as the of each lexical item to only two dimensions, some interesting
emotional vectors of our lexicon and give them initial values. results can be observed in figure 4.
Different levels of intensity of emotional words are also From the figure, we observe can see that except for the pair
considered. Each lexical item in the lexicon has a vector of of fear and anger, other opposite basic emotions are located
values on every axis of the basic emotions: joy-sadness, fear- quite far from each other (subfigure 4a), which is the desirable
anger, trust-disgust, and surprise-anticipation. We manually outcome of the lexicon. Interestingly, in a small cluster of
assign the primary emotions a value vector of 1, 0, or -1, and subfigures 4b, we observe three basic emotions: joy, fear, and
others with 1.5, 0.5, or -0.5, -1.5, depending on the intensity anger. The surrounding lexical items, while appearing to be
of the emotion according to Plutchik’s theory. For example, random at first, somehow seem more relevant later. Words
”joy” came from the axis of joy-sadness. Thus, its vector is
[1,0,0,0], while the vector for ”sadness” is [-1,0,0,0]. The word 3 https://matplotlib.org
”ecstasy” is of higher intensity than ”joy”; hence, its vector 4 http://scikit-learn.org/stable/
(a) Embedding of the top and most frequent 5000 lexical items (small dots) and basic emotions, which are both annotated in
red with arrow signs. We notice some clusters are overlapped by basic emotions.
(b) Items with the most similarity to basic emotions: joy, fear, and anger (c) Items that are most similar to basic emotion trust , such as nods,
. Dyads or intensified/lesser emotions are boxed and annotated in blue. appreciate, together, and agreed.
Rage is the intensification of anger and guilt is a combination of joy and
fear
Fig. 4. Visualization of the lexicon in two dimensions using matplotlib and scikit-learn’s t-SNE. The opposite emotions are often far from each other, while
lexical items with similar meanings are close, as in 4a. However, because we reduced the number of embedding dimensions from the original 100 to only 2,
some clusters are mixed together, as in 4b
such as pain, rage, evil, and curse are close to anger; pride, for each utterance in a conversation, we also have to vectorize
happiness, and beauty are close to joy. The dyad guilt, which the previous utterance and the entire conversation to capture
according to Plutchik’s theory is a combination of joy and the contextual information (Figure 5). As a result, the vector
fear (subfigure 1b) is also present in this small cluster. In the representation of an utterance is now a 300-dimensional vector
cluster of trust (subfigure 4c), we see lexical items that suggest concatenated product of the utterance itself and the above-
agreement, such as nods, agreed, and appreciate. The results mentioned contextual information. This representation is then
show that our lexicon learns the association between lexical fed to the input layer of the neural network in the following
items and the basic emotions and dyads. sections:
C. Text to vector D. Autoencoder semisupervised learning

We consider a bag-of-lexical-items approach to transform Similar to [36], the goal of our autoencoder is to understand
the raw input text into vector form. Therefore, for a piece the representation of the input data. We hope that in the process
of text, its representation is the sum vector of all lexical of encoding and reconstructing the input data, the underlying
items inside. As our goal is to predict the emotional label structure is revealed and the model, after retraining on labeled
Fig. 5. Vector representation of an utterance is a concatenation of the

word2vec and emotion features of the current utterance, the previous utterance,
and all other utterances in the conversation, as three 100-dimensional vectors.
data, provides better results than when it depended solely
the labeled data. From our perspective, the process of the
autoencoder learning the representation of our data is similar
to the language learning process of an infant in its early years5
6
. Both the processes include the repetition of decomposing Fig. 6. Structure of the autoencoder deep network: 1. Shared encoder layers
input into smaller components (morphemes in infant’s lan- (bottom) 2. Unsupervised part (top-left): A straight feed-forward decoder
guage learning and features vector in an autoencoder) and re- neural network with mirror settings from the encoder. 3. Supervised part (top-
right): With eight-dimensional output and threshold layers corresponding to
composing these components to produce meaningful output eight basic emotions ei .
(babbling sounds in the infant’s case and the reconstruction of
input in the autoencoder’s case). epochs at 200.
In our opinion, an autoencoder is much simpler because it 2) Supervised retraining: After learning the representation
is limited in the way it can interact with the ”world”. The of the input, the model is trained further with 10,000 labeled
”world” to an autoencoder is just the information we feed to utterances. We use the bias and weights of the encoder to
it while an infant can interact freely and deliberately with the initialize the network and an output and threshold layer are
real world to acquire more information and come up with its added to monitor the multi-label prediction (Figure 6. The
own abstract representation of the information it receives. The output of our system is a set of predicted labels ei for eight
features in an autoencoder are just compressed information basic emotions. The threshold layer is a simple set of all ti
while the infant can understand the meaning of each input for each ei . Let oi be an output node( of the output layer. We
and output and transform them into concepts and ideas. 1, if oi ≥ ti
have the following equation: ei =
Figure 6 displays the components of our network. During 0, otherwise
the unsupervised training phase, we use the encoder and The thresholds are initialized randomly and then updated
decoder network and during the supervised training phase, after each epoch; in a manner similar to how we updated the
we use the encoder and the classifier. In our implementation, biases and weights. Only labels with output values greater than
Tensorflow [37] was used for its GPU-computing power. The the corresponding threshold are considered valid. Initially, we
entire system uses a sigmoid function and gradient descent determined a fixed threshold for all the emotions but soon
optimizer with a learning rate of 0.001. To avoid overfitting, realized that a flexible set of thresholds will be more effective
we use a dropout method with a keep rate of 0.75 at all hidden and reasonable.
layers. As Y is the true labelset and Y 0 is the set of labels predicted
1) Unsupervised training: The encoder is a straight feed- by our model, we define the cross-entropy loss function as
forward neural network with a 300-dimensional input layer follows:
and two 100-dimensional fully connected hidden layers 7 . CrossEntropy = −[Y ln Y 0 + (1 − Y ) ln(1 − Y 0 )] (3)
Logically, the decoder is the mirror image of the encoder l 2
with the same settings but in the reverse order. Unlike the
X wi
Loss = CrossEntropy + λ (4)
autoencoder in the image-classifying task, we do not add j
2
noise to the network. The least-squares error loss function for
reconstructing the input is as follows: L = (X − X 0 )2 , where The global cost function is regularized by twelve regulariza-
X is the input vector and X 0 is the reconstructed vector. For tions using the weights wj of all layers and the value of lambda
the autoencoder to learn the underlying structure completely, λ is fixed to be 0.01 in this study. As we use only a small part
we set the minibatch size at 128 and the number of training of the corpus as training data, all the twelve regularizations
help us in avoiding overfitting problems.
5 https://www.linguisticsociety.org/resource/faq-how-do-we-learn-language
6 http://www.ling.upenn.edu/courses/Fall 2011/ling001/acquisition.html
7 We are aware of LSTM seq2seq architectures that can handle dialogues
V. E XPERIMENT
directly instead of our method of appending the three vectors. However, LSTM
is a more complex method and has higher computational costs. Our trial run of
A. Experiment setting
LSTM on 100,000 utterances of unsupervised training data took much longer As mentioned earlier, we used a set of 1,000 annotated
than our current method with no real improvement. While taking our current
limits of computing power into consideration, we decided to keep LSTM for utterances for testing our method. For the gold standard of
future research. the test data, we applied the majority rule to the annotations,
which states that if a basic emotion is labeled by at least three vs. Bag-of-words approaches: Our system, under different
among five annotators, we accept it as a true label for the settings, performed remarkably better than the simple ap-
utterance. proaches using Meka’s DBPNN and RAkEL. The supervised
Evaluation metrics In our study, two common evaluation method used the same dataset as the other two methods and
metrics, which have been popularly used in multi-label clas- none of them took advantage of the unsupervised data. Yet, the
sification problems, [3], [38] are employed to measure the supervised method remarkably outperformed the two methods
performance of our system. Let Yi be a set of true labels for by 14 and 19 in hamming and the F1 score, respectively. We
a given instance i , and Yi0 be the set of labels predicted by a believe that the context features and our emotion lexicon were
system. Let N be the total number of instances, then: a deciding factor.
1) Hamming score, or accuracy in multi-label classifica- Our system: autoencoder vs. self-learn We can clearly see
tion, gives the degree of similarity between the ground- that the the autoencoder has a better performance as compared
truth set of labels and the predicted set of labels. to the self-learn method. While both are semisupervised meth-
N ods, self-learn uses supervised data to produce a model. This
1 X |Yi ∩ Yi0 |
Hammingscore = (5) model tries to classify the unsupervised data and assimilates
N i |Yi ∪ Yi0 | the obtained result to retrain. Naturally, when the size of the
2) F1-measure: It is the harmonic mean of precision and unsupervised data becomes larger than the supervised one, the
recall. In our study, we have given equal importance to model starts dealing with more and more unseen examples and
precision and recall. the performance drops. Figure 8 confirms this explanation.
2 ∗ precision ∗ recall Our system: semisupervised vs. supervised The perfor-
F1 = (6)
precision + recall mance of the autoencoder is much better than that of the
Precision: It is the fraction of correctly predicted labels supervised method. Our system first learns and tries to imitate
among all the predicted labels in the set. the enormous number of unsupervised examples. During the
N
1 X |Yi ∩ Yi0 | retraining process, it figures out the connections between the
P recision = (7) concepts that it has imitated and the true results. Therefore,
N i |Yi0 |
it has a better understanding of the data and makes better
Recall: It is the fraction of correctly predicted labels among predictions than when using only the supervised examples.
all the true labels in the set. vs. human annotator: This is the most important baseline,
N
1 X |Yi ∩ Yi0 | which explains how well our system performs in comparison
Recall = (8)
N i |Yi | with a human. Please note that this evaluation of the human
annotators is the average agreement between each annotator
B. Experimental results and the gold-standard data (decided by the majority rule as
discussed in the earlier section). Our system’s performance
To evaluate our system, a comparison must be made against is slightly worse than that of the human annotator in both
other systems. We replicated the works of others and applied Hamming score by 4 and the F1 measure by 6. However,
them to our new corpus. [29] is a similar work that also used we should also take into consideration that the input for
Plutchik’s theory of basic emotions and worked on multi- human annotators are movies with full video, sound signals,
label data without considering the intensity of the emotion and transcript texts, while the input for our system is only
labels. This study achieved an F1 score of 45.6 in its own the transcripts. We acknowledge that the different inputs of
dataset of 629 sentences of user-generated movie reviews. annotators and our system may affect the results. However,
Both the studies used Meka’s RAkEL method and bag-of- as the purpose of this research is to make a system that can
words approach as their work for the first baseline. However, identify emotions in text messages, we did not incorporate the
it would be unfair to apply [29]’s system to our corpus features from other modalities.
and make comparisons since it is fine-tuned specifically for
In short, our system with autoencoder semisupervised learn-
their corpus. Their study neither considers the emotions of
ing performs much better than the existing methods not only
each sentence nor the contextual information. Therefore, the because of the emotion lexicon that we built or its way of
second system is Meka’s DBPNN, which is said to have a extracting contextual information but also because of its ability
better performance than RAkEL [39]. We consider RAkEL
to use largely available unsupervised data. When relying only
and DBPNN to be state-of-the-art systems for multi-label
on textual data, our system performs slightly worse than
classification of emotions in text.
human annotators, for whom full movie clips serve as the
To our knowledge, the most important baseline is human an- input.
notation. To obtain this baseline, we calculated the evaluation
metrics based on the average agreement score between each
C. Publication of the data and possibility of replicating the
annotator and the gold standard, as mentioned in section III.
work in other domains
We elaborated the performance of our system using different
settings: autoencoder semisupervised learning, unsupervised The annotated data and the emotion lexicon will soon be
self-learning, and supervised learning using labeled data only. published in the author’s GitHub repository.
Figure 7 compares the performance of our system to the The movie conversation domain is very close to real-life
baselines. settings. Therefore, we believe that the model and lexicon
Fig. 7. Evaluation of the system: 1) Human annotators, 2) our system using autoencoder, 3) our system using self-learning, 4) our supervised system,
5)RAkEL, and 6)DBPNN.
method to detect emotion is comparable to that of a human

annotator despite the fact that it received only text as input.
By building our system in a movie conversation domain, we
attempted to produce a close imitation of real conversations.
We hope that our system will perform well in real-life settings,
such as SMS texting and messenger applications, and will play
a supporting role in identifying emotions in speech processing.
We are planning to apply our work to other domains and expect
a similar level of performance.
In the future, it shall be very interesting to incorporate all
features from modalities such as text, speech, and video signals
into our current system, and once again evaluate how well
Fig. 8. Performance comparison between autoencoder, self-learn, and super- it performs against a human. Another useful and important
vised settings of the system: The autoencoder achieves the best performance aspect for future research is that we link the emotions towards
along with a steady growth in the number of unsupervised data, while entities in the conversation. From our perspectives, such fea-
self-learn performs worse and worse. Supervised method does not use any
unsupervised data; thus, it is not affected. ture might play an important part in future emotion analysis
systems as it can indicate on how the machine should react
would perform reasonably well when applied to other domains. and respond to human emotional remarks.
Our work can be replicated easily on other domains without
a lot of effort using little annotated data and the automatic
ACKNOWLEDGMENT
emotion lexicon extraction process.
This research was supported by JST CREST Grant Num-
ber JPMJCR1513, Japan. We are grateful to our colleagues
VI. C ONCLUSION & FUTURE WORK from the Computational Linguistics Lab, NAIST, Japan, who
provided insights and expertise that greatly assisted in the re-
In this paper, we proposed a method of detecting and search. We also appreciate all the annotators who participated
classifying multiple emotions from the IMDb movie quotes in this research for their valuable efforts and patience. We
corpus. The corpus is a set of conversation transcripts anno- would like to thank Editage 8 for English language editing
tated with multi-label emotions following Plutchik’s notion of
basic emotions and dyads. Our method involves building an R EFERENCES
emotion lexicon using the word2vec word-embedding tech- [1] C. Strapparava and R. Mihalcea, “Semeval-2007 task 14: Affective
nique, extracting a vectorized representation of the input, and text,” in Proceedings of the 4th International Workshop on Semantic
classifying the emotions in a semi-supervised manner with Evaluations. Association for Computational Linguistics, 2007, pp. 70–
74.
the help of an autoencoder that exploits both the unlabeled
and labeled data. The experiments show that the ability of our 8 www.editage.jp
[2] S. Mohammad, “Portable features for classifying emotional text,” in [23] C. Yang, K. H.-Y. Lin, and H.-H. Chen, “Emotion classification using
Proceedings of the 2012 Conference of the North American Chapter web blog corpora,” in Web Intelligence, IEEE/WIC/ACM International
of the Association for Computational Linguistics: Human Language Conference on. IEEE, 2007, pp. 275–278.
Technologies. Association for Computational Linguistics, 2012, pp. [24] T. Hasegawa, N. Kaji, N. Yoshinaga, and M. Toyoda, “Predicting and
587–591. eliciting addressee’s emotion in online dialogue.” in ACL (1), 2013, pp.
[3] S. Li, L. Huang, R. Wang, and G. Zhou, “Sentence-level emotion 964–972.
classification with label and context dependence,” in Proceedings of the [25] K. Brinker, J. Fürnkranz, and E. Hüllermeier, “A unified model for
53rd Annual Meeting of the Association for Computational Linguistics multilabel classification and ranking,” in Proceedings of the 2006
and the 7th International Joint Conference on Natural Language conference on ECAI 2006: 17th European Conference on Artificial
Processing (Volume 1: Long Papers). Beijing, China: Association Intelligence August 29–September 1, 2006, Riva del Garda, Italy. IOS
for Computational Linguistics, July 2015, pp. 1045–1053. [Online]. Press, 2006, pp. 489–493.
Available: http://www.aclweb.org/anthology/P15-1101 [26] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble
[4] S. Dave and H. Diwanji, “Trend analysis in social networking using method for multilabel classification,” Machine learning: ECML 2007,
opinion mining a survey,” 2015. pp. 406–417, 2007.
[5] J. Bollen, H. Mao, and A. Pepe, “Modeling public mood and emotion: [27] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,”
Twitter sentiment and socio-economic phenomena.” 2011. International Journal of Data Warehousing and Mining, vol. 3, no. 3,
[6] G. Collier, Emotional expression. Psychology Press, 2014. 2006.
[7] D.-A. Phan, H. Shindo, and Y. Matsumoto, “Multiple emotions detection [28] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data
in conversation transcripts,” PACLIC 30, p. 85, 2016. with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[8] R. Plutchik, “A general psychoevolutionary theory of emotion,” Theories [29] L. Buitinck, J. Van Amerongen, E. Tan, and M. de Rijke, “Multi-emotion
of emotion, vol. 1, pp. 3–31, 1980. detection in user-generated reviews,” in Advances in Information Re-
[9] ——, “The nature of emotions human emotions have deep evolutionary trieval. Springer, 2015, pp. 43–48.
roots, a fact that may explain their complexity and provide tools for [30] J. T. Hancock, K. Gee, K. Ciaccio, and J. M.-H. Lin, “I’m sad you’re
clinical practice,” American scientist, vol. 89, no. 4, pp. 344–350, 2001. sad: emotional contagion in cmc,” in Proceedings of the 2008 ACM
[10] M. Li, Q. Lu, Y. Long, and L. Gui, “Inferring affective meanings of conference on Computer supported cooperative work. ACM, 2008, pp.
words from word embedding,” IEEE Transactions on Affective Comput- 295–298.
ing, vol. 8, no. 4, pp. 443–456, 2017. [31] S. I. Rauma, “Cinematic dialogue, literary dialogue, and the art of
adaptation: dialogue metamorphosis in the film adaptation of the green
[11] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling
mile,” 2004.
with Large Corpora,” in Proceedings of the LREC 2010 Workshop on
New Challenges for NLP Frameworks. Valletta, Malta: ELRA, May [32] P. Forchini, “Movie language revisited.”
2010, pp. 45–50, http://is.muni.cz/publication/884893/en. [33] I. V. Serban, R. Lowe, L. Charlin, and J. Pineau, “A survey of
[12] P. Ekman, W. V. Friesen, M. O’Sullivan, A. Chan, I. Diacoyanni- available corpora for building data-driven dialogue systems.” CoRR,
Tarlatzis, K. Heider, R. Krause, W. A. LeCompte, T. Pitcairn, P. E. Ricci- vol. abs/1512.05742, 2015. [Online]. Available: http://dblp.uni-trier.de/
Bitti et al., “Universals and cultural differences in the judgments of facial db/journals/corr/corr1512.html#SerbanLCP15
expressions of emotion.” Journal of personality and social psychology, [34] J. Cohen, “Kappa: Coefficient of concordance,” Educ. Psych. Measure-
vol. 53, no. 4, p. 712, 1987. ment, vol. 20, p. 37, 1960.
[13] R. A. Calvo and S. Mac Kim, “Emotions in text: dimensional and [35] R. Artstein and M. Poesio, “Inter-coder agreement for computational
categorical models,” Computational Intelligence, vol. 29, no. 3, pp. 527– linguistics,” Computational Linguistics, vol. 34, no. 4, pp. 555–596,
543, 2013. 2008.
[14] L.-C. Yu, J. Wang, K. R. Lai, and X.-j. Zhang, “Predicting valence- [36] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning,
arousal ratings of words using a weighted graph method,” in Proceedings “Semi-supervised recursive autoencoders for predicting sentiment dis-
of the 53rd Annual Meeting of the Association for Computational tributions,” in Proceedings of the conference on empirical methods in
Linguistics and the 7th International Joint Conference on Natural natural language processing. Association for Computational Linguis-
Language Processing (Volume 2: Short Papers), vol. 2, 2015, pp. 788– tics, 2011, pp. 151–161.
793. [37] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
[15] L.-C. Yu, L.-H. Lee, S. Hao, J. Wang, Y. He, J. Hu, K. R. Lai, Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
and X. Zhang, “Building chinese affective resources in valence-arousal A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
dimensions,” in Proceedings of the 2016 Conference of the North M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
American Chapter of the Association for Computational Linguistics: C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
Human Language Technologies, 2016, pp. 540–545. P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
[16] J. Posner, J. A. Russell, and B. S. Peterson, “The circumplex model
“TensorFlow: Large-scale machine learning on heterogeneous systems,”
of affect: An integrative approach to affective neuroscience, cognitive
2015, software available from tensorflow.org. [Online]. Available:
development, and psychopathology,” Development and psychopathology,
http://tensorflow.org/
vol. 17, no. 03, pp. 715–734, 2005.
[38] S. Godbole and S. Sarawagi, “Discriminative methods for multi-labeled
[17] S. M. Mohammad and P. D. Turney, “Emotions evoked by common classification,” in Advances in Knowledge Discovery and Data Mining.
words and phrases: Using mechanical turk to create an emotion lexicon,” Springer, 2004, pp. 22–30.
in Proceedings of the NAACL HLT 2010 workshop on computational
[39] P. Fernandez-Gonzalez, C. Bielza, and P. Larranaga, “Multidimensional
approaches to analysis and generation of emotion in text. Association
classifiers for neuroanatomical data,” in ICML Workshop on Statistics,
for Computational Linguistics, 2010, pp. 26–34.
Machine Learning and Neuroscience (Stamlins 2015), 2015.
[18] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence,
arousal, and dominance for 13,915 english lemmas,” Behavior research
methods, vol. 45, no. 4, pp. 1191–1207, 2013.
[19] M. M. Bradley and P. J. Lang, “Affective norms for english words
(anew): Instruction manual and affective ratings,” Citeseer, Tech. Rep.
[20] C. Strapparava, A. Valitutti et al., “Wordnet affect: an affective extension
of wordnet.” Citeseer, 2004.
[21] B. Heredia, T. M. Khoshgoftaar, J. Prusa, and M. Crawford, “Cross-
domain sentiment analysis: an empirical investigation,” in Information
Reuse and Integration (IRI), 2016 IEEE 17th International Conference
on. IEEE, 2016, pp. 160–165.
[22] S. K. D’Mello, S. D. Craig, J. Sullins, and A. C. Graesser, “Predict-
ing affective states expressed through an emote-aloud procedure from
autotutor’s mixed-initiative dialogue,” International Journal of Artificial
Intelligence in Education, vol. 16, no. 1, pp. 3–28, 2006.
Phan Duc Anh received his B.E. degree from Hanoi

University of Science and Technology (HUST), Viet-
nam, in 2011, and his M.Sc. degree from the Uni-
versity of Poitiers, France, in 2013. He is currently
pursuing the Ph.D. degree at Nara Institute of Sci-
ence and Technology (NAIST), Graduate School of
Information Science. His research interests include
natural language processing and machine learning,
in particular, emotion analysis and opinion mining.
Yuji Matsumoto is currently a Professor of In-

formation Science at the Nara Institute of Science
and Technology. He received his M.Sc. and Ph.D.
degrees in information science from Kyoto Univer-
sity in 1979 and 1989, respectively. He joined the
Machine Inference Section of the Electrotechnical
Laboratory in 1979. He then served as an academic
visitor at the Imperial College of Science and Tech-
nology, a deputy chief of the First Laboratory at
ICOT, and an associate professor at Kyoto Univer-
sity. His main research interests are natural language
understanding and machine learning.
Hiroyuki Shindo received his B.E. and M.E. de-

grees from Waseda University, Japan, in 2007 and
2009, respectively, and his Ph.D. degree in engineer-
ing from Nara Institute of Science and Technology
(NAIST), Ikoma, Japan, in 2013. He is currently
working as an assistant professor at NAIST in the
Graduate School of Information Science. From 2009
to 2014, he was a researcher at NTT Communication
Science Laboratories. His research interests include
machine learning and computational linguistics.

情绪模型

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

情绪模型

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Autoencoder for Semisupervised Multiple Emotion

E MOTION detection is an upcoming field of research

test data and analyzing the evaluation result.

II. R ELATED W ORKS

using the values in this work. In future work, we hope to be

We define the gold-standard data by applying the majority

Emotion class Frequency Accuracy

”ecstasy” is of higher intensity than ”joy”; hence, its vector 4 http://scikit-learn.org/stable/

C. Text to vector D. Autoencoder semisupervised learning

Fig. 5. Vector representation of an utterance is a concatenation of the

method to detect emotion is comparable to that of a human

Phan Duc Anh received his B.E. degree from Hanoi

Yuji Matsumoto is currently a Professor of In-

Hiroyuki Shindo received his B.E. and M.E. de-

You might also like