You are on page 1of 12

DepecheMood++: a Bilingual Emotion Lexicon Built Through Simple Yet

Powerful Techniques
Oscar Araque1 , Lorenzo Gatti2 , Jacopo Staiano3 , Marco Guerini4,5
1
Grupo de Sistemas Inteligentes, Departamento de Ingenierı́a Telemática
Universidad Politécnica de Madrid, E.T.S.I. de Telecomunicación, Madrid – Spain
2
Human Media Interaction, University of Twente, Enschede – The Netherlands
3
reciTAL, 34 boulevard de Bonne Nouvelle, 75010 Paris – France
4
Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento – Italy
5
AdeptMind Scholar – Canada
o.araque@upm.es, l.gatti@utwente.nl, jacopo@recital.ai, guerini@fbk.eu

Abstract used to improve both precision and coverage of


Several lexica for sentiment analysis have
a state-of-the-art lexicon that has been automati-
been developed and made available in the cally inferred from a dataset of emotionally tagged
news. Then, we try to answer the following re-
arXiv:1810.03660v1 [cs.CL] 8 Oct 2018

NLP community. While most of these come


with word polarity annotations (e.g. posi- search questions:
tive/negative), attempts at building lexica for
finer-grained emotion analysis (e.g. happi- • Can straightforward machine learning tech-
ness, sadness) have recently attracted signifi- niques that only rely on lexicon scores pro-
cant attention. Such lexica are often exploited vide even more challenging baselines for
as a building block in the process of develop- complex emotion analysis models, under
ing learning models for which emotion recog-
the constraints of keeping the required pre-
nition is needed, and/or used as baselines to
which compare the performance of the mod- processing at a minimum?
els. In this work, we contribute two new re-
sources to the community: a) an extension of • Are such techniques portable across lan-
an existing and widely used emotion lexicon guages?
for English; and b) a novel version of the lex-
icon targeting Italian. Furthermore, we show • Can the coverage of a given lexicon be sig-
how simple techniques can be used, both in nificantly increased using a straight-forward
supervised and unsupervised experimental set- and effective methodology?
tings, to boost performances on datasets and
tasks of varying degree of domain-specificity. To do so, we build upon the methodol-
1 Introduction ogy proposed in (Staiano and Guerini, 2014;
Guerini and Staiano, 2015), the publicly avail-
Obtaining high-quality and high-coverage lexica able DepecheMood lexicon described therein,
is an active subject of research (Mohammad and and the corresponding details of the source dataset
Turney, 2010). Traditionally, lexicon acquisition we were provided with.
can be done in two distinct ways: either man- We evaluate and release to the community an
ual creation (e.g. crowdsourcing annotation) or extension of the original lexicon built on a larger
automatic derivation from already annotated cor- dataset, as well as a novel emotion lexicon target-
pora. While the former approach provides more ing the Italian language and built with the same
precise lexica, the latter usually grants a higher methodology. We perform experiments on six
coverage. Regardless of the approach chosen, datasets/tasks exhibiting a wide diversity in terms
when used as baselines or as additional features of domain (namely: news, blog posts, mental
for learning models, lexica are often “taken for health forum posts, twitter), languages (English,
granted”, meaning that the performances against and Italian), setting (both supervised and unsuper-
which a proposed model is evaluated are rather vised), and task (regression and classification).
weak, a fact that could be arguably seen to slow The results obtained show that:
down progress in the field. Thus, in this paper
we first investigate whether simple and computa- 1. training straightforward classifiers/regressors
tionally cheap techniques (e.g. document filter- from a high-coverage/high-precision lexicon,
ing, text pre-processing, frequency cut-off) can be derived from general news data, allows to
obtain good performances also on domain- resource, in comparison to other lexica, can be
specific tasks, and provides more challenging higher when applied to analysis of microblogging
baselines for complex task-specific models; platforms. Similarly, SO-CAL (Taboada et al.,
2011) entries have been generated by a small
2. depending on the characteristics of the target group of human annotators. Such annotation
language, specific pre-processing steps (e.g. has been made following a multi-class approach,
lemmatization in case of morphologically- obtaining a finer resolution in the valence scores,
rich languages) can be beneficial; which range from -5 (very negative) to 5 (very
positive); further, these valence scores have been
3. coverage of the original lexicon can be ex-
subsequently validated using crowd-sourcing,
tended using embeddings, and such tech-
with the final size of the resource compounding to
nique can provide performance improve-
over 4k words.
ments.
Another relevant resource is SentiWordNet
2 Related Work (Baccianella et al., 2010), which has been gener-
ated from a few seed terms to annotate each word
Here we provide a short review of efforts towards sense of WordNet (Fellbaum, 1998) with both a
building sentiment and emotion lexica; the inter- positive and negative score, as well as an objectiv-
ested reader, can find a more thorough overview ity score, in the [0, 1] range. Building on top of
in (Pang and Lee, 2008; Liu and Zhang, 2012; Wil- it, the SentiWords resource (Gatti et al., 2016)
son et al., 2004; Paltoglou et al., 2010). has been generated by used machine learning to
2.1 Sentiment Lexica improve the precision of these scores, annotating
all the 144k lemmas of WordNet: taking into ac-
A number of sentiment lexica have been devel- count the valence expressed in manually annotated
oped during the years, with considerable differ- lexica, the method proposed is based on predicting
ences in the number of annotated words (from a the valence score of previously unseen words.
thousand to hundreds of thousands), the values
Several works in the literature makes use of
they associate to a single word (from binary to
the lexicon presented in (Hu and Liu, 2004):
multi-class, to finer-grained scales), and in the way
this dictionary consists of more than 6k words,
these ratings are collected (manually or automati-
including frequent sentiment words, slang words,
cally). Here, we only report some notable and ac-
misspelled terms, and common variants. The
cessible examples.
annotations are automated using adjective words
General Inquirer (Stone et al., 1966) is as seed, and expanding the valence value us-
one of the earliest resources of such kind, and ing synonym and antonym relations between
provides binary ratings for about 4k sentiment words, as expressed in WordNet. A recent work
words, as well as a number of syntactic, semantic, (Mohammad, 2018a), called NRC Valence,
and pragmatic categories. More than three times Arousal, Dominance Lexicon contains
larger, the resource by Warriner et al. (2013) pro- 20k terms annotated with valence, arousal and
vides fine-grained ratings for 14k frequent-usage dominance: the proposed generation process
words, obtained by averaging the crowdsourced relies on a method known as Best-Worst Scaling
answers of multiple annotators. This dataset is (Kiritchenko and Mohammad, 2016), which
an extension of the Affective Norms for English aims at avoiding some issues common to human
Words (ANEW), which reports similar scores for a annotators.
set of 1k words (Bradley and Lang, 1999). It is
worth noting that ANEW valence scores have been
2.2 Emotion Lexica
manually assigned by several annotators, leading
to an increase in precision. While many sentiment lexica have been produced,
Following the ANEW methodology, a fewer linguistic resources for emotion research are
microblogging-oriented resource has been described in the literature. Among these, a known
introduced by Nielsen (2011), called AFINN. Its resource is WordNet-Affect (Strapparava and
latest version comprises 2477 words and phrases Valitutti, 2004), a manually-built extension of
that have been manually annotated. As shown by WordNet, in which about 1k lemmas are assigned
the original author, the precision of the AFINN with a label taken from a hierarchy of 311 affective
Domain Name No. entries Annotation type Annotation process Reference
General Inquirer 4k Categorical Manual (Stone et al., 1966)
ANEW 1k Numerical Manual (Bradley and Lang, 1999)
- 6k Categorical Automatic (Hu and Liu, 2004)
SentiWordNet 117k Numerical Automatic (Baccianella et al., 2010)
Sentiment
AFINN 2k Numerical Manual (Nielsen, 2011)
SO-CAL 4k Numerical Manual (Taboada et al., 2011)
- 14k Numerical Manual (Warriner et al., 2013)
SentiWords 144k Numerical Automatic (Gatti et al., 2016)
NRC VAD 20k Numerical Manual (Mohammad, 2018a)
Fuzzy Affect 4k Numerical Manual (Subasic and Huettner, 2001)
WordNet-Affect 1k Categorical Manual (Strapparava and Valitutti, 2004)
Affect database 2k Numerical Manual (Neviarouskaya et al., 2010)
Emotion AffectNet 10k Categorical Automatic (Cambria et al., 2011)
EmoLex 14k Categorical Manual (Mohammad and Turney, 2013)
NRC Hashtag 16k Numerical Automatic (Mohammad and Kiritchenko, 2015)
NRC Affect 6k Numerical Manual (Mohammad, 2018b)

Table 1: Overview of related Sentiment and Emotion lexica.

labels, including Ekman’s six emotions (Ekman DepecheMood++ (DM++ for short), an exten-
and Friesen, 1971). AffectNet (Cambria et al., sion of the DepecheMood lexicon (which from
2011) is a semantic network containing about 10k now on we will refer to as DepecheMood2014 ,
items, created by blending entries from Concept- or DM2014 ). The original lexicon (Staiano and
Net (Havasi et al., 2007) and the emotional labels Guerini, 2014), built in a completely automated
of WordNet-Affect. and domain-agnostic fashion, has been extensively
Similarly, the Affect database used by the research community and has demon-
(Neviarouskaya et al., 2010) contains 2.5k strated high performance even in domain-specific
lemmas taken from WordNet-Affect, and has tasks, often outperformed only by domain-specific
been manually enriched by adding the strength lexica/systems; see for instance (Bobicev et al.,
of association with Izard’s basic emotions (Izard, 2015).
1977). EmoLex (Mohammad and Turney, 2013) The new version we release in this work is made
is a crowdsourced lexicon containing 14k lem- available for both English and Italian. While the
mas, each annotated with binary associations to English version of DM++ is an improved version
Plutchik’s eight emotions (Plutchik, 1980). of DM2014 built using a larger dataset, the Ital-
Further, a fuzzy approach is considered by Sub- ian one is completely new and, to the best of our
asic and Huettner (2001), who provide 4k en- knowledge, is the first publicly-available large-
tries manually annotated in a range of 80 emo- scale emotion lexicon for this language.
tion labels. A recent resource is NRC Affect
3.1 Data Used
Intensity Lexicon (Mohammad, 2018b),
which includes 6k entries manually annotated with To build DepecheMood2014 , the original authors
a set of four basic emotions: joy, fear, anger, and exploited a dataset consisting of 13.5M words
sadness. For this lexicon, similarly to before, the from 25.3K documents, with an average of 530
Best-Worst Scaling method was used. In a similar words per document.
line of work has the NRC Hashtag Emotion As previously mentioned, in this paper we use
Lexicon (Mohammad and Kiritchenko, 2015) as an expanded source dataset in order to i) re-build
output. With a coverage of over 16k unigrams, this the English lexicon on a larger corpus, and ii)
resource has been automatically inferred from mi- to build a novel lexicon targeting the Italian lan-
croblogging messages distantly annotated by emo- guage. To this end, we used an extended cor-
tional hashtags. As such, this lexicon is particu- pus which has been harvested for a subsequent
larly useful when applied to the Twitter domain. study on emotions and virality (Guerini and Sta-
iano, 2015) – besides the English articles from
3 DepecheMood++ rappler.com on a longer time span, such cor-
pus includes crowd-annotated news articles in Ital-
In this section we provide details on the ian from corriere.it.
techniques and datasets we used to create In brief, rappler.com is a “social-news”
website that embeds a small interface, called matrices using normalized frequencies
Mood Meter, in every article it publishes, allow- (MW D ).
ing its readers to express with a click their emo-
tional reaction to the story they are reading. Simi- 3. After that, we applied matrix multiplica-
larly, corriere.it, the online version of a very tion between the document-by-emotion and
popular Italian newspaper called Corriere della word-by-document matrices (MDE · MW D )
Sera, adopts a similar approach, based on emoti- to obtain a (raw) word-by-emotion matrix
cons, to sense the emotional reactions of its read- MW E . This method allows us to ‘merge’
ers. We note that the latter has discontinued its words with emotions by summing the prod-
“emotional” widgets, removing them also from the ucts of the weight of a word with the weight
archived articles, a fact that contributes to the rel- of the emotions in each document.
evance of our effort to release the Italian DM++ Finally, we transformed MW E by first apply-
lexicon. ing normalization column-wise (so to eliminate
In Table 2 we report a quantitative descrip- the over representation for happiness as discussed
tion of the data collected from Rappler and in previous section) and then scaling the data row-
Corriere. For more details, we refer the reader wise so to sum up to one.
to the original work of Guerini and Staiano (2015). An excerpt of the final Matrices MW E both
for English and Italian are presented in Tables 3
Rappler Corriere
Documents 53,226 12,437 and 4: they can be interpreted as a list of words
Tot. words 18.17 M 4.04 M with scores that represent how much weight a
Words per Document 341 325 given word has in each affective dimension. These
No. of annotations 1,145,543 320,697
matrices, that we call DepecheMood++1 , repre-
sents our emotion lexica for English and Italian,
Table 2: Corpus statistics
and are freely available for research purposes at
While previous research efforts have exploited https://git.io/fxGAP.
documents with emotional annotations on vari- 3.3 Validation and Optimization
ous affect-related tasks (Mishne, 2005; Strappar-
In this section we describe several configurations
ava and Mihalcea, 2008; Bellegarda, 2010; Tang
that were used to generate DepecheMood++ lex-
et al., 2014), the data used in these works share the
icon. To fairly assess the performance of each
limitation of only providing discrete labels, rather
configuration, we employ randomly selected val-
than a continuous score for each emotional dimen-
idation sets compounding to 25% of the arti-
sion. Moreover, these annotations were performed
cles in our data for both rappler.com and
by the document author rather than the readers.
corriere.it. For all the following evaluation
Conversely, in this work we leverage the fact
experiments, such left-out sets are used.
that rappler.com and corriere.it readers
Also, in order to facilitate comparisons with
can select as many emotions as desired, so that
previous works, we used the simple approach
the resulting annotations represent a distribution
adopted both by Staiano and Guerini (2014) and
of emotional scores for each article.
Strapparava and Mihalcea (2008): on a given
3.2 Emotion Lexica Creation headline, a single value for each affective di-
mension is computed by simply averaging the
Consistently with Staiano and Guerini (2014) the
DepecheMood++ affective scores of all the
lexica creation methodology consists of the fol-
words contained in the headline; Pearson correla-
lowing steps:
tion is then measured by comparing this averaged
1. First, we produced a document-by-emotion value to the annotation for the headline.
matrix (MDE ) per language, containing the Word Representations. Throughout the follow-
voting percentages for each document in ing experiments, we consider three word repre-
the eight affective dimensions available in sentations, corresponding to three different pre-
rappler.com for English and the six processing levels: (i) tokenization, (ii) lemmati-
available in corriere.it for Italian. zation, and (iii) lemmatization combined with Part
1
2. Then, we computed the word-by-document In French, ‘depeche’ means dispatch/news.
Word A FRAID A MUSED A NGRY A NNOYED D ON ’ T C ARE HA PPY I NSPIRED S AD
awe 0.04 0.18 0.04 0.12 0.14 0.12 0.32 0.04
criminal 0.08 0.11 0.27 0.16 0.11 0.09 0.07 0.10
dead 0.17 0.07 0.18 0.08 0.07 0.05 0.08 0.29
funny 0.04 0.31 0.04 0.13 0.17 0.09 0.16 0.06
warning 0.30 0.07 0.13 0.12 0.08 0.07 0.06 0.16
rapist 0.09 0.08 0.37 0.08 0.18 0.08 0.06 0.07
virtuosity 0.00 0.24 0.00 0.01 0.00 0.41 0.33 0.01

Table 3: An excerpt of MW E for English. Dominant Emotion in a word is highlighted for readability
purposes.
A NNOYED A FRAID S AD A MUSED H APPY
stupore 0.18 0.16 0.16 0.28 0.22
criminale 0.21 0.28 0.16 0.20 0.15
morto 0.19 0.19 0.39 0.12 0.11
divertente 0.10 0.13 0.15 0.32 0.30
allarme 0.16 0.25 0.34 0.14 0.11
stupratore 0.40 0.11 0.16 0.18 0.15
virtuoso 0.14 0.18 0.14 0.25 0.29

Table 4: An excerpt of MW E for Italian. Dominant Emotion in a word is highlighted for readability
purposes.

of Speech (PoS) tagging (the only representation Frequency Cutoff. We also explore differ-
used in DM2014 ). We denote these variations as ent word frequency cutoff values to find a
token, lemma and lemma#PoS respectively. threshold that would remove noisy items with-
out eliminating informative ones (in DM2014 no
Untagged Document Filtering. First, we noted
cutoff was used, hence hapax were also in-
that a significant percentage of the training
cluded in the vocabulary). The performance of
document set has no emotional annotations –
DepecheMood++ under different cutoff values is
such figure compounds to 8% and 16% for
reported in Table 6: the best performance was ob-
rappler.com and corriere.it, respec-
tained using a cutoff value of 10 for both Rappler
tively. Therefore, we compare two versions of the
and Corriere. Therefore, we use this value on the
proposed lexica built using two different sets of
following experiments. In Table 7 we also report
documents: (i) all documents, as done in the orig-
the vocabulary size as a function of cutoff values.
inal DM2014 ; and (ii), using only documents with a
non-zero emotion annotation vector. Cutoff Rappler Corriere
The results of this experiment are shown in Ta- token lemma l#p token lemma l#p
1 0.33 0.32 0.31 0.26 0.30 0.29
ble 5. 10 0.33 0.33 0.33 0.27 0.31 0.30
Dataset token lemma lemma#PoS 20 0.33 0.33 0.32 0.25 0.30 0.29
Rappler (all) 0.31 0.31 0.30 50 0.33 0.32 0.31 0.21 0.28 0.27
Rappler (filtered) 0.33 0.32 0.32 100 0.31 0.31 0.30 0.16 0.24 0.24
Corriere (all) 0.22 0.25 0.24
Corriere (filtered) 0.27 0.30 0.30 Table 6: Frequency cutoff impact on Pearson’s cor-
relation for left-out sets for Rappler and Corriere.
Table 5: Averaged Pearson’s correlation over all
emotions on left-out sets for Rappler and Corriere, Cutoff Rappler Corriere
using all documents in the datasets vs filtering out token lemma l#p token lemma l#p
those without emotional annotations. 1 165k 154k 249k 116k 72k 81k
10 37k 30k 44k 20k 13k 13k
20 26k 20k 29k 12k 8k 8k
It is evident that training with only documents 50 16k 12k 16k 6k 4k 4k
with emotion annotations leads to an improvement 100 10k 8k 10k 3k 3k 3k
of results, which could indicate that untagged doc-
uments add noise to the lexicon generation pro- Table 7: Number of words in generated lexica us-
cess. Consequently, we use this improved variant ing different cutoff values.
in next experiments.
token - lemma - lemma#pos in Rappler token - lemma - lemma#pos in Corriere
0.40
0.30
0.35
average correlation of all emotions

average correlation of all emotions


0.30 0.25

0.25 0.20
0.20 0.15
0.15
0.10
token token
0.10 lemma lemma
lemma#PoS 0.05 lemma#PoS
0.05
0 5000 10000 15000 20000 25000 30000 35000 0 1000 2000 3000 4000 5000 6000 7000 8000
training corpus size training corpus size
(a) Rappler (b) Corriere

Figure 1: Token - lemma - lemma#PoS comparison in both Rappler and Corriere.

Learning Curves. Next, we aim to understand Emotion DM2014 token lemma l#p
A FRAID 0.30 0.38 0.39 0.38
if there is a limit, in terms of training dataset size, A MUSED 0.16 0.33 0.32 0.31
after which the performance saturates (indicating A NGRY 0.40 0.39 0.40 0.40
that further expansions of the corpus would not re- A NNOYED 0.15 0.21 0.21 0.21
D ON ’ T C ARE 0.16 0.21 0.22 0.21
sult beneficial). To this end, we vary the amount H APPY 0.30 0.35 0.35 0.35
of documents used to build the lexica using the I NSPIRED 0.31 0.36 0.36 0.35
S AD 0.36 0.39 0.41 0.41
three different text pre-processing strategies (to-
AVERAGE 0.27 0.33 0.33 0.33
kens, lemmatization and lemmatization with PoS)
and evaluate their performance.
Table 8: Pearson’s correlation on Rappler left-
Figure 1 shows the correlation values on the
out set.
left-out sets, yielded by lexica built upon training
subsets of increasing size – documents included at Emotion token lemma lemma#PoS
each subsequent step have been randomly selected A NNOYED 0.40 0.44 0.44
A FRAID 0.14 0.16 0.15
from the original training sets. S AD 0.35 0.40 0.39
The results show that, for the Rappler A MUSED 0.20 0.22 0.22
H APPY 0.29 0.32 0.31
dataset, tokenization and lemmatization ap- AVERAGE 0.27 0.31 0.30
proaches consistently achieve the higher perfor-
mance across various dataset sizes; conversely, on
Table 9: Pearson’s correlation on Corriere left-
the Corriere dataset the lemmatization-based
out set.
strategies yield the best performance with a sig-
nificant improvement over tokenization. A possi-
ble explanation for such performance drop can be age).
hypothesized in the fact that the Italian language Considering the naı̈ve approach we used, we
(Corriere data is in Italian) is morphologycally can reasonably conclude that the quality and cov-
richer than English (as in the adjective good, that erage of our resource are the reason of such results,
can be written as buono or buona, buoni, buone); and that adopting more complex approaches (i.e.
thus, lemmatization can reduce data sparseness compositionality) can possibly further improve the
that harms the final lexicon quality. performance of emotion recognition.
Furthermore, Tables 8 and 9 show the re-
4 Evaluation
sults obtained using all three versions of our re-
source, measured by Pearson correlation between In order to thoroughly evaluate the generated lex-
the emotion annotation and the computed value (as ica, we assess their performance in both regres-
indicated before). When possible, obtained values sion and classification tasks. The methodology de-
are compared to those of DepecheMood2014 ; in scribed in Section 3, used to obtain emotion scores
general, it can be seen that the current work im- for a given sentence/document, is common to all
proves the performance with respect to the earlier these experiments. The experiments are also re-
version by a significant margin (6 points on aver- stricted to the English DM++, as we are not aware
of Italian datasets for emotion recognition. Hence, for each dataset we built N predic-
For all the experiments in this section we used tion models (one for each emotion present in the
the best DM++ configuration found in the previous dataset, using the lexicon scores computed using
section, i.e. filtering out untagged documents and either tokens, lemmas, or lemma#PoS), as fea-
using a frequency cut-off set at 10. tures:
N
We report the performance obtained on the X
datasets described in Table 10, and compare such ei ∼ lexi (1)
i=1
results with the relevant previous works. Fur-
thermore, in Table 11, we compare the coverage where ei is the predicted score on emotion i, and
statistics over the same datasets for DM2014 and lexi is the average score (computed on the ele-
DM++ (both with and without frequency cut-off). ments of the test title/sentence) derived from the
As can be seen, using DM++ with cut-off 10 still lexicon for emotion i.
grants a significantly higher coverage with respect We have used several learning algorithms: lin-
to DM2014 , without losing too much coverage as ear regression, support vector machines (SVM),
compared to the version without cut-off. decision tree, random forests and multilayer per-
ceptron in a ten-fold cross-validation setting.
4.1 Unsupervised Regression Experiments Again, consider that we are not trying to op-
The SemEval 2007 dataset on “Affective timize the aforementioned models, but just try-
Text” (Strapparava and Mihalcea, 2007) was ing to understand if there is room for strong
gathered for a competition focused on emotion supervised baselines that uses standard machine
recognition in over one thousand news headlines, learning methodologies on top of our simple fea-
both in regression and classification settings. This tures (i.e. feeding the DepecheMood++ lexicon
dataset was meant for unsupervised approaches scores to each emotion model). Table 13 shows
(only a small development sample was provided) that this is the case, with improvements ranging
to avoid simple text categorization approaches. from 8 to 15 points, depending on the dataset.
It is to be observed that the affective dimen- Additionally, we have performed another ex-
sions present in the test set – based on the six ba- periment that compares DepecheMood++ to ex-
sic emotions model (Ekman and Friesen, 1971) – isting approaches on a popular emotion dataset
do not exactly match with the ones provided by (Tweet Emotion Intensity Dataset): we replicated
Rappler’s Mood Meter; therefore, we adopted the approach outlined in (Mohammad and Bravo-
the mapping previously proposed in (Staiano and Marquez, 2017) on the dataset therein presented,
Guerini, 2014) for consistency. using DepecheMood++ as lexicon.
In Table 12 we report the results obtained us- The results are reported in Table 14: repli-
ing our lexicon on the SemEval 2007 test data set. cating the original work, Pearson’s correlation
DM++ is found to consistently improve upon the has been used to measure the performance; since
previous version (DM2014 ) on all emotions, grant- DepecheMood++ is not a Twitter-specific lex-
ing an improvement over the previous SOTA of up icon, we also report non-domain-specific ap-
to 6 points. proaches.
It can be seen that DepecheMood++ improves
4.2 Supervised Regression Experiments over DM2014 on all emotions, while the lexicon
Turning to evaluation on supervised regression from Hu and Liu (2004) performs slightly better
tasks, the approach we use is inspired by (Beck only on joy. In an aggregate view of the prob-
et al., 2014): we cast the problem of predicting lem (average of all emotions) DepecheMood++
emotions as multi-task instead of single-task, e.g. yields the best performance among the non-
rather than only using happiness scores to predict Twitter-specific lexica.
happiness, we use also the scores for sadness, fear,
etc. 4.3 Supervised Classification Experiments
This approach is justified by the evidence that Finally, akin to Section 4.2 but this time in a clas-
emotion scores often tend to be correlated or anti- sification setting, we performed additional exper-
correlated (e.g. joy and surprise are correlated, joy iments to benchmark DepecheMood++ against
and sadness are anti-correlated). existing works in the literature.
Name Domain Reference
SemEval-2007 News headline (Strapparava and Mihalcea, 2007)
Tweet Emotion Intensity
Twitter messages (Mohammad and Bravo-Marquez, 2017)
Dataset (WASSA-2017)
Blog Blog posts (Aman and Szpakowicz, 2007)
CLPsych-2016 Mental health blogs (Cohan et al., 2016)

Table 10: Annotated public datasets used in the evaluation.

DepecheMood++ Lexicon Anger Fear Joy Sad Avg.


Dataset DM2014 1 cutoff 10 cutoff Bing Liu 0.33 0.31 0.37 0.23 0.31
tok lem l#p tok lem l#p MPQA 0.18 0.20 0.28 0.12 0.20
SemEval07 0.64 0.91 0.94 0.92 0.85 0.88 0.85 NRC-EmoLex 0.18 0.26 0.36 0.23 0.26
WASSA17 0.40 0.66 0.62 0.65 0.56 0.52 0.53 SentiWordNet 0.14 0.19 0.26 0.16 0.19
Blog posts 0.64 0.93 0.92 0.91 0.84 0.83 0.80 DM2014 0.35 0.28 0.27 0.30 0.30
CLPsych16 0.57 0.87 0.87 0.87 0.81 0.78 0.76 DM++ token 0.33 0.32 0.34 0.40 0.33
DM++ lemma 0.36 0.32 0.36 0.42 0.35
DM++ lemma#PoS 0.31 0.31 0.34 0.41 0.34
Table 11: Statistics on words coverage per head-
line. tok: tokens, lem: lemma, l#p: lemma and
Table 14: Comparison with generic lexica of (Mo-
PoS.
hammad and Bravo-Marquez, 2017) in the task of
Emotion DM2014 token lemma lemma#PoS emotion intensity prediction. Avg. is the average
A NGER 0.37 0.47 0.47 0.44 over all four emotions.
F EAR 0.51 0.60 0.59 0.60
J OY 0.34 0.38 0.37 0.35
S ADNESS 0.44 0.46 0.48 0.50
S URPRISE 0.19 0.21 0.23 0.22 DM2014 lexicon for comparison purposes. We ob-
Average 0.37 0.43 0.43 0.42 serve that DepecheMood++ brings a significant
improvement and outperforms the other models.
Table 12: Pearson’s correlation on SemEval2007
System Naive Bayes SVM
dataset. (Aman and Szpakowicz, 2007)
72.08 73.89
General Inquirer + WN-Affect
Dataset Word rep. LR SVM DT RF MLP DM++
DM2014 75.86 76.89
token 0.38 0.38 0.37 0.40 0.38 0.33
DM++ token 76.01 78.04
Rappler lemma 0.37 0.37 0.36 0.40 0.39 0.33
DM++ lemma 76.50 78.03
l#p 0.36 0.36 0.35 0.38 0.38 0.33
DM++ lemma#PoS 74.18 76.18
token 0.35 0.35 0.34 0.39 0.37 0.27
Corriere lemma 0.36 0.36 0.35 0.40 0.38 0.31
l#p 0.35 0.36 0.35 0.40 0.37 0.30 Table 15: Comparison with (Aman and Szpakow-
token 0.49 0.49 0.49 0.53 0.46 0.43
SemEval lemma 0.50 0.50 0.48 0.53 0.47 0.43 icz, 2007) in the task of emotion classification ac-
l#p 0.49 0.49 0.50 0.53 0.48 0.42 curacy.

Table 13: Regression results in supervised settings: Further, we replicated the work detailed in (Co-
Pearson’s correlation averaged over all emotions. han et al., 2016), and assessed and compared with
LR: linear regression, DT: decision trees, RF: ran- it the performance of DepecheMood++ on a
dom forest, MLP: multilayer perceptron. domain-specific task: in (Cohan et al., 2016), the
authors tackle relevance prediction of blog posts
on medical forums dedicated to mental health. It
In a first experiment, we replicated the work de-
is worth noting that as (Cohan et al., 2016) used
scribed in (Aman and Szpakowicz, 2007), tack-
the original DepecheMood2014 , this experiment
ling emotion detection in blog data. Consis-
also serves the purpose of evaluating the improve-
tently with the original work detailed in (Aman
ments brought by DepecheMood++, which are
and Szpakowicz, 2007), we trained Naive Bayes
shown in Table 16.
and Support Vector Machines models using only
the DepecheMood++ affective scores. This 5 Discussion of Results
serves as comparison to the original work, where
General Inquirer and WordNet-Affect Several findings that are consistent across the
annotations were used. Results are reported in Ta- datasets emerge from the above experiments.
ble 15, including the performance of the original As shown in Tables 8 and 12, the new ver-
System Accuracy F1-macro 6 Increasing Coverage through
(Cohan et al., 2016)
81.62 75.21 Embeddings
using DM2014
DM++ token 82.15 81.34
DM++ lemma 82.26 81.46 Over the last few years, word embeddings (dense
DM++ lemma#PoS 82.25 81.44 and continuously valued vectorial representations
of words) have gained wide acceptance and popu-
Table 16: Comparison with (Cohan et al., 2016) in larity in the research community, and have proven
the task of mental health post classification. to be very effective in several NLP tasks, including
sentiment analysis (Giatsoglou et al., 2017).
Taking into account this outlook, we propose a
sion of DepecheMood++ effectively and consis- technique we call embedding expansion, that aims
tently improves (see the additional benchmarks re- to expand lexica vocabulary by means of a word
ported in Tables 14, 15 and 16) over the original embedding model. The idea is to map words that
work (Staiano and Guerini, 2014). Such improve- do not originally appear in a certain lexicon to a
ments can be explained by the expansion of the word that is contained in the aforementioned lexi-
training data, which enables the generated lexicon con.
to better capture emotional information; in Fig- Hence, given an out-of-vocabulary (OOV) word
ure 1, we showed the performance obtained by wi , we search in the embedding space for the clos-
lexica built on random and increasing subsets of est word that is included in the lexicon lj so that
the source data, and observe consistent improve- d(wi , lj ) is minimal, where d(·, ·) is the cosine dis-
ments until a certain saturation point is met. tance in the embedding model. Finally, the emo-
tion scores from lj are assigned to wi .
Moreover, we found that adding a word fre- We have performed an evaluation over
quency cutoff parameter leads to a benefit in the Rappler and Corriere datasets using the
performance of the generated lexicon; in our ex- token-based versions of DepecheMood++,
periments we find an optimal value of 10 for both with the aim of observing the effect of using
the English and Italian lexica. embedding expansion.
Turning to the benefits of common pre- To this end, we proceed to:
processing stages, our experiments included tok-
enization, lemmatization and PoS tagging. While 1. remove random subsets, of decreasing size,
the original DM2014 lexicon only provided a from the original lexicon vocabulary;
lemma#PoS-based vocabulary, we show that –
2. apply the expansion at each step;
for English – tokenization suffices, and further
stages in the pre-processing pipeline do not signif- 3. measure performance against the correspond-
icantly contribute to the generated lexicon preci- ing test sets (i.e. the left-out sets described in
sion; conversely, we obtained significant improve- Section 3.3).
ments by adding a lemmatization stage for Ital-
ian (see Figure 1), a fact we hypothesize due to The pre-trained word embeddings used for En-
morphologically-richer nature of the Italian lan- glish are the ones published by Mikolov et al.
guage. (2013), while for Italian we use those by Tripodi
and Li Pira (2017); Figure 2 shows the results of
Further, as shown in Table 5, filtering out un-
this evaluation.
tagged documents contributes to lexicon precision,
As observed, performing the embedding expan-
arguably resulting in a higher-quality resource.
sion can improve the performance of the emo-
Finally, the extensive experiments reported in tion lexicon. The higher improvement is achieved
the previous sections show the quality of the En- when the vocabulary has been reduced to roughly
glish lexicon we release, in diverse domains/tasks. half of its elements for the two datasets. When the
Our results indicate that additional data would not vocabulary is not reduced, instead, the improve-
lead to further improvements, at least for English. ment tends to disappear.
Conversely, we note that the Italian resource we Thus, we conclude that this improvement can
also provide to the community shows promising enhance the performance of lexica with low cov-
results. erage by expanding their vocabulary through an
Effect of embedding expansion in Rappler Effect of embedding expansion in Corriere
0.25
Average correlation of all emotions 0.30

Average correlation of all emotions


0.25 0.20

0.20
0.15

0.15
0.10
0.10

0.05
0.05
Using embedding expansion Using embedding expansion
Not using embedding expansion Not using embedding expansion
0.00 0.00
0 2000 4000 6000 8000 10000 12000 14000 0 5000 10000 15000 20000
Lexicon vocabulary size Lexicon vocabulary size

(a) Rappler (b) Corriere

Figure 2: Difference in performance when using the embedding expansion in both Rappler and Corriere.

embedding model. Nevertheless, when the lex- zation step. We interpret this in light of the
icon has a high coverage (as in the case of morphologically-richer nature of the Italian lan-
DepecheMood++), further extending it using the guage with respect to English.
embedding expansion does not lead to meaningful
Embeddings do help (once again): we inves-
improvements.
tigated a simple embeddings-based extension ap-
7 Conclusions proach, and showed how it benefits both perfor-
mance and coverage of the lexica. We deem
The contributions of this paper are two-fold: first, such technique particularly promising when deal-
we release to the community two new high- ing with very limited annotated datasets.
performance and high-coverage lexica, targeting
English and Italian languages; second, we exten-
sively benchmark different setup decisions affect- References
ing the construction of the two resources, and fur- Saima Aman and Stan Szpakowicz. 2007. Identifying
ther evaluate the performance obtained on several expressions of emotion in text. In Proceedings of the
International Conference on Text, Speech and Dia-
datasets/tasks exhibiting a wide diversity in terms logue, pages 196–205.
of domain, languages, settings and task.
Our findings are summarized below. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
tiani. 2010. SentiWordNet 3.0: An enhanced lexical
Better baselines come cheap: we have shown resource for sentiment analysis and opinion mining.
In Proceedings of the Conference on International
how straightforward classifiers/regressors built on Language Resources and Evaluation (LREC), pages
top of the proposed lexica and without addi- 2200–2204.
tional features obtain good performances even on
Daniel Beck, Trevor Cohn, and Lucia Specia. 2014.
domain-specific tasks, and can provide more chal- Joint emotion analysis via multi-task gaussian pro-
lenging baselines when evaluating complex task- cesses. In Proceedings of the 2014 Conference on
specific models; we hypothesize that such com- Empirical Methods in Natural Language Processing
putationally cheap approaches might benefit any (EMNLP), pages 1798–1803.
lexicon. Jerome R Bellegarda. 2010. Emotion analysis using
latent affective folding and embedding. In Proceed-
Target language matters: we built our lexica ings of the NAACL HLT 2010 workshop on Compu-
for two languages, English and Italian, using con- tational Approaches to Analysis and Generation of
sistent techniques and data, a fact that allowed us Emotion in Text, pages 1–9.
to experiment with different settings and cross- Victoria Bobicev, Marina Sokolova, and Michael
evaluate the results. In particular, we found that Oakes. 2015. What goes around comes around:
for English building a token-based vocabulary suf- learning sentiments in online medical forums. Cog-
fices and further pre-processing stages do not help, nitive Computation, 7(5):609–621.
while for Italian our experiments highlighted sig- Margaret M. Bradley and Peter J. Lang. 1999. Affec-
nificant improvements when adding a lemmati- tive norms for English words (ANEW): Instruction
manual and affective ratings. Technical Report C-1, Bing Liu and Lei Zhang. 2012. A survey of opinion
University of Florida. mining and sentiment analysis. Mining Text Data,
pages 415–463.
Erik Cambria, Thomas Mazzocco, Amir Hussain, and
Chris Eckl. 2011. Sentic medoids: Organizing Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
affective common sense knowledge in a multi- frey Dean. 2013. Efficient estimation of word
dimensional vector space. In Proceedings of the In- representations in vector space. arXiv preprint
ternational Symposium on Neural Networks, pages arXiv:1301.3781.
601–610.
Gilad Mishne. 2005. Experiments with mood classifi-
Arman Cohan, Sydney Young, and Nazli Goharian. cation in blog posts. In Proceedings of ACM SIGIR
2016. Triaging mental health forum posts. In Pro- 2005 Workshop on Stylistic Analysis of Text for In-
ceedings of the Third Workshop on Computational formation Access.
Lingusitics and Clinical Psychology, pages 143–
Saif M. Mohammad. 2018a. Obtaining reliable hu-
147.
man ratings of valence, arousal, and dominance for
20,000 english words. In Proceedings of The An-
Paul Ekman and Wallace V. Friesen. 1971. Constants
nual Conference of the Association for Computa-
across cultures in the face and emotion. Journal of
tional Linguistics (ACL), Melbourne, Australia.
Personality and Social Psychology, 17:124–129.
Saif M. Mohammad. 2018b. Word affect intensities. In
Christiane Fellbaum. 1998. WordNet: An Electronic Proceedings of the 11th Edition of the Language Re-
Lexical Database. MIT Press, Cambridge, MA, sources and Evaluation Conference (LREC-2018),
USA. Miyazaki, Japan.
Lorenzo Gatti, Marco Guerini, and Marco Turchi. Saif M. Mohammad and Felipe Bravo-Marquez. 2017.
2016. SentiWords: Deriving a high precision and WASSA-2017 shared task on emotion intensity.
high coverage lexicon for sentiment analysis. IEEE Proceedings of the 8th Workshop on Computational
Transactions on Affective Computing, 7(4):409–421. Approaches to Subjectivity, Sentiment and Social
Media Analysis, pages 34–49.
Maria Giatsoglou, Manolis G. Vozalis, Konstantinos
Diamantaras, Athena Vakali, George Sarigiannidis, Saif M. Mohammad and Svetlana Kiritchenko. 2015.
and Konstantinos Ch. Chatzisavvas. 2017. Senti- Using hashtags to capture fine emotion cate-
ment analysis leveraging emotions and word embed- gories from tweets. Computational Intelligence,
dings. Expert Systems with Applications, 69:214 – 31(2):301–326.
224.
Saif M. Mohammad and Peter D. Turney. 2010. Emo-
Marco Guerini and Jacopo Staiano. 2015. Deep feel- tions evoked by common words and phrases: Us-
ings: A massive cross-lingual study on the relation ing mechanical turk to create an emotion lexicon. In
between emotions and virality. In Proceedings of Proceedings of the NAACL HLT 2010 workshop on
the 24th International Conference on World Wide computational approaches to analysis and genera-
Web (WWW ’15), pages 299–305. tion of emotion in text, pages 26–34.

Catherine Havasi, Robert Speer, and Jason Alonso. Saif M. Mohammad and Peter D. Turney. 2013.
2007. ConceptNet 3: a flexible, multilingual se- Crowdsourcing a word-emotion association lexicon.
mantic network for common sense knowledge. In Computational Intelligence, 29:436–465.
Proceedings of the 12th International Conference on
Alena Neviarouskaya, Helmut Prendinger, and Mitsuru
Recent Advances in Natural Language Processing
Ishizuka. 2010. EmoHeart: conveying emotions in
(RANLP 2007), pages 27–29.
second life based on affect sensing from text. Ad-
vances in Human-Computer Interaction, 2010:1.
Minqing Hu and Bing Liu. 2004. Mining and summa-
rizing customer reviews. In Proceedings of the 10th Finn Årup Nielsen. 2011. A new anew: Evaluation of a
ACM SIGKDD international conference on Knowl- word list for sentiment analysis in microblogs. arXiv
edge discovery and data mining, pages 168–177. preprint arXiv:1103.2903.
Carroll E. Izard. 1977. Human emotions. Plenum Georgios Paltoglou, Mike Thelwall, and Kevan Buck-
Press. ley. 2010. Online textual communications anno-
tated with grades of emotion strength. In Proceed-
Svetlana Kiritchenko and Saif M. Mohammad. 2016. ings of the 3rd International Workshop of Emotion:
Capturing reliable fine-grained sentiment associa- Corpora for research on Emotion and Affect (EMO-
tions by crowdsourcing and best–worst scaling. In TION), pages 25–31.
Proceedings of The 15th Annual Conference of the
North American Chapter of the Association for Bo Pang and Lillian Lee. 2008. Opinion mining and
Computational Linguistics: Human Language Tech- sentiment analysis. Foundations and Trends in In-
nologies (NAACL), San Diego, California. formation Retrieval, 2(1-2):1–135.
Robert Plutchik. 1980. A general psychoevolutionary
theory of emotion. Theories of emotion, 1:3–31.
Jacopo Staiano and Marco Guerini. 2014. De-
pechemood: a lexicon for emotion analysis from
crowd-annotated news. In Proceedings of the 52nd
Annual Meeting of the Association for Computa-
tional Linguistics (ACL), pages 427–433.
Philip J. Stone, Dexter C. Dunphy, and Marshall S.
Smith. 1966. The General Inquirer: A Computer
Approach to Content Analysis. MIT press.
Carlo Strapparava and Rada Mihalcea. 2007.
SemEval-2007 task 14: Affective text. In Pro-
ceedings of the 4th International Workshop on
Semantic Evaluations (SemEval), pages 70–74.
Carlo Strapparava and Rada Mihalcea. 2008. Learning
to identify emotions in text. In Proceedings of the
2008 ACM symposium on Applied computing, pages
1556–1560.
Carlo Strapparava and Alessandro Valitutti. 2004.
WordNet-Affect: an affective extension of Word-
Net. In Proceedings of the Conference on Interna-
tional Language Resources and Evaluation (LREC),
pages 1083 – 1086.
Pero Subasic and Alison Huettner. 2001. Affect analy-
sis of text using fuzzy semantic typing. IEEE Trans-
actions on Fuzzy systems, 9(4):483–496.
Maite Taboada, Julian Brooke, Milan Tofiloski, Kim-
berly Voll, and Manfred Stede. 2011. Lexicon-based
methods for sentiment analysis. Computational lin-
guistics, 37(2):267–307.
Duyu Tang, Furu Wei, Bing Qin, Ming Zhou, and Ting
Liu. 2014. Building large-scale Twitter-specific
sentiment lexicon: A representation learning ap-
proach. In Proceedings of the 25th International
Conference on Computational Linguistics (COL-
ING), pages 172–182.
Rocco Tripodi and Stefano Li Pira. 2017. Analy-
sis of italian word embeddings. arXiv preprint
arXiv:1707.08783.
Amy Beth Warriner, Victor Kuperman, and Marc Brys-
baert. 2013. Norms of valence, arousal, and dom-
inance for 13,915 English lemmas. Behavior re-
search methods, 45(4):1191–1207.
Teresa Wilson, Janyce Wiebe, and Rebecca Hwa. 2004.
Just how mad are you? finding strong and weak
opinion clauses. In Proceedings of the 19th Na-
tional Conference on Artificial Intelligence (AAAI
’04), pages 761–769.

You might also like