Professional Documents
Culture Documents
Sarcasm (also known as verbal irony) is a sophis- Example (1) refers to the supposedly lame mu-
ticated form of speech act in which the speakers sic performance in super bowl 2010 and attributes
convey their message in an implicit way. One in- it to the aftermath of the scandalous performance
herent characteristic of the sarcastic speech act is of Janet Jackson in the previous year. Note that the
that it is sometimes hard to recognize. The dif- previous year is not mentioned and the reader has
ficulty in recognition of sarcasm causes misun- to guess the context (use universal knowledge).
derstanding in everyday communication and poses The words yet and another might hint at sarcasm.
Example (2) is composed of three short sentences, 2 Data
each of them sarcastic on its own. However, com-
bining them in one tweet brings the sarcasm to The datasets we used are interesting in their own
its extreme. Example (3) is a factual statement right for many applications. In addition, our algo-
without explicit opinion. However, having a fast rithm utilizes some aspects that are unique to these
connection is a positive thing. A possible sar- datasets. Hence, before describing the algorithm,
casm emerges from the over exaggeration (‘wow’, we describe the datasets in detail.
‘blazing-fast’).
Twitter Dataset. Since Twitter is a relatively
Example (4) from Amazon, might be a genuine new service, a somewhat lengthy description of
compliment if it appears in the body of the review. the medium and the data is appropriate.
However, recalling the expression ‘don’t judge a Twitter is a very popular microblogging service.
book by its cover’, choosing it as the title of the It allows users to publish and read short messages
review reveals its sarcastic nature. Although the called tweets (also used as a verb: to tweet: the act
negative sentiment is very explicit in the iPod re- of publishing on Twitter). The tweet length is re-
view (5), the sarcastic effect emerges from the pun stricted to 140 characters. A user who publishes a
that assumes the knowledge that the design is one tweet is referred to as a tweeter and the readers are
of the most celebrated features of Apple’s prod- casual readers or followers if they are registered to
ucts. (None of the above reasoning was directly get all tweets by this tweeter.
introduced to our algorithm.) Apart from simple text, tweets may contain ref-
erences to url addresses, references to other Twit-
Modeling the underlying patterns of sarcastic ter users (these appear as @<user>) or a con-
utterances is interesting from the psychological tent tag (called hashtags) assigned by the tweeter
and cognitive perspectives and can benefit var- (#<tag>). An example of a tweet is: “listen-
ious NLP systems such as review summariza- ing to Andrew Ridgley by Black Box Recorder on
tion (Popescu and Etzioni, 2005; Pang and Lee, @Grooveshark: http://tinysong.com/cO6i #good-
2004; Wiebe et al., 2004; Hu and Liu, 2004) and music”, where ‘grooveshark’ is a Twitter user
dialogue systems. Following the ‘brilliant-but- name and #goodmusic is a tag that allows to
cruel’ hypothesis (Danescu-Niculescu-Mizil et al., search for tweets with the same tag. Though fre-
2009), it can help improve ranking and recommen- quently used, these types of meta tags are optional.
dation systems (Tsur and Rappoport, 2009). All In order to ignore specific references we substi-
systems currently fail to correctly classify the sen- tuted such occurrences with special tags: [LINK],
timent of sarcastic sentences. [USER] and [HASHTAG] thus we have “listen-
In this paper we utilize the semi-supervised sar- ing to Andrew Ridgley by Black Box Recorder on
casm identification algorithm (SASI) of (Tsur et [USER]: [LINK] [HASHTAG]”. It is important
al., 2010). The algorithm employs two modules: to mention that hashtags are not formal and each
semi supervised pattern acquisition for identify- tweeter can define and use new tags as s/he likes.
ing sarcastic patterns that serve as features for a The number of special tags in a tweet is only
classifier, and a classification stage that classifies subject to the 140 characters constraint. There is
each sentence to a sarcastic class. We experiment no specific grammar that enforces the location of
with two radically different datasets: 5.9 million special tags within a tweet.
tweets collected from Twitter, and 66000 Amazon The informal nature of the medium and the 140
product reviews. Although for the Amazon dataset characters length constraint encourages massive
the algorithm utilizes structured information, re- use of slang, shortened lingo, ascii emoticons and
sults for the Twitter dataset are higher. We discuss other tokens absent from formal lexicons.
the possible reasons for this, and also the utility These characteristics make Twitter a fascinat-
of Twitter #sarcasm hashtags for the task. Our al- ing domain for NLP applications, although posing
gorithm performed well in both domains, substan- great challenges due to the length constraint, the
tially outperforming a strong baseline based on se- complete freedom of style and the out of discourse
mantic gap and user annotations. To further test its nature of tweets.
robustness we also trained the algorithm in a cross We used 5.9 million unique tweets in our
domain manner, achieving good results. dataset: the average number of words is 14.2
words per tweet, 18.7% contain a url, 35.3% con- Given the labeled sentences, we extracted a set
tain reference to another tweeter and 6.9% contain of features to be used in feature vectors. Two basic
at least one hashtag1 . feature types are utilized: syntactic and pattern-
based features. We constructed feature vectors for
The #sarcasm hashtag One of the hashtags each of the labeled examples in the training set and
used by Twitter users is dedicated to indicate sar- used them to build a classifier model and assign
castic tweets. An example of the use of the tag scores to unlabeled examples. We next provide a
is: ‘I guess you should expect a WONDERFUL description of the algorithmic framework of (Tsur
video tomorrow. #sarcasm’. The sarcastic hashtag et al., 2010).
is added by the tweeter. This hashtag is used in-
frequently as most users are not aware of it, hence, Data preprocessing A sarcastic utterance usu-
the majority of sarcastic tweets are not explicitly ally has a target. In the Amazon dataset these
tagged by the tweeters. We use tagged tweets as targets can be exploited by a computational al-
a secondary gold standard. We discuss the use of gorithm, since each review targets a product, its
this tag in Section 5. manufacturer or one of its features, and these are
explicitly represented or easily recognized. The
Amazon dataset. We used the same dataset Twitter dataset is totally unstructured and lacks
used by (Tsur et al., 2010), containing 66000 re- textual context, so we did not attempt to identify
views for 120 products from Amazon.com. The targets.
corpus contained reviews for books from differ- Our algorithmic methodology is based on
ent genres and various electronic products. Ama- patterns. We could use patterns that include
zon reviews are much longer than tweets (some the targets identified in the Amazon dataset.
reach 2000 words, average length is 953 charac- However, in order to use less specific patterns,
ters), they are more structured and grammatical we automatically replace each appearance
(good reviews are very structured) and they come of a product, author, company, book name
in a known context of a specific product. Reviews (Amazon) and user, url and hashtag (Twitter)
are semi-structured as besides the body of the re- with the corresponding generalized meta tags
view they all have the following fields: writer, ‘[PRODUCT]’,‘[COMPANY]’,‘[TITLE]’ and
date, star rating (the overall satisfaction of the re- ‘[AUTHOR]’ tags2 and ‘[USER]’,‘[LINK]’ and
view writer) and a one line summary. ‘[HASHTAG]’. We also removed all HTML tags
Reviews refer to a specific product and rarely and special symbols from the review text.
address each other. Each review sentence is, there-
Pattern extraction Our main feature type is
fore, part of a context – the specific product, the
based on surface patterns. In order to extract such
star rating, the summary and other sentences in
patterns automatically, we followed the algorithm
that review. In that sense, sentences in the Ama-
given in (Davidov and Rappoport, 2006). We clas-
zon dataset differ radically from the contextless
sified words into high-frequency words (HFWs)
tweets. It is worth mentioning that the majority
and content words (CWs). A word whose cor-
of reviews are on the very positive side (star rating
pus frequency is more (less) than FH (FC ) is con-
average of 4.2 stars).
sidered to be a HFW (CW). Unlike in (Davidov
and Rappoport, 2006), we consider all punctuation
3 Classification Algorithm
characters as HFWs. We also consider [product],
Our algorithm is semi-supervised. The input is [company], [title], [author] tags as HFWs for pat-
a relatively small seed of labeled sentences. The tern extraction. We define a pattern as an ordered
seed is annotated in a discrete range of 1 . . . 5 sequence of high frequency words and slots for
where 5 indicates a clearly sarcastic sentence and content words. The FH and FC thresholds were
1 indicates a clear absence of sarcasm. A 1 . . . 5 set to 1000 words per million (upper bound for
scale was used in order to allow some subjectiv- FC ) and 100 words per million (lower bound for
ity and since some instances of sarcasm are more FH )3 .
explicit than others. 2
Appropriate names are provided with each review so this
replacement can be done automatically.
1 3
The Twitter data was generously provided to us by Bren- Note that FH and FC set bounds that allow overlap be-
dan O’Connor. tween some HFWs and CWs.
α\γ 0.05 0.1 0.2
The patterns allow 2-6 HFWs and 1-6 slots for 0.05 0.48 0.45 0.39
CWs. For each sentence it is possible to gener- 0.1 0.50 0.51 0.40
0.2 0.40 0.42 0.33
ate dozens of patterns that may overlap. For ex-
ample, given a sentence “Garmin apparently does
Table 1: Results (F-Score for ”no enrichment” mode) of
not care much about product quality or customer cross validation with various values for α and γ on Twit-
support”, we have generated several patterns in- ter+Amazon data
cluding “[COMPANY] CW does not CW much”,
“does not CW much about CW CW or”, “not CW α = γ = 0.1 in all experiments. Table 1 demon-
much” and “about CW CW or CW CW.”. Note strates the results obtained with different values
that “[COMPANY]” and “.” are treated as high for α and γ.
frequency words. Thus, for the sentence “Garmin apparently does
not care much about product quality or customer
Pattern selection The pattern extraction stage
support”, the value for “[company] CW does not”
provides us with hundreds of patterns. However,
would be 1 (exact match); for “[company] CW
some of them are either too general or too specific.
not” would be 0.1 (sparse match due to insertion
In order to reduce the feature space, we have used
of ‘does’); and for “[company] CW CW does not”
two criteria to select useful patterns.
would be 0.1 ∗ 4/5 = 0.08 (incomplete match
First, we removed all patterns which appear
since the second CW is missing).
only in sentences originating from a single prod-
uct/book (Amazon). Such patterns are usually Punctuation-based features In addition to
product-specific. Next we removed all patterns pattern-based features we used the following
which appear in the seed both in some example la- generic features: (1) Sentence length in words,
beled 5 (clearly sarcastic) and in some other exam- (2) Number of “!” characters in the sentence, (3)
ple labeled 1 (obviously not sarcastic). This filters Number of “?” characters in the sentence, (4)
out frequent generic and uninformative patterns. Number of quotes in the sentence, and (5) Num-
Pattern selection was performed only on the Ama- ber of capitalized/all capitals words in the sen-
zon dataset as it exploits review’s meta data. tence. All these features were normalized by di-
viding them by the (maximal observed value · av-
Pattern matching Once patterns are selected,
eraged maximal value of the other feature groups),
we have used each pattern to construct a single en-
thus the maximal weight of each of these fea-
try in the feature vectors. For each sentence we
tures is equal to the averaged weight of a single
calculated a feature value for each pattern as fol-
pattern/word/n-gram feature.
lows:
1: Exact match – all the pattern components
Data enrichment Since we start with only a
appear in the sentence in correct small annotated seed for training (particularly, the
order without any additional words. number of clearly sarcastic sentences in the seed is
modest) and since annotation is noisy and expen-
α: Sparse match – same as exact match
sive, we would like to find more training examples
but additional non-matching words can be
inserted between pattern components.
without requiring additional annotation effort.
γ ∗ n/N : Incomplete match – only n > 1 of N pattern To achieve this, we posited that sarcastic sen-
components appear in the sentence, tences frequently co-appear in texts with other sar-
while some non-matching words can
castic sentences (i.e. example (2) in Section 1).
be inserted in-between. At least one of the
We performed an automated web search using the
appearing components should be a HFW.
Yahoo! BOSS API4 , where for each sentence s in
0: No match – nothing or only a single
pattern component appears in the sentence. the training set (seed), we composed a search en-
gine query qs containing this sentence5 . We col-
0 ≤ α ≤ 1 and 0 ≤ γ ≤ 1 are parameters we use lected up to 50 search engine snippets for each
to assign reduced scores for imperfect matches. example and added the sentences found in these
Since the patterns we use are relatively long, ex- snippets to the training set. The label (level of sar-
act matches are uncommon, and taking advantage 4
http://developer.yahoo.com/search/boss.
of partial matches allows us to significantly re- 5
If the sentence contained more than 6 words, only the
duce the sparsity of the feature vectors. We used first 6 words were included in the search engine query.
casm) Label(sq ) of a newly extracted sentence sq to be expected, since non-sarcastic sentences out-
is similar to the label Label(s) of the seed sen- number sarcastic ones, definitely when most on-
tence s that was used for the query that acquired it. line reviews are positive (Liu et al., 2007). This
The seed sentences together with newly acquired generally positive tendency is also reflected in our
sentences constitute the (enriched) training set. data – the average number of stars is 4.12.
Data enrichment was performed only for the
Seed training set with #sarcasm (Twitter). We
Amazon dataset where we have a manually tagged
used a sample of 1500 tweets marked with the
seed and the sentence structure is closer to stan-
#sarcasm hashtag as a positive set that represents
dard English grammar. We refer the reader to
sarcasm styles special to Twitter. However, this set
(Tsur et al., 2010) for more details about the en-
is very noisy (see discussion in Section 5).
richment process and for a short discussion about
the usefulness of web-based data enrichment in the Seed training set (cross domain). Results ob-
scope of sarcasm recognition. tained by training on the 1500 #sarcasm hash-
tagged tweets were not promising. Examination of
Classification In order to assign a score to new
the #sarcasm tagged tweets shows that the annota-
examples in the test set we use a k-nearest neigh-
tion is biased and noisy as we discuss in length
bors (kNN)-like strategy. We construct feature
in Section 5. A better annotated set was needed
vectors for each example in the training and test
in order to properly train the algorithm. Sarcas-
sets. We would like to calculate the score for each
tic tweets are sparse and hard to find and annotate
example in the test set. For each feature vector v in
manually. In order to overcome sparsity we used
the test set, we compute the Euclidean distance to
the positive seed annotated on the Amazon dataset.
each of the matching vectors in the extended train-
The training set was completed by manually se-
ing set, where matching vectors share at least one
lected negative example from the Twitter dataset.
pattern feature with v.
Note that in this setting our training set is thus of
Let ti , i = 1..k be the k vectors with lowest
mixed domains.
Euclidean distance to v 6 . Then v is classified with
a label l as follows: 4.1 Star-sentiment baseline
Count(l) = Fraction of training vectors with label lMany studies on sarcasm suggest that sarcasm
" # emerges from the gap between the expected utter-
1 X Count(Label(ti )) · Label(ti ) ance and the actual utterance (see echoic mention,
Label(v) = P
k i j Count(label(tj )) allusion and pretense theories in Related Work
Thus the score is a weighted average of the k clos- Section( 6)). We implemented a baseline designed
est training set vectors. If there are less than k to capture the notion of sarcasm as reflected by
matching vectors for the given example then fewer these models, trying to meet the definition “saying
vectors are used in the computation. If there are the opposite of what you mean in a way intended
no matching vectors found for v, we assigned the to make someone else feel stupid or show you are
default value Label(v) = 1, since sarcastic sen- angry”.
tences are fewer in number than non-sarcastic ones We exploit the meta-data provided by Amazon,
(this is a ‘most common tag’ strategy). namely the star rating each reviewer is obliged
to provide, in order to identify unhappy review-
4 Evaluation Setup ers. From this set of negative reviews, our base-
line classifies as sarcastic those sentences that ex-
Seed and extended training sets (Amazon). As
hibit strong positive sentiment. The list of positive
described in the previous section, SASI is semi su-
sentiment words is predefined and captures words
pervised, hence requires a small seed of annotated
typically found in reviews (for example, ‘great’,
data. We used the same seed of 80 positive (sar-
‘excellent’, ‘best’, ‘top’, ‘exciting’, etc).
castic) examples and 505 negative examples de-
scribed at (Tsur et al., 2010). 4.2 Evaluation procedure
After automatically expanding the training set, We used two experimental frameworks to test
our training data now contains 471 positive exam- SASI ’s accuracy. In the first experiment we eval-
ples and 5020 negative examples. These ratios are uated the pattern acquisition process, how consis-
6
We used k = 5 for all experiments. tent it is and to what extent it contributes to correct
classification. We did that by 5-fold cross valida- (1-3 stars) so that all sentences in the evaluation
tion over the seed data. set are drawn from the same population, increas-
In the second experiment we evaluated SASI on ing the chances they convey various levels of di-
a test set of unseen sentences, comparing its out- rect or indirect negative sentiment7 .
put to a gold standard annotated by a large number Experimenting with the Twitter dataset, we sim-
of human annotators (using the Mechanical Turk). ply classified each tweet into one of 5 classes
This way we verify that there is no over-fitting and (class 1: not sarcastic, class 5: clearly sarcastic)
that the algorithm is not biased by the notion of according to the label given by the algorithm. Just
sarcasm of a single seed annotator. like the evaluation of the algorithm on the Amazon
dataset, we created a small evaluation set by sam-
5-fold cross validation (Amazon). In this ex- pling 90 sentences which were classified as sarcas-
perimental setting, the seed data was divided to 5 tic (labels 3-5) and 90 sentences classified as not
parts and a 5-fold cross validation test is executed. sarcastic (labels 1,2).
Each time, we use 4 parts of the seed as the train-
ing data and only this part is used for the feature Procedure Each evaluation set was randomly
selection and data enrichment. This 5-fold pro- divided to 5 batches. Each batch contained 36 sen-
cess was repeated ten times. This procedure was tences from the evaluation set and 4 anchor sen-
repeated with different sets of optional features. tences: two with sarcasm and two sheer neutral.
We used 5-fold cross validation and not the The anchor sentences were not part of the test set
standard 10-fold since the number of seed exam- and were the same in all five batches. The purpose
ples (especially positive) is relatively small hence of the anchor sentences is to control the evaluation
10-fold is too sensitive to the broad range of possi- procedure and verify that annotation is reasonable.
ble sarcastic patterns (see the examples in Section We ignored the anchor sentences when assessing
1). the algorithm’s accuracy.
We used Amazon’s Mechanical Turk8 service
Classifying new sentences (Amazon & Twitter). in order to create a gold standard for the evalua-
Evaluation of sarcasm is a hard task due to the tion. We employed 15 annotators for each eval-
elusive nature of sarcasm, as discussed in Sec- uation set. We used a relatively large number of
tion 1. In order to evaluate the quality of our al- annotators in order to overcome the possible bias
gorithm, we used SASI to classify all sentences induced by subjectivity (Muecke, 1982). Each an-
in both corpora (besides the small seed that was notator was asked to assess the level of sarcasm of
pre-annotated and was used for the evaluation in each sentence of a set of 40 sentences on a scale of
the 5-fold cross validation experiment). Since it 1-5. In total, each sentence was annotated by three
is impossible to created a gold standard classifica- different annotators.
tion of each and every sentence in the corpus, we
created a small test set by sampling 90 sentences Inter Annotator Agreement. To simplify the
which were classified as sarcastic (labels 3-5) and assessment of inter-annotator agreement, the scal-
90 sentences classified as not sarcastic (labels 1,2). ing was reduced to a binary classification where 1
The sampling was performed on the whole corpus and 2 were marked as non-sarcastic and 3-5 as sar-
leaving out only the seed data. castic (recall that 3 indicates a hint of sarcasm and
Again, the meta data available in the Amazon 5 indicates ‘clearly sarcastic’). We checked the
dataset allows us a stricter evaluation. In order Fleiss’ κ statistic to measure agreement between
to make the evaluation harder for our algorithm multiple annotators. The inter-annotator agree-
and more relevant, we introduced two constraints ment statistic was κ = 0.34 on the Amazon dataset
to the sampling process: i) we sampled only sen- and κ = 0.41 on the Twitter dataset.
tences containing a named-entity or a reference to These agreement statistics indicates a fair
a named entity. This constraint was introduced in agreement. Given the fuzzy nature of the task at
order to keep the evaluation set relevant, since sen- 7
Note that the second constraint makes the problem less
tences that refer to the named entity (the target of easy. If taken from all reviews, many of the sentences would
the review) are more likely to contain an explicit be positive sentences which are clearly non-sarcastic. Doing
this would bias selection to positive vs. negative samples in-
or implicit sentiment. ii) we restricted the non- stead of sarcastic-nonsarcastic samples.
8
sarcastic sentences to belong to negative reviews https://www.mturk.com/mturk/welcome
Prec. Recall Accuracy F-score Prec. Recall FalsePos FalseNeg F Score
punctuation 0.256 0.312 0.821 0.281 Star-sent. 0.5 0.16 0.05 0.44 0.242
patterns 0.743 0.788 0.943 0.765 SASI (AM) 0.766 0.813 0.11 0.12 0.788
pat+punct 0.868 0.763 0.945 0.812 SASI (TW) 0.794 0.863 0.094 0.15 0.827
enrich punct 0.4 0.390 0.832 0.395
enrich pat 0.762 0.777 0.937 0.769
all: SASI 0.912 0.756 0.947 0.827 Table 3: Evaluation on the Amazon (AM) and the Twitter
(TW) evaluation sets obtained by averaging on 3 human an-
Table 2: 5-fold cross validation results on the Amazon gold notations per sentence. TW results were obtained with cross-
standard using various feature types. punctuation: punctua- domain training.
tion mark;, patterns: patterns; enrich: after data enrichment;
enrich punct: data enrichment based on punctuation only; en- Prec. Recall Accuracy F-score
rich pat: data enrichment based on patterns only; SASI: all punctuation 0.259 0.26 0.788 0.259
patterns 0.765 0.326 0.889 0.457
features combined. enrich punct 0.18 0.316 0.76 0.236
enrich pat 0.685 0.356 0.885 0.47
all no enrich 0.798 0.37 0.906 0.505
hand, this κ value is certainly satisfactory. We at- all SASI: 0.727 0.436 0.896 0.545
We used SASI, the first robust algorithm for recog- Minqing Hu and Bing Liu. 2004. Mining and sum-
nition of sarcasm, to experiment with a novel marizing customer reviews. In KDD ’04: Proceed-
ings of the tenth ACM SIGKDD international con-
Twitter dataset and compare performance with an ference on Knowledge discovery and data mining,
Amazon product reviews dataset. Evaluating in pages 168–177, New York, NY, USA. ACM.
various ways and with different parameters con-
figurations, we achieved high precision, recall and A. Katz. 2005. Discourse and social-cultural factors
in understanding non literal language. In Colston H.
F-Score on both datasets even for cross-domain
and Katz A., editors, Figurative language compre-
training and with no need for domain adaptation. hension: Social and cultural influences, pages 183–
In the future we will test the contribution of 208. Lawrence Erlbaum Associates.
Jingjing Liu, Yunbo Cao, Chin-Yew Lin, Yalou Huang, D. Wilson and D. Sperber. 1992. On verbal irony. Lin-
and Ming Zhou. 2007. Low-quality product re- gua, 87:53–76.
view detection in opinion summarization. In Pro-
ceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Com-
putational Natural Language Learning (EMNLP-
CoNLL), pages 334–342.