Professional Documents
Culture Documents
Hierarchical Classification
Afroze Ibrahim Baqapuri∗ , Saad Saleh† , Muhammad U. Ilyas† ,
Muhammad Murtaza Khan† and Ali Mustafa Qamar†
∗ Dept. of Computer Science, School of Computer and Communication Sciences,
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
† Dept. of Electrical Engineering, School of Electrical Engineering and Computer Science,
Abstract—This paper addresses the problem of sentiment Technical Challenges: First, the key challenges of detect-
classification of short messages on microblogging platforms. ing the polarity of sentiments in tweets lies in the fact that
We apply machine learning and pattern recognition techniques people express themselves in a great number of ways, even
to design and implement a classification system for microblog
messages assigning them into one of three classes: positive, when it relates to the same subject. Second, a key challenge
negative or neutral. As part of this work, we contributed a dataset for any data set of tweets, lies in the lack of a clear ground
consisting of approximately 10, 000 tweets, each labeled on a truth with regard to the sentiments expressed. Third, with the
five point sentiment scale by three different people. Experiments growing number of share-by-Twitter buttons on websites, a
demonstrate a detection rate between approximately 70% and large number of tweets are merely tweets providing links to
an average false alarm rate of approximately 18% across all
three classes. The developed classifier has been made available news and opinion articles. Many of these consist of objective
for online use. reports that are hard to classify as expressing a positive or
negative sentiment.
I. I NTRODUCTION Proposed Solution & Findings: Instead of delving into
the complexity of language dependent features we treat tweets
Motivation & Problem Statement: The introduction and as bag of words. This has the disadvantage that it does not
wide adoption of blogs and social networking platforms has take into consideration the sequence in which words appear in
been a multiplying factor in the number of data sources tweets. However, at the same time, this allows us to work with
recording online communication. Twitter is a widely used a relatively smaller data set. We collected an original data set
microblogging platform with about 255 million active users and labeled each tweet by at least three independent human
[26]. In this work we design an algorithm to detect the volunteers. This gave us a more reliable and stable sentiment
polarity of sentiments expressed in tweets. When applied to rating for each tweet. We propose to use a hierarchical
tweets related to a certain subject or containing certain tags / two-level statistical classifier [9], [4]. Its design will follow
keywords, sentiment analysis can be used as a barometer to standard practices of data mining and machine learning [28]
gauge the public’s sentiment on a subject. Sentiment analysis concepts. To deal with the problem of many tweets containing
can also be performed on other larger bodies of text, such objective statements expressing no sentiments, we also propose
as articles and blogs. However, the longer length of articles to rank the degree of objectivity / subjectivity of each tweet.
affords their authors the freedom to use complex linguistic We will also determine its lexical sentiment polarity based on
expressions. On the other hand, tweets are limited to a length lexical features of the tweet. We explored a very large set of
of 140 characters only which forces Twitter users to express 55 features including sentiment lexicon scores, capitalization,
themselves concisely and succinctly. While this still permits punctuation symbols, letter repetition, parts-of-speech tagging,
the expression of complex sentiments (such as satire or jokes), emoticons and numbers / digits etc. At the second stage,
the short length restricts microblogs to the communication of classification will be performed based on the estimated degree
a single idea only. During live events, blogs and articles are of objectivity / subjectivity of a tweet as well as its lexical
also relatively slow to post updates. Tweets, on the other hand sentiment polarity. The complete classifier will categorize
can be posted at the rate of a few tweets per minute. This tweets into not just two but three different categories, adding
way, sentiment analysis of tweets provides a way of tracking a third category for objective tweets expressing no sentiment.
a population’s instantaneous sentiment regarding almost any Key Contributions: The contributions of this paper are
subject of discussion. Recently, sentiment analysis of tweets threefold:
has been used for applications such as estimating the market 1) An original data set of Twitter was collected and labeled
sentiment of a company stock to predict its movement [17], with the help of multiple volunteers. Each tweet in this
and predicting the outcomes of national elections [24]. data set was rated by at least three people (Section-II).
2) All tweets in the data set were analysed based upon forced us to limit ourselves to three ratings per tweet. The
their features (Subsection-III-A). Using the performance number of volunteers was deliberately kept an odd number to
metrics, top features were short-listed for the design of avoid ties. Guidelines to volunteers are available in appendix.
the classifier (Subsection-III-B). Of the 10, 173 tweets, 1, 210 were either classified as am-
3) A hierarchical two-level statistical classifier was de- biguous or were such that the human volunteers were unable
signed that classifies tweets on a sentiment scale as to arrive at a rating consensus. Of the remaining 8, 963 tweets
positive, negative or neutral and yields a detection rate on which consensus was reached, 2, 543 were positive, 1, 877
of approximately 70% (Subsection-III-C). negative and 4, 543 neutral. Of the same 8, 963 tweets, 4, 543
4) We developed a website that lets users run the developed are objective tweets and 4, 420 are subjective tweets (sum of
classifier on real-time Twitter traffic (see at URL: http: positive and negative tweets). We also calculated the human-
//tweet-mood-check.appspot.com/). human agreement for our tweet labeling task, the results of
which are as follows:
II. DATA S ET Two criteria for determining consensus were tried for the
Data Acquisition: Data from Twitter was acquired by using labeling of tweets. The “strict” criteria of agreement required
the API implemented in the Python module Tweetstream [13]. that all ratings assigned by all human beings should be in
Tweetstream provides easy access to the Twitter streaming exact agreement, i.e. all ratings should be positive / strongly
API. This API provides two different classes to access the positive or all negative / strongly negative. Pairwise agreement
Twitter API: SampleStream and FilterStream. SampleStream (agreement between first & second, second & third, and third
returns a small, random sample of all the tweets streaming at & first labelers) between labels assigned by three different
real-time. FilterStream delivers tweets which match a given set humans are 58.9%, 59.9% and 62.5%. When the “lenient”
of criteria including keywords, user IDs and locations (for geo- criteria of agreement is used, ‘ambiguous’ ratings by human
tagged tweets). We used the SampleStream class to acquire evaluators are disregarded. Pairwise agreement between labels
data on four different days, spanning a period of several weeks assigned by three different humans are 65.1%, 67.1% and
in order to obtain a diverse data set and avoid sourcing tweets 73.0%.
that may be skewed towards a particular trending topic at any Since the pairwise agreement rate lies in the 60−70% range,
one given time. Thus, data was collected on December 17, depending on the strictness of the definition of agreement, this
2011, December 29, 2011, January 19, 2012 and February 8, shows that sentiment classification is inherently a difficult task
2012. even for human beings. For reference, we compare these val-
Cleaning & Preprocessing: In this step we applied a set ues with the agreement rates between humans reported by Kim
of rules to remove short, non-English and similar tweets. The and Hovy [15] on the sentiment rating of various adjectives
following rules are applied to clean the data set: (1) Remove and verbs by three volunteers. For strict and lenient definitions
retweets: All tweets which contain the commonly used string of consensus, they saw agreement rates between 62−77% and
“RT” in the body of the tweet (denoting a retweet or repost) are 85 − 89%, respectively. Considering that volunteers used by
removed, (2) Remove uninformative tweets: Very short tweets Kim and Hovy [15] only had to rate individual words instead
unlikely to contain much verbal information. A minimum of phrases that tweets are composed of, the slightly lower
threshold of 20 characters is used as cutoff tweet length, (3) agreement rate for our volunteers was to be expected. These
Remove non-English tweets: Words in tweets are compared results reiterate our initial claim that sentiment analysis is an
with a list of 2, 000 common English words. Tweets with inherently difficult task.
less than 15% of content matching threshold are discarded,
III. P ROPOSED M ETHOD
(4) Remove similar tweets: Compare every tweet with every
other tweet. Tweet with more than 90% of content matching A. Feature Extraction
any other tweet is discarded. After cleaning only 10, 173 We perform the following pre-processing steps to facilitate
tweets, 30% of the original, remained because roughly 70% and simplify automated feature extraction. (1) URL Extrac-
of collected tweets were retweets. tion: The start and end character positions of URL strings
Ground Truth Labeling: We used three independent contained inside tweets are explicitly provided as part of
volunteers to rate the sentiment in each tweet in the cleaned the data structure accompanying a tweet. (2) Mention: The
data set on a five point scale (strongly positive, positive, start and end character positions of mentions of other Twitter
neutral, negative, strongly negative). Volunteers were also users contained in tweets are explicitly provided as part of
asked to rate tweets for the objectivity / subjectivity of the the data structure accompanying a tweet. (3) Punctuation
statement made in each tweet on a three point scale (objective, removal: Punctuation marks and numbers are removed, leaving
ambiguous, subjective). For each tweet, the averaged sentiment only words. (4) Lowercase conversion: Tweets are normalized
rating of all three volunteers was taken as the final, true by converting it to lowercase which makes its comparison
sentiment rating, called ground truth. This ratings averaging with an English dictionary easier. (5) Tokenization: It is the
was performed by majority vote as a means to denoise tweet process of breaking a stream of text up into words, sym-
sentiment ratings. A larger number of volunteer ratings for bols and other meaningful elements called tokens (whites-
each tweet are preferred. However, time and cost limitations pace characters are used as separators). (6) Stemming: It is
the process of normalizing text by reducing derived forms
to their root or stem [3]. The purpose of stemming is to
P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 }
simplify identification of different forms of the same word
Π21i=1 P {FLP.i |L = +1}
while avoiding complex grammatical transformations of the =
Π21 {F |L = +1} + Π21
(4)
i=1 P LP.i i=1 P {FLP.i |L = −1}
word. We used Porter’s stemming algorithm when comparing
tokenized words in tweets against words in the dictionary.
(7) Stop-word removal: Stop words are a class of commonly
appearing, low-information content words such as prepositions P {L = −1|FLP.1 , FLP.2 , . . . , FLP.21 }
and pronouns [3], e.g. ‘a’, ‘an’, ‘the’, ‘he’, ‘she’, ‘by’, ‘on’, = 1 − P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 } (5)
etc. It is convenient to remove these words because they hold
no additional information since they are used almost equally in
all classes of text. (8) Parts-of-Speech (POS) Tagging: POS- Features FOS.7 and FLP.3 of a tweet are joint probabilities
Tagging is the process of assigning a tag to each word in of the joint occurrence of its constituent words in a tweet
the sentence identifying its grammatical POS, i.e. noun, verb, that makes an objective statement. The presence or absence of
adjective, adverb, coordinating conjunction. each word in a tweet is modeled by Bernoulli RVs Wi , where
1 ≤ i ≤ n and where n is the number of words used. We treat
There are two classifiers in our system that will be discussed
tweets as bags of words, i.e. we assume that the occurrence of
in detail in the next section; (1) the objective-subjective (OS)
words in tweets is independent. This leads us to Equ. 6 and
classifier, and (2) the lexical polarity (LP) classifier. A list of
7.
34 features was considered for the OS classifier and a separate
list of 21 features was considered for the LP classifier. The
P {FOS.7 |C = +1} = P {W1 , W2 , . . . , Wn |C = +1}
former is for differentiating between objective and subjective n
= Πi=1 P {Wi |C = +1} (6)
classes while the latter is for differentiating between positive
and negative classes. The list of features considered for the
OS classifier and LP classifier are shown in Appendix. Similarly,
Features OS.4 and LP.3 are both probabilities, computed
using the naı̈ve Bayes model, assuming that features are P {FLP.3 |L = +1} = P {W1 , W2 , . . . , Wn |L = +1}
independent. Let the discrete random variable (RV) C model n
= Πi=1 P {Wi |L = +1} (7)
the class membership of a tweet, i.e. for the OS classifier C ∈
{objective (+1), subjective (-1)} and for the LP classifier L ∈
{positive lexical polarity (+1), lexically neutral (0), negative At this stage P {C = +1|FOS.1 , FOS.2 , . . . , FOS.35 } and
lexical polarity (-1)}. Furthermore, let F1 , F2 , . . . , Fn denote P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 } are a posteriori proba-
a set of n different RVs, each modeling one of n features bilities expressing the likelihood that a tweet exhibiting certain
of a tweet. Then the naı̈ve Bayes model computes the class features belongs to one of two classes or the other. When
membership of a tweet with a certain set of n features by Equ. P {C = +1|FOS.1 , FOS.2 , . . . , FOS.35 } of a tweet, denoted
1. for brevity by pC (+1|tweet), is greater than 12 , the tweet is
labeled as containing an objective statement. Otherwise, it is
labeled as a tweet containing a subjective statement (C = −1).
P {C|FOS.1 , FOS.2 , . . . , FOS.35 }
Similarly, when P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 } of a
P {C} × P {FOS.1 , FOS.2 , . . . , FOS.35 |C}
=
P {FOS.1 , FOS.2 , . . . , FOS.35 }
(1) tweet, denoted for brevity by pL (+1|tweet), is greater than 21
∝ P {C} × P {FOS.1 , FOS.2 , . . . , FOS.35 |C} the tweet is labeled as having positive lexical polarity. Other-
wise, it is labeled as a tweet having negative lexical polarity
When features are assumed to be conditionally independent, (L = −1). The likelihoods pC (+1|tweet) and pL (+1|tweet)
the naı̈ve Bayes model becomes as in Equ. 2. are input features for the second stage classifier.
Count
0.3
0.2
0.2
0.1
0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
f f
C L
(a) (b)
0.45
0.6
0.35 0.5
0.3
0.4
0.25
0.3
0.2
0.15 0.2
0.1
0.1
0.05
0 0
4 15 2 1 5 3 6 25 3 1 9 2 12 10
OS Features − OS.i LP Features − LP.i
(a) (b)
Fig. 2: Information gain of (a) OS.1 through OS.35 features, (b) LP.1 through LP.21 features.
1
Lexically Negative <−> Lexically Positive
0.9
0.8
0.7
0.6
0.5
0.4 (a) Naı̈ve Bayes classifier.(b) C45 / J48 decision tree (c) SVM classifier.
classifier.
0.3
0.2
Fig. 5: Classification boundaries of (a) naı̈ve Bayes classifier,
(b) C45 / J48 decision tree classifier, and (c) SVM classifier.
0.1
fC and fL are on the horizontal and vertical axis, respectively.
0
0 0.2 0.4 0.6 0.8 1
Subjective <−> Objective
Fig. 3: Scatterplot of tweets’ in the OS – LP plain. ‘+’ denotes performance. Both classifiers give a TP of approximately 70%.
positive (p) labeled tweets, ‘×’ denotes negative (n) labeled Naı̈ve Bayes has an average FP of 18%, which is lower
tweets and ‘◦’ denotes neutral (u) labeled tweets. than the 20% average FP rate of both decision tree and
SVM. However, SVM provided an FP of only 12% for the
positive sentiment class compared to the much higher 22%
and 29% of both naı̈ve Bayes and decision tree classifiers.
Precision, recall and F-measure of both classifiers are all at
approximately 70%. The small difference in the FP between
classifiers manifests itself again in the value of the ROC Area.
Naı̈ve Bayes has an ROC Area of 84% against decision tree’s
81% and SVM’s much lower 75%. A more detailed analysis
of these results shows that SVM has a much lower TP rate
of 65% for positive sentiment tweets, compared to 73% and
79% of naı̈ve Bayes and C4.5 decision tree, respectively. Fig.
3 clearly shows that the distribution of tweets of all three
Fig. 4: High-level architecture of 2-level hierarchical classifier. classes in the OS – LP plain does not lend itself to easy
separation using linear classifiers of the kind SVM generates.
The inherent non-linearity of the class boundaries in Fig. 3
give the weighted average of each performance metric. Each explains SVM’s poorer overall performance relative to naı̈ve
class’ metric is weighted by the number of instances in the Bayes and decision tree classifiers.
test set. These metrics, like all other reported metrics are
computed using 10-fold cross validation. With the exception V. R ELATED W ORK
of the ROC Area, which is slightly greater for naı̈ve Bayes, The problem of analyzing tweets for sentiments is similar
all remaining metrics of the two classifiers show almost equal to phrase level sentiment analysis. [27] did seminal work in
TP FP P rec Rec F − M eas ROC Class TP FP P rec Rec F − M eas ROC Class
.73 .22 .77 .73 .75 .83 p .65 .12 .68 .65 .67 .77 p
.70 .15 .65 .70 .67 .85 u .79 .29 .74 .79 .76 .75 u
.63 .11 .61 .63 .62 .86 n .57 .08 .64 .57 .60 .74 n
.70 .18 .70 .70 .70 .84 Wt Avg .70 .20 .70 .70 .70 .75 Wt Avg
TABLE I: Performance table of naı̈ve Bayes classifier. TABLE III: Performance table of SVM classifier.