You are on page 1of 7

Sentiment Classification of Tweets using

Hierarchical Classification
Afroze Ibrahim Baqapuri∗ , Saad Saleh† , Muhammad U. Ilyas† ,
Muhammad Murtaza Khan† and Ali Mustafa Qamar†
∗ Dept. of Computer Science, School of Computer and Communication Sciences,
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
† Dept. of Electrical Engineering, School of Electrical Engineering and Computer Science,

National University of Sciences and Technology, H-12, Islamabad – 44000, Pakistan


Email: afroze.baqapuri@alumni.epfl.ch∗ ,
{saad.saleh, usman.ilyas, muhammad.murtaza, mustafa.qamar}@seecs.edu.pk†

Abstract—This paper addresses the problem of sentiment Technical Challenges: First, the key challenges of detect-
classification of short messages on microblogging platforms. ing the polarity of sentiments in tweets lies in the fact that
We apply machine learning and pattern recognition techniques people express themselves in a great number of ways, even
to design and implement a classification system for microblog
messages assigning them into one of three classes: positive, when it relates to the same subject. Second, a key challenge
negative or neutral. As part of this work, we contributed a dataset for any data set of tweets, lies in the lack of a clear ground
consisting of approximately 10, 000 tweets, each labeled on a truth with regard to the sentiments expressed. Third, with the
five point sentiment scale by three different people. Experiments growing number of share-by-Twitter buttons on websites, a
demonstrate a detection rate between approximately 70% and large number of tweets are merely tweets providing links to
an average false alarm rate of approximately 18% across all
three classes. The developed classifier has been made available news and opinion articles. Many of these consist of objective
for online use. reports that are hard to classify as expressing a positive or
negative sentiment.
I. I NTRODUCTION Proposed Solution & Findings: Instead of delving into
the complexity of language dependent features we treat tweets
Motivation & Problem Statement: The introduction and as bag of words. This has the disadvantage that it does not
wide adoption of blogs and social networking platforms has take into consideration the sequence in which words appear in
been a multiplying factor in the number of data sources tweets. However, at the same time, this allows us to work with
recording online communication. Twitter is a widely used a relatively smaller data set. We collected an original data set
microblogging platform with about 255 million active users and labeled each tweet by at least three independent human
[26]. In this work we design an algorithm to detect the volunteers. This gave us a more reliable and stable sentiment
polarity of sentiments expressed in tweets. When applied to rating for each tweet. We propose to use a hierarchical
tweets related to a certain subject or containing certain tags / two-level statistical classifier [9], [4]. Its design will follow
keywords, sentiment analysis can be used as a barometer to standard practices of data mining and machine learning [28]
gauge the public’s sentiment on a subject. Sentiment analysis concepts. To deal with the problem of many tweets containing
can also be performed on other larger bodies of text, such objective statements expressing no sentiments, we also propose
as articles and blogs. However, the longer length of articles to rank the degree of objectivity / subjectivity of each tweet.
affords their authors the freedom to use complex linguistic We will also determine its lexical sentiment polarity based on
expressions. On the other hand, tweets are limited to a length lexical features of the tweet. We explored a very large set of
of 140 characters only which forces Twitter users to express 55 features including sentiment lexicon scores, capitalization,
themselves concisely and succinctly. While this still permits punctuation symbols, letter repetition, parts-of-speech tagging,
the expression of complex sentiments (such as satire or jokes), emoticons and numbers / digits etc. At the second stage,
the short length restricts microblogs to the communication of classification will be performed based on the estimated degree
a single idea only. During live events, blogs and articles are of objectivity / subjectivity of a tweet as well as its lexical
also relatively slow to post updates. Tweets, on the other hand sentiment polarity. The complete classifier will categorize
can be posted at the rate of a few tweets per minute. This tweets into not just two but three different categories, adding
way, sentiment analysis of tweets provides a way of tracking a third category for objective tweets expressing no sentiment.
a population’s instantaneous sentiment regarding almost any Key Contributions: The contributions of this paper are
subject of discussion. Recently, sentiment analysis of tweets threefold:
has been used for applications such as estimating the market 1) An original data set of Twitter was collected and labeled
sentiment of a company stock to predict its movement [17], with the help of multiple volunteers. Each tweet in this
and predicting the outcomes of national elections [24]. data set was rated by at least three people (Section-II).
2) All tweets in the data set were analysed based upon forced us to limit ourselves to three ratings per tweet. The
their features (Subsection-III-A). Using the performance number of volunteers was deliberately kept an odd number to
metrics, top features were short-listed for the design of avoid ties. Guidelines to volunteers are available in appendix.
the classifier (Subsection-III-B). Of the 10, 173 tweets, 1, 210 were either classified as am-
3) A hierarchical two-level statistical classifier was de- biguous or were such that the human volunteers were unable
signed that classifies tweets on a sentiment scale as to arrive at a rating consensus. Of the remaining 8, 963 tweets
positive, negative or neutral and yields a detection rate on which consensus was reached, 2, 543 were positive, 1, 877
of approximately 70% (Subsection-III-C). negative and 4, 543 neutral. Of the same 8, 963 tweets, 4, 543
4) We developed a website that lets users run the developed are objective tweets and 4, 420 are subjective tweets (sum of
classifier on real-time Twitter traffic (see at URL: http: positive and negative tweets). We also calculated the human-
//tweet-mood-check.appspot.com/). human agreement for our tweet labeling task, the results of
which are as follows:
II. DATA S ET Two criteria for determining consensus were tried for the
Data Acquisition: Data from Twitter was acquired by using labeling of tweets. The “strict” criteria of agreement required
the API implemented in the Python module Tweetstream [13]. that all ratings assigned by all human beings should be in
Tweetstream provides easy access to the Twitter streaming exact agreement, i.e. all ratings should be positive / strongly
API. This API provides two different classes to access the positive or all negative / strongly negative. Pairwise agreement
Twitter API: SampleStream and FilterStream. SampleStream (agreement between first & second, second & third, and third
returns a small, random sample of all the tweets streaming at & first labelers) between labels assigned by three different
real-time. FilterStream delivers tweets which match a given set humans are 58.9%, 59.9% and 62.5%. When the “lenient”
of criteria including keywords, user IDs and locations (for geo- criteria of agreement is used, ‘ambiguous’ ratings by human
tagged tweets). We used the SampleStream class to acquire evaluators are disregarded. Pairwise agreement between labels
data on four different days, spanning a period of several weeks assigned by three different humans are 65.1%, 67.1% and
in order to obtain a diverse data set and avoid sourcing tweets 73.0%.
that may be skewed towards a particular trending topic at any Since the pairwise agreement rate lies in the 60−70% range,
one given time. Thus, data was collected on December 17, depending on the strictness of the definition of agreement, this
2011, December 29, 2011, January 19, 2012 and February 8, shows that sentiment classification is inherently a difficult task
2012. even for human beings. For reference, we compare these val-
Cleaning & Preprocessing: In this step we applied a set ues with the agreement rates between humans reported by Kim
of rules to remove short, non-English and similar tweets. The and Hovy [15] on the sentiment rating of various adjectives
following rules are applied to clean the data set: (1) Remove and verbs by three volunteers. For strict and lenient definitions
retweets: All tweets which contain the commonly used string of consensus, they saw agreement rates between 62−77% and
“RT” in the body of the tweet (denoting a retweet or repost) are 85 − 89%, respectively. Considering that volunteers used by
removed, (2) Remove uninformative tweets: Very short tweets Kim and Hovy [15] only had to rate individual words instead
unlikely to contain much verbal information. A minimum of phrases that tweets are composed of, the slightly lower
threshold of 20 characters is used as cutoff tweet length, (3) agreement rate for our volunteers was to be expected. These
Remove non-English tweets: Words in tweets are compared results reiterate our initial claim that sentiment analysis is an
with a list of 2, 000 common English words. Tweets with inherently difficult task.
less than 15% of content matching threshold are discarded,
III. P ROPOSED M ETHOD
(4) Remove similar tweets: Compare every tweet with every
other tweet. Tweet with more than 90% of content matching A. Feature Extraction
any other tweet is discarded. After cleaning only 10, 173 We perform the following pre-processing steps to facilitate
tweets, 30% of the original, remained because roughly 70% and simplify automated feature extraction. (1) URL Extrac-
of collected tweets were retweets. tion: The start and end character positions of URL strings
Ground Truth Labeling: We used three independent contained inside tweets are explicitly provided as part of
volunteers to rate the sentiment in each tweet in the cleaned the data structure accompanying a tweet. (2) Mention: The
data set on a five point scale (strongly positive, positive, start and end character positions of mentions of other Twitter
neutral, negative, strongly negative). Volunteers were also users contained in tweets are explicitly provided as part of
asked to rate tweets for the objectivity / subjectivity of the the data structure accompanying a tweet. (3) Punctuation
statement made in each tweet on a three point scale (objective, removal: Punctuation marks and numbers are removed, leaving
ambiguous, subjective). For each tweet, the averaged sentiment only words. (4) Lowercase conversion: Tweets are normalized
rating of all three volunteers was taken as the final, true by converting it to lowercase which makes its comparison
sentiment rating, called ground truth. This ratings averaging with an English dictionary easier. (5) Tokenization: It is the
was performed by majority vote as a means to denoise tweet process of breaking a stream of text up into words, sym-
sentiment ratings. A larger number of volunteer ratings for bols and other meaningful elements called tokens (whites-
each tweet are preferred. However, time and cost limitations pace characters are used as separators). (6) Stemming: It is
the process of normalizing text by reducing derived forms
to their root or stem [3]. The purpose of stemming is to
P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 }
simplify identification of different forms of the same word
Π21i=1 P {FLP.i |L = +1}
while avoiding complex grammatical transformations of the =
Π21 {F |L = +1} + Π21
(4)
i=1 P LP.i i=1 P {FLP.i |L = −1}
word. We used Porter’s stemming algorithm when comparing
tokenized words in tweets against words in the dictionary.
(7) Stop-word removal: Stop words are a class of commonly
appearing, low-information content words such as prepositions P {L = −1|FLP.1 , FLP.2 , . . . , FLP.21 }
and pronouns [3], e.g. ‘a’, ‘an’, ‘the’, ‘he’, ‘she’, ‘by’, ‘on’, = 1 − P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 } (5)
etc. It is convenient to remove these words because they hold
no additional information since they are used almost equally in
all classes of text. (8) Parts-of-Speech (POS) Tagging: POS- Features FOS.7 and FLP.3 of a tweet are joint probabilities
Tagging is the process of assigning a tag to each word in of the joint occurrence of its constituent words in a tweet
the sentence identifying its grammatical POS, i.e. noun, verb, that makes an objective statement. The presence or absence of
adjective, adverb, coordinating conjunction. each word in a tweet is modeled by Bernoulli RVs Wi , where
1 ≤ i ≤ n and where n is the number of words used. We treat
There are two classifiers in our system that will be discussed
tweets as bags of words, i.e. we assume that the occurrence of
in detail in the next section; (1) the objective-subjective (OS)
words in tweets is independent. This leads us to Equ. 6 and
classifier, and (2) the lexical polarity (LP) classifier. A list of
7.
34 features was considered for the OS classifier and a separate
list of 21 features was considered for the LP classifier. The
P {FOS.7 |C = +1} = P {W1 , W2 , . . . , Wn |C = +1}
former is for differentiating between objective and subjective n
= Πi=1 P {Wi |C = +1} (6)
classes while the latter is for differentiating between positive
and negative classes. The list of features considered for the
OS classifier and LP classifier are shown in Appendix. Similarly,
Features OS.4 and LP.3 are both probabilities, computed
using the naı̈ve Bayes model, assuming that features are P {FLP.3 |L = +1} = P {W1 , W2 , . . . , Wn |L = +1}
independent. Let the discrete random variable (RV) C model n
= Πi=1 P {Wi |L = +1} (7)
the class membership of a tweet, i.e. for the OS classifier C ∈
{objective (+1), subjective (-1)} and for the LP classifier L ∈
{positive lexical polarity (+1), lexically neutral (0), negative At this stage P {C = +1|FOS.1 , FOS.2 , . . . , FOS.35 } and
lexical polarity (-1)}. Furthermore, let F1 , F2 , . . . , Fn denote P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 } are a posteriori proba-
a set of n different RVs, each modeling one of n features bilities expressing the likelihood that a tweet exhibiting certain
of a tweet. Then the naı̈ve Bayes model computes the class features belongs to one of two classes or the other. When
membership of a tweet with a certain set of n features by Equ. P {C = +1|FOS.1 , FOS.2 , . . . , FOS.35 } of a tweet, denoted
1. for brevity by pC (+1|tweet), is greater than 12 , the tweet is
labeled as containing an objective statement. Otherwise, it is
labeled as a tweet containing a subjective statement (C = −1).
P {C|FOS.1 , FOS.2 , . . . , FOS.35 }
Similarly, when P {L = +1|FLP.1 , FLP.2 , . . . , FLP.21 } of a
P {C} × P {FOS.1 , FOS.2 , . . . , FOS.35 |C}
=
P {FOS.1 , FOS.2 , . . . , FOS.35 }
(1) tweet, denoted for brevity by pL (+1|tweet), is greater than 21
∝ P {C} × P {FOS.1 , FOS.2 , . . . , FOS.35 |C} the tweet is labeled as having positive lexical polarity. Other-
wise, it is labeled as a tweet having negative lexical polarity
When features are assumed to be conditionally independent, (L = −1). The likelihoods pC (+1|tweet) and pL (+1|tweet)
the naı̈ve Bayes model becomes as in Equ. 2. are input features for the second stage classifier.

P {C = +1|FOS.1 , FOS.2 , . . . , FOS.35 } B. Feature Selection


P {C = +1} × Π35i=1 P {FOS.i |C = obj}
= We do not use all available features to compute
P {FOS.1 , FOS.2 , . . . , FOS.35 }
(2)
Π35i=1 P {FOS.i |C = +1}
pC (+1|tweet) and pL (+1|tweet). Instead, for each feature
= 35
Πi=1 P {FOS.i |C = +1} + Π35 i=1 P {FOS.i |C = −1} (LP and OS), we select a small subset of the most informative
features to compute likelihoods. We use information gain
to rank each feature’s average usefulness in predicting the
class label of tweets. We used the Waikato environment for
P {C = −1|FOS.1 , FOS.2 , . . . , FOS.35 }
= 1 − P {C = +1|FOS.1 , FOS.2 , . . . , FOS.35 } (3)
knowledge analysis (WEKA) [28] for the task of feature
ranking. For example, the information gain [8] of a particular
feature (modeled by random variable OS.i) and class label
Similarly, (modeled by random variable C) is defined as follows:
positive 0.6
0.4 positive
neutral neutral
negative 0.5
negative
0.3
0.4
Count

Count
0.3
0.2

0.2
0.1
0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
f f
C L

(a) (b)

Fig. 1: Normalized histograms of (a) fC scores of tweets, (b) fL scores of tweets.

of tweets. Similarly, Fig. 1b are the normalized histograms of


fL feature of tweets plotted separately for all three classes of
I(OS.i; C) =
tweets. Fig. 3 is a scatter plot of tweets in the fC and fL plain.
X X POS.i,C {os.i, c}
POS.i,C {os.i, c} log2 As Fig. 4 depicts, the Level 2 (L2) classifier is designed
POS.i {os.i}PC {c} to label each tweet with the kind of sentiment (modeled by
C∈lneg,lpos OS.i
random variable S) expressed in it: Positive (p), negative (n)
(8) or neutral (u). We used two different classifiers for the Level 2
Plots of the information gains of the OS and LP class labels classification, a naı̈ve Bayes classifier defined earlier in Equ.
for all features OS.1 through OS.35 and LP.1 through LP.21 1) and a C4.5 decision tree classifier [28] on the two high-
sorted in descending order are shown in Fig. 2a and Fig. 2b, level features fC and fL to predict sentiment S. We used
respectively. the classifier implementations in WEKA [12]. Fig. 5a shows
The information gain functions obtained from each fold are the class boundaries in the 2-dimensional plane of fC and fL
almost identical across folds. This means the information gain features, as determined by the naı̈ve Bayes classifier. Fig. 5b
of features is stable for different data sets. The features so shows the class boundaries in the same plane as determined
obtained are expected to be stable and perform equally well by the C4.5 Decision Tree classifier using WEKA’s Java based
on other data sets. We selected the six most significant OS implementation (the Java based implementation we are using
features and LP features by information gain which are shown is named J48). Fig. 5c shows the class boundaries in the same
below. plane as determined by the SVM classifier using the LibSVM
[OS.4] Unigram word models library developed by [7], plugged into WEKA. In the following
[OS.15] Number of emoticons section, we will describe and analyze the performance of both
[OS.2] Presence of emoticons these classifiers.
[OS.1] Presence of URL
[OS.5] Number of pronouns of all forms IV. R ESULTS
[OS.3] Prior polarity of words through online lexicon MPQA This section describes the classification results of three
[LP.3] (Scaled) Likelihood based on unigram word model classification algorithms; naı̈ve Bayes, C4.5 decision tree and
[LP.1] Net total emoticon score SVM classifiers. All results reported in this section are based
[LP.9] Number of negative words from MPQA lexicon on 10-fold cross validation [28]. The number of correctly
[LP.2] Net total score from online polarity lexicon MPQA classified instances by the naı̈ve Bayes classifier is 6, 223
[LP.12] Number of present participle verbs (69.92%) and the number of incorrectly classified instances
[LP.10] Number of base-form verbs is 2, 677 (30.08%). For the C4.5 / J48 decision tree classifier
the number of correctly classified instances is 6, 254 (70.27%)
C. Classifier Design and the number of incorrectly classified instances is 2, 646
We are now left with two likelihoods, pC (+1|tweet) and (29.73%).
pL (+1|tweet). For simplicity, we will denote pC (+1|tweet) We also report the true positives rate (TP), false positives
and pL (+1|tweet) by fC and fL , respectively. These two rate (FP), precision, recall, F-measure and the receiver oper-
likelihoods are features that are used for a second level of clas- ating curve (ROC) Area for each of the three tweet classes.
sification of tweets into one of three classes; Positive, negative For the naı̈ve Bayes and C4.5 decision tree these metrics are
and neutral tweets. Fig. 1a are the normalized histograms of given in Tab. I and Tab. II, respectively. Along with the class-
fC feature of tweets plotted separately for all three classes wise performance breakup, the last lines of these tables also
0.5 0.7

0.45
0.6

Information Gain − I(OS.i; C)

Information Gain − I(LP.i, L)


0.4

0.35 0.5

0.3
0.4

0.25

0.3
0.2

0.15 0.2

0.1
0.1
0.05

0 0
4 15 2 1 5 3 6 25 3 1 9 2 12 10
OS Features − OS.i LP Features − LP.i

(a) (b)

Fig. 2: Information gain of (a) OS.1 through OS.35 features, (b) LP.1 through LP.21 features.

1
Lexically Negative <−> Lexically Positive

0.9

0.8

0.7

0.6

0.5

0.4 (a) Naı̈ve Bayes classifier.(b) C45 / J48 decision tree (c) SVM classifier.
classifier.
0.3

0.2
Fig. 5: Classification boundaries of (a) naı̈ve Bayes classifier,
(b) C45 / J48 decision tree classifier, and (c) SVM classifier.
0.1
fC and fL are on the horizontal and vertical axis, respectively.
0
0 0.2 0.4 0.6 0.8 1
Subjective <−> Objective

Fig. 3: Scatterplot of tweets’ in the OS – LP plain. ‘+’ denotes performance. Both classifiers give a TP of approximately 70%.
positive (p) labeled tweets, ‘×’ denotes negative (n) labeled Naı̈ve Bayes has an average FP of 18%, which is lower
tweets and ‘◦’ denotes neutral (u) labeled tweets. than the 20% average FP rate of both decision tree and
SVM. However, SVM provided an FP of only 12% for the
positive sentiment class compared to the much higher 22%
and 29% of both naı̈ve Bayes and decision tree classifiers.
Precision, recall and F-measure of both classifiers are all at
approximately 70%. The small difference in the FP between
classifiers manifests itself again in the value of the ROC Area.
Naı̈ve Bayes has an ROC Area of 84% against decision tree’s
81% and SVM’s much lower 75%. A more detailed analysis
of these results shows that SVM has a much lower TP rate
of 65% for positive sentiment tweets, compared to 73% and
79% of naı̈ve Bayes and C4.5 decision tree, respectively. Fig.
3 clearly shows that the distribution of tweets of all three
Fig. 4: High-level architecture of 2-level hierarchical classifier. classes in the OS – LP plain does not lend itself to easy
separation using linear classifiers of the kind SVM generates.
The inherent non-linearity of the class boundaries in Fig. 3
give the weighted average of each performance metric. Each explains SVM’s poorer overall performance relative to naı̈ve
class’ metric is weighted by the number of instances in the Bayes and decision tree classifiers.
test set. These metrics, like all other reported metrics are
computed using 10-fold cross validation. With the exception V. R ELATED W ORK
of the ROC Area, which is slightly greater for naı̈ve Bayes, The problem of analyzing tweets for sentiments is similar
all remaining metrics of the two classifiers show almost equal to phrase level sentiment analysis. [27] did seminal work in
TP FP P rec Rec F − M eas ROC Class TP FP P rec Rec F − M eas ROC Class
.73 .22 .77 .73 .75 .83 p .65 .12 .68 .65 .67 .77 p
.70 .15 .65 .70 .67 .85 u .79 .29 .74 .79 .76 .75 u
.63 .11 .61 .63 .62 .86 n .57 .08 .64 .57 .60 .74 n
.70 .18 .70 .70 .70 .84 Wt Avg .70 .20 .70 .70 .70 .75 Wt Avg

TABLE I: Performance table of naı̈ve Bayes classifier. TABLE III: Performance table of SVM classifier.

TP FP P rec Rec F − M eas ROC Class


.79 .29 .74 .79 .76 .80 p
.66 .13 .68 .66 .67 .82 u
tweets ([18]). [11] compares the performance of previous
.55 .08 .64 .55 .59 .82 n approaches employing emoticons, word sentiments etc. Au-
.70 .20 .70 .70 .70 .81 Wt Avg thors combine the features of various methods to increase the
performance of their sentiment classification technique.
TABLE II: Performance table of C4.5 decision tree classifier.
Although prior polarities of words do not provide an ac-
curate picture of the sentiment, they are still widely used
this area and formulated it as classification between prior and in the domain of phrase level sentiment analysis as they
contextual polarity. Unlike contextual polarity, prior polarity give generally good results. There are two ways by which
of a word means the general sentiment associated with the prior polarities are computed. One simple technique is to
word (positive, negative and neutral) without any information use publicly available online lexicons with prior polarity
about the specific context it is being used in at any one information for words. Multi perspective question answering
time. [27] stated that the prior polarity of a word may in (MPQA) ([19]) and SentiWordNet 3.0 ([1]) are two examples
fact differ significantly from its contextual use in a particular of such lexicons. The disadvantage of this approach is that
phrase, citing illustrative examples. Therefore, prior polarities these lexicons usually do not provide an accurate measure
of individual words may not be enough to accurately classify of strength of polarity. Nevertheless, it is still used in many
phrases by sentiment polarity. sentiment analysis techniques due to its simplicity [16], [22]
The sentiment classification problem is often formulated as and [27]. In [23], authors use semantic word spaces as the
a supervised learning problem. A broad variety of methods features and build them over a recursive neural tensor network.
have been explored for labeling training data, ranging from Upto 9.7% improvement is obtained as compared to base-
completely manual to completely automatic. The most reliable line approaches. Another approach is to construct a custom
method is labeling by human subjects, although it is also the dictionary of prior polarities from training data according to
most time consuming and expensive approach. A very widely the frequency of occurrence of each word in each particular
used alternative is the use of emoticons to automatically label class. This approach yields better performance since the prior
tweets as positive or negative which found wide adoption due polarity of words is more suited and fitted to a particular
to its ease of use such as [16], [20] and [10]. However, it has type of text ([20], [2] and [25]). It has the disadvantage
the drawback that it cannot be used for phrases that do not that it requires a very large labeled data set to construct the
contain emoticons. Some other techniques include using tags dictionary.
([16]) or classification labels generated by some other noisy An approach to include partial contextual information, is to
classifier ([2]). These methods have the advantage of labeling use bigrams and trigrams instead of just unigrams. In prac-
tweets as neutral, which was not possible relying on emoticons tice, the general conclusion is that using bigrams along with
alone, however, they add noise to the labeling process. unigrams enhances performance ([10] and [20]). In addition,
Some of the earliest work in this field classified text phrases [21] and [20] used a model in which the prior polarity of the
only as positive or negative, assuming that the data provided word was reversed if there was a negating word in its vicinity,
is subjective ([10] and [21]). While this may be a valid e.g. ‘not’, ‘no’, ‘don’t’, etc.
assumption for inherently subjective content (reviews, opinion More recent approaches have included two other types
columns), tweets and blogs often contain a lot of objective of features. The first type consists of grammatical features,
posts as well that have to be considered, making it necessary to called parts of speech (POS). This method requires tagging
incorporate a neutral class in the classification process. Some each word of the phrase with a grammatical POS, e.g. noun,
of the works which included a neutral class in the classifier pronoun, adjective, adverb, etc., using publicly available POS
are [16], [2], [20], [27] and [14]. tagging algorithms. One way to use these is to simply construct
Some recent work goes one step further by attempting to prior polarity models of POS based on their frequency in
classify emotions in tweets not just as positive or negative, each class, as done by [20]. This information can be used
but on multi-dimensional and finer, more granular scale that in conjunction with unigram models to identify useful word
includes a wider range of emotions. [5] developed a technique phrases, i.e. words which match a predetermined grammatical
to classify tweets by six distinct moods: tension, depression, rule or pattern ([25]).
anger, vigor, fatigue and confusion. [5] used an extended The second type of features are platform specific, i.e.
version of profile of mood states (POMS), a widely accepted features particular to Twitter. The presence of a URL and the
psychometric instrument used for the automated labeling of number of capitalized words / alphabets in a tweet have been
explored by [16] and [2]. [16] also reported positive results [5] Johan Bollen, Alberto Pepe, and Huina Mao. Modeling public mood
for using emoticons and Internet slang words as features. [6] and emotion: Twitter sentiment and socio-economic phenomena. In
International AAAI Conference on Weblogs and Social Media, pages
detected word lengthening as a sign of subjectivity in a tweet, 450–453, 2011.
and reported positive correlation between the two. [6] Samuel Brody and Nicholas Diakopoulos. Cooooooooooooooolllllll-
Many techniques were explored for classification. The most lllllll!!!!!!!!!!!!!!: Using word lengthening to detect sentiment in mi-
croblogs. In Conference on Empirical Methods in Natural Language
widely used classifiers include naı̈ve Bayes, support vector Processing, pages 562–570. Association for Computational Linguistics,
machines (SVM) and maximum entropy classifier. However, 2011.
there is no consensus on a single consistently best performing [7] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support
vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27,
classification algorithm. [2] reported better results for SVMs May 2011.
while others like [20] support naı̈ve Bayes . [10] and [21] also [8] T.M. Cover and J.A. Thomas. Elements of Information Theory, vol-
reported good results for maximum entropy classifier. ume 6. Wiley Online Library, 1991.
[9] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. John
VI. C ONCLUSIONS Wiley & Sons, Inc., 2001.
[10] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification
We developed a two-stage hierarchical classifier to label using distant supervision. CS224N Project Report, Stanford, pages 1–12,
tweets by the sentiment expressed in them as either positive, 2009.
[11] Pollyanna Gonçalves, Matheus Araújo, Fabrı́cio Benevenuto, and Meey-
negative or neutral. For the purpose of this study, we collected oung Cha. Comparing and combining sentiment analysis methods. In
an original data set of microblog posts from Twitter. Each Proceedings of the first ACM conference on online social networks,
tweet in the data set was manually labeled by three humans. pages 27–38. ACM, 2013.
[12] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
After removing duplicates and tweets expressing ambiguous Reutemann, and Ian H Witten. The WEKA data mining software: An
sentiments, the data set consisted of exactly 8, 900 original update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009.
tweets. We explored a large space of 34 features to predict [13] Rune Halvorsen. Tweetstream 1.1.1 : Python package index. https:
// pypi.python.org/ pypi/ tweetstream, Last accessed: Sep 8, 2014, October
objectivity / subjectivity, as well as 21 features to predict the 13 2011.
lexical positivity of tweets. The OS and LP meta-features are [14] CJ Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for
each computed using only the six most discriminating features sentiment analysis of social media text. In Eighth International AAAI
Conference on Weblogs and Social Media, 2014.
by information gain. Naı̈ve Bayes, decision tree and SVM [15] S.M. Kim and E. Hovy. Determining the sentiment of opinions. In
classifiers are used to classify tweets as expressing positive, International Conference on Computational Linguistics, page 1367.
negative or neutral sentiments based on OS and LP meta- Association for Computational Linguistics, 2004.
[16] E. Kouloumpis, T. Wilson, and J. Moore. Twitter sentiment analysis:
features. Our results show that both naı̈ve Bayes and decision The Good the Bad and the OMG! In International AAAI Conference on
tree classifiers yield similar results. However, the naı̈ve Bayes Weblogs and Social Media, 2011.
classifier outperforms the decision tree by a slight margin on [17] J. Krauss and S. Nann. Tweet trader. http:// tweettrader.net, Last
accessed: Sep 8, 2014, March 2014.
FP. SVM gives the worst average performance of all three [18] Douglas M. McNair, Maurice Lorr, and Leo F. Droppleman. Profile of
classifiers by either matching or underperforming against the mood states. Educational and Industrial Testing Service, 1971.
other two classifiers. We explained SVM’s relatively lower [19] University of Pittsburgh Department of Computer Science. Multi per-
spective question answering (MPQA). http:// mpqa.cs.pitt.edu/ lexicons/
performance by the inherently non-linear nature of the class subj lexicon/ , Last accessed: Sep 8, 2014, 2014.
boundaries between positive, neutral and negative sentiment [20] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis
tweets in the OS – LP plain which are visible in Fig. 3. Both and opinion mining. In Proceedings of the International Conference on
Language Resources and Evaluation (LREC), 2010.
classifiers deliver almost identical performance numbers for [21] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?:
TP, precision, recall and F-measure. We also made the tweet Sentiment classification using machine learning techniques. In Con-
classifier publicly available for online use. ference on Empirical methods in natural language processing (ACL-02
), volume 10, pages 79–86. Association for Computational Linguistics,
2002.
VII. A PPENDIX [22] Rudy Prabowo and Mike Thelwall. Sentiment analysis: A combined
Explored OS and LP features, and data collection approach. Journal of Informetrics, 3(2):143–157, 2009.
[23] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christo-
details are available online at http://andash.seecs. pher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep
nust.edu.pk/onlinereference/Appendix_ models for semantic compositionality over a sentiment treebank. In
afroze15sentiment.pdf Proceedings of the conference on empirical methods in natural language
processing (EMNLP), volume 1631, page 1642. Citeseer, 2013.
R EFERENCES [24] Andranik Tumasjan, Timm O Sprenger, Philipp G Sandner, and Is-
abell M Welpe. Predicting elections with twitter: What 140 characters
[1] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Senti- reveal about political sentiment. In International AAAI Conference on
wordnet 3.0: An enhanced lexical resource for sentiment analysis and Weblogs and Social Media, pages 178–185, 2010.
opinion mining. In Conference on International Language Resources [25] Peter D Turney. Thumbs up or thumbs down?: Semantic orientation
and Evaluation (LREC), Valletta, Malta, 2010. applied to unsupervised classification of reviews. In Annual Meeting on
[2] L. Barbosa and J. Feng. Robust sentiment detection on twitter from Association for Computational Linguistics, pages 417–424. Association
biased and noisy data. In International Conference on Computational for Computational Linguistics, 2002.
Linguistics: Posters, pages 36–44. Association for Computational Lin- [26] Twitter. About Twitter, inc. | About. https:// about.twitter.com/ company,
guistics, 2010. Last accessed: Sep 8, 2014, 2014.
[3] S. Bird, E. Klein, and E. Loper. Natural language processing with [27] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity
Python. O’Reilly Media, Inc., 2009. in phrase-level sentiment analysis. In Conference on Human Language
[4] C.M. Bishop. Pattern recognition and machine learning, volume 4. Technology and Empirical Methods in Natural Language Processing,
Springer New York, 2006. pages 347–354. Association for Computational Linguistics, 2005.
[28] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann, 2011.

You might also like