You are on page 1of 9

CS 671: Natural Language Processing

Sentiment Analysis in Twitter


Project Report

Rohit Kumar Jha [11615]


Sakaar Khurana [10627]
November 19, 2013

1
Contents

1 Introduction 3

2 Motivation 4

3 Previous Works 5
3.1 Bag of Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Naive Bayesian Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Implementation Details 6
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.1.1 Bag of Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.1.2 Other features specific to short text messages like tweets . . . . . . 6
4.1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.4 Support Vector Machine + Bag of Words Model . . . . . . . . . . 7
4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Results and Conclusion 8

6 Future Work 9

2
1 Introduction
In the past decade, new forms of communication, such as microblogging and text mes-
saging have emerged and become ubiquitous. While there is no limit to the range of
information conveyed by tweets and texts, often these short messages are used to share
opinions and sentiments that people have about what is going on in the world around
them. We have worked on the following task, which was a part of SEMEVAL 2013
challenge. The task is:

• Given a message, classify whether the message is of positive, negative, or neu-


tral sentiment. For messages conveying both a positive and negative sentiment,
whichever is the stronger sentiment should be chosen.

3
2 Motivation
Working with these informal text genres presents challenges for natural language process-
ing beyond those typically encountered when working with more traditional text genres,
such as newswire data. Tweets and texts are short: a sentence or a headline rather than
a document. The language used is very informal, with creative spelling and punctuation,
misspellings, slang, new words, URLs, and genre-specific terminology and abbreviations,
such as, RT for "re-tweet" and # hashtags, which are a type of tagging for Twitter mes-
sages. How to handle such challenges so as to automatically mine and understand the
opinions and sentiments that people are communicating has only very recently been the
subject of research (Jansen et al., 2009 ; Barbosa and Feng, 2010; Bifet and Frank, 2010;
Davidov et al., 2010; OâĂŹConnor et al., 2010; Pak and Paroubek, 2010; Tumasjen et
al., 2010; Kouloumpis et al., 2011).
Another aspect of social media data such as Twitter messages is that it includes rich
structured information about the individuals involved in the communication. For exam-
ple, Twitter maintains information of who follows whom and re-tweets and tags inside of
tweets provide discourse information. Modelling such structured information is important
because: (i) it can lead to more accurate tools for extracting semantic information, and
(ii) because it provides means for empirically studying properties of social interactions
(e.g., we can study properties of persuasive language or what properties are associated
with influential users).

4
3 Previous Works
3.1 Bag of Words Model
The bag-of-words model is a simplifying representation used in natural language pro-
cessing and information retrieval (IR). In this model, a text (such as a sentence or a
document) is represented as an unordered collection of words, disregarding grammar and
even word order. The bag-of-words model is commonly used in methods of document
classification, where the (frequency of) occurrence of each word is used as a feature for
training a classifier. Most commonly, we use a word list where each word has been scored.
Positivity/negativity or sentiment strength and overall polarity is determined by the ag-
gregate of polarity of all the words in the text. In SEMEVAL 2013, Team IITB used Bag
of Words model with Discourse Information and could achieve an accuracy of 39.80%.

3.2 Naive Bayesian Classifier


A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ the-
orem with strong (naive) independence assumptions. A more descriptive term for the
underlying probability model would be "independent feature model". Often, Maximum
Entropy Classifiers are used as alternatives to NaÃŕve Bayes Classifier. it doesn’t re-
quire statistical independence of the features that serve as predictors. In SEMEVAM
2013, Team uottawa used NaÃŕve Bayesian Classifier were able to achieve an accuracy
of 42.51%.

3.3 Support Vector Machine


The basic SVM takes a set of input data and predicts, for each given input, which of the
possible classes forms the output, making it a non-probabilistic linear classifier. Given a
set of training examples, each marked as belonging to one of three categories, an SVM
training algorithm builds a model that assigns new examples into one category or the
other based on certain feature vector. An SVM model is a representation of the examples
as points in space, mapped so that the examples of the separate categories are divided
by a clear gap that is as wide as possible. New examples are then mapped into that same
space and predicted to belong to a category based on which side of the gap they fall on.
In addition to performing linear classification, SVMs can efficiently perform a non-linear
classification using what is called the kernel trick, implicitly mapping their inputs into
high-dimensional feature spaces. More formally, a support vector machine constructs
a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can
be used for classification, regression, or other tasks. Intuitively, a good separation is
achieved by the hyperplane that has the largest distance to the nearest training data
point of any class, since in general the larger the margin the lower the generalization
error of the classifier. In SEMEVAL 2013, Team NRC-Canada used SVM model with
unigram, bigram, POS tags, negation etc as features and were able to achieve an accuracy
of 69.02%, securing the first position.

5
4 Implementation Details
4.1 Approach
4.1.1 Bag of Words Model
We have already given a brief description of Bag of Words Model, as used in previous
methods for Sentiment Analysis. We implemented it with some features in order to im-
prove accuracy. In general, the list of words used is given polarity from this set −1, 0, 1.
But we rather used two lists for this purpose. One contained the most common words
and were given polarity from set −4, −3, −2, −1, 0, 1, 2, 3, 4 and the other list, containing
the less common words, was marked with polarity from the set −1, 0, 1. With this simple
change, we achieved an accuracy of around 42%.

4.1.2 Other features specific to short text messages like tweets


Emoticons: After that, we incorporated the use of emoticons. If emoticons are present
in the text of the tweet, the classification is done solemnly based in their presence, i.e.,
positive if positive ones are present, negative if negative ones are present and if both are
present, the one with higher count is reported.

Discourse relations We have also considered the use of Discourse Relations, as first
suggested by P. Bhattacharyya et. al. It is one important factor that significantly im-
proves accuracy. Consider this example: I didn’t think I would like the movie but it
turned out to be great. If we use naive Bag of Words model, we would find the overall
sentiment to be neutral but this is not the case. It is because of the presence of "would"
and "but". Yet another factor taken into consideration was to assign double weightage
to polarity words in sentences occurring later on in tweets. But it turned out that it
didn’t make an overall difference in accuracy.

Hash tags: We also incorporated the effect of #hashtags. We applied Bag of Words
model on the text of the #hashtags after doing word boundary segmentation and we
proceeded only if some clear sentiment didn’t appear. If it did, then we would report
the aggregate of obtained polarity of the hashtags for that tweet. (For classification hash
tags were first segmented for word boundaries)

Polarity boosters: We also incorporated the effect of modifiers like "very", "too",
etc.

Normalize and correct spelling: Considering the fact that tweets contains important
words which are often intentionally misspelt to convey emotion like happy might be spelt
as haaaaaappyyyyy. So, whenever an alphabet occurs more than once it is counted as
once only.

6
Considering Negation: For every tweet polarities of words are reversed if present
near a negation word. All this helped achieve an accuracy of around 56%.

4.1.3 Support Vector Machine


We also tried classification by building an SVM classifer on following features:

• Most frequent words: Top 1000 words in the training set are seperated. Pres-
ence/absence of each of the features counts as a feature. This constitutes 1000
features.

• Important POS tags: Each word is POS tagged. Presence/absence of seven


POS Tags (’NN’,’VG’,’CD’,’JJ’,’CC’,’RB’) is considered as a feature. (Not all
but only important POS Tags like noun, adjective, adverb, verb, conjunctions are
considered).

• Presence of negators: Presence/absence of negators in a sentence is considered


as a feature.

Achieved accuracy of around 61% after considering these features

4.1.4 Support Vector Machine + Bag of Words Model


Finally, we built a hybrid classifier with three layers of classification using: Emoticons,
Bag of Words Model (while considering other features) and SVM Classifier. If there is a
clear presence of emoticons denoting the emotion of a tweets, we don’t proceed further
and report the emotion as represented by these emoticons. Otherwise, we proceed to bag
of Words Model with all the features discussed before. If we obtain the polarity with
a certain confidence level, we proceed and report the polarity we obtained. If we fail
again, we move on to SVM Classifier and report the polarity obtained by it. This Hybrid
Classifier achieved an accuracy of 68.36% with training on only around 8000 tweets.

7
4.2 Data
We have uses following data from following sources in our project:

• SEMEVAL 2013 has also provided with around 10000 labelled tweets for the "Mes-
sage Polarity Classification" problem.
http://www.cs.york.ac.uk/semeval-2013/task2/index.php?id=data

• Lists of emoticons, emotions, negating words, and booster words (like very, too,
etc) are taken from http://sentistrength.wlv.ac.uk/.

5 Results and Conclusion


Our results, compared to the the performance of other teams in SemEvam-2013:

Methods Used Data Size Accuracy Obtained


on SEMEVAL 2013
data
Bag of Words Model N.A. 39.80%
NaÃŕve Bayes Classifier 10,000 tweets 42.51%
Support Vector Machine 1.6 Million tweets 69.02%
SVM + Bag of Words Model(Our result) 9,000 tweets 68.36%

We have achieved an accuracy of 68.36% with training on only around 9000 tweets and
testing on 1100 tweets. The only team with better performance than us in SemEval 2013
achieved an accuracy of 69.02% while using 1.6 million tweets for training. Our method
achieves good accuracy with relatively small data size.

8
6 Future Work
• We have covered most of the features in our classification. Bit, we didn’t include
effect of following features on classification accuracy.
– Taking care of emotions conveyed by abbreviations
– Analysing if subsequent sentences in a tweet are more important. (For eg.
giving greater weight to a 2nd line in a tweet of 2 lines.)

• Although it was clear from work done by others on the same problem that SVM
tends to perform better than other classifers, it would be interesting to see how
hybrid of other classifiers (like naive bayes classifier) with SVM would perform. (In
our work we tried hybrid of bag of words with SVM which improved the accuracy)

References
[1] S. Mukherjee and P. Bhattacharyya. Sentiment analysis in Twitter with lightweight
discourse analysis, December 2012.

[2] P. Nakov and S. Rosenthal. Sentiment analysis in twitter, March 2013.