You are on page 1of 7

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE)

ISSN: 0976-1353
Volume 14 Issue 2 APRIL 2015

Sentiment Analysis on Twitter Data using: Hadoop


ABHINANDAN P SHIRAHATTI 1, NEHA PATIL 2 , DURGAPPA KUBASAD 3 , ARIF MUJAWAR 4
1,2,3,4

Department of CSE, Jain College of Engineering, Belagavi-590014, India

AbstractAs of now we know present industries and some


survey companies are mainly taking decisions by data obtained
from web. As we see WWW is a rich collection of data that is
mainly in the form of unstructured data from which we can do
analysis on those data which is collected on some situation or on
a particular thing. In this paper, we are going to talk how
effectively sentiment analysis is done on the data which is
collected from the Twitter using Flume. Twitter is an online
web application which contains rich amount of data that can be
a structured, semi-structured and un-structured data. We can
collect the data from the twitter by using BIGDATA
eco-system using online streaming tool Flume. And doing
analysis on Twitter is also difficult due to language that is used
for comments. And, coming to analysis there are different types
of analysis that can be done on the collected data. So here we
are taking sentiment analysis, for this we are using Hive and its
queries to give the sentiment data based up on the groups that
we have defined in the HQL (Hive Query Language). Here we
have categorized this sentiment analysis into 3 groups like
tweets that are having positive, moderate and negative
comments.

KeywordsAnalysis, BIGDATA, Comment, Flume, Hive, HQL,


Sentiment Analysis, Structured, Semi-Structured, Twitter, Tweets,
Un-Structured, WWW (Word Wide Web) .

I. INTRODUCTION
From 20th century onwards this WWW has completely
changed the way of expressing their views. Present situation
is completely they are expressing their thoughts through
online blogs, discussion forms and also some online
applications like Facebook, Twitter, etc. If we take Twitter as
our example nearly 1TB of text data is generating within a
week in the form of tweets. So, by this it is understand clearly
how this Internet is changing the way of living and style of
people. Among these tweets can be categorized by the hash
value tags for which they are commenting and posting their
tweets. So, now many companies and also the survey
companies are using this for doing some analytics such that
they can predict the success rate of their product or also they
can show the different view from the data that they have
collected for analysis. But, to calculate their views is very
difficult in a normal way by taking these heavy data that are
going to generate day by day.

Fig. 1: Describes clearly Apache Hadoop Ecosystem.

The above figure shows clearly the different types of


ecosystems that are available on Hadoop so, this problem is
taking now and can be solved by using BIGDATA Problem
as a solution. And if we consider getting the data from
Twitter one should use any one programming language to
crawl the data from their database or from their web pages.
Coming to this problem here we are collecting this data by
using BIGDATA online streaming Eco System Tool known
as Flume and also the shuffling of data and generating them
into structured data in the form of tables can be done by using
Apache Hive.
II.

LITERATURE SURVEY

Sentiment analysis has been handled as a Natural Language


Processing task at many levels of granularity. Micro blog
data like Twitter, on which users post real time reactions to
and opinions about everything, poses newer and different
challenges.
Some of the early and recent results on sentiment
analysis of Twitter data are by Go et al. (2009), (Bermingham
and Smeaton, 2010) and Pak and Paroubek (2010). Go et al.
(2009) use distant learning to acquire sentiment data.They
use tweets ending in positive emoticons like :) :-) as
positive and negative emoticons like :( :-( as negative.

831

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE)


ISSN: 0976-1353
Volume 14 Issue 2 APRIL 2015

III. PROBLEM STATEMENT

2.1 Existing System


As we have already discussed about the older way of getting
data and also performing the sentiment analysis on those
data. Here they are going to use some coding techniques for
crawling the data from the twitter where they can extract the
data from the Twitter web pages by using some code that may
be written either in JAVA, Python etc. For those they are
going to download the libraries that are provided by the
twitter guys by using this they are crawling the data that we
want particularly. After getting raw data they will filter by
using some old techniques and also they will find out the
positive, negative and moderate words from the list of
collected words in a text file. All these words should be
collected by us to filter out or do some sentiment analysis on
the filtered data. These words can be called as a dictionary set
by which they will perform sentiment analysis. Also, after
performing all these things and they want to store these in a
database and coming to here they can use RDBMS where
they are having limitations in creating tables and also
accessing the tables effectively.

Fig. 2: Architecture diagram for proposed system.

IV. METHODOLOGY
As we have seen the procedure how to overcome the problem
that we are facing in the existing problem that is shown
clearly in the proposed system. So, to achieve this we are
going to follow the following methods:
Creating Twitter Application.
Getting data using Flume.
Querying using Hive Query Language (HQL)

2.2Proposed System
As we have seen existing system drawbacks, here we are
going to overcome them by solving this issue using Big Data
problem statement. So here we are going to use Hadoop and
its Ecosystems, for getting raw data from the Twitter we are
using Hadoop online streaming tool using Apache Flume. In
this tool only we are going to configure everything that we
want to get data from the Twitter. For this we want to set the
configuration and also want to define what information that
we want to get form Twitter. All these will be saved into our
HDFS (Hadoop Distributed File System) in our prescribed
format. From this raw data we are going to create the table
and filter the information that is needed for us and sort them
into the Hive Table. And from that we are going to perform
the Sentiment Analysis by using some UDFs (User Defined
Functions) by which we can perform sentiment analysis by
taking Stanford Core NLP as the data dictionary so that by
using that we can decide the list of words that coming under
positive, moderate and negative. The following figure shows
clearly the architecture view for the proposed system by this
we can understand how our project is effective using the
Hadoop ecosystems and how the data is going to store form
the Flume, also how it is going to create tables using Hive
also how the sentiment analysis is going to perform.

4.1 Creating Twitter Application


First of all if we want to do sentiment analysis on Twitter data
we want to get Twitter data first so to get it we want to create
an account in Twitter developer and create an application by
clicking on the new application button provided by them.[3]
After creating a new application just create the access tokens
so that we no need to provide our authentication details there
and also after creating application it will be having one
consumer keys to access that application for getting Twitter
data. The following is the figure that show clearly how the
application data looks after creating the application and here
its self we can see the consumer details and also the access
token details. We want to take this keys and token details and
want to set in the Flume configuration file such that we can
get the required data from the Twitter in the form of twits.

832

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE)


ISSN: 0976-1353
Volume 14 Issue 2 APRIL 2015
that we got form the Twitter is also in the JSON format that is
shown clearly below:

Fig. 3: Creating Twitter application from Twitter Developer.

The figure show clearly the application keys that are


generated after creating application and in this keys we can
see the top two keys are the API key and API secret. And
coming to the reaming two keys it is nothing but know as the
access tokens that we want to generate it by ourselves by
clicking the generate access token. After clicking that we can
get the two keys that are our account access token and coming
to that one is Access token and the other one is the Access
token secret.

4.2 Getting data using Flume


Fig. 5: Twitter data in HDFS (Hadoop Distributed File System).

After creating an application in the Twitter developer site we


want to use the consumer key and secret along with the access
token and secret values. By which we can access the Twitter
and we can get the information that what we want exactly
here we will get everything in JSON format and this is stored
in the HDFS that we have given the location where to save all
the data that comes from the Twitter. The following is the
configuration file that we want to use to get the Twitter data
from the Twitter.

Fig. 6: Twitter data in JSON format.

Fig. 4: Flume configuration files for Twitter data.

4.3 Querying using Hive Query Language


(HQL)After running the Flume by setting the above

From these data first we want to create a table where the


filtered data want to set into a formatted structured such that
by which we can say clearly that we have converted the
unstructured data into structured format. For this we want to
use some custom serde concepts. These concepts are nothing
but how we are going to read the data that is in the form of
JSON format for that we are using the custom serde for JSON
so that our hive can read the JSONdata and can create a table
in our prescribed format.

configuration then the Twitter data will automatically will


save into HDFS where we have the set the path storage to
save the Twitter data that was taken by using Flume. The
following is the figure that shows clearly how the data is
stored in the HDFS in a documented format and the raw data

833

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE)


ISSN: 0976-1353
Volume 14 Issue 2 APRIL 2015
V. RELATED WORK
Sentiment analysis of tweets data is considered as a much
harder problem than that of conventional text such as review
documents. This is partly due to the short length of tweets,
the frequent use of informal and irregular words, and the
rapid evolution of language in Twitter. A large amount of
work has been conducted in Twitter sentiment analysis
following the feature-based approaches. Go et al. explored
augmenting
different n-gram features in conjunction with POS tags into
the training of supervised classifiers including Naive Bayes
(NB), Maximum Entropy (MaxEnt) and Support Vector
Machines (SVMs). They found that MaxEnt trained from a
combination of unigrams and bigrams outperforms other
models trained from a combination of POS tags and
unigrams
by almost 3%. However, a contrary finding was reported in
that adding POS tag features into n-grams improves the
sentiment classification accuracy on tweets.

Fig. 7: HQL Query for creating Tweets table.

Fig. 8: Inserting data by performing sentiment analysis

Also we are using another UDFs (User Defined Functions)


for performing the sentiment analysis on the tales that are
created by using Hive.[5]From that we can perform the
sentiment analysis. And acquire the results where a new table
is created by partition concept such that all the comments
that are having positive will go into the positive partition and
all the comments that are having moderate will go into
moderate partition and finally all the comments that are
having negative will go into negative partition.
The following figure shows clearly how the rating is done
and how the data is partitioned into 3 types.

Fig. 9: Result after performing the sentiment analysis.

Barbosa and Feng argued that using n-grams on tweet data


may hinder the classification performance because of the
large number of infrequent words in Twitter. Instead, they
proposed using micro blogging features such as re-tweets,
hash tags, replies, punctuations, and emoticons. They found
that using the above features to train the SVMs enhances the
sentiment classification accuracy by 2.2% compared to
SVMs trained from unigrams only. A similar finding was
reported by Kouloumpis et al. It explored the microblogging
features including emoticons, abbreviations and the presence
of intensifiers such as all-caps and character repetitions for
Twitter sentiment classification. Their results show that the
best performance comes from using the n-grams together
with the microblogging features and the lexicon features
where words tagged with their prior polarity. However,
including the POS features produced a drop in performance.
Agarwal et al. also explored the POS features, the lexicon
features and the microblogging features. Apart from simply
combining various features, they also designed a tree
representation of tweets to combine many categories of
features in one succinct representation. A partial tree kernel
was used to calculate the similarity between two trees. They
found that the most important features are those that combine
prior polarity of words with their POS tags. All other features
only play a marginal role. Furthermore, they also showed
that combining unigrams with the best set of features
outperforms the tree kernel-based model and gives about 4%
absolute gain over a unigram baseline.
Rather than directly incorporating the microblogging
features into sentiment classifier training, Speriosu et al.
constructed a graph that has some of the microblogging
features such as hashtags and emoticons together with users,
tweets, word unigrams and bigrams as its nodes which are
connected based on the link existence among them (e.g.,
users are connected to tweets they created; tweets are
connected to word unigrams that they contain etc.). They
then applied a label propagation method where sentiment
labels were propagated from a small set of nodes seeded with

834

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE)


ISSN: 0976-1353
Volume 14 Issue 2 APRIL 2015
some initial label information throughout the graph. They
claimed that their label propagation method outperforms
MaxEnt trained from noisy labels and obtained an accuracy
of 84.7% on the subset of the Twitter sentiment test set from.
Existing work mainly concentrates on the use of three types
of features; lexicon features, POS features, and
microblogging features for sentiment analysis. Mixed
findings have been reported. Some argued the importance of
POS tags with or without word prior polarity involved, while
others emphasised the use of microblogging features. In this
paper, we propose a new type of features for sentiment
analysis, called semantic features, where for each entity in a
tweet (e.g. iPhone, iPad, MacBook), the abstract concept that
represents it will be added as a new feature (e.g. Apple
product). We compare the accuracy of sentiment analysis
against other types of features;unigrams, POS features, and
the sentiment-topic features. To the best of our knowledge,
using such semantic features is novel in the context of
sentiment analysis.

VI. DATASETS
For the work and experiments described in this paper, we
used two different Twitter datasets as detailed below. The
statistics of the datasets are shown in Table 1.

paper focuses on identifying positive and negative tweets,


and therefore we exclude neutral tweets from this dataset.
The final #smartphone dataset for training consists 839
tweets, and another 839 tweets are used for testing.

VII. SEMANTIC FEATURES FOR SENTIMENT ANALYSIS


This section describes our semantic features and their
incorporation into our sentiment analysis method.The
semantic concepts of entities extracted from tweets can be
used to measure the overall correlation of a group of entities
(e.g. all Samsung products) with a given sentiment duality.
Hence adding such features to the analysis could help
identifying the sentiment of tweets that contain any of the
entities that such concepts represent, even if those entities
never appeared in the training set (e.g. a new gadget from
Samsung). Semantic features refer to those semantically
hidden concepts extracted from tweets. An example for using
semantic features for sentiment classifier training is shown
in Figure 10 where the left box lists entities appeared in the
training set together with their occurrence probabilities in
positive and negative tweets. For example, the entities
Galaxy I, Galaxy II appeared more often in tweets of
positive polarity and they are all mapped to the semantic
concept Samsung products. As a result, the tweet from the
test set Finally, I got my Galaxy II. What a product! is more
likely to have a positive duality because it contains the entity
Galaxy II which is also mapped to the concept Samsung
Product.
7.1 Extracting Semantic Entities and Concepts
There are several open APIs that provide entity extraction
services for online textual data. Rizzo and Troncy evaluated
the use of five popular entity extraction tools. Assuming of

Table 1. Statistics of the twoTwitter datasets used in this paper.

Cricket World Cup(CWC15)This dataset consists of 60,000


tweets randomly selected from the Cricket World
Cup(CWC15). Half of the tweets in this dataset contains
positive emoticons, such as :), :-), : ), :D, and =), and the
other half contains negative emoticons such as :(, :-(, or : (.
The original dataset contained 1.6 million general tweets,
and its test set manually annotated tweets consisted of 177
negative and 182 positive tweets. The training set which was
collected based on specific emoticons, whereas the test set
was collected by searching Twitter API with specific
queries.To extend the testing set, we added 641 tweets
randomly selected from the original dataset, and annotated
manually by 12 users (researchers in our lab), where each
tweet was annotated by one user. Our final CWC15 dataset
consists of 60K general tweets, with a test set of 1,000 tweets
of 527 negatively, and 473 positively annotated ones.
6.2 Business(#smartphones)
The Business(#smartphones) dataset was built by crawling
tweets containing the various hashtags eg. #smartphones
(business) in March 2010. A subset of this corpus was
manually annotated with three polarity labels (positive,
negative, neutral) and split into training and test sets. This

course that the entity extractor successfully identify the new entities
as sub-types of concepts already correlated with negative or positive
sentiment on a dataset of news articles.

Fig. 10. Measuring correlation of semantic concepts with negative/positive


sentiment. These semantic concepts are then incorporated in sentiment
classification.

Their experimental results showed that AlchemyAPI


performs best for entity extraction and semantic concept
mapping. Our datasets consist of informal tweets. Therefore
we conducted our own evaluation, and randomly selected
tweets from the CWC15 and asked 3 evaluators to evaluate
the semantic concept extraction outputs generated from
AlchemyAPI, OpenCalais and Zemanta.

835

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE)


ISSN: 0976-1353
Volume 14 Issue 2 APRIL 2015
capabilities for abbreviated phrases,
interjections (e.g. lol, omg).

emoticons and

8.3 Sentiment-Topic Features


Table 2. Evaluation results of AlchemyAPI, Zemanta and OpenCalais

The assessment of the outputs was based on the correctness


of the extracted entities; and the correctness of the
entity-concept mappings. The evaluation results presented in
Table 2 show that AlchemyAPI extracted the most number of
concepts and it also has the highest entity-concept mapping
accuracy compared to OpenCalais and Zematna. As such, we
chose AlchemyAPI to extract the semantic concepts from our
three datasets. Table 3 lists the total number of entities
extracted and the number of semantic concepts mapped
against them for each dataset.

The sentiment-topic features are extracted from tweets using


the weakly-supervised joint sentiment-topic (JST) mode that
we developed earlier. We trained this model on the training
set with tweet sentiment labels discarded. The resulting
model assigns each word in tweets with a sentiment label and
a topic label. Hence JST essentially groups different words
that share similar sentiment and topic.

Table 3. Entity/concept extraction statistics of CWC15 and Smartphones using


AlchemyAPI.

Table5. Extracted sentiment-topic words by the sentiments

VIII. SEMANTIC FEATURES FOR SENTIMENT ANALYSIS


We compare the performance of our semantic sentiment
analysis approach against the baselines described below.
8.1 Unigrams Features
Word unigrams are the simplest features are being used for
sentiment analysis of tweets data. Models trained from word
unigrams are shown to outperform random classifiers by a
decent margin of 20%. In this work, we are use NB classifiers
trained from word unigrams as our first baseline model.
Table 4 lists, for each dataset, the total number of the
extracted unigram features that are used for the classification
training.

Table 4. Total number of unigram features extracted from each dataset

8.2 Part-of-Speech Features


POS features are common features that have been widely
used in the literature for the task of Twitter sentiment
analysis. Here, we build various NB classifiers trained using
a combination of word unigrams and POS features and use
them as baseline models. We extract the POS features using
the TweetNLP POS tagger,which is trained specifically from
tweets. This differs from the previous work, which relies on
POS taggers trained from treebanks in the newswire domain
for POS tagging. It was shown that TweetNLP tagger
outperforms the Stanford tagger with a relative error
reduction of 25% when evaluated on 500 manually annotated
tweets. Moreover, the tagger offers additional recognition

We list some of the topic words extracted by this model from


the CWC15 and Smartphones corpora in Table 5.Words in
each cell are grouped under single topic and the upper half of
the table shows topic words bearing positive sentiment while
the lower half shows topic words bearing negative sentiment.
The rationale behind this model is that grouping words under
the same topic and bearing similar sentiment could reduce
data sparseness in Twitter sentiment classification and
improves accuracy.
IX. CONCLUSION
There are different ways to get Twitter data or any other
online streaming data where they want to code lines of
coding to achieve this. And, also they want to perform the
sentiment analysis on the stored data where it makes some
complex to perform those operations. Coming to this paper
we have achieved by this problem statement and solving it in
BIGDATA by using Hadoop and its Eco Systems. And
finally we have done sentiment analysis on the Twitter data
that is stored in HDFS. So, here the processing time taken is
also very less compared to the previous methods because
HadoopMapReduce and Hive are the best methods to process
large amount of data in a small time.
REFERENCES
[1]

Apporv Agarwal, Jasneet Singh Sabarwal, End to End Sentiment


Analysis of Twitter Data

[2]

Real Time Sentiment Analysis of Twitter Data


Using Hadoop Sunil B. Mane, Yashwant Sawant, Saif Kazi, Vaibhav
Shinde

836

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE)


ISSN: 0976-1353
Volume 14 Issue 2 APRIL 2015
[3]

Luciano Barbosa and Junlan Feng. 2010. Robust sentimentdetection on


twitter from biased and noisy data.

[4]

Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He and


Harith Alani Knowledge Media Institute, The Open University, United
Kingdom.

[5]

Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: The
good the bad and the omg! In: Proceedings of the ICWSM (2011).

[6]

Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford (2009).

837

You might also like