You are on page 1of 6

5th IEEE International Conference on Parallel, Distributed and Grid Computing(PDGC-2018), 20-22 Dec, 2018, Solan, India

Sentiment Analysis Framework of Twitter


Data Using Classification

Medha Khurana Anurag Gulati Saurabh Singh


Department of CSE Department of EE Department of CSE
Amity University Delhi Technological University, Amity University
Noida, India Delhi, India Noida, India
medha.khurana31@gmail.com gulatianurag1995@gmail.com saurabh.iiet@gmail.com

Abstract- Text mining is the way toward investigating and particular area of research. Thus, this platform has been
breaking down a lot of unstructured content information that chosen to extract the tweets.
can distinguish ideas, designs, subjects, catchphrases and
different qualities in the information. Twitter is one of those In this paper, classification techniques have been used to
forums that allow people across the world to put and exchange establish a sentiment analysis framework for twitter data.
their views and ideas on several major and minor issues which Classification refers to the procedure by which thoughts,
are revolving around the world every day. Microblogging on opinions, objects or items are perceived, separated, and
twitter gains the interest of data researchers as there is an comprehended and hence categorically separated. With the
immense scope of mining and analysing the huge amount of reference of a particular keyword the data fetched from
unstructured data in several ways. In this paper, various twitter has been analysed and hence the polarity has been
algorithms for analysing the sentiments of the tweets have been calculated to classify the tweets into positive, negative and
discussed. Further, the performance of these algorithms has neutral. The dataset is unexplored and sensitive. Moreover, it
been compared based on certain metrics. Certain challenges
is extremely sparse and throws a lot of challenges for rightly
while doing the study have also been described in terms of
evaluating the performance of the algorithms. The rest of the
improvement and future scope. Since the machine learning
paper is organized as follows: section 2 throws light on the
algorithms have been performed on an unexplored dataset,
classification, section 3 focuses on the methodology adopted,
language barriers to these algorithms have also been identified
section 4 showcases the results and finally, the paper
in terms of future scope and current feasibility of the algorithms.
The analysis has been performed using classification algorithms
concludes in the last section 5. Section 6 gives a brief of
– Naïve Bayes, Support Vector Machine and Random Forest.
future work.
This experimental work has been executed in python and excel
has been used to further evaluate and plot some of the results. I. LITERATURE REVIEW
Since the sentiment of the tweets cannot be beknown, test set has
been manually prepared in order to prevent any errors in Garg, et al.[1] describes about the sentiment analysis on
evaluating accuracy and precision of the models. twitter data post Uri attack, and discuss the analysis of the
retweets. They conclude that the negative tweets tend to
Keywords- Sentiment analysis; Twitter; Classification; Naïve survive more than the positive tweets. Huma, et al. [6] explain
Bayes; Support Vector Machine; Random Forest Classifier, increasing the efficiency of sentiment analysis using Hadoop
Precision; Recall MapReduce. They also conclude that the neutrality of tweets
plunges if the emoticons in the tweets/retweets are fed into
INTRODUCTION
the analysis. Trupthi, et al. [5] digs into the process of mining
Twitter, one of the most popular micro blogging social large amounts of data that is extracted using Twitter API.
networking site where people tweet their opinions in a They use Hadoop system to process the tweets and then use
concise manner, typically in less than or equal to 140 words. sentiment analysis algorithms for better understanding of the
It is an open forum where people from all around the world tweets. Zamani, et al. [11] use the data of Facebook, the
can express. Twitter leverages over other social networking busiest social networking site and use people’ s suggestions
sites because of its excellent features like subscribing, re- and comments to quantify sentiments and rate them according
tweeting, adding to favourites, filtering the information using to their emotion. Lavanya, et al. [4] use topic adaptive
keywords etc. Twitter produces immense information that sentiment analysis using support vector machines and evolve
can't be taken care of physically to extract valuable data and ways to perform adaptive analysis for better accuracy and
consequently, the elements of machine aided classification precision. Ahuja, et al. [2] explain the sentiment analysis
are needed to deal with that data. Hence twitter is a great using the K-means cluster algorithm and they rely on the fact
source to extract data and dig deeper I into the insights in the that the dictionaries cannot contain the exact emotion of the

978-1-5386-6026-3/18/$31©2018 IEEE 459


5th IEEE International Conference on Parallel, Distributed and Grid Computing(PDGC-2018), 20-22 Dec, 2018, Solan, India

sentiment, rather the sentiments have to be analysed based on A. Fetching extraordinary Twitter Data in Python.
the subjectivity and relevance of the text being processed.
Twitter data is in comparison to the information shared by
II. CLASSIFICATION most of the other social networking sites since it reflects
data that the users opt to share openly in public. The twitter
Classification is a technique which helps to categorically API platform gives expansive access to public tweets that
define in which data set does a particular data instance fall users across the world have imparted. In order to access the
into. In text mining, all the text classifiers have the ability to twitter API, the following procedure has been adopted.
operate on a huge amount of data according to their respective
constraints. In K-nearest neighbour classifier classes may not The foremost step to fetch the tweets from twitter has been to
be necessarily required to be linearly distinguishable but in create a twitter app to get access to the Twitter developer
this classifier, it is really time consuming to find the nearest account by the identical username as the one logged into. This
neighbours if data is huge. In SVM classifier, the accuracy of has been done in order to obtain the credentials that are
results can be high, but it is complex and requires more space needed to stream the tweets from the twitter API.
and time in both training and testing [3]. In ANN classifier, it
works very well with only a few parameters to adjust, but the Further, using a python library called Tweepy the tweets were
processing time can be really high if the neural network is fetched from the twitter API [9]. Tweepy enables python to
large. In this paper, probabilistic Naïve Bayes classifier has interact with twitter API and hence streaming of the tweets
been used whose implementation is not only simple, but also from the twitter. The tweets so obtained has been directed
has excellent efficiency and classification rate [7,8]. Also, as into a json file. In this paper, the framework has been
per the data size taken in this paper, this algorithm proves to executed by fetching tweets by using keyword Kashmir and
hence a data typically of size 339MB has been obtained.
give the best results and hence being most appropriate in text
classification of the data.
B. Data Pre-processing: NLTK
III. METHODOLOGY ADOPTED
Text-based communication is the basis of the tweets on
The methodology adopted has been described in this twitter and hence, unstructured text data has turned out to be
process flow diagram Fig.1. and in the proceeding paper extremely usual, and analysing these large quantities of raw
every step has been described in detail. text data is now a key method to comprehend what
individuals are considering NLP which is natural language
processing provides an interface between computers and
Collecting
humans. Analysis of text is done using the NLP techniques,
Create app tweets from giving a way for computers to comprehend language of
and obtain twitter API. humans. NLP tool for python is Natural language toolkit-
credentials. (Using any NLTK [10]. It is a set of libraries for representative and
keyword) factual common dialect preparing for English language that
is written in the Python programming dialect. NLTK is
expected to help research and educating in natural language
processing or firmly related regions, including experimental
Data Pre-processing. etymology, psychological science, man-made reasoning, data
• Removal of URLs recovery, and machine learning. NLTK underpins
• Removal of special symbols tokenization, stemming, labelling, parsing, and semantic
thinking functionalities. It helps in pre-processing of data by
• Removal of hashtags
cleaning the text by removing stop words, punctuations,
• Removal of additional white emoticons, and digits. The steps followed in data pre-
spaces processing has been shown in Fig.2.
• Removal of stop words
• Removal of digits

Apply Selection of
Classification best fit
Techniques - algorithm and
Naive Bayes, SVM, analysis of
Random Forest results
Fig.1. Process flow diagram

978-1-5386-6026-3/18/$31©2018 IEEE 460


5th IEEE International Conference on Parallel, Distributed and Grid Computing(PDGC-2018), 20-22 Dec, 2018, Solan, India

Naïve Bayes Algorithm is considered to be a very good


algorithm in terms of speed and accuracy. Over and above, it
is machine friendly as it is built on simplistic probability
modelling. It performs pretty good when it is used in
collaboration with other algorithms such as collaborative
filtering for building world class recommender systems,
multi class predictive models, and spam filtering. Naïve
Bayes specialty is that it does not even need a large training
set to build a precise and accurate model. For this particular
dataset, the Naïve Bayes algorithm has been run and it has
shown fairly good results when tested with the test data
whose sentiments had been spot checked before running the
process and algorithm. The ROC graph (fig. 3) has been
obtained for evaluating the connection between true positive
rate and false positive rate in Fig 3.

Fig.2. Data pre-processing steps ROC Curve - Naive Bayes (A=0.9145)


1
C. Sentiment Analysis Using Different Algorithms

TRUE POSITIVE RATE


0.8

1. Naïve Bayes Algorithm 0.6

Naïve Bayes is a classification algorithm used in Machine 0.4


Learning It is essentially used for classification of text when
0.2
and is particularly taken into consideration when the
dimensionality of the input features is high. It is named after 0
Thomas Bayes who proposed the probabilistic Bayes theorem 0 0.2 0.4 0.6 0.8 1
which is the base for Naïve Bayes Algorithm. It is called
FALSE POSITIVE RATE
naïve because the Bayes theorem is based on the assumption
that the features are independent of one another, which is
sometimes not correct and thus is naïve. For instance, a fruit
must be considered orange if it is orange in colour and is Fig.3. ROC Curve for Naïve Bayes
round. But even if these characteristics depend on one another
or on the existence of other features of a class, the algorithm Area under the curve for Naïve Bayes Algorithm has shown
considers all these features to contribute independently that a decent result and has proven to be a sophisticated model in
the fruit in consideration is orange. However, the short terms of sensibility and sensitivity. After applying this
computational time for training is one of the major algorithm on the data fetched, the following data frame has
advantages of the naive Bayes classifier. Mathematical been obtained showing the polarity of the extracted tweets.
Formula (1): Fig.4. indicates the polarity wherein, 1 represents positive, 0
represents neutral, -1 represents negative. The figures are a
( ∗ ) screenshot of the only the head and tail of the data.
Posterior probability= (1)

Here, P ( ) – Posterior

P( ) – Likelihood

P(A) –Proposition prior probability

P(B) – Evidence prior probability

978-1-5386-6026-3/18/$31©2018 IEEE 461


5th IEEE International Conference on Parallel, Distributed and Grid Computing(PDGC-2018), 20-22 Dec, 2018, Solan, India

In this paper’s case, there have been immense number of data


points because twitter data is quite sparse. Extracted data has
been handled by using the ‘rbf’ kernel for the segmentation
purpose. Since the vector created by twitter data consists of
sparse rows, ‘rbf’ kernel has been helpful for creating the
multi-dimensional hyperplane. Receiver Operating
Characteristics Curve has been plotted for the SVM classifier
and following figure (fig. 5) shows the relationship between
false positive rate and true positive rate (sensitivity and
sensibility).

ROC CURVE - SVM (A=0.7867)


1
0.9

TRUE POSITIVE RATE


0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1

Fig.4 Data frame FALSE POSITIVE RATE

2. Support Vector Machine Classifier


Fig. 5 Data frame

Support Vector Machines have revolutionized the


classification and regression world [12]. The SVM algorithm Area under the curve has been calculated as 0.7867. The more
is based on selection of the best fit hyper plane that segregates advanced and accurate technique for finding the relation
the data points; the data points in the linear world can be between sensibility and sensitivity (true positive rate and
segregated by the linear kernel, however this super flexible false positive rate), more is the area under AUC curve. Area
algorithm also helps us handle multi-dimensional world too. under the curve tending to 1 is considered as perfect for the
The “linear”, “rbf” and “poly” kernel techniques are helpful sentiment’s analysis. SVM has given decent results over
in restructuring the data points according to the length and Kashmir dataset but its performance does not surpass Naïve
adversity of the dataset. Following (2) is the basic equation Bayes results for our dataset.
for the Support Vector Machine Algorithm Type 1 category:
3. Random Forest Classifier
∑ = (2)
Random Forest Classifier is an ensemble of many decision
tree models, it is a complex model with a lot of computational
Restriction on (2) can be described with the help of (3). It ability. When we have a certain dataset which is large enough
should be noted that ‘w’ is the vector of coefficients, ‘b’ is a for performing calculations, random forest model works well.
constant £ represents parameters for handling non-separable It is one of the most reliable model in the classification world
data [13]. The symbol ‘i’ represents the iteration and ‘xi’ and its sensibility and sensitivity depends a lot on overfitting
represents the independent variables. Moreover, the [14]. While being a good model for the machine learning, it
transformation of input (tokenized words in our case) into the has its disadvantages as well. When it is fed with too much
feature space can be done so that it can be used with data and parameters, it tends to overfit since it creates an
optimization in the selection of hyper-plane. ensemble of various complex decision trees which work well
with the training set but prove to be inaccurate with the test
( ) 1 (3) set. It can be seen in the figure (Fig. 6) that random forest has
not performed well for the complex dataset as it tend to
converge into a complex problem with a lot of overfitting.
Hence, random forest did not prove to be well equipped for
this particular dataset.

978-1-5386-6026-3/18/$31©2018 IEEE 462


5th IEEE International Conference on Parallel, Distributed and Grid Computing(PDGC-2018), 20-22 Dec, 2018, Solan, India

relevant details and create a model that can prove to be


ROC Curve - Random Forest helpful in extracting the knowledge and learnings. Three
(A=0.6712) models have been applied on the dataset to analyse the
sentiments of the tweets. Sentiment analysis is a difficult task
1 because language is being used to create a model that can
predict emotions [15]. People also use emoticons and
TRUE POSITIVE RATE

0.8 different languages when it comes to social media, so it is


quite tedious for even complex models to correctly envisage
0.6
the sentiments. Below is the table (table 1) for comparison of
results through three models that were applied on the
0.4
extracted data [16]. However, spot checking of ~2.3k tweets
has been done to calculate the change of false positive rates
0.2
with the true positive rates.
0
0 0.2 0.4 0.6 0.8 1
Overfitting
FALSE POSITIVE RATE Algorithm Accuracy Precision Recall F1 Score
performance
Naïve
88% 89% 65% 75% Very Good
Fig. 6 Data frame Bayes
Support -
Vector 81% 84% 51% 63% Good
D. Data Visualization:
Machine
Random
The tweets have been classified into positive, negative and 63% 63% 30% 41% Not Good
Forest
neutral as indicated in Fig.7.

Table 1. Performance comparison of algorithms

Accuracy, which is calculated from the error matrix


(confusion matrix) is the most reliable metric to evaluate the
performance of sentiment analysis algorithm. However,
precision and recall are other metrics that can help the model
and they envisage more on the overall sentiment of the
dataset. For reference, (5). (6) and (7) are formulas of
precision, recall and F1 Score.


=( )
(5)


=( )
(6)
Fig.7. Number of tweets

∗ )
Consecutively, the results have been visualized in graphical 1 =2∗ )
(7)
form using pie chat and histogram depicting the percentage
of each classified value.
2. Apart from evaluating the best algorithm for extracted
dataset, another observation from the analysis has shown that
IV. RESULTS
most of false negative, false positive and false neutral
arguments root from the tweets which are in languages apart
1. The Kashmir dataset has been retrieved from twitter using from English. The findings have shown that trained
the Tweepy library through twitter API. The extraction has algorithms for text classification and sentiment analysis work
not been easy on the regular machine with no use of cloud well with the English language datasets.
computing for storage and extraction. This has been one of
the major challenges faced in this research work. Around one
3. Using the Naïve Bayes algorithm, which has given the
and a half day passed to extract the huge dataset for Kashmir
maximum accuracy and precision, an analysis of overall
tweets on URI attack. The twitter is a world of news and
sentiment of the sparse dataset has been done and following
happenings and it can be a bit challenging to filter out the

978-1-5386-6026-3/18/$31©2018 IEEE 463


5th IEEE International Conference on Parallel, Distributed and Grid Computing(PDGC-2018), 20-22 Dec, 2018, Solan, India

figure (Fig.7) depicts prediction of the algorithm over the accuracy and precision. There is a plethora of information on
dataset as a mix of positive, negative and neutral tweets. social media which can be analysed, and this information is
very sparse. Predictive models from the hybrid of different
machine learning algorithms can be used to correctly access
Sentiments visualization the sentiment analysis, which is in fact one of the most
10% difficult problem statement in machine learning world. The
17% accuracy of models for non-English statements can also be
improved in future.

REFERENCES
[1] P. Garg, H. Garg, and V Ranga, "Sentiment analysis of the Uri terror
attack using Twitter," Computing, Communication and Automation
(ICCCA), 2017.
[2] A, Shreya, and G. Dubey, "Clustering and sentiment analysis on
72% Twitter data," 2nd International Conference on Telecommunication
and Networks (TEL-NET), IEEE, 2017.
[3] M. Kumar and A. Bala, "Analyzing Twitter sentiments through big
positive neutral negative data," Computing for Sustainable Global Development (INDIACom),
3rd International Conference on. IEEE, 2016.
[4] K. Lavanya and C. Deisy. "Twitter sentiment analysis using multi-class
SVM," Intelligent Computing and Control (I2C2), International
Fig. 7. Pie chart
Conference on. IEEE, 2017.
[5] M. Trupthi, , S. Pabboju, and G. Narasimha. "Sentiment analysis on
Here, the percentage of the number of positive, negative and twitter using streaming API," Advance Computing Conference
neutral tweets has been shown. (IACC), IEEE 7th International. IEEE, 2017.
[6] P. Huma, and S. Pandey, "Sentiment analysis on Twitter Data-set using
Naive Bayes algorithm," Applied and Theoretical Computing and
V. CONCLUSION Communication Technology (iCATccT), 2nd International Conference
on. IEEE, 2016.
Twitter is a powerful source where people across the world [7] Chen, Siyuan, Chao Peng, Linsen Cai, and Lanying Guo, "A Deep
Neural Network Model for Target-based Sentiment Analysis,”
come together to interact on a common platform on varied International Joint Conference on Neural Networks (IJCNN), pp. 1-7,
issues. Hence, it gives a wide scope to researchers to fetch a IEEE, 2018.
large amount of raw data. This raw data processing helps to [8] M. Mittal,, et al. "Monitoring the Impact of Economic Crisis on Crime
analyse the opinion of the mass. A complex dataset of in India Using Machine Learning," Computational Economics, pp. 1-
Kashmir attacks has been retrieved and three algorithms of 19, 2018.
text classification have been applied over the dataset. Naïve [9] Zvarevashe, Kudakwashe, and Oludayo O. Olugbara, "A framework
for sentiment analysis with opinion mining of hotel reviews,"
Bayes has worked best in terms of accuracy and precision.
In Information Communications Technology and Society (ICTAS),
Moreover, the overall sentiment of the dataset has been 2018 Conference, pp. 1-4. IEEE, 2018.
predicted over the test set, which has been created manually [10] N. Sagar "A comparative study of classification techniques in data
in order to avoid errors. Random forest has given less mining algorithms," Oriental Journal of Computer Science &
accuracy due to overfitting on a large dataset and support Technology 8.1, pp. 13-19, 2015.
vector machine performs comparable to Naïve Bayes. Naïve [11] N. Zamani, M. Azminam, "Sentiment analysis: Determining people's
emotions in facebook," University Teknologies MARA,
Bayes, being computationally strong and simple,
Malaysia 2013.
outperformed the rest of the two algorithms for this case. All [12] M. Saif, "Sentiment analysis: Detecting valence, emotions, and other
the algorithms have given false results majorly on non- affectual states from text," Emotion measurement, pp. 201-237, 2016.
English (Hindi, Urdu) languages. Moreover, based on the [13] C. Aggarwal, and C. Xiang Zhai. "A survey of text classification
entire dataset, Naïve Bayes model has predicted that most of algorithms," Mining text data. Springer, Boston, MA, pp. 163-222,
the tweets in extracted dataset are actually neutral. 2012.
[14] Kaur, Harpreet, and Veenu Mangat, "A survey of sentiment analysis
techniques," I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-
VI. FUTURE WORK SMAC), International Conference, pp. 921-925, IEEE, 2017.
[15] Zvarevashe, Kudakwashe, and Oludayo O. Olugbara, "A framework
The scope of this framework can be used in analysing any for sentiment analysis with opinion mining of hotel reviews,"
Information Communications Technology and Society (ICTAS)
kind of tweets on Twitter. Not only the text but, the doors for Conference, pp. 1-4. IEEE, 2018.
future amendments and improvements of this research are [16] Li, Jie, and Lirong Qiu, "A Sentiment Analysis Method of Short Texts
wide open. There exist many different algorithms for in Microblog," Computational Science and Engineering (CSE) and
performing the sentiment analysis. However, the complexity Embedded and Ubiquitous Computing (EUC), IEEE International
Conference, vol. 1, pp. 776-779, IEEE, 2017.
and computational edge are two important factors to evaluate
which algorithm would out-perform the others in terms of

978-1-5386-6026-3/18/$31©2018 IEEE 464

You might also like