Professional Documents
Culture Documents
Sentiment Analysis Framework of Twitter Data Using Classification
Sentiment Analysis Framework of Twitter Data Using Classification
Abstract- Text mining is the way toward investigating and particular area of research. Thus, this platform has been
breaking down a lot of unstructured content information that chosen to extract the tweets.
can distinguish ideas, designs, subjects, catchphrases and
different qualities in the information. Twitter is one of those In this paper, classification techniques have been used to
forums that allow people across the world to put and exchange establish a sentiment analysis framework for twitter data.
their views and ideas on several major and minor issues which Classification refers to the procedure by which thoughts,
are revolving around the world every day. Microblogging on opinions, objects or items are perceived, separated, and
twitter gains the interest of data researchers as there is an comprehended and hence categorically separated. With the
immense scope of mining and analysing the huge amount of reference of a particular keyword the data fetched from
unstructured data in several ways. In this paper, various twitter has been analysed and hence the polarity has been
algorithms for analysing the sentiments of the tweets have been calculated to classify the tweets into positive, negative and
discussed. Further, the performance of these algorithms has neutral. The dataset is unexplored and sensitive. Moreover, it
been compared based on certain metrics. Certain challenges
is extremely sparse and throws a lot of challenges for rightly
while doing the study have also been described in terms of
evaluating the performance of the algorithms. The rest of the
improvement and future scope. Since the machine learning
paper is organized as follows: section 2 throws light on the
algorithms have been performed on an unexplored dataset,
classification, section 3 focuses on the methodology adopted,
language barriers to these algorithms have also been identified
section 4 showcases the results and finally, the paper
in terms of future scope and current feasibility of the algorithms.
The analysis has been performed using classification algorithms
concludes in the last section 5. Section 6 gives a brief of
– Naïve Bayes, Support Vector Machine and Random Forest.
future work.
This experimental work has been executed in python and excel
has been used to further evaluate and plot some of the results. I. LITERATURE REVIEW
Since the sentiment of the tweets cannot be beknown, test set has
been manually prepared in order to prevent any errors in Garg, et al.[1] describes about the sentiment analysis on
evaluating accuracy and precision of the models. twitter data post Uri attack, and discuss the analysis of the
retweets. They conclude that the negative tweets tend to
Keywords- Sentiment analysis; Twitter; Classification; Naïve survive more than the positive tweets. Huma, et al. [6] explain
Bayes; Support Vector Machine; Random Forest Classifier, increasing the efficiency of sentiment analysis using Hadoop
Precision; Recall MapReduce. They also conclude that the neutrality of tweets
plunges if the emoticons in the tweets/retweets are fed into
INTRODUCTION
the analysis. Trupthi, et al. [5] digs into the process of mining
Twitter, one of the most popular micro blogging social large amounts of data that is extracted using Twitter API.
networking site where people tweet their opinions in a They use Hadoop system to process the tweets and then use
concise manner, typically in less than or equal to 140 words. sentiment analysis algorithms for better understanding of the
It is an open forum where people from all around the world tweets. Zamani, et al. [11] use the data of Facebook, the
can express. Twitter leverages over other social networking busiest social networking site and use people’ s suggestions
sites because of its excellent features like subscribing, re- and comments to quantify sentiments and rate them according
tweeting, adding to favourites, filtering the information using to their emotion. Lavanya, et al. [4] use topic adaptive
keywords etc. Twitter produces immense information that sentiment analysis using support vector machines and evolve
can't be taken care of physically to extract valuable data and ways to perform adaptive analysis for better accuracy and
consequently, the elements of machine aided classification precision. Ahuja, et al. [2] explain the sentiment analysis
are needed to deal with that data. Hence twitter is a great using the K-means cluster algorithm and they rely on the fact
source to extract data and dig deeper I into the insights in the that the dictionaries cannot contain the exact emotion of the
sentiment, rather the sentiments have to be analysed based on A. Fetching extraordinary Twitter Data in Python.
the subjectivity and relevance of the text being processed.
Twitter data is in comparison to the information shared by
II. CLASSIFICATION most of the other social networking sites since it reflects
data that the users opt to share openly in public. The twitter
Classification is a technique which helps to categorically API platform gives expansive access to public tweets that
define in which data set does a particular data instance fall users across the world have imparted. In order to access the
into. In text mining, all the text classifiers have the ability to twitter API, the following procedure has been adopted.
operate on a huge amount of data according to their respective
constraints. In K-nearest neighbour classifier classes may not The foremost step to fetch the tweets from twitter has been to
be necessarily required to be linearly distinguishable but in create a twitter app to get access to the Twitter developer
this classifier, it is really time consuming to find the nearest account by the identical username as the one logged into. This
neighbours if data is huge. In SVM classifier, the accuracy of has been done in order to obtain the credentials that are
results can be high, but it is complex and requires more space needed to stream the tweets from the twitter API.
and time in both training and testing [3]. In ANN classifier, it
works very well with only a few parameters to adjust, but the Further, using a python library called Tweepy the tweets were
processing time can be really high if the neural network is fetched from the twitter API [9]. Tweepy enables python to
large. In this paper, probabilistic Naïve Bayes classifier has interact with twitter API and hence streaming of the tweets
been used whose implementation is not only simple, but also from the twitter. The tweets so obtained has been directed
has excellent efficiency and classification rate [7,8]. Also, as into a json file. In this paper, the framework has been
per the data size taken in this paper, this algorithm proves to executed by fetching tweets by using keyword Kashmir and
hence a data typically of size 339MB has been obtained.
give the best results and hence being most appropriate in text
classification of the data.
B. Data Pre-processing: NLTK
III. METHODOLOGY ADOPTED
Text-based communication is the basis of the tweets on
The methodology adopted has been described in this twitter and hence, unstructured text data has turned out to be
process flow diagram Fig.1. and in the proceeding paper extremely usual, and analysing these large quantities of raw
every step has been described in detail. text data is now a key method to comprehend what
individuals are considering NLP which is natural language
processing provides an interface between computers and
Collecting
humans. Analysis of text is done using the NLP techniques,
Create app tweets from giving a way for computers to comprehend language of
and obtain twitter API. humans. NLP tool for python is Natural language toolkit-
credentials. (Using any NLTK [10]. It is a set of libraries for representative and
keyword) factual common dialect preparing for English language that
is written in the Python programming dialect. NLTK is
expected to help research and educating in natural language
processing or firmly related regions, including experimental
Data Pre-processing. etymology, psychological science, man-made reasoning, data
• Removal of URLs recovery, and machine learning. NLTK underpins
• Removal of special symbols tokenization, stemming, labelling, parsing, and semantic
thinking functionalities. It helps in pre-processing of data by
• Removal of hashtags
cleaning the text by removing stop words, punctuations,
• Removal of additional white emoticons, and digits. The steps followed in data pre-
spaces processing has been shown in Fig.2.
• Removal of stop words
• Removal of digits
Apply Selection of
Classification best fit
Techniques - algorithm and
Naive Bayes, SVM, analysis of
Random Forest results
Fig.1. Process flow diagram
Here, P ( ) – Posterior
P( ) – Likelihood
=( )
(5)
=( )
(6)
Fig.7. Number of tweets
∗ )
Consecutively, the results have been visualized in graphical 1 =2∗ )
(7)
form using pie chat and histogram depicting the percentage
of each classified value.
2. Apart from evaluating the best algorithm for extracted
dataset, another observation from the analysis has shown that
IV. RESULTS
most of false negative, false positive and false neutral
arguments root from the tweets which are in languages apart
1. The Kashmir dataset has been retrieved from twitter using from English. The findings have shown that trained
the Tweepy library through twitter API. The extraction has algorithms for text classification and sentiment analysis work
not been easy on the regular machine with no use of cloud well with the English language datasets.
computing for storage and extraction. This has been one of
the major challenges faced in this research work. Around one
3. Using the Naïve Bayes algorithm, which has given the
and a half day passed to extract the huge dataset for Kashmir
maximum accuracy and precision, an analysis of overall
tweets on URI attack. The twitter is a world of news and
sentiment of the sparse dataset has been done and following
happenings and it can be a bit challenging to filter out the
figure (Fig.7) depicts prediction of the algorithm over the accuracy and precision. There is a plethora of information on
dataset as a mix of positive, negative and neutral tweets. social media which can be analysed, and this information is
very sparse. Predictive models from the hybrid of different
machine learning algorithms can be used to correctly access
Sentiments visualization the sentiment analysis, which is in fact one of the most
10% difficult problem statement in machine learning world. The
17% accuracy of models for non-English statements can also be
improved in future.
REFERENCES
[1] P. Garg, H. Garg, and V Ranga, "Sentiment analysis of the Uri terror
attack using Twitter," Computing, Communication and Automation
(ICCCA), 2017.
[2] A, Shreya, and G. Dubey, "Clustering and sentiment analysis on
72% Twitter data," 2nd International Conference on Telecommunication
and Networks (TEL-NET), IEEE, 2017.
[3] M. Kumar and A. Bala, "Analyzing Twitter sentiments through big
positive neutral negative data," Computing for Sustainable Global Development (INDIACom),
3rd International Conference on. IEEE, 2016.
[4] K. Lavanya and C. Deisy. "Twitter sentiment analysis using multi-class
SVM," Intelligent Computing and Control (I2C2), International
Fig. 7. Pie chart
Conference on. IEEE, 2017.
[5] M. Trupthi, , S. Pabboju, and G. Narasimha. "Sentiment analysis on
Here, the percentage of the number of positive, negative and twitter using streaming API," Advance Computing Conference
neutral tweets has been shown. (IACC), IEEE 7th International. IEEE, 2017.
[6] P. Huma, and S. Pandey, "Sentiment analysis on Twitter Data-set using
Naive Bayes algorithm," Applied and Theoretical Computing and
V. CONCLUSION Communication Technology (iCATccT), 2nd International Conference
on. IEEE, 2016.
Twitter is a powerful source where people across the world [7] Chen, Siyuan, Chao Peng, Linsen Cai, and Lanying Guo, "A Deep
Neural Network Model for Target-based Sentiment Analysis,”
come together to interact on a common platform on varied International Joint Conference on Neural Networks (IJCNN), pp. 1-7,
issues. Hence, it gives a wide scope to researchers to fetch a IEEE, 2018.
large amount of raw data. This raw data processing helps to [8] M. Mittal,, et al. "Monitoring the Impact of Economic Crisis on Crime
analyse the opinion of the mass. A complex dataset of in India Using Machine Learning," Computational Economics, pp. 1-
Kashmir attacks has been retrieved and three algorithms of 19, 2018.
text classification have been applied over the dataset. Naïve [9] Zvarevashe, Kudakwashe, and Oludayo O. Olugbara, "A framework
for sentiment analysis with opinion mining of hotel reviews,"
Bayes has worked best in terms of accuracy and precision.
In Information Communications Technology and Society (ICTAS),
Moreover, the overall sentiment of the dataset has been 2018 Conference, pp. 1-4. IEEE, 2018.
predicted over the test set, which has been created manually [10] N. Sagar "A comparative study of classification techniques in data
in order to avoid errors. Random forest has given less mining algorithms," Oriental Journal of Computer Science &
accuracy due to overfitting on a large dataset and support Technology 8.1, pp. 13-19, 2015.
vector machine performs comparable to Naïve Bayes. Naïve [11] N. Zamani, M. Azminam, "Sentiment analysis: Determining people's
emotions in facebook," University Teknologies MARA,
Bayes, being computationally strong and simple,
Malaysia 2013.
outperformed the rest of the two algorithms for this case. All [12] M. Saif, "Sentiment analysis: Detecting valence, emotions, and other
the algorithms have given false results majorly on non- affectual states from text," Emotion measurement, pp. 201-237, 2016.
English (Hindi, Urdu) languages. Moreover, based on the [13] C. Aggarwal, and C. Xiang Zhai. "A survey of text classification
entire dataset, Naïve Bayes model has predicted that most of algorithms," Mining text data. Springer, Boston, MA, pp. 163-222,
the tweets in extracted dataset are actually neutral. 2012.
[14] Kaur, Harpreet, and Veenu Mangat, "A survey of sentiment analysis
techniques," I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-
VI. FUTURE WORK SMAC), International Conference, pp. 921-925, IEEE, 2017.
[15] Zvarevashe, Kudakwashe, and Oludayo O. Olugbara, "A framework
The scope of this framework can be used in analysing any for sentiment analysis with opinion mining of hotel reviews,"
Information Communications Technology and Society (ICTAS)
kind of tweets on Twitter. Not only the text but, the doors for Conference, pp. 1-4. IEEE, 2018.
future amendments and improvements of this research are [16] Li, Jie, and Lirong Qiu, "A Sentiment Analysis Method of Short Texts
wide open. There exist many different algorithms for in Microblog," Computational Science and Engineering (CSE) and
performing the sentiment analysis. However, the complexity Embedded and Ubiquitous Computing (EUC), IEEE International
Conference, vol. 1, pp. 776-779, IEEE, 2017.
and computational edge are two important factors to evaluate
which algorithm would out-perform the others in terms of