You are on page 1of 4

Sentiment Analysis on Twitter Data

Rahil Arshad In our model we have used classification. Classification


Email Address: rahilarshad9@gmail.com is used to classify the data into different classes.
Department of Physics and Computer Science, Dayalbagh Classifiers available can be broadly classified into two
Educational Institute, Agra, UP, India categories: supervised and unsupervised classifiers.

Abstract---Sentiment analysis is one of the Natural In supervised classification, there will be training of the
Language Processing fields, dedicated to the exploration data to fit it according to our needs. While in unsupervised
of subjective opinions or feelings collected from various technique, the classifier will lean from its own experiences
sources about a particular subject. or the feedback provided to it. In supervised learning we
will be feeding our model labeled data on which our
In more strict business terms, it can be summarized as: model will be trained. In unsupervised learning unlabeled
‘Sentiment Analysis is a set of tools to identify and data is fed which is used for learning.
extract opinions and use them for the benefit of the
business operation’. II. LITERATURE REVIEW

Sentiment Analysis finds many applications such as HumaParveen and Prof.ShikhaPandey [1] did sentiment
review monitoring, hate messages filtering, product analysis on movie dataset by downloading the tweets on
feedback etc. This paper focuses on the application of developing a Twitter API. They have used the Hadoop
Sentiment Analysis to classify tweets into hatred negative framework for processing the dataset. Classification is
and hatred negative tweets. We will be training our done using the Naïve Bayes algorithm and its performance
classifier and assessing its performance. is increased by pre-processing the tweets. The final results
show the classification of text in their required classes
Keywords---Sentiment Analysis, Opinion Mining, with accurate performance.
Natural Language Processing, Text Classification,
Twitter Analysis, Hatred Filtering, Machine Learning Neethu M S and Rajasree R [2] perform analysis on tweets
based on some specific domain using different machine
I. INTRODUCTION learning techniques. They tried to focus on problems that
are faced during the identification of emotional keywords
Sentiment analysis is one of the Natural Language from multiple keywords and difficulty in handling
Processing fields, dedicated to the exploration of misspellings and slang words. So a feature vector is
subjective opinions or feelings collected from various created whose accuracy is tested using naïve bayes, SVM,
sources about a particular subject. maximum entropy and ensemble classifiers.

In more strict business terms, it can be summarized as: Bac Le and Huy Nguyen [3] built a model to analyze the
„Sentiment Analysis is a set of tools to identify and sentiment on Twitter using machine learning techniques
extract opinions and use them for the benefit of the by applying effective feature set and enhances the
business operation‟. accuracy i.e., bigram, unigram and object-oriented
features. The classification of tweets is done using 2
Twitter is a web service and social communication algorithms i.e., Naïve Bayes classifier and Support vector
platform which allow users to address their tweets in machines(SVM) whose accuracies are tested by
different domains. Users can easily and efficiently share calculating precision, recall and f-score and also shows
their perspectives or ideas on a wide variety of cluster on same accuracy.
various topics via social networking websites. As online
data is abundantly available through different platforms Sayali P. Nazare, Prasad S. Nar, Akshay S. Phate, Prof.Dr.
on social networks like twitter, Facebook, Reddit etc., D. R. Ingle [4] have created a dataset by twitter API and
analyzing the data is of paramount importance in collected all tweets regarding the topic blue whale game.
drawing inference from the data. Hence, in our research, Their main aim is to perform analysis on sentimental
we try to perform sentiment analysis on twitter data by tweets. They have used Naïve Bayes, Support vector
using a Naive Bayesian algorithm. By using our model, machines, Maximum entropy and Ensemble
we can classify the public tweets made by the users into classifier.SVM and Naive Bayes classifiers are
hatred positive or hatred negative tweets. implemented using MATLAB built-in functions.
Maximum Entropy classifier is implemented using
There are various Machine learning techniques that are MaxEnt software. Based on comparative results Naïve
used to perform sentiment analysis. Some of these are Bayes has better precision and slightly lower recall and
classification, random forest, neural networks, lexicons, accuracy i.e., 89% and other classifiers are having similar
support vector machine etc. accuracy levels i.e., 90%. The result shows the pie-chart
which is representing the positive, negative and neutral VI. Now a new document M is classified based on
hashtags with percentages. calculating the probability for both classes A and
B
Deys, Lopamudra & Chakraborty [14] have collected 2
sets of dataset they are movie reviews and hotel reviews P (M/W).
by using 2 classifiers naïve Bayes and K-NN. Their aim is
to check which classifier gives better results on both Find P(A / W) = P(A) * P(word1/class A)*
datasets. The experimental results shows that the naïve P(word2/ class A)……* P(wordn / class A).
Bayes classifier gives better performance in the case of
movie reviews dataset and on considering hotel reviews Find P(B / W) = P(B) * P(word1/classB) *
dataset both classifiers shows approximate results. Finally, P(word2/class B)……* P(wordn / class B).
naïve Bayes classifier is better for movie reviews
classification. VII. After calculating probability for both classes A
and B the class with higher probability is the one
R. Dey and S. Chakraborty [15] developed a new the new document M assigned.
technique which predicts the weather conditions from air
polluted dataset. Then applied convex-hull technique IV. PROPOSED WORK
suitable for dynamic databases where the climate data
are changed frequently. The incremental DBSCAN In proposed work, we have discussed how a sentiment is
clustering is used which performs clustering of new data extracted from a tweet/text using Twitter dataset. It is a
that is inserted and a protocol is used to give the weather place where the users posts their views and opinions based
prediction. The results give the accuracy of the model on the situation. The main objective of our proposed
based on hit and miss. system is to perform analysis on tweets having sentiment
which causes the great help to business intelligence on
III. NAÏVE BAYES CLASSIFIER predicting the future. This paper addresses the sentiment
analysis on twitter dataset; that is at first classification is
Algorithm: performed on tweets using naïve bayes classifier. Each
tweet is represented in the form of sentiment asserted in
I. Consider a training data set D consists of terms of positive, negative and neutral. Performing
documents which belongs to different classes say sentiment analysis is vital which is used to find out the
class A and B pros and cons of their products in the market by public
. that results in improving their business productivity. The
II. Prior probability of both classes A and B is aim of this project is to develop a classification technique
calculated as shown using machine learning which gives accurate results and
Class A=number of objects of class A / total automatic sentiment classification of an unknown tweet by
number of objects. predicting the future. In this paper, sentiment analysis is
Class B=number of objects of class B / total done on Twitter data. The dataset is collected which
number of objects. contains 31962 tweets these tweets are collected based on
the situation on all topics. There are different attributes in
III. Now calculate the total number of word the database such as item-id, label, tweet but tweet has
frequencies of both classes A and B i.e., na been considered for our proposed research. The first
na = the total number of word frequency of class attribute item-id contains the id of the tweet, the second
A. attribute sentiment represents the Boolean value (1 or 0)
nb = the total number of word frequency of class i.e., the tweet not containing hatred is taken as 0 and tweet
B. with hatred is declared as 1, and the last attribute tweet
represents the text or tweet based on all situations either
containing sentiment or not. Our main aim is to perform
IV. Calculate the conditional probability of keyword
analysis on these tweets and conclude the tweets which
occurrence for given class
are positive and negative.
P(word1 / class A) = wordcount / ni(A)
P(word1 / class B) = wordcount / ni(B)
So in order to classify data first, we need to perform the
P(word2 / class A) = wordcount / ni(A)
following steps:
P(word2 / class B) = wordcount / ni (B)
…………………………………………
1. Tokenization: It is a method that divides the
…………………………………………
variety of document into small parts called
P(wordn / class B) = wordcount / ni (B)
tokens. These tokens may be in the form of
words or numbers or punctuation marks.
V. Uniform distributions are to be performed in
order to avoid zero frequency problem. Ex: it is going to rain today
2. After performing tokenization the sentence is
divided into tokens as follows:

“It”, “is”, “going”, “to”, “rain”, “today”.

3. Stop words: These are the common words that


are to be ignored which reduces the size of the
dataset also the no of words (tokens). In our
programming language python we use a tool
called natural language toolkit (NLTK) in which
there is list of stop words in 16 different
languages.
Figure 1 Confusion Matrix and Prediction Labels
Ex: I like dancing, so I dance.

4. After removing stop words the sentence will be


as follows:

Like, dancing, dance.

5. Bag of words concept is applied to these tokens.

6. Finally, our classification technique Naïve


Bayesian classifier is applied which calculates
the probability of all words in the document and Figure 2 Classification Analysis
gives the result i.e., probability of each tweet in
both positive and negative. VI. CONCLUSIONS

7. Results show the probability of each tweet saying In conclusion, we have developed a model which
whether the tweet is either positive or negative. performs sentiment analysis on Twitter data using
Machine Learning Technique. The model that was
A. Bag-of-words proposed in this research was built by using Natural
Language Tool Kit (NLTK) on the dataset containing
A bag-of-words is a representation of text that describes tweets. Bag of words concept is used which contains
the occurrence of words within a document. The both positive and negative words separately. The
occurrence of words is represented in a numerical feature. classification was done using Naïve Bayes classifier
It is a way of extracting features from the text for use in by calculating the probability of new input data and
modeling, such as with machine learning algorithms. The the tweet with the highest value is considered as
approach is very simple and flexible and can be used for either positive or negative. However, we chose an
extracting features from documents. But there is some effective twitter feature dataset which enhances the
complexity on two cases i.e., one is on designing the effectiveness and accuracy of the classifier. This
vocabulary of known words and the other is on scoring the model can further enhanced to any desired level if one
presence of known words. wants to by incorporating more features in the
database.
B. Application of sentiment analysis
REFERENCES
Naïve Bayes classifier is one of the supervised
classification technique which classifies the text/sentence [1] Huma Parveen and Shikha Pandey “Sentiment
that belongs to particular class. It is the probabilistic analysis on Twitter Data-set using Naive Bayes
algorithm which calculates the probability of each word in algorithm” 2016 2nd International Conference on
the text/sentence and the word with highest probability is Applied and Theoretical Computing and
considered as output. Communication Technology (iCATccT) page 416-
419 @article{Parveen2016SentimentAO}.
V. RESULTS AND ANALYSIS [2] M. S. Neethu and R. Rajasree, "Sentiment analysis
in twitter using machine learning techniques," 2013
Fourth International Conference on Computing,
Communications and Networking Technologies
(ICCCNT), Tiruchengode, 2013, pp. 1-5.
[3] Le B., Nguyen H. (2015) “Twitter Sentiment
Analysis Using Machine Learning Techniques”. In:
Le Thi H., Nguyen N., Do T. (eds) Advanced Using Naïve Bayes and K-NN Classifier”.
Computational Methods for Knowledge Engineering. International Journal of Information Engineering and
Advances in Intelligent Systems and Computing, vol Electronic Business. 8. 54-62.
358. Springer, Cham 10.5815/ijieeb.2016.04.07.
[4] Sayali P. Nazare, Prasad S. Nar, Akshay S. Phate, [15] R. Dey and S. Chakraborty, “Convex-hull
Prof.Dr. D. R. Ingle“Sentiment Analysis in &DBSCAN clustering to predict future weather”, 6th
Twitter”International Research Journal of International IEEE Conference and Workshop on
Engineering and Technology (IRJET)Volume: 05, Computing and Communication, Canada, 2015,
Jan-2018. pp.1-8.
[5] Anuja Prakash Jain and Padma Dandannavar [16] Sathyadevan, Shiju& S, Devan & S
“Application of machine learning techniques to Gangadharan, Surya. (2014). “Crime Analysis and
sentiment analysis” 2016 2nd International Prediction Using Data Mining”.
Conference on Applied and Theoretical Computing 10.1109/CNSC.2014.6906719.
and Communication Technology(iCATccT) pages [17] Leena A. Deshpande, M.R. Narasingarao
628-632 article{Jain2016ApplicationOM} “Addressing social popularity in twitter data using
[6] Mejova, Yelena. (2019) “Sentiment Analysis: An drift detection technique” Journal of Engineering
Overview”. Science and Technology Vol. 14, No. 2 (2019) 922 –
[7] Boiy, Erik, Hens, Pieter, Deschacht, Koen &Moens, 934.
Marie-Francine, Marie-Francine. (2007). “Automatic [18] Tiruveedhula, Sajana &ramanarasingarao,
Sentiment Analysis in Online Text”. ELPUB2007. Manda. (2017). “Machine learning techniques for
Openness in Digital Publishing: Awareness, malaria disease diagnosis - A review” Journal of
Discovery, and Access - Proceedings of the 11th Advanced Research in Dynamical and Control
International Conference on Electronic Publishing Systems. 9. 349-369.
held in Vienna, Austria 13-15 June 2007 / Edited by
Leslie Chan and Bob Martens. ISBN 978-3-85437-
292-9, 2007, pp. 349-360.
[8] Niu, Zhen, Zelong Yin, and Xiangyu Kong.
"Sentiment classification for microblog by machine
learning." In 2012 Fourth International Conference
on Computational and Information Sciences, pp.
286-289. Ieee, 2012.
[9] J. Ren, S. D. Lee, X. Chen, B. Kao, R. Cheng, and D.
Cheung, "Naive Bayes Classification of Uncertain
Data," 2009 Ninth IEEE International Conference on
Data Mining, Miami, FL, 2009, pp. 944-949.
[10] F. Neri, C. Aliprandi, F. Capeci, M. Cuadros and
T. By, "Sentiment Analysis on Social Media," 2012
IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining, Istanbul,
2012, pp. 919-926.
[11] Gaurav D Rajurkar, Rajeshwari M Goudar, "A
speedy data uploading approach for Twitter Trend
And Sentiment Analysis using HADOOP",
HADOOP, 2015 International Conference on
Computing Communication Control and
Automation. Pages 580-584.
[12] Saif, Hassan; He, Yulan and Alani,
Harith(2012),"Semantic sentiment analysis of
Twitter," in The 11thInternational Semantic Web
Conference (ISWC 2012),11-15 November 2012,
Boston, MA, USA. Pages 1320-1326.
[13] EfthymiosKoulompis, TheresaWilson, Johanna
Moore (2011),“Twitter Sentiment Analysis: The
Good the Bad and the OMG!," in The Fifth
International AAAIConference on Weblogs and
Social Media. Pages 538-541.
[14] Dey, Lopamudra&Chakraborty, Sanjay &
Biswas, Anuraag& Bose, Beepa& Tiwari, Sweta.
(2016). “Sentiment Analysis of Review Datasets

You might also like