You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328622424

Lexicon-Based Approach to Sentiment Analysis of Tweets Using R Language:


Second International Conference, ICACDS 2018, Dehradun, India, April 20-21,
2018, Revised Selected Papers,...

Chapter · April 2018


DOI: 10.1007/978-981-13-1810-8_16

CITATIONS READS

6 3,099

2 authors:

Nitika Nigam Divakar Yadav


Indian Institute of Technology (Banaras Hindu University) Varanasi National Institute of Technology, Hamirpur
6 PUBLICATIONS   18 CITATIONS    115 PUBLICATIONS   722 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Hindi to Tamil Machine translation View project

Decision making in E-learning Environment View project

All content following this page was uploaded by Divakar Yadav on 09 February 2019.

The user has requested enhancement of the downloaded file.


Lexicon-based approach to Sentiment Analysis
of tweets using R language

Nitika Nigam1 and Divakar Yadav2


1
M.Tech.
Department of CSE, M.M.M. University of Technology
Gorakhpur-273010, U.P., India
nigamniti8@gmail.com,
2
Associate Professor
Department of CSE, M.M.M. University of Technology
Gorakhpur-273010, U.P., India
dsy99@rediffmail.com

Abstract. Sentiment analysis is a method to study the opinions of user


on a subject like product reviews, appraisal or expressing any emotion on
the entity. There are mainly two approaches used for sentiment analysis:
lexicon based and machine learning based approach. We emphasis on lex-
icon based approach which depends on an external dictionary. Our aim is
to classify the given set of tweets into two classes: Positive and Negative.
We extract the semantics from the tweets and calculate the score. This
score helps in classification of tweets either in positive or negative class.
In this experiment of sentiment analysis, we used R language as a tool.
R is a freely available software which is used for statistical computation,
data manipulation, and graphical display.

Keywords: Sentiment analysis, Twitter, lexicon based approach.

1 Introduction

Recently, many people in the world use social sites like Twitter, Facebook,
LinkedIn to share their views with the world. It is one of the best communication
tools.Thus, the bulk of data is generated (known as big data) and for analysis
the reviews, sentiment analysis was introduced. Sentiment Analysis (SA) is the
process of finding whether the given texts have a positive, negative or neutral
opinion. It also uses to detect the emotion of people, decision making process,
etc.The formal definition of Sentimental Analysis is “extracting the semantics
and determining the attitude of a speaker which conclude either positive, nega-
tive or neutral reaction.” It was first time used in 2003.
It was also for analysis of pre-or-post criminal activities on social media, product
reviews, movie reviews, news, and blogs, etc.The advantage of sentiment analysis
is to improve the products, leads to innovations, growth in market etc [1].This
method is also known opinion mining. This analysis totally depends upon the
context provided by the speaker. Sentiment analysis is handled at many levels
of granularity i.e. at the document level, sentence level, and phrase level. The
most well-known use of sentiment analysis is in reviews of items and services
given to the users. It is the application of natural language processing (NLP)
and it is commonly used in a recommender system. In our paper, we are using
data from Twitter. Twitter is an online social networking site, which provides a
virtual environment for the people who are interested in hanging out together. It
helps the people to express the thoughts on a subject. People post their views on
numerous topics like a recent issue, party-political issue, Bollywood-Hollywood
etc.There are many NLP technique which detects the sentiments of Twitter like
Stop word removing, Parts of Speech Tagging, Name Entity Recognition (NER)
which is trailed by bags of words etc. These techniques use dictionaries as the
references. Since no training is provided, it requires less computational power.
We are using lexicon approach which is used to classify the text into two classes:
“Positive”and “Negative” with the help of dictionaries.The challenges that arise
during extraction of the features and then doing classification of that text are
given below but some of the challenges are removed by cleaning the text data
set.
– Handling the big data which consist of the opinions given by the people.
– Informal languages, slang word/abbreviation or emoticons usage.
– Spelling mistakes/ typo mistakes.
– Detection of sarcasm. [2]. E.g. Dont bother me. I am living happily ever
after. Sarcasm: Speaker is taunting as well as hurting the person.
– Ambiguous sentences used by a user. E.g. I have never tasted a pizza quite
like that one before! Ambiguity: Was the pizza good or bad
– Hashtag based text detection [3].
– Detecting hidden sentiment of a user.
– Polarity Shifting detection [4].

2 Related work and Techniques used on twitter dataset


Hearst in 1992 and Kessler et al. in 1997 initiated the research of sentiment
analysis. There are two major techniques which are used for classification of sen-
timents of text Lexical analysis and Machine learning based analysis. In Lexical
analysis, a dictionary-based approach is considered which is manually created
by an expert. These dictionaries are used to interpret the words meaning so
that classification could be done easily. Dictionary contains adjectives words as
pointers equivalent with semantic orientation (SO) (polarity or strength of text)
value. The tokens are compared with the given dictionary which has been com-
piled already. The matched tokens are decorated with corresponding SO values
by using a dictionary and, SO values are combined into a single score.
In [5] the author gave the method for opinion mining by using the lexicon based
approach. The data set used was the reviews of products. They extracted fea-
tures of reviews and classify whether opinions were positive or negative. These
results were summarized so that shopper could get useful information. In [6] the
author emphasized to resolve two major problems that occur in lexicon based
method, i.e. (1) the context based dependent words, (2) combination of multiple
opinion words in one sentence. A holistic lexicon based approach was proposed
in which they compare another customer review if an ambiguous review was
present. In [7]author extracted the sentiments from the text by using monolin-
gual dictionary. In this approach, they calculated the semantic orientation (SO)
value with help of dictionaries. These dictionaries consist of the collection of
words with their strength and polarity which was created manually. The list
consists of semantic-bearing words like adjective, noun, and adverb with their
SO values. The model given by them is to handle the negation and intensifi-
cation words (shifter valence). Without using any prior knowledge or training,
their approach performs well and result well in the cross domain. In Machine
learning based analysis, the opinions are extracted automatically i.e. it allows
the computer to learn without explicitly programmed [8]. It gains more popu-
larity due to adaptiveness and extracts many features easily. It is divided into 3
subcategories: Supervised learning, Unsupervised learning, and semi-supervised
techniques. These techniques are used to extract the features like terms with
their frequency count, Part of Speech, negation and syntactic dependency. In
Supervised learning, the technique is applied under the guidance of a supervisor
and it is an unlabelled data. Naive Bayes algorithm is a supervised technique and
used for classification [9]. It is the best method at document level of classifica-
tion. Support Vector machine is another algorithm which provides the maximum
accuracy in text classification [10].
In [2] author proposed the pattern based approach which spots the cynicism on
twitter. To find sarcasm they used a pattern based approach with the help of
Parts of Speech (PoS) and for classification, machine learning approach. The
feature extracted by them was classified as i) Sentiment based ii) punctuation
based iii) syntactic and semantic based iv) pattern based. This classification
helps in removal of noisy or useless data. They detect whether the text was
sarcastic or not, in which they successfully achieve the accuracy of 83.1% with
precision 91.1%. In [11] author proposed an innovative supervised technique in
which the pattern analysis is done on writing skills and unigrams of tweets.
SENTA tool (an open source tool) was used for extracting the features from the
text which was classified into 7 classes “happy”, “sad”, “anger”, “hate”, “love”,
“fun”, “neutral”. The accuracy of multi class classification was almost 60.2%
and after removal of neutral tweets, it was 70.1%.
In [12]author overcome the problem of polarity shift detection by proposing
a model called Dual Sentiment Analysis (DSA). The DSA used the pair of re-
views, original reviews and reversed reviews. These reversed reviews were created
through data expansion technique which was the set of both training and testing
reviews. The supervised technique was used for classification with the help of a
dictionary, which was domain adaptive as well as language independent. They
remove the dependency on external antonym dictionary which improves the per-
formance but due to dual reviews, it consumes space as well as time. In [13] used
the supervised learning approach and found unigram feature which results 73%
accuracy.In Unsupervised learning, the data provided as input is unlabelled data
without any output. No pattern is followed, and it contains discrete values. It
is further subdivided into 2 categories: Clustering and Regression. Expectation-
maximization is the algorithm of unsupervised learning. In [14] focuses on the
sentiment analysis of social media sites data like Twitter, Myspace, and Digg.
They projected a lexicon based, less domain specific, spontaneous and unsuper-
vised learning algorithm to get a better result. The solution given by them was
pertinent for subjectivity detection and polarity classification. The advantage
of given approach was that it providess a robust and reliable solution. In Semi-
Supervised learning, the features are extracted by using a combination of super-
vised and unsupervised learning. In [15] emphases on the extraction of features
from phrase level in which they differentiate between the semantic orientation
and contextual polarity. Their goal was to extract the important features which
identify contextual polarity. Their experiment was 2 step procedures, firstly they
identify all instances of a clue with the help of lexicon and after that, they clas-
sify each of them into polar or neutral class. In the second step, it disambiguates
the contextual polarity of each instance. It improved the accuracy and the main
advantage of their approach was that it solved the higher-level NLP tasks.
In [16] proposed the approach for analysis the sentiments of tweets. They fo-
cuses on data mining classifiers like k-nearest neighbour, random forest, Naive
Bayes and BayesNet classifiers. Basically they are comparing the accuracy of
these classifiers by considering stop words and without stop words. In [17] uses
the “Naive Bayes classifiers”, which is a probability based method. They uses
the dataset on movie opinion given on twitter, a social site blog. The sentiments
of tweets was calculated by using Hadoop framework. They baiscally, compares
the datasets with and without emoticons. In case of emoticons, the emoticons
are changed into its equivalent words while in other case, these are neglacted. In
their approach the performance is increase in case of emoticons. In [18] author
proposed a noval system for Hindi dialects given by user on different movies.
This system is known as Hindi Opinion Mining System (HOMS). They uses the
Niave Bayes classifier which also includes the combination of Parts of Speech
(PoS) tagging and machine learning approaches for classifying the dataset into
“positive”,“negative” and “neutral” class. In the caseof PoS tagging only words
which comes under adjective domain are taken into picture. The drawback of
“HOMS” is that it can’t handle “Discourse relation” like “but”.
In [19] author done the sentiment analysis by using the machine learning ap-
proaches in different dialects(English, French and Dutch Languages). Their mo-
tive was to classifies the opinions given by the users on the products used by
them. Since, they were extracting the feelings of people they train the set of opin-
ions which was already decorated by tagging the words into “positive”,“negative”
and “neutral” class. This was done manually. They acheived 83% accuracy in
case of English language, 70% for Dutch text and 68% in French language. In
[20], the authors have concentrated on distributed data over the web which is
in terms of reviews. Opinion mining is self-administer content investigation and
rundowns of things accessible on networks which control our feeling and recog-
nize positive and negative viewpoint for examining positive and negative feeling
of the client.

3 The Proposed Method

The investigation of Twitter information is a rising field that needs more neces-
sities substantially more consideration. There are various methods to classifies
tweets into positive or negative class. Some researchers use machine learning
approach and some uses lexical based method. The ultimate goal is to extract
the sentiments of the given dataset.
In our paper, we use R language for our experiment. R is a freely available soft-
ware which is used for statistical computation, data manipulation, and graphical
display. It is a dialect of S which was designed by John M. Chambers in 1980.
It provides many statistical techniques like clustering, classification etc. It can
be easily run on any operating system (Windows, Unix, MacOS). It becomes
popular because it provides following facilities:

1. Handles Big data.


2. Open source software and free.
3. Provides storage facilities.
4. Good graphical facilities as it produces graphical output in jpg, png, pdf,
svg format and table format in latex and html. It can be easily extended via
packages.
In our approach we have collected data from twitter and evaluated the result
with the help of R language. The proposed methodology is illustrated in the
form of flow chart and represented in Fig 1.

It consists of four steps which are enlisted below:


1. Collection of dataset.
2. Noise removal from tweets.
3. Lexical Analysis
4. Classification and calculation of score.

A comprehensive explanation of these steps in our approach has been ex-


plained in next sub sections.
Fig. 1. Flow Chart on proposed methodology.

3.1 Collection of dataset:

The corpus is the collection of tweets on our Honble Prime Minister Narendra
Modi. The dataset is a collected with the help of twitter streaming API. API
provides the authentication to access the tweets. In this, we acquire about 150
tweets and for that we used the following command of R for extracting the
tweets:
#extract the tweets
modi.tweets <– searchTwitter(”Modi”,n=150)

where, modi.tweets is a variable in which data is stored regarding the search


on topic “Modi”. searchTwitter() is a function which comes under “CRAN”
package [21]. We passed two arguments which contain topic name (on which
particular topic we need to collect tweets) and numbers of tweets required. In
Fig 2 we have shown extracted tweets.
Fig. 2. Extracted tweets (10) about Prime Minister Modi.

3.2 Noise Removal from tweets:

To enhance the performance, the dataset given as shouldn’t contain any type
of noise i.e. it should be clean dataset. In this section, we are removing noise
from tweets after extracting them. These are removed because it doesn’t pro-
vide any time of knowledge regarding the output we want. While scanning the
dataset, the useless data is also scanned which consume lots of time (CPU cycles
are wasted). Due these reasons we are eliminating the noise (useless data) from
tweets. Using R tool, the tweets are extracted and the next step is to clean the
data. In the cleaning of data, the emoticons,URL punctuation marks/Target/
are removed shown in Fig 3.

The useless data is explained properly below:

– Emoticons: These are facial expressions which are pictorially represented by


using punctuation and letters. Emoticons prompt the attitude of a user.
– URL: User sometime attached the url with their tweet, which shows the
address of a page. Hashtag:Users usually use to mark subjects.
– Target : Users use the @ symbol to refer to other users, which automatically
alerts them. This is primarily done to increase the visibility of their tweets.

These are cleaned because we can easily understand the sentiments of the user by
removing these useless data. For example, “I like a @YouTube video http://t.co/et8m
Shyam Rangilla performed good” after cleaning it will look like “I like a @YouTube
video Shyam Rangilla performed good. Likewise, I love my India , ” and after
cleaning “I love my India” emoticons are removed.
Fig. 3. Extracted clean tweets (10) about Prime Minister Modi.

3.3 Lexical Analysis:


The tweets are subdivided into words, known as lexemes. The lexemes are
matched with the words of dictionaries. These dictionaries contain positive and
negative words and it is manually created. These dictionaries consist of almost
all types of words for example, “most of user uses short forms to express their
views: You are looking good can be written as U r looking gud”. So, we have
consisdered all types of possible words that are used by people to express their
opinion.
Some people uses hybrid language i.e. Hinglish “the combination of Hindi and
English dialects”. For example, “modi saheb jitenge ” written by a user which
express the feeling of user that he wants Modi ji to win. The hybrid language is
not considered by us for sentiment analysis. As it will consume time in prepar-
ing a dictionary for hybrid language. The resources for these types of language
are less in comparison of other language. Thus, our dictionary doesn’t consider
hybrid language words.

3.4 Classification and calculation of score:


In this section, lexemes are tagged with the help of dictionaries. The process of
tagging helps in the classification of tweets, whether these are in positive sense
or in a negative sense.The classification is done by calculating the score. The
tweets words are matched with the dictionary words and if it is a positive word
then score will be +1, negative words then score will be -1, otherwise 0.
The formula for calculating “score” is given below:
Score = P os(x, ”pos word”) − N eg(x, ”neg word”) (1)
where, x is a phrase, pos word is positive words and neg word is negative words.
The result of some tweets is shown in Fig 4. In this, each lexeme in a sentence
is compared with the dictionary words and assign the score accordingly.

Fig. 4. Score of clean tweets (10) about Prime Minister Modi.

4 Experimental results and Discussion


In this segment, we deliberate the output obtained after calculating the score
of all tweets. All the results are stored in tabular form which are converted
into a csv file and merged into one final table (table final$Score).The final table
consist of positive, negative and score values which is represented in the form of
a histogram. This compares the final table values (-2 to 2) and the frequency of
the score occurrence (0 to 5) as shown in Fig 5.
The range is small because we have taken a less number of tweets in our dataset.
Some of parameters are not taken into consideration, as it will be more complex.
For instance, the hybrid language “Hinglish”. The result can be improve by
considering above parameters and using new techniques. In future, we will work
on these problems.
Fig. 5. Histrogram.

5 Conclusion

Sentiment analysis is the method of investigating the sentiments of the given text
so that a good decision could be made for improvement. Mainly there are two ap-
proaches lexicon based and machine learning based. We have focused on lexicon
based approach. In our experiment, we use a dataset of twitter and two dictio-
naries (Positive and Negative) which were manually designed. We have taken
the support of R language for our experiments. Some of the twitter sentences
are shown on Honble Prime Minister Narendra Modi. The difference between
positive word and negative word in a sentence was calculated which was stored
in variable Score. Score states the polarity of the sentence, whether it is a pos-
itive or negative sentence. If the score has a positive value then the sentence is
positive, otherwise negative. The result is shown in Fig 5.
In the future, we will use machine learning approach to compare the result with
the lexicon based approach. In addition, we will consider emoticons, discourse
words and slang words used in tweets while expressing the feeling. Hybrid lan-
guage and complex sentences will be considered too.

References
1. H. Thakkar and D. Patel, “Approaches for sentiment analysis on twitter: A state-
of-art study,” arXiv preprint arXiv:1512.01043, 2013.
2. M. Bouazizi and T. Ohtsuki, “A pattern-Based approach for Sarcasm Detection
on Twitter,” IEEE Access, vol. 4, pp. 5477–5488, 2016.
3. A. Joshi, P. Bhattacharyya and M. J. Carman, “Automatic sarcasm detection: A
survey,” arXiv preprint arXiv:1602.03426, 2016
4. R. Xia, F. Xu, C. Zong, Q. Li, Y. Qi and T. Li, “Dual sentiment analysis: Consider-
ing two sides of one review,” IEEE transactions on knowledge and data engineering,
vol. 27, pp. 2120–2133, 2015.
5. M. a. L. B. Hu, “Mining and summarizing customer reviews,Proceedings of the
tenth ACM SIGKDD international conference on Knowledge discovery and data
mining,” ACM, pp. 168–177, 2004.
6. X. Ding, B. Liu and P. S. Yu, “A holistic lexicon-based approach to opinion min-
ing,”ACM, pp. 231–240, 2008.
7. M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede, “Lexicon-based meth-
ods for sentiment analysis,” Computational linguistics, vol. 37, pp. 267–307, 2011.
8. M. Kanakaraj and R. M. R. Guddeti, “NLP based sentiment analysis on Twitter
data using ensemble classifiers,” IEEE, pp. 1–5, 2015.
9. J. D. Rennie, L. Shih, J. Teevan and D. R. Karger, “Tackling the poor assumptions
of naive bayes text classifiers,”Proceedings of the 20th International Conference on
Machine Learning (ICML-03), pp. 616–623, 2003.
10. S. Schrauwen, “Machine learning approaches to sentiment analysis using the Dutch
Netlog Corpus,” Computational Linguistics and Psycholinguistics Research Center,
pp. 30–34, 2010.
11. M. Bouazizi and T. Ohtsuki,“A Pattern-Based Approach for Multi-Class Sentiment
Analysis in Twitter,”IEEE Access, vol. 5, pp. 20617–20639, 2017.
12. R. Xia, F. Xu, C. Zong, Q. Li, Y. Qi and T. Li, “Dual sentiment analysis: Consider-
ing two sides of one review,” IEEE transactions on knowledge and data engineering,
vol. 27, pp. 2120–2133, 2015.
13. E. Boiy and M.-F. Moens, “A machine learning approach to sentiment analysis
in multilingual Web texts,” Information retrieval Springer, vol. 12, pp. 526–558,
2009.
14. G. altoglou and M. Thelwall, “Twitter, MySpace, Digg: Unsupervised sentiment
analysis in social media,” ACM Transactions on Intelligent Systems and Technol-
ogy (TIST), vol. 3, p. 66, 2012.
15. T. Wilson, J. Wiebe and P. Hoffmann, “Recognizing contextual polarity: An explo-
ration of features for phrase-level sentiment analysis,” Computational linguistics,
vol. 35, pp. 399–433, 2009
16. A. P.Jain, and V.D. Katkar,(2015, December). “Sentiments analysis of Twitter data
using data mining.”In Information Processing (ICIP), 2015 International Confer-
ence on pp. 807-810. IEEE.
17. H. Parveen, and S.Pandey,(2016, July).“ Sentiment analysis on Twitter Data-set
using Naive Bayes algorithm.” In Applied and Theoretical Computing and Commu-
nication Technology (iCATccT), 2016 2nd International Conference on pp. 416-419.
IEEE.
18. V. Jha, N. Manjunath, P. D. Shenoy, K. R.Venugopal, and L. M. Patnaik, 2015,
July. “Homs: Hindi opinion mining system.” In Recent Trends in Information Sys-
tems (ReTIS), 2015 IEEE 2nd International Conference on pp. 366-371. IEEE.
19. E. Boiy and M. F. Moens (2009). “A machine learning approach to sentiment
analysis in multilingual Web texts.” Information retrieval, vol. 12 number 5, pp.
526-558. Springer.
20. S. Y. Ganeshbhai, Bhumika K. Shah, “Feature Based Opinion Mining : A Sur-
vey,” 2015 IEEE International Advance Computing Conference (IACC) pp. 919923,
2015.
21. Avaliable at: [http://rfunction.com/archives/1984]

View publication stats

You might also like