You are on page 1of 20

CHAPTER - 1

INTRODUCTION
In text mining, Sentiment Analysis and Opinion Mining consists study of sentiments,
attitudes, reactions, evaluation of the content of the text. Many times, while analysing
peoples’ opinions, sentiments, evaluations, attitudes, reactions towards entities, such as
services, products, organizations, individuals, events, topics, issues and their attributes
Sentiment Analysis is also called as Opinion Mining.

Twitter is a microblogging media in real time to express the persuasion of a person or


group about a particular topic to appear going on a timeline. The message which is
displayed on Twitter is named as Tweet. The users are made by friends and followings,
tweets and their timeline are key components of Twitter. The chronologically sorted
collection of multiple tweets is the timeline. A person can express his view in front of the
world in various forms like multimedia, text etc. Because of popularity of Twitter as an
information source, it led to development of applications and research in many spheres.
Twitter is used in predicting the happenings of earthquakes and identifying relevant users
to follow to obtain disaster relevant information.

Web search applications, Real world applications like current trends in world, world
events, extracting latest information about incidents uses micro blog data for their
analysis and conclusion making.

Twitter is much different from other social media. Microblogging is one vital fact and it
is more opinion oriented than informative about a topic. There are many abbreviations,
symbols, and the content is many times similar to conversation. Mutual Acceptance from
both users is not needed in Twitter because of its Asymmetric Following Model.

Thus, huge and varied amount of knowledge can be extracted from the tweets. This
project studies about the sentiment detection from the tweets. The step-by-step
methodology and the comparative analysis of existing system to do sentiment
classification is discussed in this report.

1
1.1 Problem description

In this project, we try to implement an NLP Twitter sentiment analysis model that helps
to overcome the challenges of identifying the sentiments of the tweets. The necessary
details regarding the dataset involving twitter sentiment analysis project are:

The dataset provided is the Sentiment140 Dataset which consists of 1,600,000 tweets that
have been extracted using the Twitter API. The various columns present in the dataset
are:

target: the polarity of the tweet (positive or negative)

ids: Unique id of the tweet

date: the date of the tweet

flag: It refers to the query. If no such query exists then it is NO QUERY.

user: It refers to the name of the user that tweeted

text: It refers to the text of the tweet

1.4 Motivation

Countries around the world are facing extraordinary challenges in implementing various
measures in to slow down the spread of the novel coronavirus (COVID-19) and to sustain
their healthcare systems. A good illustration of these measures comes from China, which
implemented home isolation of cases, home quarantine, social distancing, and closure of
schools and universities in highly affected regions. These measures led to a decrease in
R0, a measure of reproduction of new infections, to less than one and thus suppressed the
local spread of the virus so far (Kucharski et al., 2020). A recent report published by the
Imperial College’s COVID-19 Response Team on March 16, 2020 showed that effective
suppression of the virus spread found to be achievable by implementing policies that
include population-wide social distancing combined with home isolation of cases and
school and university closures (Ferguson et al., 2020).

Guided by such policies, the government of Saudi Arabia has implemented various public
health measures after the detection of the first confirmed case on March 2nd, 2020. The
following timeline reflects the series of measures that have been gradually implanted
within the first two weeks, (Saudi Press Agency, 2020):
2
- March 5: temporary daily closures of the Great Mosque for sterilization purposes

- March 8: a temporary control of all traffic in and out of Qatif, where the majority of the

first cases of COVID-19 were recorded

- March 8: schools and universities closure announced to start from the following day
until

further notice

- March 14: closure of all shopping malls, restaurants, coffee shops, and public parks with

the exception of essential businesses, such as pharmacies and supermarkets

- March 15: all sport leagues and competitions suspended until further notice.

- March 17: all congregational and weekly Friday prayers suspended across the Kingdom.

- March 23: Saudi Arabia announced a nationwide curfew for the next 21 days. Figure 1

illustrates the sequence of the events over time.

Fig 1.4: Timeline of events and covid-19 cases in Saudi Arabia on daily basis.
3
1.2 Objective

This is a web app made using Python and Flask Framework. It has a registration system
and a dashboard. Users can enter keywords to retrieve live Twitter text based on the
keyword, and analyse it for customer feelings and sentiments. This data can be visualized
in a graph. This project, in particular, mines data using a popular “Tweepy” API. Tweepy
API connects to Twitter in real-time and gathers metadata along with the text from the
Twitter platform.

4
CHAPTER - 2

SENTIMENT ANALYSIS

2.1 Definition

Sentiment and Subjectivity are mainly context and domain dependent. Not only the
changes in vocabulary are the reason behind that but one more reason is the dual meaning
or sentiments of same expression in different domains. Consider the example of
expression ‘go and read the book’. In case of book reviews this expression gives the
positive polarity about the product but in case of movie review the same expression gives
negative polarity about the product. Sentiment Analysis is more focused on extraction of
polarity about a particular topic rather than assigning a particular emotion to the text.
Opinion Mining and Sentiment Analysis are the branches of Text Mining which refer to
the process of extracting nontrivial patterns and interesting information from unstructured
script documents. We can say that they are the addition to data mining and knowledge
discovery. Opinion Mining and Sentiment Analysis focus on polarity detection and
emotion recognition correspondingly. Opinion Mining has more marketable potential
higher than data mining as it the most natural form of storing the information in text
format. It is much complex task than data mining because it has to deal with unstructured
and fuzzy data. It is a multi-disciplinary area of research because it constitutes adoption
of techniques in information retrieval, text analysis and extraction, auto-categorization,
machine learning, clustering, and visualization. Though Sentiment Analysis and Opinion
Mining might look the same as the fields like traditional text mining or fact-based
analysis, it varies because of following facts. Sentiment Classification is the binary
polarity classification which deals with a relatively small number of classes. Sentiment
classification is easy task compared to text auto-categorization.

2.2 Levels

We can divide sentiment analysis in following levels.

2.2.1 Document

The task at this level is classifying the sentiment for document. The document is on single
topic is considered. Thus, texts which comprise comparative learning cannot be
considered under this level.
5
2.2.2 Sentence

The task at this level goes to the sentences; it determines whether each sentence expresses
a positive opinion, negative opinion, or neutral view. If a sentence states no opinion
means it is a neutral. This level of analysis is closely related to subjectivity classification.
The subjective statement displays the polarity of an entity in affirmative-negative terms
i.e., good-bad terms. Hence it is easy to obtain sentiment from it. But Objective statement
does not give separation directly by affirmative negative terms. These are abstract
sentences which are fact based.

2.2.3 Entity or Aspect

Aspect level gives detailed analysis. The core task of entity level is to identification of
aspect of the text.

For example, in a review of mobile if a customer says,” Sound is good but the handset is
not handy.” In this the aspect is sound and handiness. Here sentiment analysis becomes
two level tasks i.e., finding the aspects in the text and then classifying in respective
aspect. Aspect level sentiment analysis is superior to Document and Sentence level
sentiment analysis. Sentiment analysis of topic or body which may or may not be hidden
in the document is done. Thus, comparative statements are also part of entity level
sentiment analysis.

2.3 Approaches

We can do opinion mining and sentiment analysis in following ways: keyword spotting,
lexical affinity, statistical methods.

2.3.1 Keyword Spotting

In this technique the text is categorized based on the presence of fairly unambiguous
words present in it. Thus, the words or keywords present in the text have the importance
with respect to sentiment analysis.

2.3.2 Lexical affinity

For a particular emotion, Lexical affinity assigns arbitrary words a probabilistic


similarity.

6
2.3.3 Statistical methods

It calculates the valence or target of affective keywords and word co-occurrence


frequencies on the base of a large training corpus. In early work it was aim to classify
entire document into overall affirmative or negative. These systems mainly depend on
supervised learning approaches which depend on manually labelled data. The examples of
such systems are movie or product review databases. Many times, sentiments are not
restricted to document level texts. It can be extracted from sentence level text. In such
cases sentiment analysis can be done using detected opinion-bearing lexicon items. Or
sentiments are not limited to particular target, they can contrary towards same topic or
multiple topics can be present in the same document.

2.4 Features

Sentiment features are as follows:

2.4.1 Terms presence and frequency

These features are nothing but individual words or word n-grams and their frequency
counts. It either uses the term frequency weights or gives binary weighting to the words.

2.4.2 Parts of speech (POS)

It set up finding adjectives from the text, as they are important indicators of opinions.

2.4.3 Opinion words and phrases

These words themselves express opinion about the product or service in the text. For e.g.,
good or bad, like or hate. Some phrases also express opinions without using opinion
words.

Negations: the presence of negative words may change the opinion orientation like not
good is equivalent to bad.

7
8
CHAPTER – 3

LITERATURE REVIEW
It is of no surprise that Twitter can also be responsible for commendation or defamation
of a brand or company since it is very convenient for users to post their personal liking
and preference in the form of online reviews.

Bernard showed in their research how the sentiments of people fluctuate from week to
week and the struggle of brands in maintain a positive image in front of its potential
customers.

A simple supervised machine learning approach was developed by Ted to break down
meaningless words and provide cleaner datasets.

Varsha proposed the use of Parts of Speech (POS)- specific prior polarity feature. They
also introduced the tree kernel model-based methodology in their paper in order to
remove the repetitive features.

Tony suggested to use support vector machines (SVMs) in order to obtain an efficient
system of sentiment analysis. The results show that indeed a hybrid SVM yields a score
with better accuracy. On a side note, we may also conclude that through their research it
was evident that adding Osgood values didn’t do anything good to the performance score
but introducing Turney value did actually help the result accuracy.

In the paper “Sentiment analysis on Twitter data”, Apoorv introduced new features and
experimented combinations of various models which were: Unigram model, Tree kernel
model, 100 Senti- features model, Kernel plus Senti-features and Unigram plus Senti-
features and compared the accuracy score in each case over a provided data set. Their
results proved that using tree kernel and feature based models perform better than the
unigram baseline.

G. Vinodhini and RM. Chandrasekaran focused on the challenges and problems


prevalent in this field along with a comparison of analysis done on movie reviews and
product reviews. They had also concluded that most of the researchers prefer to use the
movie reviews dataset but it’s not right to judge which dataset will give better
performance result.

9
A recent work on Twitter movie review sentiment analysis has been done by
Kiruthikaet. They extracted the twitter data using the traditional method of twitter API
after building the required application on the developer site. Thereafter, they performed a
sentiment analysis of Twitter data about movies using supervised learning approach. They
used feature-based opinion mining approach to analyse various aspects of movie reviews
on twitter. They extracted twitter data of six movies from the Twitter API, pre-processed
it and then applied various models on the same. Hence a system of supervised learning
and POS tagger was proposed which weighed the sentiment orientation of tweets which
reviewed those movies.

Changhuaet used support vector machine (SVM) and conditional random field (CRF) to
explore the emotion classification. After training the classifiers with the common emotion
words, they presented their results in the form of precision, recall and FScore. Their
research was carried at the document level as well as the sentence level. In the later one,
they compared the performance of CRF classifier and that of the SVM classifier. The
results showed that CRF outperformed the SVM. Another interesting finding of this
research was that the emotion conveyed in the last sentence generally described the entire
emotion of document which meant that people usually like to conclude their opinionated
text in such a form that it addressed their real motive of writing the entire piece.

A problem of mixed reviews was expressed in the paper represented by Kushal. The
author expressed his concern over the classification of reviews which contain both
positive and negative sentiments. These reviews often end up reducing the performance
score due to incorrect categorization. They also concluded that Amazon reviews are likely
to give better results than twitter reviews when applied under the same machine learning
algorithm for sentiment prediction because their length is comparatively longer.

Neelima and Ela proposed the use of Bayes Classifier and Maximum Entropy classifier
in Twitter Sentiment analysis and made a comparison between the two results.

Bhumika successfully compared the accuracies of the following models: DAN2, SVM,
Bayesian Logistic Regression, Naïve Bayes, Random Forest Classifier, Neural Network,
Maximum Entropy and Ensemble classifier. The last two classifiers gave the highest
performance rate. They also concluded that the efficiency of classifier is inversely
proportional to the number of classes made.

10
CHAPTER - 4

METHODOLOGY

4.1 Proposed system:


To overcome the drawbacks of the methods we have reviewed above, we propose a new
model for sentiment analysis. In this model we combine many techniques to reach our
final goal of emotion extraction.

The steps for the process are documented below:

1. Retrieval of Data: Public Twitter data is mined using the existing Twitter APIs for data
extraction. Tweets would be selected based on a few chosen keywords pertaining to the
domain of our concern, i.e., product reviews. We have elected to use the Twitter API due
to ease of data extraction.

2. Preprocessing: In this stage, the data is put through a pre-processing stage in which we
remove identifying information such as Twitter handles, timestamps of the message and
embedded links and videos. Such information is largely irrelevant and may cause false
results to be given by our system.

3. Tweet Correction: As tweets are written for human perusal, they often contain slang,
misspellings and other irrelevant data. Thus, we correct the misspellings in the sentences
and look to replace the slang in the sentences with words from standard English that may
roughly relate to the slang in question. As slang itself can be used to display a wide
variety of sentiment, often with greater emotional impact, this process is necessary so that
slang words may be considered as part of the emotion expressed.

4. Polarity detection: In this step we begin the second phase of our proposed system, in
which we try to identify the polarity of the sentence in question. If emoticons exist in the
statements, they will be used as well to compute the overall polarity of the statement. We
aim to find sentences where the polarity detection is not very clear or where the expressed
sentiment may be low. We also try to isolate the opinion words in the sentence in relation
to a given concept in the sentence.

11
a. We train the system to understand the relation between words in various
contexts. Pre-existing dictionaries like SenticNet can be used in this phase to
segregate the emotion from the context it is in.

b. Once the opinion words are identified with context, we can find the polarities of
the words using NLTK-SentiWordNet.

c. To help with detection of the concepts associated, we train our system on a


large dataset that expresses a wide variety of complex and ambiguous emotions.
The system is given this data in an unsupervised fashion and will proceed by
clustering.

5. Emotion Extraction: Emotion models often map the core emotions to a computational
scale from which we can broadly classify and detect the emotions expressed. For the
purposes of our system, we consider the “Plutchik’s Wheel of Emotion” which divides all
emotions into an eight-point wheel which represents the intensity and complexity of
human feeling as we move from the centre of the wheel to the outer rim. The central core
is made of 8 basic emotions that decrease in intensity as we move away from the centre,
often blending with one or more emotion to become increasingly complex. For example,
the wheel may express the simple emotions “rage” and “loathing” at the centre, but the
rims contain the harder to identify emotions of “contempt”, “boredom” and “annoyance”.

a. Mapping: Once the emotional relation has been extracted, we map it to


Plutchik’s model using a neuro-fuzzy inference system. As ambiguous phrases
contain a high probability of expressing two or more emotions together in order to
create a complex feeling, a neuro-fuzzy system is designed so that the emotions
may be computed to a membership function instead.

b. Once the system calculates the degree of membership of the emotion or


emotions expressed in the statements, we use it to determine the most significant
emotions. This value is decided after comparing all the degrees of membership
given by the opinion words in the statement.

6. A graphical representation is provided for the statement. The block diagram for the
proposed system is given below in Figure 1:

12
Fig 4.1: Model of proposed system

4.2 Hardware Requirements

PROCESSOR: x86_64 CPU ARCHITECTURE; 2nd GENERATION INTEL CORE or


NEWER, or AMD CPU.

RAM: 4 GB or MORE.

HARD DISK: MINIMUM 8 GB OF AVAILABLE DISK SPACE.

4.3 Software Requirements

IDE: PyCharm

PROGRAMMING LANGUAGE: JavaScript, Python, HTML/CSS

Libraries: Tweepy, MySQL Connector, Textblob, Flask, matplotlib

ADMIN: phpMyAdmin

DATABASE: MySQL

13
CHAPTER-5

SOLUTION APPROACH

5.1 Manual approach

Twitter sentiment analysis may be mostly done manually.

Natural Language Processing (NLP) is a hotbed of research in data science these days and
one of the most common applications of NLP is sentiment analysis. From opinion polls to
creating entire marketing strategies, this domain has completely reshaped the way
businesses work, which is why this is an area every data scientist must be familiar with.

Thousands of text documents can be processed for sentiment (and other features including
named entities, topics, themes, etc.) in seconds, compared to the hours it would take a
team of people to manually complete the same task.

5.2 Automatic approach

5.2.1. Collect data

Data needs to be representative because you’ll make further strategic decisions based
upon it. Choose whether you want to gather current or historical tweets and then start
collecting data. You have a few different options to do that:

You can create a so-called Zap – an automated workflow on Zapier. Choose an app for
which you want to gather data (in this case, Twitter) and an application to which the data
will be sent (e.g., Google Sheets).

You can use the Twitter API to gain access to public Twitter data or collect them from
specific users instead. You can connect Twitter to a Streaming API to gather tweets with
keywords, brand mentions, and hashtags.

You can use Tweepy and the Python package. This way, you will need to set up your own
Twitter account in API keys for authentication.

5.2.2. Organize the data

Twitter data is unstructured, so you need to clean it up first. The higher quality data you
have, the better the results. Delete any unnecessary information like emojis, extra blank

14
spaces, and special characters. Make sure to cut out duplicate tweets and those that are
too short.

5.2.3. Analyse your mentions

And here’s the trap – the most challenging part of the process. Your task is to categorize
all mentions and analyse their sentiment yourself. You can use Excel formulas or
dictionary references to determine the positive value of certain words and then average
the scores as the sentiment of the text. However, if there are a huge number of mentions,
this may become overwhelming in the long run.

5.2.4. Visualize your results

It’s the last step – visualize your results to see everything clearly. You can use tools like
Google Data Studio to create interactive reports and share them with other team members.
You can easily integrate the data from Google Sheets or Excel, so this shouldn’t cause
much of a problem.

As you can see, the old method is time-consuming, and you have to repeat each step over
and over every time you want to analyse new tweets. One of the biggest drawbacks is that
you will have to set it up and organize everything yourself. However, Twitter sentiment
analysis can be much simpler and faster.

Note: We are using Automatic approach in our project.

15
CHAPTER – 6

IMPLEMENTATION
Setting Up Environment for Sentiment Analysis Using Python.

The following components are required to be downloaded and installed properly.

6.1. Python

Python is a high level, interpreted programming language, created by Guido van Rossum.
The language is very popular for its code readability and compact line of codes. It uses
white space inundation to delimit blocks.
Python provides a large standard library which can be used for various applications for
example natural language processing, machine learning, data analysis etc.
It is favoured for complex projects, because of its simplicity, diverse range of features
and its dynamic nature.

6.2. PyCharm
PyCharm is a IDE, we are using to build our project on it.

Fig 6.2 PyCharm IDE

16
We are going to write a code for our project Twitter sentiment analysis which consist
Home Page, Login page, and User Registration Page.

Fig 6.2.1: Login Page

Fig. 6.2.2: Registration Page

We are going to merge our different files with our main code in order it to work smoothly
or without any intervention.

17
Fig 6.2.3: Home Page

6.3. XAMPP control panel.

Fig 6.3 XAMPP control panel

We will run our project on local host. All the user database will get stored in MySQL
database on apache.

As soon as user put credentials and login his/her account he will get access to analyze
sentiments of twitter database.

CHAPTER – 7

18
CONCLUSION
Performing sentiment analysis on data obtained from Twitter is a huge challenge because
of the amount of ambiguity involved. Due to the widespread usage of slang, wrong
spellings, emoticons etc. it becomes difficult for automatic detection of emotions from
tweets. This project is a small step towards the efficient automation of sentiment analysis
by focusing on ambiguous statements. The system proposed by us also attempts to extract
actual emotions from tweets. Such a system will be very useful for various marketing
teams to gain actual and detailed feedback from their users. At present, we have only
proposed a system to perform the extraction of emotions from ambiguous tweets. The
implementation has to be done and the system must be trained. At this stage, the project is
limited to product reviews aired by users on Twitter. In the future, the system can also be
extended to analyse sentiments about politics, finance and other affairs. Complete
removal of ambiguity is an uphill task indeed. Therefore, interpretation and classification
of sarcastic sentences are not a part of the current scope. However, in the future, the scope
can be extended to accommodate the same. Finally, the project can be extended to work
for natural languages other than English. We are still waiting for twitter’s v2 elevated
access to run our project.

19
Chapter – 7

REFERENCES
David Zimbra, M. Ghiassi and Sean Lee, “Brand-Related Twitter Sentiment Analysis
using Feature Engineering and the Dynamic Architecture for Artificial Neural
Networks”, IEEE 1530-1605, 2016.

Varsha Sahayak, Vijaya Shete and Apashabi Pathan, “Sentiment Analysis on Twitter
Data”, (IJIRAE) ISSN: 2349-2163, January 2015.

Peiman Barnaghi, John G. Breslin and Parsa Ghaffari, “Opinion Mining and Sentiment
Polarity on Twitter and Correlation between Events and Sentiment”, 2016 IEEE Second
International Conference on Big Data Computing Service and Applications.

Mondher Bouazizi and Tomoaki Ohtsuki, “Sentiment Analysis: from Binary to Multi-
Class Classification”, IEEE ICC 2016 SAC Social Networking, ISBN 978-1-4799-6664-
6.

Nehal Mamgain, Ekta Mehta, Ankush Mittal and Gaurav Bhatt, “Sentiment Analysis of
Top Colleges in India Using Twitter Data”, (IEEE) ISBN -978-1-5090-0082-1, 2016.

20

You might also like