You are on page 1of 63

DECLARATION

We the student of B.Sc. in Computer Science and Engineering of International Islamic University
Chittagong. We have submitted thesis for partial fulfillment of the requirements for the degree of
Bachelor of Science in Computer Science and Engineering. We hereby declare that the report is
titled as “Real Time Sentiment Analysis and Opinion Mining on Refugee Crisis.” Which is
completely prepared and completed by us. Which is an original work. It has not been submitted
before elsewhere for any other purpose.

_________________________
Sarjis Muhammad Abdullah
ID: C141098
Computer Science and Engineering, 38th batch.
International Islamic University Chittagong

_________________________
Didarul Karim Jewel
ID: C141113
Computer Science and Engineering, 38th batch.
International Islamic University Chittagong

_________________________
Abdullah Al Mahfuj
ID: C141093
Computer Science and Engineering, 38th batch.
International Islamic University Chittagong
TABLE OF CONTENTS

LIST OF ABBREVIATIONS ....................................................................................................... I

ACKNOWLEFGEMENT ........................................................................................................... II

ABSTRACT ................................................................................................................................. III

Chapter 01: Introduction ............................................................................................................. 1

1.1 Introduction ........................................................................................................................... 1

1.2 History................................................................................................................................... 2

1.3 Motivation ............................................................................................................................. 3

1.4 Problem Statement ................................................................................................................ 4

1.5 Objectives ............................................................................................................................. 4

1.6 Contributions......................................................................................................................... 5

1.7 Summary ............................................................................................................................... 5

Chapter 02: Background of Sentiment Analysis ........................................................................ 6

2.1 Introduction ........................................................................................................................... 6

2.2 What is Sentiment Analysis? ................................................................................................ 6

2.3 Why Sentiment Analysis is Important? ................................................................................ 7

2.4 Reason of Hardness of Sentiment Analysis? ........................................................................ 7

2.5 Sentiment Analysis (SA) & Natural Language processing (NLP) ....................................... 8

2.6 Sentiment Analysis Models ................................................................................................ 14

Chapter 03: Literature Review .................................................................................................. 16

3.1 Literature Overview ............................................................................................................ 16

3.2 Existing works on Sentiment Analysis: .............................................................................. 16

3.3 Strong and Weak Point ....................................................................................................... 19

3.4 Summary ............................................................................................................................. 19


Chapter 04: Environment Study ............................................................................................... 20

4.1 Environment Overview ....................................................................................................... 20

4.2 Tools and Extension............................................................................................................ 20

Chapter 05: Methodology and Experiment .............................................................................. 24

5.1 Methodology Overview ...................................................................................................... 24

5.2 Methodology ....................................................................................................................... 24

5.3 Process Strategy .................................................................................................................. 26

5.4 Summary ............................................................................................................................. 40

Chapter 06: Result ...................................................................................................................... 41

6.1 Visualizing NLP Results ..................................................................................................... 41

6.2 Algorithmic Performance.................................................................................................... 46

6.3 Algorithmic Result Visualization ....................................................................................... 48

6.4 Comparison Table ............................................................................................................... 50

6.5 Summary ............................................................................................................................. 50

Chapter 07: Conclusion .............................................................................................................. 51

7.1 Discussion ........................................................................................................................... 51

7.2 Limitation ............................................................................................................................ 51

7.3 Future Work ........................................................................................................................ 52

References .................................................................................................................................... 53
LIST OF FIGURES
Figure 1 Methodology Diagram.................................................................................................... 25
Figure 2 Process Diagram. ............................................................................................................ 26
Figure 3 Data Collection Process Diagram. .................................................................................. 28
Figure 4 Data pre-processing diagram. ......................................................................................... 29
Figure 5 Duplicate Tweets Removal diagram. ............................................................................. 30
Figure 6 Symbol removal diagram. .............................................................................................. 30
Figure 7 Removal of stop words. .................................................................................................. 31
Figure 8 Sentiment classification operator’s diagram. ................................................................. 32
Figure 9 Sentiment classification diagram. ................................................................................... 33
Figure 10 Visualization of polarity class. ..................................................................................... 34
Figure 11 Transformation of Attributes diagram. ......................................................................... 35
Figure 12 Implementation diagram. .............................................................................................. 37
Figure 13 Overview of Polarity Class. .......................................................................................... 41
Figure 14 Overview of Subjectivity Class. ................................................................................... 42
Figure 15 Subjectivity and Polarity Diagram. .............................................................................. 42
Figure 16 English and Polarity Diagram. ..................................................................................... 43
Figure 17 Bangla and Polarity Diagram. ...................................................................................... 43
Figure 18 Turkish and Polarity Diagram. ..................................................................................... 44
Figure 19 Urdu and Polarity Diagram. ......................................................................................... 44
Figure 20 Chinese and Polarity Diagram. ..................................................................................... 45
Figure 21 Retweet Visualization Diagram. ................................................................................... 46
Figure 22 Accuracy of predicting Polarity.................................................................................... 48
Figure 23 Recall, Precision and F-measure of Polarity Class. ...................................................... 49
LIST OF TABLES

Table 1 Confusion Matrix. ............................................................................................................ 46


Table 2 Recall, Precision and F-measure of Polarity Class. ......................................................... 49
Table 3 Comparison Table of Accuracy. ...................................................................................... 50
Table 4 Comparison Table of Recall, Precision and F-Measure. ................................................. 50
LIST OF ABBREVIATIONS
Text Mining = TM
Naïve Bayes =NB
Support Vector Machine=SVM
Linear Support Vector Machine=LSVM
Decision Tree=DT
Random Forest=RF
K Nearest Neighbor =K-NN
Neural Network=NN
Supervised Machine Learning Algorithm=SMLA
Machine Learning=ML
Refugee Crisis=RC
Software as a Service=SaaS
Deep Learning=DL
NLP = Natural Language Processing
ML = Machine Learning
SA = Sentiment Analysis
BOW = Bag-Of-Words
SO = Sentiment Orientation
POS = Part-Of-Speech
Artificial Intelligence = AI
NLTK = Natural Language Toolkit

i
ACKNOWLEFGEMENT

First of all, we are to grateful to the almighty Allah, the merciful and the benevolent, who has
enabled us to complete this report.

It was not possible to successful completion of thesis and preparation of report without generous
help from our supervisor Md. Mahiuddin, Assistant Professor, CSE, IIUC for his valuable times,
kind encouragements, guidance, opinions, views etc. Here, we are acknowledging his contribution
& Showing heartiest gratitude to him.

Last but not the least important, we would like to give thanks to our senior brother Mazharul Islam
and Md. Taufeeq Uddin Ex-students of CSE, IIUC for their kind instructions.

Finally, we would to thank all the people who have shared their views about our work, provided
us with necessary information, criticized us. We express our heartiest gratitude to all of them.

ii
ABSTRACT
It is acknowledged that Twitter is a micro blogging social site and millions of people share their
thoughts, views, reactions and commenting on particular subject as his opinion or fact for seeking
attention from different categorical person. In the current analysis and experimentation, we
investigated the public opinions, facts and sentiments on Refugee Crisis which is in recent time
widely discussed topic in social media platform. To analyze public sentiment on this crisis, we
extracted around 35,000 relevant Twitter data in five different languages including English,
Bangla, Turkish, Chinese and Urdu and we used them for sentiment analysis and decision mining
in the way of data mining and data science. Considering it we presented a new way of real time
sentiment analysis on Refugee Crisis for provide some prediction on political improvements. This
paper will able to give end level decision of how much people are commenting for supporting
refugee and how much comment are posting against refugee by binomial classification of positive
and negative. Respectively using supervised machine learning algorithm such as DT, RF, NB and
KNN we abled to give prediction of polarity and subjectivity. Where KNN algorithm gave the best
polarity accuracy of 95% compared to DT, RF and NB classifier. Respective and responsible
person for refugee can get a better knowledge by having our analysis.

Keywords— refugee crisis, sentiment analysis, data science, machine learning, opinion mining,
prediction.

iii
Chapter 01: Introduction

1.1 Introduction
Refugee Crisis is one of the biggest issues all over the world recently. This problem originated
since many years before in many countries like Afghanistan, South Sudan and Somalia. Very
lately, people from Myanmar and Syria are the new victim so called as refugee. We are living in
the age of science. Science gave us many gifts. Social media is one of them. So, we are willing to
combine science with social data by exploring insight meaning of social media post on particular
domain knowledge. So that, it can be helpful for public service.

In the present world social media in one of the biggest platforms to express people’s opinion,
thought, view, intention, reaction and idea. There are a lot of social media platform, Twitter is one
of the popular platforms. In addition to having a global coverage of issues, Twitter provides a
media platform that enables sharing opinions easily using various content forms including text,
images, links with the character restriction unlike many other social media platforms. More than
300 million people use Twitter all over the world. Statistics shows that more than 6000 tweets per
second, 350000 tweets per minute and 500 million tweets per day posted in Twitter. This statistic
shows how important impact Twitter has all over the world [1].

So, Twitter sentiment mining can be helpful in different situations such as analyzing people’s
comment on different event, product, movie, song etc. Natural language processing can be helpful
in these circumstances. There are a lot of research done already on Sentiment Analysis in other
identical problem. We also analyzed here by adding some new gateway for smooth result and
accuracy. In our proposed system we are presenting a new way of real time sentiment analysis on
current Refugee Crisis for provide some prediction on polarity types for political improvements.
The extracted and preprocessed datasets were used in various opinion mining algorithm such as
Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), K Nearest Neighbor (KNN) etc.
Using python packages and programming.

1
This paper will able to give end level decision of how much people are commenting for supporting
refugee and how much reactions are commenting against refugee and after that binomial
classification is done by classifying positive and negative. Respectively using supervised machine
learning algorithm such as DT, RF, NB and KNN we able to give prediction of polarity. Where
KNN algorithm gave the best accuracy of 95%. Respective and responsible person for refugee can
get a better knowledge by having our analysis.

1.2 History
Refugee Crisis can be defined as groups of dislocated persons, who could be dislocated persons
either internally or migrated people. Internally Dislocated Persons are the people who forced to
left their homes, but they failed to reach a neighboring country and the Migrated People are the
persons who left their home and took shelters on other country.

Internal Dislocated Person do not full fill the definition of a refugee of 1951 Refugee Convention,
because they have not left their home and country. But because of war nature has changed in the
last few decades, the number of Internal Dislocated Person has increased. The United Nations
estimated in 2018, 68.5 million people are worldwide forced to displaced. From them, 25.4 million
are refugees and 40 million are internally dislocated [2].

85% of refugees are took shelter in developed countries like Australia, USA, France. From them
57% coming from Syria, Afghanistan and South Sudan. Almost 3.5 million of refuges took shelters
in Turkey. The United Nations estimated in 2018, 68.5 million people are worldwide forced to
displaced. From them, 25.4 million are refugees and 40 million are internally dislocated. 85% of
refugees are took shelter in developed countries like Australia, USA, France. From them 57%
coming from Syria, Afghanistan and South Sudan. Almost 3.5 million of refuges took shelters in
Turkey. And most recently more than 0.7 million Rohingya took shelter in Bangladesh to avoid
the religious oppression by security forces of Myanmar [3].

Though millions of people have been suffering by this serious problem. Still now, we can't see any
effective decision and proper solution on this particular embedded problem.

2
1.3 Motivation
In the recent time a huge amount of research is done by researchers on Sentiment Analysis and
Opinion Mining on different subject and event. As social media becomes a great source of data,
Sentiment Analysis or Opinion Mining is able to give more impact on community service.

In the present time Refugee Crisis is one of the important matters in international arena. This
problem evolved from many years before in many countries like Syria, Afghanistan, South Sudan
and Somalia. But it’s not ended, countries like Myanmar, Venezuela are the new victims of it.
Millions of people are suffering every day. But still there is no significant solution.

It is a kind of problem or crisis where people opinion is very important. So, we decided that with
the process of NLP using Machine Learning Algorithm we can classify the thought and opinion
about this matter. After that we can come to a satisfying result that can be helpful for the decision
making in political arena.

3
1.4 Problem Statement
In the recent world people are experimenting on various data. New information and theory are
gathered from there. But Analyzing scientific papers domain is hard, because there are several
features that affect evaluating sentiment reviews.

They have used different analyzing protocols to differentiate the sentiment class. The actual
meaning and classification of sentence is the most important part in decision making. There are
plenty of research regarding sentiment analysis but not much about Refugee Crisis. And the
existing works on Refugee Crisis are more focused on sentiment comparisons between people of
different country. The drawback of the previous work is they didn’t shows any effective result. So,
we tried give that research more efficiently and precisely by predicting accuracy using MLA
(Machine Learning Algorithm).

1.5 Objectives
The objective of our paper is to understand people’s emotion and opinion about Refugee Crisis
and give prediction of polarity (polarity may be supporting to refugee or against refugee which is
decided by tweets after extracting different features from tweets and analyzing them by natural
language processing with the help of existing machine learning algorithm).

To analyze people’s emotion and opinion about Refugee Crisis.


To Extract the positive and negative sentiment about Refugee Crisis.
To train the machine by extracting features from our data so that machine can predict what people
think about Refugee.

4
1.6 Contributions
In the following part explored our contributions in completing this thesis which are as follows:

• We achieved more accurate predicting model with respect to related work.


• We explored supporting and non-supporting tweets.
• We leveled the neutral tweets into positive or negative class manually (Because in a crisis
neutral opinion are redundant)
• We added new attributes to our dataset for better result.
• We worked with English, Turkish, Chinese, Urdu and Bangla tweets.
• We exhibited comparative performance of the algorithms like Naïve Bayes, K Nearest
Neighbor, Decision Tree and Random Forest classifier.

1.7 Summary
In this chapter we described our thesis objectives, contributions that we took part, little statement
on problem domain, reason behind of our thesis, we called it as motivation and little history
background on behalf of Refugee Crisis.

5
Chapter 02: Background of Sentiment Analysis

2.1 Introduction
Sentiment Analysis and Text Mining is the process of analyzing, experimenting, testing the
opinion of people after collecting data in text form. In the following section we will discuss about
it briefly.

2.2 What is Sentiment Analysis?


2.2.1 What is Sentiment?
Sentiments can be defined as emotions, opinions, feelings or ideas expressed by people. There are
two types of sentiment: facts and opinion-based information. Fact is just the briefing about a
matter, event called as objective and opinion is the view, thought, intention, idea about an event
or topic of someone’s called as subjective.

2.2.2 What is Sentiment Analysis?


Sentiment Analysis is the process of analyzing, experimenting, testing the opinion or fact of people
after collected and pre-processed data in text form. In the following section we will discuss about
it briefly. It is also known as opinion mining, emotion gathering, decision mining, idea generating
on particular matter of a person which can be subjective and objective [4].

Subjective has terminological meaning that is the own opinion of person and the other one
objective has also inside meaning that is the intention of sharing just facts of any incident. The
subjective sentences have three different types:
• Private situations as references. For example: “He was taking with fear.”
• Expressing private situations as references to speech. For example: “The editors of the New
Age paper attacked the police officer.”
• Expressive subjective entities. For example: “That government is a brilliant.”

6
2.3 Why Sentiment Analysis is Important?
There are billions of online users, who use Facebook, Twitter, Whatsapp, Google+, Skype, Sina
Weibo, Instagram etc. Where the user expressing their reaction, opinion, thought, idea, view about
different events and topics. That’s why Online daily sentiments become the most important
resource in decision making. According to a new survey conducted by World Research, the survey
found the percentage of online customer reviews as much as personal recommendations.

According to study nearly 95% of shoppers read online reviews before making a purchase (Spiegel
Research Center, 2017) this research also showed that displaying reviews can increase sales rates
by 270%. According to 2016 Study ((Fan and Fuel) 94% of customers read online reviews and
97% of shoppers say reviews influence buying decisions [5].

2.4 Reason of Hardness of Sentiment Analysis?


Sentiment analysis is one of the hardest problems for Computer. Identifying some entities, features
is difficult for machines or even impossible while it is comparatively easy for human beings. Some
intractable situations for computers are given below:

2.4.1 Dealing with Sarcasm


It is difficult to understand that the opposite meaning of a sentence. Sometimes that can be
recognized through special algorithm but it is not reliable to depend on this algorithm.

2.4.2 Dealing with Strength of an Opinion


To deal with the strength of a particular opinion or review can be difficult for computer for
classifying. Opinions have different strength points. Some of them are very strong: “This war
should be stopped” and some of them are weak: “I think this war can be stopped” [6].

Although a list of weak or strong opinion words are possible to create as dictionary depending on
the application need, computers are still not comfortable for when the strength of opinion mixes
with the position of that opinion then the document completely changes the polarity in many cases.

7
2.5 Sentiment Analysis (SA) & Natural Language processing (NLP)
2.5.1 Overview
In the following section we described about some important definitions that we learnt based on
sentiment terminology.

2.5.1.1 Natural language processing (NLP)


NLP is an area of the computer science and AI interested in the interactions between human and
computer. The field of Human Computer Interaction (HCI) is related to the NLP. It is actually
challenging job for NLP to understand and to interact with natural language, where it forces the
computers to extract meaning from given human or natural language as an input [7]. For automated
generation, analysis of insight meaning of sentences and interact with human, a wide set of tactics
and techniques NLP invented by programming. From the knowledge house of AI, NLP inherit so
many techniques, so newer domains are affected. We need to discuss some very basic definition
and terminology to understanding the NLP models and techniques.

2.5.1.2 Token
Before any real processing can be done on the input text, it needs to be segmented into linguistic
units such as words, punctuation, and numbers or alphanumeric. These units are recognized as
tokens.

2.5.1.3 Sentence
This refers to an ordered sequence of tokens.

2.5.1.4 Tokenization
Tokenization is defined as the operation of splitting a sentence into tokens. For the language
English which is also called as segmented language, tokenization relatively easier because of the
existence of whitespace.

2.5.1.5 Corpus
Usually, A corpus is known as the large number of sentences, documents, blogs data, websites
data or simply means a body of text.

8
2.5.1.6 Part-of-speech (POS) Tag
A POS tag is nothing but representation of symbols where a word can be categorized into one or
more classes like NN (Noun), VB (Verb), AJ (Adjective), AT (Article). One of the oldest and most
commonly used tag sets is the Brown Corpus tag set.

2.5.1.7 Parse Tree


A parse tree is nothing but tree which splits the sentence into tree structure. After providing
essential terminology and formal grammar it creates the syntactic structure of given sentence.

2.5.2 Some Mutual NLP tasks:


2.5.2.1 Computational Morphology
Morphemes or stems are also called as basic building blocks which are constructed over a large
number of words coming from the natural language. Internal structure of the words are discovered
and analyzed by the term Computational Morphology using the computer system.

2.5.2.2 Parsing
In the parsing task, a parser builds the parse tree for given sentence. There are some parsers assume
the existence of a set of grammar rules to parse but recent parsers are smart enough to deduce the
parse trees directly from the given data using complex statistical models. Most parsers also operate
in a supervised setting and require the sentence to be POS-tagged before it can be parsed. Statistical
parsing is an area of active research in NLP.

2.5.2.3 Machine Translation (MT)


According to machine translation, the goal is to translate the given text in one natural language to
another language without involvement of human. This is one of the most difficult tasks in NLP
and has been tackled in a lot of different ways over the years. Almost all MT approaches use POS
tagging and parsing as preliminary steps.

9
2.5.2.4 Subjective Sentence
When a writer or user expresses own reviews, feelings or sentiments regarding any incidents,
entities and events then these reviews, feelings are called as Subjective sentence. For example: “I
like to give shelter for refugee”.

2.5.2.5 Objective Sentence


When a writer or user expresses on incidents, entities, events as facts then we called it Objective
sentence. For example: “Rohingya and Syria crisis made millions of people helpless and”

2.5.2.6 Opinion
Opinion is nothing but belief or judgment based on having knowledge on specific a topic.
Sometimes opinions are called as explicit opinion like: “Refugees are facing dangerous situation
of their life.” But sometimes hidden in the sentiment of a sentence, for example; “Current refugee
problem has no solution yet.”. In fact, polarity class is determined by the positivity or negativity
of an opinion. To determine the polarity of text in details, determining the polarity of each
subjective sentence is one of the main sub tasks of sentiment analysis and Opinion Mining.

2.5.2.7 Opinion words


To express positive or negative sentiment, opinion words are words that are commonly used. For
example: {Support, pretty, love} Positive sentiment {Crisis, stop, hate} Negative sentiment

2.5.2.8 Sentiment Orientation


Sentiment orientation is a term to indicate the expressed opinion by opinion words is positive,
negative. For example: "The government good decision for staying them with us" Positive.

2.5.2.9 Opinion Sentence


It is a sentence where it has one or more opinion words. For example: "The government policy
was amazing as they were facing political harassment and our government showed humanity."

2.5.2.10 Object / Features


For Example: "A state wants to begin Refugee Crisis for showing off their heroism."

10
Object: heroism
Explicit object- feature: show off heroism.

Here crisis word is objective word.


In this example: the explicit feature is voice quality, but sometimes object features should be
inferred from the sentence. This kind of feature is called: "implicit feature".

For example: "The crisis is going dangerously"


Object crisis

Implicit feature: dangerous


Opinion word: dangerously

2.5.3 Classifier
Classifier help us to separate our datasets into categorical data. When classifier gets some data
then it started to think in which category it will make. To classifying something, it needs various
features to identify for which category it is going to classify.

It is a function to classify different objects and label them as an output. In sentiment analysis,
classifiers are used to find out the polarity of a subjective sentence from our given data. There are
two types of classification: Supervised Classification, such that a classifier is supposed from the
training set. The classification algorithm can predict the correct label (positive or negative) for any
pre-processed input data. In contrast unsupervised classification supposed the hidden structure of
raw data. In sentiment analysis, both classification types are widely used. The main task of
Sentiment Analysis is extracting suitable features and constructing an engineered feature vector as
an input for classifier

2.5.3.1 Naïve Bayes


Apparently, Bayesian methods are those that apply and use the concept of Bayes Theorem for
probabilistic problems such as classification and regression. Popular Bayesian algorithm such as

11
Gaussian Naive Bayes classifier is an algorithm often used for text classification in different
domains. It is a simple probabilistic classifier applying Naïve Bayes theorem that is used to predict
the class of a new document.
𝑃(𝑋) × 𝑃(𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 | 𝑋)
𝑃(𝑋 | 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦) =
𝑃(𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦)
𝑃(𝑋) × 𝑃(𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 | 𝑋)
𝑃(𝑋 | 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦) =
𝑃(𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦)

We used above two concepts in our analyzed model. In contrast to other classifiers, a Naive model
is efficient since it can work with a small training dataset classification. GaussianNB implements
the Gaussian Naive Bayes algorithm for classification [8] [9].

2.5.3.2 Support Vector Machine


SVM algorithm is a traditional algorithm that most often used for text classification model that
performs well in different domains. It was developed by for binary classification, however, it has
been applied successfully to many applications, such as in [10].
In this study, the Linear Support Vector Machine algorithm is used for sentiment classification. It
uses a hyperplane represented by the vector that separates the negative and positive training vectors
with a maximum margin. In this experiment, we implemented SVM in python for modeling and
to build the SVM classifier [9].

2.5.3.3 K Nearest Neighbor


KNN is an algorithm which is used for non-linear classification. It means that to form a non-linear
classification KNN is another method. It takes the five nearest neighbors for new data point and
measure the distance by various distance measuring algorithm such as Euclidian, or Minkowski
with five nearest and closest neighbors [11].
Classifier k-Nearest Neighbor algorithm is also called as the simplest algorithm of all the machine
learning models. The training dataset is stored in a n-dimensional pattern space and the algorithm
searches the area for the nearest point (k) from the training dataset that is close to the given point
or example set x. This algorithm usually is used for classification and regression. Regarding
classification, it classifies according to the majority vote of it is nearest k. K is a positive integer.
12
When k = 1, this means it is assigned to the nearest neighbor. When k = 1, this means it calculates
distance with 5 nearest points and classified as nearest one’s class from given 5 examples. We
used the Refugee Crisis dataset as the training dataset, and we implemented algorithm using the
K-NN class in python implementation as a classification model [9].

2.5.3.4 Decision Tree


Decision Tree Classifier is a simple and widely used classification technique. It applies a clear-cut
idea to solve the classification problem. In theoretical practice generally, we make the Decision
Tree after calculating class information gain, entropy and gain using the proper and required
formula. Following three equations are the used formula in calculating the Decision Tree outcome
[12]. Following Information Gain I (p, n) is need to calculate only for class attribute. Entropy E(a)
is calculated for each attribute and finally, Gain, G (a) is calculated for selecting the root node for
designing tree.

𝑝 𝑝 𝑛 𝑛
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛, 𝐼(𝑝, 𝑛) = − 𝑝+𝑛 log 2 𝑝+𝑛 − 𝑝+𝑛 log 2 𝑝+𝑛
𝑣
𝑝𝑖 + 𝑛𝑖
𝐸𝑛𝑡𝑟𝑜𝑝𝑦, 𝐸(𝑎) = ∑ × (𝐼(𝑝, 𝑛))
𝑝+𝑛
𝑖=1

𝐺𝑎𝑖𝑛, 𝐺(𝑎) = 𝐼(𝑝, 𝑛) − 𝐸(𝑎)

Decision Tree Classifier poses a series of carefully crafted questions about the attributes of the test
record. Each time it receives an answer, a follow-up question is asked until a conclusion about the
class label of the record is reached. In the decision tree, the root and child nodes contain attribute
test conditions to differentiate records that have different structures.

2.5.3.5 Random Forest Classifier


Random Forest classifier is also known as voting machine. It creates a set of decision trees from
randomly selected subset of training set. It takes n_estimators parameter in algorithmic
performance evaluation period. If n_estimators takes 300 in parameter value, it means it will make
300 decision by decision tree method then gives output by averaging them. It means it aggregates
the votes from different decision trees to decide the final class of the test object [13].

13
2.5.4 Text Mining
Text Mining (TM) is very important terms. It is the system of colleting important and necessary
information from unstructured text. TM identifies facts, relationships and statement that would
remain buried in the mass of textual big data. For analysis and visualization these facts,
relationships and statement are extracted and turned into structured data.

Typical TM tasks include sentiment analysis, categorization of text, clustering of text, document
summarization concept/entity extraction etc. Text mining is a field of computer science which
have a strong connection with NLP, DM, ML, Information retrieval and knowledge management
[14]. TM is much effective than Traditional keyword search. Traditional keyword search retrieves
all the specified keywords contained document. The drawback is we need to read all those
documents to find out whether they actually contain. Where Text mining software is very different,
because it reads and analyzes the documents on your behalf. It works with word level, sentence
level and document level.

There are two categories of texts:


i. Transcripts of communication or Verbal recordings.
ii. Outputs of communication or messages produced by user.

2.6 Sentiment Analysis Models


Earlier, the sentiment text analysis classification relies on using a dataset and a classifier. The
documents apply the classifier into two sets: positive & negative to find out the best representatives
that can describe importance in sentiment analysis.

2.6.1 Bag of Words (BOW)


The bag-of-words model is one of the most popular representation methods for object
categorization [15]. Document can be present as collection of words without regarding the
structure of grammar and also even in word order. According to this model, during the analyzing,
a dictionary is created depend on training data and then used to characterize between the positive
& negative documents in the testing procedure.

14
For example, if we have the two documents below:
1) Rohingya crisis started from many years before.
2) Crisis should be stopped.
The dictionary which is constructed based on BOW will be:
Dictionary= {(1:” Rohingya”), (2:” crisis”), (3:” started”), (4:” from”), (5:” many”), (6:” years”),
(7:” before”), (8:” should”), (9:”be”), (10:” stopped”)}
Hence, the feature vector of each document has “10 dimensionalities” based on the constructed
dictionary. As demonstrate in sentiment analysis discussion, word frequency is very informative.
According to NLP, one word can express the author’s opinion clearly while a sequence of words
cannot. For example, in the sentence below, only the words “like” and “approach” shows the
polarity of the sentence while the whole sentence seems to have positive polarity.
“I like the way of Myanmar government’s approach”.

2.6.2 Part-Of-Speech (POS) Tagging


From a sentence mutual language processing task can automatically specify POS tags to each word
in the sentences. For example, given the sentence "The crisis is dangerous", the output of a POS
tagger would be the /AT crisis/NN is/VB dangerous/AJ. State-of-the-art POS taggers can reach to
higher accuracy as 96%. Tagging text with part-of-speech turns out to be much beneficial for more
complicated NLP tasks such as parsing and machine translation.

15
Chapter 03: Literature Review

3.1 Literature Overview


After we have known about the background, aim, contribution and objective of our research, it’s
a high time to evaluate literatures which are related to our work. Following we discussed about
some research paper which related to sentiment analysis. At first, we mention the paper name and
then we summarized their work.

3.2 Existing works on Sentiment Analysis:

In this [16] paper they used supervised learning algorithms to find the polarity of the student
feedback based on pre-defined features of teaching and learning. For this work they gathered
student feedback data from a survey results of Middle East College, Oman. They explain step by
step process of implementation by using the analytics tool Rapid Miner. This paper also presents
a comparative performance study of the algorithms like SVM, NB, KNN and NN classifier. The
compared their results to find the better outcome with respect to different features for the different
algorithms. Their Result shows that KNN got the best precision result of 100%, NB got the best recall
and accuracy result of 97.07% and 99.11% respectively.

In this [17] paper they used text analyzing tool to get tweets in Hindi language. They analyzed
42,235 on tweets collected that referenced on various political parties in India, during the
campaigning period of elections in 2016. On their work they used both supervised and
unsupervised technique. They use Dictionary Based, NB and SVM algorithm to classified the data
as positive, negative and neutral.
In their work the results of the analysis for NB was the BJP, for SVM it was the BJP and for the
Dictionary Approach it was the Indian National Congress. They predicted by their work BJP had
the chance of 78.4% to win elections due to the positive sentiment they received in tweets. And
when election was done BJP won 60 out of 126 constituencies and NC only won 26 out of 126
constituencies. The NB algorithm gave them an accuracy of 62.1%.

16
In this [18] paper, they experimented the extraction of sentiment from a famous website Twitter
where the people post their views and opinion. They done sentiment analysis on movie review
related tweet. They used Hadoop tools for processing data that is available on the Twitter website
in the form of people’s reviews, feedback, and their comments. They showed the results of their
sentiment analysis as different sections presenting positive, negative and neutral sentiments.

In this [19] study, they analyze the public opinions and sentiments towards the Syrian Refugee
Crisis. They collected tweets about Syrian Refugee Crisis in two languages including Turkish and
English. They considered Turkish tweet because Turkey has sheltered the huge number of refugees
and Turkish tweets carried important information to reflect public perception and views. They
done a comparative SA of retrieved tweets. Their results showed that there was significant
difference between sentiment getting from Turkish tweet and sentiments from English tweets. The
paper showed more positive sentiments posted by Turkish towards Syrians and refugees. On the
other hand, the largest number of English tweets are neutral.

In this [20] study they focus on experimenting customers feedback on SaaS products by predicting
reviewer’s attitudes. The goal of their paper was to predict the sentiment of SaaS customers
reviews. They proposed five technique based on five algorithms, the SVM, NB, NB(Kernel), KNN
and the DT algorithm to predict the attitude of SaaS reviews. In their experiment they got 92.37%
accuracy by using SVM algorithm which proved that this algorithm is able to give better result on
sentiment of online reviews compared with the other models. By this work they able to give
important information into online SaaS reviews and that help in the design of SaaS review
websites.

Lately, most research [21] done on opinion mining of online user by using the data like tweets,
reviews, blogs, comments etc. Using SentiWordNet they worked on opinion mining for newspaper
headline. Further, they separated the adverb-adjective combination exist in the statements. In their
paper they also analyzed the news headline whether it is a part-of-speech tag. During their research
they use python packages to classify words. They used SentiWordNet 3.0 to find out the polarity
(positive & negative) of each word. By means Through this way they evaluated the impact of news
headline by measuring the total positive & negative polarity.

17
In [22], they proposed approach, heterogeneous features such as machine learning based and
Lexicon based features and classification algorithms like NB and LSVM used to build the system
model. By their proposed heterogeneous features and hybrid approach they abled to get better
sentiment accuracy compare to others. These heterogeneous features can be used for building
advance and more accurate models using DL.

In their experiment they used 250 training dataset and 100 testing datasets. They able to get 89%
for NB and 76% for SVM. Again, they used 300 training dataset and 150 testing datasets. They
able to get 84% for NB and 79% for SVM.

In this [23] paper they worked on a dataset of tweets for 6 major US Airlines and performed a
multi-class sentiment analysis. They start off with pre-processing techniques used to clean the
tweets and then representing these tweets as vectors using a deep learning concept to do a phrase-
level analysis. They used 7 different classification strategies: DT, RF, SVM, KNN, LR, NB and
AdaBoost. They trained 80% of their data and tested the remaining 20% data. They set the tweet
sentiment 3 class (positive, negative & neutral). Based on the results obtained, they calculated the
accuracies to draw a comparison between each classification approach and the overall sentiment
count was visualized combining all six airlines.

18
3.3 Strong and Weak Point
The strong point of the existing analysis of sentiment which are as follows:

• In many experiment researchers made their own system to work with Text Mining and
Sentiment Analysis.
• From their working experience we got to understand that big data can be handled by using
different tools.
• In the previous works they used classifier to Fraud Detection, Face Recognition, Predicting
Election.
• Researcher gave hint about RapidMiner that is data science workable environment and right
gateway to work with data.
• Existing research can help to find out the best way of marketing, reviewing.
• In some research they worked with 5 classes of polarity.

Though we noticed many advantages in existing research, however we found some drawback in
existing environment and system. Here is some:
• As It is difficult for machine to find out the accurate sentiment from a text every time.
Sometime prediction may not true.
• Most of the time existing research is done by working with smaller data set. Smaller dataset
can be handled without facing any difficulty and hardships.
• While in performing Sentiment Analysis sometimes they concerned only positivity of the text
and then made the rest of data as negative.
• The classification result they got can be improved.

3.4 Summary
In this chapter we discussed about different existing paper on sentiment analysis. We tried to give
some idea about some paper. We also discussed about strong and weak point about existing work.

19
Chapter 04: Environment Study

4.1 Environment Overview


In this chapter we will discuss about the different classifier, tools and techniques we used to model
down our proposed work. This chapter is consisting of briefly discussion and system over view.

4.1.1 Motivation
We have motivated to work with classification algorithm and RapidMiner tools and protocols for
only reason behind is to handle big amount of data with the help of classifier. With the help of
python and R programming we could complete our analysis but it would be slow process as our
experiment. But using single protocol, if we able to collect data, pre-process data, can get the
natural language processing platform and also, can use classifier then we observed that to choose
these tools for our getting desired result. Though we evaluated algorithmic performance by python
program but in primary level we experimented our dataset by RapidMiner tools.

4.2 Tools and Extension


4.2.1 Why we used Classifier? How we relate classifier to our model?
A classifier can take text information as input, analyze its content and automatically assign to
relevant class. To classify tweets in to different class by having various attributes machine learning
classifier are used generally.
In our analysis we also used classification algorithm for extract similar kinds of result by splitting
data into training and testing data separation process with some new attributes as our features.

Instead of relying on manually crafted rules, text classification with machine learning learns to
make classifications based on past observations. By using pre-labeled examples as training data, a
machine learning algorithm can learn the different associations between pieces of tweets and that
a particular output is expected for a particular input.

20
The first step towards training a classifier with machine learning is feature extraction: a method is
used to transform each text into a numerical representation in the form of a vector. One of the most
frequently used approaches is bag of words, where a vector represents the frequency of a word in
a predefined dictionary of words.
Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets
and tags to produce a classification model:

4.2.1.1 Training process in Text Classification


Once it’s trained with enough training samples, the machine learning model can begin to make
accurate predictions. The same feature extractor is used to transform unseen text to feature sets
which can be fed into the classification model to get predictions on tags (e.g. sports, politics):

4.2.1.2 Prediction process in Text Classification


Text classification with machine learning is usually much more accurate than human-crafted rule
systems, especially on complex classification tasks. Also, classifiers with machine learning are
easier to maintain and you can always tag new examples to learn new tasks.

4.2.2 RapidMiner tools:


RapidMiner is machine learning and data science workable environment also called as tools to
perform different mathematical operation in science of computer world easily. Data Preparation,
Text Mining, Machine Learning etc. are integrated in this environment. There is always huge
guideline and tutorial to help its user provided by them [24].

A lot of data pre-processing step is needed to work with any data science project. If we worked
with smaller dataset then we can easily complete it using free and open source tools. We choose
RapidMiner tools for overcome our data pre-processing steps and also, we realized that in RM is
one of the most reliable working tools for data scientist.

We collected Twitter data around 35k rows which is not easier task to extract tweets by following
the Twitter community authorization guideline.

21
4.2.3 Aylien Tap
We were searching out a tool which will be the combination of NLP platform and all the things
described in previous section. As we are working with RapidMiner software so that we searched
in their market place for text analyzing extension. It is well known that, we can do sentiment
analysis in various ways. But we may not able to extract a good result if we don’t work with
AYLIEN extension because we are working with large number of datasets that contain 35k rows
of data. That wouldn’t be wise thinking if we don’t work with it. It is not a free extension. We
abled to evaluate one-thousand text per-day by showing our university identification and
educational certificate.

4.2.3.1 Credentials and Connecting


The first thing we needed to do before start analyzing text is to connect with the AYLIEN API. To
connect to the AYLIEN API we needed an App ID and API Key. So, after Creating a new
connection of type "AYLIEN Text Analysis Connection", by adding our credentials App ID and
API.

4.2.3.2 Analysis fazes


Extracting sentiment from a piece of tweet can provide us valuable insight about the authors and
publishers. Insight meaning can be his own opinion and can be opinion of others what he just
shared. Their tone can be classified as positive or neutral or negative where the tweet is subjective
means that reflection of author's opinion or objective means that expressing a fact.

4.2.3.3 Mode parameter


Before starting our AYLIEN Sentiment Analysis process, we set appropriate mode parameter
‘tweet’ based on our input text and input attribute was Text because in excel file tweet is in Text
column. Following example clarify us how AYLIEN NLP platform performed.
{
"text”: “Rohingya Killing should be stopped.",
"polarity”: negative",
"subjectivity”: subjective",
"polarity confidence": 0.7,

22
"subjectivity confidence": 0.7
}
4.2.4 Spyder
Spyder is the Scientific Python Development Environment (SPDE). Anaconda has included
Spyder IDE. It able to give us opportunity to coding, testing and debugging our module with
python programming. Spyder is one of the open source tools for handling the programming with
python provided by Anaconda.

4.2.4.1 Python
Python is high level programming language where we implemented our thesis related all machine
learning algorithm in object-oriented way. It abled to provide us different packages which is
needed to implement our desired end level outcome.

4.2.5 Microsoft Product


4.2.5.1 Microsoft Excel
Normalization is one of the most important part of our process. We have added required features
in string which is normalized to numerical value. We did it by Microsoft Excel 2016 version.

4.2.4.2 Microsoft Word


Complete report was written by the help of Microsoft Word 2016 version.

4.2.5.3 Microsoft Power Point


We used so many figure and diagram in our project which was drawn by the help of Microsoft
Power Point 2016 version.

4.2.6 Why we used google translator?


We knew that google translator produces bad results. Yes, we agree that in some case google
translator generates bad results but it cannot change the core meaning of a sentence so that a tweets
insight meaning changes to positive to negative or vice versa. On the other hand, if we look forward
to find positivity or negativity of a tweet from different countries people then google translator is
the best performer so far.

23
Chapter 05: Methodology and Experiment
5.1 Methodology Overview
Methodology is the process of how our work implemented and modeling refers, can be explained
as elaborated description of followed strategy and diagram. It includes here our strategy,
procedure, scheme and process.

5.2 Methodology
In the following part, sequentially we described our methodology and experiment of overall
procedure.
1. At first, we created developer account in Twitter.
2. After creating account Twitter authorized us to collect tweets using Twitter API.
3. Using required query, we received tweets with five language such that Bangla, Turkish, Urdu,
Chinese and English.
4. After that we converted Bangla, Turkish, Urdu, Chinese tweets to English text.
5. After collection and conversion, we performed data pre-processing state for collected tweets.
6. Then we converted all tweets to lower case.
7. We added attributes with text.
8. Then we used AYLIEN NLP analyzer to find out polarity of tweets positive, neutral and
negative.
9. We made it support and against class for polarity and opinion and fact class for subjectivity.
10. Then we added sentiment attributes for getting better prediction result by classification
algorithm.
11. After that, for numerical presentation of our dataset we encoded our string data into numeric
form.
12. Then we split dataset into training and testing data.
13. After that, we trained our data using KNN algorithm.
14. Then we tested our trained data for find out accuracy.
15. Repeat the steps 13 and 14 for DT, NB, SVM, RF supervised ML algorithm.
16. Presentation of found results by Recall, Precision and F-measure.

24
Following diagram states our strategy:

Figure 1 Methodology Diagram.

25
5.3 Process Strategy
In our stratagem, process is one of the sub-parts of our methodology. We made our process part
with the combination of data collection, data pre-processing, data classification into different class
level and implementation technique.

Following is the diagram of our process stratagem:

Figure 2 Process Diagram.

5.3.1 Data Collection


5.3.1.1 Overview
Collection of data from social blog site is one of the hardest jobs that we faced. In fact, it was
interesting too. There is so many ways to extract or collect social media data.

5.3.1.2 Motivation
We planned to work with real time data. It is true that data is available but available in unstructured
way. We ensured that we were not collecting corpus data. We needed to work with document
formatted file. We also get know that formatted and structured data is much dependable and
workable for algorithm implementation. Many websites can provide us data. We searched in
Kaggle.com to find our topics related data but we failed to find out related data.

In the beginning of our data collection process we abled to extract tweets contained google sheet
for a limited period. After passing few days Twitter made their website more secured than previous
to share their data. They wanted to see university provided email address which we don’t had that
time if we want to extract tweets by google sheet.

26
In the mean time we had luckily get to know that there is well-known and remarkable software
called RapidMiner which can help us to extract tweets from Twitter with authorized gateway
created by RapidMiner and Twitter developer community. It is noticeable that RapidMiner can
also able to handle any data science project and modeling with SML algorithm in proper and
dependable way.

5.3.1.3 Account Creation


We created Twitter developer account app in the apps.twitter.com website as recommended by
official Twitter developer’s team for creating access token, access token secret, consumer key and
consumer secret which is needed to make API access request on Twitter accounts behalf. These
all are required for collecting tweets from Twitter micro blog by creating.

27
5.3.1.4 Collection Process
We planned to collect tweets using required query which can easily match our searched tag with
tweets posted by Twitter user. If our searched tag found in Twitter user’s tweet then it will be
extracted in our document file. For that reason, we used two different types of operator from
RapidMiner called as “search Twitter” and “write excel”. Where in search twitter operator we used
the parameter “refugee”, “save syria”, “save rohingya”, “help rohingya”, “help syria”, “help syria
children”, “syria crisis”, “refugee crisis”, “sudan refugee”, “afganisthan refugee”, “stop killing
rohingya”, “stop killing refugee”, “stop syria war” etc.

Following figure shows us the operator that we used in RapidMiner for collecting tweets.

Figure 3 Data Collection Process Diagram.

We have collected around thirty-five thousand rows of tweets from authorized Twitter community
as a Twitter developer with five different languages. So, Data needs to convert from various
language to global language. For that case we used google translator for tweets conversion. Data
was not noise free and congested with removable resources. Its included duplicated tweets, lots of
URL links, special symbol, emojis, mentioned name that is needed to be cleansing and cleaning.

28
5.3.2 Data Pre-Processing
5.3.2.1 Overview
The data which extracted from Twitter or other social media website contains different non-
sentiment contents such as duplicate tweets, website links, emotions, mentioned symbol or
username, white spaces, retweet tag, hashtag etc. which we removed before processing our tweets
so that the sentiment can generate accurately. Our pre-processing step include followings:

Figure 4 Data pre-processing diagram.

5.3.2.2 Duplicate Tweets Removal


Every day peoples are sharing huge number of tweets. When we extract their shared tweets from
social media platform, sometime one tweet comes multiple times. This is what we called
duplicated tweets and that should be removed for our desired analysis.

29
Figure 5 Duplicate Tweets Removal diagram.
5.3.2.3 Removal of Website Link
Extracted Twitter data so called tweets consist of different type of information what we called as
URL. If Tweeter user posted any video, audio, article link which is unnecessary for use in
sentiment analysis. Therefore, that URL should be removed from our tweet. The URL we found
here many as YouTube and Facebook’s video link.

5.3.2.4 Special Symbol Removing Part


There are different types of symbols what we called as special symbol used by the social media
user. We removed punctuation mark (!), full stop (.) mentioning symbol (@), single quote (“”),
single quote (‘’) etc. which does not contain sentiment. So, we removed all the special symbols
from our tweet dataset by following the RapidMiner provided operator with the help of regular
expression. Following figure shows it well:

Figure 6 Symbol removal diagram.

30
5.3.2.5 Username Removing Part
One user can use one username and that is should be unique by following the guide line of Twitter;
so, anything is posted by a user there is his/her username proceeding by @ which is used as proper
nouns. For example, @someones_username. This also removed from our dataset for effective
analysis.

5.3.2.6 Removal of Additional White Spaces


There may be consists of extra white space in the data and it needs to be removed. By removing
white spaces, the analysis to be done more efficiently.

5.3.2.7 Removal of Stop Words


"Stop words" means some common words that don't carry useful information of sentence like
"the," "a," and "and". Removing stop words means that Model won't able to see these words and
will be trained on a clean Dataset. Text Analysis Platform, TAP recognizes stopwords based on a
manually curated, comprehensive list, but since what defines a stopword can vary depending on
context. Since shorter documents like Tweets contain such little text, for training a Model filtering
stop word is needed.

Figure 7 Removal of stop words.


Following stop words are removed from our dataset:
I, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his,
himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who,
whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do,
does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about,
against, between, into, though, during, before, after, above, below, to, from, up, down, in, out, on,

31
off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each,
few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, can, will,
just, don, should, now.

5.3.3 Classification
After completing our pre-processing steps, we got 8788 structured tweets from around 35,000
unstructured tweets.

5.3.3.1 Sentiment Classification


We analyzed our tweets by AYLIEN extension which is NLP platform provided by RapidMiner
to analyze our tweets. At first, we’re going to determine the Sentiment of each tweet whether they
are Positive, Negative or Neutral. We added the Analyze Sentiment Operator in our Process and
selected "text" as our Input Attribute in RapidMiner software.

Figure 8 Sentiment classification operator’s diagram.

So now, we processed and analyzed our tweets, so that it gave us following Polarity (positive,
neutral & negative) and Subjectivity (subjective & objective).

32
Figure 9 Sentiment classification diagram.

5.3.3.2 Polarity Handling


Step1:
Usually in sentiment analysis negative words make the negative class and positive word makes the
positive class. So, we reviewed each tweet with negative, positive and neutral polarity class, if
there found any supporting words or hashtag likes help, #help, save, #save and non-supporting
words then we read it manually and if we understand tweet is actually Refugee supporting then we
classified it as support class otherwise classified as against class in polarity attribute.

33
Support and Against class

458
8330 458

Support Against

Figure 10 Visualization of polarity class.

5.3.3.4 Attributes
In our dataset we have several attributes. Various source of text we found in our dataset. These
are Twitter Web Client, Twitter for Android, Twitter for iPhone, Twitter Lite, Twitter for iPad,
Facebook, Google, LinkedIn. Our provided attributes are as follows:
• Id
• Name
• Tweet date
• Source of text
• Language
• Retweet
• Subjectivity
• Subjectivity confidence
• Polarity
• Polarity confidence.
But someone’s Id, Name, Tweet date contains no sentiment value. That is why we removed them
from our dataset. Especially, Tweet date is removed because we are working in real time. So, tweet
date is redundant in this case of analyzing.

34
5.3.3.5 Transformation
Many classification algorithms can’t deal with string data. That is reason behind of our data
transformation. We transformed our attributes into numerical value for getting algorithmic
performance with the help of python programming. This is what just form of presentation. We
encoded the source attribute of each tweet into different labeled of numeric value which is done
by LabelEncoder class from python. We transformed our language attribute by dummy encoding
with the help of OneHotEncoder class from python. In subjectivity attribute, subjective is what we
called it opinion presented by 1 and objective is what we called it fact is presented by 0. In polarity
class support is presented by 0 and against presented as 1.

Figure 11 Transformation of Attributes diagram.

35
5.3.4 Implementation
5.3.4.1 Overview
We implemented our SML algorithm by python programming. We have done our project by going
through OOP concept with Python. In the following section, we are presenting our steps what we
followed in completing our project.

5.3.4.2 Implementation Strategy


Step1: In the very beginning of the implementation of supervised ML algorithm. At first, we
imported three large packages of python namely numpy, matplotlib. pyplot and pandas.
Step2: After that, we imported dataset and separated the dependent and independent variable as
we do in our typical mathematical problem.

Step3: Then we performed label encoding for our categorical data by LabelEncoder class from
scikit learn.

Step4: Then we formed one hot encoding or binary presentation for our required attribute by
OneHotEncoder class from scikit learn.

Step5: We avoided dummy variable trap of OneHotEncoder.

Step6: Then the help of Scikit-learn and cross validation, we split our data set into training and
testing data.

Step7: Then we completed fit and transformation by StandardScaler class from scikit learn.

Step8: Then we trained our dataset by DecisionTreeClassifier, KNeighborsClassifier, SVC,


RandomForestClassifier, GaussianNB class from scikit learn.

Step9: At last, making the confusion matrix from scikit learn with computed our desired results.

36
Figure 12 Implementation diagram.

5.3.4.2 Coding
In the following session, we showed our label encoding by python step by step.

In next phase, we presented our one hot encoding by python.

37
38
39
5.4 Summary
Int this methodology and experiment chapter we described about our methodology and process
stereogram where it included data collection process, data pre-processing process, data
classification strategy by our given feature analysis and finally implementation of our
classification and machine learning algorithm with python programming.

40
Chapter 06: Result

6.1 Visualizing NLP Results


To give little overview on this chapter we can say that, we completed our experiment with our
dataset by taking one-fourth of data as testing and three-fourth of data as training data from 8788
tweets dataset. In the following section, we presented some graphical visualization of
experimented dataset in various perspective.

6.1.1 Support and Against Class visualization


We have our results stored in a document but in order to make them more presentable we visualized
them in following three-dimensional pie diagram. Microsoft Excel let us to display and visualize
results of our process.

Support and Against

458

8330

Support Against

Figure 13 Overview of Polarity Class.

6.1.2 Opinion and Fact Class Visualization


After analyzing natural language processing, we noticed that in our dataset we got 5129 opinion
sentence and 3659 fact or so-called objective sentence from tweeter user. In the following figure
graphical view shows more exclusively.

41
Subjectivity Class
6000

5000

4000

3000

2000

1000

0
Fact Opinion
Series1 3659 5129

Figure 14 Overview of Subjectivity Class.

6.1.3 Subjectivity and Polarity

Count of polarity
6000

4797
5000

4000 3533

3000

2000

1000
332
126
0
fact opinion fact opinion
against against support support

Figure 15 Subjectivity and Polarity Diagram.

42
6.1.4 English and Polarity

3223
133 3223

against English support English

Figure 16 English and Polarity Diagram.

6.1.5 Bangla and Polarity

2229
110 2229

against Bangla support Bangla

Figure 17 Bangla and Polarity Diagram.

43
6.1.6 Turkish and Polarity

909
54 909

against turkish support turkish

Figure 18 Turkish and Polarity Diagram.

6.1.7 Urdu and Polarity

96
835 96

support Urdu against Urdu

Figure 19 Urdu and Polarity Diagram.

44
6.1.8 Chinese and Polarity

65
1592 65

support Chinese against Chinese

Figure 20 Chinese and Polarity Diagram.

45
6.1.9 Retweet Visualization
We discovered that tweet versus retweet. Retweet means when a Twitter user posted, number
people comment on particular post.

2052
Retweet

5275

8498
1
294
587
880
1173
1466
1759

2345
2638
2931
3224
3517
3810
4103
4396
4689
4982

5568
5861
6154
6447
6740
7033
7326
7619
7912
8205
Figure 21 Retweet Visualization Diagram.

6.2 Algorithmic Performance


To calculate and measure algorithmic performance, we got analyzing help using confusion matrix.
In the following section we described our used domain.

6.2.1 Confusion Matrix


In our thesis, confusion matrix is presented as 2x2 matrix which is the combination of actual class
and predicted class. By confusion matrix we can get the prediction of model result by comparing
with actual testing dataset.

Table 1 Confusion Matrix.

Confusion Predicted Class


Matrix yes no
yes True Positive (TP) False Negative (FN)
Actual Class no False Positive (FP) True Negative (TN)

46
6.2.2 Accuracy
Accuracy is defined as the ratio between addition of TP, TN and addition of TP, TN, FN, FP.
Simply we can write as follows:
TP + TN
Accuracy =
(TP + TN + FN + FP)

6.2.3 Recall
Recall is defined as the ratio between TP and addition of TP, FN Simply we can write as follows:
TP
Recall =
(TP + FN)

6.2.4 Precision
Precision is defined as the ratio between TP and addition of TP, FP. Simply we can write as
follows:
TP
Precision =
TP + FP

6.2.5 F-measure
For calculating f-measure first we multiplied 2, recall & precision with each other and then product
divided by addition of recall, precision.
2 × 𝑅𝑒𝑐𝑎𝑙𝑙 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

47
6.3 Algorithmic Result Visualization
We experimented our structured data twice by selecting seventy five percent training dataset and
by selecting eighty percent training dataset where twenty five percent dataset was for testing and
twenty percent dataset was for testing respectively.

6.3.1 Result of Polarity


With the help of learning algorithms, we experimented to learn our machine. our dataset by taking
three-fourth of data as training and one-fourth of data as testing.

6.3.1.1 Accuracy
We noticed that KNN machine learning algorithm giving us the best accuracy by only giving five
percent error and NB giving the lowest accuracy by eighty three percent. In the following figure
we presented pie diagram which contains accuracy of our used KNN, NB, DT and RF classifier.

Polarity Accuracy

93% 94%

83%
95%

RF KNN NB DT

Figure 22 Accuracy of predicting Polarity.

6.3.1.2 Table
In the following table we presented Recall, Precision, F-measure with respect to our used classifier
and average of all in the fifth column.

48
Table 2 Recall, Precision and F-measure of Polarity Class.

Classifier Recall Precision F-measure


KNN 0.99 0.95 0.98
RF 0.99 0.95 0.97
NB 0.86 0.95 0.90
DT 0.98 0.95 0.96

6.3.1.3 Graph View


In the following graph we presented Recall, Precision, F-measure with respect to our used
classifier.

Recall, Precision and F-measure

RF KNN NB DT
Recall 99% 99% 86% 98%
Precision 95% 95% 95% 95%
F-measure 97% 98% 90% 96%

Recall Precision F-measure

Figure 23 Recall, Precision and F-measure of Polarity Class.

49
6.4 Comparison Table
We compared our accuracy result with related work. We got better predicting result after
calculating the accuracy from confusion matrix.

Table 3 Comparison Table of Accuracy.

In [23] they found following table for their analysis.


Table 4 Comparison Table of Recall, Precision and F-Measure.

Classifier Recall Precision F-measure


KNN 0.592 0.590 0.593
RF 0.865 0.856 0.865
NB 0.647 0.642 0.646
DT 0.646 0.630 0.645

6.5 Summary
In this result chapter we presented our algorithmic analysis by calculating accuracy, recall,
precision and f-measure for better understanding of our thesis outcome by showing the graph and
table. Finally, we presented comparison table with related work.

50
Chapter 07: Conclusion

7.1 Discussion
In this thesis we have shown a different approach of text mining and sentiment analysis. So that,
we can see the improvements in results chapter, where our model classifier gave us the most
accurate and best result while in classification of support and against class. As we wanted to work
with big data so that we did it partially. Because handling of big data is really complex and big
issue in these circumstances. We used some protocols that can deal with big dataset but it was time
consuming too. If we abled to use high configured system then can get better outcome with
minimum time consuming.
After completing our analysis, we observed that text mining and sentiment analysis can be done
in so many ways.
Our observation on this thesis, it is just way of proof presentation that we can now deal with data
science project partially. Using theoretical term, we abled to analyze real life problem.

7.2 Limitation
As Data and information became more secured than before, it’s going difficult to collect data for
experiment. And it is difficult for machine to find out the accurate sentiment from a text every
time.

We got to observe that to work with real life problem in the way of data mining we need some
extraordinary and high configured system. It was tough job for us to complete our work with
traditional system what we used to in general.

In our thirty-five thousand tweets, a lot of incomplete word, sarcastic post found by our tools which
is removed when we started to check neutral class. We actually avoided neutral class thinking that
it is not able to give us appropriate decision for reaching our desired destination.

51
Too much dependency with aylien natural language processing tools. Though it is provided by
dependable source RapidMiner. As Data and information became more secured than before, it’s
going difficult to collect data for experiment.
It is difficult for machine to find out the accurate sentiment from a text every time.

7.3 Future Work


We are willing to make Bangla natural language processing platform where we will make some
list of stop words of Bangla and parts of tagging model for only analyzing Bangla text.

Using our experience on this project we are willing to work with classification algorithm for facial
recognition, fraud detection and solving other classification problem in real life.

In future we are willing work with more social media platform data like Facebook, YouTube etc.

We are now willing to work with big data with high configured system and to have more training
on real life data science project. So, our future plan is to work with big data in handling the text
mining and analysis of text.

52
References

[1] Team Internet Live Stats, "Internet Live Stats," InternetLiveStats.com, 20 October 2018.
[Online]. Available: http://www.internetlivestats.com/twitter-statistics/. [Accessed 20
October 2018].
[2] Wikipedia writers, "Wikipedia," Wikimedia Foundation Inc., 5 October 2018. [Online].
Available: https://en.wikipedia.org/wiki/Internally_displaced_person. [Accessed 10 October
2018].
[3] World Vision Staff, "World Vision," World Vision, Inc., 26 June 2018. [Online]. Available:
https://www.worldvision.org/refugees-news-stories/forced-to-flee-top-countries-refugees-
coming-from. [Accessed 1 September 2018].
[4] Haseena Rahmath P, "Opinion Mining and Sentiment Analysis Challenges and
Applications," International Journal of Application or Innovation in Engineering &
Management (IJAIEM), vol. 3, no. 5, 2014.
[5] Kristen McCabe, "Crowd Learning Hub," G2 Crowd, Inc, 16 May 2018. [Online]. Available:
https://learn.g2crowd.com/customer-reviews-statistics. [Accessed 20 October 2018].
[6] Doaa Mohey El-Din Mohamed Hussein, "Analyzing Scientific Papers Based on Sentiment
Analysis," Research Gate, Cairo University, 2016.
[7] Bird, S., Klein, E., and Loper, E., Natural Language Processing with Python, vol. 1, O’Reilly
Media publisher, 2009.
[8] Zengchang Qin, "Naive Bayes Classification Given Probability Estimation Trees," IEEE,
Orlando, FL, USA, 2006.
[9] Andrew Christian Flores ; Rogelyn I. Icoy ; Christine F. Peña ; Ken D. Gorro, "An Evaluation
of SVM and Naive Bayes with SMOTE on Sentiment Analysis Data Set," Phuket, Thailand,
2018.
[10] Bo Liu ; Zhi-Feng Hao ; Xiao-Wei Yang, "Nesting support vector machinte for muti-
classification [machinte read machine]," IEEE, Guangzhou, China, China, 2005.
[11] Zonghu Wang ; Zhijing Liu, "Graph-based KNN text classification," in IEEE, Yantai, China,
2010.

53
[12] Shiueng-Bien Yang ; Shen-I Yang, "New decision tree based on genetic algorithm," in
International Symposium on Computer, Communication, Control and Automation (3CA),
Tainan, Taiwan, 2010.
[13] Yashaswini Hegde ; S.K. Padma, "Sentiment Analysis Using Random Forest Ensemble for
Mobile Product Reviews in Kannada," in IEEE 7th International Advance Computing
Conference (IACC), Hyderabad, India, 2017.
[14] Ian, H.W., Eibe, F., and Mark A. H., Data Mining: Practical machine learning tools.,
Waikatio, New Zealand: Morgan Kaufmann Publishers, 2011.
[15] Yin, Z., Rong, J., and Zhi-Hua, Z., "Understanding Bag-of-Words Model: A Statistical
Framework," International Journal of Machine Learning and Cybernetics, 2010.
[16] Dhanalakshmi V., Dhivya Bino, Saravanan A. M., "Opinion mining from student feedback
data using supervised learning algorithms," in 3rd MEC International Conference on Big
Data and Smart City, Muscat, Oman, 2016.
[17] Parul Sharma, Teng-Sheng Moh., "Prediction of Indian Election Using Sentiment Analysis.,"
in IEEE International Conference on Big Data (Big Data), San Jose, CA, USA, 2016.
[18] Huma Parveen, Prof. Shikha Pandey., "Sentiment Analysis on Twitter Data-set using Naive,"
in 2nd International Conference on Applied and Theoretical Computing and Communication
Technology (iCATccT)., Bhilai, India, 2016.
[19] Nazan Ozturka, Serkan Ayvaz, "Sentiment analysis on Twitter: A text mining approach to
the Syrian refugee crisis," in Science Direct, Istanbul, Turkey, 2017.
[20] Asma Musabah Alkalbani, Ahmed Mohamed Ghamry, Farookh Khadeer Hussain, Omar
Khadeer Hussain., "Predicting the sentiment of SaaS online reviews using supervised
machine learning techniques.," in 2016 International Joint Conference on Neural Networks
(IJCNN)., Sydney, Australia., 2016.
[21] Apoorv Agarwal, Vivek Sharma, Geeta Sikka, Renu Dhir., "Opinion Mining of News
Headlines using SentiWordNet.," in 2016 Symposium on Colossal Data Analysis and
Networking (CDAN)., Punjab, India., 2016.
[22] Rachana Bandana, "Sentiment Analysis of Movie Reviews Using Heterogeneous Features,"
in IEEE, Nadiad, India. , 2018.

54
[23] Ankita Rane, Dr. Anand Kumar., "Sentiment Classification System of Twitter Data for US
Airline Service Analysis.," in 42nd IEEE International Conference on Computer Software
& Applications., Dubai, UAE., 2018.
[24] "RapidMiner," RapidMiner, 10 September 2018. [Online]. Available:
https://rapidminer.com/. [Accessed 100 September 2018].
[25] Ch. Nanda Krishna, Dr. P. Vidya Sagar, Dr. Nageswara Rao Moparthi, "Sentiment Analysis
of Top Colleges," in 4th International Conference on Advances in Electrical, Electronics,
Information, Communication and Bio-Informatics (AEEICB-18), Andhra Pradesh, India,
2018.

55

You might also like