Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis

ASCERTAINING POLARITY OF PUBLIC
OPINIONS ON BANGLADESH CRICKET

THROUGH SENTIMENT ANALYSIS
A THESIS REPORT
Submitted By
MD ABDULLAH FARUQUE
ID No: 11508041
In partial fulfillment for the award of the degree

of
BACHELOR OF ENGINEERING
in
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
COMILLA UNIVERSITY :: CUMILLA-3506
SEPTEMBER 2020
COMILLA UNIVERSITY :: CUMILLA-3506
BONAFIDE CERTIFICATE
Certified that this thesis report “ASCERTAINING POLARITY OF
PUBLIC OPINIONS ON BANGLADESH CRICKET THROUGH

SENTIMENT ANALYSIS” is the bona-fide work of “MD
ABDULLAH FARUQUE” who carried out the thesis work under
our supervision.
SIGNATURE SIGNATURE
MD.KAMAL HOSSAIN CHOWDHURY PARTHA CHAKRABORTY
CHAIRMAN, EXAM COMMITTEE SUPERVISOR
Associate Professor Assistant Professor
Department of Computer Science and Department of Computer Science
Engineering and Engineering
Comilla University, Comilla Comilla University, Comilla
Abstract
In the present world we are not only the consumers of information but creators as well. The
virtual world of social media, which is considered a free open forum for discussion; provides
it’s participants a chance to shape or re-shape the news provided by the media, and post them
via the wall. Sentiment Analysis; a tool which performs the computational study of identifying
and extracting sentiment content of textual data, is used to classify those public opinions posted
on various topics in the social media. In this work, a sentiment polarity detection approach
is presented, that detects the polarity of textual Facebook posts in Bangla containing people’s
point of views on Bangladesh Cricket using three popular machine learning algorithms named
Naive Bayes (NB), Support Vector Machines (SVM), and Logistic Regression (LR). A compar-
ative result analysis is also provided, where LR performed slightly better than SVM and NB by
considering ngram as feature with an accuracy of 83%.
Acknowledgements
At first, I want to express gratitude to the Almighty God for His endless kindness for keeping
my mentally and physically it to complete this sophisticated task.
I would like to express my special thanks of gratitude to the honorable supervisor PARTHA
CHAKRABORTY, who gave me the golden opportunity to do this wonderful thesis work. His
sage advice, insightful criticisms, and patient encouragement aided the completing of this project in
innumerable ways. His directions and guidelines of preparing manu scripts, reports, presentations made
me more dynamic and well-organized throughout the whole project. I came to know about so many new
things, I am greatly thankful to him.
I wish to thank all of my seniors for their utmost supports. I also want to thank my classmates, who
always inspired, helped and motivated me.
ii
Contents
Abstract i
Acknowledgements ii
Contents iii
List of Figures vi
List of Tables viii
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Organization of the Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review and Background Study 5

2.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Types of Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1.1 Subjectivity/Objectivity Identification . . . . . . . . . . . . 6
2.1.1.2 Feature/Aspect-Based Identification . . . . . . . . . . . . . 6
2.1.2 Major Tasks in a Sentiment Analysis . . . . . . . . . . . . . . . . . . . 6
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Methodology 9
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 Removing Extra characters . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iii
Contents iv
3.4.3 Removing stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6.1 Data split for training and testing . . . . . . . . . . . . . . . . . . . . . 13
3.6.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6.2.1 Gaussian Naïve Bayes . . . . . . . . . . . . . . . . . . . . . 15
3.6.2.2 Multinomial Naïve Bayes . . . . . . . . . . . . . . . . . . . 15
3.6.2.3 Bernoulli Naïve Bayes . . . . . . . . . . . . . . . . . . . . . 15
3.6.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6.3.1 Linear Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6.3.2 Polynomial Kernel . . . . . . . . . . . . . . . . . . . . . . . 17
3.6.3.3 Radial Basis Function (RBF) Kernel . . . . . . . . . . . . . 17
3.6.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.4.1 Binary or Binomial . . . . . . . . . . . . . . . . . . . . . . 19
3.6.4.2 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.4.3 Ordinal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7.5 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.8 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Experimental Analysis and Results 23

4.1 Data Collection and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Reading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Trainning the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.1 Traning Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . 27
4.5.2 Traning the SVM Classifier . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.3 Traning the Logistic Regression Classifier . . . . . . . . . . . . . . . . 28
4.6 Results Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6.1 Multinomial Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.2 Bernoulli Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6.3 Gaussian Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6.4 Linear SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Conclusion and Future Directions 37

Contents v
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography 39
List of Figures
1.1 Social Media users in Bangladesh. . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Major tasks in Sentiment Analysis. . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Steps of data preprocessimg. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Data splitting for training and testing. . . . . . . . . . . . . . . . . . . . . . . 13
3.4 SVM Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Simple Statistical Analysis of the dataset. . . . . . . . . . . . . . . . . . . . . 24

4.2 Description of the Dataframe. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Removing Extra Characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Tokenization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Stopwords Removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Vectorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Importing necessary models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.8 Traning Multinomial, Bernoulli, and Gaussian Naive Bayes Classifiers. . . . . 28
4.9 Traning the SVM classifier with different kernels. . . . . . . . . . . . . . . . . 28
4.10 Traning the Logistic Regression classifier. . . . . . . . . . . . . . . . . . . . . 28
4.11 Defination of method for calculating accuracy and drawing the heatmap for con-
fusion matrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.12 Defination of method for for printing classification report. . . . . . . . . . . . 29
4.13 Accuracy for Multinomial Naive Bayes classifier. . . . . . . . . . . . . . . . . 30
4.14 Confusion matrix for Multinomial Naive Bayes classifier. . . . . . . . . . . . . 30
4.15 Classification report for Multinomial Naive Bayes classifier. . . . . . . . . . . 30
4.16 Accuracy for Bernoulli Naive Bayes classifier. . . . . . . . . . . . . . . . . . . 31
4.17 Confusion matrix for Bernoulli Naive Bayes classifier. . . . . . . . . . . . . . 31
4.18 Classification report for Bernoulli Naive Bayes classifier. . . . . . . . . . . . . 31
4.19 Accuracy of Gaussian Naive Bayes classification. . . . . . . . . . . . . . . . . 32
4.20 Confusion matrix of Gaussian Naive Bayes classification. . . . . . . . . . . . . 32
4.21 Classification report of Gaussian Naive Bayes classification. . . . . . . . . . . 32
4.22 Accuracy for Linear SVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vi
List of Figures vii
4.23 Confusion matrix for Linear SVC. . . . . . . . . . . . . . . . . . . . . . . . . 33

4.24 Classification report for Linear SVC. . . . . . . . . . . . . . . . . . . . . . . . 33
4.25 Accuracy for Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . 34
4.26 Confusion matrix for Logistic Regression. . . . . . . . . . . . . . . . . . . . . 34
4.27 Classification report for Logistic Regression. . . . . . . . . . . . . . . . . . . 34
4.28 Comparison of accuracy among the models. . . . . . . . . . . . . . . . . . . . 35
List of Tables
4.1 Head of the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Label Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Comparing accuracy of the models. . . . . . . . . . . . . . . . . . . . . . . . 35
viii
Chapter 1
Introduction
his chapter provides a descriptive viewpoint of the introductory aspects of the thesis
T work which includes problem statement of the work, objective and motivation behind
it. It also includes a brief portrayal of the experiments carried out in this thesis. Further-
more, thesis contribution is discussed and finally this chapter concludes with book organization
which gives an outline for the rest of the book.
1.1 Introduction
Gathering information is an important part of our daily life. Daily we search our news feed
what’s new on today and also hunting for interesting topics on social media. Social media has
been considered as the center region for information mining as it contains the client information
as remarks, audits, posts, likes hates and furthermore different stages like Blogs, Forums, bring
with heaps of client created information. The information on the social media incorporates
the feelings of the client for example how decidedly or contrarily the client is composing his
remarks or surveys. The energy and the cynicism include the critical traits portraying client’s
mind-set and feelings. Sentiment analysis is an important field in natural language processing
in the present world. It enhances its popularity on social media very rapidly. Many works
have already been done in this field for example business sector uses sentiment analysis for
product review by using a piece of text from social media posts and comments, and reduced
the time complexity of an organization. Sentiment analysis on natural language processing is a
1
Chapter 1. Introduction 2
process by which we can analyze a person’s opinion, emotion, attitude. It is done by sentiment
analyzer tool which is involved with machine learning algorithms. Almost every sentence has
some specific words which express whether the sentence is positive or negative or neutral by
using sentiment analysis tools which detect the polarity (i.e., positivity, negativity) of a string
of sentence. Thus, it provides my sentiment about an individual. In the present world, Bangla
is spoken as the first language by almost 200 million people where 160 million people are
Bangladeshi. There are approximately 3 billion people using social media worldwide where
95.13% of people uses Facebook and 1.35% people uses Twitter in Bangladesh[1].
Figure 1.1: Social Media users in Bangladesh.
Most of the Bangladeshis are reliable to express and share their thoughts and opinions on mi-
croblogging and social networking sites like Facebook, twitter etc. by means of writing blogs,
posts, comments which contains a person’s point of view in Bangla Language.
1.2 Motivation
Cricket is a game which has a massive and passionate following in Bangladesh. Bangladesh
has joined the elite group of countries eligible to play Test Cricket since 2000. The Bangladesh
national cricket team goes by the nickname of the Tigers – after the Royal Bengal Tiger. The
people of Bangladesh enjoy watching live sports. Whenever there is a cricket match between
popular local teams or international teams in any local stadium significant number of spectators
gather to watch the match live. People who can’t go stadium don’t miss to watch the cricket
match played by Bangladesh team live on TV or over the Internet. They also celebrate major
victories of the national team with great enthusiasm.
As, People love to express their feelings for Cricket on social media in their native language,
it has become an interesting area for us to analyze with real people emotions for cricket. But,
it is not enough to know what people are talking about. We must also know how they feel.
Sentiment analysis is one way to uncover those feelings.
Much of the research work on polarity classification of social media posts and comments has
been implemented on the English language, but construction of resources and tools for sentiment
analysis in languages other than English is a growing need since the social media posts and
comments are not just posted in English, but in other natural languages as well. Work on other
languages is growing, including Japanese, Chinese, German, and Romanian. Bangla is one of
the highest spoken language, ranked seventh in the world, but surprisingly very few works had
been done with Bangla language on sentiment analysis.
1.3 Problem Statement
In this work, my aim is to extract sentiments or polarity conveyed by users posts and comments
on the most popular social platform in the world, Facebook; which are written in Bangla lan-
guage. Then I identiﬁed the overall polarity of texts as positive, negative or neutral. At ﬁrst I
have created a Google forms and labeled the data with these three classes, then preprocessed
and extracted data as features using TF-IDF and applied machine learning models. From re-
lated works in English, Logistic Regression(LR), Support Vector Machine (SVM), and Naive
Bayes have proven to outperform other classifiers in this field. Hence, for classification, I have
used Logistic Regression, Support Vector Machine (SVM), Multinomial, Bernoulli and Gaus-
sian Naive Bayes and do a comparative analysis on the performance of these machine learning
algorithms.
1.4 Objectives
This thesis aims to incorporate a method that categorizes social media posts and comments
on Bangladesh Cricket. For this purpose, previously labeled sentences are collected from the
social media. These documents are divided into trainset and testset and trainset is used in variety
of experiments including the proposed model. After experimentation, comparison among the
results of dierent experiments are conducted for model evaluation and prediction is made on
testset.
1.5 Organization of the Thesis Work
The remaining report is divided into four chapters:

In chapter 2, related work of this project will be represented.
In chapter 3 the methodology will be discussed.
In Chapter 4 the experiment and it’s result will be shown.
In Chapter 5 concludes the paper with a brief mention of future possibilities and experiments.
Chapter 2
Literature Review and Background Study
his chapter discusses background and literature review of the work conducted and pro-
T vides an insight into the necessary preliminaries of this thesis. Reviewing existing litera-
ture provide substantial knowledge on the topic of interest offering guidance to follow through-
out the thesis work.
2.1 Sentiment Analysis
Sentiment analysis is a term that refers to the use of natural language processing, text analysis,
and computational linguistics in order to ascertain the attitude of a speaker or writer toward a
specific topic. Basically, it helps to determine whether a text is expressing sentiments that are
positive, negative, or neutral. Sentiment analysis is an excellent way to discover how people,
particularly consumers, feel about a particular topic, product, or idea.
2.1.1 Types of Sentiment Analysis
Sentiments refer to attitudes, opinions, and emotions. In other words, they are subjective impres-
sions as opposed to objective facts. Different types of sentiment analysis use different strategies
and techniques to identify the sentiments contained in a particular text. There are two main types
of sentiment analysis: subjectivity/objectivity identification and feature/aspect-based sentiment
analysis.
5
Chapter 2. Literature Review and Background Study 6
2.1.1.1 Subjectivity/Objectivity Identification
Subjectivity/objectivity identification entails classifying a sentence or a fragment of text into

one of two categories: subjective or objectivity. However, it should be noted that there are
challenges when it comes to conducting this type of analysis. The main challenge is that the
meaning of the word or even a phrase is often contingent on its context.
2.1.1.2 Feature/Aspect-Based Identification
Feature/aspect identification allows for the determination of different opinions or sentiments

(features) in relation to different aspects of an entity. Unlike subjectivity/objectivity identifica-
tion, feature/aspect based identification allows for a much more nuanced overview of opinions
and feelings.
2.1.2 Major Tasks in a Sentiment Analysis
Two major tasks in a Sentiment Analysis: 1.Opinion Holder Extraction: It is the discovery of
opinion holders or sources. Detection of opinion holder is to recognize direct or indirect sources
of opinion. 2. Object /Feature Extraction: It is the discovery of the target entity.
Figure 2.1: Major tasks in Sentiment Analysis.

2.2 Related Works
There are a large number of approaches that has been developed to date for classifying senti-
ments or polarities in English texts. These methods can be classified into two categories- (1)
machine learning or statistical-based approach and (2) unsupervised lexicon-based approach.
Machine learning methods use classifiers that learn from the training data to automatically an-
notate new unlabeled texts with their corresponding sentiment or polarity.
One of the first papers on the automatic classification of sentiments in Twitter messages, using
machine learning techniques, is by [2]. Through distant supervision, the authors use a training
corpus of Twitter messages with positive and negative emoticon and train this corpus on three
different machine learning techniques- SVM, Naïve Bayes, and MaxEnt, with features like N-
grams (unigrams and bigrams) and Part of Speech (POS) tags. They obtain a good accuracy of
above 80%.
A system [3] proposed a model that follows the same procedures as [2] to develop the training
corpus of Twitter messages, but they introduce a third class of objective tweets in their corpus
and form a dataset of three classes- positive sentiments, negative sentiments, and a set of ob-
jective texts (no sentiments). They use multinomial NB, SVM, and Conditional Random Field
(CRF) as classifiers with N-grams and POStags as features, sentiments, and a set of objective
texts (no sentiments). They use multinomial NB, SVM, and Conditional Random Field (CRF)
as classifiers with Ngrams and POS-tags as features. The authors of [4] use 50 hashtags and
15 emoticons as sentiment labels to train a supervised sentiment classifier using the K Nearest
Neighbors (KNN) algorithm.
Work on other natural languages is increasing including Japanese [5] [6] [7] [8], Chinese [9] [10]
and other languages. Although a lot of research has been done on sentiment analysis of English
dataset, there has not been much research performed on movie reviews in Bangla, mostly due
to the lack of sufficient data.
The paper entitled ”Evaluation of Naive Bayes and Support Vector Machines on Bangla Textual
Movie Reviews” [11] compares the performance of Naive Bayes and SVM in classifying movie
reviews in Bangla. The dataset contains 800 comments. It was collected by the authors by using
web crawling methods from Bangla movie review sites and social media. In this paper, the
performance of the models were judged by the recall and precision values. In their experiment,
SVM provided the best precision of 0.86. N-gram Based Sentiment Mining for Bangla Text
Using Support Vector Machine [12] approaches the sentiment analysis problem primarily using
SVM for classification and N-gram method for vectorization. An interesting technique used in
this paper was Negativity Seperation, which separates the negative postfix of a word from the
actual word, thus putting more emphasis on the fact that the overall sentence contains negativity.
Furthermore, a comparison between Linear and Non-linear SVM was presented, which indicates
that Linear SVM performed better in case of text classification. An experiment on Detecting
Multilabel Sentiment and Emotions from Bangla YouTube Comments was preformed by Irtiza
et al. [13] where their Deep Learning based [14] LSTM approach achieves 65.97% accuracy in
a three label dataset and 54.24% accuracy in a five label dataset.
A system [15] proposed a model named ‘A Sentiment Classification in Bengali and Machine
Translated English Corpus’. They classified a combination of 2489 cricket data and 1016 data,
into three polarity. Positive, Negative and Neutral. Considering accuracy, SVM is the top
performing classifier for the English corpus with an accuracy of 73.2%. In Bengali corpus, I
notice the highest accuracy from the LR classifier, which is 70.9%.
My work is mainly inspired by [16] presented a research work of Sentiment Analysis on Bangladesh
Cricket with Support Vector Machine. The dataset named ABSA that was used here contains
2979 data samples. The data samples were labeled as three classes, namely Positive, Nega-
tive and Neutral. The authors also collected a dataset of their own and it contains 1601 data
samples with three classes. Python NLTK was used for tokenizing and TFIDF Vectorizer for
vectorization. Accuracy on the custom dataset was 64.6% and 73.49% on the ABSA dataset.
Chapter 3
Methodology
his chapter provides details about my proposed method and other experiments conducted in
T this study step by step. Methodology of a thesis offers intricate specificity of the work con-
ducted by providing information about data sources, analysis techniques, performance scrutiny,
etc.
3.1 System Architecture
The architecture of my system is depicted in Figure 3.1.
The system is decomposed into the following modules: -
1. Data Collection - Collecting data from web using data mining techniques
2. Data Annotation - Adding labels to the collected data
3. Preprocessing - Removing extra characters, tokenization and removing stopwords
4. Feature Extraction- Turning text documents into numerical feature vectors
5. Classification - Using classification algorithms to classify data.
6. Performance Measurement - Performace of each model is measured and compared with

the others.
7. Prediction - Best model is used for real life data prediction.

9
Chapter 3. Methodology 10
Figure 3.1: System Architecture
3.2 Data Collection
Data collection is the process of gathering and measuring information on variables of interest,
in an established systematic fashion that enables one to answer stated research questions, test
hypotheses, and evaluate outcomes. The data collection component of research is common to all
fields of study including physical and social sciences, humanities, business, etc. While methods
vary by discipline, the emphasis on ensuring accurate and honest collection remains the same.
3.3 Data Annotation
Data annotation a necessary step before preprocessing the data. Annotation denotes adding a
label with data, which indicates the class in which this specific data belongs to. After performing
annotation a dataset is stored as a text file or a csv file. Then, this dataset is get preprocessed
for being used in a classification model.
3.4 Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an under-
standable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method
of resolving such issues. Data preprocessing prepares raw data for further processing.
Figure 3.2: Steps of data preprocessimg.
To get the accurate result, we must exclude all unnecessary word and punctuation. These tech-
niques include removing extra characters, tokenization, and deletion of stopwords.
3.4.1 Removing Extra characters
Extra characters such as English alphabets, numbers, all URLs (e.g. www.xyz.com), hash tags
(e.g. topic), targets (@username) as well as all the special characters are removed. That means
all the characters except Bangla alphabets and numbers are excluded from the sentences.
3.4.2 Tokenization
Tokenization is the process of protecting sensitive data by replacing it with an algorithmically

generated number called a token. Tokenization is a method, in which; a sentence is divided into
small parts (words).
3.4.3 Removing stopwords
A stopword is a commonly used word that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the result of a search query.
We would not want these words to take up space in our database. So we can easily Remove all
Stop Words. Because Stopwords doesn’t show any information or idea.
3.5 Feature Extraction
Feature Extraction is used to convert a collection of text documents to a vector of term/token

counts. It also enables the pre-processing of text data prior to generating the vector represen-
tation. The count vector is a well-known encoding technique to make word vector for a given
document. CountVectorizer takes what’s known as the Bag of Words approach. Each message
or document is divided into tokens and the number of times every token happens in a message
is counted.
3.6 Classification
Classification is the process of predicting the class of given data points. Classes are sometimes
called as targets/ labels or categories. Classification predictive modeling is the task of approx-
imating a mapping function (f) from input variables (X) to discrete output variables (y). For
performing classification, firstly we have split the dataset into two parts known as: train set and
test set. Then, machine learning Algorithms are trained using train data and are used to perform
classification on test data.
3.6.1 Data split for training and testing
Data split for training and testing means separating the columns into dependent and independent
variables(or features and label), then split those variables into train and test set. After splitting
Figure 3.3: Data splitting for training and testing.
test and train data four variables named: X_train, X_test, y_train, y_test are found.
Where:
• X_train - This includes your all independent variables,these will be used to train the model,
also as we have specified the test_size = 0.4, this means 80% of observations from your
complete data will be used to train/fit the model and rest 20% will be used to test the
model.
• X_test - This is remaining 20% portion of the independent variables from the data which
will not be used in the training phase and will be used to make predictions to test the
accuracy of the model.
• y_train - This is dependent variable which needs to be predicted by this model, this in-
cludes category labels against your independent variables, I need to specify the dependent
variable while training/fitting the model.
• y_test - This data has category labels for the test data, these labels will be used to test the
accuracy between actual and predicted categories.
3.6.2 Naive Bayes
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable
algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.[19]
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of
other features. For example, a loan applicant is desirable or not depending on his/her income,
previous loan and transaction history, age, and location. Even if these features are interdepen-
dent, these features are still considered independently. This assumption simplifies computation,
and that’s why it is considered as naive. This assumption is called class conditional indepen-
dence.
P (L)P (f eatures|L)
P (L|f eatures) = (3.1)
P (f eatures)
Where,
• P(L): the probability of hypothesis h being true (regardless of the data). This is known as
the prior probability of h.
• P(features): the probability of the data (regardless of the hypothesis). This is known as
the prior probability.
• P(L|features): the probability of hypothesis h given the data D. This is known as posterior
probability.
• P(features|L): the probability of data d given that the hypothesis h was true. This is known
as posterior probability.
Python library, Scikit learn is the most useful library that helps us to build a Naïve Bayes model
in Python. We have the following three types of Naïve Bayes model under Scikit learn Python
library −
3.6.2.1 Gaussian Naïve Bayes
It is the simplest Naïve Bayes classifier having the assumption that the data from each label is
drawn from a simple Gaussian distribution.
3.6.2.2 Multinomial Naïve Bayes
Another useful Naïve Bayes classifier is Multinomial Naïve Bayes in which the features are
assumed to be drawn from a simple Multinomial distribution. Such kind of Naïve Bayes are
most appropriate for the features that represents discrete counts.
3.6.2.3 Bernoulli Naïve Bayes
Another important model is Bernoulli Naïve Bayes in which features are assumed to be binary
(0s and 1s). Text classification with ‘bag of words’ model can be an application of Bernoulli
Naïve Bayes.
3.6.3 Support Vector Machine
Support vector machines (SVMs) are powerful yet flexible supervised machine learning algo-
rithms which are used both for classification and regression. But generally, they are used in
classification problems. In 1960s, SVMs were first introduced but later they got refined in
1990. SVMs have their unique way of implementation as compared to other machine learning
algorithms. Lately, they are extremely popular because of their ability to handle multiple contin-
uous and categorical variables. An SVM model is basically a representation of different classes
in a hyperplane in multidimensional space. The hyperplane will be generated in an iterative
manner by SVM so that the error can be minimized. The goal of SVM is to divide the datasets
into classes to find a maximum marginal hyperplane (MMH).[20].
Figure 3.4: SVM Classification.
The followings are important concepts in SVM −
• Support Vectors − Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
• Hyperplane − As we can see in the above diagram, it is a decision plane or space which
is divided between a set of objects having different classes.
• Margin − It may be defined as the gap between two lines on the closet data points of differ-
ent classes. It can be calculated as the perpendicular distance from the line to the support
vectors. Large margin is considered as a good margin and small margin is considered as
a bad margin.
SVM algorithm is implemented with kernel that transforms an input data space into the required
form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional
input space and transforms it into a higher dimensional space. In simple words, kernel converts
non-separable problems into separable problems by adding more dimensions to it. It makes
SVM more powerful, flexible and accurate. The following are some of the types of kernels
used by SVM:
3.6.3.1 Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is
as below −
K(x, xi) = sum(x ∗ xi) (3.2)
From the above formula, we can see that the product between two vectors say x and xi the sum
of the multiplication of each pair of input values.
3.6.3.2 Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel −
K(x, xi) = 1 + sum(x ∗ xi)d (3.3)
Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.
3.6.3.3 Radial Basis Function (RBF) Kernel
RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional
space. Following formula explains it mathematically −
K(x, xi) = exp(−gamma ∗ sum(x − xi2)) (3.4)

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A
good default value of gamma is 0.1.
As I implemented SVM for linearly separable data, I can implement it in Python for the data
that is not linearly separable. It can be done by using kernels.
SVM is a device for content classification, which can lessen the requirement for marked prepar-
ing cases in both standard inductive and tranductive settings. Picture classification, which is a
piece of picture preparing, can likewise be performed utilizing support vector machine. In the
examination and result investigation SVM accomplish higher precision that other customary
plans after only at least three round of input. To discover the picture classification SVM takes
after same customary approach as ordinary content examination. The SVM algorithm has been
additionally broadly utilized as a part of the natural and different sciences to discover results.
SVM classification have been utilized and it offers up to 90% of the compound effectively. Sup-
port vector machine weights have additionally been utilized to consider SVM models previously.
Psthoc interpretation of support vector machine models has been utilized as a part of request
to recognize components is a generally new territory of research with unique importance in the
organic sciences[22].
3.6.4 Logistic Regression
Logistic regression is a supervised learning classification algorithm used to predict the probabil-
ity of a target variable. It is mainly used for predicting two possible classes. It can be modified to
handle multi-class problems using one-vs-rest logistic regression (OVR) or multinomial logistic
regression, in which the dependent variable can have 3 or more possible unordered types. Lo-
gistic regression use a sigmoid function called logistic sigmoid function to map the probability
value to respective classes. Logistic sigmoid function can be written as,
1
f (X) = (3.5)
1 − eX
.
In simple words, the dependent variable is binary in nature having data coded as either 1 (stands
for success/yes) or 0 (stands for failure/no).
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of

the simplest ML algorithms that can be used for various classification problems such as spam
detection, Diabetes prediction, cancer detection etc [26].
Generally, logistic regression means binary logistic regression having binary target variables,
but there can be two more categories of target variables that can be predicted by it. Based on
those number of categories, Logistic regression can be divided into following types −
3.6.4.1 Binary or Binomial
In such a kind of classification, a dependent variable will have only two possible types either 1
and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.
3.6.4.2 Multinomial
In such a kind of classification, dependent variable can have 3 or more possible unordered types
or the types having no quantitative significance. For example, these variables may represent
“Type A” or “Type B” or “Type C”.
3.6.4.3 Ordinal
In such a kind of classification, dependent variable can have 3 or more possible ordered types or
the types having a quantitative significance. For example, these variables may represent “poor”
or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.
3.7 Performance Measurement
In this section I will discuss about the performance measurement of the models that were used
to classify data. There are various metrics which I can use to evaluate the performance of ML
algorithms. Here, I are going to discuss various performance metrics that can be used to evaluate
predictions for classification problems.
3.7.1 Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the output can
be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions
viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”,
“True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below −
Figure 3.5: Confusion Matrix.
Explanation of the terms associated with confusion matrix are as follows −
• True Positives (TP) − These are the correctly predicted positive values which means that
the value of actual class is yes and the value of predicted class is also yes. E.g. if actual
class value indicates that this passenger survived and predicted class tells the same thing.
• True Negatives (TN) − These are the correctly predicted negative values which means
that the value of actual class is no and value of predicted class is also no. E.g. if actual
class says this passenger did not survive and predicted class tells the same thing.
False positives and false negatives, these values occur when your actual class contradicts
with the predicted class.
• False Positives (FP) − When actual class is no and predicted class is yes. E.g. if actual
class says this passenger did not survive but predicted class tells that this passenger will
survive.
• False Negatives (FN) − When actual class is yes but predicted class in no. E.g. if actual
class value indicates that this passenger survived and predicted class tells that passenger
will die[21].
We can use confusion_matrix function of sklearn.metrics to compute Confusion Matrix of our

classification model.
3.7.2 Accuracy
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted
observation to the total observations. It is simply the proportion of predictions a model got right,
which is usually expressed as a percent. Formally, it has the following definition:
T rueP ositive + T rueN egative

Accuracy = (3.6)
T rueP ositive + T rueN egative + F alseP ositive + F alseN egative
3.7.3 Precision
Precision is the number of correctly labeled sentences, from the total sentences in the test set
that are classified by the classifier for a particular class.
T rueP ositives
Precision = (3.7)
T rueP ositives + F alseP ositives
3.7.4 Recall
Recall is the number of correctly labeled sentences, from the total sentences in the test set that
are actually labeled for a particular class.
T rueP ositives
Recall = (3.8)
T rueP ositives + F alseN egatives
3.7.5 F1 Score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false
positives and false negatives into account. F1 is usually more useful than accuracy, especially if
we have an uneven class distribution. Accuracy works best if false positives and false negatives
have similar cost. If the cost of false positives and false negatives are very different, it’s better
to look at both Precision and Recall[21].
2 ∗ (Recall ∗ P recision)
F1 Score = (3.9)
Recall + P recision
3.8 Prediction
Prediction is done by Predictive analytics. Predictive analytics uses historical data to predict
future events. Typically, historical data is used to build a mathematical model that captures
important trends. That predictive model is then used on current data to predict what will happen
next, or to suggest actions to take for optimal outcomes.
The model that provides the best accuracy score is saved as .sav file. This saved model is used
for performing further prediction on real life data.
Chapter 4
Experimental Analysis and Results
his chapter provides implementational details of the designed experiments described in

T chapter 3. Moreover, this provides information about performance evaluation analysis of
the experiments conducted in this research work.
4.1 Data Collection and Annotation
For my work, a dataset was needed that contains supporter’s comments that convey their thoughts,
attitudes, and opinions towards Bangladesh National Cricket team. The sources of my dataset
are a verified Facebook page that provides me all news and updates about Cricket in Bangladesh
named ‘bdcrictime.com’ [17] and sports news on the Facebook page of Daily Prothom Alo [18];
a popular newspaper of Bangladesh. I used web crawler; which was a Python Script to get all
the required data (posts and comments) using the page link. Data was collected in a text file
from the Facebook.
Data which was collected via web crawling[25] had to be separated manually. In a text file
all the data was arranged in separate rows. Then I had to read through the data and found out
the irrelevant posts and comment to exclude them in order to get an efficient dataset. To train
my models I needed tagged data. For this reason new column was added named ‘label’ in the
text file that is used to annotate the posts and comments which were kept in the second column
named ‘sentence’. I manually tagged those data as positive, negative or neutral one by one. I
then saved all the data in a text file, and this text file worked as my main dataset.
23
Chapter 4. Experimental Analysis and Results 24
my dataset was populated with 2501 sentences with labels. Where, each sentence contains
approximately 3-100 Bengali words. This was the head of the dataset before anything was
changed:
Label Sentence
neutral েটস্ট িসিরেজ িক লাল বেল েখলা হেব? নািক েগালািপ বেল।
negative সু জন,সািবব্র,ইমরুল, সার িলটন দাস সব ধরেনর িকৰ্েকট েথেক বাদ িদেত হেব।।
positive মাশরািফ বেলন, সািকব বেলন দু জন ই আমােদর গবর্।
negative বাংলােদেশর েকউ ৫০ রােনর িনেচ আউট হেল, তােক ১০ লক্ষ টাকা জিরমানা করা উিচত।
negative এক ম াচ ভােলা েখলেলই দশ ম াচ খারাপ েখলা েখলয়ারেক িনেয় হইচই করার িকছু ই েনই।
Table 4.1: Head of the Dataset
The dataset contains 1201 Negative comments, 947 Positive comments, and 353 Neutral com-
ments.
Figure 4.1: Simple Statistical Analysis of the dataset.

4.2 Reading Data
As I said before, I have a text file consists 2501 sentences with labels associated with them, a
piece of python script was written to read them line by line and store them in two lists named:
labels and sentences. Then, these lists (labels and sentences) are integrated in one dataframe.
The dataframe contains two column named ’label’ and sentence’.
Description of my dataframe is given below:
Figure 4.2: Description of the Dataframe.
4.3 Data Preprocessing
After constracting the dataframe I have performed various data preprocessing operations such
as: label encoding, removing extra characters, tokenization, removing stopwords, and word to
vector transformation. After preprocessing. All the sentences was converted to vector form
before they are used in model evaluation.
As the first step of preprocessing, label encoding was performed. Where, all the labels were
converted to id as below:
Before Encoding After Encoding
neutral 0
positive 1
negative 2
Table 4.2: Label Encoding
Then I have removed all the unnecessary characters from the sentences, thus they can’t make
any impact on data classification.
Figure 4.3: Removing Extra Characters.
After that, tokenization was performed which has split the sentectes into words as below:
Figure 4.4: Tokenization.
A text file containing a list of 193 stopwords helps me to remove stopwords those don’t have
any impact in the meaning of the sentences. Stopwords were listed by performing sentiment
analysis on the dataset. A piece of python code were used, which provides me information
about the number of times a word is present in a positive and negative words. The stopword
removal was as below:
Figure 4.5: Stopwords Removal.
4.4 Feature Extraction
Before using for the predictive modeling, the text data requires special customized preparation.
To be used as the input of a machine learning algorithm all the words need to be converted
into integer or floating point numbers, which is referred as vectorization. A feature matrix was
created, in which each sentence was represented by a vector of that vocabulary. One of the
most powerful tool for vectorization of the data as feature, Term frequency–inverse document
frequency (TF–IDF) was used for calculating the features, which measures relevance, not the
frequency of the words. In this paper, ngram was considered as the feature while classifying the
opinions. After performing vectorization the head of dataset:
Figure 4.6: Vectorization.
Then 80% of the data were kept for training purpose and the rest of them has been used for
testing.
4.5 Trainning the Models
After splitting the test and train data, now I have to train the models with X_train, y_train. Before
that I had to import necessary models from Scikit-learn (formerly scikits.learn and also known
as sklearn) is a free software machine learning library for the Python programming language
[23].
Figure 4.7: Importing necessary models.
4.5.1 Traning Naive Bayes Classifiers
I have trained Multinomial, Bernoulli, and Gaussian Naive Bayes model with model.fit() method
as below:
Figure 4.8: Traning Multinomial, Bernoulli, and Gaussian Naive Bayes Classifiers.
4.5.2 Traning the SVM Classifier
Then, I have trained SVM classifier with Linear kernel also with model.fit() method as below:
Figure 4.9: Traning the SVM classifier with different kernels.
4.5.3 Traning the Logistic Regression Classifier
Then, I have trained Multinomial Logistic Regression classifier also with model.fit() method as
below:
Figure 4.10: Traning the Logistic Regression classifier.

4.6 Results Calculation
As I said before, I trained the models with 80% of the data and let the trained models to predict
class of the remaining 20% data (test data). The dataset was populated with 2501 sentences,
20% of them which means 501 sentences are used as the test data. Trained models will make
predictions on these 501 data.
A method was defined to perform this task. Besides, this method will calculate the accuracy of
the model and will constract a confusion matrix of the prediction. Finally, a heatmap of confu-
sion matrix will be drawn. Defination of the method is shown in Figure 4.12 Another method
Figure 4.11: Defination of method for calculating accuracy and drawing the heatmap for confu-
sion matrics.
was defined to print the classification report as below: If we call the calculate_accuracy method
Figure 4.12: Defination of method for for printing classification report.
using name and model parameters for any model we will get the accuracy for that specific model.
Besides, a heatmap for the confusion matrix will also be printed. The report method will gener-
ate the classification report, which contains precision, recall and f1 score. The output of calling
both methods for different models are discussed one by one in next pages:
4.6.1 Multinomial Naive Bayes
Accuracy, confusion matrix and classification report for Multinomial Naive Bayes classification
are shown below:
Figure 4.13: Accuracy for Multinomial Naive Bayes classifier.
Figure 4.14: Confusion matrix for Multinomial Naive Bayes classifier.
Figure 4.15: Classification report for Multinomial Naive Bayes classifier.

4.6.2 Bernoulli Naive Bayes
Accuracy, confusion matrix and classification report for Bernoulli Naive Bayes classification
are shown below:
Figure 4.16: Accuracy for Bernoulli Naive Bayes classifier.
Figure 4.17: Confusion matrix for Bernoulli Naive Bayes classifier.
Figure 4.18: Classification report for Bernoulli Naive Bayes classifier.

4.6.3 Gaussian Naive Bayes
Accuracy, confusion matrix and classification report of Gaussian Naive Bayes classification are
shown below:
Figure 4.19: Accuracy of Gaussian Naive Bayes classification.
Figure 4.20: Confusion matrix of Gaussian Naive Bayes classification.
Figure 4.21: Classification report of Gaussian Naive Bayes classification.

4.6.4 Linear SVC
Accuracy, confusion matrix and classification report for Linear SVC are shown below:
Figure 4.22: Accuracy for Linear SVC.
Figure 4.23: Confusion matrix for Linear SVC.
Figure 4.24: Classification report for Linear SVC.

4.6.5 Logistic Regression
Accuracy, confusion matrix and classification report for Logistic Regression are shown below:
Figure 4.25: Accuracy for Logistic Regression.
Figure 4.26: Confusion matrix for Logistic Regression.
Figure 4.27: Classification report for Logistic Regression.

4.7 Results Comparison
The accuracies of Multinomial, Bernoulli, and Gaussain Naive Bayes classifiers with SVC clas-
sifiers using Linear, RBF and Polynomial kernels are shown below:
Model Accuracy
Logistic Regression 83.03%
Linear SVC 80.44%
GaussainNB 79.44%
MultinomialNB 78.84%
BernoulliNB 74.85%
Table 4.3: Comparing accuracy of the models.
Gaussian Naive Bayes has the accuracy of 79.44%, which is the highest among all Naive Bayes
classifiers. Multinational and Bernoulli Naive Bayes have the accuracies of 78.84% and 74.85%
respectively. SVC model with Linear kernel was more accurate, with an accuray of 80.44%. But
Figure 4.28: Comparison of accuracy among the models.

the best accuracy was achieved by Multinomial Logistic Regression, which was 83.03%. All the
accuracy scores are quite good comparing with some related works using the same classifiers
[14][15][16].
This comprehensive study shows that, Multinomial Logistic Regression gives a promising ac-
curacy. Thus, I have selected it as my final model.
Chapter 5
Conclusion and Future Directions
his chapter provides the concluding remark of this thesis work, the limitations it has and
T the future direction of this thesis work. In section 5.1, conclusion of this thesis work
provided.In section 5.2,limitations are discussed and lastly In section 5.3, future direction of
this thesis work is outlined.
5.1 Conclusion
In this work, I have analyzed the performance of sentiment analysis on Bangla textual data by
developing an automated system for detecting the polarity of public opinions on Bangladesh
cricket using multiple machine learning algorithms. Due to the lack of standardized labeled
dataset in this domain, reviews are collected from various Facebook pages using a custom
crawler. After the manual annotation, the data is preprocessed for removing noise and reducing
the feature space. Following that, the data is tokenized, vectorized for classification. Multino-
mial, Bernoulli, and Gaussian Nave¨ Bayes (NB) and Support Vector Machines (SVM) using
Linear kernel, and Multinomial Logistic Regression as as classification algorithms were used
for classification. The experimental results showed that, Gaussian Naive Bayes classifier shows
a promising pecision of 0.803 with recall and f1 score of 0.794, and 0.797 respectively. SVM
with linear kernel shows more consistency with values of the precision, recall and f1 score,
which are 0.809, 0.804, and 0.802 respectively. But Multinomial Logistic Regression shows
37
Chapter 5. Conclusion and Future Directions 38
the best result. As it has values of the precision, recall and f1 score of 0.833, 0.83, and 0.825
respectively.
5.2 Limitation
Since the system considered a small dataset, the underlying semantic relationship among fea-
tures is not captured properly. Semantic analysis of large data is required in this domain to
explore better linguistic knowledge for sentiment extraction[11]. N-gram features (unigrams
and bigrams) perform poorly in comparison to the baseline results in [24]. One of the reasons
for this could be the smaller training dataset used in this experiment.
5.3 Future Directions
This research shows promise for further development in this area. An improvemet can be made
on the performance this thesis work by enriching the dataset. Use of the proper natural language
processing will highly improve the system. As, I am working on public opinions written in
Bengali language spell-checking and Bengali parts- of-speech tagging will definitely make a
positive impact on the performance of the work. Finally, a useful application will be developed
for users by implementing these concepts.
Bibliography
[1] Yeasir Arefin Tusher and Md Rubel, POPULARITY ASSESSMENT OF CRICKET

PLAYER BASED ON BANGLA TEXT IN SOCIAL MEDIA, pages 1-2, 2019.
[2] Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using dis-
tant supervision. Technical report, Stanford.
[3] Pak, A., and Paroubek, P. 2010 (May). Twitter as a corpus for sentimentanalysis and opin-
ion mining. In N. C. C. Chair, K. Choukri, B. Maegaard, J.Mariani, J. Odijk, S. Piperidis,
M. Rosner, and D. Tapias (eds.), Proceedings of the Seventh Conference on International
Language Resources and Evaluation (LREC’10), Valletta, Malta, ELRA, pp.19–21. Euro-
pean Language Resources Association.
[4] Davidov, D., Tsur, O., and Rappoport, A. 2010a. Enhanced sentiment learning using Twit-
ter hashtags and smileys. In Proceedings of the 23rd International Conference on Compu-
tational Linguistics: Posters, COLING ’10, pp. 241–9. Stroudsburg, PA: Association for
Computational Linguistics.
[5] Kanayama H. and Nasukawa T., “Fully automatic lexicon expansion for domain-oriented
sentiment analysis”. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, Sydney, Australia, 2006.
[6] Kobayashi N., Inui K., Tateishi K., and Fukushima T., “Collecting evaluative expressions
for opinion extraction”. In Proceedings of IJCNLP 2004, pages 596–605, 2004.
[7] Suzuki Y., Takamura H., and Okumura M., “Application of semi-supervised learning to
evaluative expression classification”. In Proceedings of the 7th International Conference
on Intelligent Text Processing and Computational Linguistics, 2006.
39
[8] Takamura H., Inui T., and Okumura M., “Latent variable models for semantic orientations
of phrases”. In Proceedings of the 11th Meeting of the European Chapter of the Association
for Computational Linguistics, 2006.
[9] Hu Y., Duan J., Chen X., Pei B., and Lu R., “A new method for sentiment classification
in text retrieval”. In IJCNLP, pages 1–9, 2005.
[10] Zagibalov T. and Carroll J., “Automatic seed word selection for unsupervised sentiment
classification of chinese text”. In Proceedings of the Conference on Computational Lin-
guistics, 2008.
[11] N. Banik and M. Hasan Hafizur Rahman, “Evaluation of naïve bayes and support vector
machines on bangla textual movie reviews,” in 2018 International Conference on Bangla
Speech and Language Processing (ICBSLP), Sep. 2018, pp. 1–6.
[12] S. Abu Taher, K. Afsana Akhter, and K. M. Azharul Hasan, “N-gram based sentiment
mining for bangla text using support vector machine,” in 2018 International Conference
on Bangla. Speech and Language Processing (ICBSLP), Sep. 2018, pp. 1–5.
[13] N. Irtiza Tripto and M. Eunus Ali, “Detecting multilabel sentiment and emotions from
bangla youtube comments,” in 2018 International Conference on Bangla Speech and Lan-
guage Processing (ICBSLP), Sep. 2018, pp. 1–6.
[14] A. Hassan, N. Mohammed, and A. K. A. Azad, “Sentiment analysis on bangla and roman-
ized bangla text (brbt) using deep recurrent models,” 10 2016.
[15] Sazzed, Salim and Jayarathna, Sampath. ”A Sentiment Classification in Bengali and Ma-
chine Translated English Corpus”,2019 IEEE 20th International Conference on Informa-
tion Reuse and Integration for Data Science (IRI),pages 107–114, 2019,IEEE.
[16] S. Arafin Mahtab, N. Islam, and M. Mahfuzur Rahaman, “Sentiment analysis on

bangladesh cricket with support vector machine”, in 2018 International Conference on
Bangla Speech and Language Processing (ICBSLP), Sep. 2018, pp. 1–4.
[17] https://www.facebook.com/bdcrictime
[18] https://www.facebook.com/DailyProthomAlo
[19] https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_
with_python_classification_algorithms_naive_bayes.htm
[20] https://www.tutorialspoint.com/machine_learning_with_python/machine_
learning_with_python_classification_algorithms_support_vector_machine.htm
[21] https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-
performance-measures/
[22] Platt J., Microsoft Research. Microsoft Way, Redmond , WA 98052 USA.
http://www.research.microsoft.com/ jplatt
[23] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand
Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,
Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Perrot, Édouard Duch-
esnay (2011). ”Scikit-learn: Machine Learning in Python”. Journal of Machine Learning
Research. 12: 2825–2830.
[24] Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using dis-
tant supervision. Technical report, Stanford.
[25] https://www.sciencedaily.com/terms/web_crawler.htm
[26] https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_
python_classification_algorithms_logistic_regression.htm

Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis

Uploaded by

Copyright:

Available Formats

ASCERTAINING POLARITY OF PUBLIC

OPINIONS ON BANGLADESH CRICKET

In partial fulfillment for the award of the degree

COMILLA UNIVERSITY :: CUMILLA-3506

Certified that this thesis report “ASCERTAINING POLARITY OF

PUBLIC OPINIONS ON BANGLADESH CRICKET THROUGH

List of Tables viii

2 Literature Review and Background Study 5

3.4.3 Removing stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Experimental Analysis and Results 23

5 Conclusion and Future Directions 37

1.1 Social Media users in Bangladesh. . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Major tasks in Sentiment Analysis. . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Simple Statistical Analysis of the dataset. . . . . . . . . . . . . . . . . . . . . 24

4.23 Confusion matrix for Linear SVC. . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Head of the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 1.1: Social Media users in Bangladesh.

1.3 Problem Statement

1.5 Organization of the Thesis Work

The remaining report is divided into four chapters:

Literature Review and Background Study

2.1 Sentiment Analysis

2.1.1 Types of Sentiment Analysis

2.1.1.1 Subjectivity/Objectivity Identification

Subjectivity/objectivity identification entails classifying a sentence or a fragment of text into

2.1.1.2 Feature/Aspect-Based Identification

Feature/aspect identification allows for the determination of different opinions or sentiments

2.1.2 Major Tasks in a Sentiment Analysis

Figure 2.1: Major tasks in Sentiment Analysis.

2.2 Related Works

3.1 System Architecture

The architecture of my system is depicted in Figure 3.1.

The system is decomposed into the following modules: -

2. Data Annotation - Adding labels to the collected data

3. Preprocessing - Removing extra characters, tokenization and removing stopwords

4. Feature Extraction- Turning text documents into numerical feature vectors

5. Classification - Using classification algorithms to classify data.

6. Performance Measurement - Performace of each model is measured and compared with

7. Prediction - Best model is used for real life data prediction.

Figure 3.1: System Architecture

3.2 Data Collection

3.3 Data Annotation

Figure 3.2: Steps of data preprocessimg.

3.4.1 Removing Extra characters

Tokenization is the process of protecting sensitive data by replacing it with an algorithmically

3.4.3 Removing stopwords

3.5 Feature Extraction

Feature Extraction is used to convert a collection of text documents to a vector of term/token

3.6.1 Data split for training and testing

Figure 3.3: Data splitting for training and testing.

3.6.2 Naive Bayes

3.6.2.1 Gaussian Naïve Bayes

3.6.2.2 Multinomial Naïve Bayes

3.6.2.3 Bernoulli Naïve Bayes

3.6.3 Support Vector Machine

Figure 3.4: SVM Classification.

The followings are important concepts in SVM −

3.6.3.1 Linear Kernel

K(x, xi) = sum(x ∗ xi) (3.2)

3.6.3.2 Polynomial Kernel

K(x, xi) = 1 + sum(x ∗ xi)d (3.3)