Professional Documents
Culture Documents
May 2023
The authors declare that they are the sole authors of this thesis and that they have not used
any sources other than those listed in the bibliography and identified as references. They further
declare that they have not submitted this thesis at any other institution to obtain a degree.
Contact Information:
Author(s):
Prashuna Sai Surya Vishwitha Domadula
E-mail: prdo22@student.bth.se
University advisor:
Dr Prashant Goswami, Associate Professor
Department of Computer Science and Engineering
AI Artificial Intelligence
ML Machine Learning
Methods: This thesis employs a quantitative research technique, with data anal-
ysed using traditional machine learning. The labelled data set comes from an online
website called kaggle(https://www.kaggle.com/datasets), which contains movie
review information. Algorithms like the lexicon-based approach and the BERT neu-
ral networks are trained using the chosen IMDb movie reviews data set. To discover
which model performs the best at predicting the sentiment analysis, the constructed
models will be assessed on the test set using evaluation metrics such as accuracy,
precision, recall and F1 score.
Results: From the conducted experimentation the BERT neural network model
is the most efficient algorithm in classifying the IMDb movie reviews into positive
and negative sentiments. This model achieved the highest accuracy score of 90.67%
over the trained data set, followed by the BoW model with an accuracy of 79.15%,
whereas the TF-IDF model has 78.98% accuracy. BERT model has the better preci-
sion and recall with 0.88 and 0.92 respectively, followed by both BoW and TF-IDF
models. The BoW model has a precision and recall of 0.79 and the TF-IDF has a
precision of 0.79 and a recall of 0.78. And also the BERT model has the highest F1
score of 0.88, followed by the BoW model having a F1 score of 0.79 whereas, TF-IDF
has 0.78.
Conclusions: Among the two models evaluated, the lexicon-based approach and
the BERT transformer neural network, the BERT neural network is the most effi-
cient, having a good performance score based on the measured performance criteria.
ii
Keywords: Bag of Words(BoW), Deep Learning, IMDb Movie Reviews, Machine
Learning, Natural Language Processing(NLP), Sentiment Analysis, Term Frequency-
Inverse Document Frequency(TF-IDF).
iii
Acknowledgments
We would like to thank everyone who stayed as a direct or indirect support. Last
but not least, we want to thank our parents for their trust in us, without which we
would not have been able to complete this thesis.
iv
Contents
Nomenclature i
Abstract ii
Acknowledgments iv
1 Introduction 1
1.1 Ethical, societal and sustainability aspects . . . . . . . . . . . . . . . 3
1.2 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Defining the scope of the thesis . . . . . . . . . . . . . . . . . . . . . 4
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Natural Language Processing. . . . . . . . . . . . . . . . . . . 7
2.1.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Epoch and accuracy curve . . . . . . . . . . . . . . . . . . . . 12
2.2.6 Epoch and loss curve . . . . . . . . . . . . . . . . . . . . . . . 12
3 Related Work 13
4 Method 15
4.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Software tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Data collection and visualization . . . . . . . . . . . . . . . . 19
4.2.3 Removing HTML tags and noises in the text . . . . . . . . . . 20
4.2.4 Removing special characters . . . . . . . . . . . . . . . . . . . 20
4.2.5 Word stemming . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.6 Removing stop words . . . . . . . . . . . . . . . . . . . . . . . 20
v
4.2.7 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.8 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.9 Data splitting: training, validation, and testing . . . . . . . . 21
4.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Lexicon-based approach: . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 BERT neural network . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Discussion 38
References 45
vi
List of Figures
2.1 Flowchart depicting various concepts from which we derived the meth-
ods employed in the thesis. . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Working of lexicon-based approach . . . . . . . . . . . . . . . . . . . 8
2.3 Structure of a neural network . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Working of a BERT neural network in classifying the IMDB movie
reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1 A bar plot comparing the accuracy scores of three models employed
in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Graph plotted using the BERT model to demonstrate the model’s
accuracy using validation data. . . . . . . . . . . . . . . . . . . . . . 34
5.3 A bar plot comparing the precision scores of three models employed
in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 A bar plot comparing the recall scores of three models employed in
this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5 A bar plot comparing the F1 scores of three models employed in this
study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1 Correlation matrix of the BoW model showing the classified no.of true
positives and true negatives on IMDb movie reviews data set. . . . . 39
6.2 Correlation matrix of the TF-IDF model showing the classified no.of
true positives and true negatives on IMDb movie reviews data set. . . 39
6.3 Correlation matrix of the BERT model showing the classified no.of
true positives and true negatives on the IMDb movie review data set. 40
6.4 A curve plot showing the comparison of performance metrics evaluated
for the three classifier models used in the thesis. . . . . . . . . . . . . 41
vii
List of Tables
5.1 Table showing the results of the literature survey conducted in our
thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Continuation of table 5.1 showing results of the literature survey. . . 27
5.3 Continuation of table 5.1 showing results of the literature survey. . . 28
5.4 Continuation of table 5.1 showing results of the literature survey. . . 29
5.5 Summary of table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.6 Continuation of the Summary of table 5.1 . . . . . . . . . . . . . . . 31
5.7 Continuation of the Summary of table 5.1 . . . . . . . . . . . . . . . 32
5.8 Table of the number of reviews classified into positive and negative by
each classifier model used in the study. . . . . . . . . . . . . . . . . . 32
5.9 Table of accuracy scores for all classifier models in this study. . . . . 33
5.10 Table of precision scores for all classifier models in this study. . . . . 34
5.11 Table of recall values for all classifier models in this study. . . . . . . 35
5.12 Table of F1 scores for all classifier models in this study. . . . . . . . . 36
viii
Chapter 1
Introduction
Social media trends have made it easy for people to be affected in the modern age,
and movies have a significant influence on people’s lives. Their influence on the
public’s view of several subjects, such as social security, politics and religion, has
a significant cultural impact. They are also important sources of entertainment,
providing viewers with an emotionally stimulating experience. In addition, they
can impact today’s culture concerning music, fashion and other elements. Generally
speaking, films can teach viewers about some of the topics that are important to
them, like history, science and literature [1]. In addition, they can support people’s
efforts to find an interest or career. By convincing individuals that certain acts
are appropriate or desirable, they will display behavioural influence and ultimately
change people’s behaviour. In conclusion, movies can have a big impact on how
people feel, think, and act, therefore it’s important to understand the messages
they’re trying to convey. They can also reinforce societal norms and values, helping
to shape individuals’ attitudes and beliefs.
Every time a movie reaches the audience, you can use the performance of the
movie and the number of viewers to determine how good the movie is. Viewers
have many opinions, and different people have different ways of describing how good
a movie is. One such possibility might be a text review of a movie, which is the
opinion of an individual who has seen the movie [2]. IMDb is one of the websites that
contains a huge selection of movie reviews, but there are many more online platforms
where you may observe individuals publishing these reviews. Information about
films is found in the Internet Movie Database (IMDb) at https://www.imdb.com/.
Users may search through a large collection of works and filter their search results
by choosing parameters such as genre, release year, popularity, and others. IMDb
incorporates user-generated information, such as ratings and reviews made by viewers
of a film, television programme, or video game. Users may score on a scale of 1 to
10 and submit reviews to convey their ideas and opinions.
People are interested in film reviews and ratings. These evaluations are essential
for understanding how well films operate. A film’s quantitative success or failure may
be measured by awarding it a particular number of stars, but a review collection can
provide a deeper qualitative understanding of the film’s numerous features. There are
various methods for submitting a review. The first is a numerical rating composed of
many ratings, and the second is a written evaluation. Rating-based evaluations are
brief, to the point, and numbered from 1 to 10. Textual movie reviews highlight a
film’s virtues and shortcomings. A deeper study into the review shows if it matches
the reviewer’s expectations [3].
1
Chapter 1. Introduction 2
Let us take an example of the most recent movie "RRR" and consider two re-
views from the IMDb website.
As you can see from the movie reviews above, a particular film has garnered a
range of reactions, including positive, negative or neutral. However, how can one
assess the overall quality of a film? The most straightforward solution is to integrate
the findings of various film reviews and award a grade of good or poor. This can be
done through "sentiment analysis". The natural language processing technique
of "sentiment analysis" [5], also referred to as "opinion mining", can be used to
identify or classify the emotional tone of a text. A movie’s performance or its effect
on a specific audience can be assessed by classifying movie reviews. We can analyze
the ratio of positive to negative reviews to determine how well the film business is
doing.
In the modern digital landscape, individuals now have an unparalleled opportu-
nity to voice their thoughts and share personal experiences through online platforms.
One notable aspect of this phenomenon is the ability to write and publish reviews
on movies. Online movie reviews have become highly significant and influential for
a variety of reasons as discussed in the earlier sections of this chapter. The need
to understand, mine and analyse the data has increased significantly as a result of
the massive evolution of the data and the amount of data that is exchanged and
produced every second. Machine learning methods and neural networks using deep
learning were keys to the big data era because conventional traditional models were
insufficient to obtain results in these big data.
There are numerous methods for performing sentiment analysis, the most preva-
lent of which is the classic method which depends on Natural Language Processing [6]
(NLP) techniques. However, some manual feature extraction is required. Because
of the growing interest in social media analysis, Artificial Intelligence (AI) [7] tech-
nologies connected to text analysis have more attention. Several layers of labelled
data or a neural network architecture can be used to train deep learning models.
Sometimes, they execute better than humans do. These methods eliminate the need
for manual feature extraction by extracting features directly from the data.
This thesis aims to perform a comparative sentiment analysis on textual IMDb
movie reviews. Here, there is a need to convert these text-based reviews into the data
form, a machine (i.e. ML classifier) can understand. So, by using a traditional NLP
Chapter 1. Introduction 3
technique called the lexicon-based approach and a deep learning model called the
BERT neural network model, we could achieve the previously discussed classification.
Later, different performance evaluation metrics are used to identify the most effective
learning model over the IMDb data set.
1.2.2 Objectives
The primary objectives of the thesis can be stated as follows:
• To split the data into training, validation, and testing sets and to apply and
build the selected algorithms by using training data to train the deep learning
algorithms.
area under the accuracy and loss curves as discussed in section 2.2. And then
based on the evaluation’s findings, the most efficient algorithm is identified.
2. Which of the two, the lexicon-based approach and BERT neural net-
work is more accurate in conducting sentiment analysis?
1.5 Outline
• Chapter 1, introduction, provides a basic overview of the plot topic and how
sentiment analysis might assist reduce the problem. In addition, the numerous
ways that we decided to implement the notion, as well as an overall overview
of the scope of the thesis, are presented here.
• Chapter 2, comprises background, which includes an introduction to the tools
and subjects utilised in the thesis to familiarise the readers with them.
• The background is then followed by chapter 3 on related works, which de-
scribes the related works we researched as part of the literature study. They
also assisted us in finding the gap in the thesis.
• Chapter 4 is the method chapter that comprises the implementation portion
of the thesis work utilising the experimental methodology. This assists the
reader in understanding the procedure of the work done to get the intended
outcomes.
• Then followed by chapter 5, the results and analysis, which examines the
collected results and provides a full analysis of the outcomes.
Chapter 1. Introduction 5
• Finally, in chapter 7, conclusions and future work, we give some findings based
on our experimental analysis, as well as any additional extensions of the study.
Chapter 2
Background
This chapter will concentrate on explaining the different techniques and themes used
in this thesis. The extensive definitions and explanations of every topic assist readers
in better understanding the thesis work. The below figure 2.1, shows the mapping
of different techniques that were employed in the study.
Figure 2.1: Flowchart depicting various concepts from which we derived the methods
employed in the thesis.
6
Chapter 2. Background 7
The process of extracting a text relevant data that conveys underlying sentiment
or emotion can be done through the lexicon-based approach, the machine learning
approach, or the hybrid approach which integrates both methods and can be used
to do sentiment analysis at the word or phrase level, sentence level, and document
level.
Types of sentiment analysis:
Sentiment analysis can be carried out in different ways for different uses. The fol-
lowing are the categories of sentiment analysis.
1. Fine grained sentiment: When assessing sentiments, commonly used cate-
gories such as positive, neutral, and negative is used. This includes giving a
rating scale of 1 to 5 or 1 to 10 [9].
2. Emotion detection sentiment analysis: This is a more advanced approach
for recognizing feelings in text. This type of analysis aids in detecting and
comprehending people’s emotions. Anger, sadness, happiness, frustration, fear,
panic, and all possible emotions [10].
3. Aspect based analysis: This kind of sentiment analysis primarily focuses
on a certain service aspect. Consider the following scenario: a corporation
or organisation with products or consumers. Aspect-based sentiment analysis
may assist organisations in automatically sorting and analysing client data,
and automating activities such as customer care duties helps us to acquire
substantial insights [11].
4. Intent based sentiment analysis: Intent classification is the automatic clas-
sification of textual material based on the customer’s intent. An intent classifier
can naturally examine texts and reports and categorize them [12].
1. Tokenization: It is the process of breaking a whole text into tokens which are
nothing but small individual phrases or words.
2. Parts of speech tagging: It is the process in which each word of a sentence
is labelled to its corresponding parts of speech.
3. Named identity recognition: It is the process of recognizing the identities
of words in a text and classifying them such as people, places, etc.
4. Machine translation: It is the process of translating a text from one language
to another.
Chapter 2. Background 8
Lexicon-Based Approach :
A lexicon-based approach is an approach that is also called the rule-based approach
primarily using a dictionary of words consisting of a predefined set of sentiments [14].
A lexicon is a set of features with an assigned sentiment value. It is essentially
utilized as a predetermined list of terms known as the dictionary and each word
has several synonyms with which it is related. "WordNet" and "SenticNet" are
commonly and widely used lexicons. The dictionary consists of sentiment scores
and word scores from the dictionary. These scores are used to find the average
score based on the words in the provided document. And by these averages, we
can compute whether the document consists of positive or negative sentiments. The
accuracy of this method simply relies on how comprehensive the lexicons are. A rule-
based approach called the Valence Aware Dictionary for sEntiment Reasoning
(VADER) [15] is developed. This helps determine the polarity of the document.
This approach considers the valence of the document to analyze the polarity of the
sentiment. Here, valence is nothing but the magnitude of positivity or negativity of
the words.
– Classification
– Regression
– Clustering
– Association
output layer. Layers are made up of nodes, which mimic neurons in the human
brain. The output layer gets the output, the hidden layer does the calculations using
mathematical functions, and the nodes in the input layer receive the input [17]. The
structure of a neural network is shown in Figure 2.1 below.
BERT Model :
Bidirectional Encoder Representation for Transformer(BERT) [18] is a natural lan-
guage processing machine learning model which is developed by google research in
the year 2018. The BERT model is based on a deep learning model called trans-
formers, in which each and every output and input are interlinked with each other.
This is due to the weights between the input and output being produced automati-
cally based on their relationship. This is referred to as attention in NLP. The BERT
model is trained on large texts, giving the architecture the ability to understand the
language and to learn variability in data patterns of the NLP tasks. As the name
suggests, the BERT model learns information both from the left and right sides of a
token in the training phase.
Generally, a transformer is comprised of an encoder to read the input text and a
decoder to predict the results. But as the main objective of the BERT model is to
build a language representation model, it only has the encoder. An encoder of the
BERT model takes input in the form of a series of tokens which will be transformed
into vectors to be processed by the neural network. There are many other choices of
transformers like the hugging face:- distilled BERT, XL NET, GPT23 etc [19]. But
the BERT model stands the best among these and shows a better performance in
many NLP tasks.
BERT model generally works by following two steps: 1. pre-training and 2. fine-
tuning. Pre-training steps involve training the model using unlabelled data sets and
for fine-tuning initially pre-trained parameters are used to initialise BERT and the
parameters are fine-tuned using labelled data from the further tasks.
Chapter 2. Background 11
Figure 2.4: Working of a BERT neural network in classifying the IMDB movie reviews
• Accuracy
• Precision
• Recall
• F1 score
Before we define each performance metric let us look at the following terms we use
while defining the performance of a model.
• True Positive (TP) - The result when a model predicts a positive class correctly.
• True Negative (TN) - The result when a model predicts a negative class cor-
rectly.
• False Positive (FP) - The result when a model falsely predicts a positive class
as a negative class.
• False Negative (FN) - The result when the model falsely predicts a negative
class as a positive class.
2.2.1 Accuracy
The accuracy of a classification model can be defined as the ratio of total number
of correct predictions made to the total number of predictions. The equation for
accuracy can be given as [20],
TP + TN
Accuracy = (2.1)
TP + TN + FP + FN
Chapter 2. Background 12
2.2.2 Precision
Precision is defined as the ratio of the number of positive labels classified to the total
number of positive labels. The equation for precision can be given as [20],
TP
P recision = (2.2)
TP + FP
2.2.3 Recall
Recall is the ratio of true positives to true positives and false negatives. This is
nothing but identifying the number of positives labelled correctly. The equation for
the recall is [20],
TP
Recall = (2.3)
TP + FN
2.2.4 F1 score
Another performance metric used is the F1 score, this metric is used to summarize
precision and recall in order to provide better results. F1 score can be defined as the
harmonic mean of both precision and recall lying between 0 and 1. The equation for
the F1 score is [20],
2 ∗ precison ∗ recall
F 1score = (2.4)
precision + recall
In 2019, Kusrini and Mochamad Mashur [21], have compared the accuracy of the
model’s performances after doing sentiment analysis on twitter data using a lexicon-
based approach and polarity multiplication. The study’s findings call for a model
that can manage large amounts of data and perform more accurately on that data.
And this study has demonstrated that the lexicon-based technique is a tried-and-true
strategy that is typically utilized to classify the text easily and with great flexibility.
In 2021, Dingyi Yu [22], in this study, TF-IDF, Bag of Words(BoW) model and
Convolution Neural Network(CNN) training method were used to create a sentiment
analysis system for movie reviews. The data set used for the training and testing
experiment contains 25,000 reviews of movies. The model includes methods like
L2 regularization and dropout to lower the danger of over-fitting. The final model
was found to have an average accuracy of 80.62 percent and a standard deviation
of 1.33. Clearly, the model still has scope for development regarding data selection,
text vectorization, and model optimization.
In 2013, Kamil Topal and Gultekin Ozsoyoglu [23], have performed a study
on emotion analysis of IMDB movie reviews by detecting the emotion of a movie
review in order to observe the performance of movies. This study used a k-means
clustering algorithm to cluster movies according to the reviewer’s emotions per di-
mension. This study suggested a model that can handle large amounts of data in
order to map emotions.
And also in 2018, Rachana Bandana [24] studied sentiment analysis of movie
reviews using heterogeneous features. This study uses a hybrid approach, a com-
bination model of machine learning and a lexicon-based approach which attempted
results, to determine the polarity of a movie review. The conclusion drawn from this
study looks for a model that may use algorithms other than machine learning for han-
dling large data. Also, deep learning features such as Word2Vec, Doc2Paragraph and
word embedding were applied to deep learning algorithms such as Recursive Neural
Network (RNN), Recurrent Neural Networks (RNNs) and Convolutional deep Neural
Networks (CNNs) to get a remarkable result.
In 2022, Sarika, Pavan Kumar [25], have conducted a study. Here, Long Short-
Term Memory (LSTM) and Gated Recurrent Unit (GRU) recurrent neural network
techniques were compared in their study to perform sentiment analysis on IMDb
movie reviews. This analysis led us to the conclusion that LSTM was more accurate
at predicting boundary values. GRU, however, predicted each class similarly. Over-
all, GRU performed somewhat better than LSTM at predicting multi-class text data
of movie reviews.
13
Chapter 3. Related Work 14
The focus of the method chapter is to present the methodology that we employed in
the thesis.
For RQ1, the research starts with a thorough analysis of the literature to find the
most popular algorithms for performing sentiment analysis. The most common em-
pirical methods include surveys, case studies, and experiments. Then the selected
algorithms are used and trained over the chosen data set, to evaluate each model’s
contribution in classifying the sentiment of the IMDb movie reviews.
For RQ2, we decided to use experimentation [26] as our research method in our
thesis, which is discussed in section 4.2 below. The experimental approach requires
the researcher to carry out an experiment in a methodical way to get the desired
results. This method’s main goal is to use present evaluation methodologies to apply
and assess the chosen algorithms. The experiment used the same hardware and
software tools covered in this chapter. And then the chosen IMDb movie reviews data
set is used to train the chosen algorithms for categorizing the reviews as positive or
negative using sentiment analysis. Finally, we show the comparison of the algorithms
based on performance measures to determine which one is the most accurate and
efficient.
15
Chapter 4. Method 16
1. Identifying the keywords that are related to our thesis, the main key concepts
we chose are, "sentiment analysis", "IMDb movie reviews", "opinion mining",
"text-based classification", "deep learning", "neural networks", "lexicon-based
approaches" and more.
2. Shortlisting all the studied and reviewed resources that are helpful in working
with the thesis.
4.2 Experiment
The below, flowchart is a step-by-step process of all the steps followed in the exper-
imentation process [26] involved in answering research question 2.
Figure 4.1: Flowchart showing various steps involved in performing the experiment
for addressing research question 2.
The overall remainder of the experiment done is: open-source Python packages
like matplotlib, pandas, scikit-learn, and others are used to train the selected algo-
rithms for research question 2. Firstly, the IMDb movie reviews data set will be
utilized to conduct sentiment analysis using a lexicon-based methodology, and then
BERT neural networks model. Using methods like word embeddings and tokeniza-
tion, we will first preprocess the data set before doing sentiment analysis on the test
data and then train the models over the data set. And then categorize the outcomes
Chapter 4. Method 17
as either positive or negative. Finally, we will compare the two neural networks and
determine which one is more effective by using performance evaluation metrics like
precision, accuracy, F1 score, epoch, accuracy and loss curves, and the area under
the curves.
Hardware tools
OS windows 11
Processor intel core i7
Ram memory 16GB
2. Pandas: Python’s pandas library is a popular tool for data analysis and manip-
ulation. It offers several methods to efficiently handle and modify data coupled
with high-performance data structures like data frames and series. Pandas is
widely used in machine learning because of their capabilities in data handling,
preprocessing, data exploration, and analysis. In our thesis, we turned all of the
obtained data into a data frame for analysis and prediction using pandas [28].
5. NLTK: The Natural Language Toolkit (NLTK) is a set of resources and tools
for working with human language data that are included in a Python package.
It is frequently employed in tasks involving computational linguistics and Nat-
ural Language Processing (NLP). This library is used for word embeddings,
text categorization, word stemming, text processing, and tokenization [29].
Figure 4.2: The imported data set is depicted in this image, with columns having
the id, review, and sentiment of the movies.
The IMDb movie reviews data set, which was imported, has 50,000 movie re-
views. The data set shown here in Figure 4.2 has 25,000 positive reviews and 25,000
negative reviews. Having an equal number of positive and negative reviews in the
data set offers several advantages for sentiment analysis performed by using machine
learning techniques. It ensures balanced training, preventing biases towards either
sentiment. The equal distribution facilitates accurate performance evaluation, allow-
ing for reliable comparisons of different models. It improves model generalization by
capturing underlying patterns for both sentiments. Additionally, it mitigates bias
and promotes fair sentiment analysis results. Overall, the balance in the data set
enhances the effectiveness and reliability of sentiment analysis models.
Chapter 4. Method 20
4.2.7 Tokenization
Tokenization is a data preprocessing technique of converting a separate piece of text
into smaller parts like words, phrases, or any other meaningful elements called tokens
which makes counting the number of words in the text easier. The proposed system
performed tokenization at the word level so as to consider the sentiment polarity of
each word.
An example of tokenization of a review is shown below.
Review: I thought this was a wonderful way to spend time on a too-hot summer
weekend, sitting in the movie theatre enjoying myself. I really loved the film [30].
Tokenised review: ’i’, ’thought’, ’this’, ’was’, ’a’, ’wonderful’, ’way’, ’to’, ’spend’,
’time’, ’on’, ’a’, ’too’, ’hot’, ’summer’, ’weekend’, ’sitting’ ’in’, ’the’, ’movie’, ’the-
atre’, ’enjoying’, ’myself ’, ’i’, ’really’,’ loved’, ’the’, ’film’
Table 4.3: Table consisting of the number of reviews in training, validation and
testing data sets.
4.3 Classifiers
4.3.1 Lexicon-based approach:
As previously stated, sentiment analysis on movie reviews can be accomplished in a
variety of methods; however, the approach we have selected is the standard method
employing a lexicon-based system. The two processes addressed here are the BoW
("Bag Of Words") model and the TD-IDF- ("Term Frequency - Inverse Document
Frequency") model.
BoW model :
The bag of words model involves turning the data set into a matrix and the data
is converted in the form of vectors. Such data sets are called the Document Term
Matrix - (DTM). Here, in the IMDb data set the rows of the matrix correspond to
the reviews and the columns correspond to the words of the reviews. These words are
called n-grams which means a phrase "N-gram" representing "n" number of words.
Here in the proposed system, the values of the DTM are filled and represented in
the form of a count. Count refers to the count of occurrences of the word in the
corresponding review.
TF-IDF model :
This model is used to convert text documents to matrices containing TF-IDF fea-
tures. The term frequency-inverse document frequency is the measure of an essential
word in the text.
Then these models are trained using the imported data set to perform a binary
classification in predicting the sentiment of the reviews and categorizing them into
two outputs either as positive or as negative.
Fine tuning :
BERT neural network must be fine-tuned on the data set of the labeled movie re-
views. The fine-tuning process is nothing but adjusting the weights of the model
using back-propagation and gradient descent in order to minimize the loss function.
Chapter 4. Method 23
The loss function is the measurement of the predicted sentiment to the actual sen-
timent of the labeled data. The fine-tuning process is performed using a Pytorch
optimizer.
Mathematically, the fine-tuning is represented as follows:
f (x_i; θ) = sof tmax(w_2 ∗ (ReLU (W _1 ∗ h_i + b_2)) [32]
The terms of the equation are explained as follows:
1. The function f(x_i; θ) defines a neural network model that takes in input as a
movie review in the form of a sequence of tokens and uses a pre-defined BERT
model. Here in this study, we have used the bert-based-uncased model, to
obtain contextualized embedding of the input sequence i.e. a movie review.
2. The word embeddings are then passed through two ReLU-enabled linear trans-
formations, and a softmax function is applied to the output to obtain the
probability distribution (positive or negative sentiment) across classes.
3. The model parameters (θ) are trained by minimizing a loss function between the
anticipated probability distribution and the true labels of the movie reviews.
4. The h_i is the output of the BERT model’s final encoding layer for the pre-
processed text.
5. x_i, W_1 and b_1 are the first completely connected layer’s weights and
biases, ReLU is the rectified linear activation function, and W_2 and b_2 are
the second fully connected layer’s weights and biases. Softmax is the softmax
activation function.
7. Then finally the BERT model will make use of two output classes i.e. for
classifying both positive and negative sentiment.
that are movie reviews from the IMDb data set, along with their matching sentiment
labels (either 1 for positive or 0 for negative). The model then learns to adjust the
weights of its connections to improve prediction accuracy. Finally, when used for
prediction, the network takes a sentence as input and outputs a probability distribu-
tion across the two probable emotion labels (0 and 1). The label that says the most
likely outcome is then chosen as the anticipated sentiment score.
Below table 4.4, is an example of the results of IMDb reviews being classified into
either positive or negative.
Table 4.4: Sentiment classification table showing examples of actual sentiment and
predicted sentiment of movie reviews using the lexicon-based and BERT neural net-
work classifier models.
Chapter 5
Results and Analysis
This chapter presents a full explanation of the findings achieved in selecting algo-
rithms for efficient sentiment analysis on IMBD movie reviews, as well as the results
obtained after training the selected algorithms on the imported data set. To perform
sentiment analysis on IMDB movie review data, the Bag of Words (Bow) model, as
well as the TD-IDF model and the BERT neural network model, are chosen and used.
To evaluate the algorithm’s accuracy, we used performance metrics such as accuracy,
precision, and f1 score, as well as the epoch-loss and accuracy curves. Every result
of the experiment is part of the test data set. The outcomes of the thesis study are
discussed in depth in the following sections of the chapter.
25
Chapter 5. Results and Analysis 26
Table 5.1: Table showing the results of the literature survey conducted in our thesis.
Chapter 5. Results and Analysis 27
Table 5.2: Continuation of table 5.1 showing results of the literature survey.
Chapter 5. Results and Analysis 28
Table 5.3: Continuation of table 5.1 showing results of the literature survey.
Chapter 5. Results and Analysis 29
Table 5.4: Continuation of table 5.1 showing results of the literature survey.
According to the findings gathered from the literature review, various research
papers employed different methods to conduct sentiment analysis. These approaches
can be summarized in the following table 5.5
Chapter 5. Results and Analysis 30
• Recurrent
Neural Net-
work(RNN)
Sentiment anal- Performing sen-
ysis using Lex- timent analysis
icon based ap- using traditional
proach model.
Comparing • Long Short Comparision of
LSTM and GRU Term Memory deep learning
for multiclass neural net- models.
sentiment anal- work(LSTM)
ysis of movie
reviews • Gated Recur-
rent Unit(GRU)
Hybrid con- • Convolutional A hybrid ap-
volutional Bidirectional proach
bidirectional Recurrent
recurrent neural Neural Net-
network based work(CBRNN)
sentiment anal-
ysis on movie • Bidirectional
reviews Gated Re-
current
Unit(BGRU)
• Random
Forest
• Decision
Tree
Combining Rule-based Naive bayes Hybrid ap-
a rule based classifier proach.
classifier with
weakly super-
vised learning
for Twitter sen-
timent analysis
Sentiment Long Short Term Neural network
Information Memory neural net- model.
based Model for work(LSTM)
Chinese Text
Sentiment Anal-
ysis
Comparative Naive bayes Hybrid ap-
analysis of proach.
customer sen-
timents on
competing
brands using the
Hybrid model
approach
BERT-IAN Bidirectional Encoder Neural network
Model for Representations from model.
Aspect-based Transformers(BERT)
Sentiment Anal-
ysis
A compara- Ensemble mod- Comparision of
tive analysis els different ma-
of sentiment chine learning
classification models.
based on Deep
and Traditional
ensemble Ma-
chine Learning
Models
Table 5.8: Table of the number of reviews classified into positive and negative by
each classifier model used in the study.
5.2.1 Accuracy
By determining the accuracy of each model, this performance metric is utilized to
discover the most efficient algorithm from all of the selected algorithms. The accuracy
Chapter 5. Results and Analysis 33
Table 5.9: Table of accuracy scores for all classifier models in this study.
Figure 5.1: A bar plot comparing the accuracy scores of three models employed in
this study.
If we observe the above comparison of accuracy scores. The BERT model amongst
the other selected models has the highest accuracy score of 90.67% followed by the
BoW model with 79.15%.
Chapter 5. Results and Analysis 34
Figure 5.2: Graph plotted using the BERT model to demonstrate the model’s accu-
racy using validation data.
5.2.2 Precision
This performance metric depicts all possible positive predictions classified by the
model. The selected models are evaluated considering this metric and the values of
the precision scores of each model are shown in table 5.2 below.
Table 5.10: Table of precision scores for all classifier models in this study.
Chapter 5. Results and Analysis 35
Figure 5.3: A bar plot comparing the precision scores of three models employed in
this study.
So, by observing the above results showing the comparison of precision scores of
the three models, again the BERT model has the highest precision scores among the
other evaluated models followed by both the BoW model and TF-IDF model having
the same values.
5.2.3 Recall
In order to find the optimized algorithm we use this performance metric. The below
table 5.3, shows the recall values of each model evaluated against the text data.
Table 5.11: Table of recall values for all classifier models in this study.
Chapter 5. Results and Analysis 36
Figure 5.4: A bar plot comparing the recall scores of three models employed in this
study.
Here also, the BERT model has the best recall score of 0.92, followed by the BoW
model with a 0.79 recall score.
5.2.4 F1 score
F1 score is another performance metric that we have chosen in order to evaluate the
selected models based on their performance. Table 5.4, shows the values of the F1
score for each algorithm.
Table 5.12: Table of F1 scores for all classifier models in this study.
Chapter 5. Results and Analysis 37
Figure 5.5: A bar plot comparing the F1 scores of three models employed in this
study.
Chapter 6
Discussion
The discussion chapter contains an overview of the thesis findings as well as their
contribution to answering the research questions. We also discuss aspects that con-
tradict the findings.
BoW model: The matrix displayed in Figure 6.1 represents the results of binary
classification using the Bag-of-Words (BoW) model in our study. It indicates the
model’s ability to accurately predict whether a labeled review from the IMDb movie
reviews data set is positive or negative, belonging to either class 0 or class 1. The
matrix allows us to assess the model’s performance in correctly identifying the overall
number of positive and negative reviews.
38
Chapter 6. Discussion 39
Figure 6.1: Correlation matrix of the BoW model showing the classified no.of true
positives and true negatives on IMDb movie reviews data set.
TF-IDF model: The TF-IDF model was employed in our research to evaluate
whether a labeled review of the IMDb movie reviews data set would be positive
or negative. The results of the binary classification in the model are shown in the
matrix below Figure 6.2, which is either class 0 or class 1. The model’s accuracy in
predicting the total number of positives and negatives is evident.
Figure 6.2: Correlation matrix of the TF-IDF model showing the classified no.of true
positives and true negatives on IMDb movie reviews data set.
BERT model: In our study, the BERT model was used to determine whether
a labelled review from the data set of IMDb movie reviews will be either positive
Chapter 6. Discussion 40
or negative. The matrix in Figure 6.3 below displays the outcomes of the model’s
binary classification, which are either class 0 or class 1. It is clear that the model was
almost successful in forecasting both the overall number of positives and negatives.
Figure 6.3: Correlation matrix of the BERT model showing the classified no.of true
positives and true negatives on the IMDb movie review data set.
and F1 scores.
The BERT neural network model demonstrated the highest precision score of
0.88, indicating its ability to accurately classify positive and negative sentiments. It
also exhibited a high recall score of 0.92, indicating that it successfully identified
a significant proportion of true positive instances. Additionally, the BERT model
achieved an F1 score of 0.88, which combines both precision and recall.
On the other hand, the lexicon-based approaches using the BoW and TF-IDF
models showed similar performance scores for precision, recall, and F1 scores, slightly
lower than the BERT model. This suggests that the lexicon-based approaches were
comparatively less effective in accurately predicting sentiment compared to the BERT
model.
Figure 6.4: A curve plot showing the comparison of performance metrics evaluated
for the three classifier models used in the thesis.
So, in response to RQ 2, our study clearly demonstrates that the BERT neural
network model outperformed traditional lexicon-based approaches in terms of accu-
racy and various performance metrics. The BERT model consistently showed higher
accuracy, precision, recall, and F1 scores, indicating its effectiveness in conducting
sentiment analysis. These findings confirm our hypothesis that deep learning models
like BERT can deliver superior results compared to lexicon-based methods.
The implications of our research are significant for sentiment analysis applications.
By utilizing the BERT model, analysts and businesses can achieve more precise and
reliable sentiment analysis outcomes. This is particularly advantageous in domains
Chapter 6. Discussion 42
Validity threats:
"Internal validity" relates to the accuracy of the research. The data quality
is the major internal challenge of this study. During the data-gathering phase of
the investigation, possible threats to the internal validity of this theory may emerge.
While downloading the data set from the internet page, take care to ensure that all
of the text gets downloaded. If this is not done, an incorrect sentiment polarity will
arise, resulting in an erroneous understanding of the circumstance. Another risk is
that the algorithms which are employed may be incorrect. To address this threat,
available data were examined and a thorough literature study was conducted, in or-
der to select algorithms that perform well with the data that is currently accessible.
The experiment demonstrated that the algorithms performed as expected.
The term "external validity" relates to how well the thesis’s findings may be
used in practice. The data set in this study comprises reviews written by people on
an internet platform, therefore this data is somehow related to the outside world, and
performing sentiment analysis as mentioned in this study can be beneficial to people.
Another issue is that the model is out of date and does not completely reflect the
actual situation. However, the procedures utilised in the study were acceptable and
successful, therefore this thesis might be applied in other real-world circumstances.
Finally, the validity of the thesis conclusions is determined by how correctly we
were able to select the model that performs sentiment analysis on IMDb movies with
more accuracy. This is related to the performance measurements we used in order
to validate the chosen models. In our study, we made sure to use the appropriate
metrics, which were substituted in the findings acquired.
Chapter 7
Conclusions and Future Work
7.1 Conclusions
In the modern era, movies and television have become the primary sources of en-
tertainment, deeply intertwined with people’s lives. Consequently, individuals are
increasingly interested in gaining insights about a film before watching it. To con-
tribute to this field, we conducted a sentiment analysis of IMDb movie reviews,
aiming to classify them as positive or negative. This dissertation focuses on com-
paring two approaches: the lexicon-based method and the BERT neural network
model.
Our experiment yielded successful results, revealing that the BERT neural net-
work model outperformed the lexicon-based method in terms of accuracy and effi-
ciency. To validate this conclusion, we evaluated the performance using metrics such
as accuracy, precision, recall, and f1 score.
The BERT neural network model, a cutting-edge natural language processing
(NLP) model, exhibited superior performance in sentiment analysis compared to
the lexicon-based method. BERT’s advantage lies in its ability to grasp contextual
information and understand the nuances of language. By being trained on extensive
text data, BERT learns to predict words in a sentence based on their surrounding
context, resulting in a highly accurate and robust model. On the other hand, the
lexicon-based method relies on predefined sentiment dictionaries or lexicons. While
it can yield reasonable results, it often struggles with contextual understanding and
fails to capture subtle language nuances. Lexicon-based approaches typically assign
sentiment scores to individual words or phrases and aggregate them to determine
the sentiment of the entire text. This simplistic approach may overlook the complex
interactions between words and their contextual meanings, leading to less accurate
sentiment analysis.
To evaluate the performance of the two approaches, we utilized various metrics,
including accuracy, precision, recall, and f1 score. Accuracy measures the overall
correctness of sentiment classification, while precision and recall assess the model’s
ability to correctly identify positive and negative sentiments. The f1 score combines
precision and recall, offering a comprehensive evaluation of the model’s performance.
Based on these evaluation metrics, we consistently found that the BERT neural
network model outperformed the lexicon-based method in sentiment analysis accu-
racy. Its proficiency in capturing contextual information and understanding language
intricacies enables it to achieve higher precision, recall, and f1 score values. Con-
sequently, the BERT model proves to be a more reliable and effective approach for
43
Chapter 7. Conclusions and Future Work 44
• For more precise classification, we can use live data that is immediately acquired
from online domains using web scraping techniques.
• For better outcomes, the proposed methodology can be done and assessed using
a hybrid model that combines traditional methods using lexicons with deep
learning models like neural networks.
References
[1] J. Robbins. Intensity impact of its escalation on people, society and the environ-
ment. In ISTAS 98. Wiring the World: The Impact of Information Technology
on Society. Proceedings of the 1998 International Symposium on Technology and
Society (Cat. No.98CH36152), pages 98–104, June 1998.
[2] An Han, Liu Hao, and Ren Jifan. An empirical study on inline impact factors of
reviews usefulness based on movie reviews. In 2016 13th International Confer-
ence on Service Systems and Service Management (ICSSSM), pages 1–5, June
2016. ISSN: 2161-1904.
[3] Mahesh Joshi, Dipanjan Das, Kevin Gimpel, and Noah A. Smith. Movie Re-
views and Revenues: An Experiment in Text Regression. In Human Language
Technologies: The 2010 Annual Conference of the North American Chapter of
the Association for Computational Linguistics, pages 293–296, Los Angeles, Cal-
ifornia, June 2010. Association for Computational Linguistics.
[4] S. S. Rajamouli. RRR (Rise Roar Revolt), March 2022. Translated title: RRR
IMDb ID: tt8178634 event-location: India.
[5] Subhra Balabantaray. Impact of Indian cinema on culture and creation of world
view among youth: A sociological analysis of Bollywood movies. Journal of
Public Affairs, September 2020.
[6] M. Kavitha, Bharat Bhushan Naib, Basetty Mallikarjuna, R. Kavitha, and
R. Srinivasan. Sentiment Analysis using NLP and Machine Learning Techniques
on Social Media Data. In 2022 2nd International Conference on Advance Com-
puting and Innovative Technologies in Engineering (ICACITE), pages 112–115,
April 2022.
[7] Deni Kurnianto Nugroho. US presidential election 2020 prediction based on
Twitter data using lexicon-based sentiment analysis. In 2021 11th International
Conference on Cloud Computing, Data Science & Engineering (Confluence),
pages 136–141, January 2021.
[8] Samar Assem and Sameh Alansary. Sentiment Analysis From Subjectivity to
(Im)Politeness Detection: Hate Speech From a Socio-Pragmatic Perspective.
In 2022 20th International Conference on Language Engineering (ESOLEC),
volume 20, pages 19–23, October 2022.
[9] Xiao-Hong Cai, Pei-Yu Liu, Zhi-Hao Wang, and Zhen-Fang Zhu. Fine-Grained
Sentiment Analysis Based on Sentiment Disambiguation. In 2016 8th Inter-
national Conference on Information Technology in Medicine and Education
(ITME), pages 557–561, December 2016. ISSN: 2474-3828.
45
References 46
[22] Dingyi Yu. Intelligent Analysis System of Movie Reviews Using Deep Learning
and Convolutional Neural Networks. In 2021 IEEE Conference on Telecom-
munications, Optics and Computer Science (TOCS), pages 617–621, December
2021.
[23] Kamil Topal and Gultekin Ozsoyoglu. Movie review analysis: Emotion analysis
of IMDb movie reviews. In 2016 IEEE/ACM International Conference on Ad-
vances in Social Networks Analysis and Mining (ASONAM), pages 1170–1176,
August 2016.
[24] Rachana Bandana. Sentiment Analysis of Movie Reviews Using Heterogeneous
Features. In 2018 2nd International Conference on Electronics, Materials En-
gineering & Nano-Technology (IEMENTech), pages 1–4, May 2018.
[25] Pawan Kumar Sarika. Comparing LSTM and GRU for Multiclass Sentiment
Analysis of Movie Reviews. 2020.
[26] V.R. Basili. The role of experimentation in software engineering: past, current,
and future. In Proceedings of IEEE 18th International Conference on Software
Engineering, pages 442–449, March 1996. ISSN: 0270-5257.
[27] Jeffrey W. Knopf. Doing a Literature Review. PS: Political Science & Politics,
39(1):127–132, January 2006. Publisher: Cambridge University Press.
[28] I. Stančin and A. Jović. An overview and comparison of free Python libraries for
data mining and big data analysis. In 2019 42nd International Convention on
Information and Communication Technology, Electronics and Microelectronics
(MIPRO), pages 977–982, May 2019. ISSN: 2623-8764.
[29] Xavier Schmitt, Sylvain Kubler, Jérémy Robert, Mike Papadakis, and Yves
LeTraon. A Replicable Comparison Study of NER Software: StanfordNLP,
NLTK, OpenNLP, SpaCy, Gate. In 2019 Sixth International Conference on
Social Networks Analysis, Management and Security (SNAMS), pages 338–343,
October 2019.
[30] IMDB Dataset of 50K Movie Reviews.
[31] Siwei Lai, Kang Liu, Shizhu He, and Jun Zhao. How to Generate a Good Word
Embedding. IEEE Intelligent Systems, 31(6):5–14, November 2016. Conference
Name: IEEE Intelligent Systems.
[32] Denis Rothman and Antonio Gulli. Transformers for Natural Language Pro-
cessing: Build, train, and fine-tune deep neural network architectures for NLP
with Python, PyTorch, TensorFlow, BERT, and GPT-3. Packt Publishing Ltd,
March 2022. Google-Books-ID: u9FjEAAAQBAJ.
[33] Sara Sabba, Nahla Chekired, Hana Katab, Nassira Chekkai, and Mohammed
Chalbi. Sentiment Analysis for IMDb Reviews Using Deep Learning Classifier.
In 2022 7th International Conference on Image and Signal Processing and their
Applications (ISPA), pages 1–6, May 2022.
[34] Cuk Tho, Yaya Heryadi, Iman Herwidiana Kartowisastro, and Widodo Budi-
harto. A Comparison of Lexicon-based and Transformer-based Sentiment Analy-
sis on Code-mixed of Low-Resource Languages. In 2021 1st International Con-
References 48