Professional Documents
Culture Documents
Bachelor of science
In
1|Page
Under the Guidance of
May – 2022
CERTIFICATE
DEPARTMENT OF DATA SCIENCE AND ANALYTICS
BENGALURU – 560027
2|Page
NLP SENTIMENT ANALYSIS ON MOVIE REVIEWS WITH
TOXIC COMMENT DETECTION
T.ROHITH 19BSR18022
Guided by:
3|Page
DEPARTMENT OF DATA SCIENCE AND ANALYTICS
JAIN UNIVERSITY, JC ROAD CAMPUS
BENGALURU – 560027
I hereby declare that the project work on “ NLP SENTIMENT ANALYSIS ON MOVIE
REVIEWS WITH TOXIC COMMENT DETECTION ” has been submitted by me for the
partial fulfilment of the requirements for degree of M.Sc. in Data science and analytics, this is
my original work and all information in this document have been obtained and presented in
accordance with academic rules and ethical conduct and results presented in this report or parts
of it have not been presented for the award of any other degree.
I also declare that no chapter of this manuscript in whole or part has been incorporated in this
report from any earlier work done by others or by me. However, extracts of any literature which
has been used for this report has been duly acknowledged providing details of such literature
in references.
Name: T.ROHITH
USN: 19BSR18022
Signature:
4|Page
ACKNOWLEDGEMENT
I take this opportunity to acknowledge the guidance received from our professor, our
college administration, and my families and friends towards this exciting journey of
researching and working on the final year thesis on “NLP SENTIMENT ANALYSIS ON
MOVIE REVIEWS WITH TOXIC COMMENT DETECTION”. Completing of this project
gives us immense satisfaction, and it would not have been possible without my advisors.
I am indebted to, and sincerely thank our project guide, Mr. Kunal Dey., professor,
department of Data science and analytics, Jain university, for his time, patience, and valuable
knowledge, and for leading and guiding me throughout this project.
I would like to convey our special thanks to Dr. Aarthi Sudarshan, professor and
HOD, Department of Data science and analytics, Jain university, for this opportunity.
Additionally, I extend my gratitude to all our teaching and non-teaching staff from the
Department of Data science and analytics, Jain university, JC Road campus for their
encouragement.
I am also grateful to Dr. Asha Rajiv., principal, Jain university, JC Road campus for
her continued support.
Last but not the least, my family and friends have been there for me through everything,
and we take this opportunity to thank them for their bolstering presence.
T.ROHITH
5|Page
ABSTRACT
6|Page
TABLE OF CONTENTS
SL NO CHAPTER NAME PAGE NO
1. CHAPTER -1 10
Introduction 10
What is sentiment analysis ? 10
Problem Statement 11
Project Pipeline 11
2. CHAPTER – 2 11
Findings from the literature review 13
An overview of sentiment analysis 13-14
technology
Natural language processing 15
in Artificial intellingence
Text preprocessing, Practical Text 15
Analytics
3. CHAPTER – 3
Data set description 16
Project description 17
Expanding contractions 17-19
7|Page
4. CHAPTER – 4 Results 22-26
8|Page
LIST OF FIGURES
9|Page
CHAPTER—1. INTRODUCTION
I used natural language processing to implement Sentiment Analysis on the IMDb movie
review dataset, with the extra functionality of toxic comment detection. The purpose of
this research is to determine the underlying sentiment of a movie review based on the
textual material available, and to do so, we have classified whether a reviewer enjoyed the
film. In this project, keyword spotting was utilized to evaluate data. Although, keyword
spotting is considered the most naïve approach, it’s accessibility and economy make it
popular. This method divides text into affect categories using clear affect terms like
joyful, sad, fearful, and bored. Vectorization and normalization techniques – Stemming
and Lemmatization – have been implemented for data pre-processing. Various feature
extraction models were applied, including Bag of words, TF-IDF, and N-gram. The
outcomes of five machine learning models, including I Bayes, Random Forest, Support
Vector Machine (SVM), Logistic Regression, and Decision Tree classifiers, were
examined. The Toxic Comment Detection model determines whether or not the input text
is suitable. This is accomplished through the use of the Logistic Regression Classifier in
the N-gram Feature Extender.
Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique
used to determine whether data is positive, negative or neutral. Sentiment analysis is often
performed on textual data to help businesses monitor brand and product sentiment
in customer feedback, and understand customer needs.
Sentiment analysis focuses on the polarity of a text (positive, negative, neutral) but it also
10 | P a g
e
goes beyond polarity to detect specific feelings and emotions (angry, happy, sad, etc),
urgency (urgent, not urgent) andeven intentions (interested v. not interested)
PROBLEM STATEMENT
PROJECT PIPELINE
5)Data Preprocessing
11 | P a g
e
7)Transforming Dataset using TF-IDF Vectorizer
9)Model Building
10)Conclusion
•
• Goal –
• Identify the underlying sentiment of a movie review, find the state of mind of
the reviewer while providing the review and understand if the person was
“happy”, “sad”, “angry” and so on.
• Data Preprocessing –
• Removing HTML tags, symbols, Removing stop words, stemming,
lemmatization, expanding contractions
Classifiers –
Decision Tree, Naive Bayes, Logistic Regression, Random Forest and Linear
support vector machine
• Evaluation –
• Accuracy, F1 Score and Confusion Matrix
12 | P a g
e
Fig 01
In 2019, Ray and Chakrabarti [8] have introduced a deep learning algorithm
for extracting the features from text and the user's sentiment analysis
with respect to the feature. In opinionated
sentences, a seven layer Deep CNN was employed for tagging the features. In
order to enhance the performance of sentiment scoring and feature extraction
models, the authors merged the deep learning
methods using a set of rule-based models. Finally, it was seen that the
suggested method achieved the best accuracy. In 2019, Zhao et al. [9] have
offered a novel image-text consistency driven multi-
modal sentiment evaluation model, which explored the correlation among the
text and image. Later, a multi-modal adaptive sentiment analysis model was
implemented. By using the traditional SentiBank
model, the mid-level visual features were extracted and those were
employed for representing the visual theories by integrating the different
characteristics like social, textual, and visual features for
introducing a machine learning model. The suggested model has attained
best performance when compared over traditional models
15 | P a g
e
Natural language processing in artifical intellingence :
Author(s) : Brojo Kishore Mishra, Raghvendra Kumar (2020)
Journal: Computer Science, Engineering & Technology Taylor and Francis
This paper focuses on natural language processing, artificial intelligence, and
allied areas. Natural language processing enables communication between
people and computers and automatic translation to facilitate easy interaction
with others around the world. This book discusses theoretical work and
advanced applications, approaches, and techniques for computational models
of information and how it is presented by language (artificial, human, or
natural) in other ways. It looks at intelligent natural language processing and
related models of thought, mental states, reasoning, and other cognitive
processes. It explores the difficult problems and challenges related to
partiality, under specification, and context-dependency, which are signature
features of information in nature and natural languages .
16 | P a g
e
CHAPTER—3. PROPOSED METHOD
DATASET DESCRIPTION
17 | P a g
e
PROJECT DESCRIPTION
Data Pre-processing. The data that is collected in its raw form can be a muddle. As a result, data
should be cleaned before we run any sort of analytics on it. Punctuation marks and HTML elements
were deleted utilizing a regular expressions/pattern matching approach. For ease of processing, the
text is also changed to lower case.
Fig3
EXPANDING CONTRACTIONS:
Removing Stop Words. Stop words are common words like "if," "but," "we,"
"he," "she," and "them," which may typically be omitted from a document
without changing its semantics. The model's performance is improved as a
result of this.
18 | P a g
e
Fig4
Normalization:
The process which is used to clean noise from unstructured text for sentiment
analysis. This process is achieved by using Natural Language Took kit
(NLTK). NLP model can be further enhanced by converting all the different
forms of a given word into one. This process is called as Normalization.
Normalization can be done in two ways: Stemming and Lemmatization.
Stemming:
The process of reducing a word's inflection to its basic form. Stemming can be
done using a variety of algorithms. "Porter's Algorithm" is the most widely
used and empirically effective algorithm.
-----
Fig5
19 | P a g
e
Lemmatization:
The idea of Lemmatization is similar to stemming which is to eliminate
inflections and map a word to its root form. In this process, the actual root is
transformed by the words. For example, the algorithm would identify that the
word ‘better’ is derived from the lemma ‘good’.
Fig6
20 | P a g
e
Feature Extraction methods used:
Bag of Words:
This is a simple description of extracting the features from the text in order to
be used in machine learning algorithms. This specific feature is successful in
document classification and language modelling problems. This describes the
existence of words within a document. The frequency of word occurrences in
a text file is calculated using this model. In Bag of Words, the sequence of the
words is irrelevant; the model is just concerned with the words that occur. It
can be used on several levels, with 500, 5000, and 50000 words.
N-Gram:
This model is one of the important type which is used in language and speech
processing which is basically a sequence of N words. It mainly allows you to
deduct the combinations when it comes to NLP program. it is a sequence of
N-words in a sentence. N is an integer which stands for the number of words
in the sequence. When N = 1, N=2, N=3, it is referred to as uni- gram, bigram
and tri-gram respectively. N- gram is used because unlike in bag of words,
the order in which the words appear is important. For example, it is a good
idea to consider bigrams like “New York” instead of splitting them into
individual words like “New” and “York”.
The main goal of this module is to forbid the words which are continuously
repeated in each and every document. The term frequency (TF) which is the
measure of the word in any given document is given by the formula:
TF=(Total no. of times the word occurs)/(Total no. of words in document)
The term Inverse Document Frequency(IDF) which is the count of how rare
the word appears in a given document is given by the formula:
IDF=log_e(Total no. of documents)/(No. of documents with that word in it)
21 | P a g
e
Machine Learning Models:
22 | P a g
e
Difference in APPROACH/METHOD between your project and main
projects of your references:
23 | P a g
e
Fig8 N-gram Accuracy Scores
24 | P a g
e
Fig10 Bag of Words F1 scores
25 | P a g
e
Fig11: N-gram F1 scores
RESULTS
From the above graphs, we can interpret by comparing different models that
Logistic Regression has the highest accuracy whereas the decision tree has the
least.
Similarly, when compared to the other models, Logistic Regression has a high
F1 score, while the decision tree has the lowest.
26 | P a g
e
Fig13
27 | P a g
e
CHAPTER—5 SYSTEM REQUIREMENTS
PC
Windows 11, 10, 8
Python latest version 3.10 or 3.9
Anaconda Prompt Shell or CMD to download required packages
Anaconda Navigator
GITHUB Desktop
PC
Processor- Intel(R) Core (TM) i5-6300U CPU 2.40GHz 2.50 GHz
4GB or 8GB of RAM
System Type-64bit Operating system
28 | P a g
e
CHAPTER—7 CONCLUSION & ANALYSIS
Analysis
.
What did I do well?
When a review is marked as 'Dislike it', the algorithms classify it as negative. Rather than
reading individual words, a complete line could be read.
With Decision Tree classifier, N-gram modeling can be optimized for faster results.
To take toxic comment detection to the next level, toxic comments might be
classified as toxic, obscene, threat, menace, insult, or identify hate.
29 | P a g
e
CONCLUSION
Features NB LR RF DT LVSM
CHART TITLE
N- Gram Bag of Words TD-IDF
90
87
84
81
78
75
72
69
66
NAÏVE BAYE'S LOGISTIC RANDOM DECISION LINEAR
REGRESSION FOREST TREES SUPPORT
VECTOR
MACHINE
Fig15
30 | P a g
e
Features NB LR RF DT LVSM
FIG18
Chart Title
90
85
80
75
70
65
NB LR RF DT LVSM
FIG17
(1)https://cseweb.ucsd.edu/classes/wi15/cse255-a/reports/fa15/003.pdf1
(2)https://uksim.info/icaiet2014/CD/data/7910a212.pdf
(3) https://ieeexplore.ieee.org/document/7351837
(4)https://www.ijcsmc.com/docs/papers/June2016/V5I6201691.pdf
(5)https://arxiv.org/ftp/arxiv/papers/1902/1902.00679.pdf
(8)Andrew L Mass, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng and Christopher
Potts (2011). Learning Word Vectors for Sentiment Analysis
(12)Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment
Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP).
(17)Cambria, Erik; Schuller, Björn; Xia, Yunqing; Havasi, Catherine (2013). "New Avenues in
Opinion Mining and Sentiment Analysis". IEEE Intelligent Systems
(18)Snyder, Benjamin; Barzilay, Regina (2007). "Multiple Aspect Ranking using the Good Grief
Algorithm". Proceedings of the Joint Human Language Technology/North American Chapter of the
32 | P a g
e
ACL Conference
33 | P a g
e