NLP Sentiment Analysis On Movie Reviews With Toxic Comment Detection

Final year project report on
NLP SENTIMENT ANALYSIS ON MOVIE REVIEWS WITH

TOXIC COMMENT DETECTION
Submitted to department of Data science and Analytics
In partial fulfilment for the award of Degree of
Bachelor of science
In
Data Science and Analytics

By
Name: T.ROHITH
USN: 19BSR18022
1|Page
Under the Guidance of
Mr. Kunal Dey

Professor
DEPARTMENT OF DATA SCIENCE AND ANALYTICS

JAIN UNIVERSITY, JC ROAD CAMPUS
BENGALURU – 560027
May – 2022
CERTIFICATE
This is to certify that the Final Year Project entitled
2|Page
NLP SENTIMENT ANALYSIS ON MOVIE REVIEWS WITH
TOXIC COMMENT DETECTION
Has been successfully completed by
T.ROHITH 19BSR18022
As a part of the 6th semester curriculum in Bachelor of Science in
Data Science and Analytics

Jain university, JC Road Campus,
During the academic year 2021 – 2022
Guided by:
Mr. Kunal Dey Dr. Aarthi Sudarshan

Professor Head of Department
Dept. of DS, Jain University Dept. of DS, Jain University
Name Signature with date
Examiner 1:
Examiner 2:
3|Page
DECLARATION OF AUTHORSHIP AND

COMPLIANCE OF ACADEMIC ETHICS
I hereby declare that the project work on “ NLP SENTIMENT ANALYSIS ON MOVIE
REVIEWS WITH TOXIC COMMENT DETECTION ” has been submitted by me for the
partial fulfilment of the requirements for degree of M.Sc. in Data science and analytics, this is
my original work and all information in this document have been obtained and presented in
accordance with academic rules and ethical conduct and results presented in this report or parts
of it have not been presented for the award of any other degree.
I also declare that no chapter of this manuscript in whole or part has been incorporated in this
report from any earlier work done by others or by me. However, extracts of any literature which
has been used for this report has been duly acknowledged providing details of such literature
in references.
Name: T.ROHITH
USN: 19BSR18022
Signature:
4|Page
ACKNOWLEDGEMENT
I take this opportunity to acknowledge the guidance received from our professor, our
college administration, and my families and friends towards this exciting journey of
researching and working on the final year thesis on “NLP SENTIMENT ANALYSIS ON
MOVIE REVIEWS WITH TOXIC COMMENT DETECTION”. Completing of this project
gives us immense satisfaction, and it would not have been possible without my advisors.
I am indebted to, and sincerely thank our project guide, Mr. Kunal Dey., professor,
department of Data science and analytics, Jain university, for his time, patience, and valuable
knowledge, and for leading and guiding me throughout this project.
I would like to convey our special thanks to Dr. Aarthi Sudarshan, professor and
HOD, Department of Data science and analytics, Jain university, for this opportunity.
Additionally, I extend my gratitude to all our teaching and non-teaching staff from the
Department of Data science and analytics, Jain university, JC Road campus for their
encouragement.
I am also grateful to Dr. Asha Rajiv., principal, Jain university, JC Road campus for
her continued support.
Last but not the least, my family and friends have been there for me through everything,
and we take this opportunity to thank them for their bolstering presence.
T.ROHITH
5|Page
ABSTRACT
Sentiment Analysis is the most researched area in the Natural Language

Processing (NLP).The word ‘Sentiment’ can be described as the meaning
of a word or sequence of words that is often associated with an emotion.
It is basically the process of computationally recognizing and classifying
an opinion from a piece of text, in order to find out the writer’s attitude
towards the respective topic or a product etc. The main source of data for
this Sentiment Analysis is “social media”, which is a platform where
people express their viewpoints through likes, comments etc. Drawing out
the opinions of a group of people can be very helpful for making
conclusions that has business and economic perspectives. The expression
of opinions freely on social media to hurt/abuse others is called as
Conversation toxicity, which must be addressed and removed
immediately. Sentiment Analysis which is also called as opinion mining
can be defined as investigating the choice of people on different topics
and establish the results from the analysis.
Sentiment analysis refers to the task of natural language processing to

determine whether a piece of text contains some subjective information
and what subjective information it expresses, i.e., whether the attitude
behind this text is positive, negative or neutral. Understanding the
opinions behind user-generated content automatically is of great help for
commercial and political use, among others. The task can be conducted
on different levels, classifying the polarity of words, sentences or entire
documents
6|Page
TABLE OF CONTENTS
SL NO CHAPTER NAME PAGE NO
1. CHAPTER -1 10
Introduction 10
What is sentiment analysis ? 10
Problem Statement 11
Project Pipeline 11
2. CHAPTER – 2 11
Findings from the literature review 13
An overview of sentiment analysis 13-14
technology
Natural language processing 15
in Artificial intellingence
Text preprocessing, Practical Text 15
Analytics
3. CHAPTER – 3
Data set description 16
Project description 17
Expanding contractions 17-19
Feature Extraction methods used 20
Machine Learning Models 21
7|Page
4. CHAPTER – 4 Results 22-26
5. CHAPTER – 5 System requirements 27
7. CHAPTER – 7 Conclusion & analysis 28-30

8. CHAPTER – 8 References 31
8|Page
LIST OF FIGURES
Figure 1 :imported dependencies pg. 13
Figure 2. a typical movie review pg. 17
Figure 3: Text after cleaning pg. 18
Figure 4: Text after removing stop words pg. 19
Figure 5: Stemmed Text pg. 19
Figure 6: Lemmatized Text pg. 20
Figure 7: : Bag of Words Accuracy scores pg. 23
Figure 8: N-gram Accuracy Scores pg. 24
Figure 9: TF-IDF Accuracy scores pg. 24
Figure 10: Bag of Words F1 scores pg. 25
Figure 11: : N-gram F1 scores pg. 25
Figure 12: : TF-IDF F1 scores pg. 26
Figure13: Table 1: Accuracy Scores (in %) pg. 27
Figure14: : F1 Scores (in %) pg. 30
Figure15: chart 1 pg. 30

Figure:16:chart 2 pg. 31
Figure 17: Model Evaluation Results pg. 30
Figure 18: Model Evaluation Results pg. 31
9|Page
CHAPTER—1. INTRODUCTION
I used natural language processing to implement Sentiment Analysis on the IMDb movie
review dataset, with the extra functionality of toxic comment detection. The purpose of
this research is to determine the underlying sentiment of a movie review based on the
textual material available, and to do so, we have classified whether a reviewer enjoyed the
film. In this project, keyword spotting was utilized to evaluate data. Although, keyword
spotting is considered the most naïve approach, it’s accessibility and economy make it
popular. This method divides text into affect categories using clear affect terms like
joyful, sad, fearful, and bored. Vectorization and normalization techniques – Stemming
and Lemmatization – have been implemented for data pre-processing. Various feature
extraction models were applied, including Bag of words, TF-IDF, and N-gram. The
outcomes of five machine learning models, including I Bayes, Random Forest, Support
Vector Machine (SVM), Logistic Regression, and Decision Tree classifiers, were
examined. The Toxic Comment Detection model determines whether or not the input text
is suitable. This is accomplished through the use of the Logistic Regression Classifier in
the N-gram Feature Extender.
What is sentiment analysis ?
Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique
used to determine whether data is positive, negative or neutral. Sentiment analysis is often
performed on textual data to help businesses monitor brand and product sentiment
in customer feedback, and understand customer needs.
Sentiment analysis focuses on the polarity of a text (positive, negative, neutral) but it also
10 | P a g
e
goes beyond polarity to detect specific feelings and emotions (angry, happy, sad, etc),
urgency (urgent, not urgent) andeven intentions (interested v. not interested)
PROBLEM STATEMENT
• Identify the underlying sentiment of a movie review, find the state of

mind of the reviewer while providing the review and understand if the person
was “happy”, “sad”, “angry” and so on.
In this project, we attempt to construct a movie review

sentiment analysis model that will aid in overcoming the obstacles of
recognising the feelings of the reviews.
I used natural language processing to implement Sentiment Analysis on the

IMDb movie review dataset, with the extra functionality of toxic comment
detection. The purpose of this research is to determine the underlying
sentiment of a movie review based on the textual material available, and to do
so, we have classified whether a reviewer enjoyed the film. In this project,
keyword spotting was utilized to evaluate data.
PROJECT PIPELINE
The various steps involved in the Machine Learning Pipeline are :
1)Import Necessary Dependencies
2)Read and Load the Dataset
3)Exploratory Data Analysis
4)Data Visualization of Target Variables
5)Data Preprocessing
6)Splitting our data into Train and Test Subset
11 | P a g
e
7)Transforming Dataset using TF-IDF Vectorizer
8)Function for Model Evaluation
9)Model Building
10)Conclusion
•
• Goal –
• Identify the underlying sentiment of a movie review, find the state of mind of
the reviewer while providing the review and understand if the person was
“happy”, “sad”, “angry” and so on.
• Data Preprocessing –
• Removing HTML tags, symbols, Removing stop words, stemming,
lemmatization, expanding contractions
• Feature Extraction methods –

• 1) Bag of Words,
• 2)N-gram and
• 3)TF-IDF
Classifiers –
Decision Tree, Naive Bayes, Logistic Regression, Random Forest and Linear
support vector machine
• Evaluation –
• Accuracy, F1 Score and Confusion Matrix
12 | P a g
e
Fig 01
CHAPTER—2. LITERATURE REVIEW
Findings from the Literature Review

The initial work on this dataset was done by Stanford University scholars. Unsupervised learning
was used to cluster words with similar meanings and construct word vectors .They tested multiple
classification algorithms on these word vectors to determine the polarity of thwordds reviews
.This approach is particularly useful in cases when the data has rich sentimencontent and is prone
to subjectivity in the semantic affinity of the words and their intended meanin gs.Apart from the
foregoing, Bo Pang and Peter Turnkey have contributed significantly to Polarity identification in
movie and product reviews. They have also worked on creatinga revieww multi-class
categorization and prediction of reviewer rating of the movie/product
An Overview of sentiment analysis technology :

Author(s): SAAD & YANG(2019)
Saad and Yang [sought to provide a comprehensive Twitter sentiment analysis

in 2019.combining ordinal regression and machine learning techniques Pre-
processing was incorporated into the proposed paradigm.
Tweets were used as the initial step, and an effective feature was developed
using the feature extraction model.
13 | P a g
e
, . klnhl;TheSVR, RF, Multinomial logistic regression (SoftMax), and DTs
were used in the analysis.
categorizing the sentiment analysis Furthermore, the Twitter dataset was
utilized to test the proposed model.The test results revealed that the proposed
model had the highest accuracy, as well asWhen compared to other
approaches, DTs performed well. Fang et al. proposed in 2018Multi-strategy
sentiment analysis models with semantic fuzziness for problem resolution.
The outcomes have demonstrated that the proposed model has attained high
efficiency.
In 2019, Ray and Chakrabarti [8] have introduced a deep learning algorithm
for extracting the features from text and the user's sentiment analysis
with respect to the feature. In opinionated
sentences, a seven layer Deep CNN was employed for tagging the features. In
order to enhance the performance of sentiment scoring and feature extraction
models, the authors merged the deep learning
methods using a set of rule-based models. Finally, it was seen that the
suggested method achieved the best accuracy. In 2019, Zhao et al. [9] have
offered a novel image-text consistency driven multi-
modal sentiment evaluation model, which explored the correlation among the
text and image. Later, a multi-modal adaptive sentiment analysis model was
implemented. By using the traditional SentiBank
model, the mid-level visual features were extracted and those were
employed for representing the visual theories by integrating the different
characteristics like social, textual, and visual features for
introducing a machine learning model. The suggested model has attained
best performance when compared over traditional models
In 2019, Abdi et al. have suggested a deep-learning-based technique for

categorizing the opinion of the user mentioned in reviews. Moreover, a deep
learning model was a unified feature set
that was representative of sentiment shifter rules, word embedding, sentiment
knowledge, linguistic and statistical knowledge has not been continuously
explored for a sentiment analysis. Moreover, the suggested model used
RNN that consisted of LSTM for considering the benefit of sequential
14 | P a g
e
processing and conquered many issues in conventional algorithms. In 2020,
Park et al. [18] have designed a deep learning approach for improving
performance. In order to improve the performance, two questions have come
into picture. The content attention was required for being sophisticated for
merging many attention results non-linearly and assumes the whole
context for mentioning the complex sentences. The test results have shown
that the proposed model was attained as the best performance
15 | P a g
e
Natural language processing in artifical intellingence :
Author(s) : Brojo Kishore Mishra, Raghvendra Kumar (2020)
Journal: Computer Science, Engineering & Technology Taylor and Francis
This paper focuses on natural language processing, artificial intelligence, and
allied areas. Natural language processing enables communication between
people and computers and automatic translation to facilitate easy interaction
with others around the world. This book discusses theoretical work and
advanced applications, approaches, and techniques for computational models
of information and how it is presented by language (artificial, human, or
natural) in other ways. It looks at intelligent natural language processing and
related models of thought, mental states, reasoning, and other cognitive
processes. It explores the difficult problems and challenges related to
partiality, under specification, and context-dependency, which are signature
features of information in nature and natural languages .
Text preprocessing, Practical Text Analytics :

Author(s) : Murugan Anandarajan, Chelsey Hill & Thomas Nolan (2018)
Journal: Part of the Advances in Analytics and Data Science series, Springer, Cham
This paper starts the process of preparing text data for analysis. This chapter
introduces the choices that can be made to cleanse text data, including
tokenizing, standardizing and cleaning, removing stop words, and stemming.
The chapter also covers advanced topics in text preprocessing, such as n-
grams, part-of-speech tagging, and custom dictionaries. The text preprocessing
decisions influence the text document representation created for analysis.
16 | P a g
e
CHAPTER—3. PROPOSED METHOD
DATASET DESCRIPTION
The Large dataset containing the IMDB(Internet Movie Database) reviews

contains a total of 50,000 reviews that have been pre-labelled with “positive”
or “negative” sentiment class based on review content provided by the
audience. The dataset has been split into 15,000 which will be under testing
data and the rest 35,000 will be under training data. A typical review from the
dataset might look like this
IMDb lets users rate movies on a scale of 1 to 10. To label these reviews, the
data curator has labelled any review that is ≤ 4 stars as negative and any
review with ≥ 7 stars as positive. Reviews with 5- or 6-star ratings are
omitted.
Fig 2 a typical movie review
17 | P a g
e
PROJECT DESCRIPTION
Data Pre-processing. The data that is collected in its raw form can be a muddle. As a result, data
should be cleaned before we run any sort of analytics on it. Punctuation marks and HTML elements
were deleted utilizing a regular expressions/pattern matching approach. For ease of processing, the
text is also changed to lower case.
Fig3
EXPANDING CONTRACTIONS:
Words or syllables are abbreviated version of contractions. By deleting

key letters and sounds, these shorter versions or contractions of words are
generated. In the case of English contractions, one of the vowels is frequently
removed from the word. Do not to do not and I would to I'd are two examples.
Text standardization is aided by converting each contraction to its enlarged,
original form
Removing Stop Words. Stop words are common words like "if," "but," "we,"
"he," "she," and "them," which may typically be omitted from a document
without changing its semantics. The model's performance is improved as a
result of this.
18 | P a g
e
Fig4
Normalization:
The process which is used to clean noise from unstructured text for sentiment
analysis. This process is achieved by using Natural Language Took kit
(NLTK). NLP model can be further enhanced by converting all the different
forms of a given word into one. This process is called as Normalization.
Normalization can be done in two ways: Stemming and Lemmatization.
Stemming:
The process of reducing a word's inflection to its basic form. Stemming can be
done using a variety of algorithms. "Porter's Algorithm" is the most widely
used and empirically effective algorithm.
-----
Fig5
19 | P a g
e
Lemmatization:
The idea of Lemmatization is similar to stemming which is to eliminate
inflections and map a word to its root form. In this process, the actual root is
transformed by the words. For example, the algorithm would identify that the
word ‘better’ is derived from the lemma ‘good’.
Fig6
20 | P a g
e
Feature Extraction methods used:
Bag of Words:
This is a simple description of extracting the features from the text in order to
be used in machine learning algorithms. This specific feature is successful in
document classification and language modelling problems. This describes the
existence of words within a document. The frequency of word occurrences in
a text file is calculated using this model. In Bag of Words, the sequence of the
words is irrelevant; the model is just concerned with the words that occur. It
can be used on several levels, with 500, 5000, and 50000 words.
N-Gram:
This model is one of the important type which is used in language and speech
processing which is basically a sequence of N words. It mainly allows you to
deduct the combinations when it comes to NLP program. it is a sequence of
N-words in a sentence. N is an integer which stands for the number of words
in the sequence. When N = 1, N=2, N=3, it is referred to as uni- gram, bigram
and tri-gram respectively. N- gram is used because unlike in bag of words,
the order in which the words appear is important. For example, it is a good
idea to consider bigrams like “New York” instead of splitting them into
individual words like “New” and “York”.
Term Frequency, Inverse Document Frequency (TF-IDF):
The main goal of this module is to forbid the words which are continuously
repeated in each and every document. The term frequency (TF) which is the
measure of the word in any given document is given by the formula:
TF=(Total no. of times the word occurs)/(Total no. of words in document)
The term Inverse Document Frequency(IDF) which is the count of how rare
the word appears in a given document is given by the formula:
IDF=log_e(Total no. of documents)/(No. of documents with that word in it)
21 | P a g
e
Machine Learning Models:
Naïve Bayes Classifier:
Naïve Bayes classifiers are simple probabilistic classifiers based on Bayes

theorem, which is a learning algorithm that outperforms the most powerful
alternatives for small sample sizes. It takes an assumption that no feature is
related to each other in a class. This classifier is proven to be fast, reliable and
accurate in many NLP applications.
Random Forest Classifier:
Random Forest Classifier implements large number of individual decision

trees that operate together. Decision trees are the building blocks of the
random forest model where each tree gives out a class prediction and the class
which outnumbers other classes in measure of votes is chosen to be our
prediction. This method is expected to give good results as the trees protect
each other’s individual errors.
Support Vector Machine Classifier:
A supervised machine learning algorithm which can be used to predict both

the classification and regression challenges while mostly used for
classification. In this method, the classification is performed by plotting each
item as point in n-dimensional space with the value of a specific co-ordinate
and by finally determining the hyper plane that differentiates the two planes.
Decision Tree Classifier:
A sort of Supervised Learning technique in which data is separated based on a

certain criterion is the decision tree, which may be characterized by two
entities called nodes and leaves. The data is partitioned at the decision nodes.
The basic purpose of a decision tree algorithm is to build a model that can
determine the value of our target by taking into account the rules that have
been derived from prior data.
22 | P a g
e
Difference in APPROACH/METHOD between your project and main
projects of your references:
We have used more features and more classifiers to gain insight on

implementing the classification on movie reviews.
We have used LSVM and Decision Tree classifier for comparing accuracy
scores.
We are also improving the accuracy and F1 scores by changing the
hyperparameter.
D i f f e r e n c e i n A C C U R A C Y / PERFORMANCE between your

project and main projects of your references:
In reference paper – With an average accuracy of 89 percent, the Logistic

Regression model outperforms all other feature representations.
In our project – The best performance comes from the Logistic Regression
model, which has an average accuracy of 89.92 percent (~ 90%).
Fig7 : Bag of Words Accuracy scores
23 | P a g
e
Fig8 N-gram Accuracy Scores
Fig9: TF-IDF Accuracy scores
24 | P a g
e
Fig10 Bag of Words F1 scores
25 | P a g
e
Fig11: N-gram F1 scores
Fig 12TF-IDF F1 scores
RESULTS
From the above graphs, we can interpret by comparing different models that
Logistic Regression has the highest accuracy whereas the decision tree has the
least.
Similarly, when compared to the other models, Logistic Regression has a high
F1 score, while the decision tree has the lowest.
Toxic Comment Detection Results
26 | P a g
e
Fig13
27 | P a g
e
CHAPTER—5 SYSTEM REQUIREMENTS
5.1. Software Requirements
PC
 Windows 11, 10, 8
 Python latest version 3.10 or 3.9
 Anaconda Prompt Shell or CMD to download required packages
 Anaconda Navigator
 GITHUB Desktop
5.2. Hardware Requirements
PC
 Processor- Intel(R) Core (TM) i5-6300U CPU 2.40GHz 2.50 GHz
 4GB or 8GB of RAM
 System Type-64bit Operating system
28 | P a g
e
CHAPTER—7 CONCLUSION & ANALYSIS
Analysis
.
What did I do well?
Successfully Imple mented the Toxic Comment Detection.

Improved data pre-processing with new normalization algorithms
(Stemming and Lemmatization) and expanded contraction.
Added a Regularization parameter to improve Lambda values hence
increasing the performance.
On the console, you can detect comments using user-defined inputs and label
them as good or negative.
Used a variety of classifiers and feature extraction approaches, examined the
best ones, and improved the model's accuracy by 2%.
What could I have done better?
When a review is marked as 'Dislike it', the algorithms classify it as negative. Rather than
reading individual words, a complete line could be read.
With Decision Tree classifier, N-gram modeling can be optimized for faster results.
What is left for future work?
The outputs from the above-mentioned models' can be used to develop

Recommender Systems for application in the current market.
To take toxic comment detection to the next level, toxic comments might be
classified as toxic, obscene, threat, menace, insult, or identify hate.
29 | P a g
e
CONCLUSION
Model Evaluation Results
Features NB LR RF DT LVSM
N- Gram 86.65 89.92 82.01 71.56 89.42
Bag of Words 80.77 86.16 83.56 69.77 85.57
TD-IDF 83.06 88.25 79.67 69.82 87.73

Fig17
CHART TITLE
N- Gram Bag of Words TD-IDF
90
87
84
81
78
75
72
69
66
NAÏVE BAYE'S LOGISTIC RANDOM DECISION LINEAR
REGRESSION FOREST TREES SUPPORT
VECTOR
MACHINE
Fig15
30 | P a g
e
Features NB LR RF DT LVSM
N- Gram 86.61 89.91 81.62 71.61 89.41
Bag of Words 80.76 86.16 83.62 69.69 85.89
TD-IDF 83 88.25 79.87 70.28 87.73
FIG18
Chart Title
90
85
80
75
70
65
NB LR RF DT LVSM
N- Gram Bag of Words TD-IDF
FIG17
On a large movie review dataset, we successfully developed Sentiment Analysis using

Natural Language Processing with the addition of Toxic Comment Detection. The best
model is the Logistic Regression Model, which has greater accuracy and F1 scores
across all feature extraction representations. Naive Bayes Classifier combined with N-
gram/TF-IDF can be used to further analyze the findings, but because N-gram is slow
to process, TF-IDF is the next best approach for optimum results. Based on the lowest
scores, the Decision Tree classifier is not suggested for this data.
31 | P a g
e
CHAPTER—8 REFERENCES
(1)https://cseweb.ucsd.edu/classes/wi15/cse255-a/reports/fa15/003.pdf1
(2)https://uksim.info/icaiet2014/CD/data/7910a212.pdf
(3) https://ieeexplore.ieee.org/document/7351837
(4)https://www.ijcsmc.com/docs/papers/June2016/V5I6201691.pdf
(5)https://arxiv.org/ftp/arxiv/papers/1902/1902.00679.pdf
(6)Sentiment Analysis – Wikipedia – https://en.wikipedia.org/wiki/Sentiment_analysis
(7)Large Movie Review Dataset – http://ai.stanford.edu/~amaas/data/sentiment/
(8)Andrew L Mass, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng and Christopher
Potts (2011). Learning Word Vectors for Sentiment Analysis
(9)Internet Movie Database – http://www.imdb.com/
(10)Cross Validation – Wikipedia – https://en.wikipedia.org/wiki/Crossvalidation_%28statistics%29
(11)NLTK Stopwords Corpus: http://www.nltk.org/book/ch02.html
(12)Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment
Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP).
(13)Turney, Peter (2002). "Thumbs Up or Thumbs Down? Semantic Orientation Applied to

Unsupervised Classification of Reviews". Proceedings of the Association for Computational
Linguistics.
(14)Tumasjan, Andranik; O.Sprenger, Timm; G.Sandner, Philipp; M.Welpe, Isabell (2010).

"Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment".
"Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media"
(15)Natural Language Processing from Scratch -

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35671.pdf
(16)Scikit-learn API Reference: http://scikit-learn.org/stable/modules/classes.html
(17)Cambria, Erik; Schuller, Björn; Xia, Yunqing; Havasi, Catherine (2013). "New Avenues in
Opinion Mining and Sentiment Analysis". IEEE Intelligent Systems
(18)Snyder, Benjamin; Barzilay, Regina (2007). "Multiple Aspect Ranking using the Good Grief
Algorithm". Proceedings of the Joint Human Language Technology/North American Chapter of the
32 | P a g
e
ACL Conference
33 | P a g
e

NLP Sentiment Analysis On Movie Reviews With Toxic Comment Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Sentiment Analysis On Movie Reviews With Toxic Comment Detection

Uploaded by

Copyright:

Available Formats

Final year project report on

NLP SENTIMENT ANALYSIS ON MOVIE REVIEWS WITH

Data Science and Analytics

Mr. Kunal Dey

DEPARTMENT OF DATA SCIENCE AND ANALYTICS

JAIN UNIVERSITY, JC ROAD CAMPUS

This is to certify that the Final Year Project entitled

Has been successfully completed by

As a part of the 6th semester curriculum in Bachelor of Science in

Data Science and Analytics

Mr. Kunal Dey Dr. Aarthi Sudarshan

DECLARATION OF AUTHORSHIP AND

Sentiment Analysis is the most researched area in the Natural Language

Sentiment analysis refers to the task of natural language processing to

Feature Extraction methods used 20

Machine Learning Models 21

5. CHAPTER – 5 System requirements 27

7. CHAPTER – 7 Conclusion & analysis 28-30

Figure 1 :imported dependencies pg. 13

Figure 2. a typical movie review pg. 17

Figure 3: Text after cleaning pg. 18

Figure 4: Text after removing stop words pg. 19

Figure 5: Stemmed Text pg. 19

Figure 6: Lemmatized Text pg. 20

Figure 7: : Bag of Words Accuracy scores pg. 23

Figure 8: N-gram Accuracy Scores pg. 24

Figure 9: TF-IDF Accuracy scores pg. 24

Figure 10: Bag of Words F1 scores pg. 25

Figure 11: : N-gram F1 scores pg. 25

Figure 12: : TF-IDF F1 scores pg. 26

Figure13: Table 1: Accuracy Scores (in %) pg. 27

Figure14: : F1 Scores (in %) pg. 30

Figure15: chart 1 pg. 30

Figure 17: Model Evaluation Results pg. 30

Figure 18: Model Evaluation Results pg. 31

What is sentiment analysis ?

• Identify the underlying sentiment of a movie review, find the state of

In this project, we attempt to construct a movie review

I used natural language processing to implement Sentiment Analysis on the

The various steps involved in the Machine Learning Pipeline are :

1)Import Necessary Dependencies

2)Read and Load the Dataset

3)Exploratory Data Analysis

4)Data Visualization of Target Variables

6)Splitting our data into Train and Test Subset

8)Function for Model Evaluation

• Feature Extraction methods –

CHAPTER—2. LITERATURE REVIEW

Findings from the Literature Review

An Overview of sentiment analysis technology :

Saad and Yang [sought to provide a comprehensive Twitter sentiment analysis

In 2019, Abdi et al. have suggested a deep-learning-based technique for

Text preprocessing, Practical Text Analytics :

The Large dataset containing the IMDB(Internet Movie Database) reviews

Fig 2 a typical movie review

Words or syllables are abbreviated version of contractions. By deleting

Term Frequency, Inverse Document Frequency (TF-IDF):

Naïve Bayes Classifier:

Naïve Bayes classifiers are simple probabilistic classifiers based on Bayes

Random Forest Classifier:

Random Forest Classifier implements large number of individual decision

Support Vector Machine Classifier:

A supervised machine learning algorithm which can be used to predict both