You are on page 1of 13

SENTIMENT ANALYSIS ON MOVIE REVIEWS

Natural Language Processing UML602

Project Report

BE Third Year, COE

Submitted by:

101603120 Himanshu Dhiman


101603125 Himanshu Pandey

Submitted to:

Dr. Aashima Sharma

Computer Science and Engineering Department


TIET, Patiala
April, 2019
1. INTRODUCTION

Sentiment Analysis means analyzing the sentiment of a given text or document and categorizing
the text/document into a specific class or category (like positive and negative). In other words, we
can say that sentiment analysis classifies any particular text or document as positive or negative.
Basically, the classification is done for two classes: positive and negative. By definition Sentiment
analysis refers to the use of natural language processing, text analysis, computational linguistics,
and biometrics to systematically identify, extract, quantify, and study affective states and
subjective information. Sentiment Analysis is also referred as Opinion Mining. It’s mostly used in
social media and customer reviews data.

1.1 Steps involved during sentiment analysis

Figure 1.1
1.2 Libraries used

Natural Language Toolkit (NLTK)

NLTK is a leading platform for building Python programs to work with human language data. It
provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing,
and semantic reasoning.

2. STEPS OF WORKING

In this project, NLTK’s movie_reviews corpus is used as our labeled training data. The
movie_reviews corpus contains 2,000 movie reviews with sentiment polarity classification. The
two categories for classification are - positive and negative. The movie_reviews corpus already
has the reviews categorized as positive and negative. The reviews are categorized using supervised
classification technique. In supervised classification, the classifier is trained with labeled training
data.

The below shown figure depicts the working followed during training and testing of the model.

Figure 2.2
2.1 Pre-processing of data

Three different ways are used pre-process the data to achieve maximum training and testing
accuracy.

2.1.1 Using 2000 most frequently occurring words:

1. Convert movie review data into useful format

2. Remove Stopwords and Punctuation

3. Create word feature using 2000 most frequently occurring words

2.1.2 Bag of words feature

1. Create unique list based on positive and negative review

2. Shuffle both list separately and add equal no of reviews

3. Train classifier and test model

2.1.3 n-gram feature

1. Create unique list based on positive and negative review

2. We define two functions

bag_of_words: that extracts only unigram features from the movie review words

bag_of_ngrams: that extracts only bigram features from the movie review words

We then define another function

bag_of_all_words: that combines both unigram and bigram features

4. Train classifier and test model

2.2 Training of model

The model is trained using NLTK’s Naïve Bayes Classifier which is an in-built classifier of the
module. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a
simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the
probability of an event, based on prior knowledge of conditions that might be related to the
event.
2.3 Testing of model

The model accuracy is tested on training data as well as on custom data input by the user.

3. CODE

Pre-processing of data

The below shown code creates frequency distribution of all the words in the document and removes
stop-words and punctuations from the text and as a result data is cleaned and cleaned words are
added to a new list.

Figure 3.1
Creating document feature using top-N occurring words

The below shown code creates the document feature using 2000 frequently occurring words and
then trains the model using Naïve Bayes classifier and prints the accuracy of the model.

Figure 3.2
Creating feature word using bag of words method

The code shown below categorizes the text as positive and negative in different lists which helps
to reduce positive and negative data in separately and then pre-processes the data.

Figure 3.3
Bi-Gram Feature

In bag of words feature extraction, we used only unigrams. In the example below, we will use
both unigram and bigram feature, i.e. we will deal with both single words and double words.

Figure 3.4
Training the model

After pre-processing, the created feature sets are trained using NLTK’s Naïve Bayes classifier.

Figure 3.5

Figure 3.6
4. Results

top-N most frequently occurring words –

Figure 4.1

We can see that custom negative reviews are categorized accurately but in case of positive
custom review we get inaccurate results.

In the top-N feature, we only used the top 2000 words in the feature set.

We combined the positive and negative reviews into a single list, randomized the list, and then
separated the train and test set.

This approach can result in the un-even distribution of positive and negative reviews across the
train and test set.
Bag of words Feature –

Figure 4.2

Now using bag of words feature we get appropriate results on custom test reviews but the overall
accuracy of the model is decreased to 70%
Bi-gram Feature –

Figure 4.3

The accuracy of the classifier has significantly increased when trained with combined feature set
(unigram + bigram).

Accuracy was 70% while using only Unigram features.

Accuracy has increased to 77% while using combined (unigram + bigram) features.
5. Applications & Future Scope

5.1 Brand Monitoring - or you could also call it Reputation management. We all know how
much good reputation means these days when the majority of us check social media reviews as
well as review sites before making a purchase decision.

5.2 Customer support - Social media are channels of communication with your customers
these days, and whenever they’re unhappy about something related to you, whether or not
it’s your fault, they’ll call you out on Facebook/Twitter/Instagram.

Such mentions will appear in your dashboard with a flashing red color, and you better start
engaging them as soon as they are there.

People nowadays expect brands to respond on social media almost immediately, and if
you’re not quick enough, you might as well see them moving on to your competitors instead
of waiting for your reply.

You might also like