You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/351262735

Sentiment Analysis of Hotel Reviews - Performance Evaluation of Machine


Learning Algorithms

Technical Report · January 2020


DOI: 10.13140/RG.2.2.21026.96965

CITATIONS READS
0 24

1 author:

Saman Zahid
Linköping University
2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Comparing Resource Abundance And Intake At The Reda And Wisla River Estuaries View project

All content following this page was uploaded by Saman Zahid on 01 May 2021.

The user has requested enhancement of the downloaded file.


Sentiment Analysis of Hotel Reviews - Performance Evaluation of
Machine Learning Algorithms
732A92 - Text Mining

Saman Zahid-samza595

March 19, 2020

1 ABSTRACT
Customer reviews on hotels are very important part of travel plan for people now a days. People
prefer to book such hotels which have high number of positive reviews. There are different sources
to find the reviews to get a better insight about the hotel’s reputation. Thus it can be said that
customer reviews plays an important part for business owners in order to improve their services.
In this project, sentiment analysis is performed on the basis of user reviews using three different
classifiers. The classifiers used in this project are “Naive Bayes”,“Random Forest” and “Support
Vector Machine”. The performance of these algorithms are assessed on two different parameter
settings. The reviews are classified as “positive”,“negative” or “average” labels.

1
Contents
1 ABSTRACT 1

2 Introduction 3

3 Theoretical Background on Classification Techniques 3


3.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5.4 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Data 6
4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 Method 7

6 Classification Result 8

7 Discussion 9

8 Conclusion 10

Reference 11

2
2 Introduction
Sentiment analysis also known as “opinion mining” or “emotion AI” is used to extract and analyze
users’ opinions, sentiments,emotions and response on certain matter. Text mining using Natural
Language Processing(NLP) techniques is often used to analyze ones responses and reviews to perform
sentiment analysis.
With the advancement of technology and increase in social interactions, it has become very important
for any business to consider user reviews as it plays an important role in providing best services to
the customers. Customer reviews can be used by the business owners to identify the glitches in
their system highlighted by the customers and make improvements accordingly. Customer reviews
also plays a vital role in establishing a company’s reputation.
In this project, I have taken data of 1000 hotels. The purpose of this project is to perform sentiment
analysis on 3 polarity levels that are positive, negative and average using text classification. In this
project I have chosen three different algorithms to check which performs best on this data. The
approach taken are probabilistic, non-probabilistic discriminative and ensemble.
Text classification is not a process of building a classifier only, it also involves different steps that are
required to clean the data and make it useful for the analysis.The steps for text classification are:

Figure 1: Flow Diagram

3 Theoretical Background on Classification Techniques


In this section theoretical idea on the working of classification techniques used in this project are
explained.

3.1 Naive Bayes


Naive bayes classifier uses bayes theorem for classification. Naive bayes takes indenpendence
assumption into account, that is the effect of a predictor on a given class is independent of other
predictors.
From Bayes Theorem:
P (D|c) ∗ P (c)
P (c|D) =
P (D)

3
For the given condition: c = Class(Polarity) , D= Review(text document)
So,
P (Review|Class) ∗ P (Class)
P (Class|Review) =
P (Review)

Naive bayes does not perform well when there is any dependence between the attributes but for
text classification, it sometimes give remarkable results. In (Gindl, Weichselbraun, and Scharl
2010), it has been highlighted that naive bayes performs well for text classification because it takes
contextualization into account.
In this research, I have implemented Multinomial Naive bayes classifier. Multinomial Naive bayes
classifier is suitable for classification with discrete predictors and for multi-label classification. In
our project we have 3 labels and discrete predictor is the word count which is done by using tf-idf.
The maximum posterior for naive bayes is given as:

cmax = argmaxc P (D|c)P (c)


where c max is the label from Class with highest occurrence.

3.2 Support Vector Machine


A non-probabilistic discriminative classifier tries to model by depending on the observed data by
learning how to do the classification from the given statistics. SVM forms a decision boundary
between vectors belonging to same class to that of different classes. Using optimization procedure
SVM maximises the margin between the classes to keep them separated.
SVM is a linear model for classification but can also solve non-linear problems.SVM tends to perform
better in text classification because it searches for optimal hyperplane.Text classification using
SVM is done by representing each document as a vector where dimensions is the number of unique
keywords used in data used for training. The decision boundary is then made using the plane that
maximizes the distance between the classes.
For this project, I have used SVM with “Linear” Kernel with multi-class set to “OVR” that is “One
VS Rest” as the other multi-class method “crammer_singer” is more expensive and is said to not
practically produce good accuracy. The regularization parameter is taken 1 that is default but once
assigning weight 1 to all classes and once keeping the weight “balanced” which adjusts weight using
value of y.

3.3 Random Forest


Random forest is an ensemble method which creates a number of decision trees turned into an
ensembled forest. Since random forest is an ensemble method, it randomly selects features to be
used in each step for training while forming the decision trees.Random forest forms a set of decision
trees instead of one tree. Each tree in the forest, makes prediction of target class based on the
selected features. Each predicted class gets a vote every time it is predicted. The class with the
highest votes is selected at the end. Random forest is suggested to be a good classifier as it provides
the best performance metrics. It is very suitable for large amount of data.

4
For text classification, we are have performed data-preprocessing that is we have already selected
the features that contributes most for the sentiment analysis. Thus the expectation from random
forest is increased as there is very low chance of selecting insignificant features while making trees.

3.4 TF-IDF
TF-IDF is used to transform data into term frequency(TF) times inverse document frequency(IDF).
Term frequency as recognized by its name is the frequency count of each word in the document.
Inverse document frequency is the weighted frequency count. Term frequency counts frequency
with equal weightage to all the words in the document while inverse document frequency gives
less weightage to the commonly occuring words such as “the, those, these” etc,. The advantage of
performing TF-IDF transformation is that it extracts those features from the document that are
significant for classification.
(1+n)
IDF as used in this project can be given as idf (t) = log (1+df ) + 1 where df is the document frequency
and n is the total number of documents. Constant 1 is added in both numerator and denominator
to avoid 0 divisions. And tfidf as mentioned earlier is tf idf = tf ∗ idf (Pedregosa et al. 2011)

3.5 Performance Metrics


The performance of classifier cannot be evaluated by simply checking its accuracy. Accuracy alone
does not gurantees the performance of an algorithm. Confusion matrix is an important factor which
is used to determine the performance of a classifier. Here in this project I have used accuracy as well
as precision,recall and F1 score to determine the performance of a classifier(Sokolova and Lapalme
2009).These metrics are defined using the elements of confusion matrix.
A confusion matrix consists of true positive(TP), false positive(FP),true negative(TN) and false
negative(FN) values. It is a bit a tricky to understand a multi-labeled confusion matrix.The elements
of multi-labeled confusion matrix are defined as:
True Positive: Diagonal Values that is intersection of actual value and predicted value for the
same class.
True Negative: Sum of all the values except all the values in the corresponding row and column
of that class.
False Positive: Sum of all the values in the corresponding column of that class excluding TP.
False Negative: Sum of all the values in the corresponding row of that class excluding TP
For multi-labeled classification, the stated metrics are given as follows:

3.5.1 Accuracy
Accuracy is the fraction of sum of true positive and true negative predictions to that of total number
of predictions.
N umber of T rue P redictions
Accuracy =
T otal P redictions

3.5.2 Precision
Precision is the fraction of true positive labels to that of sum of true positive and false positive
predictions. It is given as:

5
TP
P recision =
TP + FP

3.5.3 Recall
Recall is also known as sensitivity. Sensitivity is for binary classification while recall is more genereal.
Recall is the rate of true positive labels to that of sum of true positive and false negative labels. It
is given as:

TP
Recall =
TP + FN

3.5.4 F1-Score
It is the harmonic mean of precision and recall. It is given as:

P recision ∗ Recall
F 1 Score = 2 ∗
P recision + Recall

4 Data
For this project dataset is taken from kaggle. The data is originally fetched from " Datafiniti’s
Business Database" (“Datafiniti Business Database,” n.d.). Dataset initially contained 10000 rows
and 26 columns. In this dataset, each row contains all the information related to a hotel,hotel’s
review,rating as well as reviewr’s information. There are many columns which are irrelevant
for this project such as hotel’s address,country,province,postal code etc, as well as reviewer’s
name,province,source of review(url),review date added and seen etc. Thus for this project I have
taken only the most relevant columns into account though some other columns can be used for
making different sort of analysis such as classifying best hotels in each city.
Since in this project the target is to perform simple sentiment analysis using different classifiers,
regardless of any categorization, only 6 columns are kept. Figure 2 illustrates the overview of data.

Figure 2: Data Overview

The ratings is converted into class labels.

6


 if Rating > 3
P ositive,
Label = Average, if Rating = 3

N egative, if Rating < 3

The following figure describes the actual distribution of labels in complete dataset.

Figure 3: Distribution of Labels

Text classification is vastly being done using different techniques all over the world. There are many
researches available on sentiment analysis such as (Hegde and Padma 2017). The steps applied in
this project to perform sentiment analysis is dicussed below:

4.1 Data Preprocessing


Data preprocessing is very important for text classification. Data preprocessing involves following
steps:
1. Tokenization: Splitting text into tokens of words
2. Data Cleaning:
• Converting text into lowercase letters
• Removing punctuation
• Removing blank spaces
• Removing stop words
3. Feature Selection: Determining word frequency. This step is performed after splitting data
into training and test.

5 Method
For this project, data is to split into 80% training set and 20% test dataset. The classifiers are
learnt using training data and predictions are made on test data.
In this project, 3 different algorithms are used each with 2 different settings. Though there are other
parameters that can be changed but each combination of parameters can influence the performance
of classifier greatly thus I have decided to compare using only 2 different settings for each algorithm.

7
1. Multinomial naive bayes is applied with default settings that is alpha = 1 and then for alpha
= 0.009.
2. Learning is performed using “Linear SVM” classifier that is simple a support vector machine
with kernel set to “Linear”. Once training is done by keeping default settings i-e regularization
parameter C= 1 and class weight = 1 then with C= 1 but class weight = balanced See SVM.
3. Random forest with number of trees 100 and 2000 and minimum number of samples required
to split internal nodes as 2 is used.

6 Classification Result
The performance of each algorithm with different settings is evaluated on test data. The predictions
were made on test set by using each classifier one by one and the performance metrics discussed in
section 3.5 are calculated.
Figure 3, illustrates the confusion matrix heat map for all the classifiers.

It can be observed that the only big difference appears in case of naive bayes classifier with alpha=1
because obviously value 1 does not perform proper tuning and one of the labels i-e “average” is not
predicted at all. For all other classifiers, the difference by keeping different parameter values for the
same classifier is very low.

8
Naive bayes with alpha=0.009 has the highest number of correctly predicted instances that is 1674.
SVM with class weight 1 ranks second with 1628 correctly predicted instances out of 1995 total test
instances. Random forest with default settings performs very well with 1625 correctly predicted
instances. It is quite interesting that though naive bayes with alpha=1 has missed one label, yet it
succeeded to make 1529 correct predictions.
Further assessment can be made by looking at the following graph:

Figure 4: Performance Metrics

7 Discussion
In figure 4, the performance of every classifier have been plotted. For clear idea of values, the table
containing the values of performance metrics for each classifier is mentioned. It can be seen that in
terms of accuracy and precision, random forest (E,F) performs really well but it provides a very
low recall and f1 score. On the other hand naive bayes(B) has the highest accuracy and very good
precision,recall as well as f1-score.
In (Samal, Panda, and Behera 2017) research, performance analysis using supervised learning
algorithm is done on sentiment analysis. The best algorithm in this case was turned out to be
SVM with linear kernel (LinearSVC). As we can see, linear SVM (C) performs really well with
81.60% accuracy and 66.5% F1-score which is slightly less than multinomial naive bayes. In another
research (Sharif, Hoque, and Hossain 2019), sentiment analysis has been done on restaurant reviews
and is analyzed using several different classifiers but “Multinomial Naive Bayes” in combination
with tf-idf outperformed all other algorithms.In another reserach (Dey et al. 2016), two different
datasets(movie reviews and hotel reviews) are used to analyze the performance of classifiers. It was
concluded that the naive bayes performs well for movie reviews but not for hotel reviews.
In this project, SVM and multinomial naive bayes seems to perform well. One thing that should
be considered is the data taken in this project is skewed that is data contains more positively
labeled text than that of negative and average. This results in bias. The other problem in case
of multinomial naive bayes as highlighted in (Rennie, n.d.) is weight magnitude error, that is the
reason the classifier performs well with appropriate alpha.

9
8 Conclusion
In this project, we have performed sentiment analysis on hotel reviews and have analyzed performance
of Multinomial naive bayes, SVM and Random forest. It was concluded that on the basis of
precision,recall,accuracy and F1-score, multinomial naive bayes with appropriate parameter settings,
performs the best. It can be said that the simplicity of multinomial naive bayes does not make it
less effective than any other algorithm. We can observe that both SVM and multinomial naive bayes
performs really well as compared to the most complex of these three that is Random forest.The
small amount of data could be the reason for it.
In (Govindarajan 2014) research, a hybrid method is used for sentiment analysis which is a
combination of naive bayes,SVM and genetic algorithm. It will be interesting to use such a hybrid
method for sentiment analysis of hotel reviews as not much work has been done on hotel reviews
using such approach.

10
Reference
“Datafiniti Business Database.” n.d. https://datafiniti.co/products/business-data/.
Dey, Lopamudra, Sanjay Chakraborty, Anuraag Biswas, Beepa Bose, and Sweta Tiwari. 2016.
“Sentiment Analysis of Review Datasets Using Naive Bayes and K-Nn Classifier.” ArXiv
abs/1610.09982.
Gindl, Stefan, Albert Weichselbraun, and Arno Scharl. 2010. “Cross-Domain Contextualization of
Sentiment Lexicons.” In.
Govindarajan, M. 2014. “SENTIMENT Analysis of Restaurant Reviews Using Hybrid Classification
Method.” In.
Hegde, Y., and S. Padma. 2017. “Sentiment Analysis Using Random Forest Ensemble for Mobile
Product Reviews in Kannada.” In 2017 Ieee 7th International Advance Computing Conference
(Iacc), 777–82. Los Alamitos, CA, USA: IEEE Computer Society. https://doi.org/10.1109/IA
CC.2017.0160.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al.
2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12:
2825–30.
Rennie, Shih, J. D. M. n.d. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers.” In
Twentieth International Conference on Machine Learning, 616–23. Washington, DC: Goole
Scholar. https://www.aaai.org/Papers/ICML/2003/ICML03-081.pdf.
Samal, Biswaranjan, Mrutyunjaya Panda, and Anil Behera. 2017. “Performance Analysis of
Supervised Machine Learning Techniques for Sentiment Analysis.” In. https://doi.org/10.1109/
SSPS.2017.8071579.
Sharif, Omar, Mohammed Hoque, and Eftekhar Hossain. 2019. “Sentiment Analysis of Bengali
Texts on Online Restaurant Reviews Using Multinomial Naïve Bayes.” In, 1–6. https://doi.org/
10.1109/ICASERT.2019.8934655.
Sokolova, Marina, and Guy Lapalme. 2009. “A Systematic Analysis of Performance Measures
for Classification Tasks.” Information Processing & Management 45 (July): 427–37. https:
//doi.org/10.1016/j.ipm.2009.03.002.

11

View publication stats

You might also like