You are on page 1of 10

Rating Prediction Based On Yelp’s User Reviews: A Hybrid

Approach
Bhargav Ramprasad Panth Maaz Hasan Onurcan Onder
16340549 17319658 17314870
panthb@tcd.ie mhasan@tcd.ie ondero@tcd.ie
Some Aditya Mandal
17311198
mandals@tcd.ie

Abstract
The proliferation of social media has dramatically affected the behavior of advertising industry.
Today, a remarkable proportion of social media platforms are being used like an advertising billboard.
There are many reasons behind the popularity of social media marketing, and of course, one of the
biggest reason is, it allows users to write their reviews and opinions for the presented product (food,
place etc.). However, since these social media platforms do not provide any proper rating system for
those kinds of advertisements, the further evaluation, and analysis of that huge amount of data is not
being easy or not even possible with basic techniques. In this paper, a hybrid approach is proposed
to predict the user ratings with a resonably good accuracy.

Keywords— semantics, natural language processing, classification, machine learning, prediction,


naive bayes, logistic regression, support vector machine

1 Introduction
There have been interests in implementing more advance rating prediction systems using different tech-
niques/approaches such as machine learning, text mining, semantic analysis etc. However, there is still
no commonly accepted accurate prediction system for this issue. Accordingly, this paper proposes a
hybrid rating prediction technique based on the user reviews. As a part of this study, the effectiveness of
different techniques and approaches are analyzed and substantiated. . Finally, an effective hybrid system
based on the observations is developed. Accordingly, stop words are removed, stemming is applied for
feature selection. Then a unique lexicon is created using the Yelp user reviews dataset. Finally three
different machine learning classifiers are applied, evlatued and analyzed.

2 Related Work
Various techniques and systems have been described in the literature on user rating prediction. Re-
searchers have been looking into ways of improving the accuracy of the predictions.
To this end, Channapragada and Shivaswamy (2015) developed a system that predicts the rating
of a business based on the user review. They have used Linear Regression as the regression algorithm,
Support Vector Machine (SVM) and Naive Bayes as the classification algorithms. They have examined
Yelp dataset and used it to evaluate their system. Even by looking at the visualizations of the dataset, they
could able to make some assumptions. For example, the data shows that there is a consistent decrease
in the rating with the increase of the word number in a review (Fig. 1). Another pre-processing has
been done with words, they have calculated some weights for the polarity of the words, e.g., worst is
the most negative or incredible is the most positive one. Finally, as a part of their project, they have

1
Figure 1: Frequency Distribution of Length Review vs Average Rating, from Channapragada and Shiv-
aswamy (2015)

compared these three machine learning algorithms with different feature combinations from the dataset.
They have used 80% of the data for training and the rest for the tests. Experiments showed that, since
linear regression gives numbers between two integers, mean square error value was lower compared to
classification algorithms. However, it was not meaningful to give a non-decimal rating for a 5-star review.
In the comparison between two classification algorithms, SVM was always giving a slightly better result
with all the features. Because, Naive Bayes algorithm detects a false conditional independence between
features.
Wang (2015) has used the Yelp user reviews and applied Sentiment Analysis. Sentiments were
predicted using Naive Bayes, multi-class SVM, and Perceptron learning algorithm. In all of the models
used in this study, it was observed that removing stop words, common symbols and stemming reduced
the chances of multicollinearity and provided results with reduced dimension. Significantly less training
error was achieved by doing so. The study is concluded with a remark from the author that the per-
formance of the multi-class SVM and Naive Bayes algorithm had a less accuracy in comparison to the
Perceptron learning algorithm for the prediction of results. Also, by adding regularization terms and
running cross-validation parameters can improve the performance of the test set.
Ganu et al. (2009) focused on identifying information structure and sentiment from free-form
text reviews to predict the rating. The authors extracted their corpus of over 50000 restaurant reviews
from Citysearch New York. First, they analyzed the data to identify categories which are specific to the
restaurant reviews domain using 7-fold cross-validation. They were able to identify the following six
categories: Food, Service, Price, Ambience, Anecdotes, and Miscellaneous. To classify the sentences
into the above-mentioned categories and sentiment classes, they manually annotated a training set of
approximately 3400 sentences with both category and sentiment information. They trained and tested
SVM classifiers on their manually annotated data (one classifier for each topic and one for each senti-
ment type). Then they performed 7-fold cross validation with accuracy, precision and recall metrics to
observe the performance of their classification. They performed an in-depth analysis of the corpus of
52264 user reviews, such that they can study the relation between the textual structure of the reviews and
the metadata entered by the reviewers, such as star rating. Then, they compared star rating with the sen-

2
timent annotation produced by their classifier using the Pearson correlation coefficient. The coefficient
ranges from -1 to 1, with -1 for negative correlation, 1 for positive correlation and 0 for no correlation.
Their results showed a positive correlation (0.45) between the star rating and the percentage of positive
sentences in the review, and a negative correlation (-0.48) between the star rating and the percentage of
negative sentences. For Rating prediction, they experimented and used the popular Mean Squared Error
(MSE) accuracy metric to evaluate their prediction techniques. with different prediction strategies. Like
in one of their experiments, they based the computation of the text rating on the number of Positive and
Negative sentences in the review, either Review based or Topic-Based or Rating Based. They further
used multivariate regression to model the user provided star rating as the dependent variable; the sen-
tence types, represented as (category, sentiment) pairs are the independent variables. They concluded
that Predicting the regression-based text ratings is more difficult than predicting the sentiment-based text
ratings and results in high MSE values.

3 Implementation & Design


3.1 Dataset & Preprocessing
”Yelp Dataset Challenge” dataset has been selected to study in this research. The Yelp dataset has
been published to be studied on photo classification, graph mining and natural language processing &
sentiment analysis. Accordingly, there are many subsets in it. For this research, the review dataset which
includes more than 5,200,000 user reviews spanning over 11 metropolitan areas is examined. A python
script is implemented to parse the reviews JSON data file. During the parsing process, only star ratings
and text reviews are taken into consideration, all the other information is ignored. The raw data is stored
in three different dictionaries on the basis of review, sentiments and stars.
In the data pre-processing phase, the entire text is converted into lowercase to reduce redundancy in
subsequent feature selection. Several regular expressions are used, followed by the removal of punctua-
tions and white spaces from the review text.

3.2 Feature Selection


Several feature selection algorithms are implemented by building the feature dictionary using the training
data with some additional variations. The feature selection algorithms loop over the training set word by
word, on the other hand, a lexicon maps each word to the frequency of occurrence in the training set.
Instead of using lexicon from Ding et al. (2008), considering the dataset and the study, a unique lexicon
is built for this research.
There are advantages of using an existing lexicon of Bing Lius, since there is no looping over the
dataset. Furthermore, the feature set consists exclusively of adjectives that have sentiment meaning, but
there is a remarkable disadvantage too. Since the features that are used are not extracted from the Yelp
dataset, irrelevant features may be included while relevant features might not selected. For example,
many words in the text reviews are spelt wrong but still, contain sentimental information.
Using such a small feature set may cause the problem of high bias. Therefore, in the scope of this
research, a unique feature set is built based on the user text reviews. In addition to this, some variations to
our process are implemented: (1) With no pre-processing or changes (2) Removing English stop words
(i.e. extremely common words) from the feature set using the stop word removal feature available in
Natural Language Toolkit (NLTK) Corpus (3) Stemming (i.e. reducing a word to its stem/root form) to
remove repetitive features using the Snowball Stemmer algorithm which is a built-in feature in NLTK.
According to Wang (2015), building the feature set using Yelp dataset improves both precision and
recall significantly. However, looping over the training set to select relevant features can slow down the
process time with a large training data. On the other hand, looping on a small training data, the features
selected might have high bias and may not be so affected on the entire Yelp dataset.
From Wang (2015), higher rating implies a more positive emotion from the user towards the business.
Accordingly, for the first basic sentiment analysis a simple rule is considered, if the star rating is greater

3
than 3 value 1.0 is assigned which s inferred as a ”Positive” sentiment and otherwise it was assigned 0.0
for ”Negative” sentiment. Cross validation is used and the algorithms are excecuted on a sample size of
100000. Sample set is randomly split into training (70% of the data) and test (the remaining 30%) sets.

3.3 Machine Learning


Three different machine learning algorithms are implemented and examined: Naive Bayes, SVM and
Logistic Regression.
Naive Bayes algorithm in the scikit-learn machine learning library is used to predict the star ratings
for the user reviews. Naive Bayes is traditionally used and proved to be the better suitable for text
classification. In our Naive Bayes algorithm, a review is represented via a feature vector whose length is
equal to the number of words in the dictionary. In addition, a variation of Naive Bayes is implemented,
i.e. Multinomial Naive Bayes. In other words, instead of counting the frequency of occurrence of the
words, 1 or 0 values are used to denote whether the word occurred or not. This is motivated by the belief
that word occurrences may matter more than frequency.
Second machine learning algorithm that is used in this research is Multi-Class SVM. Multi-Class
SVM is a generalization of SVM, where the labels are not binary, but are drawn from a finite set of
several elements. OneVsRestClassifier is used, this strategy consists in fitting one classifier per class. For
each classifier, the class is fitted against all the other classes. In addition to its computational efficiency,
one advantage of this approach is its interpretability. Since each class is represented by one and one
classifier, it is possible to gain knowledge about the class by inspecting its corresponding classifier.
The third algorithm is the Logistic Regression. Logistic Regression is a statistical method for ana-
lyzing a dataset in which there are one or more independent variables that determine an outcome. The
outcome is measured with a dichotomous variable (in this case, two to five possible outcomes). The goal
was to find the best fitting model to describe the relationship between the dichotomous characteristic of
interest (reviews) and a set of independent variables (lexicons).

4 Evaluation & Results


4.1 Evaluation Metrics
To measure the rating prediction system performance and present the results, Accuracy, Precision, Recall
and F1-Score evaluation metrics are used. The prediction is compared with the metadata star rating to
determine the correctness. Precision, Recall and F1-Score are calculated respectively by the equations
below:
tp + tn
Accuracy =
tp + tn + f p + f n
tp
P recision =
tp + f p
tp
Recall =
tp + f n
P recision ∗ Recall
F − M easure = 2
P recision + Recall

where tp, fp and fn are the number of True Positives, False Positives, and False Negatives respectively.
Accuracy refers to the closeness of a measured value to a standard or known value. Precision refers
to the closeness of two or more measurements to each other.
In the predictive analysis, Confusion Matrix is used as another evaluation criterion. Confusion Matrix
is a table with multiple rows and columns that reports the number of false positives, false negatives,
true positives, and true negatives. This allows more detailed analysis than a mere proportion of correct
classifications such as accuracy.

4
4.2 Results
4.2.1 Naive Bayes
Multinomianal-Naive Bayes is evaluated on 100,000 instances. The results are represented with preci-
sion, recall and f1-score metrics. First, polarity of the reviews are observed (Fig. 2). Then same methods
are implemented on 5 classes which represent 5 stars (Fig. 3). The results are observed relatively high
for 2 classes polarity evaluation. However, a significant decrease is observed in the results for 5 classes.
This inference can be based on the fact that lexicons with 4 and 5 stars are relatively close and lexicons
with rating of 1,2 and 3 are relatively close.

Figure 2: Error Comparison of Feature Selection Algorithms using Naive Bayes

5
Figure 3: 5 Classes Evaluation using Naiven Bayes

4.2.2 Support Vector Machine


Support Vector Machines is a discriminative classifier formally defined by a separating hyperplane. The
algorithm outputs an optimal hyperplane which categorizes new incoming instances, given labeled train-
ing data (Fig. 4, Fig. 5).

Figure 4: Error Comparison of Feature Selection Algorithms using SVM

6
Figure 5: 5 Classes Evaluation using SVM

4.2.3 Logistic Regression


Logistic Regression is a statistical method for analyzing a dataset in which there are one or more inde-
pendent variables that determine an outcome. The outcome is measured with a dichotomous variable (in
this case, two to five possible outcomes). The goal was to find the best fitting model to describe the rela-
tionship between the dichotomous characteristic of interest (reviews) and a set of independent variables
(Fig. 6, Fig. 7).

7
Figure 6: Error Comparison of Feature Selection Algorithms using Logistic Regression

Figure 7: 5 Classes Evaluation using Logistic Regression

5 Conclusion
Various machine learning algorithms were experimented in predicting the reviews of Yelp dataset. Effec-
tiveness of each of the algorithms were calculated with precision, recall and F1 metrics. No significant

8
improvement was noticed with the removal of features such as stop words in the case of polarity classifi-
cation. However, there was an effective improvement in the results in the case of multiclass classification.
Overall, we achieved an accuracy of 79 percent for polarity in comparsion with a 40 percent accuracy
for multiclass classification. And Naive Bayes was noted to be the best performing algorithm.

References
Channapragada, S. and R. Shivaswamy (2015). Prediction of rating based on review text of yelp reviews.

Ding, X., B. Liu, and P. S. Yu (2008). A holistic lexicon-based approach to opinion mining. pp. 231–240.

Ganu, G., N. Elhadad, and A. Marian (2009). Beyond the stars: improving rating predictions using
review text content. In WebDB, Volume 9, pp. 1–6. Citeseer.

McCormick, C. (2016). Word2vec tutorial - the skip-gram model. word2vec tu-


torial - the skip-gram model. http://mccormickml.com/2016/04/19/
word2vec-tutorial-the-skip-gram-model/. Accessed: 08-03-2018.

Wang, J. (2015). Predicting yelp star ratings based on text analysis of user reviews.

9
Author Declaration for Group Assignments

Module Number: CS7IS4

Title of the Assignment: Group 5 Final Essay

Student Student Nature of Contribution Percentage


Number: Name: Contribution
16340549 Bhargav Bhargav studied on machine-learning papers and applied Logistic 30%
Ramprasad Regression to overall system. He wrote evaluation, conclusion sections
Panth and created the necessary graphs.

17319658 Maaz Hasan For research purpose, I contributed to the literature review by reading 10%
three research papers focusing on the methods for achieving the results
pertaining to the research question. I also implemented the Bag of Words
model from NLP and Naive Bayes on the yelp dataset to categorise the
data as good or bad but the precision and recall score obtained, I have
implemented the stop words removal on the overall system. I also
contributed for drafting the final essay.
17314870 Onurcan Onurcan studied on machine-learning papers and applied Support 30%
Onder Vector Machine to overall system. He wrote the abstract, introduction,
related work sections and fine-tuned the whole paper to make the paper
suitable for formatting requirements. Also learnt LaTeX tool.
17311198 Some Some studied on papers about data pre-processing, created a unique 30%
Aditya lexicon and applied stemming. He also applied Naïve Bayes algorithm
Mandal to overall system. He wrote implementation section.

We have read and we understand the plagiarism provisions in the General Regulations
of the University Calendar for the current year, found at: http://www.tcd.ie/calendar

We have also completed the Online Tutorial on avoiding plagiarism ‘Ready, Steady,
Write’, located at http://tcd-ie.libguides.com/plagiarism/ready-steady-write

We declare that the assignment together with any supporting artefact is offered for
assessment as our original and unaided work, expect in so far as any advice and/or
assistance from any other named person in preparing it and any reference material
used are duly and appropriately acknowledged. We declare that the percentage
contribution by each member as stated above has been agreed by all members of the
group and reflects the actual contribution of the group members.

Signed and Dated:

Bhargav Ramprasad Panth Maaz Hasan

Onurcan Onder Some Aditya Mandal