You are on page 1of 24

Text Analytics on Wine Reviews

Lesson 12

20538 – Predictive Analytics


for data driven decision
making

Prof. Luca Molteni


Index
• Collected Data
Data Source and Analysis Goals
Dataset Fields

• Data Preparation
Data import
Data cleaning
Bag of Words
Terms filtering

• Predictive models
Pinot Noir prediction
Pinot Noir vs Cabernet Sauvignon vs Syrah
Performance comparison of various machine learning
algorithms

2
COLLECTED DATA

3
Data Source and analysis’ goals
The source for the original dataset is the Kaggle repository:
https://www.kaggle.com/zynicide/wine-reviews/data

The data was scraped from WineEnthusiast of November 22nd, 2017, collecting also the title
of each review, which we can parse the year out of, the tasters name, and the taster's Twitter
handle (this should eventually also fix the duplicate entry issue)

The idea is to investigate if we could create a predictive model to identify wine variety through
blind tasting like a master sommelier would

The first step in this journey was gathering some data to train a model; the plan is to use
deep learning to predict the wine variety using words in the description/review: the model still
won't be able to taste the wine, but theoretically it could identify the wine based on a
description that a sommelier could give

4
Dataset Fields
The data consists of 13 fields:
Points: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say
they only post reviews for wines that score >=80)
Title: the title of the wine review, which often contains the vintage if we're interested in extracting
that feature
Variety: the type of grapes used to make the wine (ie Pinot Noir)
Description: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
Country: the country that the wine is from
Province: the province or state that the wine is from
Region 1: the wine growing area in a province or state (ie Napa)
Region 2: sometimes there are more specific regions specified within a wine growing area (ie
Rutherford inside the Napa Valley), but this value can sometimes be blank
Winery: the winery that made the wine
Designation: the vineyard within the winery where the grapes that made the wine are from
Price: the cost for a bottle of the wine
Taster Name: name of the person who tasted and reviewed the wine
Taster Twitter Handle: Twitter handle for the person who tasted and reviewed the wine

5
DATA PREPARATION

6
Data preparation (1) – data import
First of all we need to import the csv file and to look at
the distribution of the wine variety
Pinot Noir is the most frequent variety, and considering
only straight red wines, Cabernet Sauvignon and Syrah
reported the second and third place in term of number of
reviews

7
Data preparation (2) – sample of available data
Below you can see a sample of the first rows of collected data
The field description contains the review we will use to predict variety, obviously
removing every direct reference to variety itself, winery, region

8
Data preparation (3) – data cleaning
In this phase, with “String to Document” node we transform the description string
into a Knime document variable, that allow to use specific Knime text functions

To clean the document variable, we use nodes Punctuation Erasure (removes all
punctuation characters of terms contained in the input document), N Chars filter (we
retained words at least 3 chars long), Number Filter (to remove numbers in text), Case
Converter (converts all terms contained in the input documents to lower case), Stop
Word Filter (filters all terms of the input documents, which are contained in the specified
stop word list, such as “and”, “not”,…) and finally Snowball Stemmer (providing a suffix
‘stripper grammar’)

9
Data preparation (4) – data cleaning
Below we can see in the Preprocessed Document column, the effect of Document
cleaning

10
Data preparation (5) – filtered Bag of Words

In this phase, with “Bag of


Words Creator”, we create a
Term column containing all
the terms used in each
document rows; in the same
time each document row is
duplicated a number of times
equal to the number of terms
present inside the row itself,
moving from 129971 to
2816280 rows

With the Row Filter node we


removed all terms with a
frequency lower than 300

11
Data preparation (6) – term vectorization
and wine variety filter
With the “Document Vector” node we transformed the term column into a series of terms
dummies, recovering also with the “Document Data Extractor” the variety (Source/ Wine)
With the Pinot-Cabernet-Syrah filter than we decided to focus on the three most relevant
straight red wines (in terms of reviews)

12
Data preparation (7) – term frequency
and trivial relation filter
With the “Low Variance Filter” we removed terms with a percentage of citations in the
selected reviews lower than 5.3% (5% dummy variance as a lower bound)
We decided furthermore to remove trivial terms or terms trivially related to dependent
variable (variety)

13
DATA ANALYSIS

14
Data analysis with Equal Size Sampling (1)
– tested models
The three categories of the dependent
variable are not balanced in term of weights
and in the training dataset (see below), so a
balancing procedure is needed before model
estimation
We started with the Equal Size Sampling
option, testing then four models: Decision
Trees, Random Forest, Knime Gradient
Boosting and H2O Gradient Boosting

15
Data analysis with Equal Size Sampling (2)
– models performances
Apart from the Decision Tree
performance, that’s clearly
lower with a 64.4% of accuracy,
the remaining models perform
more or less with the same
efficacy
The top performer is the
Random Forest model, with
an accuracy a little bit over
70%, and a sensitivity clearly
higher for the most frequent
class (Pinot Noir – 75.7%)
The predictive performance for
the remaining two classes is
between 62% and 64%

16
Data analysis with Equal Size Sampling (3)
– Random Forest model focus
Focusing the attention on the
best model, it’s interesting to
calculate the impact on
prediction of the various terms
We calculated the importance
as the ratio between the
number of splits for each
variable over the number of
candidates for the first three
levels of the trees (we
requested 300 trees)
Cassis, Blackberry, Strawberry,
Pepper, Silky and Cola are the 6
most relevant terms in
prediction

17
Data analysis with Equal Size Sampling (4)
– Random Forest model focus

Looking at the direction of the impact,


the terms cola and silky are more
related to Pinot Noir, cassis to
Cabernet, pepper to Syrah and
blackberry to Cabernet and Syrah

18
Data analysis with Equal Size Sampling (5)
RF focus – Tag Clouds of 30 most relevant terms
Pinot Noir
Cabernet

Syrah

19
Data analysis with SMOTE Sampling (1)
The previous results were
obtained using the Equal Size
Sampling node, or in other words
undersampling the majority
classes
An alternative approach is to
oversample the minority classes
and the SMOTE procedure is one
of the most used, being able to
oversample with a “perturbation”
in data, so in a realistic way
without simply duplicate the
minority classes observations
We tried bot Random Forest and
XGboost models with SMOTE
improving the classification
performance (see next slide)

20
Data analysis with SMOTE Sampling (2)
– models performances

As we can easily understand from


the accuracy statistics, the
overall ability to correctly predict
wine category in this case is
around 74%-75% against the
previous 69%-70%
In the prediction there is only a
problem of False Negatives
(Sensitivity) on the less
represented class, Syrah

21
Data analysis with SMOTE Sampling (3)
– Random Forest model focus
Focusing the attention on the
best model, it’s interesting to
calculate also in this case the
impact on prediction of the
various terms
Cassis, Blackberry, Strawberry,
Pepper, Silky and Cola are
again the 6 most relevant
terms in prediction, but in a
different order comparing this
chart with the previous one

22
Data analysis with NGrams (1)
– Random Forest model focus
It could be interesting to test the impact of word
associations (Ngrams) besides that of single terms
For this reason, we calculated the most used association of
3 words, using the flow below, merging the obtained table
with the single term vectorization
In right table we can see a subset of ngrams with their
frequency; obviously before using them as explicative
variables we removed the trivially related terms, such as
“blend cabernet sauvignon” or “cabernet sauvignon merlot”

23
Data analysis with NGrams (2)
– Random Forest model focus
Anyway in this case the results of the
Random Forest model are not
improved inserting the Ngrams using
the Equal Size Sampling balancing,
with the overall accuracy again
around 70% and similar results on
other accuracy statistics

24

You might also like