Professional Documents
Culture Documents
2
BEER RECOMMENDATION PROJECT
Proof of Concept
Reproducibility statement
To ensure reproducibility of all my analyses I have deposited the Jupyter notebook containing the code in Github
repository https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-Concept. The same repository
contains raw data input, all results files, software requirements, final report, video presentation and PowerPoint
presentation. Environment setup is listed in the above-mentioned Final Report file, chapter 1.2.1 Inventory of Resources.
1. Business Understanding
1.1 Business Objectives
1.1.1 Background
My husband likes to try new beer each time he has the opportunity. I have volunteered to keep an Excel file and record
beer names and ratings to make sure we don’t buy the same product twice. The current state is we buy any beer that’s
not in the Excel file and hope it’s enjoyable. If he likes it, he drinks it and the beer gets a rating of 5 or higher. If he doesn’t
like it, the beer goes down the drain, is rated below 5, and he opens another one. Until now, 9.5% of all beers bought were
not consumed.
The project will be successful if less than 9,6% of new beers will have a rating lower than 5.
Currently there are no apps to use as a benchmark for predicting beer ratings.
Contingency: Use an interval of two standard deviations away from the mean to train the model instead of three standard
deviations as one would choose in a big dataset. This will leave out more outliers and will decrease the risk of overfitting.
1.2.4 Terminology
Business terminology:
ABV – Alcohol By Volume: how much alcohol is contained in a given volume of an alcoholic beverage. Actual ABV in this
dataset starts at 0 (alcohol-free) and ends at 10.
Fermentation – place of fermentation in the barrel: top (for ale, alt, blonde, soda and wheat beer) or bottom (dark, keller,
lager, mix and pils beer). Warm fermentation is recorded as top, and cold fermentation as bottom.
Filtration – the process of removing sediments in the brewing process: filtered, unfiltered.
Flavor – additional flavors if any: fruit, herb, lemon, standard. If none is specified on the bottle, it is classified as standard.
Method – brewing method: industrial for big volumes, craft for small batches.
Rating – actual ratings recorded in the input file for each beer. Each rating reflects how much that particular beer was
enjoyed after a salty meal. The scale is from 1 (worst) to 10 (best). Beers that are exceptionally good or exceptionally bad
may receive ratings that are out of this range. Actual ratings are between -1 and 11.
Predicted rating – output predicted by the machine learning model after training on a subset of data.
Style – brewing style: ale, alt, blonde, dark, keller, lager, mix, pils, soda, wheat.
The time dedicated to this project is part of planned data science practice.
There are no costs with data acquisition because the data collection is my property.
Current hardware fits the project’s requirements without need of further purchases.
Software needed was already installed with the exception of H2O, available for free.
Benefits
Bring value by decreasing the quantity of bought and discarded beer. Helps with saving money, effort, and the
environment:
Less money spent on beer.
Less time spent carrying unsatisfactory beer on the staircase.
Lower carbon footprint by reducing the quantity of discarded beer.
1. Business Understanding
1.1 Business Objectives 01-09-20 04-09-20 0 4 Alina
1.2 Assess Situation 07-09-20 11-09-20 6 5 Alina
1.3 Determine Data Mining Goals 14-09-20 18-09-20 13 5 Alina
1.4 Produce Project Plan 21-09-20 25-09-20 20 5 Alina
Documentation 28-09-20 02-10-20 27 5 Alina
2. Data Understanding
2.1 Collect Initial Data – Initial Data Collection
Report 05-10-20 09-10-20 34 5 Alina
2.2 Describe Data – Data Description Report 12-10-20 23-10-20 41 10 Alina
6
2.3 Explore Data – Data Exploration Report 26-10-20 06-11-20 55 10 Alina
2.4 Verify Data Quality – Data Quality Report 09-11-20 13-11-20 69 5 Alina
Documentation 16-11-20 20-11-20 76 5 Alina
3. Data Preparation
3.1 Select Data – Rationale for Inclusion/Exclusion 23-11-20 04-12-20 83 10 Alina
3.2 Clean Data – Data Cleaning Report 07-12-20 18-12-20 97 10 Alina
3.3 Construct Data 04-01-21 15-01-21 125 10 Alina
3.4 Integrate Data 18-01-21 29-01-21 139 10 Alina
3.5 Format Data – Reformatted Data 01-02-21 01-02-21 153 1 Alina
3.6 Dataset Output 01-02-21 01-02-21 153 1 Alina
3.7 Dataset Description Report 02-02-21 12-02-21 154 9 Alina
Documentation 15-02-21 19-02-21 167 5 Alina
4. Modeling. Machine Learning with H2O
4.1 Initialize H2O 22-02-21 22-02-21 174 1 Alina
4.2 Select .csv Files 22-02-21 22-02-21 174 1 Alina
4.3 Import Datasets As Dictionaries Into H2O 23-02-21 05-03-21 175 9 Alina
4.4 Select Modeling Techniques 08-03-21 26-03-21 188 15 Alina
4.5 Generate Test Design 29-03-21 16-04-21 209 14 Alina
4.6 Build Model 19-04-21 14-05-21 230 20 Alina
4.7 Assess Models 17-05-21 28-05-21 258 10 Alina
Documentation 31-05-21 11-06-21 272 10 Alina
5. Evaluation
5.1 Evaluate Results 14-06-21 25-06-21 286 10 Alina
5.2 Review Process 28-06-21 02-07-21 300 5 Alina
5.3 Determine Next Steps 05-07-21 09-07-21 307 5 Alina
Documentation 12-07-21 16-07-21 314 5 Alina
6. Deployment
6.1 Plan Deployment 19-07-21 30-07-21 321 10 Alina
6.2 Plan Monitoring and Maintenance 02-08-21 06-08-21 335 5 Alina
6.3 Produce Final Report 09-08-21 27-08-21 342 15 Alina
6.4 Review Project 30-08-21 03-09-21 363 5 Alina
7
1.1 Business Objectives
1.2 Assess Situation
1.3 Determine Data Mining Goals
1.4 Produce Project Plan
Write Documentation
Tools:
8
▪ Pandas: library for data cleaning and transformation. Easy to read code that helps with understanding the flow.
▪ Seaborn: library for eye friendly statistical graphs. Makes uncluttered visualizations with simple code.
Techniques:
2. Data Understanding
2.1 Collect Initial Data – Initial Data Collection Report
Note:
The dataset has 322 rows and 8 columns, and all cells are filled. There are no missing values.
81 beers are cut and pasted into another file so that they stay unseen for the machine learning model.
I will work with the remaining 241 beers, splitting them into three subsets for training, validation and testing the model.
Since I don’t know which attributes have a higher impact on ratings, I’ll keep them all for the initial analysis and remove
unnecessary columns after seeing the results.
One column, Rating, stores beer ratings as a numeric variable of integer type.
Seven columns store data as object type in a text form and need to be converted: Name, Method, Style, Flavor,
Fermentation, Country and Split.
• Rating: beer ratings as integer between -1 and 11, and the mean rating is 6.9.
• Name: beer names and their alcohol content as the last part of the text. Can be separated.
• Method: brewing method. Repetitive values, good candidates for conversion to category.
• Style: brewing style. Repetitive values, good candidates for conversion to category.
• Flavor: additional flavors. Repetitive values, good candidates for conversion to category.
• Fermentation: place of fermentation in the barrel. Repetitive values, good candidates for conversion to category.
9
• Country: numerous different string values. Needs further check before deciding on its transformation.
• Split: repetitive values that mark which observations will be used for training, validation and test when building
model in H2O.
After removing duplicates, the dataset has 240 rows and 9 columns. The extra column is ABV (alcohol by volume) in float
format.
Rating interval for 25%-75% of beers is 6 to 8 in most subgroups of categorical features. Out of the 240 beers, 9.6% have
a rating lower than 5.
10
style count mean std min 25% 50% 75% max
ale 20 6.6 2.3 1 5.8 7 8 10
alt 7 6.7 1 5 6.5 7 7 8
blonde 38 7.1 1.5 2 6 7 8 10
dark 17 7.6 1.7 4 6 8 9 11
keller 12 6.9 1.4 5 6 7 8 10
lager 38 6.9 1.7 3 6 7 8 9
mix 4 4 3.3 0 3 4 5 8
pils 53 6.8 1.4 3 6 7 8 10
soda 13 7 3 -1 6 8 9 10
wheat 38 6.9 1.7 1 6 7 8 9
There are 240 unique beer names. Each beer name starts with the brand, followed by characteristics like style, flavor,
filtration and pasteurization, if the case. All names end with a number that represent the alcohol content.
There are 17 unique countries, quite a lot for a dataset of 240 beers. I will determine if each country has enough
occurrences in the data selection section.
Result 1: The scatterplot shows no linear relationship and no pattern between ratings and alcohol content. However,
subgroup of lemon flavor with zero or low alcohol content has higher ratings compared to other groups. Because alcohol-
free and low alcohol are present in many flavors, out of which lemon stands out as higher rated, it’s safe to say there is a
relationship between lemon flavor and high ratings.
11
Figure 1. Relationship between ABV and rating by each categorical feature
2.3.2 Hypothesis 2
Some subgroups might have a low number of observations.
Result 2: The results are stored in a new column, “Occurrence”: too few if below threshold, NaN for the rest. When a
threshold of 5% from the total is set, alt and mix styles, and herb flavor are below.
2.3.3 Hypothesis 3
Flavor column might have greater variation between the average of its subgroups. If this is the case, Flavor would be a
useful feature when building the machine learning model.
Result 3: Flavor subgroups have observable variation between their average. The errorbar is bigger on herb subgroup.
12
2.3.4 Hypothesis 4
Lemon beers’ high average might not be due to outliers. The scatterplot between alcohol content and rating grouped by
Flavor hints this is due to all observations being higher rated, not because of outliers.
Result 4: The boxplot shows that the distribution of lemon beers is due to higher ratings overall compared to other
subgroups. The KDE plot shows that ratings of lemon beers are so much higher than the rest, that their median overlaps
with its 75th percentile. This was the reason for the lineless box of lemon subgroup in boxplot.
13
2.4 Verify Data Quality – Data Quality Report
There are no missing values as shown in Initial Data Collection Report (section 2.1).
All keys and all string values are stored in lower-case because I have transformed all text in section 2.2.
14
2.4.2 Define noise in ratings
Dataset has 240 rows. In a gaussian distribution, outliers are usually defined as values that are more than three times the
standard deviation away from the mean. In small datasets, a narrower interval is preferred, for example two times the
standard deviation away from the mean. A narrower interval avoids overfitting in case values are not representative of
real life. I will use this criteria when cleaning data in section 3.2.
3. Data Preparation
3.1 Select Data – Rationale for Inclusion/Exclusion
Decide correlation metric to describe relationship between ratings and alcohol content. Their relationship is non-linear as
was shown in Data exploration report. Find out what type of distributions ratings and alcohol content have:
• If both distributions are gaussian and relationship is linear, use covariance or Pearson’s correlation coefficient.
• If one or both distributions are non-gaussian and relationship is non-linear, but still monotonic use Spearman’s
correlation coefficient.
Result: Curve of ratings KDE is gaussian, therefore an observable pattern. Only a few beers get extreme ratings, minimum
-1 or maximum 11. Most ratings are grouped towards 7.
Result: Minimum alcohol content is 0, maximum is 10. Curve of alcohol content KDE is not gaussian because of alcohol-
free beer which amounts for 17% of the total beer. Still, it shows a pattern of three subgroups each with its own gaussian
curve:
• alcohol-free beer
• mix of beer with lemonade, alcohol content between 0.5 and 3, curve peaked at 2.5
• beer without additions, alcohol content between 3 and 10, curve peaked at 5
Decision: Include ABV (alcohol content) column in machine learning model when training regular beer subgroup. Ratings
and alcohol content have similar scale, so there’s no need to normalize them.
16
Figure 7. KDE of alcohol content
Instead, I can calculate Spearman’s correlation coefficient because it works for both linear and non-linear relationship and
the variables don’t need to have gaussian distributions. However, Spearman’s correlation coefficient assumes a monotonic
relationship, meaning if one variable goes up, the other does the same.
I have shown in the Data Exploration Report (section 2.3) that ratings and alcohol content don’t have a linear relationship,
but they still can have non-linear one, so I’ll calculate Spearman’s correlation coefficient.
Result 1: Spearman’s correlation coefficient is -0.11, which means that either the variables don’t have a monotonic
relationship, or they don’t have any kind of relationship. Since we already know that lemon beer with an alcohol content
of 2.5 has higher grades compared to the rest, we can say that there is a relationship, but it’s not monotonic.
Action 2: Plot kernel density estimation (KDE) of alcohol content and rating and see if data can be grouped. Set bandwidth
less than 1 because distribution of alcohol content is not gaussian.
Result 2: There are three zones, so it makes sense to split the dataset into three groups after all cleanup is done.
Decision: Before and after removing rating outliers, split dataset into alcohol-free beer (0 ABV), light beer (0.5-3 ABV) and
regular beer (ABV higher than 3).
17
Figure 8. KDE of alcohol content and rating
3.1.4 Check country column if it has enough observations for each unique value
Action: Count unique countries and how many occurrences each of them has. Show results as percentage.
Results: There are 17 unique countries and 15 of them have each less than 5% of the total observations. Having a small
dataset, this feature might lead to bias in a machine learning model because of the large number of subgroups combined
with the low number of occurrences in each subgroup.
3.1.5 Check which categorical features have high variation across subgroups
Action: Write functions to plot averages and distribution of subgroups. Select features with high variation across
subgroups.
Results: Boxplots and barplots show that Flavor and Style have high variation between distributions and averages of their
subgroups. Method and Fermentation have low variation across their subgroups. Errorbars are bigger on subgroups with
low number of observations.
Decision: Include Style and Flavor in all machine learning models. Decide later for Method and Fermentation, after splitting
subsets because even if it doesn’t show interesting subgroups in original dataset, it might do otherwise in the subsets split
on alcohol content criteria.
18
Figure 9. Average rating of beer subgroups
Results:
Decision: The goal of this project is to identify and eliminate beers that have extremely low ratings, so it might be possible
for a model trained on these extreme ratings to perform better. I will build models on both complete dataset, and less
than 2*std dataset, then compare predictions. That’s why I will select the less than 2*std away rows only after new
columns are added.
Action 1: Define a list of words in various languages that translate as unfiltered. If in the Name column, at least one word
from the list is found, label the beer as unfiltered, otherwise label it as filtered.
Action 2: Define a list of words in various languages that translate as unpasteurized. If in the Name column, at least one
word from the list is found, label the beer as unpasteurized, otherwise label it as pasteurized.
Results: Two new columns – Filtration and Pasteurization, with entries derived from the Name column.
Create column to bin alcohol content as categorical data and capture non-linearities in visualization and modeling. Name
it “perception”. Decide edges and labels according to perceived features of subgroups:
• drive – beer that can be consumed before driving, alcohol content between 0 and 0.5
• refresh – beer that quenches thirst, a mix of beer and lemonade or fruit juice, with alcohol content between 0.6
and 2.8
• weak – beer that’s not suitable neither for driving, nor for quenching thirst; its taste is slightly diluted and alcohol
content between 2.9 and 4.4
• tasty – beer with full taste and alcohol content between 4.5 and 5.5
• too strong – beer with strong taste and alcohol content 5.6 or higher
20
Results: In complete ratings range there are 41 alcohol-free, 30 light and 169 regular beers.
In less than 2*std away ratings there are 38 alcohol-free, 30 light and 160 regular beers.
21
Figure 13. Regular beer ratings
3.4.4 Check feature variation across subgroups when split into the three subsets based on alcohol content
Subsets split from beer dataset containing all observations:
Alcohol_free_all dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration and Pasteurization.
22
Figure 15. Distribution of alc_free_all
Light_all dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration, Pasteurization and
Perception.
Regular_all dataset. Observable variation in subgroups of Style, Flavor, Pasteurization and Perception.
23
Figure 17. Distribution of regular_all
Subsets split from beer_2std dataset containing only normal distribution observations, without ratings outliers:
24
Alcohol_free_2std dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration and
Pasteurization.
Figure 19. Distribution of alc_free_2std
Light_2std dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration, Pasteurization and
Perception.
Regular_2std dataset. Observable variation in subgroups of Style, Flavor, Pasteurization and Perception.
25
Figure 21. Distribution of regular_2std
26
3.7 Dataset Description Report
In all datasets, each row is a unique beer with corresponding observations in all columns.
For brevity, I will copy here only the descriptions of categorical features selected for training each dataset. ABV feature is
numeric and its distribution was shown in section 3.1.2.
27
lager 38 6.9 1.7 3 6 7 8 9
mix 4 4 3.3 0 3 4 5 8
pils 53 6.8 1.4 3 6 7 8 10
soda 13 7 3 -1 6 8 9 10
wheat 38 6.9 1.7 1 6 7 8 9
28
standard 16 6 1.6 3 5 6 7 8
29
lager 28 7 1.6 4 6 7 8 9
mix 2 2 2.8 0 1 2 3 4
pils 30 6.6 1.7 3 6 7 8 10
soda 2 4.5 7.8 -1 1.8 4.5 7.2 10
wheat 21 6.6 2 1 6 7 8 9
• Rows: 38
• Number of ratings lower than 5: 0
• Columns: 12
• Features: Style, Flavor
• Dimensions: 2
• Maximum combinations possible: 6*4 = 24
32
Conclusion: Possible combinations in each model are almost as many as the number of observations in each dataset.
• Pros: handles high-dimensional datasets with no need for categorical data encoding, generates numeric data
output from categorical data input.
• Cons: small community of users that can help with technical questions.
Other tools considered, but not chosen: Scikit-Learn – open-source machine learning library.
• Pros: versatile library, large community of users that can help answering questions.
• Cons: needs encoding of categorical data, techniques generate the same kind of output as the provided input.
H2O steps:
• Initialize H2O
• Import data
• Model
• Export output.
Select .csv files from folder that stores clean dataframes. Import as dictionary of dictionaries. Each key is the name of a
dataset. The values are: frame, its list of predictors, the response column. The list of predictors for each frame is the one
decided in section 3.7.2 Description of observations (rows).
Note: I’ve tried to import from Excel and it didn’t work. If you run into the same situation, convert excel to csv, then import
csv.
33
4.4 Select Modeling Techniques
4.4.1 Modeling Techniques
Distributed Random Forest (DRF). Ensemble method that builds parallel trees, each of them trained on a subset of data.
The regression line generated by each tree is run through all other trees, and the results are used to compute the final
mean, thus avoiding over-fitting.
Gradient Boosting Machine (GBM). Forward learning ensemble method that builds parallel trees, but unlike Distributed
Random Forest (DRF), it trains each of them on the full set. Trees are built sequentially, one after another, so that each
tree will learn from the results of previous trees and the model gets refined. The output is the prediction of the last tree
built.
In an ideal dataset, decision about how to split data between training and testing would be automated, random and
representative.
Since I don’t have an ideal dataset and the number of observations is rather small and prone to over-fitting, I have decided
to split data manually. For this, I have created a column in the original dataset and I have named it Split. I have sorted
observations by name and then by rating, and at each grade I made sure to record “train” in 70% of them, “valid” in 15%,
and “test” in 15%. Beer from the same name brand was thus split into different groups. I made sure other features are
split too, e.g. for a rating of 5, in each split appear various styles and flavors. The result is that splits are diverse no matter
how few observations there are in each of them.
Note: Usually, alcohol-free beer is labeled so. When sorting beer by name, most of alcohol-free were the first in the series
of a particular brand, and the light ones came after them. Splitting the brand into as many different groups as possible, I
made sure the first one is marked for “train”, the second one for “valid” and, if a third one existed, for “test”. This led to
the alcohol-free subgroup to have more observations in the “train” group compared to regular beer. This is also the reason
why the “test” subgroup has more regular beer. Why did I do that? Because the alcohol free is under-represented compared
to regular beer in the total dataset, and I wanted to make sure there are as many observations as possible for training in
this small subset.
Select rows for training, validation and test sets to make results reproducible.
34
4.6.1 Parameter Settings
Distributed Random Forest
DRF instantiate model.
col_sample_rate_change_per_level = 1 (default)
col_sample_rate_per_tree = 1 (default)
fold_assignment = “random”
histogram_type = AUTO (default), bins from min to max in steps of (max-min)/N levels
max_depth = 20 (default)
min_rows = 1 (default)
nbins = 13, because the rating scale is an array of integers starting at -1 and ending at 11
nfolds = 4, because this is the maximum allowed relative to the total number of rows
ntrees = 50 (default)
seed = 12
DRF training
35
x = features (columns) decided as relevant for each dataset variant in section 3.7.2 Description of observations (rows)
y = Column Rating
training_frame = rows marked with “train” in column Split in section 4.2 Generate Test Design
validation_frame = rows marked with “valid” in column Split in section 4.2 Generate Test Design
DRF testing
test = rows marked with “test” in column Split in section 4.2 Generate Test Design
col_sample_rate = 1 (default)
col_sample_rate_change_per_level = 1 (default)
col_sample_rate_per_tree = 1 (default)
distribution = “gaussian”
fold_assignment = “random”
histogram_type = AUTO (default), bins from min to max in steps of (max-min)/N levels
36
learn_rate_annealing = 1 (default)
max_depth = 5 (default)
min_rows = 1, because the light beer dataset is too small for the default min_rows of 10
nbins = 13, because the rating scale is an array of integers starting at -1 and ending at 11
nfolds = 4, because this is the maximum allowed relative to the total number of rows
ntrees = 50 (default)
seed = 12
stopping_rounds = 0 (default) and it means it’s disabled. If cross-validation is enabled, training stops anyway when there’s
no improvement.
GBM training
x = features (columns) decided as relevant for each dataset variant in section 3.7.2 Description of observations (rows)
y = Column Rating
training_frame = rows marked with “train” in column Split in section 4.2 Generate Test Design
validation_frame = rows marked with “valid” in column Split in section 4.2 Generate Test Design
GBM testing
test = rows marked with “test” in column Split in section 4.2 Generate Test Design
37
4.6.2 Define Models
Write function to generate DRF and GBM models.
• To help with readability down the road, store elements of dictionary under short variable names.
• Split rows into training, validation and test sets. This makes reproducibility possible.
• Instantiate model with custom parameters defined in section 4.3.1 Parameter Settings.
• Train model on the list of predictors (x) chosen for each dataset in section 3.7.2 Description of observations (rows).
Specify the response column (y) which is Rating for all datasets. Training frame: all rows in “train” Split. Validation
frame: all rows in “valid” Split.
• Print model to show variable importance.
• Generate prediction. It gets stored in a H2O frame with one column named “predict”.
• Calculate model performance on test set, store it in a dictionary and extract only MSE values from it.
• Print model name, its predictors and MSE value.
• Add prediction to original H2O, convert it to dataframe and export it as .csv file to location:
https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/predictions
• Export models to location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/models
• Export variable importance to the same folder as models, location: https://github.com/alina-molnar/Beer-
Recommendation-Project-Proof-of-Concept/tree/main/beer_output/models
• Export dictionary of dataframes and their respective MSE values to location: https://github.com/alina-
molnar/Beer-Recommendation-Project-Proof-of-Concept/tree/main/beer_output/metrics/mse/
• MSE. Mean squared error. Square each error, then calculate their mean. Scale dependent.
• RMSE. Root mean square error. Calculate the squared root of MSE. Scale dependent.
38
• RMSLE. Root Mean Squared Logarithmic Error. Not dependent of scale. Useful if an under-prediction is worse than
an over-prediction, but that’s not the case here, quite the contrary.
• MAE. Mean absolute error. Take the absolute value of each error, then calculate their mean. Scale dependent.
MSE was extracted by the model building function because it’s the only time when MSE values are computed on test sets.
The following statements are true for both DRF and GBM models:
Code for graph visualization of above statements is in section 5.1.1 where MSE and recall where plotted on the same
figure.
5. Evaluation
5.1 Evaluate Results
5.1.1 Assessment of Data Mining Results w.r.t. Business Success Criteria
The business objective is to spend less money for the same quantity of consumed beer. To achieve this, a model has to
have a recall score as close to 1 as possible.
A recall score of 1 means that all beer with rating lower than 5 was identified by the model, and a score of 0 means that
none of the beer with rating lower than 5 was identified.
The recall score shows the decrease in low rated beer acquisition. For example, a score of 0.6 means that the model
correctly identifies 60% of the low rated beer. In this case, the number of low rated beers bought would be reduced by
60% compared to random purchases.
• True positives. Rating < 5 & Prediction < 5. Beer with real life rating lower than 5 predicted correctly. This means
beer rightfully not recommended by model and money saved.
• False negatives. Rating < 5 & Prediction >= 5. Beer with real life rating lower than 5 predicted incorrectly. This
means beer recommended by model would be bought and discarded.
39
• True negatives. Rating >= 5 & Prediction >= 5. Beer with real life rating 5 or higher predicted correctly. This means
beer recommended by model would be bought and consumed.
• False positives. Rating >= 5 & Prediction < 5. Beer with real life rating 5 or higher predicted incorrectly. This means
beer not recommended by model, but that would have been consumed in real life.
• Import predictions.
• Select false negatives.
• Calculate recall score.
• Merge MSE and recall in a single dataframe.
This graph proves that lower MSE doesn't necessarily mean a better model.
40
Figure 22. MSE and recall score of models
5.1.1.6 Plot graph of recall scores split by model type and dataset range.
Graph shows that DRF and GBM have the similar results for the same dataset, except for beer full range dataset where
DRF performed better.
Graph shows that full range datasets generated better models than datasets with 2*std away ratings.
This was to be expected because the goal of the project is to predict outliers, not the bulk of average ratings.
Decision: Keep outliers when testing the model on the unseen beer dataset.
41
Figure 24. Recall of models split by dataset range
Decisions:
1. Keep outliers in unseen dataset.
2. Split unseen beer dataset according to alcohol content in alcohol-free, light and regular beer.
3. Use DRF models for predictions on the unseen dataset.
42
Approved models:
• alc_free_all_drf for unseen alcohol-free beer
• light_all_drf for unseen light beer
• regular_all_drf for unseen regular beer
Having had too few observations of a subgroup didn’t necessary lead to wrong predictions in that subgroup. For example,
see the graph with false negatives split as enough vs. too few.
6. Deployment
6.1 Plan Deployment
Deployment Plan
In the unseen dataset that contains 81 beers:
43
6.1.1 Import and clean unseen file following the same steps as with seen data
Import datasets from location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/unseen_data/unseen_input/beer_unseen_raw.csv
• Rows: 13
• Number of ratings lower than 5: 1
• Recall score: 0
• This means that out of the 13 beers in alc_free_unseen, only one had a rating lower than 5 and the model didn’t
detect it.
• Model generates UserWarning stating that Test/Validation dataset column “Style” has levels not trained on alt
beer. This means that alt style wasn’t present in the training frame and was encountered in the unseen set on
which I have deployed the model. In this case the prediction takes into consideration only the Flavor feature.
Solutions:
A. Decide impact of small quantity of discarded beer from a similarly small group. Calculate cost/benefit of
pursuing other machine learning approaches.
B. Consume more alcohol-free beer and train/test model after more low ratings are collected.
light_unseen:
• Rows: 9
• Number of ratings lower than 5: 0
• Recall score: not computed
44
• This means that out of the 9 beers in light_unseen, none had a rating lower than 5, so the model didn’t have
anything to detect.
• Model generates UserWarning stating that Test/Validation dataset column “Flavor” has levels not trained on herb
beer. This means that herb flavor wasn’t present in the training frame and was encountered in the unseen set on
which I have deployed the model. In this case the prediction takes into consideration only Style and ABV (alcohol
content) features.
Solutions:
A. Decide impact of small quantity of discarded beer from a similarly small group. Calculate cost/benefit of pursuing
other machine learning approaches.
B. Consume more light beer and train/test model after more low ratings are collected.
regular_unseen:
• Rows: 59
• Number of ratings lower than 5: 7
• Recall score: 0.28
• This means that out of the 59 beers in regular_unseen, seven had ratings lower than 5, and the model detected
28% of them. By using this already trained model, the number of discarded bottles of beer drops by 28%.
Solutions:
45
6.2 Plan Monitoring and Maintenance
Monitoring and Maintenance Plan
1. Record ratings of beer purchased because of model recommendations.
2. Every quarter calculate the percentage of beer that has been bought and discarded.
3. Compare percentage of discarded beer vs. recall score of the model applied.
4. Record the date when beer was rated.
5. Analyze change of beer preferences as time goes by.
6. Run the model yearly to adjust for changes in beer preferences.
Final Report
This document is the final report and was written based on CRISP-DM template. It is stored at https://github.com/alina-
molnar/Beer-Recommendation-Project-Proof-of-Concept/blob/main/Final_Report.pdf
Presentations
PowerPoint presentation at https://www.scribd.com/presentation/528726148/Data-Science-Beer-Recommendation-
Machine-Learning-Summary
Key take-away: When comparing models, lower MSE doesn’t necessary mean better model. For example, regular_2std
and beer_2std models had lower MSE compared to regular_all and beer_all, but also lower recall score. And recall is the
metric that best defines the success of models because it helps achieve the business goal of throwing away less beer. This
is why I have used only the recall score to assess models.
46
Lessons learned:
1. If goal is to predict (also) outliers, keep outliers when training model. Outlier removal works best for correctly
predicting most values. It was not the case here.
2. Instead of building a model with the lowest MSE possible, in this project it was worth it deciding usage of own
metric (recall score).
Redundant tasks:
• Outlier removal. All models performed better on datasets that kept outliers. This is somehow logic because the
goal of this project was to correctly detect outliers.
• Derived attribute “Filtration” unused. “Perception” helped only in beer_all and beer_2std. Since only models for
split datasets are approved, “Perception” remained unused.
For all shared code related to this work, check if you include: Status Remarks
Specification of dependencies. Yes
Training code. Yes
Evaluation code. Yes
(Pre-)trained model(s). Yes
README file includes table of results accompanied by precise command to run to
produce those results. Yes
For all reported experimental results, check if you include: Status Remarks
The range of hyper-parameters considered, method to select the best hyper-parameter
configuration, and specification of all hyper-parameters used to generate results. Yes
Recorded in model
The exact number of training and evaluation runs. No files
A clear definition of the specific measure or statistics used to report results. Yes
A description of results with central tendency (e.g. mean) & variation (e.g. error bars). No Not applicable
The average runtime for each result, or estimated energy cost. No Not in scope
A description of the computing infrastructure used. Yes
Checklist reproduced from: www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist-v2.0.pdf
48