You are on page 1of 48

Contents

Reproducibility statement --------------------------------------------------------------------------------------------------------------------------- 3


1. Business Understanding --------------------------------------------------------------------------------------------------------------------------- 3
1.1 Business Objectives---------------------------------------------------------------------------------------------------------------------------- 3
1.2 Assess Situation -------------------------------------------------------------------------------------------------------------------------------- 3
1.3 Determine Data Mining Goals -------------------------------------------------------------------------------------------------------------- 6
1.4 Produce Project Plan -------------------------------------------------------------------------------------------------------------------------- 6
2. Data Understanding -------------------------------------------------------------------------------------------------------------------------------- 9
2.1 Collect Initial Data – Initial Data Collection Report ------------------------------------------------------------------------------------ 9
2.2 Describe Data – Data Description Report ------------------------------------------------------------------------------------------------ 9
2.3 Explore Data – Data Exploration Report-------------------------------------------------------------------------------------------------11
2.4 Verify Data Quality – Data Quality Report ----------------------------------------------------------------------------------------------14
3. Data Preparation -----------------------------------------------------------------------------------------------------------------------------------15
3.1 Select Data – Rationale for Inclusion/Exclusion ---------------------------------------------------------------------------------------15
3.2 Clean Data – Data Cleaning Report -------------------------------------------------------------------------------------------------------19
3.3 Construct Data ---------------------------------------------------------------------------------------------------------------------------------20
3.4 Integrate Data ---------------------------------------------------------------------------------------------------------------------------------20
3.5 Format Data – Reformatted Data ---------------------------------------------------------------------------------------------------------26
3.6 Dataset Output --------------------------------------------------------------------------------------------------------------------------------26
3.7 Dataset Description Report -----------------------------------------------------------------------------------------------------------------27
4. Modeling. Machine Learning with H2O ------------------------------------------------------------------------------------------------------33
4.1 Initialize H2O -----------------------------------------------------------------------------------------------------------------------------------33
4.2 Select .csv Files --------------------------------------------------------------------------------------------------------------------------------33
4.3 Import Datasets As Dictionaries Into H2O ----------------------------------------------------------------------------------------------33
4.4 Select Modeling Techniques ---------------------------------------------------------------------------------------------------------------34
4.5 Generate Test Design ------------------------------------------------------------------------------------------------------------------------34
4.6 Build Model-------------------------------------------------------------------------------------------------------------------------------------34
4.7 Assess Models ---------------------------------------------------------------------------------------------------------------------------------38
5. Evaluation--------------------------------------------------------------------------------------------------------------------------------------------39
5.1 Evaluate Results -------------------------------------------------------------------------------------------------------------------------------39
5.2 Review Process --------------------------------------------------------------------------------------------------------------------------------43
5.3 Determine Next Steps -----------------------------------------------------------------------------------------------------------------------43
6. Deployment -----------------------------------------------------------------------------------------------------------------------------------------43
6.1 Plan Deployment ------------------------------------------------------------------------------------------------------------------------------43
6.2 Plan Monitoring and Maintenance -------------------------------------------------------------------------------------------------------46
1
6.3 Produce Final Report -------------------------------------------------------------------------------------------------------------------------46
6.4 Review Project ---------------------------------------------------------------------------------------------------------------------------------46

2
BEER RECOMMENDATION PROJECT
Proof of Concept

Copyright (C) 2020-2021 Alina Molnar


License CC BY-NC

Reproducibility statement
To ensure reproducibility of all my analyses I have deposited the Jupyter notebook containing the code in Github
repository https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-Concept. The same repository
contains raw data input, all results files, software requirements, final report, video presentation and PowerPoint
presentation. Environment setup is listed in the above-mentioned Final Report file, chapter 1.2.1 Inventory of Resources.

1. Business Understanding
1.1 Business Objectives
1.1.1 Background

My husband likes to try new beer each time he has the opportunity. I have volunteered to keep an Excel file and record
beer names and ratings to make sure we don’t buy the same product twice. The current state is we buy any beer that’s
not in the Excel file and hope it’s enjoyable. If he likes it, he drinks it and the beer gets a rating of 5 or higher. If he doesn’t
like it, the beer goes down the drain, is rated below 5, and he opens another one. Until now, 9.5% of all beers bought were
not consumed.

1.1.2 Business Objectives


Spend less money for the same quantity of consumed beer.

1.1.3 Business Success Criteria


Compare quantity of discarded beer before and after implementing the prediction algorithm. Currently 9.6% of total beers
are discarded because their rating is lower than 5.

The project will be successful if less than 9,6% of new beers will have a rating lower than 5.

Currently there are no apps to use as a benchmark for predicting beer ratings.

1.2 Assess Situation


1.2.1 Inventory of Resources
Data Scientist: Alina Molnar – author of input data file and all steps of data science project.
Data: Excel file with beer names and ratings according to husband’s preferences. The dataset has observations on 322
beers, and all cells are filled. There are no missing values.
Computing resources:
• Dell Latitude E6530
3
• CPU: Intel i7-2760QM, 2.40 GHz
• GPU: Intel HD Graphics 3000, NVIDIA NVS 4200M
• Operating system: Windows 10
Software toolchain:
• cmd
• Python 3.8
• Visual Studio Code 1.52.0
• Windows PowerShell 5.1.18362.1171
Supporting software:
• H2O 3.32.1.5
• IPython 7.17.0
• Jupyter 1.0.0
• Matplotlib 3.4.2
• Numpy 1.19.4
• Pandas 1.0.3
• Scipy 1.5.4
• Seaborn 0.11.0
Software environment:
{'commit_hash': '05c1664fb',
'commit_source': 'installation',
'default_encoding': 'utf-8',
'ipython_path': 'C:\\Users\\alina\\AppData\\Roaming\\Python\\Python38\\site-packages\\IPython',
'ipython_version': '7.17.0',
'os_name': 'nt',
'platform': 'Windows-10-10.0.19041-SP0',
'sys_executable': 'E:\\Python382\\python.exe',
'sys_platform': 'win32',
'sys_version': '3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)]'}

1.2.2 Requirements, Assumptions, and Constraints


Result: Machine learning model to predict ratings of beers before buying them.
Quality of result: The lower the number of beers discarded, the better the quality of the model.
Assumptions: All data is correct because it was collected by me and I have double checked when recording it.
Constraints:
A. Dataset is quite small, 322 rows, so the training will be done on an even smaller subset.
4
B. Analyze patterns in each feature and across features, but not the evolution of patterns across time because dates
of observations were not recorded.
C. Predicted ratings will be modeled by categorical features. This is a cross between a classification and a regression
task: categorical predictors as input and numeric variable as output.

1.2.3 Risks and Contingencies


Risk: It is possible that the model will be insufficiently trained to get useful results.

Contingency: Use an interval of two standard deviations away from the mean to train the model instead of three standard
deviations as one would choose in a big dataset. This will leave out more outliers and will decrease the risk of overfitting.

1.2.4 Terminology
Business terminology:

ABV – Alcohol By Volume: how much alcohol is contained in a given volume of an alcoholic beverage. Actual ABV in this
dataset starts at 0 (alcohol-free) and ends at 10.

Country – the country where the beer was produced.

Fermentation – place of fermentation in the barrel: top (for ale, alt, blonde, soda and wheat beer) or bottom (dark, keller,
lager, mix and pils beer). Warm fermentation is recorded as top, and cold fermentation as bottom.

Filtration – the process of removing sediments in the brewing process: filtered, unfiltered.

Flavor – additional flavors if any: fruit, herb, lemon, standard. If none is specified on the bottle, it is classified as standard.

Method – brewing method: industrial for big volumes, craft for small batches.

Pasteurization – conservation of beer by heat treatment: pasteurized, unpasteurized.

Rating – actual ratings recorded in the input file for each beer. Each rating reflects how much that particular beer was
enjoyed after a salty meal. The scale is from 1 (worst) to 10 (best). Beers that are exceptionally good or exceptionally bad
may receive ratings that are out of this range. Actual ratings are between -1 and 11.

Predicted rating – output predicted by the machine learning model after training on a subset of data.

Style – brewing style: ale, alt, blonde, dark, keller, lager, mix, pils, soda, wheat.

• Ale – beer fermented above 15°C in a top-fermentation process.


• Alt – top-fermentation beer brewed in Westphalia, Germany.
• Blonde – very pale ale with a clear golden color.
• Dark – dark colored lager beer.
• Keller – unfiltered and unpasteurized lager beer.
• Lager – beer brewed at cold temperatures in a bottom-fermentation process.
• Mix – beer mixed with cola, energy drinks or liquor.
• Pils – a type of pale lager.
• Soda – beverage that tastes like beer with or without alcohol, fermented with fruit or plant roots.
• Wheat – top-fermentation beer that contains wheat.

Split – the subset an observation belongs to in the machine learning model.

• Train – observations used for training.


• Valid – observations used for validation.
5
• Test – observations used for testing the model.

1.2.5 Costs and Benefits


Costs

The time dedicated to this project is part of planned data science practice.
There are no costs with data acquisition because the data collection is my property.
Current hardware fits the project’s requirements without need of further purchases.
Software needed was already installed with the exception of H2O, available for free.
Benefits

Bring value by decreasing the quantity of bought and discarded beer. Helps with saving money, effort, and the
environment:
Less money spent on beer.
Less time spent carrying unsatisfactory beer on the staircase.
Lower carbon footprint by reducing the quantity of discarded beer.

1.3 Determine Data Mining Goals


1.3.1 Data Mining Goals
Predict future ratings given past ratings.
Create a machine learning model that predicts beer ratings well enough to select beers that get a passing grade. Ratings
are between 1 (worst) and 10 (best), and a passing grade is 5 or higher.

1.3.2 Data Mining Success Criteria


Prediction generated by machine learning model leads to less than 9.5% discarded beer.

1.4 Produce Project Plan


1.4.1 Project Plan

TASK NAME START END DATE START ON DURATION* TEAM


DATE DAY* (WORKDAYS) MEMBER

1. Business Understanding
1.1 Business Objectives 01-09-20 04-09-20 0 4 Alina
1.2 Assess Situation 07-09-20 11-09-20 6 5 Alina
1.3 Determine Data Mining Goals 14-09-20 18-09-20 13 5 Alina
1.4 Produce Project Plan 21-09-20 25-09-20 20 5 Alina
Documentation 28-09-20 02-10-20 27 5 Alina
2. Data Understanding
2.1 Collect Initial Data – Initial Data Collection
Report 05-10-20 09-10-20 34 5 Alina
2.2 Describe Data – Data Description Report 12-10-20 23-10-20 41 10 Alina

6
2.3 Explore Data – Data Exploration Report 26-10-20 06-11-20 55 10 Alina
2.4 Verify Data Quality – Data Quality Report 09-11-20 13-11-20 69 5 Alina
Documentation 16-11-20 20-11-20 76 5 Alina
3. Data Preparation
3.1 Select Data – Rationale for Inclusion/Exclusion 23-11-20 04-12-20 83 10 Alina
3.2 Clean Data – Data Cleaning Report 07-12-20 18-12-20 97 10 Alina
3.3 Construct Data 04-01-21 15-01-21 125 10 Alina
3.4 Integrate Data 18-01-21 29-01-21 139 10 Alina
3.5 Format Data – Reformatted Data 01-02-21 01-02-21 153 1 Alina
3.6 Dataset Output 01-02-21 01-02-21 153 1 Alina
3.7 Dataset Description Report 02-02-21 12-02-21 154 9 Alina
Documentation 15-02-21 19-02-21 167 5 Alina
4. Modeling. Machine Learning with H2O
4.1 Initialize H2O 22-02-21 22-02-21 174 1 Alina
4.2 Select .csv Files 22-02-21 22-02-21 174 1 Alina
4.3 Import Datasets As Dictionaries Into H2O 23-02-21 05-03-21 175 9 Alina
4.4 Select Modeling Techniques 08-03-21 26-03-21 188 15 Alina
4.5 Generate Test Design 29-03-21 16-04-21 209 14 Alina
4.6 Build Model 19-04-21 14-05-21 230 20 Alina
4.7 Assess Models 17-05-21 28-05-21 258 10 Alina
Documentation 31-05-21 11-06-21 272 10 Alina
5. Evaluation
5.1 Evaluate Results 14-06-21 25-06-21 286 10 Alina
5.2 Review Process 28-06-21 02-07-21 300 5 Alina
5.3 Determine Next Steps 05-07-21 09-07-21 307 5 Alina
Documentation 12-07-21 16-07-21 314 5 Alina
6. Deployment
6.1 Plan Deployment 19-07-21 30-07-21 321 10 Alina
6.2 Plan Monitoring and Maintenance 02-08-21 06-08-21 335 5 Alina
6.3 Produce Final Report 09-08-21 27-08-21 342 15 Alina
6.4 Review Project 30-08-21 03-09-21 363 5 Alina

7
1.1 Business Objectives
1.2 Assess Situation
1.3 Determine Data Mining Goals
1.4 Produce Project Plan
Write Documentation

2.1 Collect Initial Data – Initial Data Collection Report


2.2 Describe Data – Data Description Report
2.3 Explore Data – Data Exploration Report
2.4 Verify Data Quality – Data Quality Report
Write Documentation

3.1 Select Data – Rationale for Inclusion/Exclusion


3.2 Clean Data – Data Cleaning Report
3.3 Construct Data
3.4 Integrate Data
3.5 Format Data – Reformatted Data
3.6 Dataset output
3.7 Dataset Description Report
Write Documentation

4.1 Initialize H2O


4.2 Select .csv files
4.3 Import datasets as dictionaries into H2O
4.4 Select Modeling Techniques
4.5 Generate Test Design
4.6 Build Model
4.7 Assess Models
Write Documentation

5.1 Evaluate Results


5.2 Review Process
5.3 Determine Next Steps
Write Documentation

6.1 Plan Deployment


6.2 Plan Monitoring and Maintenance
6.3 Produce Final Report
6.4 Review Project
0 50 100 150 200 250 300 350 400
Days of the Project

1.4.2 Initial Assessment of Tools and Techniques


Programming language: Python because it’s the most used by data scientists.

Tools:

▪ H2O: machine learning platform.


▪ Matplotlib: library for visualization tasks. Has numerous customization options, from defining ticklabels to
changing font in legends.
▪ Numpy: library for array calculations and mathematical functions. Highly versatile for working with arrays.

8
▪ Pandas: library for data cleaning and transformation. Easy to read code that helps with understanding the flow.
▪ Seaborn: library for eye friendly statistical graphs. Makes uncluttered visualizations with simple code.
Techniques:

▪ Machine learning: create a model that predicts ratings.


▪ Supervised learning: learn from previous observations by training model on input and desired output of a subset
of data.
▪ Distributed Random Forest (DRF): ensemble method.
▪ Gradient Boosting Machine (GBM): forward learning ensemble method.

2. Data Understanding
2.1 Collect Initial Data – Initial Data Collection Report
Note:

1. Make a copy of the original file.


2. Store the original away from the project folder.
3. Do all work on the copy file.

Work on the copy file stored in the following location: https://github.com/alina-molnar/Beer-Recommendation-Project-


Proof-of-Concept/blob/main/beer_input/beer_241.xlsx

Import file in Python as a Pandas dataframe.

The dataset has 322 rows and 8 columns, and all cells are filled. There are no missing values.
81 beers are cut and pasted into another file so that they stay unseen for the machine learning model.
I will work with the remaining 241 beers, splitting them into three subsets for training, validation and testing the model.
Since I don’t know which attributes have a higher impact on ratings, I’ll keep them all for the initial analysis and remove
unnecessary columns after seeing the results.

2.2 Describe Data – Data Description Report


The dataset imported as a Pandas dataframe has 241 rows and 8 columns with no missing values.

One column, Rating, stores beer ratings as a numeric variable of integer type.

Seven columns store data as object type in a text form and need to be converted: Name, Method, Style, Flavor,
Fermentation, Country and Split.

• Rating: beer ratings as integer between -1 and 11, and the mean rating is 6.9.
• Name: beer names and their alcohol content as the last part of the text. Can be separated.
• Method: brewing method. Repetitive values, good candidates for conversion to category.
• Style: brewing style. Repetitive values, good candidates for conversion to category.
• Flavor: additional flavors. Repetitive values, good candidates for conversion to category.
• Fermentation: place of fermentation in the barrel. Repetitive values, good candidates for conversion to category.

9
• Country: numerous different string values. Needs further check before deciding on its transformation.
• Split: repetitive values that mark which observations will be used for training, validation and test when building
model in H2O.

2.2.1 Data Preprocessing: Standardization


Actions:

• Convert all column labels to lowercase.


• Convert all column values to lowercase if they are strings.
• Convert Name column from object to string.
• Copy alcohol content from end of name and store as separate column. Why use copy instead of cut? Because
sometimes a brand is sold in two variants under the same name and characteristics, with one of them containing
alcohol, and the other one as alcohol-free. Keeping the alcohol content at the end of the name ensures a unique
identifier for each observation.
• Convert alcohol content to float.
• Convert Method, Style, Flavor and Fermentation columns from object types to category.
• Keep Split column as object because it’s used only by H2O which can handle it as is. Also, prior to using it in H2O,
analysis of categorical features in bulk based on their data type is easier if irrelevant columns are left out.

2.2.2 Data Preprocessing: Pre-Validation


Validation of uniqueness in beer names. Check for duplicate data, remove if found.

After removing duplicates, the dataset has 240 rows and 9 columns. The extra column is ABV (alcohol by volume) in float
format.

Rating interval for 25%-75% of beers is 6 to 8 in most subgroups of categorical features. Out of the 240 beers, 9.6% have
a rating lower than 5.

There is noise (extreme ratings) in most of the subgroups of each feature.

2.2.3 Data Preprocessing: Numeric Analysis


Description of numeric variables:

metrics rating ABV


count 240 240
mean 6.9 4.1
std 1.8 2.2
min -1 0
25% 6 2.5
50% 7 4.9
75% 8 5.3
max 11 10

2.2.4 Data Preprocessing: Categorization


Description of categorical variables. Write function to avoid repetitive code.

method count mean std min 25% 50% 75% max


craft 57 7 1.9 1 6 7 8 11
industrial 183 6.9 1.7 -1 6 7 8 10

10
style count mean std min 25% 50% 75% max
ale 20 6.6 2.3 1 5.8 7 8 10
alt 7 6.7 1 5 6.5 7 7 8
blonde 38 7.1 1.5 2 6 7 8 10
dark 17 7.6 1.7 4 6 8 9 11
keller 12 6.9 1.4 5 6 7 8 10
lager 38 6.9 1.7 3 6 7 8 9
mix 4 4 3.3 0 3 4 5 8
pils 53 6.8 1.4 3 6 7 8 10
soda 13 7 3 -1 6 8 9 10
wheat 38 6.9 1.7 1 6 7 8 9

flavor count mean std min 25% 50% 75% max


fruit 24 7.1 2.1 0 6.8 7.5 8 10
herb 8 5.6 2.9 -1 5.8 6 7.2 8
lemon 28 7.9 0.9 6 7 8 8 10
standard 180 6.8 1.7 1 6 7 8 11

fermentation count mean std min 25% 50% 75% max


bottom 124 6.9 1.7 0 6 7 8 11
top 116 6.9 1.9 -1 6 7 8 10

2.2.5 Data Preprocessing: String Columns


Description of Name column:

There are 240 unique beer names. Each beer name starts with the brand, followed by characteristics like style, flavor,
filtration and pasteurization, if the case. All names end with a number that represent the alcohol content.

Description of Country column:

There are 17 unique countries, quite a lot for a dataset of 240 beers. I will determine if each country has enough
occurrences in the data selection section.

2.3 Explore Data – Data Exploration Report


2.3.1 Hypothesis 1
There might be a linear relationship between ratings and alcohol content.

Result 1: The scatterplot shows no linear relationship and no pattern between ratings and alcohol content. However,
subgroup of lemon flavor with zero or low alcohol content has higher ratings compared to other groups. Because alcohol-
free and low alcohol are present in many flavors, out of which lemon stands out as higher rated, it’s safe to say there is a
relationship between lemon flavor and high ratings.

11
Figure 1. Relationship between ABV and rating by each categorical feature

2.3.2 Hypothesis 2
Some subgroups might have a low number of observations.

Result 2: The results are stored in a new column, “Occurrence”: too few if below threshold, NaN for the rest. When a
threshold of 5% from the total is set, alt and mix styles, and herb flavor are below.

2.3.3 Hypothesis 3
Flavor column might have greater variation between the average of its subgroups. If this is the case, Flavor would be a
useful feature when building the machine learning model.

Result 3: Flavor subgroups have observable variation between their average. The errorbar is bigger on herb subgroup.

Figure 2. Average rating of Style subgroups

12
2.3.4 Hypothesis 4
Lemon beers’ high average might not be due to outliers. The scatterplot between alcohol content and rating grouped by
Flavor hints this is due to all observations being higher rated, not because of outliers.

Result 4: The boxplot shows that the distribution of lemon beers is due to higher ratings overall compared to other
subgroups. The KDE plot shows that ratings of lemon beers are so much higher than the rest, that their median overlaps
with its 75th percentile. This was the reason for the lineless box of lemon subgroup in boxplot.

Figure 3. Distribution of ratings in Flavor subgroups

Figure 4. KDE of lemon beer ratings

13
2.4 Verify Data Quality – Data Quality Report
There are no missing values as shown in Initial Data Collection Report (section 2.1).
All keys and all string values are stored in lower-case because I have transformed all text in section 2.2.

2.4.1 Data coverage


2.4.1.1 Data coverage in numeric variables:
Uniques values of ratings are -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. All values between minimum and maximum show up.
Looks right.
Uniques values of alcohol content are 0.0, 2.0, 2.1, 2.4, 2.5, 2.9, 3.0, 3.7, 3.8, 4.0, 4.1, 4.2, 4.3, 4.5, 4.7, 4.75, 4.8, 4.9, 5.0,
5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6.0, 6.2, 6.3, 6.5, 6.7, 6.8, 7.0, 7.5, 8.2, 9.5, 10.0. These values look plausible, with
interval between 4 and 6 more populated, the way it happens in real life. There’s a gap between 0 and 2 that looks right
because usually there’s no buying option between alcohol-free and beer-lemonade mixes.

2.4.1.2 Data coverage in categorical variables:


Values contained in attributes fit with the keys (column names) and are consistent across the whole dataset. The countplot
shows Style and Flavor columns have subgroups with few observations.
Size of subgroups is consistent with stores’ assortment:
• Method: industrial has more observations than craft.
• Style: most observations are pils, lager and blonde.
• Flavor: standard has more observations than fruit.
• Fermentation: similar counts of top and bottom fermented.

Figure 5. Count of observations in each subgroup of categorical columns

14
2.4.2 Define noise in ratings
Dataset has 240 rows. In a gaussian distribution, outliers are usually defined as values that are more than three times the
standard deviation away from the mean. In small datasets, a narrower interval is preferred, for example two times the
standard deviation away from the mean. A narrower interval avoids overfitting in case values are not representative of
real life. I will use this criteria when cleaning data in section 3.2.

2.4.3 Notes from data collector and file author


I trust data to be accurate because I have collected it with the purpose of analyzing it when I would have enough
observations.
Data collected from labels is 100% correct: names, alcohol content, style and flavor.
Data that represents personal assessments (ratings) is 100% correct.
Data that was inferred from label is correct according to my knowledge. This includes style when it wasn’t mentioned on
the label, method and fermentation. I have searched online for each beer to find out its brewing method and its style.
Some of the producers do not have a website, and some have a website, but they don’t specify the beer style or the
fermentation method.
Data in Method: If there’s no data on label or online, presume it to be industrial method. Producers tend to advertise on
the label if their beer is craft, so a beer without this information is likely to be industrial brewed.
Data in Style: If there’s no data on label or online, presume it to be lager style. This means that the lager subgroup might
have a few entries that in fact are not lager. Considering that lager style has many entries, these lager-assumed beers
should have a low impact in the overall picture. The average rating for lager is 6.9, exactly the same as the average of the
full dataset, 6.9.
Data in Fermentation: record it as top if style is ale, alt, blonde, soda or wheat, and bottom if style is dark, keller, lager,
mix or pils. Warm fermentation (for example in wheat beer) is recorded as top, and cold fermentation (for example lager
beer) is recorded as bottom.

3. Data Preparation
3.1 Select Data – Rationale for Inclusion/Exclusion
Decide correlation metric to describe relationship between ratings and alcohol content. Their relationship is non-linear as
was shown in Data exploration report. Find out what type of distributions ratings and alcohol content have:

• If both distributions are gaussian and relationship is linear, use covariance or Pearson’s correlation coefficient.
• If one or both distributions are non-gaussian and relationship is non-linear, but still monotonic use Spearman’s
correlation coefficient.

3.1.1 Check distribution of ratings


Action: Plot kernel density estimation (KDE) of ratings and see if there’s a pattern.

Result: Curve of ratings KDE is gaussian, therefore an observable pattern. Only a few beers get extreme ratings, minimum
-1 or maximum 11. Most ratings are grouped towards 7.

Decision: Include ratings column in machine learning model as target variable.


15
Figure 6. KDE of rating

3.1.2 Check distribution of alcohol content


Action: Plot kernel density estimation (KDE) of alcohol content and see if there’s a pattern. Commercial beer is either
regular, alcohol-free, or lemonade mix, and a smooth curve hides these groups. Set bandwidth lower than 1 to check if
groups show up.

Result: Minimum alcohol content is 0, maximum is 10. Curve of alcohol content KDE is not gaussian because of alcohol-
free beer which amounts for 17% of the total beer. Still, it shows a pattern of three subgroups each with its own gaussian
curve:

• alcohol-free beer
• mix of beer with lemonade, alcohol content between 0.5 and 3, curve peaked at 2.5
• beer without additions, alcohol content between 3 and 10, curve peaked at 5

Decision: Include ABV (alcohol content) column in machine learning model when training regular beer subgroup. Ratings
and alcohol content have similar scale, so there’s no need to normalize them.

16
Figure 7. KDE of alcohol content

3.1.3 Combined distribution of alcohol content and ratings


Action 1: I don’t calculate neither the covariance nor the Pearson’s correlation coefficient between ratings and alcohol
content because I would need gaussian distribution of both variables involved. Ratings have a gaussian distribution, but
alcohol content has not.

Instead, I can calculate Spearman’s correlation coefficient because it works for both linear and non-linear relationship and
the variables don’t need to have gaussian distributions. However, Spearman’s correlation coefficient assumes a monotonic
relationship, meaning if one variable goes up, the other does the same.

I have shown in the Data Exploration Report (section 2.3) that ratings and alcohol content don’t have a linear relationship,
but they still can have non-linear one, so I’ll calculate Spearman’s correlation coefficient.

Result 1: Spearman’s correlation coefficient is -0.11, which means that either the variables don’t have a monotonic
relationship, or they don’t have any kind of relationship. Since we already know that lemon beer with an alcohol content
of 2.5 has higher grades compared to the rest, we can say that there is a relationship, but it’s not monotonic.

Action 2: Plot kernel density estimation (KDE) of alcohol content and rating and see if data can be grouped. Set bandwidth
less than 1 because distribution of alcohol content is not gaussian.

Result 2: There are three zones, so it makes sense to split the dataset into three groups after all cleanup is done.

Decision: Before and after removing rating outliers, split dataset into alcohol-free beer (0 ABV), light beer (0.5-3 ABV) and
regular beer (ABV higher than 3).

17
Figure 8. KDE of alcohol content and rating

3.1.4 Check country column if it has enough observations for each unique value
Action: Count unique countries and how many occurrences each of them has. Show results as percentage.

Results: There are 17 unique countries and 15 of them have each less than 5% of the total observations. Having a small
dataset, this feature might lead to bias in a machine learning model because of the large number of subgroups combined
with the low number of occurrences in each subgroup.

Decision: Exclude country column from machine learning model.

3.1.5 Check which categorical features have high variation across subgroups
Action: Write functions to plot averages and distribution of subgroups. Select features with high variation across
subgroups.

Results: Boxplots and barplots show that Flavor and Style have high variation between distributions and averages of their
subgroups. Method and Fermentation have low variation across their subgroups. Errorbars are bigger on subgroups with
low number of observations.

Decision: Include Style and Flavor in all machine learning models. Decide later for Method and Fermentation, after splitting
subsets because even if it doesn’t show interesting subgroups in original dataset, it might do otherwise in the subsets split
on alcohol content criteria.

18
Figure 9. Average rating of beer subgroups

Figure 10. Rating distribution of beer subgroups

3.2 Clean Data – Data Cleaning Report


Action: Calculate standard deviation (std) of ratings. Define range of normal ratings as 2*std away from the mean because
dataset is small and has gaussian distribution. Usually a 3*std interval is preferred, but a broad interval might lead to
overfitting when a dataset is small. Select outliers of the 2*std away normal range.

Results:

• The mean rating is 6.90.


• The lower limit of normal ratings is 3.33 and the upper limit is 10.48.
• These are the rating outliers: -1, 0, 1, 1, 2, 3, 3, 3, 3, 3, 3, 11.
• There are 12 outliers out of 240 total rating observations.
19
• There are 228 observations with normal ratings.

Decision: The goal of this project is to identify and eliminate beers that have extremely low ratings, so it might be possible
for a model trained on these extreme ratings to perform better. I will build models on both complete dataset, and less
than 2*std dataset, then compare predictions. That’s why I will select the less than 2*std away rows only after new
columns are added.

3.3 Construct Data


3.3.1 Derived Attributes
Create new columns to store info about filtration and pasteurization status. Name them Filtration and Pasteurization, and
convert them to categorical type.

Action 1: Define a list of words in various languages that translate as unfiltered. If in the Name column, at least one word
from the list is found, label the beer as unfiltered, otherwise label it as filtered.

Unfiltered keywords: “unfiltered”, “kellerbier”, “natur”, “naturtrubes”, “nefiltrata”, “nonfiltrata”.

Action 2: Define a list of words in various languages that translate as unpasteurized. If in the Name column, at least one
word from the list is found, label the beer as unpasteurized, otherwise label it as pasteurized.

Unpasteurized keywords: “unpasteurized”, “kellerbier”, “natur”, “naturtrubes”, “nepasteurizata”, “nonpastorizzata”.

Results: Two new columns – Filtration and Pasteurization, with entries derived from the Name column.

3.3.2 Generated records


Check basic statistics for newly created attributes.

Create column to bin alcohol content as categorical data and capture non-linearities in visualization and modeling. Name
it “perception”. Decide edges and labels according to perceived features of subgroups:

• drive – beer that can be consumed before driving, alcohol content between 0 and 0.5
• refresh – beer that quenches thirst, a mix of beer and lemonade or fruit juice, with alcohol content between 0.6
and 2.8
• weak – beer that’s not suitable neither for driving, nor for quenching thirst; its taste is slightly diluted and alcohol
content between 2.9 and 4.4
• tasty – beer with full taste and alcohol content between 4.5 and 5.5
• too strong – beer with strong taste and alcohol content 5.6 or higher

3.4 Integrate Data


3.4.1 Select observations with ratings less than 2*std away from the mean
Action: Use limits calculated at 3.2, the lower limit of normal ratings is 3.33 and the upper limit is 10.48. Store result in a
dataframe called “beer_2std” and reset index.

Result: Dataframe without outliers has 228 rows.

3.4.2 Split dataset by alcohol content into three groups


Action: Split both datasets, the complete one and the less 2*std dataset, by alcohol content into three groups: alcohol-
free beer (0-0.4 ABV), light beer (0.5-3 ABV), and regular beer (ABV higher than 3), as decided at 3.1.3.

20
Results: In complete ratings range there are 41 alcohol-free, 30 light and 169 regular beers.
In less than 2*std away ratings there are 38 alcohol-free, 30 light and 160 regular beers.

3.4.3 Check distribution of ratings in subsets


Action: Plot distribution of ratings in each subset to check if their curves are gaussian.
Result: KDE plots of the six subsets prove that ratings keep their gaussian curve even if split by alcohol content criteria.

Figure 11. Alcohol-free beer ratings

Figure 12. Light beer ratings

21
Figure 13. Regular beer ratings

3.4.4 Check feature variation across subgroups when split into the three subsets based on alcohol content
Subsets split from beer dataset containing all observations:

Beer_all dataset. Observable variation in subgroups of Style, Flavor and Perception.

Figure 14. Distribution of beer_all

Alcohol_free_all dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration and Pasteurization.

22
Figure 15. Distribution of alc_free_all

Light_all dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration, Pasteurization and
Perception.

Figure 16. Distribution of light_all

Regular_all dataset. Observable variation in subgroups of Style, Flavor, Pasteurization and Perception.

23
Figure 17. Distribution of regular_all

Subsets split from beer_2std dataset containing only normal distribution observations, without ratings outliers:

Beer_2std dataset. Observable variation in subgroups of Style, Flavor and Perception.

Figure 18. Distribution of beer_2std

24
Alcohol_free_2std dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration and

Pasteurization.
Figure 19. Distribution of alc_free_2std

Light_2std dataset. Observable variation in subgroups of Style, Flavor, Fermentation, Filtration, Pasteurization and
Perception.

Figure 20. Distribution of light_2std

Regular_2std dataset. Observable variation in subgroups of Style, Flavor, Pasteurization and Perception.

25
Figure 21. Distribution of regular_2std

3.5 Format Data – Reformatted Data


Sort dataframe on rating, then on alcohol content. Other formatting was done when it had been needed.

3.6 Dataset Output


Export the complete dataset and its three subsets:

• beer_all, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-


Concept/blob/main/beer_output/clean_files/beer_all.csv
• alc_free_all, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/beer_output/clean_files/alc_free_all.csv
• light_all, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/beer_output/clean_files/light_all.csv
• regular_all, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/beer_output/clean_files/regular_all.csv

Export the dataset without outliers and its three subsets:

• beer_2std, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-


Concept/blob/main/beer_output/clean_files/beer_2std.csv
• alc_free_2std, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/beer_output/clean_files/alc_free_2std.csv
• light_2std, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/beer_output/clean_files/light_2std.csv
• regular_2std, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/beer_output/clean_files/regular_2std.csv

26
3.7 Dataset Description Report
In all datasets, each row is a unique beer with corresponding observations in all columns.

3.7.1 Description of features (columns)


• Name: unique beer names and their alcohol content as the last part of the text. String type.
• Rating: beer ratings between 4 and 10. Numeric variable. Integer type.
• Style: brewing style. Categorical type: ale, alt, blonde, dark, keller, lager, mix, pils, soda, wheat. Partial overlap of
keller with unfiltered subgroup in Filtration and with unpasteurized subgroup in Pasteurization. Partial overlap of
Soda with subgroups of Flavor and with drive subgroup in Perception.
• Flavor: additional flavors. Categorical type: fruit, herb, lemon, standard. Partial overlap of fruit and lemon with
drive and refresh subgroups in Perception.
• Fermentation: place of fermentation in the barrel. Categorical type: top, bottom. Top overlaps with ale, alt,
blonde, soda and wheat subgroups in Style. Bottom overlaps with dark, keller, lager, mix and pils subgroups in
Style.
• Split: which observations will be used for training, validation and test when building model in H2O. Category type:
train, valid, test.
• ABV: alcohol content between 0 and 10. Numeric variable. Float type.
• Occurrence: occurrence above or below 5% threshold. Categorical type: too few, enough. Too few overlaps with
alt and mix in Style, and with herb in Flavor.
• Filtration: the process of removing sediments in the brewing process. Categorical type: filtered, unfiltered. Partial
overlap of unfiltered with keller subgroup in Style.
• Pasteurization: conservation of beer by heat treatment. Categorical type: pasteurized, unpasteurized. Partial
overlap of unpasteurized with keller subgroup in Style.
• Perception: perceived features of alcohol content subgroups. Categorical type: drive, refresh, weak, tasty, too
strong. Partial overlap of drive and refresh with fruit and lemon subgroups in Flavor.
• Method: brewing method. Categorical type: craft, industrial.

3.7.2 Description of observations (rows)


Use function created in Describe Data (section 2.2) to print description of all categorical variables.

For brevity, I will copy here only the descriptions of categorical features selected for training each dataset. ABV feature is
numeric and its distribution was shown in section 3.1.2.

Dataset 1: beer_all dataset


• All beers (alcohol-free + light + regular) from the original dataset, including rating outliers
• Rows: 240
• Number of ratings lower than 5: 23
• Columns: 12
• Features for training model: Style, Flavor, Perception, ABV
• Dimensions: 4
• Maximum combinations possible: 10*4*5 = 200 without counting combinations with numerical ABV

style count mean std min 25% 50% 75% max


ale 20 6.6 2.3 1 5.8 7 8 10
alt 7 6.7 1 5 6.5 7 7 8
blonde 38 7.1 1.5 2 6 7 8 10
dark 17 7.6 1.7 4 6 8 9 11
keller 12 6.9 1.4 5 6 7 8 10

27
lager 38 6.9 1.7 3 6 7 8 9
mix 4 4 3.3 0 3 4 5 8
pils 53 6.8 1.4 3 6 7 8 10
soda 13 7 3 -1 6 8 9 10
wheat 38 6.9 1.7 1 6 7 8 9

flavor count mean std min 25% 50% 75% max


fruit 24 7.1 2.1 0 6.8 7.5 8 10
herb 8 5.6 2.9 -1 5.8 6 7.2 8
lemon 28 7.9 0.9 6 7 8 8 10
standard 180 6.8 1.7 1 6 7 8 11

perception count mean std min 25% 50% 75% max


drive 41 7 1.6 3 6 8 8 9
refresh 26 7.5 0.9 6 7 7 8 10
weak 13 6.2 2.8 -1 5 7 8 10
tasty 121 6.9 1.7 0 6 7 8 10
too_strong 39 6.7 2.2 1 6 7 8 11

Dataset 2: alcohol_free_all dataset


Because dataset is small, I will choose only predictors with highest variation.

• Alcohol-free beers from the original dataset, including rating outliers


• Rows: 41
• Number of ratings lower than 5: 3
• Columns: 12
• Features: Style, Flavor
• Dimensions: 2
• Maximum combinations possible: 6*4 = 24

style count mean std min 25% 50% 75% max


ale 0 NaN NaN NaN NaN NaN NaN NaN
alt 0 NaN NaN NaN NaN NaN NaN NaN
blonde 1 8 NaN 8 8 8 8 8
dark 0 NaN NaN NaN NaN NaN NaN NaN
keller 3 6 1.7 5 5 5 6.5 8
lager 5 6 2.8 3 3 7 8 9
mix 0 NaN NaN NaN NaN NaN NaN NaN
pils 12 7 1.1 5 6.8 7 8 8
soda 11 7.5 1.9 3 6.5 8 9 9
wheat 9 7.4 1 6 7 8 8 9

flavor count mean std min 25% 50% 75% max


fruit 12 7.6 1.7 3 7 8 9 9
herb 3 7.3 1.2 6 7 8 8 8
lemon 10 8 0.7 7 8 8 8 9

28
standard 16 6 1.6 3 5 6 7 8

Dataset 3: light_all dataset


Because dataset is small, I will choose only predictors with highest variation.

• Light beers from the original dataset, including rating outliers


• Rows: 30
• Number of ratings lower than 5: 1
• Columns: 12
• Features: Style, Flavor
• Dimensions: 2
• Maximum combinations possible: 6*3 = 18

style count mean std min 25% 50% 75% max


ale 0 NaN NaN NaN NaN NaN NaN NaN
alt 1 7 NaN 7 7 7 7 7
blonde 0 NaN NaN NaN NaN NaN NaN NaN
dark 0 NaN NaN NaN NaN NaN NaN NaN
keller 3 8.7 1.2 8 8 8 9 10
lager 5 7.8 0.8 7 7 8 8 9
mix 2 6 2.8 4 5 6 7 8
pils 11 7.3 0.5 7 7 7 7.5 8
soda 0 NaN NaN NaN NaN NaN NaN NaN
wheat 8 6.9 1.1 6 6 6.5 7.2 9

flavor count mean std min 25% 50% 75% max


fruit 8 6.9 0.8 6 6 7 7.2 8
herb 0 NaN NaN NaN NaN NaN NaN NaN
lemon 16 7.8 0.9 7 7 8 8 10
standard 6 6.5 1.4 4 6.2 7 7 8

Dataset 4: regular_all dataset


• Regular beers from the original dataset, including rating outliers
• Rows: 169
• Number of ratings lower than 5: 19
• Columns: 12
• Features: Style, Flavor, Pasteurization, ABV
• Dimensions: 5
• Maximum combinations possible: 10*4*2 = 80 without counting combinations with numerical ABV

style count mean std min 25% 50% 75% max


ale 20 6.6 2.3 1 5.8 7 8 10
alt 6 6.7 1 5 6.2 7 7 8
blonde 37 7.1 1.5 2 6 7 8 10
dark 17 7.6 1.7 4 6 8 9 11
keller 6 6.5 0.5 6 6 6.5 7 7

29
lager 28 7 1.6 4 6 7 8 9
mix 2 2 2.8 0 1 2 3 4
pils 30 6.6 1.7 3 6 7 8 10
soda 2 4.5 7.8 -1 1.8 4.5 7.2 10
wheat 21 6.6 2 1 6 7 8 9

flavor count mean std min 25% 50% 75% max


fruit 4 6.2 4.3 0 5.2 7.5 8.5 10
herb 5 4.6 3.2 -1 5 6 6 7
lemon 2 7.5 2.1 6 6.8 7.5 8.2 9
standard 158 6.9 1.8 1 6 7 8 11

pasteurization count mean std min 25% 50% 75% max


pasteurized 144 6.7 2 -1 6 7 8 10
unpasteurized 25 7.2 1.5 3 7 7 8 11

Dataset 5: beer_2std dataset


• Beers (alcohol-free_2std + light_2std + regular_2std) without rating outliers
• Rows: 228
• Number of ratings lower than 5: 12
• Columns: 12
• Features: Style, Flavor, Perception, ABV
• Dimensions: 4
• Maximum combinations possible: 10*4*5 = 200 without counting combinations with numerical ABV

style count mean std min 25% 50% 75% max


ale 18 7.2 1.7 4 7 7 8 10
alt 7 6.7 1 5 6.5 7 7 8
blonde 37 7.3 1.3 4 6 7 8 10
dark 16 7.4 1.5 4 6 8 8.2 10
keller 12 6.9 1.4 5 6 7 8 10
lager 36 7.2 1.5 4 6 7.5 8 9
mix 3 5.3 2.3 4 4 4 6 8
pils 52 6.9 1.3 4 6 7 8 10
soda 11 8.1 1.3 6 7.5 8 9 10
wheat 36 7.1 1.2 4 6 7 8 9

flavor count mean std min 25% 50% 75% max


fruit 22 7.6 1.1 6 7 8 8 10
herb 7 6.6 1.1 5 6 6 7.5 8
lemon 28 7.9 0.9 6 7 8 8 10
standard 171 7 1.4 4 6 7 8 10

perception count mean std min 25% 50% 75% max


drive 38 7.4 1.2 5 7 8 8 9
refresh 26 7.5 0.9 6 7 7 8 10
30
weak 12 6.8 1.9 4 5.8 7 8 10
tasty 117 7 1.4 4 6 7 8 10
too_strong 35 7 1.5 4 6 7 8 10

Dataset 6: alcohol-free_2std dataset


Because dataset is small, I will choose only predictors with highest variation. Alcohol-free beers without rating outliers

• Rows: 38
• Number of ratings lower than 5: 0
• Columns: 12
• Features: Style, Flavor
• Dimensions: 2
• Maximum combinations possible: 6*4 = 24

style count mean std min 25% 50% 75% max


ale 0 NaN NaN NaN NaN NaN NaN NaN
alt 0 NaN NaN NaN NaN NaN NaN NaN
blonde 1 8 NaN 8 8 8 8 8
dark 0 NaN NaN NaN NaN NaN NaN NaN
keller 3 6 1.7 5 5 5 6.5 8
lager 3 8 1 7 7.5 8 8.5 9
mix 0 NaN NaN NaN NaN NaN NaN NaN
pils 12 7 1.1 5 6.8 7 8 8
soda 10 7.9 1.2 6 7.2 8 9 9
wheat 9 7.4 1 6 7 8 8 9

flavor count mean std min 25% 50% 75% max


fruit 11 8 1 6 7.5 8 9 9
herb 3 7.3 1.2 6 7 8 8 8
lemon 10 8 0.7 7 8 8 8 9
standard 14 6.4 1.2 5 5.2 6.5 7 8

Dataset 7: light_2std dataset


Because dataset is small, I will choose only predictors with highest variation. Dataset is identic to light_all because there
were no outliers in light beer.

• Light beers without rating outliers


• Rows: 30
• Number of ratings lower than 5: 1
• Columns: 12
• Features: Style, Flavor
• Dimensions: 2
• Maximum combinations possible: 6*3 = 18

style count mean std min 25% 50% 75% max


ale 0 NaN NaN NaN NaN NaN NaN NaN
alt 1 7 NaN 7 7 7 7 7
31
blonde 0 NaN NaN NaN NaN NaN NaN NaN
dark 0 NaN NaN NaN NaN NaN NaN NaN
keller 3 8.7 1.2 8 8 8 9 10
lager 5 7.8 0.8 7 7 8 8 9
mix 2 6 2.8 4 5 6 7 8
pils 11 7.3 0.5 7 7 7 7.5 8
soda 0 NaN NaN NaN NaN NaN NaN NaN
wheat 8 6.9 1.1 6 6 6.5 7.2 9

flavor count mean std min 25% 50% 75% max


fruit 8 6.9 0.8 6 6 7 7.2 8
herb 0 NaN NaN NaN NaN NaN NaN NaN
lemon 16 7.8 0.9 7 7 8 8 10
standard 6 6.5 1.4 4 6.2 7 7 8

Dataset 8: regular_2std dataset


• Regular beers without rating outliers
• Rows: 160
• Number of ratings lower than 5: 11
• Columns: 12
• Features: Style, Flavor, Pasteurization, ABV
• Dimensions: 5
• Maximum combinations possible: 10*4*2 = 80 without counting combinations with numerical ABV

style count mean std min 25% 50% 75% max


ale 18 7.2 1.7 4 7 7 8 10
alt 6 6.7 1 5 6.2 7 7 8
blonde 36 7.2 1.3 4 6 7 8 10
dark 16 7.4 1.5 4 6 8 8.2 10
keller 6 6.5 0.5 6 6 6.5 7 7
lager 28 7 1.6 4 6 7 8 9
mix 1 4 NaN 4 4 4 4 4
pils 29 6.7 1.5 4 6 7 8 10
soda 1 10 NaN 10 10 10 10 10
wheat 19 7.1 1.4 4 6 7 8 9

flavor count mean std min 25% 50% 75% max


fruit 3 8.3 1.5 7 7.5 8 9 10
herb 4 6 0.8 5 5.8 6 6.2 7
lemon 2 7.5 2.1 6 6.8 7.5 8.2 9
standard 151 7 1.5 4 6 7 8 10

pasteurization count mean std min 25% 50% 75% max


pasteurized 137 7 1.5 4 6 7 8 10
unpasteurized 23 7.2 1 6 7 7 8 9

32
Conclusion: Possible combinations in each model are almost as many as the number of observations in each dataset.

4. Modeling. Machine Learning with H2O

Tool chosen: H2O – open-source machine learning software.

• Pros: handles high-dimensional datasets with no need for categorical data encoding, generates numeric data
output from categorical data input.
• Cons: small community of users that can help with technical questions.

Other tools considered, but not chosen: Scikit-Learn – open-source machine learning library.

• Pros: versatile library, large community of users that can help answering questions.
• Cons: needs encoding of categorical data, techniques generate the same kind of output as the provided input.

H2O steps:

• Initialize H2O
• Import data
• Model
• Export output.

Select .csv files from folder that stores clean dataframes. Import as dictionary of dictionaries. Each key is the name of a
dataset. The values are: frame, its list of predictors, the response column. The list of predictors for each frame is the one
decided in section 3.7.2 Description of observations (rows).

Note: I’ve tried to import from Excel and it didn’t work. If you run into the same situation, convert excel to csv, then import
csv.

4.1 Initialize H2O


Parameter nthreads=-1 means use all CPU on the host.

4.2 Select .csv Files


Select clean files stored in location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/clean_files

4.3 Import Datasets As Dictionaries Into H2O


Write function to import multiple files at once.
Define list of predictors for each dataset depending on its name.

33
4.4 Select Modeling Techniques
4.4.1 Modeling Techniques
Distributed Random Forest (DRF). Ensemble method that builds parallel trees, each of them trained on a subset of data.
The regression line generated by each tree is run through all other trees, and the results are used to compute the final
mean, thus avoiding over-fitting.

Gradient Boosting Machine (GBM). Forward learning ensemble method that builds parallel trees, but unlike Distributed
Random Forest (DRF), it trains each of them on the full set. Trees are built sequentially, one after another, so that each
tree will learn from the results of previous trees and the model gets refined. The output is the prediction of the last tree
built.

4.4.2 Modeling Assumptions


Distributed Random Forest assumes data has no formal distribution. As shown in Data Exploration Report (section 2.3),
there is no linear relationship and no observable pattern between ratings and alcohol content, even if data is separated
into subgroups of each feature.

Gradient Boosting Machine assumes independence of observations.

4.5 Generate Test Design


TestDesign
The model will learn from the training set and will be assessed on the test set.

In an ideal dataset, decision about how to split data between training and testing would be automated, random and
representative.

Since I don’t have an ideal dataset and the number of observations is rather small and prone to over-fitting, I have decided
to split data manually. For this, I have created a column in the original dataset and I have named it Split. I have sorted
observations by name and then by rating, and at each grade I made sure to record “train” in 70% of them, “valid” in 15%,
and “test” in 15%. Beer from the same name brand was thus split into different groups. I made sure other features are
split too, e.g. for a rating of 5, in each split appear various styles and flavors. The result is that splits are diverse no matter
how few observations there are in each of them.

Note: Usually, alcohol-free beer is labeled so. When sorting beer by name, most of alcohol-free were the first in the series
of a particular brand, and the light ones came after them. Splitting the brand into as many different groups as possible, I
made sure the first one is marked for “train”, the second one for “valid” and, if a third one existed, for “test”. This led to
the alcohol-free subgroup to have more observations in the “train” group compared to regular beer. This is also the reason
why the “test” subgroup has more regular beer. Why did I do that? Because the alcohol free is under-represented compared
to regular beer in the total dataset, and I wanted to make sure there are as many observations as possible for training in
this small subset.

4.6 Build Model


Decide on parameter settings and explain choice of setting.

Build function to avoid repetitive code for each dataset.

Select rows for training, validation and test sets to make results reproducible.

34
4.6.1 Parameter Settings
Distributed Random Forest
DRF instantiate model.

Parameters in alphabetical order for ease of search:

build_tree_one_node = True, because datasets are small

calibrate_model = False (default)

categorical_encoding = “Enum” which means 1 column per categorical feature

check_constant_response = True (default)

col_sample_rate_change_per_level = 1 (default)

col_sample_rate_per_tree = 1 (default)

fold_assignment = “random”

histogram_type = AUTO (default), bins from min to max in steps of (max-min)/N levels

ignore_const_cols = True (default)

keep_cross_validation_fold_assignment = False (default)

keep_cross_validation_models = True (default)

keep_cross_validation_predictions = False (default)

max_depth = 20 (default)

max_runtime_secs = 0 (default), and it means it’s unlimited

min_rows = 1 (default)

min_split_improvement = 1e-05 (default)

mtries = the length of the list of predictors for each dataset

nbins = 13, because the rating scale is an array of integers starting at -1 and ending at 11

nbins_top_level = 16, because it must be a power of 2, higher than nbins

nfolds = 4, because this is the maximum allowed relative to the total number of rows

ntrees = 50 (default)

sample_rate = 0.6320000291 (default) applies on (x-axis)

seed = 12

stopping_rounds = 0 (default) and it means it’s disabled

verbose = False (default)

DRF training

35
x = features (columns) decided as relevant for each dataset variant in section 3.7.2 Description of observations (rows)

y = Column Rating

training_frame = rows marked with “train” in column Split in section 4.2 Generate Test Design

validation_frame = rows marked with “valid” in column Split in section 4.2 Generate Test Design

model_id = name of dataset +”_drf_model”

DRF testing

test = rows marked with “test” in column Split in section 4.2 Generate Test Design

DRF export model

get_genmodel_jar = False, because I don’t intend to import the model in Java.

Gradient Boosting Machine


GBM instantiate model.

Parameters in alphabetical order for ease of search:

balance_classes = False (default)

build_tree_one_node = True, because datasets are small

calibrate_model = False (default)

categorical_encoding = “Enum” which means 1 column per categorical feature

check_constant_response = True (default)

col_sample_rate = 1 (default)

col_sample_rate_change_per_level = 1 (default)

col_sample_rate_per_tree = 1 (default)

distribution = “gaussian”

fold_assignment = “random”

histogram_type = AUTO (default), bins from min to max in steps of (max-min)/N levels

ignore_const_cols = True (default)

keep_cross_validation_fold_assignment = False (default)

keep_cross_validation_models = True (default)

keep_cross_validation_predictions = False (default)

learn_rate = 0.1 (default)

36
learn_rate_annealing = 1 (default)

max_abs_leafnode_pred = 1.797693135e+308 (default)

max_depth = 5 (default)

max_runtime_secs = 0 (default), and it means it’s unlimited

min_rows = 1, because the light beer dataset is too small for the default min_rows of 10

min_split_improvement = 1e-05 (default)

nbins = 13, because the rating scale is an array of integers starting at -1 and ending at 11

nbins_top_level = 16, because it must be a power of 2, higher than nbins

nfolds = 4, because this is the maximum allowed relative to the total number of rows

ntrees = 50 (default)

pred_noise_bandwidth = 0 (default), and it means it’s disabled

sample_rate = 1 (default) applies on (x-axis)

score_tree_interval = 0 (default), and it means it’s disabled

seed = 12

stopping_rounds = 0 (default) and it means it’s disabled. If cross-validation is enabled, training stops anyway when there’s
no improvement.

stopping_tolerance = 0.001 (default).

verbose = False (default)

GBM training

x = features (columns) decided as relevant for each dataset variant in section 3.7.2 Description of observations (rows)

y = Column Rating

training_frame = rows marked with “train” in column Split in section 4.2 Generate Test Design

validation_frame = rows marked with “valid” in column Split in section 4.2 Generate Test Design

model_id = name of dataset +”_gbm_model”

GBM testing

test = rows marked with “test” in column Split in section 4.2 Generate Test Design

GBM export model

get_genmodel_jar = False, because I don’t intend to import the model in Java.

37
4.6.2 Define Models
Write function to generate DRF and GBM models.

• To help with readability down the road, store elements of dictionary under short variable names.
• Split rows into training, validation and test sets. This makes reproducibility possible.
• Instantiate model with custom parameters defined in section 4.3.1 Parameter Settings.
• Train model on the list of predictors (x) chosen for each dataset in section 3.7.2 Description of observations (rows).
Specify the response column (y) which is Rating for all datasets. Training frame: all rows in “train” Split. Validation
frame: all rows in “valid” Split.
• Print model to show variable importance.
• Generate prediction. It gets stored in a H2O frame with one column named “predict”.
• Calculate model performance on test set, store it in a dictionary and extract only MSE values from it.
• Print model name, its predictors and MSE value.
• Add prediction to original H2O, convert it to dataframe and export it as .csv file to location:
https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/predictions
• Export models to location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/models
• Export variable importance to the same folder as models, location: https://github.com/alina-molnar/Beer-
Recommendation-Project-Proof-of-Concept/tree/main/beer_output/models
• Export dictionary of dataframes and their respective MSE values to location: https://github.com/alina-
molnar/Beer-Recommendation-Project-Proof-of-Concept/tree/main/beer_output/metrics/mse/

4.6.3 Call Functions to Build Models


Define path to export output from models and then run functions.

4.7 Assess Models


Create function that inputs files path and pattern and outputs dictionary that contains dataframes and their names. This
allows access by assigning variable names.

Import variable importance files and print them.

The most important predictors are:

• In alc_free_all_drf, Flavor: 76%


• In light_all_drf, ABV: 73%
• In regular_all_drf, Style: 51%
• In beer_all_drf, Style: 49%

Built-in metrics of DRF and GBM in H2O are:

• MSE. Mean squared error. Square each error, then calculate their mean. Scale dependent.
• RMSE. Root mean square error. Calculate the squared root of MSE. Scale dependent.

38
• RMSLE. Root Mean Squared Logarithmic Error. Not dependent of scale. Useful if an under-prediction is worse than
an over-prediction, but that’s not the case here, quite the contrary.
• MAE. Mean absolute error. Take the absolute value of each error, then calculate their mean. Scale dependent.

MSE of models on test sets.

MSE was extracted by the model building function because it’s the only time when MSE values are computed on test sets.
The following statements are true for both DRF and GBM models:

• Light and Light_2std sets have the lowest MSE.


• Alcohol-free sets have a bit higher MSE compared to light beer. Alcohol-free_2std didn’t have observations with
ratings lower than 5 in the test set because they were all marked as outliers and removed in the cleaning process.
That’s why MSE was not computed for Alcohol-free_2std
• Regular and unsplit sets with outliers have the highest MSE. This was to be expected because Regular is the largest
subgroup in Beer unsplit. Regular_2std and Beer_2std have MSE values a bit lower than Regular and Beer.

Code for graph visualization of above statements is in section 5.1.1 where MSE and recall where plotted on the same
figure.

5. Evaluation
5.1 Evaluate Results
5.1.1 Assessment of Data Mining Results w.r.t. Business Success Criteria
The business objective is to spend less money for the same quantity of consumed beer. To achieve this, a model has to
have a recall score as close to 1 as possible.
A recall score of 1 means that all beer with rating lower than 5 was identified by the model, and a score of 0 means that
none of the beer with rating lower than 5 was identified.
The recall score shows the decrease in low rated beer acquisition. For example, a score of 0.6 means that the model
correctly identifies 60% of the low rated beer. In this case, the number of low rated beers bought would be reduced by
60% compared to random purchases.

Recall = True Positives / (True Positives + False Negatives)


or, in layman’s translation:
Recall = Detected Positives / All Real Positives

Business meaning of the outcome:

• True positives. Rating < 5 & Prediction < 5. Beer with real life rating lower than 5 predicted correctly. This means
beer rightfully not recommended by model and money saved.
• False negatives. Rating < 5 & Prediction >= 5. Beer with real life rating lower than 5 predicted incorrectly. This
means beer recommended by model would be bought and discarded.

39
• True negatives. Rating >= 5 & Prediction >= 5. Beer with real life rating 5 or higher predicted correctly. This means
beer recommended by model would be bought and consumed.
• False positives. Rating >= 5 & Prediction < 5. Beer with real life rating 5 or higher predicted incorrectly. This means
beer not recommended by model, but that would have been consumed in real life.

Steps for evaluating model real-life performance:

• Import predictions.
• Select false negatives.
• Calculate recall score.
• Merge MSE and recall in a single dataframe.

5.1.1.1 Import predictions.


Import predictions dataframes generated by models, location: https://github.com/alina-molnar/Beer-Recommendation-
Project-Proof-of-Concept/tree/main/beer_output/predictions

5.1.1.2 Select false negatives.


Create function that inputs predictions dictionary and outputs false negatives dictionary. Export false negatives to
location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/false_negatives

5.1.1.3 Calculate recall score.


Create function that inputs predictions dictionary and outputs dataframe of recall score of models. Export dataframe to
recall subfolder in metrics, location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/metrics/recall

5.1.1.4 Merge MSE and recall in a single dataframe.


Create columns that show the model type and the dataset range. This allows graph to show separate groups. Export
result to location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/beer_output/metrics

5.1.1.5 Plot graph of recall score and MSE of each model.


Recall score and MSE don’t have a linear relationship. MSE is lowest on models with highest recall score, which is to be
expected. As recall score decreases, MSE doesn't show a reversed pattern and doesn’t increase accordingly. Since recall is
the most useful metric to assess model performance with respect to business success criteria, I will use only the recall
score when deciding which models are approved.

This graph proves that lower MSE doesn't necessarily mean a better model.

40
Figure 22. MSE and recall score of models

5.1.1.6 Plot graph of recall scores split by model type and dataset range.
Graph shows that DRF and GBM have the similar results for the same dataset, except for beer full range dataset where
DRF performed better.

Decision: Use DRF models for predictions on the unseen dataset.

Figure 23. Recall of models split by type

Graph shows that full range datasets generated better models than datasets with 2*std away ratings.
This was to be expected because the goal of the project is to predict outliers, not the bulk of average ratings.
Decision: Keep outliers when testing the model on the unseen beer dataset.

41
Figure 24. Recall of models split by dataset range

5.1.2 Approved Models


Action: Print recall scores of all models.
Results:
Beer unsplit by alcohol content: In beer_all dataset, DRF score is 0.43 and GBM score is 0.39. In beer_2std dataset, DRF
score is 0.25 and GBM score is 0.16. This means that DRF has a slightly better score than GBM in both datasets. Models
have performed better on dataset that didn’t have outliers removed.
Alcohol-free beer: Both DRF and GBM have a score of 0.66 in alc_free_all dataset. However, the result may not be relevant
because there were only 3 ratings lower than 5. In alc_free_2std dataset recall was not calculated because there were no
ratings lower than 5.
Light beer: Both DRF and GBM have a perfect score of 1 in light_all and light_2std datasets. However, the result may not
be relevant because there was only one rating lower than 5. Also, light_all and light_2std datasets are identic because
there were no outliers (no ratings lower than 3.33 or higher than 10.48).
Regular beer: Both DRF and GBM have a score of 0.42 in regular_all dataset. In regular_2std dataset the score is 0.18 for
both DRF and GBM models. Just like in beer unsplit dataset, models have performed better on regular dataset that didn’t
have outliers removed.

Decisions:
1. Keep outliers in unseen dataset.
2. Split unseen beer dataset according to alcohol content in alcohol-free, light and regular beer.
3. Use DRF models for predictions on the unseen dataset.

42
Approved models:
• alc_free_all_drf for unseen alcohol-free beer
• light_all_drf for unseen light beer
• regular_all_drf for unseen regular beer

5.2 Review Process


Review of Process
Review was done at every step on an iterative basis, thus generating the results in section 5.1. There was no need for
additional testing or assessment due to specifics of this project and its phase in the software development lifecycle (proof
of concept). In short, after each code execution, the model outputs were reviewed.

Having had too few observations of a subgroup didn’t necessary lead to wrong predictions in that subgroup. For example,
see the graph with false negatives split as enough vs. too few.

Figure 25. Occurrence of observations in false negatives

5.3 Determine Next Steps


I will proceed with deployment even if the number of ratings below threshold is low in alcohol-free and light datasets.
Results can be improved gradually by adding new recordings in the input file. As the input data grows, models will be
better trained.

6. Deployment
6.1 Plan Deployment
Deployment Plan
In the unseen dataset that contains 81 beers:

43
6.1.1 Import and clean unseen file following the same steps as with seen data
Import datasets from location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/unseen_data/unseen_input/beer_unseen_raw.csv

Standardize appearance, convert columns, create columns, check for duplicates.

6.1.2 Create datasets based on alcohol content


Split dataframe and export resulting datasets to location: https://github.com/alina-molnar/Beer-Recommendation-
Project-Proof-of-Concept/tree/main/unseen_data/unseen_output/unseen_clean

6.1.3 Apply ML models on unseen data


Import unseen datasets as dictionary of H2O frames. Write function to input dictionary of unseen H2O frames, import
already trained models, and output predictions as pandas dataframes and csv files.

6.1.4 Generate and export predictions


Export predictions dataframes to location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/unseen_data/unseen_output/unseen_predictions

6.1.5 Select and export false negatives


Select false negatives and export them to location: https://github.com/alina-molnar/Beer-Recommendation-Project-
Proof-of-Concept/tree/main/unseen_data/unseen_output/unseen_false_negatives

6.1.6 Calculate and export recall scores on unseen datasets


Calculate and export recall score to location: https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/tree/main/unseen_data/unseen_output/unseen_metrics

6.1.7 Evaluate results


alc_free_unseen:

• Rows: 13
• Number of ratings lower than 5: 1
• Recall score: 0
• This means that out of the 13 beers in alc_free_unseen, only one had a rating lower than 5 and the model didn’t
detect it.
• Model generates UserWarning stating that Test/Validation dataset column “Style” has levels not trained on alt
beer. This means that alt style wasn’t present in the training frame and was encountered in the unseen set on
which I have deployed the model. In this case the prediction takes into consideration only the Flavor feature.

Solutions:

A. Decide impact of small quantity of discarded beer from a similarly small group. Calculate cost/benefit of
pursuing other machine learning approaches.
B. Consume more alcohol-free beer and train/test model after more low ratings are collected.

light_unseen:

• Rows: 9
• Number of ratings lower than 5: 0
• Recall score: not computed

44
• This means that out of the 9 beers in light_unseen, none had a rating lower than 5, so the model didn’t have
anything to detect.
• Model generates UserWarning stating that Test/Validation dataset column “Flavor” has levels not trained on herb
beer. This means that herb flavor wasn’t present in the training frame and was encountered in the unseen set on
which I have deployed the model. In this case the prediction takes into consideration only Style and ABV (alcohol
content) features.

Solutions:

A. Decide impact of small quantity of discarded beer from a similarly small group. Calculate cost/benefit of pursuing
other machine learning approaches.
B. Consume more light beer and train/test model after more low ratings are collected.

regular_unseen:

• Rows: 59
• Number of ratings lower than 5: 7
• Recall score: 0.28
• This means that out of the 59 beers in regular_unseen, seven had ratings lower than 5, and the model detected
28% of them. By using this already trained model, the number of discarded bottles of beer drops by 28%.

Solutions:

A. Refine values of parameters set in section 4.6.1


B. Build models on other techniques. Examples:
1. NLP (Natural language processing) models that analyze relationship between words in beer name and
rating.
2. Logistic regression to classify beers as low-rated/medium-high rated with probability of being right so
that the customer can decide how much they want to risk.

Figure 26. Recall of predictions in unseen datasets

45
6.2 Plan Monitoring and Maintenance
Monitoring and Maintenance Plan
1. Record ratings of beer purchased because of model recommendations.
2. Every quarter calculate the percentage of beer that has been bought and discarded.
3. Compare percentage of discarded beer vs. recall score of the model applied.
4. Record the date when beer was rated.
5. Analyze change of beer preferences as time goes by.
6. Run the model yearly to adjust for changes in beer preferences.

6.3 Produce Final Report


Software requirements at https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-
Concept/blob/main/requirements.txt

Final Report
This document is the final report and was written based on CRISP-DM template. It is stored at https://github.com/alina-
molnar/Beer-Recommendation-Project-Proof-of-Concept/blob/main/Final_Report.pdf

Presentations
PowerPoint presentation at https://www.scribd.com/presentation/528726148/Data-Science-Beer-Recommendation-
Machine-Learning-Summary

PowerPoint presentation at https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-


Concept/blob/main/Summary_Presentation.pptx

Video presentation of code at https://www.youtube.com/watch?v=0J-50p2U_wQ

Video presentation of code at https://github.com/alina-molnar/Beer-Recommendation-Project-Proof-of-


Concept/blob/main/Video_Code_Presentation.mp4

6.4 Review Project


Experience Documentation
This project was intended to prove there is a way to predict a numeric variable using one numeric feature and multiple
categorical features as input. The numeric feature and the numeric target had a non-linear and non-monotonic
relationship. The categorical features didn’t require previous encoding.

Key take-away: When comparing models, lower MSE doesn’t necessary mean better model. For example, regular_2std
and beer_2std models had lower MSE compared to regular_all and beer_all, but also lower recall score. And recall is the
metric that best defines the success of models because it helps achieve the business goal of throwing away less beer. This
is why I have used only the recall score to assess models.

46
Lessons learned:
1. If goal is to predict (also) outliers, keep outliers when training model. Outlier removal works best for correctly
predicting most values. It was not the case here.
2. Instead of building a model with the lowest MSE possible, in this project it was worth it deciding usage of own
metric (recall score).

Tasks that had added value:


• Splitting beer dataset based on alcohol content because it allowed high recall scores for alcohol-free and light
beer.
• Analysis of subgroup occurrence. Graphs show that most false negatives predictions were generated for
subgroups that had enough observations. This is consistent with the observation that datasets that kept outliers
had higher recall scores than datasets without outliers.
• Derived attribute “Pasteurization” for regular beer.

Redundant tasks:
• Outlier removal. All models performed better on datasets that kept outliers. This is somehow logic because the
goal of this project was to correctly detect outliers.
• Derived attribute “Filtration” unused. “Perception” helped only in beer_all and beer_2std. Since only models for
split datasets are approved, “Perception” remained unused.

Future opportunities to explore:


• Do a general pipeline that has a function to analyze variation and selects features automatically to build models.
• Use another tool for predictions.
• Analyze false negatives to find out if they present a pattern.

The Machine Learning Reproducibility Checklist (v2.0, Apr.7 2020)


For all models and algorithms presented, check if you include: Status Remarks
A clear description of the mathematical setting, algorithm, and/or model. Yes
A clear explanation of any assumptions. Yes
An analysis of the complexity (time, space, sample size) of any algorithm. No Not in scope

For any theoretical claim, check if you include: Status Remarks


A clear statement of the claim. No Not applicable
A complete proof of the claim. No Not applicable

For all datasets used, check if you include: Status Remarks


47
The relevant statistics, such as number of examples. Yes
The details of train / validation / test splits. Yes
An explanation of any data that were excluded, and all pre-processing step. Yes
A link to a downloadable version of the dataset or simulation environment. Yes
For new data collected, a complete description of the data collection process, such as
instructions to annotators and methods for quality control. No Not applicable

For all shared code related to this work, check if you include: Status Remarks
Specification of dependencies. Yes
Training code. Yes
Evaluation code. Yes
(Pre-)trained model(s). Yes
README file includes table of results accompanied by precise command to run to
produce those results. Yes

For all reported experimental results, check if you include: Status Remarks
The range of hyper-parameters considered, method to select the best hyper-parameter
configuration, and specification of all hyper-parameters used to generate results. Yes
Recorded in model
The exact number of training and evaluation runs. No files
A clear definition of the specific measure or statistics used to report results. Yes
A description of results with central tendency (e.g. mean) & variation (e.g. error bars). No Not applicable
The average runtime for each result, or estimated energy cost. No Not in scope
A description of the computing infrastructure used. Yes
Checklist reproduced from: www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist-v2.0.pdf

This paper is under Copyright (C) 2020-2021 Alina Molnar


License CC BY-NC

48

You might also like