You are on page 1of 28

MLBABJ20-4

Group Project

A Study of Red & White Wines:


An Objective Critique of Quality

Student Details
Abhinav Premsekhar – H20063
Nikhil Kumar Upadhyaya – H20097
Tushar Rawat – H20118

October 12, 2021

XLRI – Xavier School of Management, Jamshedpur


MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Contents

1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Business Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Description of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

8 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

9 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

a. Appendix – Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

1 Objective

The objective of this study is to find the dependency of factors or attributes


related to wine like volatile acidity, citric acid, residual sugar chlorides etc. over
the overall quality of the wine.

3
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

2 Business Objective

Wine is an alcoholic beverage made from grapes, generally fermented without the
addition of sugars, acids, enzymes, water, or other nutrients. Yeast consumes the sugar
in the grapes and converts it to ethanol and carbon dioxide. Different varieties of
grapes and strains of yeasts produce different styles of wine. These variations result
from the complex interactions between the biochemical development of the grape, the
reactions involved in fermentation, the terroir, and the production process. The
Wines we are going to study here is Red and White Wine. The red-wine production
process involves extraction of color and flavor components from the grape skin. Red
wine is made from dark-colored grape varieties. When making white wine, the grape
skins are removed before fermentation, resulting in a clear juice that ultimately yields
a transparent white wine.

The ultimate aim of analysis and outcome would be to recognise the exact factors and
their dependence, that would give a satisfactory quality of wine. Whether factors be
physiochemical like alcohol %, chlorides, sulphates content etc or pH, density.

Other factors that exactly determine how good the wine as the final outcome of our
model. The model will be trained and when provided with all the relevant metric,
would predict the quality of wine.

4
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

3 Data Understanding

The data consists of information of almost 6500 data entries for various red and white
wine. The data set consists of:

• Wine Type: Broadly bucketed into Red and White Wine


• Fixed Acidity: most acids involved with wine that do not evaporate readily
• Volatile Acidity: the amount of acetic acid in wine, which at too high of levels
can lead to an unpleasant, vinegar taste.
• Citric Acid: Citric acid is often added to wines to increase acidity, complement
a specific flavor or prevent ferric hazes. It can be added to finished wines to
increase acidity and give a “fresh” flavor.
• Residual Sugar: It refers to the sugars left unfermented in a finished wine. Wines
labeled as “dry” contain about 10 g/L of residual sugar. Noticeably sweet wines
start at around 35 grams per liter of residual sugar and then go up from there.
• Chlorides: Wine contains from 2 to 4 g L–1 of salts of mineral acids, along with
some organic acids, and they may have a key role on a potential salty taste of
a wine, with chlorides being a major contributor to saltiness.
• Sulfur Dioxide: It prevents the wine from reacting with oxygen which can cause
browning and off-odors (oxidation), and it inhibits the growth of bacteria and
undesirable wild yeasts in the grape juice and wine
• Density: Generally between 1.080 and 1.090. This essentially means your wine
is 8-9% more dense than water.
• pH: the pH level of a wine ranges from 3 to 4. Red wines with higher acidity
are more likely to be a bright ruby color, as the lower pH gives them a red hue.
• Sulphates: Sulfites are a food preservative widely used in winemaking, thanks
to their ability to maintain the flavor and freshness of wine.
• Alcohol %: ABV is the global standard of measurement for alcohol content. The
range of ABV for unfortified wine is about 5.5% to 16%, with an average of
11.6%
• Quality: In terms of quality, we look at – acidity, tannins, sugar/sweetness,
alcohol and fruit. For wines that need several years of aging to reach maturity,
this gives them the time they need to reach optimal balance.

5
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

The sample of the data is as shown below:

6
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

4 Description of Parameters

The data has been distributed across various parameters, as discussed below:

Wine Type: Categorical variable.

Fixed Acidity:

7
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Volatile Acidity:

8
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Citric Acid:

9
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Residual Sugar:

10
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Chlorides:

11
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Sulfur Dioxide:

12
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Density:

13
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

pH:

14
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Sulphates:

15
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Alcohol %:

16
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Quality:

The summary of the data is as given below:


> summary(wine)
wine.type fixed.acidity volatile.acidity citric.acid
residual.sugar
Min. :0.0000 Min. : 3.800 Min. :0.0800 Min. :0.0000 Min.
: 0.600
1st Qu.:0.0000 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st
Qu.: 1.800
Median :0.0000 Median : 7.000 Median :0.2900 Median :0.3100 Median
: 3.000
Mean :0.2461 Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean
: 5.443
3rd Qu.:0.0000 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd
Qu.: 8.100
Max. :1.0000 Max. :15.900 Max. :1.5800 Max. :1.6600 Max.
:65.800
chlorides free.sulfur.dioxide total.sulfur.dioxide density
pH
Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
Min. :2.720
1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
1st Qu.:3.110
Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
Median :3.210
Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
Mean :3.219
3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
3rd Qu.:3.320

17
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390


Max. :4.010
sulphates alcohol quality
Min. :0.2200 Min. : 8.00 Min. :3.000
1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000
Median :0.5100 Median :10.30 Median :6.000
Mean :0.5313 Mean :10.49 Mean :5.818
3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
Max. :2.0000 Max. :14.90 Max. :9.000

18
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

5 Data Preparation

Initially we have 6497 observation with 15 variables or attributes.

• Wine Type and wine selected are the variables which were converted into
factor variables.
• The Column “id” was removed as it is irrelevant to our model.
• The data set had no empty/null values, so imputation is not needed.
• One of the data points having a residual sugar of 65.8 which was an outlier
was removed.
• All outliers were removed using the formula Q1- 1.5 (IQR) and Q3+ 1.5(IQR)

Final data structure is as below.

Next, we check if the data has a class imbalance problem. The cleaned data has been
split into training and test data in the ratio 70% and 30% respectively with seed set
as 11111.

There is no class imbalance therefore undersampling or oversampling is not required.

19
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

6 Logistic Regression

The logistic regression model is shown below:

The confusion matrix of the model is show below:

20
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

The accuracy of the model is –

21
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

7 Decision Tree

The decision tree model is shown below:

The confusion matrix of the model is show below:

22
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

The ROC is shown below:

The accuracy of the model is:

The AUC value of the model is:

23
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

8 Random Forest

The random forest model is shown below:

The confusion matrix of the model is show below –

The accuracy of the model is –

24
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

9 Neural Network

The neural network model is shown below:

The confusion matrix of the model is show below:

The accuracy of the model is:

25
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

10 Conclusion

By looking into the details, we can see that good quality wines have higher
levels of alcohol on average, have a lower volatile acidity on average, higher
levels of sulphates on average, and higher levels of residual sugar on average.
Free sulfur dioxide, pH and residual sugar are the least important criteria for
determining quantity, while alcohol type, volatile acidity and sulphates are the
most influential factors.

26
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

a. Appendix – Code

Wine_Data = read.csv(file.choose(), header = T)


Wine_Data = as.data.table(Wine_Data)

summary(Wine_Data)
dp = Wine_Data$volatile.acidity
q1 = quantile(dp,0.25)
q3 = quantile(dp,0.75)
iqr = q3 -q1
Wine_Data[volatile.acidity>=1.5*iqr + q3]

str(Wine_Data)

table(Train_Data$wine.selected)

###Train-Test Split
set.seed(11111)
Split_Type = sample.split(Wine_Data$wine.selected,SplitRatio = 0.7)
Train_Data = subset(Wine_Data[,-c("ï..id")], Split_Type == TRUE)
Test_Data = subset(Wine_Data[,-c("ï..id")], Split_Type == FALSE)

### Training the model


#Logistic Regression
Model_LR = glm(wine.selected~., data = Train_Data,family ="binomial")
LR_Prediction = predict(Model_LR, Test_Data[,!c("wine.selected"),with =
F],type = 'response')
LR_Prediction = ifelse(LR_Prediction<0.5,0,1)
LR_Confusion_Matrix = table(Test_Data$wine.selected, LR_Prediction)
LR_Accuracy = sum(diag(LR_Confusion_Matrix))/sum(LR_Confusion_Matrix)
LR_Accuracy
summary(Model_LR)

#Decision Tree
Model_DT = rpart(wine.selected~., data = Train_Data,method = "class")
DT_Prediction = predict(Model_DT, Test_Data, type = "class")
DT_Confusion_Matrix = table(Test_Data$wine.selected, DT_Prediction)
DT_Accuracy = sum(diag(DT_Confusion_Matrix))/sum(DT_Confusion_Matrix)
DT_Accuracy

library(ROCR)
probrf=predict(Model_DT,Test_Data,type="prob")
head(probrf)
probrf1<-probrf[,2]

predrf<-prediction(probrf1,Test_Data$wine.selected)
rocrf=performance(predrf,"tpr","fpr")
plot(rocrf)

#roc curve can be used only in the case of binary


aucrf<-performance(predrf,"auc")
aucrf@y.values

#Random forest
Model_RF = randomForest(wine.selected~., data = Train_Data)
RF_Prediction = predict(Model_RF, Test_Data, type = "class")
RF_Prediction = ifelse(RF_Prediction<0.5,0,1)
RF_Confusion_Matrix = table(Test_Data$wine.selected, RF_Prediction)
RF_Accuracy = sum(diag(RF_Confusion_Matrix))/sum(RF_Confusion_Matrix)

27
MLBABJ20-4 Group Project A Study of Red & White Wines: An Objective Critique of Quality

RF_Accuracy
summary(Model_RF)

#Neural Network
NN_Data = model.matrix(~.,Wine_Data[,-c("ï..id")])
NN_Data = as.data.table(NN_Data)
NN_Data$`(Intercept)` = NULL
names(NN_Data) <- make.names(names(NN_Data))

set.seed(11111)
Split_Type = sample.split(NN_Data$wine.selected,SplitRatio = 0.7)
NN_Train_Data = subset(NN_Data, Split_Type == TRUE)
NN_Test_Data = subset(NN_Data, Split_Type == FALSE)

Model_NN =
neuralnet(wine.selected~.,data=NN_Train_Data,hidden=0,threshold=0.2,linear.
output=TRUE)
NN_Prediction = predict(Model_NN, NN_Test_Data[,!c("wine.selected"),with =
F])
NN_Prediction = ifelse(NN_Prediction<0.5,0,1)
NN_Confusion_Matrix = table(Test_Data$wine.selected, NN_Prediction)
NN_Accuracy = sum(diag(NN_Confusion_Matrix))/sum(NN_Confusion_Matrix)
NN_Accuracy

28

You might also like