You are on page 1of 4

OCTOBER 3, 2020

SYNDICATE TASK #1
PREDICTIVE MODELLING – WINE QUALITY

SOPE DALLEY, ALISTER KING, MATTHEW LEWIS & RADEYAN SAZZAD


SYNDICATE 6
1. Explore the properties of the data, including:
a. Average of each variable
FA VA CA RS Ch FSD TSD Density pH Sulphates Alc QS
8.32 .529 .271 2.54 0.0875 15.9 46.5 0.997 3.31 0.658 10.4 5.64
Table 1 - Mean of physicochemical properties

The averages vary substantially reflecting the different units of measurement for each of the physiochemical
properties which will also be reflected in the co-efficients in the regression model.

b. Histogram of the quality score variable

Figure 1 - Histogram of quality score

Many of the QS results lie between 5 and 7. Due to the low number of categories it is difficult to assess normality
however there appears to be some degree of skew towards the higher scores.

c. Relationship between the physicochemical properties and the wine quality score using the correlation
function.
QS FA VA CA RS Ch FSD TSD Density pH Sulphates Alc
0.124 -0.391 0.226 0.01373 -0.129 -0.0506 -0.185 -0.175 -0.0577 0.251 0.476
Table 2 - Correlation of physicochemical properties

The correlations indicate negative and positive correlations between the QS score and physiochemical properties
(PP). Alcohol (Alc) had the highest magnitude correlation function (0.476), with VA the second highest in
magnitude (-.391). These correlation numbers do not indicate statistical significance, however. The table does
provide insights into what relationships might be worth investigating further and the direction of the relationship
between the PP and QS score.

2. Regression models:
a. Standard b. Stepwise

Figure 2 - R output of linear regression Figure 3 - R output of stepwise regression


The PP’s FA, CA, Density were not included in the Stepwise model indicating they are not statistically significant
enough in terms of correlation with QS. All other factors were- indicating a level of confidence in terms of
correlation. This does not imply magnitude/ impact on QS however.

c. Any potential nonlinear relationships.

Figure 4 - R output of nonlinear relationships

The four plots of residuals were based on the t values and correlation values in Table 1. More PP were explored
also but displayed similar trends in terms of the residual plots. The four plots in figure 4 indicate substantial
random distribution of residuals with no clear trend (e.g. quadratic). There are elements of clustering however that
appears to be a function of the scale of the variable (VA). The non-linear relationships displayed in the regression
output in figure 4 are both quadratic and interaction, however only Ca x Alc, and FA x Density have p values <0.1
indicating the remainder are not particularly statistically significant. The r2 value increased only 2.2% from the
standard model with the substantial inclusion of the non-linear factors.

3. Obtain the predictions for wine quality score for the test set data. Compute the predictive accuracy metrics
for the test set. Comment on which model in your consideration set performs best.
RMSE MAE MAPE MASE Adjusted r^2
Linear 0.635 0.488 8.70 0.735 0.353
Stepwise 0.636 0.489 8.70 0.735 0.354
Nonlinear 0.627 0.488 8.71 0.734 0.377
Table 3 - summary of predictive accuracy outputs

The three models in general have quite similar predictive accuracy metrics. The MASE for all 3 are within 0.2
percentage points and on average 36.6% better than the naïve case (training mean). REMSE, MAE, and MAPE are
similarly closely distributed.

The Nonlinear model has the lowest RMSE, MAE, and MASE indicating it has the lowest error. Only MAPE is slightly
higher than Linear and Stepwise- due to a slightly smaller actual error (denominator). Nonlinear also explains the
most variance with an adjusted R2 of 0.3772, 0.02 higher the Standard and Stepwise case. It should be noted that
the Nonlinear model has 9 additional interaction and quadratic factors to achieve the marginal gain compared with
the standard model, and 12 more than the Stepwise model.

Typically, in a purely diagnostic environment the Nonlinear model would be preferable on account of the lower
model error and higher explanatory power in R2. However, these better metrics should be viewed in perspective of
how meaningful the model is and what context it will be used.
In terms of a useful predictive model, the Stepwise model is the preferred model on account of the insignificant
factors being removed (FA, CA, Density) for a minimal sacrifice in model error and predictive power. The
elimination of less significant factors also means there will be a reduced likelihood of error or false positives
generated on account of weak factors skewing the predictive results. The succinct model also will be more
meaningful in a practical application when explaining the drivers of wine rating.

4. Investigate the best performing predictive model you have chosen. Discuss what your chosen model reveals
about the relationships between the wine quality score and the physicochemical properties.

Figure 5 - Stepwise output

RS, FSD, Sulphates and Alc are all positively correlated with QS indicating an increase in these factors will enhance
the end score QS. VA, Ch, TSD, and pH have negative coefficients thus indicating that increases in these
physicochemical properties detract from the end score, and vice versa. To use Alc as an example, an increase in
alcohol by 1 point (1 percentage point) will increase the modelled QS score by 0.293, and similarly a decrease in
alcohol by 1 point will decrease the modelled QS by 0.293. The properties with negative coefficients work in the
opposite direction. The standard error for each coefficient (property) are used to assess the range of confidence in
each prediction.

In terms of statistical significance, all factors show significance at the p<0.15 level; with Alc clearly the most
statistically significant with the highest magnitude t value and smallest p value. This means Alc is the most likely to
influence QS score. In terms of magnitude; the coefficients indicate how much QS will change for a one unit change
in each property; however units and magnitudes are different between the properties and thus further analysis
can be conducted to estimate impact on the response variable (QS) such as confidence intervals around each
physiochemical property.

5. What are the limitations of your analysis? Discuss.


This analysis is assessing different physicochemical properties of the wine to determine the quality score.
However, judging the quality of wine is a subjective process with multiple factors that can impact the quality of the
wine such as different testers or mood of the tester on the day. Given the subjective nature of wine quality, it is no
surprise that the highest R2 value of the three different regression models is only 0.3772. Although considered a
low R2 value, this appears reasonable given the nature of the quality score of wine is not purely dependent on
physicochemical properties of wine and can only explain 37.72% of the variation in the quality of wine. To develop
a more comprehensive model further investigation into predictive factors would need to be included and more
comprehensive data gathered. One proposal could be to isolate the scores form individual judges to assess rule
better control individual biases discussed above.

You might also like