You are on page 1of 11

Xin Zhang

Professor Wanli Qiao

STAT 515-P04

Dec 14, 2017

Final Project-Red Wine Quality

How to evaluate the quality of red wine? If you interview people on the street, you wont

get a useful answer. But, the tasteful wine experts evaluate by the production process, vintage and

quality [1]. This paper is focusing on the quality of red wine. For people who do not drink like me,

I use RStudio to evaluate red wine quality.

Data description

I got these data from Kaggle.com [2]. The size of the data set is 26 KB. It has 1600

observations and 12 variables. There are 11 independent variables and one dependent variable.

Independent variables are based on physicochemical tests including fixed acidity, volatile acidity,

citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates

and alcohol. Dependent variable is quality (score between 0 and 10). The library including MASS,

ISLR, lattice, hexbin, ggplot2, car, splines, corrplot.


Data preparation

Why I choose this data set? Data clearing is time consuming and the data set shapes the

analysis. This data set is clean, so I can focus on the analysis. Otherwise, I need to clean the

unnecessary data by subset function and the blank cell should be changed to NA and the NA should

be removed by na. omit function.

Scatter Plot Matrix

I use this scatter plot to visualize the relationship between the dependent variables and 11

independent variables. In this scatter plot, it is difficult to see the trend of the data. We cannot find

out which independent variables have significant relationship with quality. I will improve this

scatter plot. The relation is not clear, so I will use linear regression to redesign.
Improved Scatter Plot Matrix

I use linear regression to optimize the scatter plot. The trend of the data shows in red lines.

We can find that the last column and last line are all related to quality. In the last line, we can find

that there are some linear relationships between quality and some independent variables. For

example, it is clear to see that volatile acid and alcohol have significant relationship with quality.

However, some independent variables do not have a significant relationship with quality. So, I will

build a model to select the significant variables.

8 variables-Model
This model identifies 8 independent variables which includes volatile acidity, chlorides,

free sulfur dioxide, total sulfur dioxide, pH, sulphates and alcohol. First, in this model, all the p-

values of 8 independent variables are less than 0.05 which means 8 variables are significant.

Second, the p-value of this model is also less than 0.05 which means this model is also significant.

Third, the Multiple r-squared is 0.3595 and the Adjusted r-squared is 0.3567 which means this

model is good.

Diagnostic Plots for Linear Regression Analysis

After running a regression analysis, I use the diagnostic plots for linear regression analysis

to check if the model works well for data [3]. The diagnostic plots show residuals in four different

ways as below.

I. Residuals vs Fitted
The plot of Residuals vs Fitted shows if residuals have non-linear patterns in the model [3].

If the model doesnt capture the non-linear relationship, the plot could show the pattern of non-

linear relation between predictor variables and quality [3]. There is no distinctive pattern in this

plot which shows the model is good. Otherwise, if the red line becomes a parabola, the model is

bad. However, there are three outliers, No. 833, No. 1277 and No. 653 which do not fit the pattern

very well. I should keep the three outliers in mind for next plot that might be potential problems.

II. Normal Q-Q

The plot of Normal Q-Q shows if residuals are normal distributed in the model. In this plot,

we can see residuals follow a straight line well which means the models is good. No. 833, No.

1277 and No. 653 look like a tail and also do not fit the line very well. Although, they would not

be a not perfect straight line because of the three outliers. The model is still proved good.
III. Scale-Location

The plot of Scale-Location also called Spread-Location plot. It shows if residuals are

spread equally along the ranges of predictors and check the assumption of equal variance

(homoscedasticity). We can see the red line is not a perfect horizontal line with equally spread

point which means the assumption is not well met. We can still find the same outliers, No. 833,

No. 1277 and No. 653.

IV. Residuals vs Leverage


The plot of Residuals vs Leverage can find any influential cases. Not all outliers are

influential in linear regression analysis. This plot is the typical look when the outlier is not

influential. In this plot, No. 833, No. 1277 and No. 653 do not influence the result. Because it is

difficult to find an obvious red dashed line called Cooks distance which means the outliers do not

have potential problems.

Conclusion

There are many other factors can affect the quality of red wine. Some of them are related

with smells and flavors rather than the chemical properties as my independent variables. After the

analysis, I identified the 8 significant factors which are volatile acidity, chlorides, free sulfur

dioxide, total sulfur dioxide, pH, sulphates and alcohol. These factors can influence the quality of

red wine in a high level.

In this project, I used Scatter Plot Matrix, Linear Regression Model and Diagnostic Plots

for Linear Regression Analysis to identify the significant variables of red wire quality. For future

study, I can improve my project by increasing the number of independent variables or find out is

there any relationship with the price. Additionally, for increasing the accurate of the model, I would

like to try the Nor-Liner Transformation method.


References

[1] "All About Wine," 2017. [Online]. Available: http://www.all-about-wine.com/wine-evaluation.html.


[Accessed 13 12 2017].

[2] "Red Wine Quality," Kaggle Inc, 2017. [Online]. Available: https://www.kaggle.com/uciml/red-wine-
quality-cortez-et-al-2009. [Accessed 13 12 2017].

[3] B. Kim, "Understanding Diagnostic Plots for Linear Regression Analysis," University of Virginia
Library, 21 11 2015. [Online]. Available: http://data.library.virginia.edu/diagnostic-plots/. [Accessed
14 12 2017].
Appendix

df <- read.csv("winequality-red.csv")

library(MASS)
library(ISLR)
library(lattice)
library(hexbin)
library(ggplot2)
library(car) # vif() and qqPlot functions
library(splines)
library(corrplot)

hw <- theme_gray()+ theme(


plot.title=element_text(hjust=0.5),
plot.subtitle=element_text(hjust=0.5),
plot.caption=element_text(hjust=-.5),

# strip.text.y = element_blank(),
strip.background=element_rect(fill=rgb(.9,.95,1),
colour=gray(.5), size=.2),

panel.border=element_rect(fill=FALSE,colour=gray(.70)),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.spacing.x = unit(0.10,"cm"),
panel.spacing.y = unit(0.05,"cm"),

# axis.ticks.y= element_blank()
axis.ticks=element_blank(),
axis.text=element_text(colour="black"),
axis.text.y=element_text(margin=margin(0,3,0,3)),
axis.text.x=element_text(margin=margin(-1,0,3,0))
)

offDiag <- function(x,y,...){


panel.grid(h=-1,v=-1,...)
panel.hexbinplot(x,y,xbins=15,...,border=gray(.7),
trans=function(x)x^.5)
# panel.loess(x , y, ..., lwd=2,col='red')
}

onDiag <- function(x, ...){


yrng <- current.panel.limits()$ylim
d <- density(x, na.rm=TRUE)
d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) )
panel.lines(d,col=rgb(.83,.66,1),lwd=2)
diag.panel.splom(x, ...)
}

splom(df,as.matrix=TRUE,
xlab='',main="Wine Analysis: Selected Variables",
pscale=0, varname.cex=0.8,axis.text.cex=0.6,
axis.text.col="purple",axis.text.font=2,
axis.line.tck=.5,
panel=offDiag,
diag.panel = onDiag
)
offDiag <- function(x,y,...){
panel.grid(h=-1,v=-1,...)
panel.hexbinplot(x,y,xbins=15,...,border=gray(.7),
trans=function(x)x^.5)
panel.loess(x , y, ..., lwd=2,col='red')
}

splom(df, as.matrix=TRUE,
xlab='',main="Wine Analysis: Selected Variables",
pscale=0, varname.cex=0.8,axis.text.cex=0.6,
axis.text.col="purple",axis.text.font=2,
axis.line.tck=.5,
panel=offDiag,
diag.panel = onDiag
)

model=lm(df$quality~.,df)
summary(model)
model1=lm(df$quality~volatile.acidity+chlorides+free.sulfur.dioxide+total.sulfur.dioxide
+pH+sulphates+alcohol,df)
summary(model1)

# par(mfrow=c(2,2))
plot(model1)