14 views

Uploaded by api-386201662

© All Rights Reserved

- AE6103 Assignment 1
- Cost Estimation.ppt
- Linear Regrssion Analysis and Residual
- ch04.pdf
- OLS-SPSS
- PurifiedSentimentIndicatorsfortheStockMarket5.04.09
- 16 Review of Part II
- Et1intro
- Stata Learning Module
- HW2
- 70-307-1-PB (3)
- Formulating a Trip Production Prediction Model for the Residential Land Use in Medium Sized Towns of Kerala
- Regrerssion in Business
- 02020
- now-品質
- Quiz Solutions
- Regressi On
- 10.1.1.388.5011
- Synopsis - QM2 SAS Outputs
- Modeling Lowe's Sales - Simonoff(1)

You are on page 1of 11

STAT 515-P04

How to evaluate the quality of red wine? If you interview people on the street, you wont

get a useful answer. But, the tasteful wine experts evaluate by the production process, vintage and

quality [1]. This paper is focusing on the quality of red wine. For people who do not drink like me,

Data description

I got these data from Kaggle.com [2]. The size of the data set is 26 KB. It has 1600

observations and 12 variables. There are 11 independent variables and one dependent variable.

Independent variables are based on physicochemical tests including fixed acidity, volatile acidity,

citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates

and alcohol. Dependent variable is quality (score between 0 and 10). The library including MASS,

Data preparation

Why I choose this data set? Data clearing is time consuming and the data set shapes the

analysis. This data set is clean, so I can focus on the analysis. Otherwise, I need to clean the

unnecessary data by subset function and the blank cell should be changed to NA and the NA should

I use this scatter plot to visualize the relationship between the dependent variables and 11

independent variables. In this scatter plot, it is difficult to see the trend of the data. We cannot find

out which independent variables have significant relationship with quality. I will improve this

scatter plot. The relation is not clear, so I will use linear regression to redesign.

Improved Scatter Plot Matrix

I use linear regression to optimize the scatter plot. The trend of the data shows in red lines.

We can find that the last column and last line are all related to quality. In the last line, we can find

that there are some linear relationships between quality and some independent variables. For

example, it is clear to see that volatile acid and alcohol have significant relationship with quality.

However, some independent variables do not have a significant relationship with quality. So, I will

8 variables-Model

This model identifies 8 independent variables which includes volatile acidity, chlorides,

free sulfur dioxide, total sulfur dioxide, pH, sulphates and alcohol. First, in this model, all the p-

values of 8 independent variables are less than 0.05 which means 8 variables are significant.

Second, the p-value of this model is also less than 0.05 which means this model is also significant.

Third, the Multiple r-squared is 0.3595 and the Adjusted r-squared is 0.3567 which means this

model is good.

After running a regression analysis, I use the diagnostic plots for linear regression analysis

to check if the model works well for data [3]. The diagnostic plots show residuals in four different

ways as below.

I. Residuals vs Fitted

The plot of Residuals vs Fitted shows if residuals have non-linear patterns in the model [3].

If the model doesnt capture the non-linear relationship, the plot could show the pattern of non-

linear relation between predictor variables and quality [3]. There is no distinctive pattern in this

plot which shows the model is good. Otherwise, if the red line becomes a parabola, the model is

bad. However, there are three outliers, No. 833, No. 1277 and No. 653 which do not fit the pattern

very well. I should keep the three outliers in mind for next plot that might be potential problems.

The plot of Normal Q-Q shows if residuals are normal distributed in the model. In this plot,

we can see residuals follow a straight line well which means the models is good. No. 833, No.

1277 and No. 653 look like a tail and also do not fit the line very well. Although, they would not

be a not perfect straight line because of the three outliers. The model is still proved good.

III. Scale-Location

The plot of Scale-Location also called Spread-Location plot. It shows if residuals are

spread equally along the ranges of predictors and check the assumption of equal variance

(homoscedasticity). We can see the red line is not a perfect horizontal line with equally spread

point which means the assumption is not well met. We can still find the same outliers, No. 833,

The plot of Residuals vs Leverage can find any influential cases. Not all outliers are

influential in linear regression analysis. This plot is the typical look when the outlier is not

influential. In this plot, No. 833, No. 1277 and No. 653 do not influence the result. Because it is

difficult to find an obvious red dashed line called Cooks distance which means the outliers do not

Conclusion

There are many other factors can affect the quality of red wine. Some of them are related

with smells and flavors rather than the chemical properties as my independent variables. After the

analysis, I identified the 8 significant factors which are volatile acidity, chlorides, free sulfur

dioxide, total sulfur dioxide, pH, sulphates and alcohol. These factors can influence the quality of

In this project, I used Scatter Plot Matrix, Linear Regression Model and Diagnostic Plots

for Linear Regression Analysis to identify the significant variables of red wire quality. For future

study, I can improve my project by increasing the number of independent variables or find out is

there any relationship with the price. Additionally, for increasing the accurate of the model, I would

References

[Accessed 13 12 2017].

[2] "Red Wine Quality," Kaggle Inc, 2017. [Online]. Available: https://www.kaggle.com/uciml/red-wine-

quality-cortez-et-al-2009. [Accessed 13 12 2017].

[3] B. Kim, "Understanding Diagnostic Plots for Linear Regression Analysis," University of Virginia

Library, 21 11 2015. [Online]. Available: http://data.library.virginia.edu/diagnostic-plots/. [Accessed

14 12 2017].

Appendix

df <- read.csv("winequality-red.csv")

library(MASS)

library(ISLR)

library(lattice)

library(hexbin)

library(ggplot2)

library(car) # vif() and qqPlot functions

library(splines)

library(corrplot)

plot.title=element_text(hjust=0.5),

plot.subtitle=element_text(hjust=0.5),

plot.caption=element_text(hjust=-.5),

# strip.text.y = element_blank(),

strip.background=element_rect(fill=rgb(.9,.95,1),

colour=gray(.5), size=.2),

panel.border=element_rect(fill=FALSE,colour=gray(.70)),

panel.grid.minor.y = element_blank(),

panel.grid.minor.x = element_blank(),

panel.spacing.x = unit(0.10,"cm"),

panel.spacing.y = unit(0.05,"cm"),

# axis.ticks.y= element_blank()

axis.ticks=element_blank(),

axis.text=element_text(colour="black"),

axis.text.y=element_text(margin=margin(0,3,0,3)),

axis.text.x=element_text(margin=margin(-1,0,3,0))

)

panel.grid(h=-1,v=-1,...)

panel.hexbinplot(x,y,xbins=15,...,border=gray(.7),

trans=function(x)x^.5)

# panel.loess(x , y, ..., lwd=2,col='red')

}

yrng <- current.panel.limits()$ylim

d <- density(x, na.rm=TRUE)

d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) )

panel.lines(d,col=rgb(.83,.66,1),lwd=2)

diag.panel.splom(x, ...)

}

splom(df,as.matrix=TRUE,

xlab='',main="Wine Analysis: Selected Variables",

pscale=0, varname.cex=0.8,axis.text.cex=0.6,

axis.text.col="purple",axis.text.font=2,

axis.line.tck=.5,

panel=offDiag,

diag.panel = onDiag

)

offDiag <- function(x,y,...){

panel.grid(h=-1,v=-1,...)

panel.hexbinplot(x,y,xbins=15,...,border=gray(.7),

trans=function(x)x^.5)

panel.loess(x , y, ..., lwd=2,col='red')

}

splom(df, as.matrix=TRUE,

xlab='',main="Wine Analysis: Selected Variables",

pscale=0, varname.cex=0.8,axis.text.cex=0.6,

axis.text.col="purple",axis.text.font=2,

axis.line.tck=.5,

panel=offDiag,

diag.panel = onDiag

)

model=lm(df$quality~.,df)

summary(model)

model1=lm(df$quality~volatile.acidity+chlorides+free.sulfur.dioxide+total.sulfur.dioxide

+pH+sulphates+alcohol,df)

summary(model1)

# par(mfrow=c(2,2))

plot(model1)

- AE6103 Assignment 1Uploaded byLaurina Eisenring
- Cost Estimation.pptUploaded byHenri Matius Naibaho
- Linear Regrssion Analysis and ResidualUploaded byhebahaddad
- ch04.pdfUploaded byDhanushka Rajapaksha
- OLS-SPSSUploaded byChristian Beren
- PurifiedSentimentIndicatorsfortheStockMarket5.04.09Uploaded byJonathan Lewis
- 16 Review of Part IIUploaded byRama Dulce
- Et1introUploaded byKatitja Molele
- Stata Learning ModuleUploaded byMichael Ray
- HW2Uploaded byITzFredWaZy
- 70-307-1-PB (3)Uploaded byAngelo Leal
- Formulating a Trip Production Prediction Model for the Residential Land Use in Medium Sized Towns of KeralaUploaded byesatjournals
- Regrerssion in BusinessUploaded byshagunparmar
- 02020Uploaded byMohamed Farag Mostafa
- now-品質Uploaded byattyavati
- Quiz SolutionsUploaded byVikas Singh
- Regressi OnUploaded byfansuri80
- 10.1.1.388.5011Uploaded byMin Parton
- Synopsis - QM2 SAS OutputsUploaded bysatna claus
- Modeling Lowe's Sales - Simonoff(1)Uploaded bykillerboye
- Mockmidterm2015_1Uploaded byPETER
- lectureslides_Chap6-annot.pdfUploaded byrashid.iisc
- Sreeram_1.pdfUploaded bySreeram Venkataraman
- Ch3Summary.pdfUploaded bysablu khan
- Spatial economicsUploaded byArmandoMartinezM
- RegrCorr.pdfUploaded byjessel
- Further Regression TopicsUploaded bythrphys1940
- daun atasUploaded byDanar Fahmi Sudarsono
- Fu Ch11 Linear RegressionUploaded byKrithika Kaushik
- Manual Estatistico HP30sUploaded byAnonymous ZbrmXSI

- Surname Proposal Date FormUploaded byViyura Eng
- Summary of Coleman and Steele uncertainty methodsUploaded byAli Al-hamaly
- Applied Research 2013 JligUploaded byjibave
- ProbabilityUploaded bykakerote
- Westinghouse Condenser Retrofit 12-09Uploaded byRengasamy Ranganathan
- Design of ExperimentsUploaded byCarlos de la Calleja
- Test of hypothesis in R languageUploaded byKamakshaiah Musunuru
- 2 way anovaUploaded bychawlavishnu
- Fan Et Al. - Solids Mixing - Ind. and Eng. Chemistry (1970) Vol 62 Nr 7Uploaded byBerndUmme
- Prof. Nwigwe_Language, Truth and Reality_a Philosophical Investigation of the Notion of MeaningUploaded byIfeanyi Henry Njoku
- UNIT 4: RESEARCH PROJECTUploaded bychandni0810
- Quali.gender.narraUploaded byNors Cruz
- Creating a TOK Presentation.pptUploaded byasddsaasdasddsa
- Robertson, Control of IndustryUploaded byquintus14
- CasesUploaded bySahil Bhatia
- ebook_s104_book1_e2i1_n9781848731622_l3Uploaded byJulie Firth
- Business Stats ReviewUploaded bybasil9
- Feyerabend Paul.-Theoreticians, Artists and Artisans.pdfUploaded byMaurizio80
- finalreviewerforprint-140324045805-phpapp01.pdfUploaded byDarwin Blasabas
- Yasenik and Gardner - Play Therapy Dimensions Model - Chapter 1Uploaded byartemidi
- it enters a new learning environmentUploaded byapi-296643523
- 10 Mehatab EDUUploaded byNeen Naaz
- Kernel Density Estimation (KDE) in Excel TutorialUploaded bySpider Financial
- ubd experimental designUploaded byapi-259314612
- Importance of getting Statistics Help using SPSSUploaded byElk Journals- a class apart
- Experimental Research DesignUploaded bySulfa Rais
- Understanding Repeated-Measures ANOVA.pdfUploaded bywahida_halim
- Tl Strategies ChecklistUploaded byCami Rapalino
- Explication Versus Description Miller 1947Uploaded byJosé Ferreira Júnior
- The Theory of TheoriesUploaded byAlexandra Buchi