You are on page 1of 26

STAT318 Data Mining

Dr. Blair Robertson


University of Canterbury, Christchurch, New Zealand

Semester 2, 2016

Some of the figures in this presentation are taken from An Introduction to


Statistical Learning, with applications in R (Springer, 2013) with permission
from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

B. Robertson, University of Canterbury

STAT318 Data Mining

,1 / 26

Linear regression

Linear regression is a simple parametric approach to


supervised learning that assumes there is an approximately
linear relationship between the predictors X1 , X2 , . . . , Xp and
the response Y .

Although true regression functions are never linear, linear


regression is an extremely useful and widely used method.

B. Robertson, University of Canterbury

STAT318 Data Mining

,2 / 26

50

100

200

300

25
5

10

15

Sales

20

25
20
15

Sales

10

15
10
5

Sales

20

25

Linear regression: advertising data

10

TV

B. Robertson, University of Canterbury

20

30

40

50

Radio

STAT318 Data Mining

20

40

60

80

100

Newspaper

,3 / 26

Simple linear regression

In simple (one predictor) linear regression, we assume a model


Y = 0 + 1 X + ,
where 0 and 1 are two unknown parameters and  is an
error term with E () = 0.
Given some parameter estimates 0 and 1 , the prediction of
Y at X = x is given by
y = 0 + 1 x.

B. Robertson, University of Canterbury

STAT318 Data Mining

,4 / 26

Estimating the parameters: least squares approach


Let yi = 0 + 1 xi be the prediction of Y at X = xi , the
predictor value at the ith training observation. Then, the ith
residual is defined as
ei = yi yi ,
where yi is the response value at the ith training observation.
The least squares approach chooses 0 and 1 to minimize
the residual sum of squares (RSS)
RSS =

n
X
i=1

ei2

n
n
X
X
2
=
(yi yi ) =
(yi 0 1 xi )2 .
i=1

B. Robertson, University of Canterbury

i=1

STAT318 Data Mining

,5 / 26

15
10
5

Sales

20

25

Advertising example

50

100

150

200

250

300

TV

y = 7.03 + 0.0475x

B. Robertson, University of Canterbury

STAT318 Data Mining

,6 / 26

Advertising example
3

0.05

2.15

0.04

0.06

2.5

2.2

0.03

2.3

3
5

3
6

Contour plot of the RSS on the advertising data, using TV as


the predictor.
B. Robertson, University of Canterbury

STAT318 Data Mining

,7 / 26

Estimating the parameters: least squares approach

Using some calculus, we can show that


Pn
(x x)(yi y)
Pn i
1 = i=1
)2
i=1 (xi x
and
0 = y 1 x,
where x and y are the sample means of x and y , respectively.

B. Robertson, University of Canterbury

STAT318 Data Mining

,8 / 26

10
5
10

Y
10

10

Assessing the accuracy of the parameter estimates

True model (red) is Y = 2 + 3X + , where  Normal(0, 2 ).


B. Robertson, University of Canterbury

STAT318 Data Mining

,9 / 26

Assessing the accuracy of the parameter estimates


The standard errors for the parameter estimates are
s

2
1
x

SE(0 ) =
+ Pn
n
)2
i=1 (xi x
and

,
)2
i=1 (xi x

SE(1 ) = pPn
where =

p
V ().

Usually is not known and needs to be estimated from data


using the residual standard error (RSE)
sP
n
i )2
i=1 (yi y
RSE =
.
np1
B. Robertson, University of Canterbury

STAT318 Data Mining

,10 / 26

Hypothesis testing
If 1 = 0, then the simple linear model reduces to Y = 0 + ,
and X is not associated with Y .
To test whether X is associated with Y , we perform a
hypothesis test:
H0 : 1 = 0 (there is no relationship between X and Y )
HA : 1 6= 0 (there is some relationship between X and Y )
If the null hypothesis is true (1 = 0), then
t=

1 0
SE(1 )

will have a t-distribution with n 2 degrees of freedom.


B. Robertson, University of Canterbury

STAT318 Data Mining

,11 / 26

Results for the advertising data set

Intercept
TV

Coefficient
7.0325
0.0475

B. Robertson, University of Canterbury

Std. Error
0.4578
0.0027

t-statistic
15.36
17.67

STAT318 Data Mining

p-value
<0.0001
<0.0001

,12 / 26

Assessing the overall accuracy


Once we have established that there is some relationship
between X and Y , we want to quantify the extent to which
the linear model fits the data.
The residual standard error (RSE) provides an absolute
measure of lack of fit for the linear model, but it is not always
clear what a good RSE is.
An alternative measure of fit is R-squared (R 2 ),
Pn
(yi yi )2
RSS
2
R =1
= 1 Pi=1
,
n
TSS
)2
i=1 (yi y
where TSS is the total sum of squares.

B. Robertson, University of Canterbury

STAT318 Data Mining

,13 / 26

Results for the advertising data set

Quantity
Residual standard error (RSE)
R2

Value
3.26
0.612

The R 2 statistic has an interpretable advantage over RSE


because it always lies between 0 and 1.
A good R 2 value usually depends on the application.

B. Robertson, University of Canterbury

STAT318 Data Mining

,14 / 26

Multiple linear regression

In multiple linear regression, we assume a model


Y = 0 + 1 X1 + . . . + p Xp + ,
where 0 , 1 , . . . , p are p + 1 unknown parameters and  is
an error term with E () = 0.
Given some parameter estimates 0 , 1 , . . . , p , the prediction
of Y at X = x is given by
y = 0 + 1 x1 + . . . + p xp .

B. Robertson, University of Canterbury

STAT318 Data Mining

,15 / 26

Multiple linear regression


Y

X2

y = 0 + 1 x1 + 2 x2 .
B. Robertson, University of Canterbury

X1

STAT318 Data Mining

,16 / 26

Estimating the parameters: least squares approach


The parameters 0 , 1 , . . . , p are estimated using the least
squares approach.
We choose 0 , 1 , . . . , p to minimize the sum of squared
residuals
RRS =

n
X

(yi yi )2

i=1

n
X

(yi 0 1 xi1 . . . p xip )2 .

i=1

We will calculate these parameter estimates using R.

B. Robertson, University of Canterbury

STAT318 Data Mining

,17 / 26

Results for the advertising data

Intercept
TV
Radio
Newspaper

Coefficient
2.939
0.046
0.189
-0.001

B. Robertson, University of Canterbury

Std. Error
0.3119
0.0014
0.0086
0.0059

t-statistic
9.42
32.81
21.89
-0.18

STAT318 Data Mining

p-value
<0.0001
<0.0001
<0.0001
0.8599

,18 / 26

Is there a relationship between Y and X ?

To test whether X is associated with Y , we perform a


hypothesis test:
H0 : 1 = 2 = . . . = p = 0 (there is no relationship)
HA : at least one j is non-zero (there is some relationship)
If the null hypothesis is true (no relationship), then
F =

(TSS - RSS)/p
RSS/(n p 1)

will have an F -distribution with parameters p and n p 1.

B. Robertson, University of Canterbury

STAT318 Data Mining

,19 / 26

Is the model a good fit?


Once we have established that there is some relationship
between the reponse and the predictors, we want to quantify
the extent to which the multiple linear model fits the data.
The residual standard error (RSE) and R 2 are commonly used.
For the advertising data we have:
Quantity
Residual standard error (RSE)
R2
F-statistic

B. Robertson, University of Canterbury

Value
1.69
0.897
570

STAT318 Data Mining

,20 / 26

Extensions to the linear model

We can remove the additive assumption and allow for


interaction effects.
Consider the standard linear model with two predictors
Y = 0 + 1 X1 + 2 X2 + .

An interaction term is included by adding a third predictor to


the standard model
Y = 0 + 1 X1 + 2 X2 + 3 X1 X2 + .

B. Robertson, University of Canterbury

STAT318 Data Mining

,21 / 26

Results for the advertising data

Consider the model


Sales = 0 + 1 Tv + 2 Radio + 3 (Tv Radio) + .
The results are:

Intercept
TV
Radio
TvRadio

Coefficient
6.7502
0.0191
0.0289
0.0011

B. Robertson, University of Canterbury

Std. Error
0.248
0.002
0.009
0.000

t-statistic
27.23
12.70
3.24
20.73

STAT318 Data Mining

p-value
<0.0001
<0.0001
0.0014
<0.0001

,22 / 26

Extensions to the linear model

We can accommodate non-linear relationships using


polynomial regression.
Consider the simple linear model
Y = 0 + 1 X + .

Non-linear relationships can be captured by including powers


of X in the model. For example, a quadratic model is
Y = 0 + 1 X + 2 X 2 + .

B. Robertson, University of Canterbury

STAT318 Data Mining

,23 / 26

50

Polynomial regression: Auto data

30
20
10

Miles per gallon

40

Linear
Degree 2
Degree 5

50

100

150

200

Horsepower

B. Robertson, University of Canterbury

STAT318 Data Mining

,24 / 26

Results for the auto data


The figure suggests that
mpg = 0 + 1 Horsepower + 2 Horsepower2 + ,
may fit the data better than a simple linear model.
The results are:

Intercept
Horsepower
Horsepower2

Coefficient
56.9001
-0.4662
0.0012

B. Robertson, University of Canterbury

Std. Error
1.8004
0.0311
0.0001

t-statistic
31.6
-15.0
10.1

STAT318 Data Mining

p-value
<0.0001
<0.0001
<0.0001

,25 / 26

What we did not cover

Qualitative predictors need to be coded using dummy variables


for linear regression (R does this automatically for us).
Deciding on important variables.
Outliers and high leverage points.
Non-constant variance and correlation of error terms.
Collinearity.

B. Robertson, University of Canterbury

STAT318 Data Mining

,26 / 26