You are on page 1of 26

# STAT318 Data Mining

## Dr. Blair Robertson

University of Canterbury, Christchurch, New Zealand

Semester 2, 2016

## Some of the figures in this presentation are taken from An Introduction to

Statistical Learning, with applications in R (Springer, 2013) with permission
from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

## STAT318 Data Mining

,1 / 26

Linear regression

## Linear regression is a simple parametric approach to

supervised learning that assumes there is an approximately
linear relationship between the predictors X1 , X2 , . . . , Xp and
the response Y .

## Although true regression functions are never linear, linear

regression is an extremely useful and widely used method.

,2 / 26

50

100

200

300

25
5

10

15

Sales

20

25
20
15

Sales

10

15
10
5

Sales

20

25

10

TV

20

30

40

50

Radio

20

40

60

80

100

Newspaper

,3 / 26

## In simple (one predictor) linear regression, we assume a model

Y = 0 + 1 X + ,
where 0 and 1 are two unknown parameters and  is an
error term with E () = 0.
Given some parameter estimates 0 and 1 , the prediction of
Y at X = x is given by
y = 0 + 1 x.

,4 / 26

## Estimating the parameters: least squares approach

Let yi = 0 + 1 xi be the prediction of Y at X = xi , the
predictor value at the ith training observation. Then, the ith
residual is defined as
ei = yi yi ,
where yi is the response value at the ith training observation.
The least squares approach chooses 0 and 1 to minimize
the residual sum of squares (RSS)
RSS =

n
X
i=1

ei2

n
n
X
X
2
=
(yi yi ) =
(yi 0 1 xi )2 .
i=1

i=1

## STAT318 Data Mining

,5 / 26

15
10
5

Sales

20

25

Advertising example

50

100

150

200

250

300

TV

y = 7.03 + 0.0475x

## STAT318 Data Mining

,6 / 26

Advertising example
3

0.05

2.15

0.04

0.06

2.5

2.2

0.03

2.3

3
5

3
6

## Contour plot of the RSS on the advertising data, using TV as

the predictor.
B. Robertson, University of Canterbury

,7 / 26

## Using some calculus, we can show that

Pn
(x x)(yi y)
Pn i
1 = i=1
)2
i=1 (xi x
and
0 = y 1 x,
where x and y are the sample means of x and y , respectively.

,8 / 26

10
5
10

Y
10

10

## True model (red) is Y = 2 + 3X + , where  Normal(0, 2 ).

B. Robertson, University of Canterbury

,9 / 26

## Assessing the accuracy of the parameter estimates

The standard errors for the parameter estimates are
s

2
1
x

SE(0 ) =
+ Pn
n
)2
i=1 (xi x
and

,
)2
i=1 (xi x

SE(1 ) = pPn
where =

p
V ().

## Usually is not known and needs to be estimated from data

using the residual standard error (RSE)
sP
n
i )2
i=1 (yi y
RSE =
.
np1
B. Robertson, University of Canterbury

## STAT318 Data Mining

,10 / 26

Hypothesis testing
If 1 = 0, then the simple linear model reduces to Y = 0 + ,
and X is not associated with Y .
To test whether X is associated with Y , we perform a
hypothesis test:
H0 : 1 = 0 (there is no relationship between X and Y )
HA : 1 6= 0 (there is some relationship between X and Y )
If the null hypothesis is true (1 = 0), then
t=

1 0
SE(1 )

## will have a t-distribution with n 2 degrees of freedom.

B. Robertson, University of Canterbury

,11 / 26

Intercept
TV

Coefficient
7.0325
0.0475

Std. Error
0.4578
0.0027

t-statistic
15.36
17.67

p-value
<0.0001
<0.0001

,12 / 26

## Assessing the overall accuracy

Once we have established that there is some relationship
between X and Y , we want to quantify the extent to which
the linear model fits the data.
The residual standard error (RSE) provides an absolute
measure of lack of fit for the linear model, but it is not always
clear what a good RSE is.
An alternative measure of fit is R-squared (R 2 ),
Pn
(yi yi )2
RSS
2
R =1
= 1 Pi=1
,
n
TSS
)2
i=1 (yi y
where TSS is the total sum of squares.

,13 / 26

## Results for the advertising data set

Quantity
Residual standard error (RSE)
R2

Value
3.26
0.612

## The R 2 statistic has an interpretable advantage over RSE

because it always lies between 0 and 1.
A good R 2 value usually depends on the application.

,14 / 26

## In multiple linear regression, we assume a model

Y = 0 + 1 X1 + . . . + p Xp + ,
where 0 , 1 , . . . , p are p + 1 unknown parameters and  is
an error term with E () = 0.
Given some parameter estimates 0 , 1 , . . . , p , the prediction
of Y at X = x is given by
y = 0 + 1 x1 + . . . + p xp .

,15 / 26

## Multiple linear regression

Y

X2

y = 0 + 1 x1 + 2 x2 .
B. Robertson, University of Canterbury

X1

,16 / 26

## Estimating the parameters: least squares approach

The parameters 0 , 1 , . . . , p are estimated using the least
squares approach.
We choose 0 , 1 , . . . , p to minimize the sum of squared
residuals
RRS =

n
X

(yi yi )2

i=1

n
X

i=1

,17 / 26

Intercept
TV
Radio
Newspaper

Coefficient
2.939
0.046
0.189
-0.001

Std. Error
0.3119
0.0014
0.0086
0.0059

t-statistic
9.42
32.81
21.89
-0.18

p-value
<0.0001
<0.0001
<0.0001
0.8599

,18 / 26

## To test whether X is associated with Y , we perform a

hypothesis test:
H0 : 1 = 2 = . . . = p = 0 (there is no relationship)
HA : at least one j is non-zero (there is some relationship)
If the null hypothesis is true (no relationship), then
F =

(TSS - RSS)/p
RSS/(n p 1)

,19 / 26

## Is the model a good fit?

Once we have established that there is some relationship
between the reponse and the predictors, we want to quantify
the extent to which the multiple linear model fits the data.
The residual standard error (RSE) and R 2 are commonly used.
For the advertising data we have:
Quantity
Residual standard error (RSE)
R2
F-statistic

Value
1.69
0.897
570

,20 / 26

## We can remove the additive assumption and allow for

interaction effects.
Consider the standard linear model with two predictors
Y = 0 + 1 X1 + 2 X2 + .

## An interaction term is included by adding a third predictor to

the standard model
Y = 0 + 1 X1 + 2 X2 + 3 X1 X2 + .

,21 / 26

## Consider the model

Sales = 0 + 1 Tv + 2 Radio + 3 (Tv Radio) + .
The results are:

Intercept
TV
Radio
TvRadio

Coefficient
6.7502
0.0191
0.0289
0.0011

Std. Error
0.248
0.002
0.009
0.000

t-statistic
27.23
12.70
3.24
20.73

p-value
<0.0001
<0.0001
0.0014
<0.0001

,22 / 26

## We can accommodate non-linear relationships using

polynomial regression.
Consider the simple linear model
Y = 0 + 1 X + .

## Non-linear relationships can be captured by including powers

of X in the model. For example, a quadratic model is
Y = 0 + 1 X + 2 X 2 + .

,23 / 26

50

30
20
10

40

Linear
Degree 2
Degree 5

50

100

150

200

Horsepower

,24 / 26

## Results for the auto data

The figure suggests that
mpg = 0 + 1 Horsepower + 2 Horsepower2 + ,
may fit the data better than a simple linear model.
The results are:

Intercept
Horsepower
Horsepower2

Coefficient
56.9001
-0.4662
0.0012

Std. Error
1.8004
0.0311
0.0001

t-statistic
31.6
-15.0
10.1

p-value
<0.0001
<0.0001
<0.0001

,25 / 26

## Qualitative predictors need to be coded using dummy variables

for linear regression (R does this automatically for us).
Deciding on important variables.
Outliers and high leverage points.
Non-constant variance and correlation of error terms.
Collinearity.

,26 / 26