0 Up votes0 Down votes

567 views79 pagesCourse Notes by Darrell Aucoin on Applied Linear Models

Oct 13, 2013

© Attribution Non-Commercial (BY-NC)

PDF, TXT or read online from Scribd

Course Notes by Darrell Aucoin on Applied Linear Models

Attribution Non-Commercial (BY-NC)

567 views

Course Notes by Darrell Aucoin on Applied Linear Models

Attribution Non-Commercial (BY-NC)

- The Law of Explosive Growth: Lesson 20 from The 21 Irrefutable Laws of Leadership
- Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
- Hidden Figures Young Readers' Edition
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- Micro: A Novel
- The Wright Brothers
- The Other Einstein: A Novel
- State of Fear
- State of Fear
- The Power of Discipline: 7 Ways it Can Change Your Life
- The Kiss Quotient: A Novel
- The 10X Rule: The Only Difference Between Success and Failure
- Being Wrong: Adventures in the Margin of Error
- Algorithms to Live By: The Computer Science of Human Decisions
- The 6th Extinction
- The Black Swan
- The Art of Thinking Clearly
- The Last Battle
- Prince Caspian
- A Mind for Numbers: How to Excel at Math and Science Even If You Flunked Algebra

You are on page 1of 79

Darrell Aucoin

Email: daucoin@uwaterloo.ca

Teacher: Dr. Leilei Zeng

Oce: M3-4223

Oce Hours: T & TH, 2:30-3:30PM

Email: lzeng@uwaterloo.ca

University of Waterloo

Under-Graduate Advisor: Diana Skrzydco

TAs: Saad Khan M3-3108 Space 1

Oce Hours: Wednesday 4:00-5:00

Friday 1:30-2:30 M3-3111

Yu Nakajima

Zi Tian

Jianfeng Zhang

October 8, 2013

Contents

1 Introduction 4

1.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Review of Simple Linear Regression Model 6

2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Assumptions about (Gauss-Markov Assumptions) . . . . . . . . . . . . . . . . . . . . 6

2.1.1.1 Assumption Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The Least-Squares Estimator (LSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 The Properties of

0

and

1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Consequence of LS Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 The Estimation of

2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Condence Intervals and Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 The t-test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Value Prediction for Future Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6.1 Some properties of y

p

: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7 Mean Prediction for Future Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8 Analysis of Variance (ANOVA) for Testing: H

0

:

1

= 0 . . . . . . . . . . . . . . . . . . . . . 16

2.8.1 F-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.8.2 Terminologies of ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.8.3 Coecient of Determination R

2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Review of Random Vectors and Matrix Algebra 20

3.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Dierentiating Over Linear and Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Some Useful Results on a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Multiple Linear Regression Model 24

4.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.1 Assumptions of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.2 Regression Coecients

1

, . . . ,

p

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 LSE of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Properties of LSE

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.2 Some Useful Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Residuals Relationship with the Hat Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 An Estimation of

2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Sampling Distribution of

,

2

under Normality . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.7 ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1

CONTENTS 2

5 Model and Model Assumptions 37

5.1 Model and Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.1 Basic Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Relationship Between Residuals and Random Errors . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Statistical Properties of r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Residual Plot for Checking E[

i

] = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3.1 Residuals Versus x

j

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.2 Partial Residuals Versus x

j

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.3 Added-Variable Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Residual Plots for Checking Constant Variance V ar (

i

) =

2

. . . . . . . . . . . . . . . . . . 42

5.5 Residual Plots for Checking Normality of

i

s . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.5.1 Standardized Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6 Residual Plots for Detecting Correlation in

i

s . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6.1 Consequence of Correlation in

i

? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6.2 The Durbin-Watson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Model Evaluation: Data Transformation 46

6.1 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1.1 Remarks on Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Logarithmic Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2.1 Logarithmic Transformation of y Only . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2.1.1 Interpretation of

j

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2.2 Logarithmic Transformation of All Variables . . . . . . . . . . . . . . . . . . . . . . . 50

6.2.2.1 Interpretation of

j

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2.3 Logarithmic Transformation of y and Some x

i

s . . . . . . . . . . . . . . . . . . . . . . 50

6.2.4 95% CI for Transformed Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2.4.1 There are Two ways to Get CI . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.3 Transformation for Stabilizing Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.4 Some Remedies for Non-Linearity -Polynomial Regression . . . . . . . . . . . . . . . . . . . . 51

7 Model Evaluation: Outliers and Inuential Case 53

7.1 Outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1.1 How to Detect Outliers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2 Hat Matrix and Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3 Cooks Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.3.1 Cooks D Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.4 Outliers and Inuential Cases: Remove or Keep? . . . . . . . . . . . . . . . . . . . . . . . . . 57

8 Model Building and Selection 58

8.1 More Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.1.1 Testing Some But Not All s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.1.1.1 Extra Sum of Squares Principle . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.1.1.2 Alternative Formulas for F

0

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.1.1.3 ANOVA (Version 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.1.1.4 ANOVA (Version 2) (not including

0

) . . . . . . . . . . . . . . . . . . . . . 60

8.1.2 The General Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.1.2.1 The test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2 Categorical Predictors and Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2.1 Binary Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2.2 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.2.3 Categorical Predictor with More Than 2 Levels . . . . . . . . . . . . . . . . . . . . . . 62

8.2.3.1 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.2.3.2 Testing Overall Eect of a Categorical Predictor . . . . . . . . . . . . . . . . 63

CONTENTS 3

8.3 Modeling Interactions With Categorical Predictors . . . . . . . . . . . . . . . . . . . . . . . . 63

8.4 The Principle of Marginality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.5 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.5.1 Backward Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.5.2 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8.5.3 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.5.4 All Subsets Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.5.4.1 R

2

Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.5.4.2 R

2

adj

Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.5.4.3 Mallows C

k

Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.5.4.4 AIC (Akaikes Information Criterion) . . . . . . . . . . . . . . . . . . . . . . 70

9 Multicollinearity in Regression Models 75

9.1 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9.2 Consequence of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

9.3 Detection of Multicollinearity Among x

1

, . . . , x

p

. . . . . . . . . . . . . . . . . . . . . . . . . . 76

9.3.1 Formal Check of Multicollinearity: Variance Ination Factors (VIF) . . . . . . . . . . 77

9.4 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.4.1 Minimize Subject to Constraints (Lagrange Multiplier Method) . . . . . . . . . . . . . 77

Get the book Linear Models with R

References: Oxford Dictionary of Statistics, Regression Modeling

Chapter 1

Introduction

1.1 Denitions

Denition 1.1. Response Variable (y): The dependent or outcome variable in a study. This is the

primary variable of interest in a study.

Example 1.1. Yield of a crop, performance of a stock, etc.

Denition 1.2. Explanatory Variable(s) (x

i

): Also called the independent, antecedent, background,

predictor, or controlled variable(s) that help predict the response variable.

Example 1.2. What type of fertilizer, temperature, average rain fall, quarterly returns, etc.

Denition 1.3. Regression: Regression deals with the functional relationship between a response (or

outcome) variable y and the one or more explanatory variables (or predictor variables) x

1

, x

2

, . . . , x

p

.

A general expression for a regression model is

y = f (x

1

, x

2

, . . . , x

p

) +

where

The function f (x

1

, x

2

, . . . , x

p

) represent the deterministic relationship between y and x

1

, x

2

, . . . , x

p

represents unexplained variation in y due to other factors

Remark 1.1. y, are considered the only variables in this model with

V ar (y) = V ar () =

2

The x

1

, x

2

, . . . , x

p

are considered to be deterministic upon y.

Example. Examples of applications:

Linking climate change to man made activities

y xs

Global Climate Surface temperature Green House Gasses

Finance

y xs

Finance Stock Price Index Unemployment rate, Money Supply, etc.

Economics

y xs

Economics Unemployment Rate Interest Rate

4

CHAPTER 1. INTRODUCTION 5

Regression Modeling can be used for

Identifying important factors (or explanatory variables)

Estimation

Prediction

In Stat 231, you saw only a simplest form of the regression model

y =

0

+

1

x +

where we have only one explanatory variable x, and the form of f (x) is assumed to be known as a linear

function.

Example 1.3. Linear function

y =

0

+

1

x +

2

x

2

Example 1.4. Non-linear function

y =

0

+

1

exp (

2

x)

If the derivative of any s still has that in it, then it is a non-linear function

Stat 331 extends discussion to p explanatory variables.

y =

0

+

1

x

1

+ +

p

x

p

+

Note. In this course we will use the model of

y =

0

+

1

x

1

+ +

p

x

p

+

Where

0

,

1

, . . . ,

p

are constants in the linear function, we normally call them regression parameters (or

coecients). Note that s are unknown and are estimated from the data.

Chapter 2

Review of Simple Linear Regression

Model

2.1 The Model

Let y be the response variable and x be the only explanatory variable. The simple linear regression model

is given by

y =

0

+

1

x

1

+

where

0

+

1

x

1

represents the systematic relationship, and is random error.

0

and

1

are unknown

regression parameters. y, are considered random variables, x

i

is considered (in this course) a non-random

variable.

Suppose we observe n pairs of values {(y

i

, x

i

) | i = 1, 2, . . . , n} on y and x from a random sample of

subjects. Then for the i

th

observation, we have

y

i

=

0

+

1

x

1,i

+

i

0

0

X Medical Treatment

Y

R

e

s

p

o

n

s

e

y =

0

+

1

x

(x

i

, y

i

)

i = 1, 2, 3, . . .

2.1.1 Assumptions about (Gauss-Markov Assumptions)

Formally, we make a number of assumptions about the

1

, . . . ,

n

. Gauss-Markou Assumption (conditional

on x

i

)

6

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 7

1. E[

i

] = 0

1

2.

1

, . . . ,

n

are statistically independent

2

3. V ar (

i

) =

2

,

3

4.

i

is normally distributed for i = 1, . . . n.

4

These four assumptions are often summarized as saying that

1

, . . . ,

n

are independent and identically

distributed (iid) N

_

0,

2

_

.

In particular, assumption 1. is needed to ensure that a linear relationship between y and x is approximate.

2.1.1.1 Assumption Implications

Assumption 1. implies that

E[y

i

] = E[y

i

| x

i

] =

0

+

1

x

i

Assumption 2 implies y

1

, . . . , y

n

are independent.

Assumption 3 implies V ar (y

i

) =

2

(constant over x

i

).

Assumption 4 implies that y

i

is normally distributed.

So, equivalently, we can summarize that y

1

, . . . , y

n

are independent and normally distributed such that

y

i

N

_

0

+

1

x

i

,

2

_

2.1.2 Regression Parameters

The two unknown regression parameters

0

and

1

:

0

is the intercept

1

is the slope and is of primary interest.

1

= E[y | x = a + 1] E[y | x = a]

If

1

= 0,

E[y | x] =

0

2.2 The Least-Squares Estimator (LSE)

Suppose we let

0

and

1

be the chosen estimators for

0

and

1

, respectively and the tted value for y

i

from the regression line is:

y =

0

+

1

x

1,i

Then the least squares criterion chooses

0

and

1

to make the residuals r

i

small.

r

i

= y

i

y

i

1

= E[y

i

] =

0

+

1

x

1

2

= y

1

, . . . , yn are independent

3

= V ar (y

i

) =

2

4

= y

i

N

i

,

2

Specically, LSE of

0

and

1

are chosen to minimize the sum of squared residuals:

min

0,

1

S (

0

,

1

) =

n

i=1

r

2

i

min

0,

1

S (

0

,

1

) =

n

i=1

(y

i

y

i

)

2

=

n

i=1

_

y

i

1

x

1,i

_

2

The LSE of

0

and

1

are

0

= y

1

x

1

=

n

i=1

(x

i

x) (y

i

y)

n

i=1

(x

i

x)

2

(2.2.1)

or

1

=

s

x,y

s

x,x

Note. In this course we occasionally use y

i

to denote the random variable from the i

th

subject of a sample

and sometimes for the value (number) actually observed.

Similarly,

0

,

1

will be used as estimators (random) and for particular estimates calculated from some

data.

5

2.3 The Properties of

0

and

1

We have have following properties of LSE

1. E

_

0

_

=

0

E

_

1

_

=

1

2. The Theoretical variance of

0

and

1

V ar

_

0

_

=

2

_

1

n

+

x

2

(x

i

x)

2

_

V ar

_

1

_

=

2

_

1

(x

i

x)

2

_

3.

Cov

_

0

,

1

_

=

2

x

(x

i

x)

2

5

Sept 18, 2012

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 9

Proof. To prove the results related to

1

we write,

1

=

(x

i

x) (y

i

y)

(x

i

x)

2

1

=

(x

i

x)

(x

i

x)

2

y

i

(x

i

x)

(x

i

x)

2

y

1

=

(x

i

x)

(x

i

x)

2

y

i

y

(x

i

x)

(x

i

x)

2

(x

i

x) = 0 =

1

=

(x

i

x)

(x

i

x)

2

y

i

0

1

=

c

i

y

i

where c

i

=

(x

i

x)

(x

i

x)

2

Hence, because y

1

, . . . , y

n

are independent variables, the expectation of

1

is:

E

_

1

_

=

c

i

E[y

i

]

E

_

1

_

=

c

i

(

0

+

1

x

i

)

E

_

1

_

=

0

c

i

+

1

c

i

x

i

(x

i

x) = 0 =

E

_

1

_

=

1

c

i

x

i

c

i

x

i

=

(x

i

x) x

i

(x

i

x)

2

c

i

x

i

=

(x

i

x) ((x

i

x) + x)

(x

i

x)

2

c

i

x

i

=

(x

i

x)

2

+ (x

i

x) x

(x

i

x)

2

c

i

x

i

= 1 +

(x

i

x) x

(x

i

x)

2

c

i

x

i

= 1 + x

(x

i

x)

(x

i

x)

2

(x

i

x) = 0 =

c

i

x

i

= 1 =

E

_

1

_

=

1

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 10

Similarly

V ar

_

1

_

= V ar

_

c

i

y

i

_

y

i

are independent =

V ar

_

1

_

=

c

2

i

V ar (y

i

)

V ar

_

1

_

=

(x

i

x)

2

(x

i

x)

4

2

V ar

_

1

_

=

2

(x

i

x)

2

Results:

1

N

_

1

,

2

(x

i

x)

2

_

2.3.1 Consequence of LS Fitting

1.

r

i

= 0

2.

r

i

x

i

= 0

3.

r

i

y

i

= 0

4. The point ( x, y) is always on the tted regression line

Proof.

r = y y

r = (I H) y

X

T

r = X

T

(I H) y

X

T

r = X

T

_

I X

_

X

T

X

_

1

X

T

_

y

X

T

r = X

T

y X

T

y

X

T

r = 0 =

r

i

= 0

r

i

x

i

= 0

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 11

y

T

r =

n

i=1

y

i

r

i

y

T

r =

n

i=1

_

0

+

1

x

1

+ +

p

x

p

_

r

i

y

T

r =

n

i=1

0

r

i

+

1

x

1

r

i

+ +

p

x

p

r

i

y

T

r =

n

i=1

0

r

i

+

n

i=1

1

x

1

r

i

+ +

n

i=1

p

x

p

r

i

y

T

r =

0

n

i=1

r

i

+

1

n

i=1

x

1

r

i

+ +

p

n

i=1

x

p

r

i

_

r

i

= 0

r

i

x

i

= 0

=

y

T

r = 0

y =

0

+

1

x

0

= y

1

x +

1

x

y = y

2.4 The Estimation of

2

Note 1. We can re-write the model

y

i

=

0

+

1

x

i

+

i

as

i

= y

i

1

x

i

to emphasize the analogy with the residuals

r

i

= y

i

1

x

i

We could say that r

i

(which can be calculated) estimate the unobservable

i

. The basic idea is then to

use sample variance of r

1

, . . . , r

n

to estimate the unknown V ar (

i

) =

2

. The sample variance of r

1

, . . . , r

n

1

n 1

n

i=1

(r

i

r)

2

this is actually not unbiased.

E

_

1

n 1

n

i=1

(r

i

r)

2

_

=

2

The unbiased estimator of

2

is dened as

2

= S

2

=

1

n 2

n

i=1

(r

i

r)

2

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 12

Proof. Look this up in the assignment solutions

r

i

= y

i

y

r =

1

n

n

i=1

r

i

r = 0

(r

i

r)

2

=

(y

i

y)

2

E

_

(r

i

r)

2

_

=

E

_

(y

i

y)

2

_

E

_

(r

i

r)

2

_

=

E

_

_

y

i

1

x

i

_

2

_

0

= y

1

x =

E

_

(r

i

r)

2

_

=

E

_

_

y

i

_

y

1

x

_

1

x

i

_

2

_

E

_

(r

i

r)

2

_

=

E

_

_

y

i

y +

1

x

1

x

i

_

2

_

E

_

(r

i

r)

2

_

=

E

_

_

(y

i

y)

1

(x

i

x)

_

2

_

1

=

S

x,y

S

2

x

=

E

_

(r

i

r)

2

_

=

E

_

_

(y

i

y)

S

x,y

S

2

x

(x

i

x)

_

2

_

E

_

(r

i

r)

2

_

=

E

_

(y

i

y)

2

2

S

x,y

S

2

x

(x

i

x) (y

i

y) +

_

S

x,y

S

2

x

_

2

(x

i

x)

2

_

E

_

(r

i

r)

2

_

= E

_

(y

i

y)

2

_

2E

_

S

x,y

S

2

x

(x

i

x) (y

i

y)

_

+E

_

_

S

x,y

S

2

x

_

2

(x

i

x)

2

_

E

_

(r

i

r)

2

_

= E

_

S

2

y

2E

_

S

x,y

S

2

x

S

x,y

_

+E

_

_

S

x,y

S

2

x

_

2

S

2

x

_

E

_

(r

i

r)

2

_

= E

_

S

2

y

2E

_

S

2

x,y

S

2

x

_

+E

_

S

2

x,y

S

2

x

_

E

_

(r

i

r)

2

_

= E[S

y,y

] E

_

S

2

x,y

S

2

x

_

E

_

(r

i

r)

2

_

= E[S

y,y

]

E

_

S

2

x,y

S

2

x

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 13

E[S

y,y

] = E

_

y

2

i

n y

2

_

E[S

y,y

] =

E

_

y

2

i

nE

_

y

2

E[S

y,y

] =

_

V ar (y

i

) +E

2

[y

i

]

_

n

_

V ar ( y) +E

2

[ y]

_

E[S

y,y

] =

2

+E

2

[y

i

]

_

n

_

2

n

+E

2

[ y]

_

E[S

y,y

] = n

2

+

_

E

2

[y

i

]

_

2

nE

2

[ y]

E[S

y,y

] = (n 1)

2

+

_

E

2

[y

i

]

_

nE

2

[ y]

E[S

y,y

] = (n 1)

2

+

(

0

+

1

x

i

)

2

n(

0

+

1

x)

2

E[S

y,y

] = (n 1)

2

+

2

0

+

0

1

x

i

+

2

1

x

2

i

_

n

_

2

0

+

0

1

x +

2

1

x

2

_

E[S

y,y

] = (n 1)

2

+n

2

0

n

2

0

+

0

x

i

n

0

1

x +

2

1

x

2

i

n

2

1

x

2

E[S

y,y

] = (n 1)

2

+

0

x

i

n

1

n

x

i

+

2

1

x

2

i

n

2

1

x

2

E[S

y,y

] = (n 1)

2

+

2

1

_

x

2

i

x

2

_

E[S

y,y

] = (n 1)

2

+

2

1

S

x,x

E

_

S

2

x,y

= V ar (S

x,y

) +E

2

[S

x,y

]

E

_

S

2

x,y

= V ar

_

(x

i

x) (y

i

y)

_

+E

2

_

(x

i

x) (y

i

y)

_

E

_

S

2

x,y

= V ar

_

(x

i

x) (y

i

y)

_

+E

2

_

(x

i

x) (y

i

y)

_

E

_

S

2

x,y

= V ar

_

(x

i

x) (y

i

y)

_

+

_

(x

i

x) E[y

i

y]

_

2

E

_

S

2

x,y

= V ar

_

(x

i

x) y

i

y

(x

i

x)

_

+

_

(x

i

x) (

0

+

1

x

i

1

x)

_

2

(x

i

x) = 0 =

E

_

S

2

x,y

= V ar

_

(x

i

x) y

i

_

+

_

(x

i

x) (

1

x

i

1

x)

_

2

E

_

S

2

x,y

(x

i

x)

2

V ar (y

i

) +

2

1

_

(x

i

x)

2

_

2

E

_

S

2

x,y

= S

x,x

_

2

+

1

S

x,x

_

E

_

(r

i

r)

2

_

= E[S

y,y

]

E

_

S

2

x,y

S

x,x

E

_

(r

i

r)

2

_

= (n 1)

2

+

2

1

S

x,x

S

x,x

_

2

+

1

S

x,x

_

S

x,x

E

_

(r

i

r)

2

_

= (n 1)

2

+

2

1

S

x,x

2

+

1

S

x,x

_

E

_

(r

i

r)

2

_

= (n 2)

2

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 14

2.5 Condence Intervals and Hypothesis Testing

Recall that

1

N

_

1

,

2

Sxx

_

, so

Sxx

N (0, 1)

By denition

P

_

1.96 <

Sxx

< 1.96

_

= 0.95

P

_

1

1.96

S

xx

<

1

<

1

+ 1.96

S

xx

_

= 0.95

95% CI for

1

is

1

1.96

Sxx

when

2

is known.

In most practices,

2

is unknown, when we replace

2

by S

2

, then the unknown standard deviation of

2

S

xx

is replaced by standard error

SE

_

1

_

=

S

2

S

xx

where S

2

=

1

n 2

r

2

i

The standardized

1

random variable becomes

1

SE

_

1

_ t

n2

which is no longer standard normal but has a t-distribution with n 2 degrees of freedom.

A 100 (1 ) % condence interval for

1

is

CI

100(1)%

_

1

_

=

1

t

n2,/2

SE

_

1

_

Hypothesis Tests are derived and computed in a similar way. To test

H

0

:

1

=

1

H

a

:

1

=

1

We use the t-statistic

t =

1

SE

_

1

_

which as a t

n2

distribution when H

0

is true.

2.5.1 The t-test Statistic

t =

1

SE

_

1

_ t

n2

If H

0

:

1

=

1

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 15

2

t

n2,/2

t

n2,/2

Formally, if

|t| =

1

SE

_

1

_

> t

n2,/2

There is evidence to reject H

0

:

1

=

1

at signicant level of . Otherwise, we cannot reject H

0

.

2.6 Value Prediction for Future Values

The tted value:

y

i

=

0

+

1

x

i

refers to an x which is part of the sample data.

Predict a single future value at a given x = x

p

The future value is given as

y

p

=

0

+

1

x

p

+

p

where

p

is the future error. Naturally, we replace

p

by its expectation and use

y

p

=

0

+

1

x

p

to predict y

p

.

2.6.1 Some properties of y

p

:

1. E[y

p

y

p

] = 0 (an unbiased prediction)

2. V ar (y

p

y

p

) =

_

1 +

1

n

+

(xp x)

2

Sxx

_

2

y

p

y

p

=

0

+

1

x

p

+

p

1

x

p

Note that

p

is independent of

0

and

1

since it is an future error that is unrelated to the data that

0

and

1

are calculated.

V ar (y

p

y

p

) = V ar (

p

) +V ar

_

0

+

1

x

p

_

3. It can be shown that

y

p

y

p

SE (y

p

y

p

)

t

n2

where

SE (y

p

y

p

) =

_

_

1 +

1

n

+

(x

p

x)

2

S

xx

_

s

2

where S

xx

=

(x

i

x)

2

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 16

2.7 Mean Prediction for Future Values

Predict the mean of future response values at a given x = x

p

. We will still use

p

=

0

+

1

x

p

as the predicted future mean

p

=

0

+

1

x

p

The variance of the prediction error V ar (

p

p

) is smaller than the variance of prediction error of y

p

.

SE (

p

p

) =

_

_

1

n

+

(x

p

x)

2

S

xx

_

S

2

Note. Notice that

V ar (

p

p

) < V ar (y

p

y

p

)

2.8 Analysis of Variance (ANOVA) for Testing: H

0

:

1

= 0

The total variation among the y

i

s is measured by

SST =

n

i=1

(y

i

y)

2

Pf there is no variation (all y

i

s are the same), the SST = 0. The bigger the SST, the more variation. If we re-

write SST as HbegineqnarraySST =

n

i=1

(y

i

y)

2

SST =

n

i=1

(y

i

y

i

+ y

i

y)

2

SST =

(y

i

y

i

)

2

. .

r

2

i

+

( y

i

y)

2

+ 2

(y

i

y

i

) (y

i

y)

. .

=0

SST = SSE +SSReqnarray

Where:

SSE refers to the sum of squares of residuals. It measures the variability of y

i

s that is unexplained by

the regression model.

SSR refers to the sum of squares of regression. It measures the variability of response is accounted for

by the regression model.

If H

0

:

1

= 0 is true, SSR should be relatively small compared to SSE. Our decision is to reject H

0

if the

ratio of SSR and SSE is large.

Some Distribution Results: (when H

0

is true)

SST

2

2

(n1)

To show this, recall that y

1

, . . . , y

n

are independent N

_

0

,

2

_

then

_

y

i

_

2

2

(n)

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 17

By re-arrangement of SST

SST =

n

i=1

(y

i

0

+

0

y)

2

SST =

n

i=1

(y

i

0

)

2

i=1

(

0

y)

2

SST =

n

i=1

(y

i

0

)

2

n( y

0

)

2

SST

2

. .

2

(n1)

=

n

i=1

(y

i

0

)

2

2

. .

2

(n)

n( y

0

)

2

2

. .

2

(1)

_

`

n

i=1

(y

i

0

)

2

2

=

SST

2

+

n( y

0

)

2

2

1. From Cockrams Theorem:

SST

2

is independent of

n( y0)

2

2

and

SST

2

2

(n1)

2.

SSR

2

2

(1)

SSR =

( y

i

y)

2

SSR =

0

+

1

x

i

y

_

2

0

= y

1

x =

SSR =

_

y

1

x +

1

x

i

y

_

2

SSR =

1

x +

1

x

i

_

2

SSR =

2

1

(x

i

x)

2

SSR =

2

1

S

xx

Recall

1

N

_

1

,

2

S

xx

_

_

Sxx

_

2

2

(1)

Under

H

0

:

2

1

2

/Sxx

=

2

1

Sxx

2

2

(1)

3.

SSE

2

2

(n2)

SST

2

. .

2

(n1)

=

SSE

2

+

SSR

2

. .

2

(1)

From Cockrams Theorem,

SSE

2

2

(n2)

6

6

Sept 25, 2012

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 18

2.8.1 F-Distribution

Based on these results, we derive F- Statistic

F

=

SSR

2 /1

SSE

2 /n2

F

(1,n2)

It can be used for testing H

0

:

1

= 0, we reject H

0

at -level if

F > F

(1,n2)

Recall

t =

1

SE

_

1

_

t =

1

_

s

2

Sxx

t

2

=

2

1

S

xx

s

and

F

2

1

S

xx

s

2

t

2

n2

= F

(1,n2)

only for 1

The t-test and F-test for H

0

:

1

= 0 are equivalent for SLR..

2.8.2 Terminologies of ANOVA

Sum of Squares Source of Variation Degrees of Freedom Mean Squares F p-value

SSR Regression 1 MSR =

SSR

1

F =

MSR

MSE

SSE Residual n 2 MSE =

SSE

n2

SST Total n 1

For p explanatory variables:

Source of Variation Sum of Squares Degrees of Freedom Mean Squares F p-value

Regression SSR = (y y)

T

(y y) (p + 1) 1 MSR =

SSR

p

F =

MSR

MSE

Residual n p 1 MSE =

SSE

np

Total n 1

2.8.3 Coecient of Determination R

2

R

2

=

SSR

SST

0 R

2

1

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 19

It is a measure of goodness-of-t of the regression model to the data. In the case of SLR

R

2

=

SSR

SST

R

2

=

2

1

S

xx

S

yy

1

=

S

xy

S

xx

=

R

2

=

S

2

xy

S

xx

S

yy

= r

2

where r

2

is the sample correlation coecient R

2

is applicable to multiple regression, but r

2

is not.

Chapter 3

Review of Random Vectors and Matrix

Algebra

3.1 Denitions

Denition 3.1. Vector of Variables:

Y = (y

1

, . . . , y

n

)

E[Y ] =

_

_

E[y

1

]

.

.

.

E[y

n

]

_

_

E[Y ] =

_

1

.

.

.

n

_

_

E[Y ] =

V ar (y) = =

_

_

V ar (y

1

) Cov (y

1

, y

2

) Cov (y

1

, y

n

)

Cov (y

1

, y

2

) V ar (y

2

) Cov (y

2

, y

n

)

.

.

.

.

.

.

.

.

.

.

.

.

Cov (y

n

, y

1

) Cov (y

n

, y

1

) V ar (y

n

)

_

_

nn

V ar (y) = [

i,j

]

nn

V ar (Y ) = E

_

(Y E[Y ]) (Y E[Y ])

T

_

V ar (Y ) = E

_

(y

i

i

)

2

_

nn

If y

1

, . . . , y

n

are independent and identically distributed

V ar (Y ) =

2

I

3.2 Basic Properties

A = (a

i,j

)

mn

b = (b

1

, b

2

, . . . , b

m

)

T

c = (c

1

, c

2

, . . . , c

n

)

T

20

CHAPTER 3. REVIEW OF RANDOM VECTORS AND MATRIX ALGEBRA 21

1. E[Ay +b] = AE[y] +b

2. V ar (y +c) = V ar (y)

3. V ar (Ay) = A V ar (y) A

T

4. V ar (Ay +b) = AV ar (y) A

T

3.3 Dierentiating Over Linear and Quadratic Forms

1. f (y) = f (y

1

, . . . , y

n

)

d

dy

f =

_

d

dy

1

f, . . . ,

d

dy

n

f

_

T

2. f = c

T

y =

n

i=1

c

i

y

i

d

dy

f = c

3. f = y

T

Ay where A is a symmetric matrix

f = y

T

Ay

d

dy

f = 2Ay

Example.

f = y

T

Ay

f =

j

a

i,j

y

i

y

j

f =

i

a

ii

y

2

i

+ 2

i<j

j

a

i,j

y

i

y

j

d

dy

1

f = 2a

1,1

y

1

+ 2

i<j

a

i,j

y

j

d

dy

1

f = 2

n

j=1

a

i,j

y

j

3.4 Some Useful Results on a Matrix

1. Trace,

tr (A

mm

) =

m

i=1

a

i,i

tr (B

mn

C

nm

) = tr (C

nm

B

mn

)

2. Rank of a Matrix

rank (A

mm

) = Number of linearly independent columns

3. Vectors (y

1

, . . . , y

m

) are linearly independent i

c

1

y

1

+ +c

m

y

m

= 0 = c

1

= = c

m

= 0

4. Orthogonal Vectors and Matrices

CHAPTER 3. REVIEW OF RANDOM VECTORS AND MATRIX ALGEBRA 22

(a) Two vectors are orthogonal: y

T

x = 0

(b) A

mm

is orthogonal (ortonormal) if

A

T

A = AA

T

= I = A

T

= A

1

5. Eigenvalues and Eigenvectors.

A vector v

i

is called an eigenvector of A

mm

if

i

s.t.

Av

i

=

i

v

i

i = 1, 2, . . . , k

where

i

is the eigenvalue.

6. Decomposition of a symmetric matrix

A

T

= A

For a symmetric matrix A

mm

,

1

, . . . ,

m

are real and an orthogonal matrix P s.t.

A = PP

T

where

=

_

1

0

.

.

.

.

.

.

.

.

.

0

m

_

_

is diagonal matrix with eigenvalues on diagonal,

P =

_

_

| |

v

1

v

m

| |

_

_

is a matrix with eigenvectors on the columns.

7. Idempotent Matrix

A

mm

is idempotent if A

2

= A

Results: If A

mm

is idempotent, then all eigenvalues are either 0 or 1.

Proof.

Av

i

=

i

v

i

A

2

v

i

=

i

Av

i

i

v

i

=

2

i

v

i

=

i

{0, 1}

Results: If A

mm

is idempotent, an orthogonal matrix P s.t.

A = PP

T

where

=

_

_

1 0 0

0 0 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 1

_

_

CHAPTER 3. REVIEW OF RANDOM VECTORS AND MATRIX ALGEBRA 23

Proof.

tr (A) = rank (A) = tr () = Number of 1 eiginvalues

i

{0, 1}

= =

_

_

1 0 0

0 0 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 1

_

_

3.5 Multivariate Normal Distribution

The random vector y = (y

1

, . . . , y

n

)

T

follow a multivariate normal distribution with a joint pdf

f (y) =

_

1

2

_n

2

||

1

2

exp

_

1

2

(y )

T

(y )

_

where

= E[y]

n1

= (E[y

1

] , E[y

2

] , . . . , E[y

n

])

T

and

= V ar (y)

nn

= (

i,j

)

nn

We also write y MV N (, )

1. Margin Normality

If y MV N (, ), then y

i

N (

i

,

i,i

). Where

i,i

is the (i, i)

th

element of

2. y

1

, . . . , y

n

are independent i is diagonal.

1

3. If y MV N (, ), then let Z = Ay

Z MV N

_

A, AA

T

_

4. If U MV N (, ), y

1

= AU, y

2

= BU then y

1

and y

2

are independent if

Cov (y

1

, y

2

) = 0

Cov (AU, BU) = 0

A V ar (U) B

T

= 0

AB

T

= 0

5. If y

1

, . . . , y

n

are iid N

_

,

2

_

y MV N

_

,

2

I

_

6. If y MV N

_

0,

2

I

_

, the general case

y

T

y

2

2

(n)

1

In general if y

i

, y

j

are independent = Cov (y

i

, y

j

) = 0 but if Cov (y

i

, y

j

) = 0 y

i

and y

j

are independent.

Chapter 4

Multiple Linear Regression Model

Suppose we are interested in the relationship between a type of air pollutant and lung function.

y : FEV1

x

1

: a type of air polutant

x

2

: age

x

3

: gender

4.1 The Model

The general model is in the form:

y

i

=

0

+

1

x

i,1

+

2

x

i,2

+ +

p

x

i,p

+

i

where x

i,1

, . . . , x

i,p

are p explanatory variables, and

1

, . . . ,

p

are regression coecients associated with

these explanatory variables respectively, i = 1, . . . , n

4.1.1 Assumptions of Model

1. E[

i

] = 0 = E[y

i

] =

0

+

1

x

i,1

+

2

x

i,2

+ +

p

x

i,p

2. V ar (

i

) =

2

= V ar (y

i

) =

2

3.

1

, . . . ,

n

are independent = y

1,

, . . . , y

n

are independent

4. A stronger assumption

i

N

_

0,

2

_

= y

i

N

_

0

+

1

x

i,1

+

2

x

i,2

+ +

p

x

i,p

,

2

_

4.1.2 Regression Coecients

1

, . . . ,

p

j

: The average amount of increase (or decrease) in response when the j

th

covariate x

j

increases (or

decreases) by 1 unit while holding all other covariates xed.

If we have 2 s then our solution is a 2D plane

H

0

:

j

= 0 = x

j

is not linearly related to y, given all the other explanatory variables in the model.

24

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 25

In matrix form:

_

_

y

1

y

2

.

.

.

y

n

_

_

=

0

_

_

1

1

.

.

.

1

_

_

+

1

_

_

x

1,1

x

2,1

.

.

.

x

n,1

_

_

+

2

_

_

x

1,2

x

2,2

.

.

.

x

n,2

_

_

+ +

p

_

_

x

1,p

x

2,p

.

.

.

x

n,p

_

_

+

_

2

.

.

.

n

_

_

y

n1

=

_

_

1 x

1,1

x

1,2

x

1,p

1 x

2,1

x

2,2

x

2,p

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 x

n,1

x

n,2

x

n,p

_

_

n(p+1)

. .

X

_

1

.

.

.

[

_

_

(p+1)1

. .

+

n1

y = X +

where

MV N

_

0,

2

I

_

& y MV N

_

X,

2

I

_

4.2 LSE of

Least squares choose

to make the n 1 vector y = X

small)

Specically, we minimize

S () =

n

i=1

(y

i

1

x

i,1

p

x

i,p

)

2

S () = (y X)

T

(y X)

S () = y

T

y y

T

X

T

X

T

y +

T

X

T

X

T

X

T

y is a constant =

y

T

X =

T

X

T

y =

S () = y

T

y 2y

T

X

. .

A

+

T

X

T

X

. .

A

Set

S () = 0 to get

1

0 =

S ()

0 =

_

y

T

y 2y

T

X +

T

X

T

X

_

0 = 2X

T

y + 2X

T

X

0 = X

T

y +X

T

X

X

T

y = X

T

X

_

X

T

X

_

1

X

T

y =

=

_

X

T

X

_

1

X

T

y

=

=

_

X

T

X

_

1

X

T

y

(We require X

T

X to be full rank)

1

T

A = 2A,

A = A

T

,

A = 0

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 26

4.2.1 Properties of LSE

1.

is unbiased

E

_

_

= E

_

_

X

T

X

_

1

X

T

y

_

E

_

_

=

_

X

T

X

_

1

X

T

E[y]

E

_

_

=

_

X

T

X

_

1

X

T

X

E

_

_

=

2. V ar

_

_

=

2

_

X

T

X

_

1

V ar

_

_

= V ar

__

X

T

X

_

X

T

y

_

=

_

X

T

X

_

1

X

T

V ar (y) X

_

X

T

X

_

1

=

2

_

X

T

X

_

1

X

T

X

_

X

T

X

_

1

=

2

_

X

T

X

_

1

V ar

_

_

=

2

_

X

T

X

_

1

=

_

_

V ar

_

0

_

Cov

_

1

,

2

_

Cov

_

1

,

p

_

Cov

_

1

,

2

_

V ar

_

1

_

Cov

_

2

,

p

_

.

.

.

.

.

.

.

.

.

.

.

.

Cov

_

p

,

1

_

Cov

_

p

,

2

_

V ar

_

p

_

_

_

4.2.2 Some Useful Results

Fitted values

y = X

y = X

_

X

T

X

_

1

X

T

y

Let H = X

_

X

T

X

_

1

X

T

, a hat matrix

y = Hy

The matrix H is idempotent and symmetric = H is

a projection matrix which projects y into R(x) (a (p + 1)

dimensional subspace spanned by linear combination of p+1

columns of X)

y = proj

Col(H)

y

Col (X) = Col (H)

y

C

o

l

(

I

H

)

r = proj

Col(IH)

y

R(x)

a

A graphical representation of y projected

onto the column space of X with n = 3,

p = 1

a

Edit to t wording better

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 27

2

4.3 Residuals Relationship with the Hat Matrix

Theorem. r = (I H) y

Proof.

r = y y

r = y X

_

X

T

X

_

1

X

T

y

r =

_

I X

_

X

T

X

_

1

X

T

_

y

r = (I H) y

Theorem. (I H) is idempotent.

Proof.

(I H) = (I H)

(I H) (I H) = (I H) (I H)

(I H) (I H) = I 2H +H

2

(I H) (I H) = I 2H +H

(I H) (I H) = I H

Theorem.

r

i

= 0,

r

i

x

i,1

= 0, . . . ,

r

i

x

i,p

= 0

Proof.

X

T

r = X

T

_

I X

_

X

T

X

_

1

X

T

_

y

X

T

r = X

T

y X

T

y

X

T

r = 0

=

r

i

= 0,

r

i

x

i,1

= 0, . . . ,

r

i

x

i,p

= 0

Theorem.

r

i

y

i

= y

T

r = 0

Proof.

y

T

r =

n

i=1

y

i

r

i

y

T

r =

n

i=1

_

0

+

1

x

1

+ +

p

x

p

_

r

i

y

T

r =

n

i=1

0

r

i

+

1

x

1

r

i

+ +

p

x

p

r

i

y

T

r =

n

i=1

0

r

i

+

n

i=1

1

x

1

r

i

+ +

n

i=1

p

x

p

r

i

y

T

r =

0

n

i=1

r

i

+

1

n

i=1

x

1

r

i

+ +

p

n

i=1

x

p

r

i

2

Oct 2, 2012

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 28

r

i

= 0,

r

i

x

i,1

= 0, . . . ,

r

i

x

i,p

= 0 =

y

T

r = 0

Theorem. E[r] = 0

Proof.

E[r] = E

__

I X

_

X

T

X

_

1

X

T

_

y

_

E[r] = E

__

y X

_

X

T

X

_

1

X

T

y

__

E[r] = E[y] X

_

X

T

X

_

1

X

T

E[y]

E[r] = X X

_

X

T

X

_

1

X

T

X

E[r] = X X

E[r] = 0

Theorem. V ar (r) =

2

(I H)

Proof.

V ar (r) = V ar

_

_

(I H)

. .

A

y

_

_

V ar (r) = (I H)

2

I (I H)

T

V ar (r) =

2

(I H)

4.4 An Estimation of

2

Theorem. An unbiased estimator of

2

is

2

=

1

n (p + 1)

n

i=1

r

2

i

= MSE

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 29

Proof.

E

_

n

i=1

r

2

i

_

= E

_

r

T

r

E

_

n

i=1

r

2

i

_

= E

_

tr

_

r

T

r

_

trace of a number is the number itself

E

_

n

i=1

r

2

i

_

= E

_

tr

_

rr

T

_

E

_

n

i=1

r

2

i

_

= tr

_

E

_

rr

T

_

E

_

rr

T

= V ar (r) since r = 0 =

E

_

n

i=1

r

2

i

_

= tr (V ar (r))

E

_

n

i=1

r

2

i

_

= tr (V ar ((I H) y))

E

_

n

i=1

r

2

i

_

= tr

_

(I H) V ar (y) (I H)

T

_

E

_

n

i=1

r

2

i

_

= tr ((I H) V ar (y))

E

_

n

i=1

r

2

i

_

= tr

_

(I H)

2

_

E

_

n

i=1

r

2

i

_

= (n (p + 1))

2

4.5 Sampling Distribution of

,

2

under Normality

We assume

y MV N

_

X,

2

I

_

Theorem. Result 1:

MV N

_

,

2

_

X

T

X

_

1

_

MV N

_

,

2

_

X

T

X

_

1

_

_

1

.

.

.

p

_

_

(P+1)1

MV N

_

_

_

_

_

_

1

.

.

.

p

_

_

,

2

_

X

T

X

_

1

(p+1)(p+1)

_

_

_

_

_

i

N

_

i

,

2

v

i,i

_

Proof.

=

_

X

T

X

_

1

X

T

. .

A

y

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 30

=

_

X

T

X

_

1

X

T

y

V ar

_

_

= V ar

_

_

X

T

X

_

1

X

T

y

_

V ar

_

_

=

_

X

T

X

_

1

X

T

V ar (y)

_

_

X

T

X

_

1

X

T

_

T

V ar

_

_

=

_

X

T

X

_

1

X

T

2

I

_

_

X

T

X

_

1

X

T

_

T

V ar

_

_

=

2

_

X

T

X

_

1

X

T

X

_

X

T

X

_

1

V ar

_

_

=

2

_

X

T

X

_

1

E

_

_

=

y MV N

_

,

2

_

=

MV N

_

,

2

_

X

T

X

_

1

_

Theorem. Result 2:

and

2

are independent.

Proof. It is enough to Show r and

are independent by:

2

=

1

n (p + 1)

r

T

r

Cov

_

r,

_

= Cov

_

_

_

_

_

I X

_

X

T

X

_

1

X

T

_

. .

A

y,

_

X

T

X

_

1

X

T

. .

B

y

_

_

_

_

Cov

_

r,

_

=

_

I X

_

X

T

X

_

1

X

T

_

Cov (y, y)

_

_

X

T

X

_

1

X

T

_

T

Cov

_

r,

_

=

_

I X

_

X

T

X

_

1

X

T

_

2

I X

_

X

T

X

_

1

Cov

_

r,

_

=

2

_

X

_

X

T

X

_

1

X

_

X

T

X

_

1

X

T

X

_

X

T

X

_

1

_

Cov

_

r,

_

=

2

_

X

_

X

T

X

_

1

X

_

X

T

X

_

1

_

Cov

_

r,

_

= 0

Cov (Ay, By) = E

_

Ay

T

yB

T

E[Ay] E

_

y

T

B

T

_

y

T

y

B

T

AE[y] E

_

y

T

B

T

Theorem. Result 3: (n (p + 1))

2

2

2

n(p+1)

Proof. Note that we can re-write

(n (p + 1))

2

2

= (n (p + 1))

1

n(p+1)

r

2

i

2

(n (p + 1))

2

2

=

r

2

i

2

(n (p + 1))

2

2

=

r

T

r

2

(n (p + 1))

2

2

=

_

r

_

T

_

r

_

(n (p + 1))

2

2

= (r

)

T

(r

)

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 31

Recall y MV N

_

X,

2

I

_

, then

r

=

(I H) y

MV N (0, I H)

Since I H is idempotent, then there an orthogonal matrix P such that

I H = PP

T

where =

_

_

1 0 0

0 1 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0

_

_

The number of 1s = n (p + 1)

Now, if we dene a new random variable z

z = P

T

r

V ar (z) = P

T

V ar (r) P

V ar (z) = P

T

V ar ((I H) y) P

V ar (z) = P

T

(I H) V ar (y) (I H)

T

P

V ar (z) = P

T

(I H) V ar (y) P

V ar (z) = P

T

PP

T

2

P

V ar (z) =

2

3

then,

z MV N

_

0, P

T

(I H) P

_

I H = PP

T

=

z MV N

_

0, P

T

_

PP

T

_

P

_

z MV N (0, )

z =

_

_

_z

1

, z

2

, z

3

, . . . , z

n(p+1)

. .

, . . . ,

n(p+1)

, z

n

_

_

_

T

the rst n (p + 1) z

i

s has N (0, 1), the rest are 0s.

Therefore,

(n (p + 1))

2

2

= (r

)

T

(r

)

(n (p + 1))

2

2

= (Pz)

T

(Pz)

(n (p + 1))

2

2

= z

T

P

T

Pz

(n (p + 1))

2

2

= z

T

z

(n (p + 1))

2

2

=

n(p+1)

i=1

z

2

i

2

(n(p+1))

3

Redo: need to change a few things

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 32

Theorem. Result 4: We can use a t-distribution to test each .

i

_

2

v

i,i

t

np1

SE

_

i

_

=

_

v

i,i

2

V ar

_

_

=

2

_

X

T

X

_

1

=

2

_

_

v

0,0

v

0,1

v

0,p

v

1,0

v

1,1

v

1,p

.

.

.

.

.

.

.

.

.

.

.

.

v

p,0

v

p,1

v

p,p

_

_

Proof. From result (1)

MV N

_

,

2

_

X

T

X

_

1

_

i

N

_

i

,

2

v

i,i

_

4

when

2

is unknown, we use

2

=

1

n (p + 1)

r

2

i

to estimate

2

.

From (1) and (2), we know that

and

2

are independent, and

(n (p + 1))

2

2

2

(n(p+1))

_

`

2

()

then

x

_

y

/

t

then

ii

2

vi,i

_

(np1)

2

2

np1

t

np1

ii

vi,i

2

t

np1

i

_

2

v

i,i

t

np1

This also implies that we use standard error

SE

_

i

_

=

_

v

i,i

2

to estimate the standard deviation

_

v

i,i

2

. The quantity (*) can be used to construct CI (1 ) % and

test hypothesis H

0

:

i

=

i

.

V ar

_

_

=

2

_

X

T

X

_

1

=

_

_

V ar

_

0

_

Cov

_

1

,

2

_

Cov

_

1

,

p

_

Cov

_

1

,

2

_

V ar

_

1

_

Cov

_

2

,

p

_

.

.

.

.

.

.

.

.

.

.

.

.

Cov

_

p

,

1

_

Cov

_

p

,

2

_

V ar

_

p

_

_

_

4

v

i,i

is the (i, i)

th

element of

X

T

X

1

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 33

4.6 Prediction

5

Suppose we are interested in predicting y for given set of values of the explanatory variables x

1

, . . . , x

p

.

For example, our multiple regression model:

y =

0

+

1

x

1

+

2

x

2

+

3

x

3

+

y = FEV 1

x

1

= level of a certain air polutant

x

2

= age

x

3

= weight

We want to predict FEV 1 for a new case with an arbitrary vector of explanatory variable values a

p

(e.g.

a

p

= (1, 10, 52, 170)

T

Note. Be cautious when extrapolating outside the ranges of the explanatory variables in the tting data.

y

p

=

0

+

1

10 +

2

52 +

3

170 +

p

We can estimate y

p

by using

(LSE) to replace , and set

p

= 0

y

p

=

0

+

1

10 +

2

52 +

3

170

y

p

= a

T

p

p

, we need to know

V ar (y

p

y

p

) = V ar

_

a

T

p

+

p

a

T

p

_

V ar (y

p

y

p

) = V ar

_

p

a

T

p

_

V ar (y

p

y

p

) = V ar (

p

) +V ar

_

a

T

p

_

V ar (y

p

y

p

) =

2

+a

T

p

V ar

_

_

a

p

=

_

X

T

X

_

1

X

T

y =

V ar (y

p

y

p

) =

2

+a

T

p

V ar

_

_

X

T

X

_

1

X

T

y

_

a

p

V ar (y

p

y

p

) =

2

+a

T

p

_

X

T

X

_

1

2

a

p

V ar (y

p

y

p

) =

2

_

1 +a

T

p

_

X

T

X

_

1

a

p

_

As usual, we have to replace

2

by

2

=

1

n p 1

r

2

i

which leads to the result that

y

p

y

p

_

2

_

1 +a

T

p

(X

T

X)

1

a

p

_

t

np1

6

and thus

Theorem 4.1. The 100 (1 ) % CI for y

p

is

CI

100(1)%

( y

p

) = y

p

t

np1,/2

_

2

_

1 +a

T

p

(X

T

X)

1

a

p

_

5

Oct 4, 2012, check Handout might have to put it in

6

Homework: the proof for this

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 34

What if we want to predict the mean of the response at a given vector of values for explanatory variable,

a

T

p

?

p

= E[y

p

] = a

T

p

p

,

p

= a

p

= y

p

However

[V ar (

p

p

) = V ar (

p

)] < [V ar (y

p

y

p

)]

Why is this?

Theorem 4.2. The 100 (1 ) % CI for

p

is

CI

100(1)%

(

p

) =

p

t

np1,/2

_

2

a

T

p

(X

T

X)

1

a

p

Proof.

V ar (

p

p

) = V ar

_

a

T

p

a

T

p

_

V ar (

p

p

) = V ar

_

a

T

p

_

V ar (

p

p

) = V ar

_

a

T

p

_

V ar (

p

p

) = a

T

p

V ar

_

_

a

p

V ar (

p

p

) = a

T

p

V ar

_

_

X

T

X

_

1

X

T

y

_

a

p

V ar (

p

p

) = a

T

p

_

X

T

X

_

1

2

a

p

V ar (

p

p

) =

2

a

T

p

_

X

T

X

_

1

a

p

[V ar (

p

p

) = V ar (

p

)] < [V ar (y

p

y

p

)]

4.7 ANOVA Table

Consider the general model:

y

i

=

0

+

1

x

i,1

+ +

p

x

i,p

+

i

()

is LSE

SSE =

n

i=1

r

2

i

SSE = r

T

r

SSE =

_

y X

_

T

_

y X

_

Now if we consider the hypothesis

H

0

:

1

=

2

= =

p

= 0

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 35

under H

0

, the general model () reduces to

y

i

=

0

+

i

Reduced model

the LSE of

0

is

0

= y

SSE

_

0

_

=

n

i=1

(y

i

y

i

)

2

SSE

_

0

_

=

n

i=1

_

y

i

0

_

2

SSE

_

0

_

=

n

i=1

(y

i

y)

2

SSE

_

0

_

= SST

The dierence between the full and reduced model:

SSE

_

0

_

SSE

_

_

= SST SSE

_

_

SSE

_

0

_

SSE

_

_

= SSR

is the additive sum of squares from the p-explanatory variables. SSR tells us how much variability in response

is explained by the full model over and above the simple mean model.

SSR =

(y

i

y)

2

_

y X

_

T

_

y X

_

SSR = y

T

y n y

2

[(I H) y]

T

[(I H) y]

SSR = y

T

y n y

2

y

T

(I H) y

SSR =

T

X

T

X

n y

2

(4.7.1)

Theorem 4.3. The F-test statistic for testing

H

0

:

1

=

2

= =

p

= 0

H

a

: At least one = 0

F

=

SSR

/p

SSE

/np1

F

=

Additional sum of squares

/p

Sum of squares from full model

/np1

is

used to test

H

0

:

1

=

2

= =

p

= 0

H

a

: At least one = 0

What does a statistically signicant F-ratio imply?

It indicates that there is strong evidence against the claim that none of the explanatory variables have

an inuence on response.

Source df Sum of Squares Mean Squares F

Regression p SSR =

T

X

T

X

n y

2

MSR =

SSR

p

F =

MSR

MSE

Residual n p 1 SSE =

_

y X

_

T

_

y X

_

MSR =

SSE

np1

Total n 1 SST = (y y)

T

(y y)

CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 36

The R

2

is an overall measurement of the goodness of t of the model.

R

2

=

SSR

SST

Problem. Does a large R

2

always mean that a signicant relationship has been discovered?

R

2

usually will go up as we add more explanatory variables in the model (even if they are not relevant).

Suppose p + 1 = n, R

2

= 1.

Theorem 4.4. Adjusted R

2

: This is used to penalize for a large number of parameters.

R

2

Adj

= 1

n 1

n p 1

_

1 R

2

_

78

7

Oct 9, 2012

8

Look at Lecture Handout

Chapter 5

Model and Model Assumptions

Model Evaluation and Residual Analysis: Given a particular data set, a specic model (with a set of

assumptions)

Lease squares t

Construct hypothesis test and CI

Estimation and prediction

In practice, a more dicult task is to nd a reasonable model for a set of data. We will focus on techniques

based on analysis of residuals for model checking.

5.1 Model and Model Assumptions

Problem. What is a good model?

A good model is the one which is complex enough to provide a good t to the data and yet simple

enough to use (i.e. make prediction) well beyond the data.

5.1.1 Basic Model Assumptions

1. E[

i

] = 0

= E[y

i

] =

0

+

1

x

i,1

+ +

p

x

i,p

= Linearity

2. V ar (

i

) =

2

, constant variance

= homoscedasticity

3.

1

, . . . ,

n

are independent

4.

i

N

_

0,

2

_

Of course, we cannot observe or compute the errors in practice, so their properties can not be evaluated.

Rather, we look at the residuals r

1

, . . . , r

n

in the tted model.

r

i

= y

i

y

i

If the residuals estimate the errors well, any pattern found in the residuals suggested that a similar

relationship exists in the random error.

37

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 38

5.2 Relationship Between Residuals and Random Errors

We can write:

r = y X

r = y X

_

X

T

X

_

1

X

T

. .

H

y

r = (I H) y

r = (I H) (X +)

r = (I H) X + (I H)

r = 0 + (I H)

r = (I H)

Note:

(I H) X = (I H) X

(I H) X = (X HX)

(I H) X = (X X)

(I H) X = 0

The residuals will approximately equal the errors if H is small relative to I.

Since H is a projection matrix, and idempotenent (H = HH), and the i

th

diagonal element can be

written as

h

i,i

= (HH)

i,i

h

i,i

=

n

j=1

h

i,j

h

j,i

H is also symmetric, h

i,j

= h

j,i

h

i,i

= h

2

i,i

+

j=i

h

2

i,j

h

i,i

(1 h

i,i

) =

j=i

h

2

i,j

The right hand side is sum of squares, hence non-negative, and we see that

0 < h

i,i

< 1

If the diagonal elements h

i,i

are small, then the o diagonal elements are also small.

Note:

tr (H) =

h

i,i

tr (H) = p + 1

1

Therefore the average of diagonal elements is

p+1

n

. If we try to t nearly as many parameters as there are

observations, the h

i,i

s cannot all be small relative to 1, and the residuals are poor estimate of the errors.

Note.

H =

_

_

h

1,1

h

1,2

h

1,i

h

1,n

h

1,2

h

2,2

h

2,i

h

2,n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

h

i,1

h

i,2

h

i,i

h

i,n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

h

i,1

h

i,2

h

2,i

h

n,n

_

_

V ar (r) = V ar (), but if H is small, they are close, if H is not small there could be substantial

correlations among the residuals and patterns will be apparent even if the error assumptions hold.

1

The trace of a matrix is equal to the sum of its eigenvalues

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 39

5.2.1 Statistical Properties of r

E[r] = 0 E[] = 0

V ar (r) = V ar ((I H) )

V ar (r) = (I H) V ar () (I H)

T

V ar (r) = (I H)

2

V ar (r) = V ar ()

Of course we cannot observe or compute the errors in practice, so their properties cannot be evaluated.

Rather, we look at the residuals (r

1

, . . . , r

n

) in the tted model.

r

i

= y

i

y

i

If the residuals estimate the errors well, any pattern found in the the residuals suggest that a similar

relationship exists in the random error.

Summary. If the assumptions about hold and H is small relative to I, then

r = (I H)

E[r] = 0 V ar (r) =

2

(I H)

2

I

r MV N

_

0,

2

I

_

the residual should look approximately like a sample from un-correlated, mean zero, constant variance normal

distribution.

5.3 Residual Plot for Checking E[

i

] = 0

2

Potentially the most important assumption for linear regression models is E[

i

] = 0. The likely causes for

violation of this assumption are:

1. Eect of explanatory variables on response variable is not in fact linear (e.g. tting a relationship

linearly in x when E[y] is in fact linear in x

2

)

2. Omission of some important explanatory variables

We shall consider three types of plot for checking this assumption:

1. Residuals versus x

j

, j = 1, . . . , p

2. Partial residuals versus x

j

, j = 1, . . . , p

3. Added-variable plots

2

Oct 16, 2012

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 40

5.3.1 Residuals Versus x

j

Suppose we t a multiple regression model and

r

i

= y

i

y

i

r

i

= y

i

0

+

1

x

1

+ +

p

x

p

_

the residuals have the linear eect of xs removed from y. If x

j

does have a linear eect on y (in other words

the model assumption E[

i

] = 0 is not violated), when we plot raw residuals r

1

, . . . , r

n

against the n values

x

1,j

, . . . , x

n,j

, we expect to see a random scatter for j = 1, . . . , p.

3

On the other hand, if we see any obvious non-random pattern, it suggests the non-linearity and and we

could adapt the way x

j

is modeled. For example:

4

may require higher order terms (e.g. x

2

k

, x

3

k

, . . . ) in the model = Polynomial Regression.

5.3.2 Partial Residuals Versus x

j

Plots of the raw residuals are sometimes dicult to interpret because we have to decide whether the scatter

looks random or not. For the next type of plot considered, based on partial residuals, we have to judge

whether the plot look linear- this if often easier.

For each x

j

, the partial residuals r

(j)

i

is dened as

r

(j)

i

= r

i

+

j

x

i,j

i = 1, . . . , n

The estimated linear eect of x

j

is added back into the residuals.

For each x

j

, when we plot (r

(j)

1

, . . . , r

(j)

n

) versus (x

1,j

, . . . , x

n,j

), we expect a linear trend if the model

with a linear term in x

j

is adequate. Hence the typical pattern of partial residual plots when the assumption

is not violated is:

5

for j = 1, . . . , p

The partial residuals for x

j

attempt to correct y for all other explanatory variables, os that the plot of

r

(j)

i

against x

i,j

(i = 1, . . . n) shows the marginal eect of x

j

. To see this, note that

r

(j)

i

= r

i

+

j

x

i,j

r

(j)

i

= y

i

0

+

1

x

i,1

+ +

p

x

i,p

_

+

j

x

i,j

r

(j)

i

= y

i

0

+

1

x

i,1

+ +

j1

x

i,j1

+

j+1

x

i,j+1

+ +

p

x

i,p

_

In simple linear regression, we can see the relationship between y and x simply by plotting the two variables.

In a multiple regression situation, plotting y versus each x

j

does not show the marginal eect of x

j

.

The y values are also aected by the remaining explanatory variables. In the partial residuals plot, we

remove the estiamted eect of remaining explanatory variables hence attempting to uncover the marginal

eect of x

j

on y. We can then judge whether this marginal eect is linear or not.

In R, a function crPlots() in the car package has been make available to you to produce partial residual

plots. See the example and attached R code.

5.3.2.1 Example

The value of a tree of a particular species is largely determined by the volume of timber in the trunk.

Foresters want to value stands of trees without having to cut them down to determine their volumes. Here,

3

Insert random scatter plot from handout

4

Insert non-random scatter plots from handout

5

Insert partial residuals scatter plot from handout

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 41

we explore the relationship between volume and two measures that can be made without felling a tree, the

girth (the tree diameter in inches at 4.5 feet above ground) and the height (feet). The data are collected on

31 felled blackcherry trees, and the three variables measured for each tree are: Volume, Girth, Height.

The initial examination of data by plotting volume against girth and volume against height (see Figure

5.2.1) show a close linear relationship between volume and girth and a less strong linear relationship between

volume and height. Volume seems to become more variable as height increases. However, remember that

these plots do not necessarily show the marginal eect of a particular explanatory variable (Girth/ or

Height) on the response variable Volume.

6

The plots of residual versus the explanatory variable Girth and residual versus Height respectively are

presented in Figure 5.2.2. There is obviously a non-random pattern in the plot of residual versus Girth

(quadratic trend?), which implies that the marginal eect of Girth on Volume may not be linear. Hence the

assumption that E[

i

] = 0 is violated. How about the plot of residual versus Height? Is the scatter looks

random or not?

7

The partial residual plots for Girth and Height in Figure 5.2.3, respectively, show nonlinear trends. They

are produced using the R function crPlots( ) from the car package. The dashed straight line is the tted

simple linear regression line by using partial residual as the response variable and Girth / or Height as the

explanatory variable.The smooth curve through the data points is constructed by using the non-parametric

smooth method called locally weighted scatterplot smoothing (LOWESS). LOWESS is dened by a complex

algorithm, where a local linear polynomial t is used at each data point depending on the points fall within a

specied neighborhood. So that it attempts to follow the points fairly smoothly, but not necessarily linearly.

This makes it easier to spot nonlinear behavior. Both plots indicate the deviation from the straight line,

hence the violation of the assumption that E[] = 0.

Note that LOWESS curve can be computed by using the R function lowess(x, y, f) where

x, y are vectors giving the coordinates of the points in the scatterplot

f is the smoother span. It give the proportion of point in the plot which inuence the smooth at each

value. Larger values give more smoothness.

lowess( ) returns a list containing components x and y which give the coordinates of the smooth. The

smooth can be added to a plot of the original points with the function lines( ). See the attached R code

for how to add lowess curve to the scatter plot in Figure 5.2.1 for example.

8

5.3.3 Added-Variable Plots

Sometimes, we might suspect that an important explanatory variable has not been included in the model.

Consider the plot of residuals versus a new explanatory variable x (that currently is not in the model),

9

it suggests that the addition of x may improve the model.

When deciding whether a new explanatory variable (that currently is not in the model) should be included,

an added variable plot turns out to be a more powerful graph. To produce the added-variable plot for a new

explanatory variable x, we

regress y on all the current explanatory variables x

1

, . . . , x

p

and denote the residual vector r.

regress x on all of the x

1

, . . . , x

p

and denote the residual vector t.

Plot residual vector r versus t. Systematic patterns in the plot indicate that the variable x should be

included.

6

Insert scatterplot from handout of trees

7

Insert residuals scatterplot from handout of trees

8

Insert partial residuals scatterplot from handout of trees

9

Insert Added-variable residuals scatterplot from handout

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 42

In R, the car package provides a function avPlots(model, variable, ...) to produce added-variable

plots, where

model is the model object produced by lm(). Or you can directly use

avPlots( fit<-lm(y x1 + x2 + , data), ... )

variable is the name of the new explanatory variable that is not included in the lm() tting. The

observations of this new variable shall be included in your data set though.

5.4 Residual Plots for Checking Constant Variance V ar (

i

) =

2

Once we are satised about the assumption that E[

i

] = 0, which is equivalently to saying that we are

modeling E(y) adequately as a linear function of the explanatory variables, we can move on to the assumptions

that V ar (

i

) =

2

for i = 1, . . . , n. This assumption states that the error variance, or equivalently V ar (y

i

)

is constant.

To detect non-constant variance (heteroscedasticity), a standard diagnostic is to plot the residuals against

the tted values: r

i

versus y

i

. We examine this plot to see if the residuals appear to be have constant

variability with respect to the tted values. A pattern as following

10

suggests that the constant variance assumption is violated. A random scatter, on the other hand, supports

the assumption.

Now back to tree data. Figure 5.3.1 shows rst the residual versus tted plot and then the absolute values

of the residuals versus tted plot. The latter one folds over the bottom half of the rst plot to increase the

resolution for detecting non-constant variance. A pattern is evident in the rst plot, the residuals appear to

have some quadratic trend with the tted value. This problem with the model goes back to the assumption

that E[

i

] = 0 for i = 1, . . . , n as mentioned in Section 5.2.3. The plot can not diagnose any problems with

the assumptions about the variance of y, because there are still problems with the assumptions relating to

modeling the mean of y.

11

5.5 Residual Plots for Checking Normality of

i

s

We probably do not want to worry about the normality assumption until the other, more serious assumptions

have been checked and xed.

If all assumptions are valid, including the normality assumptions, and we have sucient degrees of freedom

for the residuals, then the residuals should look approximately like a sample from a normal distribution.

Consider n = 5 residuals for simplicity, even though this is far too few for a useful plot. What would

we expect a sample of n = 5 independent standard normals to look like? It seems intuitively reasonable

that the third largest (i.e., the middle one) has expectation zero, the mean of the standard normal. That

is, the middle observation is expected to cut the normal distribution into two equal halves. Carrying this

idea further, we might expect the sample of ve normals, once ordered, to divide the normal distribution

into six approximately equal areas. Thus for n = 5, the choice for the area to the left of the i

th

ordered

observation are a

i

=

i

n+1

for i = 1, . . . , 5 (e.g. a

1

=

1

6

, a

2

=

2

6

, a

3

=

3

6

, a

4

=

4

6

, a

5

=

5

6

, a

6

=

6

6

), this gives the

equal areas we argued above. The values diving the standard normal distribution into these equal areas are

z

i

=

1

(a

i

) for i = 1, . . . , n, where is the cumulative distribution function for the standard normal. We

call z

i

expected standard normal order statistics because z

1

< z

2

< < z

n

.

This is the basis for a Q-Q (quantile-quantile) plot to check normality. The ordered residuals (ordered

r

i

s) are plotted against the expected standard normal order statistics (z

i

s). If the normality assumption is

correct, a Q-Q plot should show an approximately straight line.

The rst plot in Figure 5.4.1 shows the Q-Q plot of a random sample of size n = 100 from a normal

distribution, which shows an approximately straight line. The other three plots are based on random samples

10

Insert residuals vs y

i

scatterplot from handout

11

Insert residuals vs y

i

scatterplot from handout for trees

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 43

simulated from Lognormal distribution (as an example of skewed distribution), Cauchy distribution (as an

example of a heavy tailed distribution), Uniform distribution (as an example of light tailed distribution).

When non-normality is found, the resolution depends on the type of problem found.

For light tailed distribution, the consequences of non-normality are not serious and can be reasonably

ignored.

For skewed errors, a transformation of the response may solve the problem.

For long-tailed errors, we might just accept the non-normality.

r

1

, r

2

, r

3

, r

4

, r

5

Ordered Residuals r

(1)

< r

(2)

< r

(3)

< r

(4)

< r

(5)

12

(Z

i

) = Pr (Z < Z

i

) = a

i

Z

i

=

1

(a

i

)

13

In R, the function qqnorm( ) can be used to produce the Q-Q plot of residuals (see attached R code).

For example for tree data, you could use command

fit <- lm(Volume Girth + Height, data=tree)

qqnorm(residuals(fit))

14

5.5.1 Standardized Residual

d

i

=

r

i

_

2

i

(1 h

i,i

)

i = 1, 2, . . . , n

d

1

, d

2

, . . . , d

n

are approximately iid N (0, 1)

15

5.6 Residual Plots for Detecting Correlation in

i

s

16

None of the diagnostic plots discussed so far has questioned the assumption that the random errors are

uncorrelated. In general, checking this assumption is very dicult, if not impossible, by inspection of the

data. Scrutiny of the data collection method is often all that one can do. For example, if the tree data

include adjacent trees, one tree might shade its neighbor leading to correlation. Care in selecting trees that

are widely separated would make the assumption of uncorrelated errors more credible.

Only if there is some structure to the correlations that might exist, do we have some basis for checking.

For example temporally (or spatially) related data, it is often reasonable to suspect that observations close

together in time (or in space) are the most likely to be correlated (e.g. daily stock price data, temperature

data, etc). It is then wise to check the uncorrelated assumption.

If this assumption is violated, the rst-order property of the least square estimate

will not be aected

(e.g. E

_

_

= ), but the second-order, variance properties will be. As a matter of fact, a fairly small

12

Insert pictures from notes plus oct 18, 2012

13

Insert Q-Q plot from handout

14

Insert R code from handout oct16

15

Oct 18, 2012

16

insert Oct 18 handout

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 44

correlation between the errors may lead to the estimated variance of

being an order of magnitude wrong,

hence there is potential for standard errors to be very wrong with corresponding eects on condence inter-

vals, etc. The reason for this is that, although the correlation is small, there are many pairwise correlations

contributing to the true variance.

Graphical checks for correlation in

i

s include plots of the residual r against time and r

i

against r

i1

.

For example, consider the case that any two observations t time units apart have correlation

t

= corr (

i

,

it

) =

t

t 0

where now i indexes the time, and 1 < < 1. This is a so called autocorrelation structure in Time Series

Analysis (STAT443). The plots in Figure 5.5.1 are generated from normal random variables r

1

, . . . , r

n

with

an autocorrelation as above with = 0.9, 0, +0.9. These plots give you some idea of what we are looking

for. Positive correlation is probably the most harmful because the computed standard errors will likely to

be too small leading to condence intervals that do not capture the true value with the stated probability.

5.6.1 Consequence of Correlation in

i

?

E

_

_

=

V ar

_

_

=

_

_

X

T

X

_

1

X

T

y

_

V ar

_

_

=

_

X

T

X

_

1

X

T

V ar (y)

_

_

X

T

X

_

1

X

T

_

T

V ar

_

_

=

_

X

T

X

_

1

X

T

V ar () X

_

_

X

T

X

_

1

_

T

V ar

_

_

=

2

I

V ar ( ) is not diagonal anymore

V ar () =

_

2

1

2

1,2

2

1,3

2

1,n

2

2,1

2

2

2

2,3

2

2,n

2

3,1

2

3,2

2

3

2

3,n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

2

n,1

2

n,2

2

n,3

2

n

_

_

V ar

_

j

_

may be under estimated for

j

SE(

j)

5.6.2 The Durbin-Watson Test

H

0

: = 0 Or sometimes H

a

: > 0

H

a

: = 0 H

a

: < 0

The Durbin-Watson test is a formal statistical test for the correlation structure mentioned above. It test

H

0

: = 0 versus H

a

: = 0. The Durbin-Watson test statistic is

d =

n

i=2

(r

i

r

i1

)

2

n

i=1

r

2

i

It follows a linear combination of

2

distributions if the null hypothesis is true (e.g. random errors are

not correlated). Due to the diculties with the distribution of d, using the tables is complicated. The test

can be implemented by using the dwtest( ) function in the lmtest package in R.

CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 45

Example. For example, consider the low birth weight infant data, where we t a multiple linear regression

model by using head circumferences as the response variable, gestational age, birth weight and mothers

toxemia status as explanatory variables. Suppose that all the other important assumptions have been

satised, and we want to test whether an autocorrelation structure exists in the random errors. We conduct

D-W test

> library(lmtest)

> dwtest(headcircgestage + birthwt + toxemia, data=lowbwt)

Durbin-Watson test

data: headcirc gestage + birthwt + toxemia

DW = 1.9726, p-value = 0.4267

alternative hypothesis: true autocorrelation is greater than 0

where the p value indicates no evidence of correlation.

17

d =

(r

i

r

i1

)

2

r

2

i

Distribution

A high statistic usually means a low p-value

If the p-value< signicance level, then we reject H

0

. We conclude that there is strong evidence

that the random errors are correlated. (We can also specify if they are negatively or positive correlated).

R Code:

dwtest(fit, alternative =

_

_

"greater"

"two.sided"

"less"

)

#test is either two.sided, greater, or less

If p-value > , we can not reject H

0

. We can conclude that there is not enough evidence that there is

autocorrelation among random errors.

If autocorrelation is detected there are several remedies.

Add a missing explanatory variable.

For example, if we model beer sales on a daily basis and omit daily maximum temperature as an

explanatory variable, we may well see strings of positive residuals during spells of hot weather

and negative residuals during poor weather systems.

Dierencing. It is often the case that the dierences D

i

= y

i

y

i1

show less correlation.

For instance, modeling the dierences for a stock price amounts to build a model for how much

the price changes from one period to the next.

Take STAT443 (Forecasting / Time Series Modeling), or STAT936 (Longitudinal Data Analysis).

17

Insert Figure 5.5.1 Normal random variables with autocorrelation structure

Chapter 6

Model Evaluation: Data Transformation

1

We now know the techniques (based on analysis of the residuals) to diagnose the problems with the as-

sumptions on random errors. In this chapter, we will concentrate on techniques where the response variable

and/or the explanatory variables are transformed so that the usual assumptions might look more reasonable.

6.1 Box-Cox Transformation

6.1.1 Remarks on Data Transformation

Box and Cox have proposed a family of transformations that can be used with nonnegative responses y, and

suggested that transformation of y can have several advantages:

1. The model in the original x variables ts better (reducing the need for quadratic terms, etc)

2. The error variance is more constant.

3. The errors are more normal.

(Box and Cox, Journal of Royal Statistical Society B, 1964)

Suppose y

i

is always positive for i = 1, . . . , n, Box-Cox transformation suggest to transform y to y

. The

procedure for choosing is:

1

Oct 23, 2012 use handouts

46

CHAPTER 6. MODEL EVALUATION: DATA TRANSFORMATION 47

1. Choose several values for s, typically in the range [1, 1].

2. For each , transform y

i

to

Z

i

=

_

y

i

= 0

ln (y

i

) = 0

3. Fit the regression

Z

i

=

0

+

1

x

i,1

+ +

p

x

i,p

+

i

and calculate

MSE

adj

=

_

_

1

y

1

_

MSE = 0

y

2

MSE = 0

where y =

_

n

i=1

y

i

_1

n

where y = (

n

i=1

y

i

)

1

n

is called the geometric mean. The (unadjusted) mean square errors (MSE)

obtained from tting the model on the transformed scale (Z

i

) is adjusted so that regression with

dierent scales for the response variable can be compared. The adjustment is related to the Jacobian

arising in a density after a change of variable

4. Choose a that minimize MSE

adj

.

Note. The Box-Cox transformation leads to the following results depends on the values we choose for :

1. = 1 = no transformation

2. = 1 = the recprocal tranformation (

1

yi

)

3. = 0.5 = The squre root transformation (

y

i

)

4. = 0 = ln transformation with natural base e (ln (y

i

))

In practice, there will be a range of values that give reasonably small values of the scale-adjusted MSE

adj

.

From this range, we will want to choose a transformation that is convenient and provides a meaningful scale.

Scientists and engineers often work on logarithmic scales ( = 0), for example. In other applications,

reciprocals ( = 1) make sense.

However, notice

Once the transformation is selected, all subsequent estimation and tests are performed in terms of

transformed values

Transformation complicate the interpretation. Some transformations are easier to explain than others

in some context.

The graphic diagnostics do not provide a clear cut decision rule. A natural criteria for assessing the

necessity for transformation is whether important substantive results dier qualitatively before and

after.

In multiple regression, the best solution may require transforming xs

In this course, we focus on Box-Cox transformation of response variable. if ln transformation is

chosen, then we may consider same ln transformation of all explanatory variables (ln ln model)

if the improvement is substantial.

Box and Cox also showed how to generate a condence interval for and hence provide a range of reasonable

values, from which we may pick a convenient value. If the condence interval contains = 1, no transfor-

mation is usually required. Otherwise, a transformation convenient for the context is chosen from the values

in the condence interval (e.g. = 1, 0, 0.5.) The methods is based on the log-likelihood for the original

response values as a function of , and we seek large values of the log-likelihood.

CHAPTER 6. MODEL EVALUATION: DATA TRANSFORMATION 48

In R, we can use the function boxcox(<object>) from the MASS package to do the Box-Cox transformation

analysis, where <object> is a model object created based on lm( ) t. We now illustrate how to identify

the appropriate transformation on the tree data to resolve the problem with the non-linearity. The following

R code

> l i br a r y (MASS)

> f i t 1 <lm( Volume Gi rth+Hei ght , data=t r e e )

> boxcox ( f i t 1 , lambda=seq ( 1, 1) )

will produce the Figure 6.1. The boxcox function computes the log-likelihood for a number of values of

and plots the curve in the gure. The values of above the horizontal dotted line comprise an approximate

95% condence interval. Here we see that values from about 0.1 to about 0.5 seems reasonable. For

convenience, we will pick the = 0 which implies a log-transformation for response variable, the Volume.

2

Now if we t the model using the transformed response y = ln (Volume), and check the plot of residuals

versus Girth (see Figure 6.3), we will see less quadratic pattern and more random scatter compared to Figure

5.2.2

3

For many applications, transformation of the explanatory variables is also useful, for example, transform

x

j

to x

j

. We consider to apply the same transformation (the ln transformation) to all the explanatory

variables. We call this the log-log model, and write

y =

0

+

1

x

1

+

2

x

2

where now

y = ln (Volume) x

1

= ln (Girth) x

2

= ln (Height)

The tted log log model is

> t r ee$y < l og ( tree$Vol ume )

> t r ee$x1 < l og ( t r ee$Gi r t h )

> t r ee$x2 < l og ( t r ee$Hei ght )

> f i t 2 <lm( y x1 + x2 , data = t r e e )

> summary( f i t 2 )

Cal l :

lm( f ormul a = y x1 + x2 , data = t r e e )

Co e f f i c i e nt s :

Esti mate Std . Error t val ue Pr( >| t | )

( I nt e r c e pt ) 6.63162 0. 79979 8.292 5. 06 e09

x1 1. 98265 0. 07501 26. 432 < 2e16

x2 1. 11712 0. 20444 5. 464 7. 81 e06

Si g ni f . codes : 0 0. 001 0. 01 0. 05 . 0. 1 1

Res i dual standard e r r or : 0. 08139 on 28 degr ees of f reedom

Mul t i pl e Rsquared : 0. 9777 , Adj usted Rsquared : 0. 9761

Fs t a t i s t i c : 613. 2 on 2 and 28 DF, pval ue : < 2. 2 e16

4

The Figure 6.4 shows the plots of residual against log-transformed predictors, they show random scatter

patterns. The partial residual plots are displayed in Figure 6.5, they are linear to a very good approximation.

The LOWESS curve attempts to wraps around the straight line fairly closely. In summary, then, all these

are consistent with the assumption that E[

i

] = 0 for i = 1, . . . , n.

2

Insert Figure 6.2: Box-Cox transformation for the tree data

3

Insert Figure 6.3 Tree Data with log transformed Volume- Residual versus Predictor Plots

4

Redo later

CHAPTER 6. MODEL EVALUATION: DATA TRANSFORMATION 49

5

6

Since the rst-order assumption E[

i

] = 0 appears to be reasonable for the log log model tted to the

tree data. Thus it is appropriate to plot the residuals versus the tted values to check the constant-variance

assumption. Figure 6.6(b) indicates no problems with this assumption.

We also see that R

2

has increase slightly from the original models 0.948 to 0.978. With the same number of

tted explanatory variables. Furthermore, the t statistics for the slopes are now larger. The most compelling

reason for favoring the log log model, however, is that this model cannot predict negative volumes and

gives much more sensible predictions than the original model.

7

6.2 Logarithmic Transformation

6.2.1 Logarithmic Transformation of y Only

In general, suppose we t the model

ln (y) =

0

+

1

x

1

+ +

p

x

p

+

On the original scale, this model becomes

y = e

0+1x1++pxp+

y = e

0

e

1x1

e

pxp

e

where the explanatory variables have multipicative eects on response variable, and each appears as an

exponential relationship. The multiplicative error e

6.2.1.1 Interpretation of

j

Assume x

j

= a

E[y | x

j

= a] = e

0

e

1x1

e

ja

e

pxp

e

Now if x

j

= a + 1

E[y | x

j

= a] = e

0

e

1x1

e

j(a+1)

e

pxp

e

=

E[y | x

j

= a]

E[y | x

j

= a + 1]

= e

j

E[y | x

j

= a]

E[y | x

j

= a + 1]

1 = e

j

1

E[y | x

j

= a] E[y | x

j

= a + 1]

E[y | x

j

= a + 1]

= e

j

1

100%

_

e

j

1

_

is the interpreted as percentage of change in the average value of response variable per unit

increase in explanatory variable x

j

, while holding all the other explanatory variables xed.

100%

_

e

j

1

_

Average percentage change in y

5

Insert Figure 6.4: Tree Data with log transformed all variables- Residual versus Predictor Plots

6

Insert Figure 6.5: Tree Data with log transformed all variables - Partial Residual versus Predictor Plots

7

Insert Figure 6.6 Tree Data - Residual versus Fitted Value Plots (a) original data; (b) Log-transformed data

CHAPTER 6. MODEL EVALUATION: DATA TRANSFORMATION 50

6.2.2 Logarithmic Transformation of All Variables

Suppose in general, we t model

ln (y) =

0

+

1

ln (x

1

) + +

p

ln (x

p

) +

On the original scale of y:

y = e

0

e

1 ln x1

e

p ln xp

e

y = e

0

x

1

1

x

p

p

e

Essentially, explanatory variables now have multiplicative eects rather than additive eects on y, and each

appears as a power relationship.

6.2.2.1 Interpretation of

j

100%

_

e

j ln(1.01)

1

_

percentage change in average value of response variables per 1% change (increase) in

x

j

.

100%

_

e

j ln(1.01)

1

_

Average percentage change in y per 1% change in x

j

6.2.3 Logarithmic Transformation of y and Some x

i

s

Consider the model with two explanatory variables

ln (y) =

0

+

1

ln (x

1

) +

2

x

2

+

where x

1

is transformed, but x

2

is not.

On the original scale of y.

y = e

0

x

1

1

e

2x2

e

Thus, x

1

has a power relationship, while x

2

has an exponential eect. In general we can obtain a mixture of

power and exponential multipicative eects.

6.2.4 95% CI for Transformed Estimate

Consider log model, 95% CI for y

p

for a given vector of values a

p

for explanatory variables.

ln y

p

= a

T

p

y

p

= e

a

T

p

Method 1: Find 95% CI for a

T

p

= y

p

then [L, U] then 95% CI for y

p

= e

a

T

p

_

e

L

, e

U

Method 2: Find SE

_

e

a

T

p

_

based on the delta method, then 95% CI for y

p

= e

a

T

p

is

e

a

T

p

t

np1,/2

SE

_

e

a

T

p

_

Second method is more correct but the rst is easier

CHAPTER 6. MODEL EVALUATION: DATA TRANSFORMATION 51

6.3 Transformation for Stabilizing Variance

Consider the general model

y

i

=

0

+

1

x

1

+ +

p

x

p

. .

i

+

y

i

=

i

+

where

i

is the mean of response.

Furthermore, suppose that y

i

has non-constant variance

V ar (y

i

) =

i

2

where

2

is a constant of proportionality between the variance of

y

i

and the mean of y

i

.

r

y

= Non-constant Variance

If > 0 , then variance increases with the mean.

If < 0, then variance decreases with the mean.

Now we want to nd a transformation, g (y

i

), of y

i

such that g (y

i

) has a constant variance. For this, we

approximate g (y

i

) by a rst-order Taylor series.

g (y

i

) g (

i

) + (y

i

i

) g

(

i

)

g (y

i

) g (

i

) + (y

i

i

)

_

d

dy

g (y

i

)

_

yi=i

Then

V ar (g (y

i

)) V ar ((y

i

i

) g

(

i

))

V ar (g (y

i

)) [g

(

i

)]

2

V ar ((y

i

i

))

V ar (g (y

i

)) [g

(

i

)]

2

i

2

To stabilize the variance, we may choose the transformation of g () such that

[g

(

i

)]

2

=

1

i

= g

(

i

) =

1

/2

i

Then choosing

g (y

i

) =

_

y

1/2

i

1/2

= 2

ln y

i

= 2

does the trick and lead to V ar (g (y

i

)) =

2

.

This analysis does not tell us which function g () to choose as we do not know and the true form of

V ar (y

i

). It does, however, explain why Box-Cox often choose transformation y

i

with < 0 or ln (y).

6.4 Some Remedies for Non-Linearity -Polynomial Regression

Fit: y =

0

+

1

x

1

+

Plot r vs. x = non-linearity

CHAPTER 6. MODEL EVALUATION: DATA TRANSFORMATION 52

Include higher order terms:

y =

0

+

1

x +

2

x

2

+

y =

0

+

1

x +

2

x

2

+

3

x

3

+

.

.

.

Rule 1: If x

n

in the expression, then x

n1

should be in as well. In general, if a higher term is in, all lower

order terms should also be in.

Rule 2: We include a higher order term only if the new model is much better.

Chapter 7

Model Evaluation: Outliers and

Inuential Case

1

7.1 Outlier

Denition 7.1. Outlier: An outlier is a particular case with un-usual (extreme) value in y or/and xs.

Consider the following cases:

X Axis

Y

A

x

i

s

A

Case A is outlying in co-

variate x, but not in y

The response is right on

the model trajectory.

X Axis

Y

A

x

i

s

B

Case B is not un-usual

with regards to x, but it is

an outlier in y

X Axis

Y

A

x

i

s

C

Case C represents an out-

lier in the x as well as in

y.

7.1.1 How to Detect Outliers?

Simple diagnostic tool-graphs of studentized residuals.

d

i

=

r

i

_

2

(1 h

i,i

)

where h

i,i

is the (i, i) entry of the hat matrix H =

X

_

X

T

X

_

1

X

T

, and approximately

d

i

N (0, 1)

Large values of d

i

(e.g. |d

i

| > 2.5) = outlier in y.

10 0 1 2 3 4 5 6 7 8 9

3.5

-3.5

-2

-1

0

1

2

The real issue is not whether a case is an outlier or not: it is whether a case has a major inuence on

a given statistical procedure, in other words, keeping or removing the case will result in dramatically

1

Nov 1, 2012

53

CHAPTER 7. MODEL EVALUATION: OUTLIERS AND INFLUENTIAL CASE 54

dierent results of the regression model.

= on tted line y

= on estimate

7.2 Hat Matrix and Leverage

Recall

H = X

_

X

T

X

_

1

X

T

= (h

i,j

)

nn

y = Hy

y =

_

_

h

1,1

h

1,2

h

1,n

h

1,1

h

2,2

h

2,n

.

.

.

.

.

.

.

.

.

h

i,1

h

i,2

h

i,n

.

.

.

.

.

.

.

.

.

h

i,1

h

i,2

h

n,n

_

_

_

_

y

1

y

2

.

.

.

y

n

_

_

and the i

th

tted value y

i

y

i

=

n

i=1

h

i,j

y

i

y

i

= h

i,i

y

i

+

n

j=i

h

i,j

y

i

2

The weight of h

i,i

indicates inuence of y

i

to y

i

h

i,i

is large = h

i,i

y

i

dominates y

i

0 h

i,i

1, if h

i,i

= 1, then y

i

y

i

This implies that when h

i,i

is large, the tted line will be force to pass very close to the i

th

observation

(y

i,

x

i,1

, . . . x

i,p

). We say that the case i exerts high leverage on the tted line.

Denition 7.2. Leverage: h

i,i

is called the leverage value of case i.

large h

i,i

high leverage inuential on the tted line

The leverage h

i,i

is a function of xs but not y.

The leverage h

i,i

is small for cases with (x

i,1

, . . . , x

i,p

) near the centroid ( x

1

, . . . , x

p

) that is determined

by all cases. The leverage h

i,i

will be large if (x

i,1

, . . . , x

i,p

) is far away from the centroid.

(h

i,i

is used to assess whether a case is unusual with regards to its covariates - the x dimension)

2

Recall Section 5.2

h

i,i

(1 h

i,i

) =

j=i

h

2

i,j

0 h

i,i

1

CHAPTER 7. MODEL EVALUATION: OUTLIERS AND INFLUENTIAL CASE 55

Example 7.1. Simple Linear Regression

_

X

T

X

_

1

=

_

n n x

n x

x

2

i

_

_

X

T

X

_

1

=

1

nS

xx

_

x

2

i

n x

n x n

_

h

i,i

=

_

1 x

i

1

S

xx

_

1

n

x

2

i

x

x 1

_ _

1

x

i

_

h

i,i

=

1

S

xx

__

1

n

x

2

i

xx

i

_

(x

i

x)

_

1

x

i

_

h

i,i

=

1

S

xx

__

1

n

x

2

i

xx

i

_

+ (x

i

x) x

i

_

h

i,i

=

1

S

xx

_

1

n

x

2

i

xx

i

+x

2

i

xx

i

_

h

i,i

=

1

S

xx

_

1

n

S

x,x

+ (x

i

x)

2

_

h

i,i

=

1

n

+

(x

i

x)

2

S

x,x

The leverage is smallest when x

i

= x, and it is large if x

i

is far from x.

Rule: The average leverage in a model with (p + 1) regression parameters is

h =

p + 1

n

If a case for which

h

i,i

> 2

h =

2 (p + 1)

n

then it is considered a high-leverage case.

7.3 Cooks Distance

Denition 7.3. Cooks Distance: It is a measure of inuence on

.

Consider model

y = X +

and

=

_

X

T

X

_

1

X

T

y

Suppose delete the i

th

case and t model

y

(i)

= X

(i)

+

i

where

y =

_

_

y

1

.

.

.

y

i1

y

i+1

.

.

.

y

n

_

_

(n1)1

X

(i)

=

_

_

1 x

1,1

x

1,p

.

.

.

.

.

.

.

.

.

1 x

i1,1

x

i1,p

1 x

i+1,1

x

i+1,p

.

.

.

.

.

.

.

.

.

1 x

n,1

x

n,p

_

_

(n1)(p+1)

CHAPTER 7. MODEL EVALUATION: OUTLIERS AND INFLUENTIAL CASE 56

and

(1)

=

_

X

T

(i)

X

(i)

_

1

X

T

(i)

y

(i)

If the i

th

case is inuential, we expect a big change in the estimate of .

The change

(i)

is then a good measure of inuence of the i

th

case.

Note.

(i)

is a vector, any large values in any component implies that the i

th

case is inuential.

_

(i)

_

T

_

(i)

_

The magnitude of

(i)

should be adjusted by the variance of

V ar

_

_

=

2

_

X

T

X

_

1

7.3.1 Cooks D Statistic

D

i

=

_

(i)

_

T

_

2

_

X

T

X

_

1

_

1

_

(i)

_

(p + 1)

D

i

=

_

(i)

_

T _

X

T

X

_

_

(i)

_

2

(p + 1)

An identity

(i)

=

r

i

1 h

i,i

_

X

T

X

_

x

i

where x

i

= (1, x

i,1

, . . . , x

i,p

) is the i

th

row of X. Substituting this into the expression.

D

i

=

r

2

i

x

T

i

_

X

T

X

_

1

x

i

(1 h

i,i

)

2

(p + 1)

2

D

i

=

d

2

i

x

T

i

_

X

T

X

_

1

x

i

(1 h

i,i

) (p + 1)

D

i

=

d

2

i

h

i,i

(1 h

i,i

) (p + 1)

D

i

measures the inuence of the i

th

case on all tted values and on the estimated .

If h

i,i

is large and d

i

small = D

i

is small

If h

i,i

is small and d

i

large = D

i

is small

D

i

is an overall measure of inuence

How large is large enough?

The cut o

If D

i

> 1 (and sometimes D

i

> 0.5) we should be concerned

CHAPTER 7. MODEL EVALUATION: OUTLIERS AND INFLUENTIAL CASE 57

7.4 Outliers and Inuential Cases: Remove or Keep?

Correct for the Obvious error due to data processing

Could be a data entry problem

A careful decision on whether to keep or remove them (before/after analysis). The target population

may change due to in inclusion/exclusion of certain cases.

Most investigators would hesitate to report rejecting H

0

if the removal of a case results in the H

0

not

being rejected.

Robust Method: Weighted least squares

In R: Suppose we t a model

fit <- lm(y ~ x

1

+ x

2

+ + x

p

, data = <data frame>)

To get cooks distance D

cookD <- cook.distance(fit)

To get leverage h

i,i

fitinf <- influence(fit)

fitinf$hat

(contains a vector of diagonal of the hat matrix H)

To get studentized residual d

i

fitsummary <- summary(fit)

s <- fitsummary$sig

(s, ,

MSE)

studr <- residuals(fit)/(sqrt(1-fitinf$hat) * s)

Chapter 8

Model Building and Selection

8.1 More Hypothesis Testing

8.1.1 Testing Some But Not All s

Consider the general model

y =

0

+

1

x

1

+ +

p

x

p

+

Partition

X =

p

A

+1

..

_

_

1 x

1,1

1 x

2,1

.

.

.

.

.

.

1 x

n,1

p

B

..

x

1,p

x

2,p

.

.

.

x

n,p

_

_

=

_

X

A

X

B

=

_

1

.

.

.

_

_

(p

A

+ 1) 1

_

.

.

.

p

__

p

B

1

=

_

B

_

Example 8.1.

y =

0

+

1

x

1

+

2

x

2

+

3

x

3

+

4

x

4

+

with p + 1 = 5 parameters partitioned as

A

= (

0

,

1

,

2

)

T

A

= (

3

,

4

)

T

and

X

A

=

_

_

1 x

1,1

x

1,2

1 x

2,1

x

2,2

.

.

.

.

.

.

.

.

.

1 x

n,1

x

n,2

_

_

X

B

=

_

_

x

1,3

x

1,4

x

2,3

x

2,4

.

.

.

.

.

.

x

n,3

x

n,4

_

_

Suppose we want to test

H

0

:

B

= 0 or H

0

:

3

=

4

= 0

H

a

:

B

= 0 H

a

: At least one = 0

We are not restricted to the p

B

elements, these ideas apply to any p

B

elements.

58

CHAPTER 8. MODEL BUILDING AND SELECTION 59

8.1.1.1 Extra Sum of Squares Principle

A test follows from the change in the sum of squares of regression between tting model (1): just

A

(reduced

model) and model (2): both

A

and

B

.

ANOVA Table for Testing Some s

Source df SSR

Regression Fitting

A

(p

A

+ 1) 1 SSR

_

A

_

Regression Fitting

B

extra to

A

p

B

SSR

_

_

SSR

_

A

_

Residuals n p 1 SSE

_

_

= r

T

r

Total n SST = y

T

y n y

2

p + 1 = (p

A

+ 1) +p

B

The idea is then if H

0

:

B

= 0 is not true, the extra sum of squares of regression contributed by including

B

in the model should be large (relative to MSE)

Formally, if all model assumptions hold

F =

(SSR(

)SSR(

A))

/p

B

MSE

H0

F

(p

B

,np1)

If F > F

,(p

B

,np1)

then we reject H

0

with signicance level . Otherwise, H

0

is not rejected.

Note that

SST SSR

_

_

= SSE

SST SSR

_

A

_

= SSE

0

where SSE

0

is sum of squares of residuals leaving out X

B

(tting the model subject to H

0

)

Then the dierence

SSE

0

SSE = SSR

_

_

SSR

_

A

_

Thus, if the extra sum of squares of regression is small:

The two models have similar residual sum of squares

The two models t about the same

We choose the simpler model

= we do not reject H

0

Mathematically, we can rearrange the formulas for F

0

:

8.1.1.2 Alternative Formulas for F

0

F =

(SSE0SSE)

/p

B

MSE

F =

_

SSE

0

SSE

1

_

n p 1

p

B

where SSE

0

is the SSE of the reduced model.

1

Note that there are two versions of the ANOVA TABLE:

1

Nov 8, 2012

CHAPTER 8. MODEL BUILDING AND SELECTION 60

8.1.1.3 ANOVA (Version 1)

Source of Variation Degrees of Freedom Sum of Squares

Regression (p + 1) 1 SSR = ( y y)

T

( y y) = y

T

y n y

2

Residual n p 1 SSE = y

T

y y

T

y = r

T

r

Total n 1 SST = (y y)

T

(y y) = y

T

y n y

2

where

y = ( y, y, . . . , y)

T

SST =

(y

i

y)

2

= (y y)

T

(y y)

SSR =

( y

i

y)

2

= ( y y)

T

( y y)

SSE =

r

2

i

= r

T

r

8.1.1.4 ANOVA (Version 2) (not including

0

)

Source of Variation Degrees of Freedom Sum of Squares

Regression p + 1 SSR =

y

i

= y

T

y

Residual n p 1 SSE =

r

2

i

= r

T

r

Total n SST = y

T

y

8.1.2 The General Linear Hypothesis

To test the very general hypothesis concerning the regression coecients

H

0

: T

c(p+1)

= b

c1

Where T is a c (p + 1) matrix of constant and b is a c 1 vector of constants.

Example 8.2.

y =

0

+

1

x

1

+

2

x

2

+

3

x

3

+

The null hypothesis

H

0

:

0

= 0 &

1

=

2

_

1 0 0 0

0 1 1 0

_

. .

T

_

3

_

_

. .

=

_

0

0

_

..

b

Thus,

H

0

: T = b

Example 8.3. To test

H

0

:

2

=

3

= 0 = T =

_

0 0 1 0

0 0 0 1

_

CHAPTER 8. MODEL BUILDING AND SELECTION 61

8.1.2.1 The test

To test H

0

: T = b in general

1. Fit regression with no constraints

2. Compute SSE

3. Fit regression model subject to constraints

4. Compare the new SSE

0

5. Compute the F-ratio

F =

(SSE0SSE)

/c

SSE

/(np1)

where c = the number of rows in the T matrix

6. If F > F

,(c,np1)

then reject H

0

: T = b otherwise not reject H

0

.

Example. Consider H

0

:

2

=

3

= 0

_

0 0 1 0

0 0 0 1

_

_

3

_

_

=

_

0

0

_

8.2 Categorical Predictors and Interaction Terms

8.2.1 Binary Predictor

Recall Low birth weight infant example

y: Head circ

x

1

: Gestation age

x

2

: Toxemia, 1 = yes, 0 = no

Consider the model

y =

0

+

1

x

1

+

2

x

2

+

y = 1.496 + 0.874x

1

1.412x

2

testing

2

= 0

y

0

+

2

0

2

x

2

= 0

x

2

= 1

It is often not reasonable to assume the eect of other explanatory variables are the same across dierent

groups

CHAPTER 8. MODEL BUILDING AND SELECTION 62

8.2.2 Interaction Terms

y =

0

+

1

x

1

+

2

x

2

+

3

x

1

x

2

+

=

_

y =

0

+

1

x

1

+ x

2

= 0

y = (

0

+

2

) + (

1

+

3

) x

1

+ x

2

= 1

by adding interaction term, it allows x

1

to have a dierent eect on y depending on the value of x

2

.

Hypothesis testing of interaction terms

H

0

:

3

= 0

H

a

:

3

= 0

tells whether the eect is dierent or not between groups

8.2.3 Categorical Predictor with More Than 2 Levels

Example. y: prestige score of occupations

exp. var:

(x

1

) Education (in years)

(x

2

) Income

(x

3

) Type of occupation:

blue collar

white collar

professional

8.2.3.1 Dummy Variables

Basically binary indicators

D

1

=

_

1 Professional

0 Otherwise

D

2

=

_

1 White collar

0 Otherwise

Type of Occupation D

1

D

2

prof 1 0

w.c 0 1

b.c. 0 0

The categorical explanatory variable with k levels can be represented by k 1 dummy variables.

The regression model

y =

0

+

1

x

1

+

2

x

2

+

3

D

1

+

4

D

2

+

=

_

_

y = (

0

+

3

) +

1

x

1

+

2

x

2

+ Professional

y = (

0

+

4

) +

1

x

1

+

2

x

2

+ White collar

y =

0

+

1

x

1

+

2

x

2

+ Blue collar

where

3

represent the constant vertical distance between the parallel regression planes for professional

and blue collar occupations

CHAPTER 8. MODEL BUILDING AND SELECTION 63

4

represent the constant vertical distance between the parallel regression planes for white collar

and blue collar occupations

2

To make a vector into a vector of categorial indicators:

R Code:

<variable vector> = factor(<variable vector>)

Makes the variable into a factor, meaning that the linear regression code will create the necessary

dummy variables as needed.

Be careful with p values for dummy variables, the p value for s only considers it against the

baseline model.

Testing Individual Hypothesis (t-test)

H

0

:

3

= 0

H

a

:

3

= 0 or

H

0

:

4

= 0

H

a

:

4

= 0

Testing dierence between experiment (e.g. prof or wc) group and reference (bc) group

8.2.3.2 Testing Overall Eect of a Categorical Predictor

H

0

:

3

=

4

= 0

H

a

: At least one = 0

Model Terms df SSE

1 (F) x

1

, x

2

, D

1

, D

2

93 4681.28

2 (R) x

1

, x

2

95 5272.44

F

0

=

(SSE0SSE)

/2

SSE

/93

F (2, 93)

F

0

= 5.95

F

0

= 5.95 > F

0.05

(2, 93) = 3.07

Therefore, we reject the null hypothesis and conclude that occupational type is overall signicantly related

to prestige score.

To change what case to consider for the base case in R:

R Code:

contrasts(<factor vector>) <- contr.treatment(<# levels>, base=<level as base>)

8.3 Modeling Interactions With Categorical Predictors

y

i

=

0

+

1

x

i,1

+

2

x

i,2

+

3

D

i,1

+

4

D

i,2

. .

main eect

+

5

x

i,1

D

i,1

+

6

x

i,1

D

i,2

. .

edutype

+

7

x

i,2

D

i,1

+

8

x

i,2

D

i,2

. .

incometype

+

i

This model also can be written as

=

_

_

y = (

0

+

3

) + (

1

+

5

) x

1

+ (

2

+

7

) x

2

+ Professional

y = (

0

+

4

) + (

1

+

6

) x

1

+ (

2

+

8

) x

2

+ White collar

y =

0

+

1

x

1

+

2

x

2

+ Blue collar

2

Nov 13, 2012

CHAPTER 8. MODEL BUILDING AND SELECTION 64

Where

5

,

6

represent eect of interaction between edu. and occupation type;

7

and

8

represent eect

of interaction between income and occupation type.

To test the signicance of the interaction. eg.

H

0

:

7

=

8

= 0

H

0

: At least one = 0

Model Terms df SSE

1 (F) x

1

, x

2

, D

1

, D

2

,x

1

D

1

, x

1

D

2

,

x

2

D

1

, x

2

D

2

89 3552.624

2 (R) x

1

, x

2

, D

1

, D

2

,x

1

D

1

, x

1

D

2

91 4504.982

We use F-test

F =

(SSE0SSE)

/2

SSE

/89

F (2, 89)

F = 11.929

F

0.05

(2, 89) = 3.099

Hence we reject H

0

and conclude that there is signicant evidence that the relationship between income

and prestige score is dierent across dierent occupation types.

8.4 The Principle of Marginality

3

y

i

=

_

0

+

1

x

i,1

+

2

x

i,2

+

3

D

i,1

+

4

D

i,2

+

i

+

5

x

i,1

D

i,1

+

6

x

i,1

D

i,2

+

7

x

i,2

D

i,1

+

8

x

i,2

D

i,2

D

i,1

, D

i,2

represent the categorical variables and must be tested as a group (F-test)

If a model includes higher order term, then the lower order term should also be included.

examine higher order term (interactions) rst, then proceed to test, estimate and interpret main eects

8.5 Variable Selection

4

Often, many explanatory variables are available. Investigators may have little idea of the driving factors

and so will cast a wide net in data collection, hoping that analysis will identify the important variables.

There are several reasons why we would like to include only the important variables:

The model will become simpler and easier to understand (unimportant factor are eliminated).

Cost of prediction is reduced - fewer variables to measure.

Accuracy of predicting new ys may improve. In general, including unnecessary explanatory variables

inates the variances of predictions.

In this section, we look at some of the more popular algorithms for selecting explanatory variables:

forward selection

backward elimination

3

Nov 22, 2012

4

Nov 20, 2012, from handout

CHAPTER 8. MODEL BUILDING AND SELECTION 65

stepwise regression

criterion based all subsets regression

Note. Italicized methods can be automated.

We will use an example to illustrate how to implement these methods. We will also discuss under what

circumstance these methods will be appropriate to use.

Example 8.4. We illustrate the variable selection methods on some data on the 50 states in U.S.A. from

the 1970s. We will take the life expectancy as the response and the remaining variables as predictors:

State states name

Population population estimate of the state

Income per capital income

Illiteracy illiteracy percent of population

Life_Exp life expectancy in years

Murder murder and non-negligent manslaughter rate per 100,000 population

Hs_Grad percent hight-school graduates

Frost mean number of days with min temperature < 32 degrees in capital city

Area land area in square miles 1

8.5.1 Backward Elimination

1. Start with all p potential explanatory variables in the model

y =

0

+

1

x

1

+ +

p

x

p

+

2. For each explanatory variable x

j

, calculate the p-value (based on either t-test or F-test

5

) for testing

H

0

:

j

= 0 j = 1, . . . , p

3. If the largest p-value is greater than , then drop the predictor with the largest p-value. If the largest

p-value is smaller than , then you cannot simplify the model further and you stop the algorithm.

4. Repeat the procedure in step 1 and step 2 with the simplied model until all p-values for remaining

variables are less than the preset signicance level .

Note: the does not have to be 0.05. (a 0.05 to 0.2 cut-o may work best if prediction is the goal). One

refers to it as "alpha to drop".

Example. Life expectancy data

5

F-test should be used when the explanatory variable is categorical

CHAPTER 8. MODEL BUILDING AND SELECTION 66

R code:

> data(state)

> statedata<-data.frame(state.x77)

> g<-lm(Life.Exp~., data=statedata)

> summary(g)

Call:

lm(formula = Life.Exp ~ ., data = statedata)

Residuals:

Min 1Q Median 3Q Max

-1.48895 -0.51232 -0.02747 0.57002 1.49447

Coecients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***

Population 5.180e-05 2.919e-05 1.775 0.0832 .

Income -2.180e-05 2.444e-04 -0.089 0.9293

Illiteracy 3.382e-02 3.663e-01 0.092 0.9269

Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***

HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *

Frost -5.735e-03 3.143e-03 -1.825 0.0752 .

Area -7.383e-08 1.668e-06 -0.044 0.9649

Residual standard error: 0.7448 on 42 degrees of freedom

Multiple R-squared: 0.7362,Adjusted R-squared: 0.6922

F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10

We illustrate the backward method. At each stage we remove the predictor with the largest p-value over

0.05:

CHAPTER 8. MODEL BUILDING AND SELECTION 67

R Code:

> g<-update(g, .~.-Area)

> summary(g)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)

Population

Income

Illiteracy

Murder

HS.Grad

Frost

> g<-update(g, .~.-Illiteracy)

> summary(g)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)

Population

Income

Murder

HS.Grad

Frost

> g<-update(g, .~.-Income)

> summary(g)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)

Population

Murder

HS.Grad

Frost

> g<-update(g, .~.-Population)

> summary(g)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)

Murder

HS.Grad

Frost

Residual standard error: 0.7427 on 46 degrees of freedom

Multiple R-squared: 0.7127,Adjusted R-squared: 0.6939

F-statistic: 38.03 on 3 and 46 DF, p-value: 1.634e-12

Notice that the nal removal of Population is a close call. The R

2

= 0.736 for the full model, it is only

reduced slightly in the nal model (R

2

= 0.713). Thus the removal of four predictors cause only a minor

reduction in t.

Note. The nal model depends on the signicance level , the larger the is, the bigger the nal model

is.

CHAPTER 8. MODEL BUILDING AND SELECTION 68

Issue with backward elimination:

Once a predictor has been eliminated from the model, it will never have a chance to re-enter the model,

even if it becomes signicant after other predictors being dropped.

For example,

R Code:

> summary(lm(Life.Exp~Illiteracy+Murder+Frost, data=statedata))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)

Illiteracy

Murder

Frost

Residual standard error: 0.7911 on 46 degrees of freedom

Multiple R-squared: 0.6739,Adjusted R-squared: 0.6527

F-statistic: 31.69 on 3 and 46 DF, p-value: 2.915e-11

We see that Illiteracy does have some association with Life.Exp. It is true that replacing Illiteracy

with HS.Grad gives us a somewhat better tting model, but it would be insucient to conclude that

Illiteracy is not a variable of interest.

8.5.2 Forward Selection

1. Fit p simple linear models, each with only a single explanatory variable v

j

, j = 1, . . . , p. There are p

t-test statistics and p-values for testing H

0

:

j

= 0, j = 1, . . . , p. The most signicant predictor is the

one with the smallest p-value, denoted by v

k

. If the smallest p value > , the algorithm stops and

there is no need to include any more variables. Otherwise, set x

1

= v

k

and t the model

2. Start from model

y =

0

+

1

x

1

+ ()

Enter the remains p 1 predictors; one-at-a-time; to t p 1 models

y =

0

+

1

x

1

+

2

v

j

+ j = 1, . . . , p 1

and let p

k

denote the smallest p-value, v

k

denote the most signicant explanatory variable.

(a) If p

k

> : stop and model (*) is the nal model.

(b) If p

k

< : set x

2

= v

k

and enter the corresponding explanatory variable, denoted by x

2

, into the

model (*) to update it as

y =

0

+

1

x

1

+

2

x

2

+

3. Continue this algorithm until no new explanatory variables can be added.

The preset signicance level is called the "alpha to enter".

Example. Life expectancy data

The rst variable to enter

R Code:

code

Result form tting 7 simple linear models

y

j

=

0

+

1

x

i,j

+

i

CHAPTER 8. MODEL BUILDING AND SELECTION 69

The second variable to enter

R Code:

The third variable to enter

R Code:

The fourth variable to enter

R Code:

Can not add any more explanatory variables at preset signicance level = 0.05, stop.

Summary of forward selection steps:

Iteration Variable to Enter p value (F-test)

1 Murder 2.260 10

11

2 HS.Grad 0.009088

3 Frost 0.006988

The nal model selected for signicant level = 0.05 includes explanatory variables: Murder, HS.Grad

and Frost. The same nal model as from the backward elimination method.

Issue with forward selection:

Once a predictor entered the model, it remains in the model forever, even if it becomes non-signicant

after other predictors have been selected.

8.5.3 Stepwise Regression

It is a combination of backward and forward method. It addresses the situation where variables are added

or removed early in the process and we want to change our mind about them later. The procedure depends

on two alphas:

1

: Alpha to enter

2

: Alpha to drop

At each stage a variable may be added or removed and there are several variations on exactly how this is

done.

For example:

1. Start as in forward selection using signicance level

1

.

2. At each stage, once a predictor entered the model, check all other predictors previously in the model

for their signicance. Drop the least signicant predictor (the one with the largest p value) if its

p value is greater than the preset signicance level

2

.

3. Continue until no predictors can be added and no predictors can be removed.

Remark. With automatic methods (forward/backward/stepwise):

Because of the one-at-a-time nature of adding/removing variables, it is possible to miss the optimal

model

The procedures are not directly linked to nal objectives of prediction or explanation and so may not

really help solve the problem of interest. It is important to keep in mind that model selection cannot

be divorced from the underlying purpose of the investigation. Variable selection tends to amplify the

statistical signicance of the variables that stay in the model. Variables that are dropped can still be

correlated with the response. It would be wrong to say these variables are unrelated to the response,

its just that they provide no additional explanatory eect beyond those variables already included in

the model.

All "automatic" algorithms should be used with caution. When there is an appreciable degree of

multicollinearity among the explanatory variables (as in most observational studies), the three methods

may lead to quite dierent nal models.

Some practical advices on t-test and F-test in linear regression models:

CHAPTER 8. MODEL BUILDING AND SELECTION 70

To test hypotheses about a single coecient, use the t-test.

To test hypothesis about several coecients (e.g. testing the coecients of several dummy variables),

or more generally to compare nested models, use the F-test based on a comparison of SSEs (or

SSRs).

8.5.4 All Subsets Regressions

Suppose we start with a regression model with p explanatory variables,

y

i

=

0

+

1

x

i,1

+ +

p

x

i,p

+

i

where each x

j

may be included or left out. Thus there are 2

p

possible regressions (e.g. p = 10 gives

2

10

= 1024 regressions). In principle, we can t each regression and choose the "best" model based on some

"t" criterion.

Numerical criteria for model comparison:

8.5.4.1 R

2

Comparison

R-square (Multiple Correlation Coecient)

R

2

=

SSR

SST

It is always in favor of a large model.

8.5.4.2 R

2

adj

Comparison

Adjusted R-square

R

2

adj

= 1

_

n 1

n p 1

_

_

1 R

2

_

where p is the number of explanatory variables in the model. A large model may have a smaller R

2

adj

.

8.5.4.3 Mallows C

k

Comparison

Mallows C

k

Consider a smaller candidate model with k explanatory variables (k < p), and SSE

k

is the sum of

squares of errors from tting this model

C

k

=

SSE

k

MSE

full

(n 2 (k + 1))

The idea is to compare sum of squares of errors from the smaller candidate with one from the full

model

A candidate model is good if C

k

k + 1

Look for the simplest model (with smallest k) for which C

k

is close to k + 1

8.5.4.4 AIC (Akaikes Information Criterion)

Under linear regression model

y

i

=

0

+

1

x

i,1

+ +

p

x

i,p

+

i

then we know

y

i

N

_

0

+

1

x

i,1

+ +

p

x

i,p

,

2

_

and y

i

s are independent.

CHAPTER 8. MODEL BUILDING AND SELECTION 71

The likelihood function

L

_

,

2

_

=

n

i=1

f (y

i

)

. .

pdf of yi

= f (y

1

, . . . , y

n

)

L

_

,

2

_

=

n

i=1

1

2

2

exp

_

(y

i

0

+

1

x

i,1

+ +

p

x

i,p

)

2

2

2

_

l

_

,

2

_

= ln L

_

,

2

_

l

_

,

2

_

=

n

i=1

_

ln

_

2

2

_

1

2

(y

i

0

+

1

x

i,1

+ +

p

x

i,p

)

2

2

2

_

l

_

,

2

_

=

n

2

ln

_

2

2

_

1

2

2

n

i=1

(y

i

0

+

1

x

i,1

+ +

p

x

i,p

)

2

The LSE

are the same as MLE

l

_

,

2

_

=

n

2

ln 2

2

1

2

2

SSE

2

l

_

,

2

_

=

n

2

1

2

2

2 +

1

2 (

2

)

2

SSE

0 =

n

2

1

2

2

2 +

1

2 (

2

)

2

SSE

2

=

SSE

n

l

_

,

2

_

=

n

2

ln 2

n

2

ln

SSE

n

n

2

l

_

,

2

_

= constant

n

2

ln

SSE

n

AIC (Akaikes Information Criterion)

AIC = 2 (max log-likelihood (p + 1))

AIC = nln

_

SSE

n

_

+ 2 (p + 1)

For linear regression model, the maximum log-likelihood is

l

_

,

2

_

=

n

2

ln

_

SSE

n

_

+ constant

AIC is a penalized maximum log-likelihood

Small AIC means better model.

smaller AIC = large max log-likelihood = better model

Note that for a model of a given size (here size refers to the number of explanatory variables included in the

model), all the criterion above will select the model with the smallest sum of squares of residuals SSE.

CHAPTER 8. MODEL BUILDING AND SELECTION 72

Example. Life expectancy data, all subsets regression:

R Code:

> library(leaps)

> data(state)

> statedata<-data.frame(state.x77)

> tmp<-regsubsets(Life.Exp~., data=statedata)

> summary(tmp)

Subset selection object

Call: regsubsets.formula(Life.Exp ~ ., data = statedata)

7 Variables (and intercept)

Forced in Forced Out

Population FALSE FALSE

Income FALSE FALSE

Illiteracy FALSE FALSE

Murder FALSE FALSE

HS.Grad FALSE FALSE

Frost FALSE FALSE

Area FALSE FALSE

1 subsets of each size up to 7

Selection Algorithm: exhaustive

Population Income Illiteracy Murder HS.Grad Frost Area

1 ( 1 ) " " " " " " "*" " " " " " "

2 ( 1 ) " " " " " " "*" "*" " " " "

3 ( 1 ) " " " " " " "*" "*" "*" " "

4 ( 1 ) "*" " " " " "*" "*" "*" " "

5 ( 1 ) "*" "*" " " "*" "*" "*" " "

6 ( 1 ) "*" "*" "*" "*" "*" "*" " "

7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"

The * means that the variable is included for that model

CHAPTER 8. MODEL BUILDING AND SELECTION 73

R Code:

> summary(tmp)$cp

[1] 16.126760 9.669894 3.739878

a

2.019659

b

4.008737 6.001959

[7] 8.000000

> summary(tmp)$adjr2

[1] 0.6015893 0.6484991 0.6939230 0.7125690

c

0.7061129 0.6993268

[7] 0.6921823

> par(mfrow=c(1,2))

> plot(2:8, summary(tmp)$cp, xlab="No. of Parameters", ylab="Ck statistic")

> abline(0,1)

> plot(2:8, summary(tmp)$adjr2, xlab="No. of Parameters", ylab="Adjusted R-square")

d

Notice for it to be good we consider C

k

k + 1

C

3

> 3.739878, and C

4

> 2.019659

Also notice R

2

adj

= 0.7125690 is the largest R

2

adj

a

Smaller than 4

b

Smaller than 5, so choose between these two

c

Largest R

2

adj

d

Insert image from teachers notes

According to the C

k

criteria, the competition is between the three-predictor model (including Murder,

HS.Grad, Frost) and the four-predictor model also including Population. The choice is between the

smaller model and the larger model, which ts a little better.

If the subset model (or candidate model) is adequate, then we expect

E

_

SSE

k

n k 1

_

2

E[SSE

k

] (n k 1)

2

We also know that

E

_

SSE

n p 1

_

=

2

therefore,

E[C

k

] = E

_

SSE

k

MSE

(n 2 (k + 1))

_

E[C

k

] k + 1

According to adjusted R

2

criteria, the four-predictor model (Populations, Murder, HS.Grad, Frost)

has the largest R

2

adj

.

Problem. Is the four predictor model (population, frost, HS graduation and murder) the optimal model?

Model selection methods are sensitive to outliers/inuential points:

Based on diagnostic statistics from tting the full model, "Alaska" can be an inuential point

When "Alaska" is excluded from the analysis, Area now makes to the model based on R

2

adj

criteria.

CHAPTER 8. MODEL BUILDING AND SELECTION 74

R Code:

> tmp<-regsubsets(Life.Exp~., data=statedata[-2,])

> summary(tmp)

Selection Algorithm: exhaustive

Population Income Illiteracy Murder HS.Grad Frost Area

1 ( 1 ) " " " " " " "*" " " " " " "

2 ( 1 ) " " " " " " "*" "*" " " " "

3 ( 1 ) " " " " " " "*" "*" "*" " "

4 ( 1 ) "*" " " " " "*" "*" "*" " "

5 ( 1 ) "*" " " " " "*" "*" "*" "*"

6 ( 1 ) "*" "*" " " "*" "*" "*" "*"

7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"

> summary(tmp)$adjr2

[1] 0.5923260 0.6603281 0.6948855 0.7086703 0.7104405

a

0.7073027

[7] 0.7008899

The * means that the variable is included for that model

a

Without "Alaska", the 5 predictor model looks best

Remark. Some Final Remarks:

Automatic variable selections methods should be used with caution.

Criterion-based best subsets methods typically involve a wider search and compare models in a prefer-

able manner. We recommend this method in general.

There may be several suggested models which t equally well. If they lead to quite dierent conclusions,

then it is clear that the data cannot answer the question of interest un-ambiguously

Chapter 9

Multicollinearity in Regression Models

9.1 Multicollinearity

Example. Pizza sales data:

y: Sales ($1000s)

x

1

: Number of advertisements

x

2

: Cost of advertisements ($100s)

Suppose t a model:

y

i

=

0

+

1

x

i,1

+

2

x

i,2

+

i

and get the following results:

i

SE

_

_

t

0

p value

Intercept 24.82 5.46 4.39 0.0007

x

1

0.66 0.54 1.23 0.2404

x

2

1.23 0.70 1.77 0.1000

R

2

= 0.7789, F-Statistic: 22.899 on 2 and 13 df, p value = 0.0001

The t-test says that

1

,

2

are not signicant but the F-test says at least one is signicant.

What do we nd?

R

2

= 0.7789, x

1

and x

2

together explain a large part (78%) of the variability in sales

F-Statistic and p value indicate that one of them is important

We can not reject H

0

:

1

= 0 when x

2

is in the model. Similarly, we cannot reject H

0

:

2

= 0 when

x

1

is in the model.

In other words, if one of x

1

or x

2

is in the model, then the extra contribution of the other variable

toward the regression is not important. The individual t-test indicates that you do not need one

variable if you already included the other.

This is because variables x

1

and x

2

are highly correlated. The two variables appear to express the same

information. So no point to include both.

Denition. Collinearity: Linear relationship between two variables: x

i

and x

j

, i = j.

Denition. Multicollinearity: There is a linear relationship involving more than two x variables.

e.g.. x

1

x

2

+x

3

75

CHAPTER 9. MULTICOLLINEARITY IN REGRESSION MODELS 76

9.2 Consequence of Multicollinearity

To understand what happens if there is an exact linear dependence, consider the

X =

_

_

| | | |

1 x

1

x

k

x

p

| | | |

_

_

where x

k

= (x

1,k

, . . . , x

n,k

)

T

is the k + 1 column of X.

If one of x

k

is a linear combination of other x

k

, say

x

1

= c

1

1 +c

2

x

2

+ +c

p

x

p

then

rank (X) < p + 1

= rank

_

X

T

X

_

< p + 1

hence

X

T

X

= 0, and

_

X

T

X

_

1

does not exist, we are not able to solve

=

_

X

T

X

_

1

X

T

y

Under Multicollinearity:

X

T

X

0 (small)

It is computationally unstable for

=

_

X

T

X

_

1

X

T

y, sometimes resulting in

Insignicance of important predictors

Opposite sign of

from expected relationship

Large S.E. and wide C.I.

9.3 Detection of Multicollinearity Among x

1

, . . . , x

p

1

First look at Pairwise Sample Correlation

r

l,m

=

n

i1

(x

i,l

x

l

) (x

i,m

x

m

)

_

n

i=1

(x

i,l

x

l

)

2

n

i=1

(x

i,m

x

m

)

2

r

l,m

measure the linear association between any two x variables: x

l

and x

m

.

1 r

l,m

1 =

_

1, 1 Perfect linear relationship

0 Not linearlly related

The matrix

_

_

1 r

1,2

r

1,3

r

1,p

r

2,1

1 r

2,3

r

2,p

r

3,1

r

3,2

1 r

3,p

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

r

p,1

r

p,2

r

p,3

1

_

_

|r

l,m

| 1 = x

l,

, x

m

are strongly linearly related.

1

Nov 29, 2012

CHAPTER 9. MULTICOLLINEARITY IN REGRESSION MODELS 77

9.3.1 Formal Check of Multicollinearity: Variance Ination Factors (VIF)

x

k

is regressed (x

k

is used as a response) on the remaining p 1 xs

x

i,k

=

0

+

1

x

i,1

+ +

k1

x

i,k1

+

k+1

x

i,k+1

+ +

p

x

i,p

+

i

for k = 1, . . . , p

The resulting

R

2

k

=

SSR

k

SST

is a measure of how strongly x

k

is linearly related to the rest of xs

R

2

k

= 1 = Perfectly linearly

R

2

k

= 0 = Not linearly related

V IP

k

=

1

1 R

2

k

( 1)

where k = 1, . . . , p

The general consensus is that if:

V IF

k

> 10, strong evidence of multicollinearity

V IF

k

[5, 10], some evidence of multicollinearity

9.4 Ridge Regression

Ridge regression is used when the design matrix X is multi-collinear and the usual least squares estimate

of appears to be unstable.

LSE

minimize (y X)

T

(y X)

For

X

T

X

0, ridge regression makes the assumption that the regression coecients are not likely to

be very large. Suppose we place some upper bound on .

p

j=1

2

j

=

T

< c

9.4.1 Minimize Subject to Constraints (Lagrange Multiplier Method)

Minimizing

(y X)

T

(y X) +c

p

j=1

2

j

The 2

nd

term is penalty depends on

p

j=1

2

j

Note. c is just a constant, therefore we will just change it to

Ridge Regression: Minimize

(y X)

T

(y X) +

p

j=1

2

j

In statistics, this is called shrinkage: you are shrinking

p

j=1

2

j

towards 0.

is a shrinkage parameter that you have to choose

CHAPTER 9. MULTICOLLINEARITY IN REGRESSION MODELS 78

The ridge regression solution

_

(y X)

T

(y X) +

T

_

= 0

2X

T

X 2X

T

y + 2 = 0

X

T

X X

T

y + = 0

_

X

T

X +I

_

= X

T

y

k

=

_

X

T

X +I

_

1

X

T

y

Note. That

k

is biased for (LSE

is unbiased)

Choose such that

Bias is small

X

T

X +I

= 0

Variance is not large

- STAT 333 Assignment 1 SolutionsUploaded byliquidblackout
- Lee J. Bain and Max Engelhardt - Introduction to Probability and Mathematical Statistics, Second EditionUploaded byFloris
- ACTSC 331 NotesUploaded bylapa_pereira67
- STAT 333 Assignment 1Uploaded byDanny Xu
- FEM (Dynamic Analysis)Uploaded byDebasis Saha
- Ex Quantum MechanicsUploaded byKamil
- 340 Printable Course NotesUploaded byYidi Lin
- 14180-20150623-ETABS-Main BuildUploaded byPurushotam Tapariya
- Islam and DemocracyUploaded byInternational Organization of Scientific Research (IOSR)
- duryatipblampiran11.pdfUploaded byEgi Nurcahyadi
- HMCost3e_SM_Ch03.docUploaded byAnna Antonio
- Actuarial ScienceUploaded byKenneth Nyu
- Assignment1 SolutionUploaded byxxx
- MC090407346-mth quiz 3Uploaded bySyed Zeshan Ali
- bivariate.pptxUploaded byVikas Saini
- Calibration Curve GuideUploaded byShri Kulkarni
- 5-2-23-217 2Uploaded byInka
- Thin Layer Drying of sliced MangoUploaded byRuel Peneyra
- new spss word.docxUploaded byshamrooz
- M 2.2.4 MDOF_Forced_Vibration-(Modal_Analysis).pdfUploaded byVladimir Shapovalov
- Introduction to Linear RegressionUploaded byreebenthomas
- Regression vs Bland-AltmanUploaded byjesus
- Stock CorrelationUploaded byaidi8888
- Tugas Pak YudiUploaded byArya Putra Verdiansyah
- St-Pierre Stats Paper for Meta-Analysis 2001Uploaded byWenping Hu
- Johns Precip Infilling ImputaroreconstruirdatosfaltantesUploaded byBruno Barra Pezo
- Interpretation 1 ConvertedUploaded byshubh
- PIE DeliverableUploaded byjsdhekgjgfqwertuyi
- G Statiscis Chapter 8Uploaded bySonnet Bhowmik
- dissss.xlsxUploaded byRajkumar Garain

- Unit 2Uploaded bycooooool1927
- DIFFERENTIAL EQUATIONSUploaded byMukesh Agarwal
- worksheet all topics (316 pgs).pdfUploaded byNik Nazwani Zulkifli
- 2975008Uploaded byJen VeryLazy Ling
- 2 Lua ScriptsUploaded byDago España
- Stress Balance PrinciplesUploaded byDilara Çınarel
- hw1Uploaded byYehh
- CEE570_ppt1_revised2014Uploaded bydrp
- Michele Nardelli, Francesco Di Noto, Pierfrancesco Roggero - "Mathematical theory of knots, quantum physics, string theory (connections with the Fibonacci’s numbers, Lie’s numbers and partition numbers)"Uploaded byMichele Nardelli
- K.W. Chow and S.Y. Lou- Propagating wave patterns and ‘‘peakons’’ of the Davey–Stewartson systemUploaded byLymes
- besintUploaded bykrshiladitya
- Dynamics - Chapter 11Uploaded byHamza Paga
- Thesis - Steel Design OptimizationUploaded bycargadory2k
- Stage 1 Mathematics MethodUploaded byyolanda Sitepu
- MATH 26 Student's Guide Summer 2011Uploaded byRegine Fajardo Escala
- Lecture 6-Time Domain Analysis of Control SystemsUploaded byNoor Ahmed
- MATH23Uploaded byIsrael Lives
- atsp-aaai93-sympUploaded byAlexey Solodovnikov
- midterm11 (1)Uploaded bymasthopesucks
- Base ExcitationUploaded byBenjamin Vazquez
- MMC-Questions-for-Division-Finals-Grade-10.pdfUploaded bymarilouqdeguzman
- Q. HO-KIM--Group Theory: A Problem BookUploaded byqhokim
- Itute 2006 Mathematical Methods Examination 2 SolutionsUploaded bynochnoch
- PPT 05 Inventory Management DiscountsUploaded bykavish09
- A Coulomb and Ampere Max Well Laws Notes 2Uploaded byMalone Adair
- gaurav mathsUploaded bymanik8840
- ch04Uploaded byRaf.Z
- MSc Physics SyllabusUploaded byAneez Koyatty
- Edexcel C3 Cheat SheetUploaded byAli Alanni
- Etab QueriesUploaded byvinmanishs

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.