You are on page 1of 24

Chapter 6

Multiple Regression: Tests of


Hypotheses and Confidence
Intervals

6.1 Test of Overall Regression

• Denote β 1 = ( β 1 , β 2 , · · · , β p ) T . β = ( β 0 , β 1T ) T . Let X = (1, X1 ) and X ∗ = (1, Xc )


where Xc = [ I − n1 J ] X1 .
 
α
Suppose y = X β + e = X∗ . We wish to test β 1 = 0.
β1
Let Hc = Xc ( XcT Xc )−1 XcT . Then, it can be rewritten as Hc = H − n1 Jn . Also note
that Hc 1 = 0.

 Partition of sum of squares:


   
T 1 T 1
y In − Jn y = y H − Jn y + y T ( I − H )y
n n
 
T T 1
= y Hc y + y I − Jn − Hc y
n
= SSR + SSE

• Theorem 8.1 a.:

(i) Hc ( In − n1 Jn ) = Hc

(ii) Hc is idempotent with rank p

(iii) I − H = I − n1 Jn − Hc is idempotent with rank (n − p − 1)


CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 45

(iv) Hc ( In − n1 Jn − Hc ) = ( H − n1 Jn )( In − H ) = 0

• Theorem 8.1 b: If y ∼ Nn ( X β, σ2 I ), and our fitted model is E(y| X ) = X β,

SSR
∼ χ2 ( p, λ)
σ2
1 T T
where λ = β X Xβ
2σ2 1 c c 1

SSE
∼ χ2 ( n − p − 1)
σ2

(Proof) In the previous chapter, we show that λ = 1


2σ2
βT X T ( H − n1 Jn ) X β.

Now with covariates matrix without intercept,

( H − n1 Jn ) = Hc = Xc ( XcT Xc )−1 XcT and

1 T T
λ= β X Xc ( XcT Xc )−1 XcT X β
2σ2  T  
1 T 1 T −1 T α
= 2 (α, β 1 ) T Xc ( Xc Xc ) Xc ( 1, Xc ) β
2σ X c 1
1
= 2 β 1T XcT Xc β 1

• Theorem 8.1c. If y is Nn ( X β, σ2 I ), then SSR and SSE are independent.

SSR/p
• Theorem 8.1d. If y is Nn ( X β, σ2 I ), the distribution of F = SSE/(n− p−1)
is as
follows:

(i) If H0 : β1 = 0 is false, then F is distributed as F ( p, n − p − 1, λ), where


1 βT T
λ= X Xβ
2σ2 1 c c 1

(ii) If H0 : β1 = 0 is true, then λ = 0 and F is distributed as F ( p, n − p − 1):

6.2 Test on a subset of the β

• Consider y = X β + e = X1 β 1 + X2 β 2 + e. We are interested in testing H0 : β 2 =


0 where the length of β 2 is q.
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 46

Let H1 = X1 ( X1T X1 )−1 X1T .

1
SST = y T ( In − Jn )y
n
1
= y T ( I − H )y + y T ( H − H1 )y + y T ( H1 − Jn )y
n
= SSE + SS( β 2 | β 1 ) + SSR(reduced)

SSR(reduced) is the sum of squares due to fitting X1 only in the reduced model,
or the null model, y = X1 β∗1 + e∗ . Thus, SS( β 2 | β 1 ) is “extra” regression sum of

squares due to β 2 after adjusting for β 1 and can be expressed as SSR( f ull ) −
SSR(reduced).

• Theorem 8.2a. The matrix H − H1 = X ( X T X )−1 X T − X1 ( X1T X1 )−1 X1T is idem-


potent with rank q, where q is the number of elements in β 2 .

(Proof) We know H and H1 are idempotent. How about HH1 ? First observe
that HX = X ( X T X )−1 X T X = X or X = [ X ( X T X )−1 X T ] X.

Partitioning X on the left side and the last X on the right side, we obtain

( X1 , X2 ) = [ X ( X T X )−1 X T ]( X1 , X2 ) = [ X ( X T X )−1 X T X1 , X ( X T X )−1 X T X2 ]

Thus, X1 = [ X ( X T X )−1 X T ] X1 and X2 = [ X ( X T X )−1 X T ] X2 .

Also, [ X ( X T X )−1 X T ] X1 ( X1T X1 )−1 X1T = H1 , i.e., HH1 = H1 and H1 H = H1 , with


which we can establish that H − H1 is idempotent. It follows rank( H − H1 ) =
tr ( H − H1 ) = q.

• Theorem 8.2b.: If y is Nn ( X β, σ2 I ), then

(i) y T ( I − H )y/σ2 is χ2 (n − p − 1).

(ii) y T ( H − H1 )y/σ2 is χ2 (q, λ1 ), λ1 = β T2 ( X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ) β 2 /2σ2

(iii) y T ( I − H )y and y T ( H − H1 )y are independent.

(Proof) We showed (i) in the last chapter. For (iii), ( I − H )( H − H1 ) = 0. For (ii),
1 T T
noncentrality parameter after applying the result is λ1 = 2σ2
β X (H − H1 ) Xβ.
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 47

Recall HX1 = X1 , HX2 = X2 . Then,

β T X T ( H − H1 ) X β = ( β T1 X1T + β T2 X2T )( H − H1 )( X1 β 1 + X2 β 2 )

= ( β T1 X1T + β T2 X2T − β T1 X1T − β T2 X2T H1 )( X1 β 1 + X2 β 2 )

= β T2 X2T X1 β 1 − β T2 X2T H1 X1 β 1 + β T2 X2T X2 β 2 − β T2 X2T H1 X2 β 2

= β T2 ( X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ) β 2

• Theorem 8.2c.: If y is Nn ( X β, σ2 I ), consider

y T ( H − H1 )y/q SS( β 2 | β 1 )/q


F= T =
y ( I − H )y/(n − p − 1) SSE/(n − p − 1)
(βb T X T y − βb∗T X T y)/q
1 1
= T T
.
(y T y − βb X y ) / ( n − p − 1)

(i) If H0 : β 2 = 0 is false, then F is distributed as F (q, n − p − 1, λ1 ), where


λ1 = 1 T
β [ X2T X2
2σ2 2
− X2T X1 ( X1T X1 )−1 X1T X2 ] β 2

(ii) If H0 : β 2 = 0 is true, then λ1 = 0 and F is distributed as F (q, n − p − 1):

• Theorem 8.2d.: If the model is partitioned,

SS( β 2 | β 1 ) = y T ( H − H1 )y = βbT2 [ X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ] βb2

6.3 The general linear hypothesis tests for H0 : Cβ =


0

• Motivation: Bio-equivalence test for newly formulated medication claiming

equal efficacy. For example, when we are interested in testing H0 : β 1 = β 2 =


β 3 = β 4 , it can be written as H0 : Cβ = 0 where
 
0 1 −1 0 0
C =  0 0 1 −1 0 
0 0 0 1 −1

with β = ( β 0 , β 1 , β 2 , β 3 , β 4 ) T .

• Theorem 8.4a. If y is Nn ( X β, σ2 I ) and C is q x ( p + 1) matrix of rank q ≤ p + 1,


then
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 48
b is Nq [Cβ, σ2 C ( X T X )−1 C T ]:
(i) C β

b ) T [ C ( X T X ) −1 C T ] −1 ( C β
(ii) SSH/σ2 = (C β b )/σ2 is χ2 (q, λ),

where λ = (Cβ) T [C ( X T X )−1 C T ]−1 (Cβ)/2σ2 .

(iii) SSE/σ2 = y T [ I − X ( X T X )−1 X T ]y/σ2 is χ2 (n − p − 1).

(iv) SSH and SSE are independent,

where SSH is the sum of squares due to Cβ (i.e. due to the hypothesis).

(Proof) We show (iv) (Showing Independence)

(C βb)T [C ( X T X )−1 C T ]−1 C βb

= y T X ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1 X T y

= y T Ay

Independence can be established by showing A( I − H )=0.

AH = X ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C ( X T X )−1 X T X ( X T X )−1 X T = A.

The F test for H0 : Cβ = 0 versus H1 : Cβ 6= 0 is given in the following theorem.

• Theorem 8.4b.: If y is Nn ( X β, σ2 I ), consider

SSH/q
F=
SSE/(n − p − 1)
b ) T [ C ( X T X ) −1 C T ] −1 ( C β
(C β b )/q
=
SSE/(n − p − 1)

where C is q x ( p + 1) matrix of rank q ≤ p + 1. Then, the distribution of F is as


follows:

(i) If H0 : Cβ = 0 is false, then F is distributed as F (q, n − p − 1, λ), where


λ= 1
2σ2
(Cβ)T [C ( X T X )−1 C T ]−1 (Cβ).

(ii) If H0 : Cβ = 0 is true, then λ = 0 and F is distributed as F (q, n − p − 1).


CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 49

• The F test for H0 : Cβ = 0 in the above theorem is called general linear hypothesis
test. The degrees of freedom q is the number of linear combinations in Cβ.

b − 0 ) T [ C ( X T X ) −1 C T ] −1 ( C β
• SSH can be written as (C β b − 0), which is squared
b and Cβ = 0 under the null. Intuitively, if C β
distance between C β b is very

different from 0, the numerator of F tends to be large.

• Estimation of β under the constraint Cβ = 0.

b be the least squared estimator under the constraint Cβ = 0. Then we can


Let β c
b =β
show β b − ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C β.
b
c

The above constraint optimization problem can be solved by introducing La-

grange multiplier, i.e. minimizing U ( β, λ) = (y − X β) T (y − X β) + λ T (Cβ − 0).

∂U ( β, λ)
= Cβ = 0
∂λ
∂U ( β, λ)
= −2X T y + 2X T X β + C T λ
∂β

b = ( X T X ) −1 X T y − 1 ( X T X ) −1 C T λ
β c
2
b = 0,
Since C β c
1
C ( X T X ) −1 X T y − C ( X T X ) −1 C T λ = 0
2
yielding

λ = 2[C ( X T X )−1 C T ]−1 C ( X T X )−1 X T y.

Thus

b =β
β b − 1 2 ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1 X T y
c
2
b − ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C β
=β b

= ( I − B) β,
b

where B = ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C. Note that B2 = B and ( I − B)2 =

I − B. Also β b are independent (∵ ( I − B)( X T X )−1 C T = 0).


b and C β
c
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 50

• Theorem 8.4e.: The mean vector and covariance matrix of β


b is
c

b ) = β − ( X T X )−1 C T [C ( X T X )−1 C T ]−1 Cβ


(i) E( β c

b ) = σ 2 ( X T X ) −1 − σ 2 ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1
(ii) cov( β c

(Proof) We show the variance expression.

b ) = ( I − B)( X T X )−1 ( I − B) T σ2
cov( β c

= ( I − ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C )( X T X )−1 ( I − C T [C ( X T X )−1 C T ]−1 C ( X T X )−1 )σ2

= {( X T X )−1 − ( X T X )−1 C T [C ( X T X )−1 C ]−1 C ( X T X )−1 }σ2

• Sum of squares decomposition:

The sum of squares due to regression can be decomposed into two parts: the

sum of squares due to the hypothesis (SSH) and the sum of squares due to the
remaining regression after adjusting the hypothesis.

Indeed, we can show

b ) T { C ( X T X ) −1 C T } −1 C β
y T HX y = ( C β b T {var ( β
b + σ2 β b )}− β
b .
c c c

To show the above result, consider singular value decomposition of I − B. Re-


b where B = ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C.
b = ( I − B) β,
call that β c

  T

D 0 P1
( I − B) = UD ∗ P T
= [U1 , U2 ] = U1 DP1T , where P1 and U1 are
0 0 P2T
( p + 1)x( p + 1 − q) matrices and D is a ( p + 1 − q)x( p + 1 − q) matrix.

 
C
Let a = β = Qβ. Then, the regression model y = X β + e can be ex-
P1T
pressed as y = XQ−1 Qβ + e = W a + e.
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 51

Then, we have

y T HX y = y T HW y = σ2 b a ) −1 b
a T var ( b a

= σ2 ( Q β b ) Q T } −1 ( Q β
b ) T { Q var ( β b)
  C ( X T X ) −1 C T C ( X T X ) −1 P  −1 ( C β
!

b )T ( PT β b T 1
b )
= (C β 1 ) b) .
P1T ( X T X )−1 C T P1T ( X T X )−1 P1 ( P1T β

In the above expression, we can show C ( X T X )−1 P1 = P1T ( X T X )−1 C T = 0 from


the following:
(a) ( I − B)( X T X )−1 C T = 0 since

B ( X T X ) −1 C T = ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1 C T = ( X T X ) −1 C T .
(b) From (a), we have

( I − B)( X T X )−1 C T = 0

⇐⇒ U1 DP1T ( X T X )−1 C T = 0

⇐⇒ U1T U1 DP1T ( X T X )−1 C T = 0

⇐⇒ D −1 DP1T ( X T X )−1 C T = 0

⇐⇒ P1T ( X T X )−1 C T = 0.

Then,

b ) T { C ( X T X ) −1 C T } −1 C β
y T HX y = ( C β b T T T −1
b + ( PT β −1 T b
1 ) { P1 ( X X ) P1 } P1 β.

b ) T { P T ( X T X )−1 P1 }−1 P T β
Now we show ( P1T β b T {var ( β
b = σ2 β b )}− β
b .
1 1 c c c

b T {var ( β
σ2 β b )}− β
b
c c c

b T ( I − B) T {( I − B)var ( β
= σ2 β b )( I − B) T }− ( I − B) β
b

b T P1 DU T {U1 DP T var ( β
= σ2 β b ) P1 DU T }− U1 DP T β
b
1 1 1 1
T
= σ2 β b ) P1 D }−1 U T ]U1 DP T β
b P1 DU T [U1 { DP T var ( β b
1 1 1 1

b T P1 { P T var ( β
= σ2 β b ) P1 }−1 P T β
b
1 1

b ) T { P T ( X T X )−1 P1 }−1 P T β.
= ( P1T β b
1 1
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 52

(Note that ( PAP T )− = PA− P T when P T P = I. In addition, when A is invertible,


( PAP T )− = PA−1 P T .)

• When is β∗1 (β 1 under Cβ = 0) of interest instead of β 1 ? e.g. Public health policy

maker’s point of view vs. clinician’s point of view.

6.4 Simultaneous Inference

• Familywise error rate: When we conduct multiple hypothesis testings with


Type I error associated with each decision, the probability of all decisions are
true could be small. To address this problem, one might control the probability

that falsely rejects at least one hypothesis when all hypotheses are true.

• Bonferroni correction: Bonferroni approach can be applied for k pre-specified


hypotheses.

Suppose that we carry out the k tests of H0j : β j = 0, j = 1, 2, · · · , k. Let Ej be


the event that the jth test rejects H0j when it is true, where P( Ej ) = α j . The over-
all α f can be defined as α f = P(reject at least one H0j when all H0j are true) =

P( E1 or E2 ... or Ek ).

α f = P( E1 ∪ E2 ∪ · · · ∪ Ek ) ≤ ∑kj=1 P( Ej ) = ∑kj=1 α j .

Let ∑ik=1 αi = α, be the desired overall error rate. One choice is αi = α/k.

• Scheffé’s method: Scheffé’s method works for testing any linear combination of
β.

For H0 : a T β = 0, ∀ a ∈ R p , we can consider the following F statistics:

b ) T ( a T ( X T X ) −1 a ) −1 a T β
(aT β b
Fa =
s2
(aT β
b )T (aT β b)
= 2 T T −1 ,
s a (X X ) a
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 53

where s2 = SSE/(n − p − 1).

Scheffé’s method is based on maxa Fa :

For a particular choice of a, we know that


( )
(aT β
b − a T β)
P √ ≥ tn− p−1 (α/2) = α,

T
a Sa
equivalently, ( )
(aT β
b − a T β )2
P ≥ t2n− p−1 (α/2) =α
a T Sa
\
where S = var (β
b ).

Now we want to make a statement such that for any choice of a,


( )
(aT β
b − a T β )2
P ≥ c = α.
a T Sa

Note that if ( )
(aT β
b − a T β )2
P max ≥c =α
a a T Sa
is satisfied, then, for any a,
( )
(aT β
b − a T β )2
P ≥c ≤ α.
a T Sa

(aT β
b − a T β )2
Then, what is maxa a T Sa
?

(aT β
b − a T β )2
b − β ) T S −1 ( β
b − β)
max = (β
a a T Sa
by Cauchy-Schwarz inequality and maximum occurs when a ∝ S−1 ( β
b − β ).

• Theorem 8.5 For β ∈ R p+1


(aT β
b )2
(i) The maximum value of Fa = T is given by
s2 a T ( X X ) −1 a

(aT βb )2 bT X T X β
β b
max a =
s 2 a T ( X T X ) −1 a s2
T
(ii) If y is Nn ( X β, σ2 I ), then β b X T X β/
b ( p + 1)s2 is distributed as F ( p + 1, n −
(aT β
b )2
p − 1). Thus max a 2 T T −1 is distributed as F ( p + 1, n − p − 1).
s a ( X X ) a ( p +1)
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 54

• Simultaneous intervals

Using the results obtained in this section, one can construct simultaneous inter-
val for β.

 Bonferroni interval for β 1 , · · · , β p is βbj ± tn− p−1 (α/2p) s g jj , where g jj is the
diagonal entry of ( X T X )−1 that corresponding to β j and this implies that
 
P ∀ j, β j ∈ ( βbj − tn− p−1 (α/2p) s g jj , βbj + tn− p−1 (α/2p) s g jj ) ≥ 1 − α
p p

 Scheffé’s interval for all possible linear functions a T β is


q
b ± s ( p + 1) Fp+1,n− p−1 (α) a T ( X T X )−1 a and this implies that
aT β
 q
p +1 T Tb
P ∀a ∈ R , a β ∈ a β − s ( p + 1) Fp+1,n− p−1 (α) a T ( X T X )−1 a,
q 
Tb T T − 1
a β + s ( p + 1) Fp+1,n− p−1 (α) a ( X X ) a ≥ 1 − α.

6.5 Numerical Examples

• 8.37: gas vapor example (We repeat the data description again.)

When gasoline is pumped into the tank of a car, vapors are vented into the
atmosphere. An experiment was conducted to determine whether y, the amount
of vapor, can be predicted using the following four variables based on initial

conditions of the tank and the dispensed gasoline:

x1 = tank temperature (◦ F)

x2 = gasoline temperature (◦ F)

x3 = vapor pressure in tank ( psi)

x4 = vapor pressure of gasoline ( psi)

 Test the overall regression hypothesis H0 : β1 = 0.


CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 55

fit=lm(y˜ . ,data=gas )
SST= sum((gas$y-mean(gas$y))ˆ2)
SSR= sum((fit$fitted-mean(gas$y))ˆ2)

SSE= sum((gas$y-fit$fitted)ˆ2)
F=(SSR/4)/(SSE/(32-4-1))
F

[1] 84.54

summary(fit)$f

value numdf dendf


84.54 4.00 27.00

1-pf(F, 4,27) # p-value

[1] 7.327e-15

 Test H0 : β 1 = β 3 = 0.

X=model.matrix(fit)

fit2=lm(y˜ x2+x4,data=gas)
X1=model.matrix(fit2)
H=X%*%solve(t(X)%*%X)%*%t(X)

H1=X1%*%solve(t(X1)%*%X1)%*%t(X1)
SS.beta2.beta1=t(gas$y)%*%(H-H1)%*%gas$y
SSE=t(gas$y)%*%(diag(dim(X)[1])-H)%*%gas$y

F= (SS.beta2.beta1/2)/(SSE/(32-4-1))
F
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 56

[,1]
[1,] 2.493

1-pf(F, 2,27) # p-value

[,1]

[1,] 0.1015

#Alternatively
anova(fit2,fit)

Analysis of Variance Table

Model 1: y ˜ x2 + x4
Model 2: y ˜ x1 + x2 + x3 + x4
Res.Df RSS Df Sum of Sq F Pr(>F)

1 29 238
2 27 201 2 37.2 2.49 0.1

 Test H0 : β 1 = β 2 = 12β 3 = 12β 4 .

C=matrix(c(0,1,-1,0,0,0,0,1,-12,0,0,0,0,1,-1),nrow=3,byrow=F)
C

[,1] [,2] [,3] [,4] [,5]

[1,] 0 1 -1 0 0
[2,] 0 0 1 -12 0
[3,] 0 0 0 1 -1

hat.beta=solve(t(X)%*%X)%*%t(X)%*%gas$y
CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 57

SSH=t(C%*%hat.beta)%*%solve(C%*%solve(t(X)%*%X)%*%t(C))%*%(C%*%hat.beta)
F=(SSH/3)/(SSE/(32-4-1))
F

[,1]
[1,] 10.57

1-pf(F, 3,27) # p-value

[,1]
[1,] 8.99e-05
Chapter 7

Multiple Regression: Model


Validation and Diagnostics

In this chapter we consider various approaches to checking the model and the

attendant assumptions for adequacy and validity

7.1 Residuals

• Assumptions: E(e) = 0, cov(e) = σ2 I. Note that e is unobservable. Instead, we


observe the residuals. The residuals can be written as

e = y − Xβ
b b = ( I − H )y = ( I − H )( X β + e) = ( I − H )e due to HX = X.

HX = X implies H1 = 1 and HX j = X j when X j is a column vector of X, i.e.


X = [1, X1 , · · · , X p ].

• Properties of residuals:

E(b
e) = 0,

e ) = σ 2 ( I − H ),
cov(b

cov(b
e, yb) = 0

∑in=1 b
ei = 0

e T y = y T ( I − H )y = SSE
b

e T yb = y T ( I − H ) Hy = 0
b
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 59

e T X = y T ( I − H ) X = 0.
b

e versus X (or any X j )


e versus yb or b
If the model is correctly specified, plots of b
should not reveal any patterns.

7.2 Hat matrix

• Theorem 9.2 Let H = ((hij )).

1
(i) n ≤ hii ≤ 1 for i = 1, · · · n

(ii) − 21 ≤ hij ≤ 1
2 for i 6= j.

(iii) hii = 1
n + ( xci − x̄ )( XcT Xc )−1 ( xci − x̄ )T

where xci = ( xi1 , xi2 , · · · , xip ), x̄ = ( x̄1 , x̄2 , · · · x̄ p ) and ( xci − x̄ ) is the ith row
of the centered matrix Xc . [Also, note that hii = xi ( X T X )−1 xiT , where xi =
(1, xi1 , xi2 , · · · , xip )]

(iv) tr ( H ) = ∑ hii = p + 1

(Proof) Recall that

1 1
H= J + Hc = J + Xc ( XcT Xc )−1 XcT (7.1)
n n

(i) lower bound from (7.1). Upper bound from H = H 2 , i.e.,


n
hii = (hi1 , hi2 , · · · , hin )(hi1 , hi2 , · · · , hin ) T = ∑ h2ij = h2ii + ∑ h2ij (7.2)
j =1 j 6 =i

∑ j6=i h2ij
Dividing (7.2) by hii (≥ 1/n), we obtain 1 = hii + hii , which implies hii ≤ 1.

(ii) can be shown using (7.2). (Check yourself)

(iii) from (7.1).


CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 60

7.3 Leverage points

• Recall H1 = 1. and ∑in=1 hij = ∑nj=1 hij = 1.

Since yb = Hy,

ybi = hii yi + ∑ hij y j


j 6 =i

If the ith observation heavily influenced its fit (i.e. yi ≈ ybi ), it implies that hii is
close to 1. Recall that 1/n ≤ hii ≤ 1. Such observation is called a leverage point.
Note that the leverage points can be identified based on X only (without y) as

we can identify through H.

• Another view for leverage points 1:

The second term of


1
hii = + ( xci − x̄ )( XcT Xc )−1 ( xci − x̄ )T
n
is an estimated Mahalanobis distance and provides a good measure of the rela-
tive distance of each xci from the center of the points as represented by x̄. (e.g.

Hospital rating example)

• Another view for leverage points 2:

ei ) = σ2 (1 − hii ). When hii ≈ 1, var (b


Recall that var (b ei ) ≈ 0 which implies
0≈b
ei = yi − ybi .

7.4 Outliers

As we know that the residuals do not have the same variance, it is desirable to
scale the residuals. There are two common methods of scaling.

• (Internally) Studentized residual

ei ≡ ei . (What is the problem using ei /b


Let b σ for diagnostic?) Define studentized

residual,
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 61

e
ri = √ i ,
σ 1 − hii
b
σ = SSE/(n − p − 1).
where b

In simple regression,

1 ( xi − x̄ )2
hii = +
n ∑( x j − x̄ )2
and
ei
ri = r .
1 ( xi − x̄ )2
σ 1−
b n − ∑( x j − x̄ )2

 Properties:

(i) ∑ ri 6= 0

(ii) E(ri ) = 0

(iii) var (ri ) = 1


hij
(iv) cov(ri , r j ) = corr (ei , e j ) = − √ √
1−hii 1−h jj

 Note that ri is not distributed as t (why not?). In fact, if y ∼ Nn ( Xβ, σ2 In ),


ri2
∼ Beta(1/2, (n − p − 2)/2). (Also, ei2 /SSE ≤ 1 − hii (Problem 9.4), and
n − p −1
p p
− n − p − 1 ≤ ri ≤ n − p − 1.)

• R-studentized residuals (Externally studentized residuals):

R-studentized residuals are calculated by replacing b


σ with the standard error
computed with n − 1 observations remaining after omitting the i −th observa-
tion.

e
ti = √i
σ(i) 1 − hii
b

σ(2i) = ∑ j6=i (y j − x j βb(i) )2 /(n − p − 2).


is distributed as t(n − p − 2), where b
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 62

σ(2i) in terms of a recognizable quadratic


(Proof) First we would like to express b
form. To do that we will derive

ei2
σ(2i) (n − p − 2) = (n − p − 1)b
σ2 − .
1 − hii
b

To proceed, let xi be the ith row of X.

βb(i) = ( X(Ti) X(i) )−1 X(Ti) y(i)

= ( X T X − xiT xi )−1 ( X T y − xiT yi )


( )
( X T
X ) −1 x T x ( X T X ) −1
i i
= ( X T X ) −1 + ( X T y − xiT yi ),
1 − hii

A−1 uv T A−1
where the last equality is using the fact that ( A − uv T )−1 = A−1 + 1 − v T A −1 u
.

Thus
T −1 T T −1
b − ( X T X ) −1 x T y i + ( X X ) x i x i ( X X ) ( X T y − x T y i )
βb(i) = β i i
1 − hii

T
b − ( X X ) xi ei . It is convenient to know that yi − xi βb(i) ≡ e(i) =
−1 T
and βb(i) = β 1− h ii
hii ei b − ( X T X )−1 x T e(i) .)
ei + 1−hii = ei /(1 − hii ). (This in turn yields βb(i) = β i

n
σ(2i) =
( n − 1 − p − 1)b ∑ (y j − x j βb(i) )2 − (yi − xi βb(i) )2
j =1
n 2 2
h ji ei
 
ei
= ∑ ej +
1 − hii

1 − hii
(7.3)
j =1
n ei2 ei2
= ∑ e2j −
1 − hii
= ( n − p − 1)b 2
σ −
1 − hii
j =1

(We may use He = H ( I − H )y = 0, thus ∑nj=1 e j h ji = 0.)

It remains to show

1
(i) σ2
(n − 1 − σ(2i) ∼ χ2 (n − 1 − p − 1)
p − 1)b
ei2
σ(2i) ⊥
(ii) (n − 1 − p − 1)b 1−hii
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 63
2
1 ei
(iii) σ2 1−hii
∼ χ2 (1).

We show (iii) first. Note that ei = ui ( I − H )y where ui is a row vector of 0’s


except that the element on the ith position is 1. Let Ki = uiT ui .
ei2
= y T ( I − H )Ki ( I − H )y/(1 − hii ). (7.4)
1 − hii

We can show that L = ( I − H )Ki ( I − H )/(1 − hii ) is idempotent using Ki ( I −


H )Ki = (1 − hii )Ki .

With tr ( L) = rank( L) = 1, we can show (iii).

To prove (ii), we can write from (7.3) and (7.4) that

σ(2i) = y T ( I − H )y − y T Ly.
( n − 1 − p − 1)b

Independence is shown since L( I − H − L) = L( I − H ) − L = 0.

For (i), we can check I − H − L is idempotent and rank( I − H − L) = n − p −

1 − 1.

 From this result, one can consider t-test using ti for testing H0 : θ = 0 in
the mean-shift outlier model, E(yi | xi ) = xi β + θ. Since n tests will be made,

correction for multiple comparison problems should be considered.

7.5 Prediction sum of squares (PRESS)

PREdiction Sum of Squares (PRESS) is defined as

n
PRESS ≡ ∑ (yi − ybi(i) )2,
i =1

where ybi(i) = xi βb(i) and βb(i) is the estimated β using the data without the ith
observation.

PRESS can be also expressed as


n  2
ei
PRESS = ∑ 1 − hii
.
i =1
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 64

• The first expression requires to fit the regression n times while the second ex-
pression needs to fit the regression once.

• A scaled residual, ei /(1 − hii ) that corresponds to a large value of hii contributes

more to PRESS. For a given dataset, PRESS may be a better measure than SSE
of how well the model will predict future observations (why?). When the objec-
tive is prediction, one can choose a model with small PRESS among candidate

models.

7.6 Influential observations

• Cook’s distance
Di = (yb − yb(i) ) T (yb − yb(i) )/{( p + 1)b
σ2 }
b − βb(i) ) T ( X T X )( β
= (β σ2 }
b − βb(i) )/{( p + 1)b

T
b − ( X X ) xi ei in Di ,
−1 T
Replacing βb(i) = β 1− h ii

1 1
Di = 2
ei2 xi ( X T X )−1 xiT
( p + 1)b
σ (1 − hii )2
!2
ei hii
=
( p + 1)(1 − hii )
p
σ (1 − hii )
b
ri2 hii
= .
( p + 1) (1 − hii )

 Di represents Mahalanobis distance between βb and βb(i) . Points with Di > 1


are considered as influential points.

• DFFITS
ybi − ybi(i)
DFFITSi = q
σ(2i) hii
b

h e
ybi − ybi(i) = xi ( βb − βb(i) ) = ii i
1 − hii
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 65

1 hii ei
DFFITSi = q
σ(2i) hii 1 − hii
b
s
ei hii
=q
σ(2i) (1 − hii ) 1 − hii
b
s
hii
= ti ,
1 − hii

where ti is R-studentized residual.

• DFBETAS:

βbj − βbj(i)
DFBETAS ji = q ,
σ(2i) c jj
b

where c jj denotes the jth diagonal element of ( X T X )−1 .

Alternatively,
a ji t
DFBETAS ji = √ √ i ,
c jj 1 − hii
where a ji be the ( j, i )th element of A = ( X T X )−1 X T .

We can derive the second expression from the first expression.

( X T X )−1 xiT ei
βb − βb(i) =
1 − hii
a i ei
= ,
1 − hii

where ai is the ith column of matrix A.


a ji ei
Then, βbj − βbj(i) = 1−hii . Therefore

a ji ei 1 a ji t
DFBETAS ji = =√ √ i .
1 − hii b
q
σ(2i) c jj c jj 1 − hii

When | DFBETA ji | > √2 then the ith data point is considered influential.
n
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 66

7.7 Numerical Examples

• 9.10 Gas vapor example

ei , ri and ti versus ybi


 Plot b

#hat matrix

hii=hat(model.matrix(fit))
#usual residual
ei=gas$y-fit$fitted

#studentized residual
ri=rstandard(fit)
#R studentized residual (external)

ti=rstudent(fit)

par(mfrow = c(2, 2),cex.main=1.5,cex.lab=1.5,cex.axis=1.5,

mar=c(4,4,2,1))
plot(fit$fitted, ei, xlab=expression( hat(y)[i]),
ylab=expression(e[i]))

abline(h=0,col='gray')
plot(fit$fitted, ri, xlab=expression( hat(y)[i]),
ylab=expression(r[i]))

abline(h=0,col='gray')
plot(fit$fitted, ti, xlab=expression( hat(y)[i]),

ylab=expression(t[i]))
abline(h=0,col='gray')
plot(1:length(hii), hii, xlab='observation number',

ylab=expression(h[ii]), ylim=c(0,1))
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 67

2
4

1
2

0
0
ei

ri
−6 −4 −2

−1
−2
20 30 40 50 20 30 40 50
y^i y^i

0.0 0.2 0.4 0.6 0.8 1.0


0 2
1

hii
ti
−1
−2

20 30 40 50 0 5 10 20 30
y^i observation number

 Influence measures

#cooks distance
cooks.distance(fit)

#DFFITS
dffits(fit)

#DFBETAS
dfbeta(fit)

You might also like