Appliedstat 2017 Chapter 6 7

Chapter 6
Multiple Regression: Tests of

Hypotheses and Confidence
Intervals
6.1 Test of Overall Regression
• Denote β 1 = ( β 1 , β 2 , · · · , β p ) T . β = ( β 0 , β 1T ) T . Let X = (1, X1 ) and X ∗ = (1, Xc )

where Xc = [ I − n1 J ] X1 .

α
Suppose y = X β + e = X∗ . We wish to test β 1 = 0.
β1
Let Hc = Xc ( XcT Xc )−1 XcT . Then, it can be rewritten as Hc = H − n1 Jn . Also note
that Hc 1 = 0.
Partition of sum of squares:

T 1 T 1
y In − Jn y = y H − Jn y + y T ( I − H )y
n n

T T 1
= y Hc y + y I − Jn − Hc y
n
= SSR + SSE
• Theorem 8.1 a.:
(i) Hc ( In − n1 Jn ) = Hc
(ii) Hc is idempotent with rank p
(iii) I − H = I − n1 Jn − Hc is idempotent with rank (n − p − 1)

CHAPTER 6. MULTIPLE REGRESSION: TESTS OF HYPOTHESES AND
CONFIDENCE INTERVALS 45
(iv) Hc ( In − n1 Jn − Hc ) = ( H − n1 Jn )( In − H ) = 0
• Theorem 8.1 b: If y ∼ Nn ( X β, σ2 I ), and our fitted model is E(y| X ) = X β,
SSR
∼ χ2 ( p, λ)
σ2
1 T T
where λ = β X Xβ
2σ2 1 c c 1
SSE
∼ χ2 ( n − p − 1)
σ2
(Proof) In the previous chapter, we show that λ = 1

2σ2
βT X T ( H − n1 Jn ) X β.
Now with covariates matrix without intercept,
( H − n1 Jn ) = Hc = Xc ( XcT Xc )−1 XcT and
1 T T
λ= β X Xc ( XcT Xc )−1 XcT X β
2σ2 T
1 T 1 T −1 T α
= 2 (α, β 1 ) T Xc ( Xc Xc ) Xc ( 1, Xc ) β
2σ X c 1
1
= 2 β 1T XcT Xc β 1
2σ
• Theorem 8.1c. If y is Nn ( X β, σ2 I ), then SSR and SSE are independent.
SSR/p
• Theorem 8.1d. If y is Nn ( X β, σ2 I ), the distribution of F = SSE/(n− p−1)
is as
follows:
(i) If H0 : β1 = 0 is false, then F is distributed as F ( p, n − p − 1, λ), where

1 βT T
λ= X Xβ
2σ2 1 c c 1
(ii) If H0 : β1 = 0 is true, then λ = 0 and F is distributed as F ( p, n − p − 1):
6.2 Test on a subset of the β
• Consider y = X β + e = X1 β 1 + X2 β 2 + e. We are interested in testing H0 : β 2 =

0 where the length of β 2 is q.
Let H1 = X1 ( X1T X1 )−1 X1T .
1
SST = y T ( In − Jn )y
n
1
= y T ( I − H )y + y T ( H − H1 )y + y T ( H1 − Jn )y
n
= SSE + SS( β 2 | β 1 ) + SSR(reduced)
SSR(reduced) is the sum of squares due to fitting X1 only in the reduced model,
or the null model, y = X1 β∗1 + e∗ . Thus, SS( β 2 | β 1 ) is “extra” regression sum of
squares due to β 2 after adjusting for β 1 and can be expressed as SSR( f ull ) −
SSR(reduced).
• Theorem 8.2a. The matrix H − H1 = X ( X T X )−1 X T − X1 ( X1T X1 )−1 X1T is idem-

potent with rank q, where q is the number of elements in β 2 .
(Proof) We know H and H1 are idempotent. How about HH1 ? First observe
that HX = X ( X T X )−1 X T X = X or X = [ X ( X T X )−1 X T ] X.
Partitioning X on the left side and the last X on the right side, we obtain
( X1 , X2 ) = [ X ( X T X )−1 X T ]( X1 , X2 ) = [ X ( X T X )−1 X T X1 , X ( X T X )−1 X T X2 ]
Thus, X1 = [ X ( X T X )−1 X T ] X1 and X2 = [ X ( X T X )−1 X T ] X2 .
Also, [ X ( X T X )−1 X T ] X1 ( X1T X1 )−1 X1T = H1 , i.e., HH1 = H1 and H1 H = H1 , with

which we can establish that H − H1 is idempotent. It follows rank( H − H1 ) =
tr ( H − H1 ) = q.
• Theorem 8.2b.: If y is Nn ( X β, σ2 I ), then
(i) y T ( I − H )y/σ2 is χ2 (n − p − 1).
(ii) y T ( H − H1 )y/σ2 is χ2 (q, λ1 ), λ1 = β T2 ( X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ) β 2 /2σ2
(iii) y T ( I − H )y and y T ( H − H1 )y are independent.
(Proof) We showed (i) in the last chapter. For (iii), ( I − H )( H − H1 ) = 0. For (ii),
1 T T
noncentrality parameter after applying the result is λ1 = 2σ2
β X (H − H1 ) Xβ.
Recall HX1 = X1 , HX2 = X2 . Then,
β T X T ( H − H1 ) X β = ( β T1 X1T + β T2 X2T )( H − H1 )( X1 β 1 + X2 β 2 )
= ( β T1 X1T + β T2 X2T − β T1 X1T − β T2 X2T H1 )( X1 β 1 + X2 β 2 )
= β T2 X2T X1 β 1 − β T2 X2T H1 X1 β 1 + β T2 X2T X2 β 2 − β T2 X2T H1 X2 β 2
= β T2 ( X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ) β 2
• Theorem 8.2c.: If y is Nn ( X β, σ2 I ), consider
y T ( H − H1 )y/q SS( β 2 | β 1 )/q

F= T =
y ( I − H )y/(n − p − 1) SSE/(n − p − 1)
(βb T X T y − βb∗T X T y)/q
1 1
= T T
.
(y T y − βb X y ) / ( n − p − 1)
(i) If H0 : β 2 = 0 is false, then F is distributed as F (q, n − p − 1, λ1 ), where

λ1 = 1 T
β [ X2T X2
2σ2 2
− X2T X1 ( X1T X1 )−1 X1T X2 ] β 2
(ii) If H0 : β 2 = 0 is true, then λ1 = 0 and F is distributed as F (q, n − p − 1):
• Theorem 8.2d.: If the model is partitioned,
SS( β 2 | β 1 ) = y T ( H − H1 )y = βbT2 [ X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ] βb2
6.3 The general linear hypothesis tests for H0 : Cβ =

0
• Motivation: Bio-equivalence test for newly formulated medication claiming
equal efficacy. For example, when we are interested in testing H0 : β 1 = β 2 =

β 3 = β 4 , it can be written as H0 : Cβ = 0 where
 
0 1 −1 0 0
C =  0 0 1 −1 0 
0 0 0 1 −1
with β = ( β 0 , β 1 , β 2 , β 3 , β 4 ) T .
• Theorem 8.4a. If y is Nn ( X β, σ2 I ) and C is q x ( p + 1) matrix of rank q ≤ p + 1,

then
b is Nq [Cβ, σ2 C ( X T X )−1 C T ]:
(i) C β
b ) T [ C ( X T X ) −1 C T ] −1 ( C β
(ii) SSH/σ2 = (C β b )/σ2 is χ2 (q, λ),
where λ = (Cβ) T [C ( X T X )−1 C T ]−1 (Cβ)/2σ2 .
(iii) SSE/σ2 = y T [ I − X ( X T X )−1 X T ]y/σ2 is χ2 (n − p − 1).
(iv) SSH and SSE are independent,
where SSH is the sum of squares due to Cβ (i.e. due to the hypothesis).
(Proof) We show (iv) (Showing Independence)
(C βb)T [C ( X T X )−1 C T ]−1 C βb
= y T X ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1 X T y
= y T Ay
Independence can be established by showing A( I − H )=0.
AH = X ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C ( X T X )−1 X T X ( X T X )−1 X T = A.
The F test for H0 : Cβ = 0 versus H1 : Cβ 6= 0 is given in the following theorem.
• Theorem 8.4b.: If y is Nn ( X β, σ2 I ), consider
SSH/q
F=
SSE/(n − p − 1)
b ) T [ C ( X T X ) −1 C T ] −1 ( C β
(C β b )/q
=
SSE/(n − p − 1)
where C is q x ( p + 1) matrix of rank q ≤ p + 1. Then, the distribution of F is as

follows:
(i) If H0 : Cβ = 0 is false, then F is distributed as F (q, n − p − 1, λ), where

λ= 1
2σ2
(Cβ)T [C ( X T X )−1 C T ]−1 (Cβ).
(ii) If H0 : Cβ = 0 is true, then λ = 0 and F is distributed as F (q, n − p − 1).

• The F test for H0 : Cβ = 0 in the above theorem is called general linear hypothesis
test. The degrees of freedom q is the number of linear combinations in Cβ.
b − 0 ) T [ C ( X T X ) −1 C T ] −1 ( C β
• SSH can be written as (C β b − 0), which is squared
b and Cβ = 0 under the null. Intuitively, if C β
distance between C β b is very
different from 0, the numerator of F tends to be large.
• Estimation of β under the constraint Cβ = 0.
b be the least squared estimator under the constraint Cβ = 0. Then we can

Let β c
b =β
show β b − ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C β.
b
c
The above constraint optimization problem can be solved by introducing La-
grange multiplier, i.e. minimizing U ( β, λ) = (y − X β) T (y − X β) + λ T (Cβ − 0).
∂U ( β, λ)
= Cβ = 0
∂λ
∂U ( β, λ)
= −2X T y + 2X T X β + C T λ
∂β
b = ( X T X ) −1 X T y − 1 ( X T X ) −1 C T λ
β c
2
b = 0,
Since C β c
1
C ( X T X ) −1 X T y − C ( X T X ) −1 C T λ = 0
2
yielding
λ = 2[C ( X T X )−1 C T ]−1 C ( X T X )−1 X T y.
Thus
b =β
β b − 1 2 ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1 X T y
c
2
b − ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C β
=β b
= ( I − B) β,
b
where B = ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C. Note that B2 = B and ( I − B)2 =
I − B. Also β b are independent (∵ ( I − B)( X T X )−1 C T = 0).

b and C β
c
• Theorem 8.4e.: The mean vector and covariance matrix of β

b is
c
b ) = β − ( X T X )−1 C T [C ( X T X )−1 C T ]−1 Cβ

(i) E( β c
b ) = σ 2 ( X T X ) −1 − σ 2 ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1
(ii) cov( β c
(Proof) We show the variance expression.
b ) = ( I − B)( X T X )−1 ( I − B) T σ2
cov( β c
= ( I − ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C )( X T X )−1 ( I − C T [C ( X T X )−1 C T ]−1 C ( X T X )−1 )σ2
= {( X T X )−1 − ( X T X )−1 C T [C ( X T X )−1 C ]−1 C ( X T X )−1 }σ2
• Sum of squares decomposition:
The sum of squares due to regression can be decomposed into two parts: the
sum of squares due to the hypothesis (SSH) and the sum of squares due to the
remaining regression after adjusting the hypothesis.
Indeed, we can show
b ) T { C ( X T X ) −1 C T } −1 C β
y T HX y = ( C β b T {var ( β
b + σ2 β b )}− β
b .
c c c
To show the above result, consider singular value decomposition of I − B. Re-

b where B = ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C.
b = ( I − B) β,
call that β c
T

D 0 P1
( I − B) = UD ∗ P T
= [U1 , U2 ] = U1 DP1T , where P1 and U1 are
0 0 P2T
( p + 1)x( p + 1 − q) matrices and D is a ( p + 1 − q)x( p + 1 − q) matrix.

C
Let a = β = Qβ. Then, the regression model y = X β + e can be ex-
P1T
pressed as y = XQ−1 Qβ + e = W a + e.
Then, we have
y T HX y = y T HW y = σ2 b a ) −1 b
a T var ( b a
= σ2 ( Q β b ) Q T } −1 ( Q β
b ) T { Q var ( β b)
C ( X T X ) −1 C T C ( X T X ) −1 P −1 ( C β
!

b )T ( PT β b T 1
b )
= (C β 1 ) b) .
P1T ( X T X )−1 C T P1T ( X T X )−1 P1 ( P1T β
In the above expression, we can show C ( X T X )−1 P1 = P1T ( X T X )−1 C T = 0 from

the following:
(a) ( I − B)( X T X )−1 C T = 0 since
B ( X T X ) −1 C T = ( X T X ) −1 C T [ C ( X T X ) −1 C T ] −1 C ( X T X ) −1 C T = ( X T X ) −1 C T .
(b) From (a), we have
( I − B)( X T X )−1 C T = 0
⇐⇒ U1 DP1T ( X T X )−1 C T = 0
⇐⇒ U1T U1 DP1T ( X T X )−1 C T = 0
⇐⇒ D −1 DP1T ( X T X )−1 C T = 0
⇐⇒ P1T ( X T X )−1 C T = 0.
Then,
b ) T { C ( X T X ) −1 C T } −1 C β
y T HX y = ( C β b T T T −1
b + ( PT β −1 T b
1 ) { P1 ( X X ) P1 } P1 β.
b ) T { P T ( X T X )−1 P1 }−1 P T β
Now we show ( P1T β b T {var ( β
b = σ2 β b )}− β
b .
1 1 c c c
b T {var ( β
σ2 β b )}− β
b
c c c
b T ( I − B) T {( I − B)var ( β
= σ2 β b )( I − B) T }− ( I − B) β
b
b T P1 DU T {U1 DP T var ( β
= σ2 β b ) P1 DU T }− U1 DP T β
b
1 1 1 1
T
= σ2 β b ) P1 D }−1 U T ]U1 DP T β
b P1 DU T [U1 { DP T var ( β b
1 1 1 1
b T P1 { P T var ( β
= σ2 β b ) P1 }−1 P T β
b
1 1
b ) T { P T ( X T X )−1 P1 }−1 P T β.
= ( P1T β b
1 1
(Note that ( PAP T )− = PA− P T when P T P = I. In addition, when A is invertible,

( PAP T )− = PA−1 P T .)
• When is β∗1 (β 1 under Cβ = 0) of interest instead of β 1 ? e.g. Public health policy
maker’s point of view vs. clinician’s point of view.
6.4 Simultaneous Inference
• Familywise error rate: When we conduct multiple hypothesis testings with

Type I error associated with each decision, the probability of all decisions are
true could be small. To address this problem, one might control the probability
that falsely rejects at least one hypothesis when all hypotheses are true.
• Bonferroni correction: Bonferroni approach can be applied for k pre-specified

hypotheses.
Suppose that we carry out the k tests of H0j : β j = 0, j = 1, 2, · · · , k. Let Ej be

the event that the jth test rejects H0j when it is true, where P( Ej ) = α j . The over-
all α f can be defined as α f = P(reject at least one H0j when all H0j are true) =
P( E1 or E2 ... or Ek ).
α f = P( E1 ∪ E2 ∪ · · · ∪ Ek ) ≤ ∑kj=1 P( Ej ) = ∑kj=1 α j .
Let ∑ik=1 αi = α, be the desired overall error rate. One choice is αi = α/k.
• Scheffé’s method: Scheffé’s method works for testing any linear combination of
β.
For H0 : a T β = 0, ∀ a ∈ R p , we can consider the following F statistics:
b ) T ( a T ( X T X ) −1 a ) −1 a T β
(aT β b
Fa =
s2
(aT β
b )T (aT β b)
= 2 T T −1 ,
s a (X X ) a
where s2 = SSE/(n − p − 1).
Scheffé’s method is based on maxa Fa :
For a particular choice of a, we know that

( )
(aT β
b − a T β)
P √ ≥ tn− p−1 (α/2) = α,

T
a Sa
equivalently, ( )
(aT β
b − a T β )2
P ≥ t2n− p−1 (α/2) =α
a T Sa
\
where S = var (β
b ).
Now we want to make a statement such that for any choice of a,

( )
(aT β
b − a T β )2
P ≥ c = α.
a T Sa
Note that if ( )
(aT β
b − a T β )2
P max ≥c =α
a a T Sa
is satisfied, then, for any a,
( )
(aT β
b − a T β )2
P ≥c ≤ α.
a T Sa
(aT β
b − a T β )2
Then, what is maxa a T Sa
?
(aT β
b − a T β )2
b − β ) T S −1 ( β
b − β)
max = (β
a a T Sa
by Cauchy-Schwarz inequality and maximum occurs when a ∝ S−1 ( β
b − β ).
• Theorem 8.5 For β ∈ R p+1

(aT β
b )2
(i) The maximum value of Fa = T is given by
s2 a T ( X X ) −1 a
(aT βb )2 bT X T X β
β b
max a =
s 2 a T ( X T X ) −1 a s2
T
(ii) If y is Nn ( X β, σ2 I ), then β b X T X β/
b ( p + 1)s2 is distributed as F ( p + 1, n −
(aT β
b )2
p − 1). Thus max a 2 T T −1 is distributed as F ( p + 1, n − p − 1).
s a ( X X ) a ( p +1)
• Simultaneous intervals
Using the results obtained in this section, one can construct simultaneous inter-
val for β.
√
Bonferroni interval for β 1 , · · · , β p is βbj ± tn− p−1 (α/2p) s g jj , where g jj is the
diagonal entry of ( X T X )−1 that corresponding to β j and this implies that

P ∀ j, β j ∈ ( βbj − tn− p−1 (α/2p) s g jj , βbj + tn− p−1 (α/2p) s g jj ) ≥ 1 − α
p p
Scheffé’s interval for all possible linear functions a T β is

q
b ± s ( p + 1) Fp+1,n− p−1 (α) a T ( X T X )−1 a and this implies that
aT β
q
p +1 T Tb
P ∀a ∈ R , a β ∈ a β − s ( p + 1) Fp+1,n− p−1 (α) a T ( X T X )−1 a,
q
Tb T T − 1
a β + s ( p + 1) Fp+1,n− p−1 (α) a ( X X ) a ≥ 1 − α.
6.5 Numerical Examples
• 8.37: gas vapor example (We repeat the data description again.)
When gasoline is pumped into the tank of a car, vapors are vented into the
atmosphere. An experiment was conducted to determine whether y, the amount
of vapor, can be predicted using the following four variables based on initial
conditions of the tank and the dispensed gasoline:
x1 = tank temperature (◦ F)
x2 = gasoline temperature (◦ F)
x3 = vapor pressure in tank ( psi)
x4 = vapor pressure of gasoline ( psi)
Test the overall regression hypothesis H0 : β1 = 0.

fit=lm(y˜ . ,data=gas )
SST= sum((gas$y-mean(gas$y))ˆ2)
SSR= sum((fit$fitted-mean(gas$y))ˆ2)
SSE= sum((gas$y-fit$fitted)ˆ2)
F=(SSR/4)/(SSE/(32-4-1))
F
[1] 84.54
summary(fit)$f
value numdf dendf

84.54 4.00 27.00
1-pf(F, 4,27) # p-value
[1] 7.327e-15
Test H0 : β 1 = β 3 = 0.
X=model.matrix(fit)
fit2=lm(y˜ x2+x4,data=gas)
X1=model.matrix(fit2)
H=X%*%solve(t(X)%*%X)%*%t(X)
H1=X1%*%solve(t(X1)%*%X1)%*%t(X1)
SS.beta2.beta1=t(gas$y)%*%(H-H1)%*%gas$y
SSE=t(gas$y)%*%(diag(dim(X)[1])-H)%*%gas$y
F= (SS.beta2.beta1/2)/(SSE/(32-4-1))
F
[,1]
[1,] 2.493
1-pf(F, 2,27) # p-value
[,1]
[1,] 0.1015
#Alternatively
anova(fit2,fit)
Analysis of Variance Table
Model 1: y ˜ x2 + x4
Model 2: y ˜ x1 + x2 + x3 + x4
Res.Df RSS Df Sum of Sq F Pr(>F)
1 29 238
2 27 201 2 37.2 2.49 0.1
Test H0 : β 1 = β 2 = 12β 3 = 12β 4 .
C=matrix(c(0,1,-1,0,0,0,0,1,-12,0,0,0,0,1,-1),nrow=3,byrow=F)
C
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 -1 0 0
[2,] 0 0 1 -12 0
[3,] 0 0 0 1 -1
hat.beta=solve(t(X)%*%X)%*%t(X)%*%gas$y
SSH=t(C%*%hat.beta)%*%solve(C%*%solve(t(X)%*%X)%*%t(C))%*%(C%*%hat.beta)
F=(SSH/3)/(SSE/(32-4-1))
F
[,1]
[1,] 10.57
1-pf(F, 3,27) # p-value
[,1]
[1,] 8.99e-05
Chapter 7
Multiple Regression: Model

Validation and Diagnostics
In this chapter we consider various approaches to checking the model and the
attendant assumptions for adequacy and validity
7.1 Residuals
• Assumptions: E(e) = 0, cov(e) = σ2 I. Note that e is unobservable. Instead, we

observe the residuals. The residuals can be written as
e = y − Xβ
b b = ( I − H )y = ( I − H )( X β + e) = ( I − H )e due to HX = X.
HX = X implies H1 = 1 and HX j = X j when X j is a column vector of X, i.e.

X = [1, X1 , · · · , X p ].
• Properties of residuals:
E(b
e) = 0,
e ) = σ 2 ( I − H ),
cov(b
cov(b
e, yb) = 0
∑in=1 b
ei = 0
e T y = y T ( I − H )y = SSE
b
e T yb = y T ( I − H ) Hy = 0
b
CHAPTER 7. MULTIPLE REGRESSION: MODEL VALIDATION AND
DIAGNOSTICS 59
e T X = y T ( I − H ) X = 0.
b
e versus X (or any X j )

e versus yb or b
If the model is correctly specified, plots of b
should not reveal any patterns.
7.2 Hat matrix
• Theorem 9.2 Let H = ((hij )).
1
(i) n ≤ hii ≤ 1 for i = 1, · · · n
(ii) − 21 ≤ hij ≤ 1
2 for i 6= j.
(iii) hii = 1
n + ( xci − x̄ )( XcT Xc )−1 ( xci − x̄ )T
where xci = ( xi1 , xi2 , · · · , xip ), x̄ = ( x̄1 , x̄2 , · · · x̄ p ) and ( xci − x̄ ) is the ith row
of the centered matrix Xc . [Also, note that hii = xi ( X T X )−1 xiT , where xi =
(1, xi1 , xi2 , · · · , xip )]
(iv) tr ( H ) = ∑ hii = p + 1
(Proof) Recall that
1 1
H= J + Hc = J + Xc ( XcT Xc )−1 XcT (7.1)
n n
(i) lower bound from (7.1). Upper bound from H = H 2 , i.e.,

n
hii = (hi1 , hi2 , · · · , hin )(hi1 , hi2 , · · · , hin ) T = ∑ h2ij = h2ii + ∑ h2ij (7.2)
j =1 j 6 =i
∑ j6=i h2ij
Dividing (7.2) by hii (≥ 1/n), we obtain 1 = hii + hii , which implies hii ≤ 1.
(ii) can be shown using (7.2). (Check yourself)
(iii) from (7.1).

DIAGNOSTICS 60
7.3 Leverage points
• Recall H1 = 1. and ∑in=1 hij = ∑nj=1 hij = 1.
Since yb = Hy,
ybi = hii yi + ∑ hij y j

j 6 =i
If the ith observation heavily influenced its fit (i.e. yi ≈ ybi ), it implies that hii is
close to 1. Recall that 1/n ≤ hii ≤ 1. Such observation is called a leverage point.
Note that the leverage points can be identified based on X only (without y) as
we can identify through H.
• Another view for leverage points 1:
The second term of

1
hii = + ( xci − x̄ )( XcT Xc )−1 ( xci − x̄ )T
n
is an estimated Mahalanobis distance and provides a good measure of the rela-
tive distance of each xci from the center of the points as represented by x̄. (e.g.
Hospital rating example)
• Another view for leverage points 2:
ei ) = σ2 (1 − hii ). When hii ≈ 1, var (b

Recall that var (b ei ) ≈ 0 which implies
0≈b
ei = yi − ybi .
7.4 Outliers
As we know that the residuals do not have the same variance, it is desirable to
scale the residuals. There are two common methods of scaling.
• (Internally) Studentized residual
ei ≡ ei . (What is the problem using ei /b

Let b σ for diagnostic?) Define studentized
residual,
DIAGNOSTICS 61
e
ri = √ i ,
σ 1 − hii
b
σ = SSE/(n − p − 1).
where b
In simple regression,
1 ( xi − x̄ )2
hii = +
n ∑( x j − x̄ )2
and
ei
ri = r .
1 ( xi − x̄ )2
σ 1−
b n − ∑( x j − x̄ )2
Properties:
(i) ∑ ri 6= 0
(ii) E(ri ) = 0
(iii) var (ri ) = 1

hij
(iv) cov(ri , r j ) = corr (ei , e j ) = − √ √
1−hii 1−h jj
Note that ri is not distributed as t (why not?). In fact, if y ∼ Nn ( Xβ, σ2 In ),

ri2
∼ Beta(1/2, (n − p − 2)/2). (Also, ei2 /SSE ≤ 1 − hii (Problem 9.4), and
n − p −1
p p
− n − p − 1 ≤ ri ≤ n − p − 1.)
• R-studentized residuals (Externally studentized residuals):
R-studentized residuals are calculated by replacing b

σ with the standard error
computed with n − 1 observations remaining after omitting the i −th observa-
tion.
e
ti = √i
σ(i) 1 − hii
b
σ(2i) = ∑ j6=i (y j − x j βb(i) )2 /(n − p − 2).

is distributed as t(n − p − 2), where b
DIAGNOSTICS 62
σ(2i) in terms of a recognizable quadratic

(Proof) First we would like to express b
form. To do that we will derive
ei2
σ(2i) (n − p − 2) = (n − p − 1)b
σ2 − .
1 − hii
b
To proceed, let xi be the ith row of X.
βb(i) = ( X(Ti) X(i) )−1 X(Ti) y(i)
= ( X T X − xiT xi )−1 ( X T y − xiT yi )

( )
( X T
X ) −1 x T x ( X T X ) −1
i i
= ( X T X ) −1 + ( X T y − xiT yi ),
1 − hii
A−1 uv T A−1
where the last equality is using the fact that ( A − uv T )−1 = A−1 + 1 − v T A −1 u
.
Thus
T −1 T T −1
b − ( X T X ) −1 x T y i + ( X X ) x i x i ( X X ) ( X T y − x T y i )
βb(i) = β i i
1 − hii
T
b − ( X X ) xi ei . It is convenient to know that yi − xi βb(i) ≡ e(i) =
−1 T
and βb(i) = β 1− h ii
hii ei b − ( X T X )−1 x T e(i) .)
ei + 1−hii = ei /(1 − hii ). (This in turn yields βb(i) = β i
n
σ(2i) =
( n − 1 − p − 1)b ∑ (y j − x j βb(i) )2 − (yi − xi βb(i) )2
j =1
n 2 2
h ji ei

ei
= ∑ ej +
1 − hii
−
1 − hii
(7.3)
j =1
n ei2 ei2
= ∑ e2j −
1 − hii
= ( n − p − 1)b 2
σ −
1 − hii
j =1
(We may use He = H ( I − H )y = 0, thus ∑nj=1 e j h ji = 0.)
It remains to show
1
(i) σ2
(n − 1 − σ(2i) ∼ χ2 (n − 1 − p − 1)
p − 1)b
ei2
σ(2i) ⊥
(ii) (n − 1 − p − 1)b 1−hii
DIAGNOSTICS 63
2
1 ei
(iii) σ2 1−hii
∼ χ2 (1).
We show (iii) first. Note that ei = ui ( I − H )y where ui is a row vector of 0’s

except that the element on the ith position is 1. Let Ki = uiT ui .
ei2
= y T ( I − H )Ki ( I − H )y/(1 − hii ). (7.4)
1 − hii
We can show that L = ( I − H )Ki ( I − H )/(1 − hii ) is idempotent using Ki ( I −

H )Ki = (1 − hii )Ki .
With tr ( L) = rank( L) = 1, we can show (iii).
To prove (ii), we can write from (7.3) and (7.4) that
σ(2i) = y T ( I − H )y − y T Ly.
( n − 1 − p − 1)b
Independence is shown since L( I − H − L) = L( I − H ) − L = 0.
For (i), we can check I − H − L is idempotent and rank( I − H − L) = n − p −
1 − 1.
From this result, one can consider t-test using ti for testing H0 : θ = 0 in
the mean-shift outlier model, E(yi | xi ) = xi β + θ. Since n tests will be made,
correction for multiple comparison problems should be considered.
7.5 Prediction sum of squares (PRESS)
PREdiction Sum of Squares (PRESS) is defined as
n
PRESS ≡ ∑ (yi − ybi(i) )2,
i =1
where ybi(i) = xi βb(i) and βb(i) is the estimated β using the data without the ith
observation.
PRESS can be also expressed as

n 2
ei
PRESS = ∑ 1 − hii
.
i =1
DIAGNOSTICS 64
• The first expression requires to fit the regression n times while the second ex-
pression needs to fit the regression once.
• A scaled residual, ei /(1 − hii ) that corresponds to a large value of hii contributes
more to PRESS. For a given dataset, PRESS may be a better measure than SSE
of how well the model will predict future observations (why?). When the objec-
tive is prediction, one can choose a model with small PRESS among candidate
models.
7.6 Influential observations
• Cook’s distance
Di = (yb − yb(i) ) T (yb − yb(i) )/{( p + 1)b
σ2 }
b − βb(i) ) T ( X T X )( β
= (β σ2 }
b − βb(i) )/{( p + 1)b
T
b − ( X X ) xi ei in Di ,
−1 T
Replacing βb(i) = β 1− h ii
1 1
Di = 2
ei2 xi ( X T X )−1 xiT
( p + 1)b
σ (1 − hii )2
!2
ei hii
=
( p + 1)(1 − hii )
p
σ (1 − hii )
b
ri2 hii
= .
( p + 1) (1 − hii )
Di represents Mahalanobis distance between βb and βb(i) . Points with Di > 1

are considered as influential points.
• DFFITS
ybi − ybi(i)
DFFITSi = q
σ(2i) hii
b
h e
ybi − ybi(i) = xi ( βb − βb(i) ) = ii i
1 − hii
DIAGNOSTICS 65
1 hii ei
DFFITSi = q
σ(2i) hii 1 − hii
b
s
ei hii
=q
σ(2i) (1 − hii ) 1 − hii
b
s
hii
= ti ,
1 − hii
where ti is R-studentized residual.
• DFBETAS:
βbj − βbj(i)
DFBETAS ji = q ,
σ(2i) c jj
b
where c jj denotes the jth diagonal element of ( X T X )−1 .
Alternatively,
a ji t
DFBETAS ji = √ √ i ,
c jj 1 − hii
where a ji be the ( j, i )th element of A = ( X T X )−1 X T .
We can derive the second expression from the first expression.
( X T X )−1 xiT ei
βb − βb(i) =
1 − hii
a i ei
= ,
1 − hii
where ai is the ith column of matrix A.

a ji ei
Then, βbj − βbj(i) = 1−hii . Therefore
a ji ei 1 a ji t
DFBETAS ji = =√ √ i .
1 − hii b
q
σ(2i) c jj c jj 1 − hii
When | DFBETA ji | > √2 then the ith data point is considered influential.
n
DIAGNOSTICS 66
7.7 Numerical Examples
• 9.10 Gas vapor example
ei , ri and ti versus ybi

Plot b
#hat matrix
hii=hat(model.matrix(fit))
#usual residual
ei=gas$y-fit$fitted
#studentized residual
ri=rstandard(fit)
#R studentized residual (external)
ti=rstudent(fit)
par(mfrow = c(2, 2),cex.main=1.5,cex.lab=1.5,cex.axis=1.5,
mar=c(4,4,2,1))
plot(fit$fitted, ei, xlab=expression( hat(y)[i]),
ylab=expression(e[i]))
abline(h=0,col='gray')
plot(fit$fitted, ri, xlab=expression( hat(y)[i]),
ylab=expression(r[i]))
plot(fit$fitted, ti, xlab=expression( hat(y)[i]),
ylab=expression(t[i]))
plot(1:length(hii), hii, xlab='observation number',
ylab=expression(h[ii]), ylim=c(0,1))
DIAGNOSTICS 67
2
4
1
2
0
0
ei
ri
−6 −4 −2
−1
−2
20 30 40 50 20 30 40 50
yî yî
0.0 0.2 0.4 0.6 0.8 1.0

0 2
1
hii
ti
−1
−2
20 30 40 50 0 5 10 20 30
yî observation number
Influence measures
#cooks distance
cooks.distance(fit)
#DFFITS
dffits(fit)
#DFBETAS
dfbeta(fit)

Appliedstat 2017 Chapter 6 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appliedstat 2017 Chapter 6 7

Uploaded by

Copyright:

Available Formats

Chapter 6

Multiple Regression: Tests of

6.1 Test of Overall Regression

• Denote β 1 = ( β 1 , β 2 , · · · , β p ) T . β = ( β 0 , β 1T ) T . Let X = (1, X1 ) and X ∗ = (1, Xc )

Partition of sum of squares:

• Theorem 8.1 a.:

(ii) Hc is idempotent with rank p

(iii) I − H = I − n1 Jn − Hc is idempotent with rank (n − p − 1)

• Theorem 8.1 b: If y ∼ Nn ( X β, σ2 I ), and our fitted model is E(y| X ) = X β,

(Proof) In the previous chapter, we show that λ = 1

Now with covariates matrix without intercept,

( H − n1 Jn ) = Hc = Xc ( XcT Xc )−1 XcT and

• Theorem 8.1c. If y is Nn ( X β, σ2 I ), then SSR and SSE are independent.

(i) If H0 : β1 = 0 is false, then F is distributed as F ( p, n − p − 1, λ), where

(ii) If H0 : β1 = 0 is true, then λ = 0 and F is distributed as F ( p, n − p − 1):

6.2 Test on a subset of the β

• Consider y = X β + e = X1 β 1 + X2 β 2 + e. We are interested in testing H0 : β 2 =

Let H1 = X1 ( X1T X1 )−1 X1T .

• Theorem 8.2a. The matrix H − H1 = X ( X T X )−1 X T − X1 ( X1T X1 )−1 X1T is idem-

( X1 , X2 ) = [ X ( X T X )−1 X T ]( X1 , X2 ) = [ X ( X T X )−1 X T X1 , X ( X T X )−1 X T X2 ]

Thus, X1 = [ X ( X T X )−1 X T ] X1 and X2 = [ X ( X T X )−1 X T ] X2 .

Also, [ X ( X T X )−1 X T ] X1 ( X1T X1 )−1 X1T = H1 , i.e., HH1 = H1 and H1 H = H1 , with

• Theorem 8.2b.: If y is Nn ( X β, σ2 I ), then

(i) y T ( I − H )y/σ2 is χ2 (n − p − 1).

(ii) y T ( H − H1 )y/σ2 is χ2 (q, λ1 ), λ1 = β T2 ( X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ) β 2 /2σ2

(iii) y T ( I − H )y and y T ( H − H1 )y are independent.

Recall HX1 = X1 , HX2 = X2 . Then,

= ( β T1 X1T + β T2 X2T − β T1 X1T − β T2 X2T H1 )( X1 β 1 + X2 β 2 )

= β T2 X2T X1 β 1 − β T2 X2T H1 X1 β 1 + β T2 X2T X2 β 2 − β T2 X2T H1 X2 β 2

= β T2 ( X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ) β 2

• Theorem 8.2c.: If y is Nn ( X β, σ2 I ), consider

y T ( H − H1 )y/q SS( β 2 | β 1 )/q

(i) If H0 : β 2 = 0 is false, then F is distributed as F (q, n − p − 1, λ1 ), where

(ii) If H0 : β 2 = 0 is true, then λ1 = 0 and F is distributed as F (q, n − p − 1):

• Theorem 8.2d.: If the model is partitioned,

SS( β 2 | β 1 ) = y T ( H − H1 )y = βbT2 [ X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 ] βb2

6.3 The general linear hypothesis tests for H0 : Cβ =

• Motivation: Bio-equivalence test for newly formulated medication claiming

equal efficacy. For example, when we are interested in testing H0 : β 1 = β 2 =

• Theorem 8.4a. If y is Nn ( X β, σ2 I ) and C is q x ( p + 1) matrix of rank q ≤ p + 1,

where λ = (Cβ) T [C ( X T X )−1 C T ]−1 (Cβ)/2σ2 .

(iii) SSE/σ2 = y T [ I − X ( X T X )−1 X T ]y/σ2 is χ2 (n − p − 1).

(iv) SSH and SSE are independent,

(Proof) We show (iv) (Showing Independence)

(C βb)T [C ( X T X )−1 C T ]−1 C βb

Independence can be established by showing A( I − H )=0.

AH = X ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C ( X T X )−1 X T X ( X T X )−1 X T = A.

The F test for H0 : Cβ = 0 versus H1 : Cβ 6= 0 is given in the following theorem.

• Theorem 8.4b.: If y is Nn ( X β, σ2 I ), consider

where C is q x ( p + 1) matrix of rank q ≤ p + 1. Then, the distribution of F is as

(i) If H0 : Cβ = 0 is false, then F is distributed as F (q, n − p − 1, λ), where

(ii) If H0 : Cβ = 0 is true, then λ = 0 and F is distributed as F (q, n − p − 1).

different from 0, the numerator of F tends to be large.

• Estimation of β under the constraint Cβ = 0.

b be the least squared estimator under the constraint Cβ = 0. Then we can

The above constraint optimization problem can be solved by introducing La-

grange multiplier, i.e. minimizing U ( β, λ) = (y − X β) T (y − X β) + λ T (Cβ − 0).

λ = 2[C ( X T X )−1 C T ]−1 C ( X T X )−1 X T y.

where B = ( X T X )−1 C T [C ( X T X )−1 C T ]−1 C. Note that B2 = B and ( I − B)2 =

I − B. Also β b are independent (∵ ( I − B)( X T X )−1 C T = 0).

• Theorem 8.4e.: The mean vector and covariance matrix of β

b ) = β − ( X T X )−1 C T [C ( X T X )−1 C T ]−1 Cβ

(Proof) We show the variance expression.