You are on page 1of 18

10/3/2020 Chapter 7: Multiple Regression: Estimation

Chapter 7: Multiple Regression: Estimation


Notes for MATH 668 based on Linear Models in Statistics by Alvin C. Rencher and G. Bruce
Schaalje, second edition, Wiley, 2008.
February 13, 2018

7.2 The Model


Definition 7.2.1 (p.137): The multiple linear regression model with n observations can be written as

yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi ,     i = 1, 2, … , k + 1, … , n

with the following assumptions:


E(ε) = 0 ,

cov(ε) = σ I.
2

Here, β 0 , β 1 , … , β k are fixed but unknown model parameters representing the (partial) regression coefficients.
In this chapter, we assume the x ij ’s are not random.
⎛ x 1j ⎞

Also, y is the dependent or response vector, the xj = ⎜


⎜ ⋮


is the j th independent, predictor, or explanatory variable,
⎝ ⎠
x nj

and ε is a random vector.


The multiple linear regression model can be expressed in matrix form as

y = Xβ + ε

where

1 x 11 ⋯ x 1k ⎞ β0 ε1
⎛ ⎛ ⎞ ⎛ ⎞
y1
⎛ ⎞
⎜ 1 x 21 ⋯ x 2k ⎟ ⎜ β1 ⎟ ⎜ ε2 ⎟
⎜ ⎟ ⎜ ⎟,ε = ⎜ ⎟.
y = ⎜ ⎟,X = ⎜ ⎟,β = ⎜
⎜ ⋮ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟ ⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎝ ⎠
yn ⎝ ⎠ ⎝ ⎠
⎝ ⎠
1 x n1 ⋯ x nk βk εn

The matrix X is often referred to as the design matrix.

7.3 Estimation of β and σ 2


^
Definition 7.3.1 (p.142): The least squares estimate of β is a vector β which minimizes the least squares function


Q(b) = (y − Xb) (y − Xb)

n
2
= ∑ (yi − b0 − b1 x i1 − b2 x i2 − … − bk x ik ) .

i=1

Theorem 7.3.1: If n ≥ p and X is a n × p full rank matrix, then X ⊤ X is a p × p full rank matrix.
Proof: Since X is full rank, its columns are linearly independent (that is, Xu = 0 implies u = 0).

Now, we will show that the columns of X ⊤ X are independent.

v1
⎛ ⎞

For any p -dimensional vector v = ⎜




⋮ ⎟ , v⊤ v = v
2
1
+ … + vp
2
equals 0 if and only if v = 0 .
⎝ ⎠
vp

Suppose (X ⊤ X)u = 0 . Let v = Xu so that

⊤ ⊤ ⊤ ⊤
v v = u X Xu = u 0 = 0.

Then v = 0 , or equivalently, Xu = 0 . Since X is full rank, this implies that u = 0 . Thus, X ⊤ Xu = 0 implies u = 0 .

Recall some vector differentiation rules from Chapter 2:

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 1/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
⊤ ⊤
∂ [c x] ∂ [x c]
= = c,
∂x ∂x


∂ [x Cx]
= 2Cx if C is symmetric.
∂x

Theorem 7.3.2 (p.142): Suppose we observe data (x i1 , … , x ik , yi ) for i = 1, … , n from a multiple linear regression
model where n > k . If X has rank k + 1 (it is full rank), then the least squares estimate of β is

−1
^ ⊤ ⊤
β = (X X) X y.

Proof: Since


Q(b) = (y − Xb) (y − Xb)

⊤ ⊤ ⊤ ⊤ ⊤ ⊤
= y y − b X y − y Xb + b X Xb

⊤ ⊤ ⊤ ⊤ ⊤
= y y − y Xb − y Xb + b X Xb
⊤ ⊤ ⊤ ⊤
= y y − 2y Xb + b X Xb,

we have

∂Q ⊤ ⊤ ⊤
= 0 − 2(y X) + 2X Xb
∂b
⊤ ⊤
= −2X y + 2X Xb.

∂Q
Setting = 0 and denoting the solution as β
^ , we write

∂b

⊤ ⊤ ^
−2X y + 2X Xβ = 0

⊤ ^ ⊤
2X Xβ = 2X y

⊤ ^ ⊤
X Xβ = X y
−1
^ ⊤ ⊤ ⊤
β = (X X) X y    (X X is invertible since X is full rank).

To show that β
^ minimizes
Q, consider

^ ^ ⊤ ^ ^
Q(b) = (y − Xβ + Xβ − Xb) (y − Xβ + Xβ − Xb)

^ ⊤ ^ ^ ⊤ ^ ^ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(y − Xβ) (Xβ − Xb) + (Xβ − Xb) (Xβ − Xb)

^ ⊤ ^ ^ ⊤ ^ ^ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(X(β − b)) (y − Xβ) + (X(β − b)) (X(β − b))

^ ⊤ ^ ^ ⊤ ⊤ ^ ^ ⊤ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(β − b) X (y − Xβ) + (β − b) X X(β − b)

^ ⊤ ^ ^ ⊤ ⊤ ⊤ ^ ^ ⊤ ⊤ ^
= (y − Xβ) (y − Xβ) − 2(β − b) (X y − X Xβ) + (β − b) X X(β − b).

^
Since −2(X ,
⊤ ⊤
y − X Xβ) = 0

^ ⊤ ^ ⊤
Q(b) = (y − Xβ) (y − Xβ) + v v

^ ^ ^
where v = X(β − b) . Now, v ⊤ v ≥ 0 with equality if and only if v = X(β − b) = 0 . This occurs if and only if b = β

^ ^
since X is full rank. Thus, Q(b) ≥ Q(β) with equality if and only if b = β .
^ ^
Theorem 7.3.3 (p.145): If X is full rank, then and .
2 ⊤ −1
E(β) = β cov(β) = σ (X X)

Proof:

^ ⊤ −1 ⊤
E(β) = E [(X X) X y]

⊤ −1 ⊤
= (X X) X E(y)

⊤ −1 ⊤
= (X X) X Xβ = β

and

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 2/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
^) = cov [(
cov(β

X)
−1 ⊤
y]
X X

⊤ −1 ⊤ ⊤ −1 ⊤ ⊤
= (X X) X cov(y)((X X) X )

⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X cov(y)X((X X) )
⊤ −1 ⊤ ⊤ −1
= (X X) X cov(y)X(X X)

⊤ −1 ⊤ 2 ⊤ −1
= (X X) X (σ I)X(X X)

2 ⊤ −1 ⊤ ⊤ −1
= σ (X X) X X(X X)

2 ⊤ −1
= σ (X X) .

1
^
Theorem 7.3.4 (p.150): Let s . If X is full rank, then E(s2 ) for the multiple linear regression
2 2
= Q(β) = σ
n − k − 1

model.
Proof: Since

^)
Q(β ^ ⊤ ^
= (y − Xβ) (y − Xβ)

⊤ ⊤
= y
⊤ ^
y − 2β
⊤ ^
y + β
⊤ ^

X X

⊤ ⊤
= y
⊤ ^
y − 2β
⊤ ^
y + β

y
X X

⊤ ^
y − β

y
= y X

⊤ ⊤ ⊤ −1 ⊤
= y y − y X(X X) X y

⊤ ⊤
= y y − y Hy


= y (I − H)y

where H = X(X

X)
−1
X

, Theorem 5.2.1 implies that

^ 2 ⊤
E[Q(β)] = tr ((In − H)(σ I)) + (Xβ) (I − H)(Xβ)

2 ⊤ ⊤
= σ tr (In − H) + β X (I − H)Xβ

2 ⊤ ⊤ ⊤ −1 ⊤
= σ tr (In − H) + β(X X − X X(X X) X X)β

2 ⊤ ⊤
= σ tr (In − H) + β(X X − X X)β

2
= σ tr (In − H)

2
= σ {tr (In ) − tr (H)} .

Since H2 = X(X

X)
−1
X

X(X

X)
−1
X

= X(X

X)
−1
X

= H so that H is idempotent, Theorem 2.11.1
implies that

⊤ −1 ⊤
tr (H) = tr ((X(X X) )X )

⊤ ⊤ −1
= tr (X X(X X) )

= tr (Ik+1 )

= k + 1.

So, we have

^ 2
E[Q(β)] = σ {n − (k + 1)}

2
= σ (n − k − 1),

1
^
and it follows that E(s2 ) = E[Q(β)] = σ
2
.
n − k − 1
The next result is known as the Gauss-Markov Theorem.
Theorem 7.3.5 (p.148): For the multiple linear regression model, the least squares estimate is the best linear unbiased
estimate (BLUE). That is, suppose we are interested in estimating a⊤ β where a is a (k + 1) -dimensional vector of
^
constants, and we consider estimates of the form c ⊤ y. If E(c ⊤ y) = a

β for all β , then var(c ⊤ y) ≥ var(a

β) .
Proof: Let d . Since a for all β , it follows that a (this can be
⊤ ⊤ ⊤ ⊤ ⊤ ⊤
= X c β = E(c y) = c E(y) = c Xβ = d β = d

1 0
⎛ ⎞ ⎛ ⎞

⎜ 0⎟ ⎜ ⎟
⋮ ⎟
shown since β = ⎜



implies a1 = d1 ,⋯, β = ⎜
⎜ ⎟
implies ak+1 = d k+1 ).
⎜ ⋮ ⎟ ⎜ 0⎟

⎝ ⎠ ⎝ ⎠
0 1

Since I − H is idempotent and symmetric, we see that

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 3/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
⊤ ⊤
c (I − H)c = c (I − H)(I − H)c

⊤ ⊤
= c (I − H) (I − H)c


= ((I − H)c) (I − H)c ≥ 0.

Since a = d = X

c , c ⊤ (I − H)c ≥ 0 implies that

⊤ ^ ⊤ ^
var(a β) = a cov(β)a

⊤ 2 ⊤ −1
= a σ (X X) a

2 ⊤ ⊤ −1 ⊤
= σ c X(X X) X c

2 ⊤
= σ c Hc
2 ⊤ 2 ⊤
≤ σ c Hc + σ c (I − H)c

2 ⊤
= σ c Ic

⊤ 2
= c (σ I)c


= c cov(y)c


= var(c y).

7.4 Geometry of Least-Squares


R Example 7.4.1: Consider the simple linear regression model for the following data set:

(3, 4), (−1, 2), (1, 0)

First, let’s find the least squares estimate based on the formula in Theorem 7.3.2. To do this, we start by creating the design
matrix and the response vector.

x=c(3,-1,1)
X=cbind(1,x);X

## x
## [1,] 1 3
## [2,] 1 -1
## [3,] 1 1

y=c(4,2,0);y

## [1] 4 2 0

Now, we use the formula for the least squares estimate.

beta.hat=solve(t(X)%*%X)%*%t(X)%*%y; beta.hat

## [,1]
## 1.5
## x 0.5

1.5
^
So, β = ( ) and the regression line for modeling y based on x is
0.5

^ = 1.5 + 0.5x.
y

Graphically, if we look at the rows of X and y, the least squares line minimizes the sum of squares of vertical distances of the data
points to the line as illustrated below.

plot(x,y,pch=19)
abline(beta.hat,col="blue")
y.hat=X%*%beta.hat
for (i in 1:3)
points(c(x[i],x[i]),c(y[i],y.hat[i]),type="l",col="red")

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 4/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

There is another geometrical intrepretation if we look at the columns of X and y. First, we plot the 3-dimensional vectors from the
columns of X in red and find the subspace of R3 spanned by these two vectors.

require(rgl)

## Loading required package: rgl

plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col="#AAAAAA")
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col="#AAAAAA")
}
segments3d(c(0,1),c(0,1),c(0,1),col="#CC0000",lwd=5)
segments3d(c(0,x[1]),c(0,x[2]),c(0,x[3]),col="#FF0000",lwd=5)

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 5/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

4
⎛ ⎞

Now, we plot the 3-dimensional response vector y = ⎜ 2⎟ .


⎝ ⎠
0

segments3d(c(0,y[1]),c(0,y[2]),c(0,y[3]),col="#0000FF",lwd=5)

Now, we project the response vector onto the subspace spanned by the columns of the design matrix. The fitted vector
3 1
⎛ ⎞ ⎛ ⎞
^
^
y = Xβ = ⎜ 1 ⎟ and the residual vector ^ = y − y
ε ^ = ⎜ 1 ⎟ are shown below.
⎝ ⎠ ⎝ ⎠
2 −2

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 6/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

segments3d(c(0,y.hat[1]),c(0,y.hat[2]),c(0,y.hat[3]),col="#FF00FF",lwd=5)
segments3d(c(y[1],y.hat[1]),c(y[2],y.hat[2]),c(y[3],y.hat[3]),col="#00FF00",lwd=5)

7.6 Normal Model


Definition 7.6.1 (p.157): The normal multiple linear regression model with n observations can be written as

yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi ,     i = 1, 2, … , k + 1, … , n

with the following assumption:


ε ∼ Nn (0, σ I) .
2

From Theorem 4.2.1, the likelihood function for β and σ 2 is


′ −1
2 −1/2
2 −n/2 −(y−Xβ) (σ I) (y−Xβ)/2 2
L(β, σ ) = (2π) e (det (σ I))
′ 2
2 −n/2 −(y−Xβ) (y−Xβ)/(2σ )
= (2πσ ) e .

The log-likelihood function can then be written as

2 2
n n 2
1
ℓ(β, σ ) = ln L(β, σ ) = − ln(2π) − ln σ − Q(β).
2
2 2 2σ

Theorem 7.6.1 (p.158): For the normal multiple linear regression model where n > k and X has rank k + 1 (it is full rank),
^
β
the maximum likelihood estimator of ( β, σ 2 ) is ( ) where
2
^
σ

−1 2
1 1
^ ⊤ ⊤
^ ^ ^ ⊤ ^
β = (X X) X y   and   σ = Q(β) = (y − Xβ) (y − Xβ).
n n

Proof: Differentiating ℓ(β, σ 2 ) with respect to β


^ and
σ , we obtain
2

∂ℓ 1 ∂Q
= −
2
∂β 2σ ∂β

and

∂ℓ n 1
= − + Q(β).
2 2 2 2
∂σ 2σ 2(σ )

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 7/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

Setting the first equation equal to 0 and solving, we see that

1 ∂Q
− = 0
2
2σ ∂β

∂Q
= 0
∂β

^ ⊤ −1 ⊤
β = (X X) X y.

^
Substituting β = β and setting the second equation equal 0 and solving, we obtain

n 1
^
− + Q(β) = 0
2 2
^
2σ ^ )2
2(σ

n 1
^
= Q(β)
2 2
^
2σ ^ )2
2(σ

^
2
Q(β)
^
σ = .
n

^
β
To show that ( ) maximizes ℓ , let ℓ∗ be the profile likelihood defined by
2
^
σ

n n 1
∗ 2 ^, 2 2 ^).
ℓ (σ ) = ℓ(β σ ) = − ln(2π) − ln σ − Q(β
2
2 2 2σ

Since

dℓ n 1
^
= − + Q(β)
2 2 2 2
d(σ ) 2σ 2(σ )

n 2 2
= − (σ ^ )
− σ
2 2
2(σ )

is positive when σ 2 ^
< σ
2
and negative when σ 2 ^
> σ
2
, ℓ∗ (σ 2 ) is maximized at σ 2 ^
= σ
2
. Finally, we see that

^ ^2 ∗ 2 ∗ 2 ^
^ ) ≥ ℓ (σ ) = ℓ(β
2 2
ℓ(β, σ ) = ℓ (σ , σ ) ≥ ℓ(β, σ )

for all β and σ 2 .


Theorem 7.6.2 (p.159): For the normal multiple linear regression model where n > k + 1 and X has rank k + 1 (it is full
rank), the maximum likelihood estimator has the following properties:
^
(a)β ∼ Nk+1 (β, σ
2
(X

X)
−1
)
2 2
^
nσ (n − k − 1)s
(b) =
2
∼ χ (n − k − 1)
2 2
σ σ
2
^
(c)β and σ
^ are independent.

^
Proof: ( a ) Theorem 4.4.1 implies that β = (X

X)
−1
X

y follows a (k + 1) -dimensional Normal distribution with mean
vector
⊤ −1 ⊤ ⊤ −1 ⊤
E[(X X) X y] = (X X) X E(y)

⊤ −1 ⊤
= (X X) X Xβ

= β

and covariance matrix


⊤ −1 ⊤ ⊤ −1 ⊤ ⊤ −1 ⊤ ⊤
cov[(X X) X y] = (X X) X cov(y)((X X) X )

⊤ −1 ⊤ ⊤ −1
= (X X) X cov(y)X(X X)

⊤ −1 ⊤ 2 ⊤ −1
= (X X) X (σ I)X(X X)

2 ⊤ −1 ⊤ ⊤ −1
= σ (X X) X X(X X)

2 ⊤ −1
= σ (X X) .

( b ) Letting H , Theorem 5.5.1 implies that


⊤ −1 ⊤
= X(X X) X

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 8/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
2 ^ ⊤ ^
^
nσ (y − Xβ) (y − Xβ)
=
2 2
σ σ

y (I − H)(I − H)y
=
2
σ


1 2
= y (I − H)y ∼ χ (n − k − 1)
2
σ

1
since ( (I − H)) (σ
2
I) = I − H is idempotent with rank n − k − 1.
2
σ
2

^ 1
^
( c ) By Theorem 5.6.1, β = (X

X)
−1
X

y and = y

(I − H)y are independent since
2 2
σ σ

⊤ −1 ⊤ 2
1 ⊤ −1 ⊤
(X X) X (σ I) (I − H) = (X X) X (I − H)
2
σ
⊤ −1 ⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X − (X X) X X(X X) X

⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X − (X X) X

= O k+1,n .

2 2
2 σ nσ
^
^
Then β and σ
^ = ( ) are independent (see Theorem L3.5 from MATH 667).
2
n σ

^
β
Theorem 7.6.3 (p.160): For the normal multiple linear regression model, ( ) is the uniform minimum variance unbiased
2
s

β
estimator (UMVUE) of ( ) .
2
σ

7.5 The Model in Centered Form


The multiple linear regression model can be reparametrized in centered form by writing

yi = β 0 + β 1 x i1 + β 2 x i2 + … + β k x ik + εi

= α + β 1 (x i1 − x̄ 1 ) + β 2 (x i2 − x̄ 2 ) + … + β k (x ik − x̄ k ) + εi

n
1
where x̄ j = ∑ x ij for j = 1, … , k and
n
i=1

α = β 0 + β 1 x̄ 1 + … + β k x̄ k .

In matrix form, this can be written as

α
y = (j, X c ) ( ) + ε
β1

where

β1
⎛ ⎞

β1 = ⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
βk

is the vector of regression coefficients with the intercept removed,

x 11 ⋯ x 1k
⎛ ⎞

⎜ x 21 ⋯ x 2k ⎟
⎜ ⎟
X1 = ⎜ ⎟
⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ ⎠
x n1 ⋯ x nk

is the design matrix with the first column of ones removed, and

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html ⎛ ⎞ 9/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
− x̄ 1 ⋯ x 1k − x̄ k ⎞
⎛ x 11

⎜ x 21 − x̄ 1 ⋯ x 2k − x̄ k ⎟
1 ⎜ ⎟
X c = (I − J) X 1 = ⎜ ⎟
n ⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟

⎝ ⎠
x n1 − x̄ 1 ⋯ x nk − x̄ k

is the centered design matrix.


Then α ^ = ȳ and

^ ⊤ −1 ⊤
β1 = (X c X c ) X c y.

If we denote the sample variance of the j th column of X c as


n
1 2
2
s = ∑(x ij − x̄ j ) ,
i
n − 1
i=1

the sample covariance of the j th and gth columns of X c as


n
1
sjg = ∑(x ij − x̄ j )(x ig − x̄ g ),
n − 1
i=1

and the sample covariance of y and the j th column of X c as


n
1
syj = ∑(yi − ȳ )(x ij − x̄ j ),
n − 1
i=1

then


^ −1 ^ ^ ⊤ −1
β1 = S xx syx    and   β 0 = α
^ − β
1
x̄ = ȳ − syx S xx x̄

2
s s12 ⋯ s1k sy1
⎛ 1 ⎞ ⎛ ⎞ x̄ 1
⎛ ⎞
2
⎜ s21 s ⋯ s2k ⎟ ⎜ sy2 ⎟ ⎜ x̄ 2 ⎟

2
⎟ 1 ⎜ ⎟ 1
where Sxx = ⎜ ⎟ =

Xc Xc , syx = ⎜ ⎟ =

Xc y , and x̄ = ⎜



.
⎜ ⎟ n − 1 ⎜ ⎟ n − 1
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
⎜ ⋮ ⋮ ⋱ ⋮ ⎟

⎝ ⎠ ⎝ ⎠
⎝ 2 ⎠ syk x̄ k
sk1 sk2 ⋯ s
k

7.7 R2 in Fixed-x Regression


Here, we generalize some of the notation from Section 6.4.
^
The predicted value of y at x is y
^ = x

β .
^
The vector of fitted values of y is ^ = Xβ
y .
The residual vector is ε^ = y − y ^.

^
Evaluating the least squares function at β , we obtain the residual sum of squares (or error sum of squares) denoted by



^ ^ ε ^ ⊤
^ = (y − Xβ ^ ⊤ ^ ⊤
S S E = Q(β) = ε ) (y − Xβ) = y y − β X y.

The corrected total sum of squares S S T = ∑(yi − ȳ )


2
= y

y − nȳ
2
can be decomposed into two parts
i=1

SST = SSE + SSR

n
⊤ ⊤
^ ^
where S S R ^ − ȳ )
= ∑(y i
2
= β X

y − nȳ
2 ⊤
= β1 X c y .
i=1

SSR
The coefficient of determination R2 = is the the proportion of variation explained by the regression model compared
SST
with that explained only using the mean. This gives a measure of how well the model fits the data.
R is sometimes referred to as the multiple correlation coefficient.

7.8 Generalized Least-Squares: cov(y) = σ V


2

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 10/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

Suppose the multiple linear regression model holds except cov(ε) = σ


2
V where V is known symmetric positive definite
matrix but σ 2 is unknown.
Definition 7.8.1 (p.164): The generalized least squares estimate of β is the vector β
^ which minimizes the generalized least

squares function

⊤ −1
QV (b) = (y − Xb) V (y − Xb)

n n

= ∑ ∑ wij (yi − b0 − b1 x i1 − … − bk x ik )(yj − b0 − b1 x j1 − … − bk x jk )

i=1 j=1

where wij is the element in the i th row and j th column of V −1 .

Theorem 7.8.1 (p.165): Given observed data (x i1 , … , x ik , yi ) for i = 1, … , n from a multiple linear regression model
where n > k and X has rank k + 1 (it is full rank), the generalized least squares estimate of β is

−1
^ ⊤ −1 ⊤ −1
β = (X V X) X V y.

The following properties hold:


^
β is the best linear unbiased estimate (BLUE) of β
^ 2 ⊤ −1
cov(β) = σ (X VX)

1
^
s
2
= QV (β) is an unbiased estimate of σ 2 .
n − k − 1

Proof: The results follow from the results in Section 7.3 with y replaced by V −1/2 y and X replaced by V −1/2 X.

7.9 Model Misspecification


In this section, we consider what happens if the true multiple linear regression model is

β1
y = Xβ + ε = (X 1 , X 2 ) ( ) + ε = X 1 β1 + X 2 β2 + ε
β2

but we incorrectly leave out the variables in the columns X 2 and use the model

∗ ∗
y = X1 β + ε .
1

Theorem 7.9.1 (p.170): Suppose X ⊤


1
X 1 is invertible. If y = X 1 β1 + X 2 β2 + ε where E(ε) = 0 and cov(ε) = σ
2
I

^
and ⊤
β1 = (X 1 X 1 )
−1 ⊤
X1 y is the least-squares estimator of β1 in the misspecified model without X 2 , then


^
(a) E(β1 ) = β1 + Aβ2 where A = (X 1 X 1 )
⊤ −1
X1 X2

is the alias matrix

^
(b) 2
cov(β1 ) = σ (X 1 X 1 )
−1 ⊤
.
Proof: For ( a ), we compute

^ ⊤ −1 ⊤
E(β1 ) = (X 1 X 1 ) X 1 E(y)

⊤ −1 ⊤
= (X X 1 ) X 1 (X 1 β1 + X 2 β2 )
1

⊤ −1 ⊤ ⊤ −1 ⊤
= (X X 1 ) X 1 X 1 β1 + (X 1 X 1 ) X 1 X 2 β2
1

⊤ −1 ⊤
= β1 + (X X 1 ) X 1 X 2 β2 .
1

For (b), we compute



^ ⊤ −1 ⊤ ⊤ −1
cov(β1 ) = (X 1 X 1 ) X 1 cov(y)X 1 (X 1 X 1 )

⊤ −1 ⊤ 2 ⊤ −1
= (X 1 X 1 ) X 1 (σ I)X 1 (X 1 X 1 )

2 ⊤ −1 ⊤ ⊤ −1
= σ (X 1 X 1 ) X 1 X 1 (X 1 X 1 )

2 ⊤ −1
= σ (X 1 X 1 ) .

Theorem 7.9.2 (p.172): Suppose X 1 is an n × (p + 1) matrix and X 1 X 1 is invertible. Let


∗ ∗
^ ⊤ ^
(y − X 1 β1 ) (y − X 1 β1 ) ∗

s
2
1
= be the estimator of σ 2 based only on X 1 where β
^
1

= (X X 1 )
1
−1 ⊤
X1 y . If
n − p − 1

y = X 1 β1 + X 2 β2 + ε where E(ε) = 0 and cov(ε) = σ


2
I , then

⊤ ⊤ ⊤ −1 ⊤
β X 2 (I − X 1 (X 1 X 1 ) X 1 )X 2 β2
2 2 2
E(s ) = σ + .
1
n − p − 1

Proof: The result holds since Theorem 5.2.1 implies that

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 11/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
∗ ∗
^
E [(y − X 1 β
⊤ ^ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤
1
) (y − X 1 β1 )] = E [(y − X 1 (X 1 X 1 ) X y) (y − X 1 (X 1 X 1 ) X y)]

⊤ −1 ⊤ ⊤ −1 ⊤
= E [(I − X 1 (X X 1 ) X)y) (y − X 1 (X X 1 ) X y)]
1 1

⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
= E [y (I − X 1 (X X 1 ) X) (y − X 1 (X X 1 ) X y)]
1 1

⊤ ⊤ −1 ⊤
= E [y (In − X 1 (X X 1 ) X 1 ) y]
1

⊤ −1 ⊤ 2 ⊤ ⊤ ⊤ −1 ⊤
= tr ((In − X 1 (X X 1 ) X 1 ) (σ I)) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1

2 ⊤ −1 ⊤ ⊤ ⊤ ⊤ −1 ⊤
= σ (tr (In ) − tr (X 1 (X X 1 ) X 1 )) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1

2 ⊤ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤
= σ (tr (In ) − tr (X X 1 (X X 1 ) )) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1 1 1

2 ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + β X (In − X 1 (X X 1 ) X 1 ) Xβ
1

2 ⊤ ⊤ ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + (β X1 + β X 2 ) (In − X 1 (X 1 X 1 ) X 1 ) (X 1 β1 + X 2 β2 )
1 2

2 ⊤ ⊤ ⊤ −1 ⊤
= σ (n − p − 1) + β X 2 (In − X 1 (X 1 X 1 ) X 1 ) X 2 β2 .
2

Example 7.9.1: Suppose yi = β0 + β1 xi + β2 x


2
i
+ εi for i = 1, … , 4 where x i = i are known values, the regression
parameters β 0 , β 1 , and β 2 are unknown, and ε1 , … , ε4 are independent and identically distributed random variables with
mean 0 and unknown variance σ 2 .
Now, suppose that the x 2 term is excluded from the model, and least squares estimation is used to model the y’s based on
∗ ∗
an intercept term and the x’s (that is, suppose that we incorrectly use a linear model and find values β^0 and β^1 , which
n
1 ∗ ∗
minimize ∑4i=1 (yi − β

0
+ β

1
xi )
2
). What is the bias of using s21 =
^ ^
∑(yi − β 0 − β 1 x i )
2
to estimate σ 2 ?
2
i=1

Answer: Here the design matrix is

1 1 1
⎛ ⎞ ⎛ ⎞

⎜ 1 2 ⎟ ⎜ 4 ⎟
X = (X 1 , X 2 ) with X 1 = ⎜ ⎟  and X 2 = ⎜ ⎟,
⎜ 1 3 ⎟ ⎜ 9 ⎟

⎝ ⎠ ⎝ ⎠
1 4 16

so


4 10 ⊤ ⊤
30
X1 X1 = ( ) , X 2 X 2 = 354,  and X 1 X 2 = ( ).
10 30 100

−1
4 10 30 −10 1.5 −0.5
Then X 1 X 1 so
⊤ 1
= ( ) = ( ) = ( )
20
10 30 −10 4 −0.5 0.2

⊤ ⊤ −1 ⊤
1.5 −0.5 30
X 2 X 1 (X 1 X 1 ) X 1 X 2 = (30, 100) ( )( ) = 350
−0.5 0.2 100

β 2 (354 − 350)β 2
and it follows that E(s2 ) − σ 2 = = 2β
2
2
.
4 − 1 − 1

For the next result, we will need Theorem 2.5.3:


A 11 A 12
If A = ( ) is an n × n nonsingular matrix, then
A 21 A 22

−1 −1 −1 −1 −1 −1
A + A A 12 B A 21 A −A A 12 B
−1 11 11 11 11
A = ( )
−1 −1 −1
−B A 21 A B
11

where B = A 22 − A 21 A
−1

11
A 12 .
^
β1
^ ^
Theorem 7.9.3 (p.171): Suppose X ⊤
1
X 1 is invertible. If β = (X

X)
−1
X

y where X = (X 1 , X 2 ) ,β = ( ) , and
^
β2

^ ⊤
β1 = (X 1 X 1 )
−1 ⊤
X1 y , then

^ ^ 2 −1 ⊤
cov(β1 ) − cov(β1 ) = σ AB A

where A = (X 1 X 1 )
⊤ −1 ⊤
X1 X2 and B ⊤
= X2 X2 − X2 X1 A

. Furthermore, AB−1 A ⊤ is positive definite or positive

semidefinite so var(β^j ) ^ )
≥ var(β j
for all j .
Proof: Since

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 12/18
10/3/2020 Chapter 7: Multiple Regression: Estimation
^) =
cov(β
2
(X

X)
−1
σ

−1
⊤ ⊤
X1 X1 X1 X2
2
= σ ( )
⊤ ⊤
X2 X1 X2 X2

⊤ −1 ⊤ −1 ⊤ −1 ⊤ ⊤ −1 ⊤ −1 ⊤ −1
(X 1 X 1 ) + (X 1 X 1 ) X1 X2 B X 2 X 1 (X 1 X 1 ) −(X 1 X 1 ) X1 X2 B
2
= σ ( )
−1 ⊤ ⊤ −1 −1
−B X 2 X 1 (X 1 X 1 ) B

⊤ −1 −1 ⊤ −1
2
(X 1 X 1 ) + AB A −AB
= σ ( )
−1 ⊤ −1
−B A B

where B ,
⊤ ⊤ ⊤ −1 ⊤ ⊤ ⊤
= X 2 X 2 − X 2 X 1 (X 1 X 1 ) X1 X2 = X2 X2 − X2 X1 A

2 ⊤ −1 −1 ⊤
^ ) = σ
cov(β ((X 1 X 1 ) + AB A ).
1

Thus, cov(β . This matrix is positive definite or positive semidefinite since, for any v with
−1 ⊤
^ ) − cov( ^ ) = σ
2
AB A
1
β1

the dimension of β1 ,

⊤ −1 ⊤ ⊤ ⊤ ⊤ −1 ⊤ ⊤
v AB A = v AX 2 (I − X 1 (X 1 X 1 ) X 1 )X 2 A

⊤ ⊤ ⊤ −1 ⊤ ⊤ ⊤ −1 ⊤ ⊤
= v AX 2 (I − X 1 (X 1 X 1 ) X1 ) (I − X 1 (X 1 X 1 ) X 1 )X 2 A


= w w ≥ 0

where w = (I − X 1 (X 1 X 1 )
⊤ −1 ⊤
X 1 )X 2 A

.

7.10 Orthogonalization
Notice that if the columns of X 1 are orthogonal to the columns of X 2 (that is, X ⊤
1
X 2 = O ), then the least squares

estimates of β1 and β2 based on fitting models separately with only X 1 and X 2 , respectively, are the same as the full
model based on X = (X 1 , X 2 ) .
Note that X ⊤1
X 2 = O implies A = O and B = X 2 X 2 . Then we see that

^
β1 ⊤ −1 ⊤
( ) = (X X) X y
^
β2

−1 ⊤
⊤ ⊤
X1 X1 X1 X2 X1
= ( ) ( ) y
⊤ ⊤
X2 X1 X2 X2 X2

⊤ −1 −1 ⊤ −1 ⊤
(X 1 X 1 ) + AB A −AB X1 y
= ( )( )
−1 ⊤ −1 ⊤
−B A B X2 y

⊤ −1 ⊤
(X 1 X 1 ) O X1 y
= ( )( )
⊤ −1 ⊤
O (X 2 X 2 ) X2 y

⊤ −1 ⊤
(X 1 X 1 ) X1 y
= ( )
⊤ −1 ⊤
(X 2 X 2 ) X2 y

^
β1
= ( ).

^
β2

Orthogonalization also provides a method for updating coefficient estimates for a regression model of y based only on X 1
to obtain coefficient estimates for the full model based of y based on X = (X 1 , X 2 ) , even when the columns of X 2 are
not orthogonal to the columns of X 1 .
Here are the steps:
∗ ∗
^ ^
Step 1: Regress y on X 1 to obtain β1
= (X 1 X 1 )
⊤ −1
X1 y

and residuals ε∗1 ^
= y − y 1
where y
^
1
= X 1 β1 .
Step 2: Simultaneously regress each column of X 2 on X 1 where each column of (X 1 X 1 ) ⊤ −1 ⊤
X1 X2 = A gives
the coefficients for the corresponding column. Then the residuals are X 2⋅1 = X 2 − X 1 A.
^
Step 3: Regress y − y
^ on X 2⋅1 to obtain β
1 2
. This gives

ˆ ^
y − y
^1 = X 2⋅1 β2

^ − y ^
^ = (X 2 − X 1 A)β
y 1 2

^ = y ^
^ − X 1 Aβ ^
y 1 2
+ X 2 β2

^
^ = X1 β ^ ^
y 1
− X 1 Aβ2 + X 2 β2

^
^ = X 1 (β ^ ^
y 1
− Aβ2 ) + X 2 β2

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 13/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

^ ^ ^
so β1
= β1 − Aβ2 .

R Example 7.10.1: Let’s see how the orthogonal method works geometrically for the data in R Example 7.4.1.

4 1
⎛ ⎞ ⎛ ⎞

First, we regress y = ⎜ 2⎟ on j = ⎜ 1⎟ to obtain


⎝ ⎠ ⎝ ⎠
0 1


∗ j y 6
^
β1 = = = 2

j j 3

so the projection of y onto j is y


^
1
= 2j and its residual is

2
⎛ ⎞

y − y
^1 = ⎜ 0 ⎟
⎝ ⎠
−2

as illustrated below.

x=c(3,-1,1)
y=c(4,2,0)
j=rep(1,3)
y.hat.1=mean(y)*j
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col=ifelse(s==0,"#000000","#AAAAAA"))
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col=ifelse(t==0,"#000000","#AAAAAA"))
}
segments3d(c(0,1),c(0,1),c(0,1),col="#CC0000",lwd=5)
segments3d(c(0,y[1]),c(0,y[2]),c(0,y[3]),col="#0000FF",lwd=5)
segments3d(c(y[1],y.hat.1[1]),c(y[2],y.hat.1[2]),c(y[3],y.hat.1[3]),col="#FF8800",lwd=5)

3 1
⎛ ⎞ ⎛ ⎞

Then we find a vector in the subspace spanned by x = ⎜ −1 ⎟ and j = ⎜ 1⎟ orthogonal to j by computing the residual after
⎝ ⎠ ⎝ ⎠
1 1

regressing x on j . The coefficient estimate of this regression is

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 14/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

j x 3
A = = = 1

j j 3

so the projection of x onto j is

^ = 1j = j
x

and its residual is

2
⎛ ⎞
^ = ⎜ −2 ⎟
X 2⋅1 = x − x

⎝ ⎠
0

as illustrated below.

x.hat=mean(x)*j
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col=ifelse(s==0,"#000000","#AAAAAA"))
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col=ifelse(t==0,"#000000","#AAAAAA"))
}
segments3d(c(0,1),c(0,1),c(0,1),col="#CC0000",lwd=5)
segments3d(c(0,x[1]),c(0,x[2]),c(0,x[3]),col="#FF0000",lwd=5)
segments3d(c(x[1],x.hat[1]),c(x[2],x.hat[2]),c(x[3],x.hat[3]),col="#00FFFF",lwd=5)

2 2
⎛ ⎞ ⎛ ⎞

Then we regress the residual y − y


^
1
= ⎜ 0 ⎟ on x − x
^ = ⎜ −2 ⎟ and obtain the regression coefficient
⎝ ⎠ ⎝ ⎠
−2 0


(y − y
^ ) (x − x
^) 4 + 0 + 0 1
1
^
β2 = = = .

(x − x
^) (x − x
^) 4 + 4 + 0 2

The projection of y − y
^ onto x − x
1
^ is

ˆ
^
y − y ^ − y
= y ^
1 1

1
= ^ ).
(x − x
2

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 15/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

Its projection and its residual

1
⎛ ⎞
ˆ
^ ) − y − y
(y − y ^ = y − ^ = ⎜
y 1 ⎟
1 1

⎝ ⎠
−2

is illustrated below.

beta.hat.2=sum((y-y.hat.1)*(x-x.hat))/sum((x-x.hat)^2)
plot3d(c(0,1),c(0,1),c(0,1),type="n",xlab="",ylab="",zlab="",lwd=5,xlim=c(0,3),ylim=c(0,3),zlim=c(0,3),box=FAL
SE,axes=FALSE)
m=4
for (s in -m:m) for (t in -m:m){
segments3d(c(3*s-m,3*s+m),c(-s-m,-s+m),c(s-m,s+m),col=ifelse(s==0,"#000000","#AAAAAA"))
segments3d(c(t-3*m,t+3*m),c(t+m,t-m),c(t-m,t+m),col=ifelse(t==0,"#000000","#AAAAAA"))
}
segments3d(c(y[1],y.hat.1[1]),c(y[2],y.hat.1[2]),c(y[3],y.hat.1[3]),col="#FF8800",lwd=5)
segments3d(c(mean(x)+x[1],mean(x)+x.hat[1]),c(mean(x)+x[2],mean(x)+x.hat[2]),c(mean(x)+x[3],mean(x)+x.hat[3]),
col="#00FFFF",lwd=5)
segments3d(c(mean(y)+beta.hat.2*(x[1]-mean(x)),y[1]),c(mean(y)+beta.hat.2*(x[2]-mean(x)),y[2]),c(mean(y)+beta.
hat.2*(x[3]-mean(x)),y[3]),col="#00FF00",lwd=5)

Finally, we obtain the regression coefficients for the full model

1
^ − y
y ^ = (x − x
^)
1
2
1
^ = y
y ^ + (x − x
^)
1
2
1
^ = 2j +
y (x − j)
2
1 1
^ = 2j +
y x − j
2 2

3 1
^ =
y j + x,
2 2

3
⎛ ⎞

and we see that ^ = ⎜ 1⎟


y .
⎝ ⎠
2

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 16/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

R Example 7.10.2: The following code shows the computation time for various methods used to compute the estimates of
the regression coefficients in R Example 7.10.1.

x=c(3,-1,1)
X=cbind(1,x)
y=c(4,2,0)

### METHOD 1: Use lm function.


# Start the clock
R=10000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
beta.hat=lm(y~x)$coef
}
# Stop the clock
run=proc.time() - ptm
run/R

## user system elapsed


## 0.000664 0.000000 0.000664

### METHOD 2: Use multiple linear regression formula for beta.hat.


# Start the clock
R=100000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
beta.hat=solve(t(X)%*%X)%*%t(X)%*%y
}
# Stop the clock
run=proc.time() - ptm
run/R

## user system elapsed


## 3.43e-05 0.00e+00 3.44e-05

### METHOD 3: Use orthogonalization/simple linear regression method.


# Start the clock
R=1000000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
m=mean(x)
xc=x-m
beta2.hat=sum(xc*y)/sum(xc*xc)
beta1.hat=mean(y)-beta2.hat*m
}
# Stop the clock
run=proc.time() - ptm
run/R

## user system elapsed


## 9.54e-06 0.00e+00 9.55e-06

In R, we can also use C functions to make computations more efficiently. Here is a method to perform linear regression based on
least squares using the LAPACK library dgels.

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 17/18
10/3/2020 Chapter 7: Multiple Regression: Estimation

#include <R.h>

extern int dgels_(char *trans, int *m, int *n, int *nrhs, double *a, int *lda, double *b, int *ldb, double *wo
rk, int *lwork, int *info);

void ls(double *X, double *y, int *n, int *p){


/* documentation for dgels_ at
http://www.netlib.org/lapack/explore-html/d7/d3b/group__double_g_esolve_ga225c8efde208eaf246882df48e590eac.htm
l
*/
double *wrk;
char trans='N';
int intone=1;
int info;
int lw=(*p)+16*(*n);
wrk=malloc(lw*sizeof(double));
dgels_(&trans,n,p,&intone,X,n,y,n,wrk,&lw,&info);
free(wrk);
}

dyn.load("ls.dll")
ourls = function(X,y){
n=as.integer(nrow(X))
p=as.integer(ncol(X))
out=.C("ls",as.double(X),beta=as.double(y),n,p)
out$beta[1:p]
}
ourls(X,y)

## [1] 1.5 0.5

Now, let’s compare its speed with the other methods.

### METHOD 4: Use linear regression method from C.


# Start the clock
R=1000000
ptm <- proc.time()
# Run code R times
for (i in 1:R){
beta.hat=ourls(X,y)
}
# Stop the clock
run=proc.time() - ptm
run/R

## user system elapsed


## 5.97e-06 0.00e+00 5.97e-06

www.math.louisville.edu/~rsgill01/668/Ch_7_Notes.html 18/18

You might also like