You are on page 1of 39

Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Linear Methods for Regression

Sun Baoluo

Department of Statistics and Data Science

Spring 2023

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Review of linear regression model

Predictors (also called independent variables, explanatory variables,


covariates, regressors) denoted by x1 , ..., xp ;
Response (also called dependent variable or outcome variable)
denoted by Y .
The linear regression model of Y on x1 , ..., xp is

Y = β0 + β1 x1 + ... + βp xp + ε,

where E(ε) = 0 and var(ε) = σ 2 . (pay attention to model


assumptions)

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Review of linear regression model

With n observations, we put them in a table as follows

observation constant x1 ... xp Y


1 1 x1,1 ... x1,p Y1
2 1 x2,1 ... x2,p Y2
.. .. .. .. ..
. . . . .
n 1 xn,1 ... xn,p Yn

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Review of linear regression model

The model can also be formulated as

Y1 = β0 + β1 x1,1 + ... + βp x1,p + ε1 ,


Y2 = β0 + β1 x2,1 + ... + βp x2,p + ε2 ,
..
.
Yn = β0 + β1 xn,1 + ... + βp xn,p + εn ,

or
Yi = β0 + β1 xi,1 + ... + βp xi,p + εi , i = 1, ..., n.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Review of linear regression model

Assumptions (for making valid statistical inference):


(A) Predictor xi is usually assumed to be non-random. In
econometrics, they are allowed to be random but usually
assumed to be uncorrelated with ε.
(B) Eε1 = Eε2 = ... = Eεn = 0;
(C) ε1 , ..., εn are independent;
(D) V ar(ε1 ) = ... = V ar(εn ) = σ 2 .

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Review of linear regression model


By writing Xi = (1, xi,1 , ..., xi,p )T , and
     
Y1 
1 x1,1 ... x1,p
 ε1 β0
 Y2   1 x2,1 ... x2,p   ε2  β1 
Y =  . , X =  , E =  . ,β =  . 
     
 ..  · · ·   ..   .. 
Yn 1 xn,1 ... xn,p εn βp

the model can be written as

Y = Xβ + E (1.1)

and the corresponding assumptions can be written as

EE = 0

V ar(E) = σ 2 I
where I is a n × n identity matrix.
Linear Methods for Regression
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Least Squares Estimation


The Least Squares estimation is to estimate β by minimizing
n
X
{Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2
i=1

with respect to β. Note that


n
X
{Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2
i=1
n
X
= {Yi − Xi⊤ β}2
i=1
= (Y − Xβ)⊤ (Y − Xβ).

The least squares estimate (LSE) is

β̂ = (β̂0 , ..., β̂p )⊤ = (X⊤ X)−1 X⊤ Y. (1.2)

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Least Squares Estimation


Elements of Statistical Learning (2nd Ed.) Hastie,
c Tibshirani & Friedman 2009 Chap 3


• • ••

•• • • • • • •
• • • •
• • • • •• •

• • •

•• • • • • • • • •

•• • •• • X2
• •

X1

FIGURE 3.1. Linear least squares fitting with


X ∈ IR2 . We seek the linear function of X that mini-
mizes the sum of squared residuals from Y .
Linear Methods for Regression
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Least Squares Estimation


Proof: Let
n
X
Q(β0 , β1 , ..., βp ) = {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2 .
i=1

The LSE must satisfy

∂Q(β0 , β1 , ..., βp )
= 0,
∂β0
∂Q(β0 , β1 , ..., βp )
= 0,
∂β1
..
.
∂Q(β0 , β1 , ..., βp )
= 0,
∂βp

which by simple calculation is


Linear Methods for Regression
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Least Squares Estimation

n
X
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )} = 0
i=1
Xn
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}xi,1 = 0
i=1
..
.
n
X
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}xi,p = 0.
i=1
In matrix
X⊤ Xβ = X⊤ Y
The solution is 1

β̂ = (X⊤ X)−1 X⊤ Y.
1 ⊤
Note that X X must be full rank in order to have an inverse, i.e. the rank
must be p+1. On the other hand, its rank is the same as X.
Linear Methods for Regression
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Random vector
Random vector
 
Z1
 Z2 
Z= or Z = (Z1 , Z2 , · · · , Zp )⊤
 
.. 
 . 
Zp

Expectation of random vector


 
E{Z1 }
 E{Z2 } 
E{Z} = 
 
.. 
 . 
E{Zp }

Random vectors Z and Y,

E(Z + Y) = EZ + EY

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Random vector

Variance-covariance Matrix of random vector

Var{Z} = E{[Z − E{Z}][Z − E{Z}]⊤ }


 
V ar{Z1 } Cov{Z1 , Z2 } ... Cov{Z1 , Zp }
 Cov{Z2 , Z1 } V ar{Z2 } ... Cov{Z2 , Zp } 
= 
 
.. .. .. .. 
 . . . . 
Cov{Zp , Z1 } Cov{Zp , Z2 } ... V ar{Zp }

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Random vector

In simple linear regression model, errors are independent. If


V ar(εi ) = σi2 , i = 1, ...n, then
 2 
σ1 0 ... 0
 0 σ 2 ... 0 
2
V ar{E} =  .
 
. .. .. .. 
 . . . . 
0 0 ... σn2

If V ar(εi ) = σ 2 , i = 1, ...n, then

V ar{E} = σ 2 I where I is an n × n identity matrix.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Random vector

For constant matrix A, let

W = AZ

we have
E{W} = AE{Z}
Var{W} = Var{AZ} = AVar{Z}A⊤
If c is a constant vector, then

E(c + AZ) = c + AE{Z}

and
V ar(c + AZ) = Var(AZ) = A Var{Z}A⊤

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Multivariate Normal distribution

Suppose Z = (ξ1 , ..., ξp )⊤ is a random vector with

EZ = b, V ar(Z) = Σ.

If for any p × 1 constant vector ℓ,

ℓ⊤ Z follows normal distribution,

then we say Z follows multivariate normal distribution2 , denoted by

Z ∼ N (b, Σ).

2
The probability density function is
 
1 1
fX (x1 , . . . , xk ) = exp − (x − µ)T Σ−1 (x − µ) .
(2π)k/2 |Σ|1/2 2

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Multivariate Normal distribution

Lemma
If Z ∼ N (b, Σ). Then for any constant vector ℓ and scalar c,

ℓ⊤ Z + c ∼ N (ℓ⊤ b + c, ℓ⊤ Σℓ).

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Multivariate Normal distribution

Example
Suppose (Z1 , ..., Zp )⊤ ∼ N (b, Σ), where
   
b1 σ11 σ12 ... σ1p
b2  σ21 σ22 ... σ2p 
b =  . , Σ =  .
   
. . .. .. .. 
.  . . . . 
bp σp1 σp2 ... σpp

Then

Zk ∼ N (bi , σii )
Zk + Zj ∼ N (bk + bj , σkk + 2σjk + σjj )
Zk − Zj ∼ N (bk − bj , σkk − 2σjk + σjj )

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Multivariate Normal distribution

Central Limit Theorem: Suppose X1 , ..., Xn are IID


(independent and identically distributed) with EXi = µ and
Cov(Xi ) = Σ. Then
n
1 X
√ (Xi − µ) → N (0, Σ)
n
i=1

in distribution as n → ∞.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Multivariate Normal distribution


Law of Large numbers Suppose X1 , ..., Xn are IID with EXi = µ
and Cov(Xi ) = Σ. Then
n
1X
Xi → E(X1 ) = µ
n
i=1

and
n
1X
(Xi − µ)(Xi − µ)⊤ → Σ
n
i=1

in probability as n → ∞. Roughly speaking


n
1X
(Xi − µ)(Xi − µ)⊤ ≈ Σ
n
i=1

when n is large enough.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Multivariate Normal distribution

Example
In notation above, we have
n
X
X⊤ X = Xi Xi⊤
i=1

and if Xi are IID with mean µ and covariance matrix Σ, then


1 ⊤
X X → Σ+µµ⊤
n
in probability.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Inference of Linear Regression Model


β̂ is unbiased:

β̂ = (X⊤ X)−1 X⊤ Y = (X⊤ X)−1 X⊤ (Xβ+E) = β+(X⊤ X)−1 X⊤ E

Thus

E β̂ = β + E{(X⊤ X)−1 X⊤ E} = β + (X⊤ X)−1 X⊤ EE = β

Fitted values
Ŷ = X(X⊤ X)−1 X⊤ Y
i.e. Ŷ is a linear function of Y.
σ 2 can be estimated by σ̂ 2 = RSS/(n − p − 1) where
n
X
RSS = {Yi − Xi⊤ β̂}2
i=1

is the residual sum of squares.


Linear Methods for Regression
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Inference of Linear Regression Model

The distribution of β̂ = (β̂0 , ..., β̂p )⊤ : If

E ∼ N (0, σ 2 I) (3.1)

where I is a n × n identity matrix, then

β̂ − β ∼ N (0, (X⊤ X)−1 σ 2 ) (3.2)

we need to check whether H0 : βk = 0 (in order to see


whether we can remove variable xk from the model). Because

β̂k − βk ∼ N (0, ck+1,k+1 σ 2 )

where ckk is the (k, k) entry of (X⊤ X)−1 .

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Inference of Linear Regression Model

p
ck+1,k+1 σ̂ 2 is called the standard error of βk .
p
tk = β̂k / ck+1,k+1 σ̂ 2 is called the t-statistic for βk ;
P (|T | > |tk |) is the p-value for βk , where T ∼ t(n − p − 1).

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Inference of Linear Regression Model

At significant level 0.05 (when n is large),


p
If |β̂k | ≤ 1.96 ck+1,k+1 σ̂ 2 , we accept H0 : βk = 0;
p
If |β̂k | > 1.96 ck+1,k+1 σ̂ 2 , we reject H0 : βk = 0.
or equivalently
p
If t-value |tk | = |β̂k / ck+1,k+1 σ̂ 2 | ≤ 1.96, we accept
H0 : βk = 0;
p
If t-value |tk | = |β̂k / ck+1,k+1 σ̂ 2 | > 1.96, we reject
H0 : βk = 0.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Inference of Linear Regression Model

Example

For (data A), there are 5 predictors x1 , ..., x5 and one response
Y , we fit model

Y = β0 + β1 x1 + ... + β5 x5 + ε

The estimated model is

Y = 0.20068 − 0.35520x1 + 0.98745x2 − 0.22444x3 − 0.73906x4


+ 0.06922x5
(SE) (0.1446) (0.1553) (0.1335) (0.1473) (0.1184)
(0.1461)

R code for the calculation (code01A.R)

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Confidence interval for the regression function


For a subject with predictor X = (1, x1 , ..., xp )⊤ , the regression
EY = β0 + β1 x1 + ... + βp xp
is a function of X.
Estimator of EY is
d = β̂0 + β̂1 x1 + ... + β̂p xp = X ⊤ β̂ = ŷ
EY
By Lemma 1, we have
d ∼ N (EY,
EY σ 2 X ⊤ (X⊤ X)−1 X)
The the 95% confidence interval (CI) for EY is
h q q i
EY − 1.96σ X (X X) X , EY + 1.96σ X ⊤ (X⊤ X)−1 X
d ⊤ ⊤ −1 d
| {z } | {z }
lower bound upper bound

or approximately
h q q i
d − 1.96σ̂ X ⊤ (X⊤ X)−1 X ,
EY d + 1.96σ̂ X ⊤ (X⊤ X)−1 X
EY
| {z } | {z }
lower bound Linear Methods for Regression
upper bound
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Least Squares Estimation


Elements of Statistical Learning (2nd Ed.) Hastie,
c Tibshirani & Friedman 2009 Chap 3

x2


x1
FIGURE 3.2. The N -dimensional geometry of least
squares regression with two predictors. The outcome
vector y is orthogonally projected onto the hyperplane
spanned by the input vectors x1 and x2 . The projection
ŷ represents the vector of the least squares predictions

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Confidence interval for the regression function

R package:
1. lm(...)
Example continued Suppose we have new data set (data B), in
which data predictors are observed for 5 subjects. Based on the
above formula, for the second subject, we predict the expected
value for Y as
d = −0.4839,
EY
its (i.e. EY ’s) 95% CI is [-0.9732, 0.0053]
Or, by R; see code (code01B.R)

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Confidence band

A special case is that p = 1, which is more relevant to the


discussion later. The predicted expectation of Y is
d = β̂0 + β̂1 x
EY

The the 95% CI for EY is approximately


h q q i
d − 1.96σ̂ X ⊤ (X⊤ X)−1 X , EY
EY d + 1.96σ̂ X ⊤ (X⊤ X)−1 X
| {z } | {z }
lower bound upper bound

where X = (1, x)⊤ .


All the predicted expectation of Y and the bounds of the CI are
functions of x. We can draw these functions and get the
confidence band.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Confidence band

For (data C), there are 50 observations, each with predictor X and
response Y . We fit a linear regression model to it.

Yi = β0 + β1 xi + εi .

Using R, we get the following regression function and its


confidence band as shown in Fig 1 below; see code (code01C.R)

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Confidence band

Estimated regression function and its 95% confidence band


● ●

5


● ●
● ● ● ●
4

● ● ●

● ●
●● ●

3

● ●
● ● ●

y

● ● ●
● ●● ●
2

●● ●
● ● ●
● ● ●
● ●
1


● ●

0

1 2 3 4 5

x
Figure: Confidence band

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Subset selection

prediction accuracy: The least squares estimates often have


low bias but large variance. Prediction accuracy can
sometimes be improved by shrinking or setting some
coefficients to zero, so that variance of the predicted values is
reduced (at the expense of larger bias).
interpretation: It is often of interest to identify a smaller
subset of the features to get the “big picture”.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Best-subset selection

Best subset regression finds for each k ∈ {0, 1, 2, , ..., p} the


subset of features of size k that gives the smallest residual
sum of squares
n
X
{Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2
i=1

An efficient algorithm – the leaps and bounds procedure


(Furnival and Wilson, 1974) – makes this feasible for p as
large as 30 or 40.
The best-subset curve is necessarily decreasing and so cannot
be used to select the subset size k
The choice of k involves bias-variance tradeoff. We will
discuss strategies to choose k later in model selection.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Best-subset selection

Number of candidate models For p predictors x1 , ..., xp , there


are 2p candidate models,

model 0 : Y = a0 + ε
model 1 : Y = a10 + a11 x1 + ε
..
.
model p : Y = a10 + a1p xp + ε
model p + 1 : Y = a2,1 + a21 x1 + a22 x2 + ε
..
.
model 2p − 1 : Y = ϕ0 + ϕ1 x1 + ... + ϕp xp + ε.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Best-subset selection

Suppose we have n samples. Consider any sub-model

(A) : Y = β0 + β1 x′1 + ... + βq x′q + ε

or

(A) : Yi = β0 + β1 x′i1 + ... + βq x′iq + εi , i = 1, ..., n

where {x′1 , ..., x′q } ⊂ {x1 , ..., xp }.


We can define its RSS by
n
X
RSS(A) = {Yi − ŶA,i }2
i=1

where ŶA,i is the fitted value of Yi by model (A).

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Best-subset selection
For any two models A and B, if A is a sub-model (or called
simplified model) of B, then
RSS(A) ≥ RSS(B).
Why? Consider a simple case, where
(A) : yi = β0 + β1 xi1 + εi , (B) : yi = β0 + β1 xi1 + β2 xi2 + εi
Then
n
X
RSS(B) = min {Yi − β0 − β1 xi1 − β2 xi2 }2
β0 ,β1 ,β2
i=1
n
X
≤ min {Yi − β0 − β1 xi1 − β2 xi2 }2
β0 ,β1 ,
β2 =0 i=1
n
X
= min {Yi − β0 − β1 xi1 }2 = RSS(A).
β0 ,β1
i=1

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Best-subset selection
For example use this data (data01B01.dat), with different models

model q, Y = β0 + β1 x1 + ... + βq xq + ε

the fitted error, i.e. RSS, are shown below; see (code01Ba.R)


50
40


RSS

30
20



10

● ● ● ● ●

2 4 6 8 10

number of predictors q

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Forward- and backward-stepwise selection

Forward selection starts with the intercept and then


sequentially adds into the model the predictor that most
improves the fit.
Forward selection is a greedy algorithm, evaluating a nested
sequence of models which may be sub-optimal as it does not
evaluate all possible subsets. Nonetheless, there are several
reasons for its use:
Computational: For large p, forward-stepwise selection is still
computationally tractable (even when p > n).
Statistical: Stepwise selection is a more constrained approach,
which yields lower variance at the possible expense of larger
bias.

Linear Methods for Regression


Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Shrinkage methods

Subset selection procedures are discrete processes and often


exhibit high variance.
We will introduce shrinkage methods which are more
continuous in nature and don’t suffer as much from high
variability.

Linear Methods for Regression

You might also like