Linear Methods For Regression: Sun Baoluo

Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection
Linear Methods for Regression
Sun Baoluo
Department of Statistics and Data Science
Spring 2023

Review of linear regression model
Predictors (also called independent variables, explanatory variables,

covariates, regressors) denoted by x1 , ..., xp ;
Response (also called dependent variable or outcome variable)
denoted by Y .
The linear regression model of Y on x1 , ..., xp is
Y = β0 + β1 x1 + ... + βp xp + ε,
where E(ε) = 0 and var(ε) = σ 2 . (pay attention to model

assumptions)

With n observations, we put them in a table as follows
observation constant x1 ... xp Y

1 1 x1,1 ... x1,p Y1
2 1 x2,1 ... x2,p Y2
.. .. .. .. ..
. . . . .
n 1 xn,1 ... xn,p Yn

The model can also be formulated as
Y1 = β0 + β1 x1,1 + ... + βp x1,p + ε1 ,

Y2 = β0 + β1 x2,1 + ... + βp x2,p + ε2 ,
..
.
Yn = β0 + β1 xn,1 + ... + βp xn,p + εn ,
or
Yi = β0 + β1 xi,1 + ... + βp xi,p + εi , i = 1, ..., n.

Assumptions (for making valid statistical inference):

(A) Predictor xi is usually assumed to be non-random. In
econometrics, they are allowed to be random but usually
assumed to be uncorrelated with ε.
(B) Eε1 = Eε2 = ... = Eεn = 0;
(C) ε1 , ..., εn are independent;
(D) V ar(ε1 ) = ... = V ar(εn ) = σ 2 .


By writing Xi = (1, xi,1 , ..., xi,p )T , and
     
Y1 
1 x1,1 ... x1,p
 ε1 β0
 Y2   1 x2,1 ... x2,p   ε2  β1 
Y =  . , X =  , E =  . ,β =  . 
     
 ..  · · ·   ..   .. 
Yn 1 xn,1 ... xn,p εn βp
the model can be written as
Y = Xβ + E (1.1)
and the corresponding assumptions can be written as
EE = 0
V ar(E) = σ 2 I
where I is a n × n identity matrix.
Least Squares Estimation

The Least Squares estimation is to estimate β by minimizing
n
X
{Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2
i=1
with respect to β. Note that

n
X
{Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2
i=1
n
X
= {Yi − Xi⊤ β}2
i=1
= (Y − Xβ)⊤ (Y − Xβ).
The least squares estimate (LSE) is
β̂ = (β̂0 , ..., β̂p )⊤ = (X⊤ X)−1 X⊤ Y. (1.2)


Elements of Statistical Learning (2nd Ed.) Hastie,
c Tibshirani & Friedman 2009 Chap 3
•
• • ••
•
•• • • • • • •
• • • •
• • • • •• •
•
• • •
•
•• • • • • • • • •
•
•• • •• • X2
• •
•
X1
FIGURE 3.1. Linear least squares ﬁtting with

X ∈ IR2 . We seek the linear function of X that mini-
mizes the sum of squared residuals from Y .

Proof: Let
n
X
Q(β0 , β1 , ..., βp ) = {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2 .
i=1
The LSE must satisfy
∂Q(β0 , β1 , ..., βp )
= 0,
∂β0
∂Q(β0 , β1 , ..., βp )
= 0,
∂β1
..
.
∂Q(β0 , β1 , ..., βp )
= 0,
∂βp
which by simple calculation is

n
X
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )} = 0
i=1
Xn
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}xi,1 = 0
i=1
..
.
n
X
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}xi,p = 0.
i=1
In matrix
X⊤ Xβ = X⊤ Y
The solution is 1
β̂ = (X⊤ X)−1 X⊤ Y.
1 ⊤
Note that X X must be full rank in order to have an inverse, i.e. the rank
must be p+1. On the other hand, its rank is the same as X.
Random vector
Random vector
 
Z1
 Z2 
Z= or Z = (Z1 , Z2 , · · · , Zp )⊤
 
.. 
 . 
Zp
Expectation of random vector

 
E{Z1 }
 E{Z2 } 
E{Z} = 
 
.. 
 . 
E{Zp }
Random vectors Z and Y,
E(Z + Y) = EZ + EY

Random vector
Variance-covariance Matrix of random vector
Var{Z} = E{[Z − E{Z}][Z − E{Z}]⊤ }

 
V ar{Z1 } Cov{Z1 , Z2 } ... Cov{Z1 , Zp }
 Cov{Z2 , Z1 } V ar{Z2 } ... Cov{Z2 , Zp } 
= 
 
.. .. .. .. 
 . . . . 
Cov{Zp , Z1 } Cov{Zp , Z2 } ... V ar{Zp }

Random vector
In simple linear regression model, errors are independent. If

V ar(εi ) = σi2 , i = 1, ...n, then
 2 
σ1 0 ... 0
 0 σ 2 ... 0 
2
V ar{E} =  .
 
. .. .. .. 
 . . . . 
0 0 ... σn2
If V ar(εi ) = σ 2 , i = 1, ...n, then
V ar{E} = σ 2 I where I is an n × n identity matrix.

Random vector
For constant matrix A, let
W = AZ
we have
E{W} = AE{Z}
Var{W} = Var{AZ} = AVar{Z}A⊤
If c is a constant vector, then
E(c + AZ) = c + AE{Z}
and
V ar(c + AZ) = Var(AZ) = A Var{Z}A⊤

Multivariate Normal distribution
Suppose Z = (ξ1 , ..., ξp )⊤ is a random vector with
EZ = b, V ar(Z) = Σ.
If for any p × 1 constant vector ℓ,
ℓ⊤ Z follows normal distribution,
then we say Z follows multivariate normal distribution2 , denoted by
Z ∼ N (b, Σ).
2
The probability density function is

1 1
fX (x1 , . . . , xk ) = exp − (x − µ)T Σ−1 (x − µ) .
(2π)k/2 |Σ|1/2 2

Lemma
If Z ∼ N (b, Σ). Then for any constant vector ℓ and scalar c,
ℓ⊤ Z + c ∼ N (ℓ⊤ b + c, ℓ⊤ Σℓ).

Example
Suppose (Z1 , ..., Zp )⊤ ∼ N (b, Σ), where
   
b1 σ11 σ12 ... σ1p
b2  σ21 σ22 ... σ2p 
b =  . , Σ =  .
   
. . .. .. .. 
.  . . . . 
bp σp1 σp2 ... σpp
Then
Zk ∼ N (bi , σii )
Zk + Zj ∼ N (bk + bj , σkk + 2σjk + σjj )
Zk − Zj ∼ N (bk − bj , σkk − 2σjk + σjj )

Central Limit Theorem: Suppose X1 , ..., Xn are IID

(independent and identically distributed) with EXi = µ and
Cov(Xi ) = Σ. Then
n
1 X
√ (Xi − µ) → N (0, Σ)
n
i=1
in distribution as n → ∞.


Law of Large numbers Suppose X1 , ..., Xn are IID with EXi = µ
and Cov(Xi ) = Σ. Then
n
1X
Xi → E(X1 ) = µ
n
i=1
and
n
1X
(Xi − µ)(Xi − µ)⊤ → Σ
n
i=1
in probability as n → ∞. Roughly speaking

n
1X
(Xi − µ)(Xi − µ)⊤ ≈ Σ
n
i=1
when n is large enough.

Example
In notation above, we have
n
X
X⊤ X = Xi Xi⊤
i=1
and if Xi are IID with mean µ and covariance matrix Σ, then

1 ⊤
X X → Σ+µµ⊤
n
in probability.

Inference of Linear Regression Model

β̂ is unbiased:
β̂ = (X⊤ X)−1 X⊤ Y = (X⊤ X)−1 X⊤ (Xβ+E) = β+(X⊤ X)−1 X⊤ E
Thus
E β̂ = β + E{(X⊤ X)−1 X⊤ E} = β + (X⊤ X)−1 X⊤ EE = β
Fitted values
Ŷ = X(X⊤ X)−1 X⊤ Y
i.e. Ŷ is a linear function of Y.
σ 2 can be estimated by σ̂ 2 = RSS/(n − p − 1) where
n
X
RSS = {Yi − Xi⊤ β̂}2
i=1
is the residual sum of squares.

The distribution of β̂ = (β̂0 , ..., β̂p )⊤ : If
E ∼ N (0, σ 2 I) (3.1)
where I is a n × n identity matrix, then
β̂ − β ∼ N (0, (X⊤ X)−1 σ 2 ) (3.2)
we need to check whether H0 : βk = 0 (in order to see

whether we can remove variable xk from the model). Because
β̂k − βk ∼ N (0, ck+1,k+1 σ 2 )
where ckk is the (k, k) entry of (X⊤ X)−1 .

p
ck+1,k+1 σ̂ 2 is called the standard error of βk .
p
tk = β̂k / ck+1,k+1 σ̂ 2 is called the t-statistic for βk ;
P (|T | > |tk |) is the p-value for βk , where T ∼ t(n − p − 1).

At significant level 0.05 (when n is large),

p
If |β̂k | ≤ 1.96 ck+1,k+1 σ̂ 2 , we accept H0 : βk = 0;
p
If |β̂k | > 1.96 ck+1,k+1 σ̂ 2 , we reject H0 : βk = 0.
or equivalently
p
If t-value |tk | = |β̂k / ck+1,k+1 σ̂ 2 | ≤ 1.96, we accept
H0 : βk = 0;
p
If t-value |tk | = |β̂k / ck+1,k+1 σ̂ 2 | > 1.96, we reject
H0 : βk = 0.

Example
For (data A), there are 5 predictors x1 , ..., x5 and one response
Y , we fit model
Y = β0 + β1 x1 + ... + β5 x5 + ε
The estimated model is
Y = 0.20068 − 0.35520x1 + 0.98745x2 − 0.22444x3 − 0.73906x4

+ 0.06922x5
(SE) (0.1446) (0.1553) (0.1335) (0.1473) (0.1184)
(0.1461)
R code for the calculation (code01A.R)

Confidence interval for the regression function

For a subject with predictor X = (1, x1 , ..., xp )⊤ , the regression
EY = β0 + β1 x1 + ... + βp xp
is a function of X.
Estimator of EY is
d = β̂0 + β̂1 x1 + ... + β̂p xp = X ⊤ β̂ = ŷ
EY
By Lemma 1, we have
d ∼ N (EY,
EY σ 2 X ⊤ (X⊤ X)−1 X)
The the 95% confidence interval (CI) for EY is
h q q i
EY − 1.96σ X (X X) X , EY + 1.96σ X ⊤ (X⊤ X)−1 X
d ⊤ ⊤ −1 d
| {z } | {z }
lower bound upper bound
or approximately
h q q i
d − 1.96σ̂ X ⊤ (X⊤ X)−1 X ,
EY d + 1.96σ̂ X ⊤ (X⊤ X)−1 X
EY
| {z } | {z }
lower bound Linear Methods for Regression
upper bound

Elements of Statistical Learning (2nd Ed.) Hastie,
c Tibshirani & Friedman 2009 Chap 3
x2
ŷ
x1
FIGURE 3.2. The N -dimensional geometry of least
squares regression with two predictors. The outcome
vector y is orthogonally projected onto the hyperplane
spanned by the input vectors x1 and x2 . The projection
ŷ represents the vector of the least squares predictions

Confidence interval for the regression function
R package:
1. lm(...)
Example continued Suppose we have new data set (data B), in
which data predictors are observed for 5 subjects. Based on the
above formula, for the second subject, we predict the expected
value for Y as
d = −0.4839,
EY
its (i.e. EY ’s) 95% CI is [-0.9732, 0.0053]
Or, by R; see code (code01B.R)

Confidence band
A special case is that p = 1, which is more relevant to the

discussion later. The predicted expectation of Y is
d = β̂0 + β̂1 x
EY
The the 95% CI for EY is approximately

h q q i
d − 1.96σ̂ X ⊤ (X⊤ X)−1 X , EY
EY d + 1.96σ̂ X ⊤ (X⊤ X)−1 X
| {z } | {z }
lower bound upper bound
where X = (1, x)⊤ .

All the predicted expectation of Y and the bounds of the CI are
functions of x. We can draw these functions and get the
confidence band.

Confidence band
For (data C), there are 50 observations, each with predictor X and
response Y . We fit a linear regression model to it.
Yi = β0 + β1 xi + εi .
Using R, we get the following regression function and its

confidence band as shown in Fig 1 below; see code (code01C.R)

Confidence band
Estimated regression function and its 95% confidence band
●
● ●
●
5
●
● ●
● ● ● ●
4
● ● ●
●
● ●
●● ●
●
3
● ●
● ● ●
●
y
● ● ●
● ●● ●
2
●● ●
● ● ●
● ● ●
● ●
1
●
● ●
●
0
1 2 3 4 5
x
Figure: Confidence band

Subset selection
prediction accuracy: The least squares estimates often have

low bias but large variance. Prediction accuracy can
sometimes be improved by shrinking or setting some
coefficients to zero, so that variance of the predicted values is
reduced (at the expense of larger bias).
interpretation: It is often of interest to identify a smaller
subset of the features to get the “big picture”.

Best-subset selection
Best subset regression finds for each k ∈ {0, 1, 2, , ..., p} the

subset of features of size k that gives the smallest residual
sum of squares
n
X
{Yi − (β0 + β1 xi,1 + ... + βp xi,p )}2
i=1
An efficient algorithm – the leaps and bounds procedure

(Furnival and Wilson, 1974) – makes this feasible for p as
large as 30 or 40.
The best-subset curve is necessarily decreasing and so cannot
be used to select the subset size k
The choice of k involves bias-variance tradeoff. We will
discuss strategies to choose k later in model selection.

Number of candidate models For p predictors x1 , ..., xp , there

are 2p candidate models,
model 0 : Y = a0 + ε
model 1 : Y = a10 + a11 x1 + ε
..
.
model p : Y = a10 + a1p xp + ε
model p + 1 : Y = a2,1 + a21 x1 + a22 x2 + ε
..
.
model 2p − 1 : Y = ϕ0 + ϕ1 x1 + ... + ϕp xp + ε.

Suppose we have n samples. Consider any sub-model
(A) : Y = β0 + β1 x′1 + ... + βq x′q + ε
or
(A) : Yi = β0 + β1 x′i1 + ... + βq x′iq + εi , i = 1, ..., n
where {x′1 , ..., x′q } ⊂ {x1 , ..., xp }.

We can define its RSS by
n
X
RSS(A) = {Yi − ŶA,i }2
i=1
where ŶA,i is the fitted value of Yi by model (A).

For any two models A and B, if A is a sub-model (or called
simplified model) of B, then
RSS(A) ≥ RSS(B).
Why? Consider a simple case, where
(A) : yi = β0 + β1 xi1 + εi , (B) : yi = β0 + β1 xi1 + β2 xi2 + εi
Then
n
X
RSS(B) = min {Yi − β0 − β1 xi1 − β2 xi2 }2
β0 ,β1 ,β2
i=1
n
X
≤ min {Yi − β0 − β1 xi1 − β2 xi2 }2
β0 ,β1 ,
β2 =0 i=1
n
X
= min {Yi − β0 − β1 xi1 }2 = RSS(A).
β0 ,β1
i=1

For example use this data (data01B01.dat), with different models
model q, Y = β0 + β1 x1 + ... + βq xq + ε
the fitted error, i.e. RSS, are shown below; see (code01Ba.R)
●
50
40
●
RSS
30
20
●
●
10
● ● ● ● ●
●
2 4 6 8 10
number of predictors q

Forward- and backward-stepwise selection
Forward selection starts with the intercept and then

sequentially adds into the model the predictor that most
improves the fit.
Forward selection is a greedy algorithm, evaluating a nested
sequence of models which may be sub-optimal as it does not
evaluate all possible subsets. Nonetheless, there are several
reasons for its use:
Computational: For large p, forward-stepwise selection is still
computationally tractable (even when p > n).
Statistical: Stepwise selection is a more constrained approach,
which yields lower variance at the possible expense of larger
bias.

Shrinkage methods
Subset selection procedures are discrete processes and often

exhibit high variance.
We will introduce shrinkage methods which are more
continuous in nature and don’t suffer as much from high
variability.

Linear Methods For Regression: Sun Baoluo

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Methods For Regression: Sun Baoluo

Uploaded by

Copyright:

Available Formats

Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection

Linear Methods for Regression

Department of Statistics and Data Science

Linear Methods for Regression

Review of linear regression model

Predictors (also called independent variables, explanatory variables,

where E(ε) = 0 and var(ε) = σ 2 . (pay attention to model

Linear Methods for Regression

Review of linear regression model

With n observations, we put them in a table as follows

observation constant x1 ... xp Y

Linear Methods for Regression

Review of linear regression model

The model can also be formulated as

Y1 = β0 + β1 x1,1 + ... + βp x1,p + ε1 ,

Linear Methods for Regression

Review of linear regression model

Assumptions (for making valid statistical inference):

Linear Methods for Regression

Review of linear regression model

the model can be written as

and the corresponding assumptions can be written as

Least Squares Estimation

with respect to β. Note that

The least squares estimate (LSE) is

β̂ = (β̂0 , ..., β̂p )⊤ = (X⊤ X)−1 X⊤ Y. (1.2)

Linear Methods for Regression

Least Squares Estimation

FIGURE 3.1. Linear least squares ﬁtting with

Least Squares Estimation

The LSE must satisfy

which by simple calculation is

Least Squares Estimation

Expectation of random vector

Random vectors Z and Y,

Linear Methods for Regression

Variance-covariance Matrix of random vector

Var{Z} = E{[Z − E{Z}][Z − E{Z}]⊤ }

Linear Methods for Regression

In simple linear regression model, errors are independent. If

If V ar(εi ) = σ 2 , i = 1, ...n, then

V ar{E} = σ 2 I where I is an n × n identity matrix.

Linear Methods for Regression

For constant matrix A, let

E(c + AZ) = c + AE{Z}

Linear Methods for Regression

Multivariate Normal distribution

Suppose Z = (ξ1 , ..., ξp )⊤ is a random vector with

If for any p × 1 constant vector ℓ,

ℓ⊤ Z follows normal distribution,

then we say Z follows multivariate normal distribution2 , denoted by

Linear Methods for Regression

Multivariate Normal distribution

Linear Methods for Regression

Multivariate Normal distribution

Linear Methods for Regression

Multivariate Normal distribution

Central Limit Theorem: Suppose X1 , ..., Xn are IID

Linear Methods for Regression

Multivariate Normal distribution

in probability as n → ∞. Roughly speaking

when n is large enough.

Linear Methods for Regression