Professional Documents
Culture Documents
Sun Baoluo
Spring 2023
Y = β0 + β1 x1 + ... + βp xp + ε,
or
Yi = β0 + β1 xi,1 + ... + βp xi,p + εi , i = 1, ..., n.
Y = Xβ + E (1.1)
EE = 0
V ar(E) = σ 2 I
where I is a n × n identity matrix.
Linear Methods for Regression
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection
•
• • ••
•
•• • • • • • •
• • • •
• • • • •• •
•
• • •
•
•• • • • • • • • •
•
•• • •• • X2
• •
•
X1
∂Q(β0 , β1 , ..., βp )
= 0,
∂β0
∂Q(β0 , β1 , ..., βp )
= 0,
∂β1
..
.
∂Q(β0 , β1 , ..., βp )
= 0,
∂βp
n
X
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )} = 0
i=1
Xn
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}xi,1 = 0
i=1
..
.
n
X
−2 {Yi − (β0 + β1 xi,1 + ... + βp xi,p )}xi,p = 0.
i=1
In matrix
X⊤ Xβ = X⊤ Y
The solution is 1
β̂ = (X⊤ X)−1 X⊤ Y.
1 ⊤
Note that X X must be full rank in order to have an inverse, i.e. the rank
must be p+1. On the other hand, its rank is the same as X.
Linear Methods for Regression
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection
Random vector
Random vector
Z1
Z2
Z= or Z = (Z1 , Z2 , · · · , Zp )⊤
..
.
Zp
E(Z + Y) = EZ + EY
Random vector
Random vector
Random vector
W = AZ
we have
E{W} = AE{Z}
Var{W} = Var{AZ} = AVar{Z}A⊤
If c is a constant vector, then
and
V ar(c + AZ) = Var(AZ) = A Var{Z}A⊤
EZ = b, V ar(Z) = Σ.
Z ∼ N (b, Σ).
2
The probability density function is
1 1
fX (x1 , . . . , xk ) = exp − (x − µ)T Σ−1 (x − µ) .
(2π)k/2 |Σ|1/2 2
Lemma
If Z ∼ N (b, Σ). Then for any constant vector ℓ and scalar c,
ℓ⊤ Z + c ∼ N (ℓ⊤ b + c, ℓ⊤ Σℓ).
Example
Suppose (Z1 , ..., Zp )⊤ ∼ N (b, Σ), where
b1 σ11 σ12 ... σ1p
b2 σ21 σ22 ... σ2p
b = . , Σ = .
. . .. .. ..
. . . . .
bp σp1 σp2 ... σpp
Then
Zk ∼ N (bi , σii )
Zk + Zj ∼ N (bk + bj , σkk + 2σjk + σjj )
Zk − Zj ∼ N (bk − bj , σkk − 2σjk + σjj )
in distribution as n → ∞.
and
n
1X
(Xi − µ)(Xi − µ)⊤ → Σ
n
i=1
Example
In notation above, we have
n
X
X⊤ X = Xi Xi⊤
i=1
Thus
Fitted values
Ŷ = X(X⊤ X)−1 X⊤ Y
i.e. Ŷ is a linear function of Y.
σ 2 can be estimated by σ̂ 2 = RSS/(n − p − 1) where
n
X
RSS = {Yi − Xi⊤ β̂}2
i=1
E ∼ N (0, σ 2 I) (3.1)
p
ck+1,k+1 σ̂ 2 is called the standard error of βk .
p
tk = β̂k / ck+1,k+1 σ̂ 2 is called the t-statistic for βk ;
P (|T | > |tk |) is the p-value for βk , where T ∼ t(n − p − 1).
Example
For (data A), there are 5 predictors x1 , ..., x5 and one response
Y , we fit model
Y = β0 + β1 x1 + ... + β5 x5 + ε
or approximately
h q q i
d − 1.96σ̂ X ⊤ (X⊤ X)−1 X ,
EY d + 1.96σ̂ X ⊤ (X⊤ X)−1 X
EY
| {z } | {z }
lower bound Linear Methods for Regression
upper bound
Review of linear regression model Random vector and normal distribution Inference Confidence interval Subset selection
x2
ŷ
x1
FIGURE 3.2. The N -dimensional geometry of least
squares regression with two predictors. The outcome
vector y is orthogonally projected onto the hyperplane
spanned by the input vectors x1 and x2 . The projection
ŷ represents the vector of the least squares predictions
R package:
1. lm(...)
Example continued Suppose we have new data set (data B), in
which data predictors are observed for 5 subjects. Based on the
above formula, for the second subject, we predict the expected
value for Y as
d = −0.4839,
EY
its (i.e. EY ’s) 95% CI is [-0.9732, 0.0053]
Or, by R; see code (code01B.R)
Confidence band
Confidence band
For (data C), there are 50 observations, each with predictor X and
response Y . We fit a linear regression model to it.
Yi = β0 + β1 xi + εi .
Confidence band
●
● ●
●
5
●
● ●
● ● ● ●
4
● ● ●
●
● ●
●● ●
●
3
● ●
● ● ●
●
y
● ● ●
● ●● ●
2
●● ●
● ● ●
● ● ●
● ●
1
●
● ●
●
0
1 2 3 4 5
x
Figure: Confidence band
Subset selection
Best-subset selection
Best-subset selection
model 0 : Y = a0 + ε
model 1 : Y = a10 + a11 x1 + ε
..
.
model p : Y = a10 + a1p xp + ε
model p + 1 : Y = a2,1 + a21 x1 + a22 x2 + ε
..
.
model 2p − 1 : Y = ϕ0 + ϕ1 x1 + ... + ϕp xp + ε.
Best-subset selection
or
Best-subset selection
For any two models A and B, if A is a sub-model (or called
simplified model) of B, then
RSS(A) ≥ RSS(B).
Why? Consider a simple case, where
(A) : yi = β0 + β1 xi1 + εi , (B) : yi = β0 + β1 xi1 + β2 xi2 + εi
Then
n
X
RSS(B) = min {Yi − β0 − β1 xi1 − β2 xi2 }2
β0 ,β1 ,β2
i=1
n
X
≤ min {Yi − β0 − β1 xi1 − β2 xi2 }2
β0 ,β1 ,
β2 =0 i=1
n
X
= min {Yi − β0 − β1 xi1 }2 = RSS(A).
β0 ,β1
i=1
Best-subset selection
For example use this data (data01B01.dat), with different models
model q, Y = β0 + β1 x1 + ... + βq xq + ε
the fitted error, i.e. RSS, are shown below; see (code01Ba.R)
●
50
40
●
RSS
30
20
●
●
10
● ● ● ● ●
●
2 4 6 8 10
number of predictors q
Shrinkage methods