Professional Documents
Culture Documents
Sun Baoluo
Spring 2023
Ridge regression
Ridge regression
Ridge regression
Y = β1 x1 + ... + βp xp + ε
A simple way to solve the invertibility issue is by adding a diagonal
matrix to X⊤ X, i.e. X⊤ X + λI, where I is a p × p identity
matrix. The ridge regression estimator is then
(providing it exists)
Ridge regression
Ridge regression
Ridge regression
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
beta
beta
beta
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
−5 0 5 10 −10 −5 0 5 0.0 0.2 0.4 0.6
The bias is
Recall Bias(β̂LSE ) = 0.
Variance-covariance matrix of β̂ridge : If V ar(E) = σ 2 In ,
then (prove it)
Recall that
V ar(β̂LSE ) = (X⊤ X)−1 σ 2
|Bias(β̂LSE )| ≤ |Bias(β̂ridge )|
but
V ar(β̂LSE ) ≥ V ar(β̂ridge )
E∥µ̂ − µ∥2
or
E{(µ̂ − µ)T (µ̂ − µ)}
denoted by M SE(µ̂).
Lemma
E∥µ̂ − µ∥2 = trace[E{(µ̂ − µ)(µ̂ − µ)⊤ }]
(µ̂1 − µ1 )2
(µ̂1 − µ1 )(µ̂2 − µ2 ) ... (µ̂1 − µ1 )(µ̂p − µp )
(µ̂2 − µ2 )(µ̂1 − µ1 )
(µ̂2 − µ2 )2 ... (µ̂2 − µ2 )(µ̂p − µp )
.. .. .. ..
. . . .
(µ̂p − µp )(µ̂1 − µ1 ) (µ̂p − µp )(µ̂2 − µ2 ) ... (µ̂p − µp ) 2
and
E{∥µ̂ − µ∥2 } = tr[E{(µ̂ − µ)(µ̂ − µ)⊤ }]
Lemma
M SE = V ariance + ||Bias||2
i.e.
E∥µ̂ − µ∥2 = trace{V ar(µ̂)} + ∥Bias∥2
or
E{(µ̂ − µ)(µ̂ − µ)⊤ } = V ar(µ̂) + Bias ∗ Bias⊤
Proof:
Theorem
There exists a λ ≥ 0, such that
Ŷ = SY
(called smoother of Y)
Smoothers put the hats on Y, S is called hat matrix.
In ordinary least squares, recall the hat matrix
S = X(X⊤ X)−1 X⊤
Penalty approach
Penalty approach: Recall the LS estimation is to minimize
n
X
(Yi − Xi⊤ β)2 .
i=1
Constraint approach
or
||Y − Xβ||2 , s.t. ||β||2 ≤t.
Denote the estimator by βridge [t].
Constraint approach
Constraint approach
Delete-one-out expression
We have
Xn n
X
β̂LSE = (X⊤ X)−1 X⊤ Y = ( Xi⊤ Xi )−1 Xi⊤ Yi
i=1 i=1
Yi = Xi β + εi , i = 1, 2, ..., n.
1
Here, trace(S) is the trace of matrix S, the sum of the elements on the
main diagonal
Regularized regression models
Ridge regression Alternative approaches Selection of tuning parameter
Selection of λ
Example For the diabetes data, the CV value changes with λ as
follows. R code for the calculation code02Ac.R
● ●
●●
●●
●●
●
●●
●●
●●
●
●●
●●
● ●●
●●
●
●●
●●
●●
●●
● ●●
●●
●
●●
●
cv
●
●●
● ●●
●
●
●●
● ●●
●
●
●●
● ●
●
●●
●
● ●●
● ●●
● ●
●●
● ●●
●
●
● ●
●●
●
● ●
●●
●
●
●●
●●
●●
●
0 2 4 6 8 10
lambda
Selection of λ
Example: [cell classification based on gene] For the leukemia gene
expression data DMleukemia1.dat. There are 25 cells with 250
genes. they are from two types of cells, denoted by 1 and 0
respectively (or +1 and -1 respectively).
By CV, ridge parameter λ = 2.48832e−06 is selected, as shown
below
●
−10
●
●●
●
●
●
●●
●
● ●
●●
●
●
●●
−15
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
● ●●
●
●
log(cv)
●
●
−20
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●●
−25
●
●●
●
● ●●
● ●●
● ●
●
−30
●
● ●●
●● ●
●
log(lambda)
Selection of λ
The estimation path
0.02
BETA
0.00
−0.02
Selection of λ
Selection of λ
1.0
● ●● ● ● ●
●
true
0.5
classified
classes
0.0
−0.5
−1.0
● ● ● ●● ●
predicted values