Regularized Regression Models: Sun Baoluo

Ridge regression Alternative approaches Selection of tuning parameter
Regularized regression models
Sun Baoluo
Dept. of Statistics and Data Science
Spring 2023

Ridge regression
The “badly conditioned linear regression problem” (Hoerl and

Kennard, 1970) has long been an important problem in
statistics and computer science. The problem is even intrinsic
in high dimensional data as most of the genetics data are.
The technique of ridge regression (RR) is one of the most
popular and best performing (Frank and Friedman, 1993)
alternatives to the ordinary least squares (LS) methods.

Ridge regression
The basic requirement for the Least squares estimation (LSE)

is that the matrix
(X⊤ X)−1 exists.
There are at least two cases that the inverse does not exit.
(1) p > n and (2) collinearity, i.e. the columns in X are
linearly dependent.
Hereafter in this chapter, both X and Y are centralized and
standardized, i.e., mean zero and variance one.

Ridge regression
Y = β1 x1 + ... + βp xp + ε
A simple way to solve the invertibility issue is by adding a diagonal
matrix to X⊤ X, i.e. X⊤ X + λI, where I is a p × p identity
matrix. The ridge regression estimator is then
β̂ridge = (X⊤ X + λI)−1 X⊤ Y
where λ > 0 is a tuning parameter needs to be chosen (HOW?

any idea). To make the notation clear, denote the LS estimator by
β̂LSE = (X⊤ X)−1 X⊤ Y
(providing it exists)

Ridge regression
Notice that the solution is indexed by the parameter λ

So for each λ, we have a solution
Hence, the λ trace out a path of solutions
λ is the shrinkage parameter
λ controls the size of the coefficients
λ controls amount of regularization
As λ ↓ 0, we obtain the least squares solutions
As λ ↑ ∞, we have β̂ridge = 0

Ridge regression
n=442 diabetes patients were measured on 10 baseline variables

x1 , ..., x10 (see diabetes.csv). The response variable is a measure
of disease progression one year after baseline y. After
centralization and standardization,
y = β1 x1 + ... + β10 x10 + ε
We have the following coefficients paths code02Aa.R

Ridge regression
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
beta
beta
beta
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
−5 0 5 10 −10 −5 0 5 0.0 0.2 0.4 0.6
log(lambda) log(1/lambda) ||beta||^2

Properties of ridge regression
β̂ridge is not an unbiased estimator
E β̂ridge = E{(X⊤ X + λI)−1 (X⊤ Y)}

= E{(X⊤ X + λI)−1 (X⊤ (Xβ + E))}
= (X⊤ X + λI)−1 X⊤ Xβ + E{(X⊤ X + λI)−1 X⊤ E))}
= (X⊤ X + λI)−1 X⊤ Xβ
= β − λ(X⊤ X + λI)−1 β.

The bias is
Bias(β̂ridge ) = E β̂ridge − β = −λ(X⊤ X + λI)−1 β.
Recall Bias(β̂LSE ) = 0.
Variance-covariance matrix of β̂ridge : If V ar(E) = σ 2 In ,
then (prove it)
V ar(β̂ridge ) = (X⊤ X + λI)−1 X⊤ X(X⊤ X + λI)−1 σ 2
Recall that
V ar(β̂LSE ) = (X⊤ X)−1 σ 2

Is ridge regression better than LSE?
|Bias(β̂LSE )| ≤ |Bias(β̂ridge )|
but
V ar(β̂LSE ) ≥ V ar(β̂ridge )

How to evaluate an estimator of a parameter? by MSE? For

any parameter, say µ = (µ1 , ..., µp )⊤ , the mean squared error
(MSE) of an estimator µ̂ is defined as
E∥µ̂ − µ∥2
or
E{(µ̂ − µ)T (µ̂ − µ)}
denoted by M SE(µ̂).

Lemma
E∥µ̂ − µ∥2 = trace[E{(µ̂ − µ)(µ̂ − µ)⊤ }]


Proof: Write
 
µ̂1 − µ1
µ̂2 − µ2 
µ̂ − µ = 
 
.. 
 . 
µ̂p − µp
Thus
∥µ̂ − µ∥2 = (µ̂1 − µ1 )2 + (µ̂2 − µ2 )2 + · · · + (µ̂p − µp )2
and
{(µ̂ − µ)(µ̂ − µ)⊤ } =
(µ̂1 − µ1 )2
 
(µ̂1 − µ1 )(µ̂2 − µ2 ) ... (µ̂1 − µ1 )(µ̂p − µp )
(µ̂2 − µ2 )(µ̂1 − µ1 )
 (µ̂2 − µ2 )2 ... (µ̂2 − µ2 )(µ̂p − µp )
 .. .. .. .. 
 . . . . 
(µ̂p − µp )(µ̂1 − µ1 ) (µ̂p − µp )(µ̂2 − µ2 ) ... (µ̂p − µp ) 2

Thus it is easy to see that
∥µ̂ − µ∥2 = trace{(µ̂ − µ)(µ̂ − µ)⊤ }
and
E{∥µ̂ − µ∥2 } = tr[E{(µ̂ − µ)(µ̂ − µ)⊤ }]

Lemma
M SE = V ariance + ||Bias||2
i.e.
E∥µ̂ − µ∥2 = trace{V ar(µ̂)} + ∥Bias∥2
or
E{(µ̂ − µ)(µ̂ − µ)⊤ } = V ar(µ̂) + Bias ∗ Bias⊤

Proof:
E{(µ̂ − µ)(µ̂ − µ)⊤ } = E{(µ̂ − E µ̂ + E µ̂ − µ)(µ̂ − E µ̂ + E µ̂ − µ)⊤

= E{(µ̂ − E µ̂)(µ̂ − E µ̂)⊤ } + E{(µ̂ − E µ̂)(E µ̂ − µ)⊤ } +
+E{(E µ̂ − µ)(µ̂ − E µ̂)⊤ } + E{(E µ̂ − µ)(E µ̂ − µ)⊤ }
= E{(µ̂ − E µ̂)(µ̂ − E µ̂)⊤ } + {(E µ̂ − µ)(E µ̂ − µ)⊤ }
= V ar(µ̂) + Bias ∗ Bias⊤

Theorem
There exists a λ ≥ 0, such that
E∥β̂ridge − β∥2 ≤ E∥β̂LSE − β∥2

Smoother matrices and effective degrees of freedom

Often, the fitted value can be written as
Ŷ = SY
(called smoother of Y)
Smoothers put the hats on Y, S is called hat matrix.
In ordinary least squares, recall the hat matrix
S = X(X⊤ X)−1 X⊤
For rank(X) = p, we have trace(S) = p, the degrees of

freedom used in the model.

In ridge regression, the fits are given by:
Ŷ = X(X⊤ X + λIp )−1 X⊤ Y
The smoother or hats matrix in ridge takes the form:
S = X(X⊤ X + λIp )−1 X⊤
So the effective degrees of freedom in ridge regression are

given by:
df (S) = trace[X(X⊤ X + λIp )−1 X⊤ ] < p
if λ > 0, λ actually penalizes the degrees of freedom of the

regression.

Penalty approach
Penalty approach: Recall the LS estimation is to minimize
n
X
(Yi − Xi⊤ β)2 .
i=1
To penalize the value of β, we can consider estimate β by

minimizing
n
X
(Yi − Xi β)2 + λ(β12 + ... + βp2 )
i=1
or
||Y − Xβ||2 + λ||β||2 .
It is not difficult to prove that solution of β to the above
problem is
β̂ridge = (X⊤ X + λI)−1 X⊤ Y.
Denote it by βridge (λ)
Constraint approach
If the βj are unconstrained, they can explode and are

susceptible to very high variance. Thus, we need to regularize
the coefficients.
We might impose the ridge constraint:
n
X p
X
(Yi − Xi⊤ β)2 , subject to(s.t.) βj2 ≤t
i=1 j=1
or
||Y − Xβ||2 , s.t. ||β||2 ≤t.
Denote the estimator by βridge [t].

Constraint approach
Here, t is also called tuning parameter:

if t = 0, then β̂ridge [t] = 0;
if t ≥ ||β̂LSE ||2 = t0 , then β̂ridge [t] = β̂LSE ;
or if λ → ∞ then β̂ridge (λ) → 0;
if λ = 0, then β̂ridge (λ) = β̂LSE ;

Constraint approach
The Penalty approach and the Constraint approach are equivalent

and that t and λ have an inverse relationship.
βridge [t] = βridge (λ), if t = ||βridge (λ)||2 .

Delete-one-out expression
We have
Xn n
X
β̂LSE = (X⊤ X)−1 X⊤ Y = ( Xi⊤ Xi )−1 Xi⊤ Yi
i=1 i=1
Therefore, if observation (Xj , Yj ) is removed, then the

delete-one-out LS estimator is
n n
j
X X
β̂LSE =( Xi⊤ Xi )−1 Xi⊤ Yi .
i=1,i̸=j i=1,i̸=j
Similarly, the delete-one-out (delete j’th observation) ridge

estimator is
n n
j
X X
β̂ridge =( Xi⊤ Xi + λI)−1 Xi⊤ Yi .
i=1,i̸=j i=1,i̸=j

Selection of tuning parameter
Suppose we have observations, (X1 , Y1 ), ..., (Xn , Yn ) and

consider linear regression model
Yi = Xi β + εi , i = 1, 2, ..., n.
For different λ, we have different ridge regression estimator

for the model.
β̂ridge (λ) = (X⊤ X + λI)−1 X⊤ Y.
The BEST λ is to minimize the MSE
M SE = E||β̂ridge (λ) − β||2

Selection of λ via Cross-Validation (CV)

We select a wide range for possible λ: [0, c]. For each fixed λ
in [0, c], consider the CV as follows. For each j,
j
X X
β̂ridge (λ) = ( Xi⊤ Xi + λI)−1 Xi⊤ Yi .
i̸=j i̸=j
The prediction error for (Xj , Yj ) is

j
errj (λ) = (Yj − Xj β̂ridge (λ))2
The CV value is then

n
X
−1
CV (λ) = n errj (λ)
j=1
The best λ is the minimizer of CV (λ).

Selection of λ via GCV
With λ the ridge regression will give a fit to the observations

as
Ŷ = Xβ̂ridge = X(X⊤ X + λI)−1 X⊤ Y = SY,
where S = X(X⊤ X + λI)−1 X⊤ .
The GCV is then
GCV (λ) = n−1 (Y − Ŷ)⊤ (Y − Ŷ)/(1 − trace(S)/n)2 .
The best λ is the minimum point of GCV (λ).1
1
Here, trace(S) is the trace of matrix S, the sum of the elements on the
main diagonal
Selection of λ via Other methods
Question: State the selection procedures by AIC, BIC, and

5-fold CV
Direct selection of tuning parameter t for the constraints
approach is not convenient.

Selection of λ
Example For the diabetes data, the CV value changes with λ as
follows. R code for the calculation code02Ac.R
0.50240 0.50255 0.50270

●
● ●
●●
●●
●●
●
●●
●●
●●
●
●●
●●
● ●●
●●
●
●●
●●
●●
●●
● ●●
●●
●
●●
●
cv
●
●●
● ●●
●
●
●●
● ●●
●
●
●●
● ●
●
●●
●
● ●●
● ●●
● ●
●●
● ●●
●
●
● ●
●●
●
● ●
●●
●
●
●●
●●
●●
●
0 2 4 6 8 10
lambda

Selection of λ
Example: [cell classification based on gene] For the leukemia gene
expression data DMleukemia1.dat. There are 25 cells with 250
genes. they are from two types of cells, denoted by 1 and 0
respectively (or +1 and -1 respectively).
By CV, ridge parameter λ = 2.48832e−06 is selected, as shown
below
●
−10
●
●●
●
●
●
●●
●
● ●
●●
●
●
●●
−15
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
● ●●
●
●
log(cv)
●
●
−20
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●●
−25
●
●●
●
● ●●
● ●●
● ●
●
−30
●
● ●●
●● ●
●
−25 −20 −15 −10 −5
log(lambda)

Selection of λ
The estimation path
0.02
BETA
0.00
−0.02
0.000 0.005 0.010 0.015 0.020 0.025
(Except for t = 0, in the other cases none of the coefficient is zero)

Selection of λ
To check the model by the ridge regression, a new experiment

was done and the data were collected DMleukemia2.dat as
testing data.
The prediction errors for the new data set based on Ridge
regression: 0.1568467. In Figure below, we can see that we
can have a very accurate classification for the new data.
code02Ad.R

Selection of λ
1.0
● ●● ● ● ●
●
true
0.5
classified
classes
0.0
−0.5
−1.0
● ● ● ●● ●
−1.0 −0.5 0.0 0.5 1.0
predicted values

Regularized Regression Models: Sun Baoluo

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regularized Regression Models: Sun Baoluo

Uploaded by

Copyright:

Available Formats

Ridge regression Alternative approaches Selection of tuning parameter

Regularized regression models

Dept. of Statistics and Data Science

Regularized regression models

The “badly conditioned linear regression problem” (Hoerl and

Regularized regression models

The basic requirement for the Least squares estimation (LSE)

Regularized regression models

β̂ridge = (X⊤ X + λI)−1 X⊤ Y

where λ > 0 is a tuning parameter needs to be chosen (HOW?

β̂LSE = (X⊤ X)−1 X⊤ Y

Regularized regression models

Notice that the solution is indexed by the parameter λ

Regularized regression models

n=442 diabetes patients were measured on 10 baseline variables

y = β1 x1 + ... + β10 x10 + ε

We have the following coefficients paths code02Aa.R

Regularized regression models

log(lambda) log(1/lambda) ||beta||^2

Regularized regression models

Properties of ridge regression

β̂ridge is not an unbiased estimator

E β̂ridge = E{(X⊤ X + λI)−1 (X⊤ Y)}

Regularized regression models

Properties of ridge regression

Bias(β̂ridge ) = E β̂ridge − β = −λ(X⊤ X + λI)−1 β.

V ar(β̂ridge ) = (X⊤ X + λI)−1 X⊤ X(X⊤ X + λI)−1 σ 2

Regularized regression models

Properties of ridge regression

Is ridge regression better than LSE?

Regularized regression models

Properties of ridge regression

How to evaluate an estimator of a parameter? by MSE? For

Regularized regression models

Properties of ridge regression

Regularized regression models

Properties of ridge regression

Regularized regression models

Properties of ridge regression

Thus it is easy to see that

∥µ̂ − µ∥2 = trace{(µ̂ − µ)(µ̂ − µ)⊤ }

Regularized regression models

Properties of ridge regression

Regularized regression models

Properties of ridge regression

E{(µ̂ − µ)(µ̂ − µ)⊤ } = E{(µ̂ − E µ̂ + E µ̂ − µ)(µ̂ − E µ̂ + E µ̂ − µ)⊤

Regularized regression models

Properties of ridge regression

E∥β̂ridge − β∥2 ≤ E∥β̂LSE − β∥2

Regularized regression models

Properties of ridge regression

Smoother matrices and effective degrees of freedom

For rank(X) = p, we have trace(S) = p, the degrees of

Regularized regression models

Properties of ridge regression

In ridge regression, the fits are given by:

Ŷ = X(X⊤ X + λIp )−1 X⊤ Y

The smoother or hats matrix in ridge takes the form:

S = X(X⊤ X + λIp )−1 X⊤

So the effective degrees of freedom in ridge regression are

df (S) = trace[X(X⊤ X + λIp )−1 X⊤ ] < p

if λ > 0, λ actually penalizes the degrees of freedom of the

Regularized regression models