You are on page 1of 33

Ridge regression Alternative approaches Selection of tuning parameter

Regularized regression models

Sun Baoluo

Dept. of Statistics and Data Science

Spring 2023

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Ridge regression

The “badly conditioned linear regression problem” (Hoerl and


Kennard, 1970) has long been an important problem in
statistics and computer science. The problem is even intrinsic
in high dimensional data as most of the genetics data are.
The technique of ridge regression (RR) is one of the most
popular and best performing (Frank and Friedman, 1993)
alternatives to the ordinary least squares (LS) methods.

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Ridge regression

The basic requirement for the Least squares estimation (LSE)


is that the matrix
(X⊤ X)−1 exists.
There are at least two cases that the inverse does not exit.
(1) p > n and (2) collinearity, i.e. the columns in X are
linearly dependent.
Hereafter in this chapter, both X and Y are centralized and
standardized, i.e., mean zero and variance one.

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Ridge regression

Y = β1 x1 + ... + βp xp + ε
A simple way to solve the invertibility issue is by adding a diagonal
matrix to X⊤ X, i.e. X⊤ X + λI, where I is a p × p identity
matrix. The ridge regression estimator is then

β̂ridge = (X⊤ X + λI)−1 X⊤ Y

where λ > 0 is a tuning parameter needs to be chosen (HOW?


any idea). To make the notation clear, denote the LS estimator by

β̂LSE = (X⊤ X)−1 X⊤ Y

(providing it exists)

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Ridge regression

Notice that the solution is indexed by the parameter λ


So for each λ, we have a solution
Hence, the λ trace out a path of solutions
λ is the shrinkage parameter
λ controls the size of the coefficients
λ controls amount of regularization
As λ ↓ 0, we obtain the least squares solutions
As λ ↑ ∞, we have β̂ridge = 0

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Ridge regression

n=442 diabetes patients were measured on 10 baseline variables


x1 , ..., x10 (see diabetes.csv). The response variable is a measure
of disease progression one year after baseline y. After
centralization and standardization,

y = β1 x1 + ... + β10 x10 + ε

We have the following coefficients paths code02Aa.R

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Ridge regression

0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
beta

beta

beta
−0.2

−0.2

−0.2
−0.4

−0.4

−0.4
−5 0 5 10 −10 −5 0 5 0.0 0.2 0.4 0.6

log(lambda) log(1/lambda) ||beta||^2

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

β̂ridge is not an unbiased estimator

E β̂ridge = E{(X⊤ X + λI)−1 (X⊤ Y)}


= E{(X⊤ X + λI)−1 (X⊤ (Xβ + E))}
= (X⊤ X + λI)−1 X⊤ Xβ + E{(X⊤ X + λI)−1 X⊤ E))}
= (X⊤ X + λI)−1 X⊤ Xβ
= β − λ(X⊤ X + λI)−1 β.

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

The bias is

Bias(β̂ridge ) = E β̂ridge − β = −λ(X⊤ X + λI)−1 β.

Recall Bias(β̂LSE ) = 0.
Variance-covariance matrix of β̂ridge : If V ar(E) = σ 2 In ,
then (prove it)

V ar(β̂ridge ) = (X⊤ X + λI)−1 X⊤ X(X⊤ X + λI)−1 σ 2

Recall that
V ar(β̂LSE ) = (X⊤ X)−1 σ 2

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

Is ridge regression better than LSE?

|Bias(β̂LSE )| ≤ |Bias(β̂ridge )|

but
V ar(β̂LSE ) ≥ V ar(β̂ridge )

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

How to evaluate an estimator of a parameter? by MSE? For


any parameter, say µ = (µ1 , ..., µp )⊤ , the mean squared error
(MSE) of an estimator µ̂ is defined as

E∥µ̂ − µ∥2

or
E{(µ̂ − µ)T (µ̂ − µ)}
denoted by M SE(µ̂).

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

Lemma
E∥µ̂ − µ∥2 = trace[E{(µ̂ − µ)(µ̂ − µ)⊤ }]

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression


Proof: Write
 
µ̂1 − µ1
µ̂2 − µ2 
µ̂ − µ = 
 
.. 
 . 
µ̂p − µp
Thus
∥µ̂ − µ∥2 = (µ̂1 − µ1 )2 + (µ̂2 − µ2 )2 + · · · + (µ̂p − µp )2
and
{(µ̂ − µ)(µ̂ − µ)⊤ } =

(µ̂1 − µ1 )2
 
(µ̂1 − µ1 )(µ̂2 − µ2 ) ... (µ̂1 − µ1 )(µ̂p − µp )
(µ̂2 − µ2 )(µ̂1 − µ1 )
 (µ̂2 − µ2 )2 ... (µ̂2 − µ2 )(µ̂p − µp )
 .. .. .. .. 
 . . . . 
(µ̂p − µp )(µ̂1 − µ1 ) (µ̂p − µp )(µ̂2 − µ2 ) ... (µ̂p − µp ) 2

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

Thus it is easy to see that

∥µ̂ − µ∥2 = trace{(µ̂ − µ)(µ̂ − µ)⊤ }

and
E{∥µ̂ − µ∥2 } = tr[E{(µ̂ − µ)(µ̂ − µ)⊤ }]

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

Lemma
M SE = V ariance + ||Bias||2
i.e.
E∥µ̂ − µ∥2 = trace{V ar(µ̂)} + ∥Bias∥2
or
E{(µ̂ − µ)(µ̂ − µ)⊤ } = V ar(µ̂) + Bias ∗ Bias⊤

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

Proof:

E{(µ̂ − µ)(µ̂ − µ)⊤ } = E{(µ̂ − E µ̂ + E µ̂ − µ)(µ̂ − E µ̂ + E µ̂ − µ)⊤


= E{(µ̂ − E µ̂)(µ̂ − E µ̂)⊤ } + E{(µ̂ − E µ̂)(E µ̂ − µ)⊤ } +
+E{(E µ̂ − µ)(µ̂ − E µ̂)⊤ } + E{(E µ̂ − µ)(E µ̂ − µ)⊤ }
= E{(µ̂ − E µ̂)(µ̂ − E µ̂)⊤ } + {(E µ̂ − µ)(E µ̂ − µ)⊤ }
= V ar(µ̂) + Bias ∗ Bias⊤

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

Theorem
There exists a λ ≥ 0, such that

E∥β̂ridge − β∥2 ≤ E∥β̂LSE − β∥2

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

Smoother matrices and effective degrees of freedom


Often, the fitted value can be written as

Ŷ = SY

(called smoother of Y)
Smoothers put the hats on Y, S is called hat matrix.
In ordinary least squares, recall the hat matrix

S = X(X⊤ X)−1 X⊤

For rank(X) = p, we have trace(S) = p, the degrees of


freedom used in the model.

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Properties of ridge regression

In ridge regression, the fits are given by:

Ŷ = X(X⊤ X + λIp )−1 X⊤ Y

The smoother or hats matrix in ridge takes the form:

S = X(X⊤ X + λIp )−1 X⊤

So the effective degrees of freedom in ridge regression are


given by:

df (S) = trace[X(X⊤ X + λIp )−1 X⊤ ] < p

if λ > 0, λ actually penalizes the degrees of freedom of the


regression.

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Penalty approach
Penalty approach: Recall the LS estimation is to minimize
n
X
(Yi − Xi⊤ β)2 .
i=1

To penalize the value of β, we can consider estimate β by


minimizing
n
X
(Yi − Xi β)2 + λ(β12 + ... + βp2 )
i=1
or
||Y − Xβ||2 + λ||β||2 .
It is not difficult to prove that solution of β to the above
problem is
β̂ridge = (X⊤ X + λI)−1 X⊤ Y.
Denote it by βridge (λ)
Regularized regression models
Ridge regression Alternative approaches Selection of tuning parameter

Constraint approach

If the βj are unconstrained, they can explode and are


susceptible to very high variance. Thus, we need to regularize
the coefficients.
We might impose the ridge constraint:
n
X p
X
(Yi − Xi⊤ β)2 , subject to(s.t.) βj2 ≤t
i=1 j=1

or
||Y − Xβ||2 , s.t. ||β||2 ≤t.
Denote the estimator by βridge [t].

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Constraint approach

Here, t is also called tuning parameter:


if t = 0, then β̂ridge [t] = 0;
if t ≥ ||β̂LSE ||2 = t0 , then β̂ridge [t] = β̂LSE ;
or if λ → ∞ then β̂ridge (λ) → 0;
if λ = 0, then β̂ridge (λ) = β̂LSE ;

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Constraint approach

The Penalty approach and the Constraint approach are equivalent


and that t and λ have an inverse relationship.

βridge [t] = βridge (λ), if t = ||βridge (λ)||2 .

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Delete-one-out expression
We have
Xn n
X
β̂LSE = (X⊤ X)−1 X⊤ Y = ( Xi⊤ Xi )−1 Xi⊤ Yi
i=1 i=1

Therefore, if observation (Xj , Yj ) is removed, then the


delete-one-out LS estimator is
n n
j
X X
β̂LSE =( Xi⊤ Xi )−1 Xi⊤ Yi .
i=1,i̸=j i=1,i̸=j

Similarly, the delete-one-out (delete j’th observation) ridge


estimator is
n n
j
X X
β̂ridge =( Xi⊤ Xi + λI)−1 Xi⊤ Yi .
i=1,i̸=j i=1,i̸=j

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Selection of tuning parameter

Suppose we have observations, (X1 , Y1 ), ..., (Xn , Yn ) and


consider linear regression model

Yi = Xi β + εi , i = 1, 2, ..., n.

For different λ, we have different ridge regression estimator


for the model.

β̂ridge (λ) = (X⊤ X + λI)−1 X⊤ Y.

The BEST λ is to minimize the MSE

M SE = E||β̂ridge (λ) − β||2

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ via Cross-Validation (CV)


We select a wide range for possible λ: [0, c]. For each fixed λ
in [0, c], consider the CV as follows. For each j,
j
X X
β̂ridge (λ) = ( Xi⊤ Xi + λI)−1 Xi⊤ Yi .
i̸=j i̸=j

The prediction error for (Xj , Yj ) is


j
errj (λ) = (Yj − Xj β̂ridge (λ))2

The CV value is then


n
X
−1
CV (λ) = n errj (λ)
j=1

The best λ is the minimizer of CV (λ).

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ via GCV

With λ the ridge regression will give a fit to the observations


as
Ŷ = Xβ̂ridge = X(X⊤ X + λI)−1 X⊤ Y = SY,
where S = X(X⊤ X + λI)−1 X⊤ .
The GCV is then

GCV (λ) = n−1 (Y − Ŷ)⊤ (Y − Ŷ)/(1 − trace(S)/n)2 .

The best λ is the minimum point of GCV (λ).1

1
Here, trace(S) is the trace of matrix S, the sum of the elements on the
main diagonal
Regularized regression models
Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ via Other methods

Question: State the selection procedures by AIC, BIC, and


5-fold CV
Direct selection of tuning parameter t for the constraints
approach is not convenient.

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ
Example For the diabetes data, the CV value changes with λ as
follows. R code for the calculation code02Ac.R

0.50240 0.50255 0.50270


● ●
●●
●●
●●

●●
●●
●●

●●
●●
● ●●
●●

●●
●●
●●
●●
● ●●
●●

●●

cv


●●
● ●●


●●
● ●●


●●
● ●

●●

● ●●
● ●●
● ●
●●
● ●●


● ●
●●

● ●
●●


●●
●●
●●

0 2 4 6 8 10

lambda

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ
Example: [cell classification based on gene] For the leukemia gene
expression data DMleukemia1.dat. There are 25 cells with 250
genes. they are from two types of cells, denoted by 1 and 0
respectively (or +1 and -1 respectively).
By CV, ridge parameter λ = 2.48832e−06 is selected, as shown
below


−10


●●



●●

● ●
●●


●●
−15



●●


●●




●●


●●

●●

● ●●


log(cv)



−20


●●



● ●









● ●●
−25


●●

● ●●
● ●●
● ●

−30


● ●●
●● ●

−25 −20 −15 −10 −5

log(lambda)

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ
The estimation path

0.02
BETA

0.00
−0.02

0.000 0.005 0.010 0.015 0.020 0.025

(Except for t = 0, in the other cases none of the coefficient is zero)


Regularized regression models
Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ

To check the model by the ridge regression, a new experiment


was done and the data were collected DMleukemia2.dat as
testing data.
The prediction errors for the new data set based on Ridge
regression: 0.1568467. In Figure below, we can see that we
can have a very accurate classification for the new data.
code02Ad.R

Regularized regression models


Ridge regression Alternative approaches Selection of tuning parameter

Selection of λ

1.0
● ●● ● ● ●


true
0.5

classified
classes

0.0
−0.5
−1.0

● ● ● ●● ●

−1.0 −0.5 0.0 0.5 1.0

predicted values

Regularized regression models

You might also like