Regularization Lecture: LASSO and Ridge Regression

Lecture 8a: Regularization
CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner
ANNOUNCEMENTS
• Religious Holidays: please contact if this affects your HW due dates.

• For 209 students:
– please submit 209 HW separately from 109 HW in different assignments on Canvas.
– A-sec this week: optional to cover 2nd part from last week (Bayesian perspective).
CS109A, PROTOPAPAS, RADER, TANNER 2

ANNOUNCEMENTS
• Milestone 1: Due this Friday, Oct 4 (see Canvas instructions).

• Study break: Wednesday 6-8pm in TBD?
• Recall Course Requirements: “You are expected to have programming
experience at the level of CS 50 or above, and statistics knowledge at the
level of Stat 100 or above (Stat 110 recommended).”

Bias vs Variance
Left: 2000 best fit straight lines, each fitted on a different 20 point training set.
Right: Best-fit models using degree 10 polynomial

Bias vs Variance
Left: Linear regression coefficients

Right: Poly regression of order 10 coefficients


Lecture Outline
Regularization: LASSO and Ridge

Geometric Interpretation

Regularization: An Overview
The
idea of regularization revolves around modifying the loss function L; in
particular, we add a regularization term that penalizes some specified properties of
the model parameters
where is a scalar that gives the weight (or importance) of the regularization term.
Fitting the model using the modified loss function Lreg would result in model
parameters with desirable properties (specified by R).

LASSO Regression
Since we wish to discourage extreme values in model parameter, we need to choose a

regularization term that penalizes parameter magnitudes. For our loss function, we
will again use MSE.
Together our regularized loss function is:
Note that is the l1 norm of the vector b

Ridge Regression
Alternatively, we can choose a regularization term that penalizes the squares of the
parameter magnitudes. Then, our regularized loss function is:
Note that is the square of the l2 norm of the vector b

Choosing l
In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter l, the more heavily we penalize large values in b,
• If l is close to zero, we recover the MSE, i.e. ridge and LASSO regression is
just ordinary regression.
• If l is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force bridge and bLASSO to be close to
zero.
To avoid ad-hoc choices, we should select l using cross-validation.

Ridge, LASSO - Computational complexity
Solution to ridge regression:
The solution to the LASSO regression:
LASSO has no conventional analytical solution, as the L1 norm has no

derivative at 0. We can, however, use the concept of subdifferential or
subgradient to find a manageable expression. See a–sec2 for details.

Regularization Parameter with a Validation Seet
The solution of the Ridge/Lasso regression involves three steps:

• Select l
• Find the minimum of the ridge/Lasso regression loss function (using
the formula for ridge) and record the MSE on the validation set.
• Find the l that gives the smallest MSE

The Geometry of Regularization (LASSO)
𝑛 𝐽
1 ^ 𝐿𝐴𝑆𝑆𝑂
𝜷 =argmin 𝐿𝐿𝐴𝑆𝑆𝑂 ( 𝜷 )
𝑇 2
𝑂 ( 𝜷 ) = ∑ ¿ 𝑦 𝑖 − 𝜷 𝒙| + 𝜆 ∑ ¿ 𝛽 𝑗 ∨¿¿
𝜆 ∑𝑛| ^𝛽𝑖=1 |=𝐶
𝐽
𝐿𝐴𝑆𝑆𝑂
𝑗 𝑗 =1=D
𝑗=1𝛽
2
𝛽 2
𝑀𝑆𝐸
C ^
𝛽 MSE=D
𝛽 1
𝛽 1

𝑛 𝐽
𝜷 =argmin 𝐿𝐿𝐴𝑆𝑆𝑂 ( 𝜷 )
𝑇 2
𝑂 ( 𝜷 ) = 𝐽 ∑ ¿ 𝑦 𝑖 − 𝜷 𝒙| + 𝜆 ∑ ¿ 𝛽 ∨¿¿
𝑗 𝑛
𝑛 |𝑖=1 𝑗 =1 1

^𝛽
𝐿 =𝜆 ∑ | 𝑇
1
𝑗=1
𝑗
𝐿𝑀𝑆𝐸 ( 𝜷 ) = ∑ ¿ 𝑦 𝑖 − 𝜷 𝒙
𝑛 𝑖=1

Th e Geometry
The Geom etryofofRegularization
Regu la riza tion (LASSO)
(LASSO)
𝑛 𝐽
1
� � = |� − � � +� |� | 𝜷� 𝐿
= argmin �
=argmin (�𝜷 )
𝑇 2 𝐿𝐴𝑆𝑆𝑂
𝜷 ∑ 𝑦 − 𝜷 𝒙| 𝜆 ∑ 𝛽
�
𝑂 ( ) = ¿ 𝑖 + ¿ 𝑗 ∨¿¿
𝜆 ∑𝑛| ^𝛽𝑖=1 |=𝐶
𝐽
� �
𝑗
=� 𝑗 =1=D∑ |� − �
�| =D
𝑗=1𝛽
� 2
𝛽
�2
𝑀𝑆𝐸
C ^
𝛽� MSE=D
𝛽
�1
𝛽
�1
CS109A, PROTOPAPAS,
CS109A, PROTOPAPAS , RADER, TANNER
RADER, TANNER 16
12
𝛽 2 𝛽 2
𝑀𝑆𝐸 𝛽 2
^
𝛽 MSE=D
C 𝑀𝑆𝐸
C ^
𝛽 MSE=D
C C
𝛽 1 𝛽 1
𝛽 1

The Geometry of Regularization (Ridge)
𝑛 𝐽
1 𝑇 2 2
𝑅𝑖𝑑𝑔𝑒
^𝜷 =argmin 𝐿𝑅𝑖𝑑𝑔𝑒 ( 𝜷 )
(𝑒 𝜷 )= ∑ ¿ 𝑦 𝑖 − 𝜷 𝒙 | + 𝜆 ∑ ( 𝛽 𝑗 )
𝜆 ∑𝑛
𝐽
𝑅𝑖𝑑𝑔𝑒 2
|^𝛽 𝑖=1| =𝐶𝑗
=D𝑗 =1
𝑗=1𝛽
2
𝛽 2
𝑆𝐸
𝑀
^
𝛽
MSE=D
C
𝛽 1
𝛽 1

The Geometry of Regularization (Ridge)
𝛽 2
𝑀𝑆𝐸
^
𝛽
MSE=D
C
𝛽 1

The Geometry of Regularization
𝛽 2 𝛽 2
𝑀𝑆𝐸
^
𝛽
𝑀𝑆𝐸
MSE=D ^
𝛽
MSE=D
C
C C
𝛽 1 𝛽 1

Examples

Ridge visualized
Ridge estimator
The ridge estimator is where the constraint The values of the coefficients decrease as
and the loss intersect. lambda increases, but they are not nullified.

LASSO visualized
Lasso estimator
The Lasso estimator tends to zero out The values of the coefficients decrease as
parameters as the OLS loss can easily intersect lambda increases, and are nullified fast.
with the constraint on one of the axis.
Ridge regularization with only validation : step by step
1.
split data into
2. for
1. determine the that minimizes the , , using the train data.
2. record using validation data.
3. select the that minimizes the loss on the validation data,
4. Refit the model using both train and validation data, }, resulting to
5. report MSE or R2 on given the

Ridge regularization with validation only: step by step
al
al
vv

Lasso regularization with validation only: step by step
1.
split data into
2. for
A. determine the that minimizes the , , using the train data. This is done
using a solver.
B. record using validation data
3. select the that minimizes the loss on the validation data,
4. Refit the model using both train and validation data, }, resulting to

……
Ridge regularization with CV: step by step .... …
…
…
… .... …
…
1. remove from data …
…
……
…
..
..
…
…
..
..
…
…
2. split the rest of data into K folds, … … … …
… … … …
E[] …
3. for k in E[] …
1. for
A. determine the that minimizes the , , using the train data of the fold, .
B. record using the validation data of the fold
At this point we have a 2-D matrix, rows are for different k, and columns are for different values.
4. Average the for each , .
5. Find the that minimizes the , resulting to .
6. Refit the model using the full training data, }, resulting to

Ridge regularization with validation only: step by step
al
al
vv

Variable Selection as Regularization
Since LASSO regression tend to produce zero estimates for a number of model
parameters - we say that LASSO solutions are sparse - we consider LASSO to be a
method for variable selection.
Many prefer using LASSO for variable selection (as well as for suppressing extreme
parameter values) rather than stepwise selection, as LASSO avoids the statistic
problems that arises in stepwise selection.
Question: What are the pros and cons of the two approaches?

Quiz time: Lecture 8

Regularization Lecture: LASSO and Ridge Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regularization Lecture: LASSO and Ridge Regression

Uploaded by

Copyright:

Available Formats

Lecture 8a: Regularization

CS109A Introduction to Data Science

• Religious Holidays: please contact if this affects your HW due dates.

CS109A, PROTOPAPAS, RADER, TANNER 2

• Milestone 1: Due this Friday, Oct 4 (see Canvas instructions).

CS109A, PROTOPAPAS, RADER, TANNER 3

CS109A, PROTOPAPAS, RADER, TANNER 5

Left: Linear regression coefficients

CS109A, PROTOPAPAS, RADER, TANNER 6

Regularization: LASSO and Ridge

CS109A, PROTOPAPAS, RADER, TANNER 7

CS109A, PROTOPAPAS, RADER, TANNER 8

Since we wish to discourage extreme values in model parameter, we need to choose a

Note that is the l1 norm of the vector b

CS109A, PROTOPAPAS, RADER, TANNER 9

Note that is the square of the l2 norm of the vector b

CS109A, PROTOPAPAS, RADER, TANNER 10

CS109A, PROTOPAPAS, RADER, TANNER 11

Solution to ridge regression:

The solution to the LASSO regression:

LASSO has no conventional analytical solution, as the L1 norm has no

CS109A, PROTOPAPAS, RADER, TANNER 12

The solution of the Ridge/Lasso regression involves three steps:

CS109A, PROTOPAPAS, RADER, TANNER 13

CS109A, PROTOPAPAS, RADER, TANNER 14

CS109A, PROTOPAPAS, RADER, TANNER 15

CS109A, PROTOPAPAS, RADER, TANNER 17

CS109A, PROTOPAPAS, RADER, TANNER 18

CS109A, PROTOPAPAS, RADER, TANNER 19

CS109A, PROTOPAPAS, RADER, TANNER 20

CS109A, PROTOPAPAS, RADER, TANNER 21

CS109A, PROTOPAPAS, RADER, TANNER 22

CS109A, PROTOPAPAS, RADER, TANNER 24

CS109A, PROTOPAPAS, RADER, TANNER 25

CS109A, PROTOPAPAS, RADER, TANNER 26

CS109A, PROTOPAPAS, RADER, TANNER 27

CS109A, PROTOPAPAS, RADER, TANNER 28

CS109A, PROTOPAPAS, RADER, TANNER 29

CS109A, PROTOPAPAS, RADER, TANNER 30

You might also like