Chapter 2 - Regularization - Modified

Supervised Learning:
Regularization
Regularization - Introduction
• Overfitting: If we have too many features, the learned hypothesis may fit
the training set very well, but fail to generalize to new examples.
12/26/2021 School of Computer Science and Engineering 2

Regularization - Introduction
( = sigmoid function)

Regularization – address overfitting
size of house
Price
no. of bedrooms
no. of floors
age of house
average income in neighborhood Size
kitchen size

Regularization – address overfitting
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm.
2. Regularization.
― Keep all the features, but reduce magnitude/values of parameters.
― Works well when we have a lot of features, each of which contributes a
bit to predicting.

Regularization – bias-variance tradeoff

Regularization
• Bias explains how much the model has over-fitted on train data.
• Variance explains the difference in the predictions made on train data and
test data.
• In an ideal situation, we need to reduce both bias and variance, which is
where regularization comes in.

Regularization – bias-variance tradeoff

Regularization – cost function
Price
Price
Size of house Size of house
Suppose we penalize and make , really small.

Regularization – cost function
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Housing:
― Features:
― Parameters:

Regularized Linear regression
In regularized linear regression, we choose to minimize
What if is set to an extremely large value (perhaps for too large

for our problem, say )?
- Algorithm works fine; setting to be very large can’t hurt it
- Algorithm fails to eliminate overfitting.
- Algorithm results in underfitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.

Regularized Linear regression
• Gradient descent update rule
Repeat

Regularized Logistic regression
x2
x1
Cost function:

Regularized Logistic regression
Gradient descent update rule
Repeat

Regularization

Regularization

Regularization

Regularization

Regularization

Regularization

Regularization

Regularization

Regularization
• Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds
“absolute value of magnitude” of coefficient as penalty term to the loss
function.
• If lambda is zero then we will get back Ordinary Least Squared whereas
very large value will make coefficients zero hence it will under-fit.
• The key difference between these techniques is that Lasso shrinks the less

important feature’s coefficient to zero thus, removing some feature
altogether. So, this works well for feature selection in case we have a
huge number of features.
Regularization
A regression model that uses L1 regularization technique is called Lasso
Regression and model which uses L2 is called Ridge Regression.
• Ridge regression adds “squared magnitude” of coefficient as penalty term
to the loss function. Here the highlighted part represents L2 regularization
element.
• If lambda is zero then you can imagine we get back Ordinary Least
Squared.
• If lambda is very large then it will add too much weight and it will lead to
under-fitting. Having said that it’s important how lambda is chosen. This
technique works very well to avoid over-fitting issue.
• we can see from the formula of L1 and L2 regularization, L1
regularization adds the penalty term in cost function by adding the
absolute value of weight(Wj) parameters, while L2 regularization adds
the squared value of weights(Wj) in the cost function.
• L1 regularization helps in feature selection by eliminating the features

that are not important. This is helpful when the number of feature points
are large in number.
• L1 regularization it will estimate around the median of the data.

• L2 regularization, while calculating the loss function in the gradient
calculation step, the loss function tries to minimize the loss by
subtracting it from the average of the data distribution.
• L1 is Lasso:
• It will penalize your model's beta coefficients, and can even make them zero, that is it
can help you remove variables as well.
• This is particularly advised when you do have a large number of variables in your
data-set and you can notice that a few of them are not really helping the model much.
• L2 is Ridge:
• It will not cause any variables to be demolished, but it will assign weights (call it
importance) to variables, such that your model maintains all the variables and at the
same time gives more importance to the important ones, in order to build a great
model for you.

Regularization
• L1 Regularization aka Lasso Regularization:
• This add regularization terms in the model which are function of absolute value of
the coefficients of parameters.
• The coefficient of the parameters can be driven to zero as well during the
regularization process. Hence this technique can be used for feature selection and
generating more parsimonious model.
• L2 Regularization aka Ridge Regularization:
• This add regularization terms in the model which are function of square of
coefficients of parameters. Coefficient of parameters can approach to zero but
never become zero and hence.
• Combination of the above two such as Elastic Nets:
• This add regularization terms in the model which are combination of both L1 and
L2 regularization.
Thank you

Chapter 2 - Regularization - Modified

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 - Regularization - Modified

Uploaded by

Copyright:

Available Formats

Supervised Learning:

12/26/2021 School of Computer Science and Engineering 2

12/26/2021 School of Computer Science and Engineering 3

12/26/2021 School of Computer Science and Engineering 4

12/26/2021 School of Computer Science and Engineering 5

12/26/2021 School of Computer Science and Engineering 6

12/26/2021 School of Computer Science and Engineering 7

12/26/2021 School of Computer Science and Engineering 8

Size of house Size of house

Suppose we penalize and make , really small.

12/26/2021 School of Computer Science and Engineering 9

12/26/2021 School of Computer Science and Engineering 10

What if is set to an extremely large value (perhaps for too large

12/26/2021 School of Computer Science and Engineering 11

12/26/2021 School of Computer Science and Engineering 12

12/26/2021 School of Computer Science and Engineering 13

12/26/2021 School of Computer Science and Engineering 14

12/26/2021 School of Computer Science and Engineering 15

12/26/2021 School of Computer Science and Engineering 16

12/26/2021 School of Computer Science and Engineering 17

12/26/2021 School of Computer Science and Engineering 18

12/26/2021 School of Computer Science and Engineering 19

12/26/2021 School of Computer Science and Engineering 20

12/26/2021 School of Computer Science and Engineering 21

12/26/2021 School of Computer Science and Engineering 22

• The key difference between these techniques is that Lasso shrinks the less

• L1 regularization helps in feature selection by eliminating the features

• L1 regularization it will estimate around the median of the data.

12/26/2021 School of Computer Science and Engineering 26

You might also like