You are on page 1of 28

Supervised Learning:

Regularization
Regularization - Introduction
• Overfitting: If we have too many features, the learned hypothesis may fit
the training set very well, but fail to generalize to new examples.

12/26/2021 School of Computer Science and Engineering 2


Regularization - Introduction

( = sigmoid function)

12/26/2021 School of Computer Science and Engineering 3


Regularization – address overfitting

size of house

Price
no. of bedrooms
no. of floors
age of house
average income in neighborhood Size
kitchen size

12/26/2021 School of Computer Science and Engineering 4


Regularization – address overfitting
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm.
2. Regularization.
― Keep all the features, but reduce magnitude/values of parameters.
― Works well when we have a lot of features, each of which contributes a
bit to predicting.

12/26/2021 School of Computer Science and Engineering 5


Regularization – bias-variance tradeoff

12/26/2021 School of Computer Science and Engineering 6


Regularization
• Bias explains how much the model has over-fitted on train data.
• Variance explains the difference in the predictions made on train data and
test data.
• In an ideal situation, we need to reduce both bias and variance, which is
where regularization comes in.

12/26/2021 School of Computer Science and Engineering 7


Regularization – bias-variance tradeoff

12/26/2021 School of Computer Science and Engineering 8


Regularization – cost function

Price
Price

Size of house Size of house

Suppose we penalize and make , really small.

12/26/2021 School of Computer Science and Engineering 9


Regularization – cost function
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Housing:
― Features:
― Parameters:

12/26/2021 School of Computer Science and Engineering 10


Regularized Linear regression
In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps for too large


for our problem, say )?
- Algorithm works fine; setting to be very large can’t hurt it
- Algorithm fails to eliminate overfitting.
- Algorithm results in underfitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.

12/26/2021 School of Computer Science and Engineering 11


Regularized Linear regression
• Gradient descent update rule
Repeat

12/26/2021 School of Computer Science and Engineering 12


Regularized Logistic regression

x2

x1

Cost function:

12/26/2021 School of Computer Science and Engineering 13


Regularized Logistic regression
Gradient descent update rule

Repeat

12/26/2021 School of Computer Science and Engineering 14


Regularization

12/26/2021 School of Computer Science and Engineering 15


Regularization

12/26/2021 School of Computer Science and Engineering 16


Regularization

12/26/2021 School of Computer Science and Engineering 17


Regularization

12/26/2021 School of Computer Science and Engineering 18


Regularization

12/26/2021 School of Computer Science and Engineering 19


Regularization

12/26/2021 School of Computer Science and Engineering 20


Regularization

12/26/2021 School of Computer Science and Engineering 21


Regularization

12/26/2021 School of Computer Science and Engineering 22


Regularization
• Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds
“absolute value of magnitude” of coefficient as penalty term to the loss
function.

• If lambda is zero then we will get back Ordinary Least Squared whereas
very large value will make coefficients zero hence it will under-fit.

• The key difference between these techniques is that Lasso shrinks the less


important feature’s coefficient to zero thus, removing some feature
altogether. So, this works well for feature selection in case we have a
huge number of features.
12/26/2021 School of Computer Science and Engineering 23
Regularization
A regression model that uses L1 regularization technique is called Lasso
Regression and model which uses L2 is called Ridge Regression.
• Ridge regression adds “squared magnitude” of coefficient as penalty term
to the loss function. Here the highlighted part represents L2 regularization
element.

• If lambda is zero then you can imagine we get back Ordinary Least
Squared.
• If lambda is very large then it will add too much weight and it will lead to
under-fitting. Having said that it’s important how lambda is chosen. This
technique works very well to avoid over-fitting issue.
12/26/2021 School of Computer Science and Engineering 24
• we can see from the formula of L1 and L2 regularization, L1
regularization adds the penalty term in cost function by adding the
absolute value of weight(Wj) parameters, while L2 regularization adds
the squared value of weights(Wj) in the cost function.

• L1 regularization helps in feature selection by eliminating the features


that are not important. This is helpful when the number of feature points
are large in number.

• L1 regularization it will estimate around the median of the data.  


• L2 regularization, while calculating the loss function in the gradient
calculation step, the loss function tries to minimize the loss by
subtracting it from the average of the data distribution.
12/26/2021 School of Computer Science and Engineering 25
• L1 is Lasso:
• It will penalize your model's beta coefficients, and can even make them zero, that is it
can help you remove variables as well.
• This is particularly advised when you do have a large number of variables in your
data-set and you can notice that a few of them are not really helping the model much.

• L2 is Ridge:
• It will not cause any variables to be demolished, but it will assign weights (call it
importance) to variables, such that your model maintains all the variables and at the
same time gives more importance to the important ones, in order to build a great
model for you.

12/26/2021 School of Computer Science and Engineering 26


Regularization
• L1 Regularization aka Lasso Regularization:
• This add regularization terms in the model which are function of absolute value of
the coefficients of parameters.
• The coefficient of the parameters can be driven to zero as well during the
regularization process. Hence this technique can be used for feature selection and
generating more parsimonious model.
• L2 Regularization aka Ridge Regularization:
• This add regularization terms in the model which are function of square of
coefficients of parameters. Coefficient of parameters can approach to zero but
never become zero and hence.
• Combination of the above two such as Elastic Nets:
• This add regularization terms in the model which are combination of both L1 and
L2 regularization.
12/26/2021 School of Computer Science and Engineering 27
Thank you

You might also like