You are on page 1of 24

Introduction to

Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
amjad.iqbal@ucp.edu.pk

https://sites.google.com/a/ucp.edu.pk/mai/iml /
Slides of Prof. Dr. Andrew Ng, Stanford and Dr. Humayoun
Regularization

2
The problem of overfitting
• So far we've seen a few algorithms
• Work well for many applications, but can suffer from
the problem of overfitting

3
Overfitting with linear regression
Example: Linear regression (housing prices)
Price

Price

Price
Size Size Size

Overfitting: If we have too many features, the learned hypothesis


may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
The hypothesis is just too large, too variable and we don't have enough data to
constrain it to give us a good hypothesis 4
Example: Logistic regression

x2 x2 x2

x1 x1 x1

( = sigmoid function)
Addressing overfitting:
size of house

Price
no. of bedrooms
no. of floors
age of house
average income in neighborhood Size
kitchen size
• Plotting hypothesis is one way to decide whether
overfitting occurs or not
• But with lots of features and little data we cannot
visualize, and therefore:
• Hard to select the degree of polynomial
• What features to keep and which to drop
Addressing overfitting:

Options:
1. Reduce number of features. (but this means loosing
information)
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
Cost function

8
Intuition

Price
Price

Size of house Size of house

Suppose we penalize and make , really small.


Regularization.
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Housing:
Unlike the polynomial
― Features: example, we don't know what
― Parameters: are the high order terms
How do we pick the ones that need to be shrunk?
With regularization, take cost function and modify it to shrink all the parameters

By convention you don't penalize θ0 - minimization is from θ1 onwards


Regularization.

• Using the regularized objective


(i.e. cost function with

Price
regularization term)
• We get a much smoother curve
which fits the data and gives a
much better hypothesis
Size of house
λ is the regularization parameter
Controls a trade off between our two goals
1) Want to fit the training set well
2) Want to keep parameters small
In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps too large for


our problem, say )?
- Algorithm works fine; setting to be very large can’t hurt it
- Algorithm fails to eliminate overfitting.
- Algorithm results in underfitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.
In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps for too large


for our problem, say )?
Price

Size of house
Regularized linear regression

15
Regularized linear regression
Gradient descent 𝜕
𝐽 (𝜃)
𝜕𝜃0
Repeat

[ + 𝝀
𝒎
𝜽 𝒋 ]
(regularized)

Same as before
Interesting term:
Usually learning rate is small and m is large Ex.
Normal equation

−1
𝜃= ( 𝑋 𝑋)
𝑇 𝑇
𝑋 𝑦
Non-invertibility (optional/advanced).
Suppose ,
(#examples) (#features)

If ,
Regularized logistic regression

20
Regularized logistic regression.

x2

x1
Cost function:
Gradient descent
Repeat

[ + 𝝀
𝒎
𝜽 𝒋 ]
(regularized)
Advanced optimization
function [jVal, gradient] = costFunction(theta)
jVal = [ code to compute ];

gradient(1) = [code to compute ];

gradient(2) = [code to compute ];

gradient(3) = [code to compute ];

gradient(n+1) = [ code to compute ];


End

24

You might also like