Professional Documents
Culture Documents
Regression
Linear Regression
• Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method
for regression.
• Linear regression finds the parameters w and b that minimize the mean squared error between
predictions and the true regression targets, y, on the training set.
• The mean squared error is the sum of the squared differences between the predictions and the
true values.
• With higher-dimensional datasets, linear models become more powerful, and there is a higher
chance of overfitting.
• We should try to find a model that allows us to control complexity.
Linear regression
with one variable
Model
representation
Machine Learning
500000
Housing Prices 400000
(Portland, OR)
300000
Price 200000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Learning Algorithm
x
x x x
Size of h Estimated y x x
house price
x hypothesis Estimated y x
Linear regression with one variable.
h maps from x’s to y’s Univariate linear regression.
Linear regression
with one variable
Cost function
Machine Learning
Size in feet2 (x) Price ($) in 1000's (y)
Training Set 2104 460
1416 232
1534 315 m = 47
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Linear regression
with one variable
Cost function
intuition II
Machine Learning
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameters )
500000
400000
100000
0
500 1000 1500 2000 2500 3000
Size in feet (x)
2
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Linear regression
with one variable
Gradient
descent
Machine Learning
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J(0,1)
1
0
J(0,1)
1
0
Gradient descent algorithm
Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Linear regression
with one variable
Gradient descent for
linear regression
Machine Learning
Gradient descent algorithm Linear Regression Model
Gradient descent algorithm
update
and
simultaneously
J(0,1)
1
0
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
Multiple features
Machine Learning
Multiple features (variables).
2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables).
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
For convenience of notation, define .
1 * (n+1)
Multivariate linear regression.
Linear Regression with
multiple variables
Machine Learning
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update )
Linear Regression with
multiple variables
Gradient descent in
practice I: Feature Scaling
Machine Learning
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2)
= number of bedrooms (1-5)
Feature Scaling
Get every feature into approximately a range.
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
Linear Regression with
multiple variables
Gradient descent in
practice II: Learning rate
Machine Learning
Gradient descent
Example automatic
convergence test:
Declare convergence if
decreases by less
than in one
0 100 200 300 400
iteration.
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
To choose , try
Linear Regression with
multiple variables
Features and
polynomial regression
Machine Learning
Housing prices prediction
Polynomial regression
Price
(y)
Size (x)
Choice of features
Price
(y)
Size (x)
Linear Regression with
multiple variables
Normal equation
Machine Learning
Gradient Descent
(for every )
Solve for
Examples:
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
examples ; features.
E.g. If
is inverse of matrix .
Octave: pinv(X’*X)*X’*y
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Ridge Regression (L2 Regularization)
• We also want the magnitude of coefficients to be as small as possible (small Slopes).
• Intuitively, this means each feature should have as little effect on the outcome as possible, while
still predicting well.
• This constraint is an example of what is called regularization.
• Regularization means explicitly restricting a model to avoid overfitting.
• A regularization parameter alpha controls the value of coefficients.
Lasso (Least Absolute Shrinkage and
Selection Operator) (L1 Regularization)
Logistic Regression
0
Logistic Regression
• In case of a binary classification problem where 𝑦 𝜖 {0 ,1 }
1
• Task is to select parameters w to fit the date
0.5
0
Logistic Regression
• estimated probability that y = 1 on input x
• For example, if
And
70% chance of tumor being malignant.
𝑥1 + 𝑥 2 =3
A linear decision boundary obtained from
linear regression model.
x1
Logistic Regression
Decision Boundary
(Nonlinear Case)
Predict y = 1 if
Logistic Regression
Decision Boundary
(Nonlinear Case)
1
Predict y = 1 if
y=1
y=0
-1
x2
1 y=1
y=0
-1 x1
Logistic Regression
Cost Function
Cost that a learning algo (hypothesis) has to pay if its prediction is h(x) when the actual label is y
𝑚
1 1 𝑖 2
Linear Regression Model: 𝐽 ( 𝑤 )= ∑ ( h𝑤 ( 𝑥 ) − 𝑦 )
𝑖
(MSE: Meane Squared Error)
𝑚 𝑖=1 2
Cost ((x),y) =
If we use the same cost function for logistic regression whose hypothesis is a nonlinear
function, it will result in a nonconvex J(w).
Gradient Descent
= (old)
Y=1
Cost = 0 if y = 1, h = 1
But as h 0
Cost infinity
That is if h = 0
P(y=1|x;w) = 0, but y = 1
We’ll penalize the learning algorithm by a large cost.
(x) 1
Logistic Regression
Cost Function for Logistic Regression
Logistic Regression Cost ((x),y) =
Model:
Y=0
(x) 1
Logistic Regression
Cost Function for Logistic Regression
𝑚
1
𝐽 ( 𝑤 )= ∑ 𝐶𝑜𝑠𝑡(h𝑤 ( 𝑥 𝑖 ), 𝑦 𝑖)
𝑚 𝑖=1
Cost ((x),y) =
Cost ((x),y) =
1
𝐽 ( 𝑤 )= ¿
𝑚
• This cost function is derived in Statistics from the idea of maximum likelihood estimation
which helps to efficiently find parameters for different models
Logistic Regression
Cost Function for Logistic Regression
𝑚
1
𝐽 ( 𝑤 )= ∑ 𝐶𝑜𝑠𝑡(h𝑤 ( 𝑥 𝑖 ), 𝑦 𝑖)
𝑚 𝑖=1
• To fit parameter w
min 𝐽 ( 𝑤 ) 𝑡𝑜 𝑔𝑒𝑡 𝒘 𝒃𝒖𝒕 𝒉𝒐𝒘 ? ?
𝑤
1
𝐽 ( 𝜃)= ¿
𝑚