2017 08 11 MC886 MO444 Linear Regression PDF

Linear Regression
Machine Learning and Pattern Recognition

(Largely based on slides from Andrew Ng)
Prof. Sandra Avila

Institute of Computing (IC/Unicamp)
MC886/MO444, August 11, 2017

House Price Prediction
$ 70 000
https://www.youtube.com/watch?v=IpGxLWOIZy4
$ 160 000
???
20
15
Price
(in $ 10 000’s) 10
5
0
Size (feet2)
Linear Regression
20 ×
15 ×
Price
12
10 ×× ××
(in $ 10 000’s)
5 × × ×
0
Size (feet2)
Today’s Agenda
● Linear Regression with One Variable
○ Model Representation
○ Cost Function
○ Gradient Descent
● Linear Regression with Multiple Variables

○ Gradient Descent for Multiple Variables
○ Feature Scaling
○ Learning Rate
Model Representation
https://www.kaggle.com/harlfoxem/housesalesprediction
500
Housing Prices
400 × × ×
Price × × × ××× × × ×
(in 1000’s
300
×× ×××××××
× ×
× ×××× ××
of dollars) 200 × × × × ×× × ×
100 × ××× ×
××
0
0 1000 2000 3000 4000 5000
Size (feet2)
Supervised Learning Regression Problem

Given the “right answer” for Predict real-valued output
each example in the data.
Size in feet2 (x) Price ($) in 1000’s (y)
Training set of
2104 460
housing prices
1416 232
1534 315
852 178
... ...
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training set
Training set
Learning algorithm
Training set
Learning algorithm
h
(hypothesis)
Training set
Learning algorithm
Size of Estimated
h
house price
(hypothesis)
Training set
Learning algorithm
Size of Estimated
h
house price
(hypothesis)
h maps x’s to y’s
How do we represent h ?
Training set
Learning algorithm
Size of Estimated
h
house price
(hypothesis)
Training set h (x) = + x

0 1
Learning algorithm ×
y × ×
× ×
Size of Estimated × ×
h
house price x
(hypothesis)
Training set h (x) = + x

0 1
Learning algorithm ×
y × ×
× ×
Size of Estimated × ×
h
house price x
(hypothesis) Linear regression with one variable.
h maps x’s to y’s Univariate linear regression.
Cost Function
Training Set Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
... ...
Hypothesis: h (x) = 0
+ x
1
i‘s: Parameters
How to choose i‘s ?
h (x) = 0
+ 1
x
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
= 1.5 =0 =1
=0 = 0.5 = 0.5
h (x) = 0
+ 1
x
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
= 1.5 =0 =1
=0 = 0.5 = 0.5
h (x) = 0
+ 1
x
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
= 1.5 =0 =1
=0 = 0.5 = 0.5
h (x) = 0
+ 1
x
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
= 1.5 =0 =1
=0 = 0.5 = 0.5
×
y × ×
× ×
× ×
x
Idea: Choose 0, 1 so that
close to y for our
training examples (x,y)
－
×
y × ×
× ×
× ×
x
close to y for our
－
×
y × ×
× ×
× ×
x
close to y for our
－
×
y × ×
× ×
× ×
x
close to y for our
－
×
y × ×
× ×
× ×
x
close to y for our
－
×
y × × h (x) = + x
0 1
× ×
× ×
x
close to y for our
－
×
y × × h (x) = + x
0 1
× ×
× ×
x = －

close to y for our
－
×
y × × h (x) = + x
0 1
× ×
× ×
x = －

close to y for our
Cost function
(Squared error function)
Cost Function
Intuition I
Hypothesis:
h (x) = 0
+ x
1
Parameters:
0, 1
Cost Function:
= −
Goal:
Hypothesis: Simplified
h (x) = 0
+ x
1
H =
Parameters:
0, 1
Cost Function:
= − = −
Goal:
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)
h (x)
3 ×
2 ×
y
1 ×
0
0 1 2 3
x
h (x)
3 ×
2 ×
y
1 × =1
0
0 1 2 3
x
J( ) = J(1) = ?
h (x)
3 × 3
2 × 2
y
1 × =1 1
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
J( ) = J(1) = 0
h (x)
3 × 3
2 × 2
y
1 × = 0.5 1
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
3 × 3
2 × 2
y
1 × = 0.5 1
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
3 × 3
2 × 2
y
1 × = 0.5 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
3 × 3
2 × 2
y
1 × =0 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
3 × 3
2 × 2
y
1 × =0 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
3 × 3
2 × 2×
y
1 × =0 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
3 × 3
2 2×
×
×
y
1 1
×
× =0
× ×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Cost Function
Intuition II
h (x) J( 0, 1)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)
h (x) J( 0, 1)
500
× ×× ×
400 ××
Price ($) 300
× ×××× ××
× ××× ×
in 1000’s ×××
200
100
××
0
0 1000 2000 3000
Size in feet2(x)
= 50
= 50 + 0.06x = 0.06
h (x) J( 0, 1)
500
× ×× ×
400 ××
Price ($) 300
× ×××× ××
× ××× ×
in 1000’s ×××
200
100
××
0
0 1000 2000 3000
Size in feet2(x)
= 50 and ?
= 50 + 0.06x = 0.06
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0 ×
300 ×××××××××××××× -0.1
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1 ×
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
Gradient Descent
Have some function
Want
Outline:
● Start with some ,
● Keep changing , to reduce until we
hopefully end up at a minimum
Gradient Descent algorithm
repeat until convergence {

− (simultaneously update
} j = 0 and j = 1)

− (simultaneously update
} j = 0 and j = 1)
Learning rate Derivative term


− (for j = 0 and j = 1)
}
Correct: Simultaneous update

temp0 −
temp1 −
temp0
temp1

− (for j = 0 and j = 1)
}
Correct: Simultaneous update Incorrect

temp0 − temp0 −
temp0
temp1 −
temp0 temp1 −
temp1 temp1
∈ℝ
×
∈ℝ
−
∈ℝ
−
∈ℝ
≥0
−
− · (positive number)
∈ℝ
×
×
≥0
−
∈ℝ
× ×
×
≥0
−
∈ℝ
× ×
×
≥0 ≤0
− −
− · (positive number) − · (negative number)

∈ℝ
× ×
× ×
≥0 ≤0
− −
− · (positive number) − · (negative number)

−
If is too small, gradient descent

can be ...
−
×
can be ...
−
×
×
can be slow.
−
×
×
If is too small, gradient descent ×
can be slow.
−
×
×
can be slow.
×
−
×
×
can be slow.
××
−
×
×
can be slow.
××
×
−
×
×
can be slow.
××
××
−
×
×
can be slow.
××
××
If is too large, gradient descent

can be ...
−
×
×
can be slow.
××
××

can be ...
×
−
×
×
can be slow.
××
××

can be overshoot the minimum.
It may fail to converge, or even ×
diverge. ×
−
×
×
can be slow.
××
××

It may fail to converge, or even
×
×
diverge. ×
−
×
×
can be slow.
××
××

×
×
×
diverge. ×
−
×
×
can be slow.
××
××
If is too large, gradient descent ×

×
×
×
diverge. ×
What will one step of gradient
descent − do?
at local optima
What will one step of gradient
descent − do?
at local optima =0
Gradient descent can converge to a local minimum,
even with the learning rate fixed.
As we approach a local minimum,

gradient descent will automatically
take smaller steps. So, no need to
decrease over time.
−
×

decrease over time.
−
×
×
decrease over time.
−
×
×
As we approach a local minimum, ×
decrease over time.
−
×
×
××
decrease over time.
−
×
×
As we approach a local minimum, ×
gradient descent will automatically ××× ×
decrease over time.
Gradient Descent algorithm Linear Regression Model
repeat until convergence { h (x) = 0

+ x
1
−
= −
(for j = 0 and j = 1)
}
Gradient Descent algorithm Linear Regression Model
repeat until convergence { h (x) = 0

+ x
1
−
= −
(for j = 0 and j = 1)
}
= −
= −
= −
= + −
= −
= + −
j = 0: = −
j = 1: = −
= −
= + −
j = 0: = −
j = 1: = −
− −
− −
}
update ai and ai
simultaneously
}
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1 ×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1 ××
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
×××
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0 ×
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1
400
× ×××× ×× ××
× × 0
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
700 0.5
0.4
600
×× ×× 0.3
×
500 × 0.2
× ××× 0.1 ××
400
× × ×××× ××
× 0 ×
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the

training examples.
Linear Regression
with multiple variables
Multiple Variables Features
Size in feet2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
... ...
h (x) = 0
+ x
1
Multiple Variables Features
Size in Number of Number of Age of home Price ($) in
feet2 bedrooms floors (years) 1000’s
x1 x2 x3 x4 y
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 2 36 178
... ... ... ... ...
Notation:
n = number of features
= input (features) of training example
= value of features j in training example
Hypothesis
Previously: h (x) = 0
+ x
1
Hypothesis
+ x
1
= + + + +
Hypothesis
+ x
1
= + + + +
= 80 + 0.1 + 10 + 3 − 2
= + + + +
= + + + +
For convenience of notation, define = 1.
= + + + +
= ℝ = ℝ
= + + + +
= ℝ = ℝ
=
= + + + +
= ℝ = ℝ
=
Multivariate linear regression.

Hypothesis: = = + + + +
Parameters:
Cost Function: = −
Gradient Descent:
repeat {
} (simultaneously update for every = )

Gradient Descent
Previously (n = 1):
repeat {
− −
− −
(simultaneously update , )
}
Gradient Descent New Algorithm (n ≥ 1):
Previously (n = 1): repeat {

repeat { − −
(simultaneously update for = )

− − }
− −
}
Gradient Descent New Algorithm (n ≥ 1):
Previously (n = 1): repeat {

repeat { − −
(simultaneously update for = )

− − }
− −
− −
− −
} − −
...
Feature Scaling
Feature Scaling
Idea: Make sure features are on similar scale.
E.g. = size (0−2000 feet2)

= number of bedrooms (1−5)
●
Feature Scaling
2 size (feet2)
E.g. = size (0−2000 feet ) =
2000
number of bedrooms
=
5
●
Feature Scaling
2 size (feet2)
E.g. = size (0−2000 feet ) =
2000
number of bedrooms
=
5
●
●
●
●
●
●
Feature Scaling
Get every feature into approximately a −1 ≤ ≤ 1 range.
Mean Normalization
Replace with − to make features have approximately
zero mean (do not apply to = 1).
size − 1000
E.g. = −0.5 ≤ ≤ 0.5
2000
#bedrooms − 2.5
= −0.5 ≤ ≤ 0.5
5
Mean Normalization
Replace with − to make features have approximately
zero mean (do not apply to = 1).
size − 1000
E.g. = −0.5 ≤ ≤ 0.5
2000
#bedrooms − 2.5
= −0.5 ≤ ≤ 0.5
5
− −
= =
Learning Rate
Gradient Descent
● “Debugging” : How to make sure gradient descent is

working correctly.
● How to choose learning rate .
Making sure gradient descent is working correctly.
0 100 200 300

No. of iterations
Example automatic
convergence test:
Declare convergence if
decreases by less than 10-3 in
0 100 200 300
one iteration.
No. of iterations
Gradient descent not working.

Use smaller .
No. of iterations

Use smaller .
No. of iterations

Use smaller .
No. of iterations
No. of iterations

Use smaller .
No. of iterations
No. of iterations
- For sufficiently small , should decrease on every iteration.

- But if is too small, gradient descent can be slow to converge.
Summary
- If is too small: slow convergence.

- If is too large: may not decrease on every iteration;
may not converge.
To choose , try
…, 0.001, …, 0.01, …, 0.1, …, 1, …
References
Machine Learning Books
● Hands-On Machine Learning with Scikit-Learn and TensorFlow, Chap. 2 & 4

● Pattern Recognition and Machine Learning, Chap. 3
● Machine Learning: a Probabilistic Perspective, Chap. 7
Machine Learning Courses
● https://www.coursera.org/learn/machine-learning, Week 1 & 2

2017 08 11 MC886 MO444 Linear Regression PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2017 08 11 MC886 MO444 Linear Regression PDF

Uploaded by

Copyright:

Available Formats

Linear Regression

Machine Learning and Pattern Recognition

Prof. Sandra Avila

MC886/MO444, August 11, 2017

● Linear Regression with Multiple Variables

Supervised Learning Regression Problem

Training set h (x) = + x

Training set h (x) = + x

Idea: Choose 0, 1 so that

Idea: Choose 0, 1 so that

repeat until convergence {

repeat until convergence {

Learning rate Derivative term

repeat until convergence {

Correct: Simultaneous update

repeat until convergence {

Correct: Simultaneous update Incorrect

− · (positive number) − · (negative number)

− · (positive number) − · (negative number)

If is too small, gradient descent

If is too large, gradient descent

If is too large, gradient descent

If is too large, gradient descent

If is too large, gradient descent

If is too large, gradient descent

If is too large, gradient descent ×

As we approach a local minimum,

As we approach a local minimum,

repeat until convergence { h (x) = 0

repeat until convergence { h (x) = 0

repeat until convergence {

“Batch”: Each step of gradient descent uses all the

Size in feet2 (x) Price ($) in 1000’s (y)

Multivariate linear regression.

} (simultaneously update for every = )

Previously (n = 1): repeat {

(simultaneously update for = )

Previously (n = 1): repeat {

(simultaneously update for = )

E.g. = size (0−2000 feet2)

● “Debugging” : How to make sure gradient descent is

0 100 200 300

Gradient descent not working.

Gradient descent not working.

Gradient descent not working.

Gradient descent not working.

- For sufficiently small , should decrease on every iteration.

- If is too small: slow convergence.

Machine Learning Books

● Hands-On Machine Learning with Scikit-Learn and TensorFlow, Chap. 2 & 4

Machine Learning Courses

● https://www.coursera.org/learn/machine-learning, Week 1 & 2

You might also like