You are on page 1of 154

Linear Regression

Machine Learning and Pattern Recognition


(Largely based on slides from Andrew Ng)

Prof. Sandra Avila


Institute of Computing (IC/Unicamp)

MC886/MO444, August 11, 2017


House Price Prediction

$ 70 000
https://www.youtube.com/watch?v=IpGxLWOIZy4
House Price Prediction

$ 160 000
House Price Prediction

???
House Price Prediction

20
15
Price
(in $ 10 000’s) 10

5
0
Size (feet2)
Linear Regression

20 ×
15 ×
Price
12
10 ×× ××
(in $ 10 000’s)
5 × × ×
0
Size (feet2)
Today’s Agenda
● Linear Regression with One Variable
○ Model Representation
○ Cost Function
○ Gradient Descent

● Linear Regression with Multiple Variables


○ Gradient Descent for Multiple Variables
○ Feature Scaling
○ Learning Rate
Model Representation
https://www.kaggle.com/harlfoxem/housesalesprediction
500
Housing Prices
400 × × ×
Price × × × ××× × × ×
(in 1000’s
300
×× ×××××××
× ×
× ×××× ××
of dollars) 200 × × × × ×× × ×
100 × ××× ×
××
0
0 1000 2000 3000 4000 5000
Size (feet2)

Supervised Learning Regression Problem


Given the “right answer” for Predict real-valued output
each example in the data.
Size in feet2 (x) Price ($) in 1000’s (y)
Training set of
2104 460
housing prices
1416 232

1534 315

852 178

... ...

Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training set
Training set

Learning algorithm
Training set

Learning algorithm

h
(hypothesis)
Training set

Learning algorithm

Size of Estimated
h
house price
(hypothesis)
Training set

Learning algorithm

Size of Estimated
h
house price
(hypothesis)
h maps x’s to y’s
How do we represent h ?

Training set

Learning algorithm

Size of Estimated
h
house price
(hypothesis)
h maps x’s to y’s
How do we represent h ?

Training set h (x) = + x


0 1

Learning algorithm ×
y × ×
× ×
Size of Estimated × ×
h
house price x
(hypothesis)
h maps x’s to y’s
How do we represent h ?

Training set h (x) = + x


0 1

Learning algorithm ×
y × ×
× ×
Size of Estimated × ×
h
house price x
(hypothesis) Linear regression with one variable.
h maps x’s to y’s Univariate linear regression.
Cost Function
Training Set Size in feet2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

... ...

Hypothesis: h (x) = 0
+ x
1

i‘s: Parameters
How to choose i‘s ?
h (x) = 0
+ 1
x

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

= 1.5 =0 =1
=0 = 0.5 = 0.5
h (x) = 0
+ 1
x

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

= 1.5 =0 =1
=0 = 0.5 = 0.5
h (x) = 0
+ 1
x

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

= 1.5 =0 =1
=0 = 0.5 = 0.5
h (x) = 0
+ 1
x

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

= 1.5 =0 =1
=0 = 0.5 = 0.5
×
y × ×
× ×
× ×
x
Idea: Choose 0, 1 so that
close to y for our
training examples (x,y)

×
y × ×
× ×
× ×
x
Idea: Choose 0, 1 so that
close to y for our
training examples (x,y)

×
y × ×
× ×
× ×
x
Idea: Choose 0, 1 so that
close to y for our
training examples (x,y)

×
y × ×
× ×
× ×
x
Idea: Choose 0, 1 so that
close to y for our
training examples (x,y)

×
y × ×
× ×
× ×
x
Idea: Choose 0, 1 so that
close to y for our
training examples (x,y)

×
y × × h (x) = + x
0 1
× ×
× ×
x
Idea: Choose 0, 1 so that
close to y for our
training examples (x,y)

×
y × × h (x) = + x
0 1
× ×
× ×
x = -

Idea: Choose 0, 1 so that


close to y for our
training examples (x,y)

×
y × × h (x) = + x
0 1
× ×
× ×
x = -

Idea: Choose 0, 1 so that


close to y for our
training examples (x,y)
Cost function
(Squared error function)
Cost Function
Intuition I
Hypothesis:
h (x) = 0
+ x
1

Parameters:
0, 1
Cost Function:
= −
Goal:
Hypothesis: Simplified
h (x) = 0
+ x
1
H =
Parameters:
0, 1
Cost Function:
= − = −
Goal:
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 ×
2 ×
y
1 ×
0
0 1 2 3
x
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 ×
2 ×
y
1 × =1
0
0 1 2 3
x
J( ) = J(1) = ?
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 × 2
y
1 × =1 1

0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
J( ) = J(1) = 0
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 × 2
y
1 × = 0.5 1

0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 × 2
y
1 × = 0.5 1

0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 × 2
y
1 × = 0.5 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 × 2
y
1 × =0 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 × 2
y
1 × =0 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 × 2×
y
1 × =0 1
×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
h (x)
(for fixed 1, this is a function of x) (function of the parameters 1)

3 × 3

2 2×
×
×
y
1 1
×
× =0
× ×
0 0 ×
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Cost Function
Intuition II
h (x) J( 0, 1)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)
h (x) J( 0, 1)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

500
× ×× ×
400 ××
Price ($) 300
× ×××× ××
× ××× ×
in 1000’s ×××
200
100
××
0
0 1000 2000 3000
Size in feet2(x)
= 50
= 50 + 0.06x = 0.06
h (x) J( 0, 1)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

500
× ×× ×
400 ××
Price ($) 300
× ×××× ××
× ××× ×
in 1000’s ×××
200
100
××
0
0 1000 2000 3000
Size in feet2(x)
= 50 and ?
= 50 + 0.06x = 0.06
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0 ×
300 ×××××××××××××× -0.1
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1 ×
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
Gradient Descent
Have some function

Want

Outline:
● Start with some ,
● Keep changing , to reduce until we
hopefully end up at a minimum
Gradient Descent algorithm

repeat until convergence {


− (simultaneously update
} j = 0 and j = 1)
Gradient Descent algorithm

repeat until convergence {


− (simultaneously update
} j = 0 and j = 1)

Learning rate Derivative term


Gradient Descent algorithm

repeat until convergence {


− (for j = 0 and j = 1)
}

Correct: Simultaneous update


temp0 −

temp1 −
temp0
temp1
Gradient Descent algorithm

repeat until convergence {


− (for j = 0 and j = 1)
}

Correct: Simultaneous update Incorrect


temp0 − temp0 −
temp0
temp1 −
temp0 temp1 −
temp1 temp1
∈ℝ

×
∈ℝ


∈ℝ


∈ℝ

≥0

− · (positive number)
∈ℝ

×
×

≥0

− · (positive number)
∈ℝ

× ×
×

≥0

− · (positive number)
∈ℝ

× ×
×

≥0 ≤0
− −

− · (positive number) − · (negative number)


∈ℝ

× ×
× ×

≥0 ≤0
− −

− · (positive number) − · (negative number)


If is too small, gradient descent


can be ...

×
If is too small, gradient descent
can be ...

×
×
If is too small, gradient descent
can be slow.

×
×
If is too small, gradient descent ×
can be slow.

×
×
If is too small, gradient descent ×
can be slow.
×

×
×
If is too small, gradient descent ×
can be slow.
××

×
×
If is too small, gradient descent ×
can be slow.
××
×

×
×
If is too small, gradient descent ×
can be slow.
××
××

×
×
If is too small, gradient descent ×
can be slow.
××
××

If is too large, gradient descent


can be ...

×
×
If is too small, gradient descent ×
can be slow.
××
××

If is too large, gradient descent


can be ...

×

×
×
If is too small, gradient descent ×
can be slow.
××
××

If is too large, gradient descent


can be overshoot the minimum.
It may fail to converge, or even ×
diverge. ×

×
×
If is too small, gradient descent ×
can be slow.
××
××

If is too large, gradient descent


can be overshoot the minimum.
It may fail to converge, or even
×
×
diverge. ×

×
×
If is too small, gradient descent ×
can be slow.
××
××

If is too large, gradient descent


can be overshoot the minimum.
×
It may fail to converge, or even
×
×
diverge. ×

×
×
If is too small, gradient descent ×
can be slow.
××
××

If is too large, gradient descent ×


can be overshoot the minimum.
×
It may fail to converge, or even
×
×
diverge. ×
What will one step of gradient
descent − do?

at local optima
What will one step of gradient
descent − do?

at local optima =0
Gradient descent can converge to a local minimum,
even with the learning rate fixed.

As we approach a local minimum,


gradient descent will automatically
take smaller steps. So, no need to
decrease over time.
Gradient descent can converge to a local minimum,
even with the learning rate fixed.


×

As we approach a local minimum,


gradient descent will automatically
take smaller steps. So, no need to
decrease over time.
Gradient descent can converge to a local minimum,
even with the learning rate fixed.


×
×
As we approach a local minimum,
gradient descent will automatically
take smaller steps. So, no need to
decrease over time.
Gradient descent can converge to a local minimum,
even with the learning rate fixed.


×
×
As we approach a local minimum, ×
gradient descent will automatically
take smaller steps. So, no need to
decrease over time.
Gradient descent can converge to a local minimum,
even with the learning rate fixed.


×
×
As we approach a local minimum,
gradient descent will automatically
××
take smaller steps. So, no need to
decrease over time.
Gradient descent can converge to a local minimum,
even with the learning rate fixed.


×
×
As we approach a local minimum, ×
gradient descent will automatically ××× ×
take smaller steps. So, no need to
decrease over time.
Gradient Descent algorithm Linear Regression Model

repeat until convergence { h (x) = 0


+ x
1


= −
(for j = 0 and j = 1)
}
Gradient Descent algorithm Linear Regression Model

repeat until convergence { h (x) = 0


+ x
1


= −
(for j = 0 and j = 1)
}
= −
= −
= −

= + −
= −

= + −

j = 0: = −

j = 1: = −
= −

= + −

j = 0: = −

j = 1: = −
Gradient Descent algorithm

repeat until convergence {

− −

− −
}
update ai and ai
simultaneously

}
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1 ×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1 ××
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1
×××
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× × ×××× ××
× 0 ×
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1
400
× ×××× ×× ××
× × 0
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
h (x)
(for fixed 0, 1, this is a function of x) (function of the parameters 0, 1)

700 0.5
0.4
600
×× ×× 0.3
×
Price $ (in 1000’s)

500 × 0.2
× ××× 0.1 ××
400
× × ×××× ××
× 0 ×
300 ×××××××××××××× -0.1 × ××
×
200
×××××× -0.2
×× × -0.3
100
-0.4
0 -0.5
0 1000 2000 3000 4000 -1000 -500 0 500 1000 1500 2000
2
Size (feet )
“Batch” Gradient Descent

“Batch”: Each step of gradient descent uses all the


training examples.
Linear Regression
with multiple variables
Multiple Variables Features

Size in feet2 (x) Price ($) in 1000’s (y)

2104 460

1416 232

1534 315

852 178

... ...

h (x) = 0
+ x
1
Multiple Variables Features
Size in Number of Number of Age of home Price ($) in
feet2 bedrooms floors (years) 1000’s
x1 x2 x3 x4 y

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 2 36 178
... ... ... ... ...

Notation:
n = number of features
= input (features) of training example
= value of features j in training example
Hypothesis

Previously: h (x) = 0
+ x
1
Hypothesis

Previously: h (x) = 0
+ x
1

= + + + +
Hypothesis

Previously: h (x) = 0
+ x
1

= + + + +

= 80 + 0.1 + 10 + 3 − 2
= + + + +
= + + + +
For convenience of notation, define = 1.
= + + + +
For convenience of notation, define = 1.

= ℝ = ℝ
= + + + +
For convenience of notation, define = 1.

= ℝ = ℝ
=
= + + + +
For convenience of notation, define = 1.

= ℝ = ℝ
=

Multivariate linear regression.


Hypothesis: = = + + + +
Parameters:

Cost Function: = −

Gradient Descent:
repeat {

} (simultaneously update for every = )


Gradient Descent

Previously (n = 1):
repeat {

− −

− −

(simultaneously update , )

}
Gradient Descent New Algorithm (n ≥ 1):

Previously (n = 1): repeat {


repeat { − −

(simultaneously update for = )


− − }

− −

(simultaneously update , )

}
Gradient Descent New Algorithm (n ≥ 1):

Previously (n = 1): repeat {


repeat { − −

(simultaneously update for = )


− − }

− −

− −
− −
(simultaneously update , )

} − −
...
Feature Scaling
Feature Scaling
Idea: Make sure features are on similar scale.

E.g. = size (0−2000 feet2)


= number of bedrooms (1−5)


Feature Scaling
Idea: Make sure features are on similar scale.

2 size (feet2)
E.g. = size (0−2000 feet ) =
2000
= number of bedrooms (1−5)
number of bedrooms
=
5


Feature Scaling
Idea: Make sure features are on similar scale.

2 size (feet2)
E.g. = size (0−2000 feet ) =
2000
= number of bedrooms (1−5)
number of bedrooms
=
5







Feature Scaling
Get every feature into approximately a −1 ≤ ≤ 1 range.
Mean Normalization
Replace with − to make features have approximately
zero mean (do not apply to = 1).

size − 1000
E.g. = −0.5 ≤ ≤ 0.5
2000
#bedrooms − 2.5
= −0.5 ≤ ≤ 0.5
5
Mean Normalization
Replace with − to make features have approximately
zero mean (do not apply to = 1).

size − 1000
E.g. = −0.5 ≤ ≤ 0.5
2000
#bedrooms − 2.5
= −0.5 ≤ ≤ 0.5
5

− −
= =
Learning Rate
Gradient Descent

● “Debugging” : How to make sure gradient descent is


working correctly.
● How to choose learning rate .
Making sure gradient descent is working correctly.

0 100 200 300


No. of iterations
Making sure gradient descent is working correctly.

Example automatic
convergence test:

Declare convergence if
decreases by less than 10-3 in
0 100 200 300
one iteration.
No. of iterations
Making sure gradient descent is working correctly.

Gradient descent not working.


Use smaller .

No. of iterations
Making sure gradient descent is working correctly.

Gradient descent not working.


Use smaller .

No. of iterations
Making sure gradient descent is working correctly.

Gradient descent not working.


Use smaller .

No. of iterations

No. of iterations
Making sure gradient descent is working correctly.

Gradient descent not working.


Use smaller .

No. of iterations

No. of iterations

- For sufficiently small , should decrease on every iteration.


- But if is too small, gradient descent can be slow to converge.
Summary

- If is too small: slow convergence.


- If is too large: may not decrease on every iteration;
may not converge.

To choose , try
…, 0.001, …, 0.01, …, 0.1, …, 1, …
References

Machine Learning Books

● Hands-On Machine Learning with Scikit-Learn and TensorFlow, Chap. 2 & 4


● Pattern Recognition and Machine Learning, Chap. 3
● Machine Learning: a Probabilistic Perspective, Chap. 7

Machine Learning Courses

● https://www.coursera.org/learn/machine-learning, Week 1 & 2

You might also like