You are on page 1of 30

Linear'Regression'with'

mul2ple'variables'

Mul2ple'features'

Machine'Learning'
2
Mul4ple%features%(variables).%

Size%(feet2)% Price%($1000)%
% %
% %
2104' 460'
1416' 232'
1534' 315'
852' 178'
…' …'

Andrew'Ng'
3
Mul4ple%features%(variables).%
Size%(feet2)% Number%of% Number%of% Age%of%home% Price%($1000)%
% bedrooms% floors% (years)% %
% % % % %
2104' 5' 1' 45' 460'
1416' 3' 2' 40' 232'
1534' 3' 2' 30' 315'
852' 2' 1' 36' 178'
…' …' …' …' …'
Nota2on:'
='number'of'features'
='input'(features)'of''''''''training'example.'
='value'of'feature''''in''''''''training'example.'

Andrew'Ng'
4
Hypothesis:'
'Previously:'

Andrew'Ng'
5

For'convenience'of'nota2on,'define''''''''''''''''.'

1 by (n+1)
matrix

Mul2variate'linear'regression.'
Andrew'Ng'
6
Linear'Regression'with'
mul2ple'variables'

Gradient'descent'for'
mul2ple'variables'
Machine'Learning'
8
Hypothesis:'
Parameters:'
n+1 dimensional vector
Cost'func2on:'

Gradient'descent:'
Repeat'

(simultaneously'update'for'every'''''''''''''''''''''''')'
Andrew'Ng'
9

Until Convergence
Linear'Regression'with'
mul2ple'variables'
Gradient'descent'in'
prac2ce'I:'Feature'Scaling'

Machine'Learning'
11
Feature%Scaling% Practical tricks for making gradient descent work well
Idea:'Make'sure'features'are'on'a'similar'scale.'
E.g.'''''''='size'(0X2000'feet2)' size'(feet2)'
''''''''''''''='number'of'bedrooms'(1X5)'
problem with
number'of'bedrooms'
two features

if you plot the


if you run gradient contours of the
descents on this cost function J a useful thing to do is to scale the features
cos-function, your of theta (theta
gradients may end 0 =0), it can
up taking a long take on this the contours
time before it can very very may look more
skewed elliptical like circles. And
finally find its way g r a d i e n t
to the global shape
descent can
minimum. fi n d a m u c h
more direct path
to the global
minimum
Andrew'Ng'
12
13
Feature%Scaling%
Get'every'feature'into'approximately'a'''''''''''''''''''''''''''range.'

Andrew'Ng'
14

Average value

(Max - Min)

4
Linear'Regression'with'
mul2ple'variables'

Gradient'descent'in'
prac2ce'II:'Learning'rate'
Machine'Learning'
16
Gradient%descent%

X  “Debugging”:'How'to'make'sure'gradient'
descent'is'working'correctly.'
X  How'to'choose'learning'rate'''''.'

Andrew'Ng'
17
Making%sure%gradient%descent%is%working%correctly.%

J(θ) should decrease


after every iteration
Example'automa2c'
convergence'test:'

Declare'convergence'if'''''''
decreases'by'less'than'''''''
in'one'itera2on.'
0' 100' 200' 300' 400' Deciding this threshold may be hard
No.'of'itera2ons' number of iterations Gradient descent takes
to converge depends on the application
Andrew'Ng'
18
Making%sure%gradient%descent%is%working%correctly.%
Gradient'descent'not'working.''
Use'smaller''''.''

No.'of'itera2ons'

No.'of'itera2ons' Theta
No.'of'itera2ons'

X  For'sufficiently'small''''','''''''''''''should'decrease'on'every'itera2on.'
X  But'if''''''is'too'small,'gradient'descent'can'be'slow'to'converge.'
Andrew'Ng'
19
Summary:%
X  If'''''is'too'small:'slow'convergence.'
X  If'''''is'too'large:'''''''''may'not'decrease'on'
every'itera2on;'may'not'converge.' Slow Converge also
possible

To'choose'''','try'

Andrew'Ng'
Linear'Regression'with'
mul2ple'variables'

Features'and'
polynomial'regression'
Machine'Learning'
21
Housing%prices%predic4on%

Land area

sometimes by defining new features you might actually get a better model.
Andrew'Ng'
22
It doesn't look like a straight line fits this data very well.
Polynomial%regression%
quadratic model

Price'
(y)'
Cubic Function

Size'(x)'

Feature scaling is important

Andrew'Ng'
23
Choice%of%features%

Price'
(y)'

Size'(x)'

Andrew'Ng'
24
Linear'Regression'with'
mul2ple'variables'

Normal'equa2on'

Machine'Learning'
26

Gradient'Descent'
iterative algorithm that takes many steps,
multiple iterations of gradient descent to
converge to the global minimum.

Normal'equa2on:'Method'to'solve'for''
analy2cally.' For some linear regression problems, Normal equation will
give us a much better way to solve for the optimal value of
the parameters theta.
Andrew'Ng'
27
Intui2on:'If'1D' Example: Theta is just
a scalar value

The way to minimize this quadratic


Solve for θ function is to set derivatives equal
to zero.

(for'every''')'

Solve'for''
Andrew'Ng'
28
Examples:''
add an extra column Size%(feet2)% Number%of% Number%of% Age%of%home% Price%($1000)%
% bedrooms% floors% (years)% %
% % % % %
1' 2104' 5' 1' 45' 460'
1' 1416' 3' 2' 40' 232'
1' 1534' 3' 2' 30' 315'
1' 852' 2' 1' 36' 178'

m x (n+1) dimensional matrix


m dimensional vector

Andrew'Ng'
29
%%%%%%training%examples,%%%%%features.%
Gradient'Descent' Normal'Equa2on'
No need to do features scaling
•  Need'to'choose''''.'' •  No'need'to'choose''''.'
•  Needs'many'itera2ons.' •  Don’t'need'to'iterate.'
•'   Works'well'even' •  Need'to'compute'
when'''''is'large.'
•  Slow'if'''''is'very'large.'

Andrew'Ng'
To summarize, so long as the number of features is not too large, the normal equation 30
gives us a great alternative method to solve for the parameter theta. Concretely, so long
as the number of features is less than 1000, normal equation method can be used
rather than gradient descent.

As we get to the more complex learning algorithm, for example, when we talk about
classification algorithm, like a logistic regression algorithm, the normal equation method
actually do not work for those more sophisticated learning algorithms, and, we will have
to resort to gradient descent for those algorithms.

So, gradient descent is a very useful algorithm to know. The linear regression will have a
large number of features and for some of the other algorithms, because, for them, the
normal equation method just doesn't apply and doesn't work. But for this specific model
of linear regression, the normal equation can give you an alternative that can be much
faster, than gradient descent.

So, depending on the detail of the problems and how many features that you have, both
of these algorithms are well worth knowing about.

You might also like