You are on page 1of 16

Lecture Notes: CHE5151:OCP 2/8/2020

Linear regression
with one variable
Model
representation

Size(in(feet2 (x) Price (Rs) in 20000's (y)


Training set of 2104 460'
Manipal housing prices 1416 232'
Price 1534 315'
(Manipal) 852' 178
(In 20000s …… …….
Rupees.) …… …….
Notation:
m ='Number 'of 'training‘ examples'
x’s‘= input”'variable'/'features‘
Supervised learning: Regression'Problem‘ y’s'='“output”'variable'/'“target”'variable
Given'the “right answer” for Predict real valued output
(x,y) : one training example.
Each example in the data. Classification: Predict discrete valued (x(i), y(i)): ith training example
output

1
Lecture Notes: CHE5151:OCP 2/8/2020

Size(in(feet2 (x) Price (Rs) in 20000's (y)


Training set of 2104 460'

h ( x )   0  1 x
housing prices 1416 232'
1534 315'
shorthand h( x )
(Manipal) 852' 178
…… …….
…… …….

Hypothesis : h ( x )   0  1 x
X Hypothesis Estimated
value
i ' s are parameters

‘h’ maps from x’s to y’s How to choose i ?

h ( x)   0  1 x Quiz: Consider the plot below of hθ(x)=θ0+θ1x . What are θ0 and θ1 ?

 θ0=0,θ1=1

 θ0=0.5,θ1=1

 θ0=1,θ1=0.5

 θ0=1,θ1=1

2
Lecture Notes: CHE5151:OCP 2/8/2020

1 m
Mimimize
 0 ,1
 (h ( x (i ) )  y (i ) )2
2m i 1 Linear regression
h ( x ( i ) )   0  1 x ( i ) With one variable

Cost function Intuition I


1 m
J ( 0 ,1 )   (h ( x(i ) )  y (i ) )2
2m i 1
Idea: Choose 0, 1, so that minimize J ( 0 ,1 )
 0 ,1
h(x) is close to ‘y’ for our
training examples (x,y) Cost function
(Squared error function)

Simplified
Hypothesis :
h ( x )  0  1 x h ( x)  1 x Give data set h ( x)  1 x

Parameters :
1
0 ,1
Cost Function :
Cost Function :
1 m
1 m J (1 )   (h ( x(i ) )  y (i ) )2
J ( 0 ,1 )   (h ( x(i ) )  y (i ) )2
2m i 1
2m i 1

Goal: Minimize J ( 0 , 1 ) Minimize J ( 1 )


0 ,1 1

3
Lecture Notes: CHE5151:OCP 2/8/2020

1= 1
J(1)=?
1 m
J (1 )   (h ( x(i ) )  y (i ) )2
2m i 1 J (1)  0 1= 0.5 1= 0
=
1 m 1
 (1.( x(i ) )  y (i ) )2  2  3 (02  02  02 )  02 J(1)=? J(1)=?
2m i 1

Quiz : Suppose we have a training set with m=3 examples, plotted below.
Our hypothesis representation is h ( x)  1 x ( i ) , with parameter θ1. The
m
cost function J(θ1) J (1 )  1  (h ( x (i) )  y (i) ) 2 . What is J(0) ?
2m i 1

0

 1/6

1

 14/6 √

4
Lecture Notes: CHE5151:OCP 2/8/2020

Hypothesis : h ( x )  0  1 x
Linear regression
With one variable Parameters :  0 , 1
1 m
Cost function Intuition II Cost Function : J ( 0 , 1 )   (h ( x (i ) )  y (i ) ) 2
2m i 1

Goal: Minimize J ( 0 , 1 )
0 ,1

5
Lecture Notes: CHE5151:OCP 2/8/2020

6
Lecture Notes: CHE5151:OCP 2/8/2020

Have some function: J ( 0 ,  1 ) / J ( 0 ,  1 ,.......... n )

Want : Mini J ( 0 , 1 ) / Mini J ( 0 , 1 ..... n )


 0 ,1 0 ,1 ...n
Linear regression
With one variable Outline:
 Start with some  0 ,  /  0 ,  ,.......... n
1 1

Gradient descent  Keep changing  0 ,  / 0 , ,.......... n to educe J ( 0 ,  1 ) / J ( 0 ,  1 ,.......... n )


1 1

until we hopefully end up at a minimum

7
Lecture Notes: CHE5151:OCP 2/8/2020

Gradient descent algorithm


repeat until convergence {

 j :  j   J ( 0 ,1 ) (for j  0 and j  1)
 j
}
Correct : simultaneous update incorrect :
 
temp0 :  0   J ( 0 ,1 ) temp0 :  0   J ( 0 ,1 )
 0 0
 0 := temp0
temp1: 1   J ( 0 , 1 )
1 
temp1: 1   J (0 ,1 )
 0 := temp0 1
1 := temp1 1 := temp1

Quiz: Gradient descent algorithm


repeat until convergence {

 j :  j   J (0 ,1 ) (simlutneously update j  0 and j  1)
 j
}


Lets consider the case

Mini J ( 1 )
1

8
Lecture Notes: CHE5151:OCP 2/8/2020

Gradient Decent Algorithm: Derivation Intuition:


Gradient Decent Algorithm: Learning rate Intuition:

1 : 1   J (1 )
1
If  is too small gradient
descent can be slow.

If'α is too large, gradient descent can


overshoot the minimum. It may Fail to
converge,or even diverge.

Quiz:
Quiz: Suppose θ1 is at a local optimum of J(θ1)), such as shown in the figure Which of the following are true statements? Select all that apply.

 To make gradient descent converge, we must slowly decrese ‘α’ over time.

 Gradient descent is guaranteed to find the global minimum for any function

 Gradient descent can converge even if ‘α’ is kept fixed. (But ‘α’ cannot be too
large, or else it may fail to converge.)

For the specific choice of cost function J(0, 1) used in linear regression, there are no
local optima (other than the global optimum)
 Leave θ1 unchanged

 Change θ1 in a random direction

 Move θ1 in the direction of the global minimum of J(θ1)

 Decrease θ1

9
Lecture Notes: CHE5151:OCP 2/8/2020

Gradient descent can converge to a local


minimum, even with the learning rate 'α‘ fixed.

Linear regression
With one variable
Gradient descent for
linear regression
As we approach a local minimum, gradient
descent will automatically take smaller steps.
So, no need to decrease 'α‘ over Time.

Gradient descent algorithm for linear regression model   1 m


 j
J ( 0,1 ) 
 j
 (h ( x(i ) )  y (i ) )2
2m i 1
 1 m

 j
 (0  1 x(i)  y(i ) )2
2m i 1

 1 m
j  0: J ( 0,1 )   ( h ( x ( i ) )  y ( i ) )
 0 m i 1
 1 m
j  1: J ( 0,1 )   ( h ( x ( i ) )  y ( i ) ).x ( i )
1 m i 1

10
Lecture Notes: CHE5151:OCP 2/8/2020

Gradient descent algorithm



J ( 0 , 1 )
 0
repeat until convergence {
1 m
 0 :  0    (h ( x (i ) )  y ( i ) ) Update 0 and 1
m i 1 simultaneously
1 m
1 : 1    ( h ( x (i ) )  y (i ) ).x (i )
m i 1
}

J ( 0 , 1 )
1

Convex function

11
Lecture Notes: CHE5151:OCP 2/8/2020

12
Lecture Notes: CHE5151:OCP 2/8/2020

13
Lecture Notes: CHE5151:OCP 2/8/2020

Batch (Gradient Descent “Batch”: Each


step 'of gradient descent uses all the
training examples. i.e.
m
(i )
 (h ( x )  y (i ) )
i 1

Quiz:

Which of the following are true statements? Select all that apply.

• To make gradient descent converge, we must slowly decrease α over


time.
• Gradient descent is guaranteed to find the global minimum for any
function J(θ0,θ1).
• Gradient descent can converge even if α is kept fixed. (But α cannot
be too large, or else it may fail to converge.)
• For the specific choice of cost function J(θ0,θ1) used in linear
regression, there are no local optima (other than the global
optimum).

14
Lecture Notes: CHE5151:OCP 2/8/2020

Matrix -vector multiplication:

Size(in(feet2 (x)
2104
1416 h ( x)  40  0.25 x
1534
852' …….

Matrix –Matrix multiplication:

15
Lecture Notes: CHE5151:OCP 2/8/2020

Vectorization
Vectorization Example
1 m
n
h ( x )    j x j
 0 :  0    (h ( x(i ) )  y (i ) )2 x0
m i 1
j 0

= T x 1 m
1 : 1    (h ( x (i ) )  y ( i ) ) 2 x1
m i 1
1 m
 2 :  2    (h ( x (i ) )  y (i ) ) 2 x2
m i 1
Un-Vectorized implementation Vectorized implementation

 :    
1 m
 (h ( x(i ) )  y (i) )2 xi
m i 1
 n1  n1   n1
  n1

16

You might also like