Lecture 01 Linear Regression Single - Var PDF

Lecture Notes: CHE5151:OCP 2/8/2020
Linear regression
with one variable
Model
representation
Size(in(feet2 (x) Price (Rs) in 20000's (y)

Training set of 2104 460'
Manipal housing prices 1416 232'
Price 1534 315'
(Manipal) 852' 178
(In 20000s …… …….
Rupees.) …… …….
Notation:
m ='Number 'of 'training‘ examples'
x’s‘= input”'variable'/'features‘
Supervised learning: Regression'Problem‘ y’s'='“output”'variable'/'“target”'variable
Given'the “right answer” for Predict real valued output
(x,y) : one training example.
Each example in the data. Classification: Predict discrete valued (x(i), y(i)): ith training example
output
1
Size(in(feet2 (x) Price (Rs) in 20000's (y)

Training set of 2104 460'
h ( x )   0  1 x
housing prices 1416 232'
1534 315'
shorthand h( x )
(Manipal) 852' 178
…… …….
…… …….
Hypothesis : h ( x )   0  1 x
X Hypothesis Estimated
value
i ' s are parameters
‘h’ maps from x’s to y’s How to choose i ?
h ( x)   0  1 x Quiz: Consider the plot below of hθ(x)=θ0+θ1x . What are θ0 and θ1 ?
 θ0=0,θ1=1
 θ0=0.5,θ1=1
 θ0=1,θ1=0.5
 θ0=1,θ1=1
2
1 m
Mimimize
 0 ,1
 (h ( x (i ) )  y (i ) )2
2m i 1 Linear regression
h ( x ( i ) )   0  1 x ( i ) With one variable
Cost function Intuition I

1 m
J ( 0 ,1 )   (h ( x(i ) )  y (i ) )2
2m i 1
Idea: Choose 0, 1, so that minimize J ( 0 ,1 )
 0 ,1
h(x) is close to ‘y’ for our
training examples (x,y) Cost function
(Squared error function)
Simplified
Hypothesis :
h ( x )  0  1 x h ( x)  1 x Give data set h ( x)  1 x
Parameters :
1
0 ,1
Cost Function :
Cost Function :
1 m
1 m J (1 )   (h ( x(i ) )  y (i ) )2
J ( 0 ,1 )   (h ( x(i ) )  y (i ) )2
2m i 1
2m i 1
Goal: Minimize J ( 0 , 1 ) Minimize J ( 1 )

0 ,1 1
3
1= 1
J(1)=?
1 m
J (1 )   (h ( x(i ) )  y (i ) )2
2m i 1 J (1)  0 1= 0.5 1= 0
=
1 m 1
 (1.( x(i ) )  y (i ) )2  2  3 (02  02  02 )  02 J(1)=? J(1)=?
2m i 1
Quiz : Suppose we have a training set with m=3 examples, plotted below.
Our hypothesis representation is h ( x)  1 x ( i ) , with parameter θ1. The
m
cost function J(θ1) J (1 )  1  (h ( x (i) )  y (i) ) 2 . What is J(0) ?
2m i 1
0
 1/6
1
 14/6 √
4
Hypothesis : h ( x )  0  1 x
Linear regression
With one variable Parameters :  0 , 1
1 m
Cost function Intuition II Cost Function : J ( 0 , 1 )   (h ( x (i ) )  y (i ) ) 2
2m i 1
Goal: Minimize J ( 0 , 1 )
0 ,1
5
6
Have some function: J ( 0 ,  1 ) / J ( 0 ,  1 ,.......... n )
Want : Mini J ( 0 , 1 ) / Mini J ( 0 , 1 ..... n )

 0 ,1 0 ,1 ...n
Linear regression
With one variable Outline:
 Start with some  0 ,  /  0 ,  ,.......... n
1 1
Gradient descent  Keep changing  0 ,  / 0 , ,.......... n to educe J ( 0 ,  1 ) / J ( 0 ,  1 ,.......... n )

1 1
until we hopefully end up at a minimum
7
Gradient descent algorithm

repeat until convergence {

 j :  j   J ( 0 ,1 ) (for j  0 and j  1)
 j
}
Correct : simultaneous update incorrect :
 
temp0 :  0   J ( 0 ,1 ) temp0 :  0   J ( 0 ,1 )
 0 0
 0 := temp0
temp1: 1   J ( 0 , 1 )
1 
temp1: 1   J (0 ,1 )
 0 := temp0 1
1 := temp1 1 := temp1
Quiz: Gradient descent algorithm


 j :  j   J (0 ,1 ) (simlutneously update j  0 and j  1)
 j
}
√
Lets consider the case
Mini J ( 1 )
1
8
Gradient Decent Algorithm: Derivation Intuition:

Gradient Decent Algorithm: Learning rate Intuition:

1 : 1   J (1 )
1
If  is too small gradient
descent can be slow.
If'α is too large, gradient descent can

overshoot the minimum. It may Fail to
converge,or even diverge.
Quiz:
Quiz: Suppose θ1 is at a local optimum of J(θ1)), such as shown in the figure Which of the following are true statements? Select all that apply.
 To make gradient descent converge, we must slowly decrese ‘α’ over time.
 Gradient descent is guaranteed to find the global minimum for any function
 Gradient descent can converge even if ‘α’ is kept fixed. (But ‘α’ cannot be too
large, or else it may fail to converge.)
For the specific choice of cost function J(0, 1) used in linear regression, there are no
local optima (other than the global optimum)
 Leave θ1 unchanged
 Change θ1 in a random direction
 Move θ1 in the direction of the global minimum of J(θ1)
 Decrease θ1
9
Gradient descent can converge to a local

minimum, even with the learning rate 'α‘ fixed.
Linear regression
With one variable
Gradient descent for
linear regression
As we approach a local minimum, gradient
descent will automatically take smaller steps.
So, no need to decrease 'α‘ over Time.
Gradient descent algorithm for linear regression model   1 m

 j
J ( 0,1 ) 
 j
 (h ( x(i ) )  y (i ) )2
2m i 1
 1 m

 j
 (0  1 x(i)  y(i ) )2
2m i 1
 1 m
j  0: J ( 0,1 )   ( h ( x ( i ) )  y ( i ) )
 0 m i 1
 1 m
j  1: J ( 0,1 )   ( h ( x ( i ) )  y ( i ) ).x ( i )
1 m i 1
10
Gradient descent algorithm


J ( 0 , 1 )
 0
1 m
 0 :  0    (h ( x (i ) )  y ( i ) ) Update 0 and 1
m i 1 simultaneously
1 m
1 : 1    ( h ( x (i ) )  y (i ) ).x (i )
m i 1
}

J ( 0 , 1 )
1
Convex function
11
12
13
Batch (Gradient Descent “Batch”: Each

step 'of gradient descent uses all the
training examples. i.e.
m
(i )
 (h ( x )  y (i ) )
i 1
Quiz:
Which of the following are true statements? Select all that apply.
• To make gradient descent converge, we must slowly decrease α over

time.
• Gradient descent is guaranteed to find the global minimum for any
function J(θ0,θ1).
• Gradient descent can converge even if α is kept fixed. (But α cannot
be too large, or else it may fail to converge.)
• For the specific choice of cost function J(θ0,θ1) used in linear
regression, there are no local optima (other than the global
optimum).
14
Matrix -vector multiplication:
Size(in(feet2 (x)
2104
1416 h ( x)  40  0.25 x
1534
852' …….
Matrix –Matrix multiplication:
15
Vectorization
Vectorization Example
1 m
n
h ( x )    j x j
 0 :  0    (h ( x(i ) )  y (i ) )2 x0
m i 1
j 0
= T x 1 m
1 : 1    (h ( x (i ) )  y ( i ) ) 2 x1
m i 1
1 m
 2 :  2    (h ( x (i ) )  y (i ) ) 2 x2
m i 1
Un-Vectorized implementation Vectorized implementation
 :    
1 m
 (h ( x(i ) )  y (i) )2 xi
m i 1
 n1  n1   n1
  n1
16

Lecture 01 Linear Regression Single - Var PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 01 Linear Regression Single - Var PDF

Uploaded by

Copyright:

Available Formats

Lecture Notes: CHE5151:OCP 2/8/2020

Size(in(feet2 (x) Price (Rs) in 20000's (y)

Size(in(feet2 (x) Price (Rs) in 20000's (y)

‘h’ maps from x’s to y’s How to choose i ?

h ( x)   0  1 x Quiz: Consider the plot below of hθ(x)=θ0+θ1x . What are θ0 and θ1 ?

Cost function Intuition I

Goal: Minimize J ( 0 , 1 ) Minimize J ( 1 )

Have some function: J ( 0 ,  1 ) / J ( 0 ,  1 ,.......... n )

Want : Mini J ( 0 , 1 ) / Mini J ( 0 , 1 ..... n )

Gradient descent  Keep changing  0 ,  / 0 , ,.......... n to educe J ( 0 ,  1 ) / J ( 0 ,  1 ,.......... n )

until we hopefully end up at a minimum

Gradient descent algorithm

Quiz: Gradient descent algorithm

Gradient Decent Algorithm: Derivation Intuition:

If'α is too large, gradient descent can

 Change θ1 in a random direction

 Move θ1 in the direction of the global minimum of J(θ1)

Gradient descent can converge to a local

Gradient descent algorithm for linear regression model   1 m

Gradient descent algorithm

Batch (Gradient Descent “Batch”: Each

• To make gradient descent converge, we must slowly decrease α over

Matrix -vector multiplication:

Matrix –Matrix multiplication:

You might also like