You are on page 1of 15

Gradient Descent Learning Algorithm

 Goal is to decrease overall error (or other objective function) each time
a weight is changed
 Total Sum Squared error one possible objective function E: S (ti – zi)2
 Seek a weight changing algorithm such that error gradient is negative
 If a formula can be found then we have a gradient descent learning
algorithm

  E E E 
Gradient E[ w]   , ,..., 
 w0 w1 wn 

Training rule : wi   E[ w]
E
i.e., wi  
wi
1
Linear unit gradient descent training rule
Guaranteed to converge with minimum squared
error :
Given sufficiently small learning rate 
Even when training data contains noise
Even when training data not separable by H
 
Δw  rE w 

E For Linear Activation function


Δw i  r
w i

E  1 2   2
 
1
  t  x   o  x     t  x   o  x  
w i w i  2 x D  2 x D  w i 

      

1
 2 t  x   o  x  t  x   o  x    t  x   o  x  t  x   w  x 
2 xD  w i  xD  w i 
E
 t x   ox  x i 

w i xD
2
Measuring error for linear units

 Output Function
  
 ( x)  w  x
 Error Measure:
 1
E ( w)   (td  od ) 2

2 dD
data target linear unit
value output
Gradient Descent vs. Perceptrons
 Perceptron Rule & Threshold Units
– Learner converges on an answer ONLY IF data is linearly
separable
– Can’t assign proper error to parent nodes
 Gradient Descent
– Minimizes error even if examples are not linearly separable
– Works for multi-layer networks
 But…linear units only make linear decision surfaces (can’t learn XOR
even with many layers)
– Cannot use the step function as it isn’t differentiable…

– SO WE NEED NON-LINEAR ACTIVATION FUNCTIONS


Non-Linear activation functions

5
Imp relationship between sigmoid and tanh
(try the proofs)
1
x

1
1  tanh  1
2 x

1 e 2

2
tanh( x)  2 x
1
1 e
Perceptron rule vs GD rule or Delta rule )
 Perceptron rule (target - thresholded output) guaranteed to
converge to a separating hyperplane if the problem is
linearly separable. Otherwise may not converge – could
get in cycle
 Singe layer Delta rule guaranteed to have only one global
minimum. Thus it will converge to the best SSE solution
whether the problem is linearly separable or not.
– Could have a higher misclassification rate than with the perceptron
rule and a less intuitive decision surface –
– Stopping Criteria – For these models stop when no longer making
progress
– When you have gone a few epochs with no significant
improvement/change between epochs (including oscillations)
Incremental (Stochastic) Gradient
Descent
Batch mode Gradient Descent:
Do until satisfied:
 1

1. Compute the gradient ED [w] ED [w]  2  (td  od ) 2
   dD
2. w  w   ED [ w]

Incremental mode Gradient Descent:


Do until satisfied:
- For each training example d in D
 
1. Compute the gradient Ed [w] Ed [ w]  12 (td  od ) 2
  
2. w  w   Ed [w]

Incremental Gradient Descent can approximate Batch Gradient


Descent arbitrarily closely if  made small enough

8
Error Gradient for a Sigmoid Unit

E  1
  d d
wi wi 2 dD
(t  o ) 2
But we know :
 od  (netd )

1
 (t d  od ) 2   od (1  od )
2 d wi netd netd
 
1  netd  ( w  xd )
  2 (t d  od )
wi
(t d  od )   xi ,d
2 d wi wi
 od 
  (t d  od )   So :
d  wi  E
od netd    (t d  od )od (1  od ) xi ,d
 -  (t d  od ) wi d D
d netd wi

9
Sigmoid function

Continuous, differentiable
easy to compute.
This is how the derivative looks->

10
11
12
13
14
15

You might also like