Professional Documents
Culture Documents
Goal is to decrease overall error (or other objective function) each time
a weight is changed
Total Sum Squared error one possible objective function E: S (ti – zi)2
Seek a weight changing algorithm such that error gradient is negative
If a formula can be found then we have a gradient descent learning
algorithm
E E E
Gradient E[ w] , ,...,
w0 w1 wn
Training rule : wi E[ w]
E
i.e., wi
wi
1
Linear unit gradient descent training rule
Guaranteed to converge with minimum squared
error :
Given sufficiently small learning rate
Even when training data contains noise
Even when training data not separable by H
Δw rE w
E 1 2 2
1
t x o x t x o x
w i w i 2 x D 2 x D w i
1
2 t x o x t x o x t x o x t x w x
2 xD w i xD w i
E
t x ox x i
w i xD
2
Measuring error for linear units
Output Function
( x) w x
Error Measure:
1
E ( w) (td od ) 2
2 dD
data target linear unit
value output
Gradient Descent vs. Perceptrons
Perceptron Rule & Threshold Units
– Learner converges on an answer ONLY IF data is linearly
separable
– Can’t assign proper error to parent nodes
Gradient Descent
– Minimizes error even if examples are not linearly separable
– Works for multi-layer networks
But…linear units only make linear decision surfaces (can’t learn XOR
even with many layers)
– Cannot use the step function as it isn’t differentiable…
5
Imp relationship between sigmoid and tanh
(try the proofs)
1
x
1
1 tanh 1
2 x
1 e 2
2
tanh( x) 2 x
1
1 e
Perceptron rule vs GD rule or Delta rule )
Perceptron rule (target - thresholded output) guaranteed to
converge to a separating hyperplane if the problem is
linearly separable. Otherwise may not converge – could
get in cycle
Singe layer Delta rule guaranteed to have only one global
minimum. Thus it will converge to the best SSE solution
whether the problem is linearly separable or not.
– Could have a higher misclassification rate than with the perceptron
rule and a less intuitive decision surface –
– Stopping Criteria – For these models stop when no longer making
progress
– When you have gone a few epochs with no significant
improvement/change between epochs (including oscillations)
Incremental (Stochastic) Gradient
Descent
Batch mode Gradient Descent:
Do until satisfied:
1
1. Compute the gradient ED [w] ED [w] 2 (td od ) 2
dD
2. w w ED [ w]
8
Error Gradient for a Sigmoid Unit
E 1
d d
wi wi 2 dD
(t o ) 2
But we know :
od (netd )
1
(t d od ) 2 od (1 od )
2 d wi netd netd
1 netd ( w xd )
2 (t d od )
wi
(t d od ) xi ,d
2 d wi wi
od
(t d od ) So :
d wi E
od netd (t d od )od (1 od ) xi ,d
- (t d od ) wi d D
d netd wi
9
Sigmoid function
Continuous, differentiable
easy to compute.
This is how the derivative looks->
10
11
12
13
14
15