Professional Documents
Culture Documents
Backprop PDF
Backprop PDF
Fall 2006
1
LMS / Widrow-Hoff Rule
y
w i = − y−d x i
wi
xi
2
With Linear Units, Multiple Layers
Don't Add Anything
y
U: 2×3 matrix
= U× V x
y U×V x
=
2×4
V: 3×4 matrix
x
1 layer of
trainable
weights
separating hyperplane
4
2 layers of
trainable
weights
5
3 layers of
trainable
weights
composition of polygons:
convex regions
6
How Do We Train A
Multi-Layer Network?
y Error = d-y
Error = ???
1
E = ∑ d p− y p 2
2
p
∂E
w ij = −
∂ w ij
8
Deriving the LMS or “Delta” Rule
As Gradient Descent Learning
y = ∑ wi xi
i
E = 1 ∑ dp−y p 2 dE
2 p
= y−d
dy
y
∂E dE ∂ y
= ⋅ = y−dx i
∂ wi d y ∂ wi
wi
∂E
xi w i = − = −y−dx i
∂ wi
net j = ∑ w ij yi
i
g x=tanhx
g 'x=1/cosh 2 x
10
Gradient Descent with Nonlinear Units
wi
xi tanh(wixi) y
y=g net=tanh
∑
i
wi xi
dE dy 2 ∂ net
=y−d, =1/cosh net, =x i
dy d net ∂ wi
∂E d E d y ∂ net
= ⋅ ⋅
∂ wi d y d net ∂ w i
2
= y−d/cosh ∑ w i x i ⋅x i
i
11
Now We Can Use The Chain Rule
∂E
= y k −d k
∂ yk
yk
∂E
k = = y k −d k ⋅g 'net k
∂ net k
w jk ∂E ∂ E ∂ net k ∂E
= ⋅ = ⋅y
∂ w jk ∂ net k ∂ w jk ∂ net k j
yj
w ij
∂E
∂yj
= ∑
k
∂ E ∂ net k
⋅
∂ net k ∂ y j
∂E ∂E
j = = ⋅g 'net j
∂ net j ∂yj
yi ∂E ∂E
= ⋅y i
∂ w ij ∂ net j
12
Weight Updates
∂E ∂E ∂ net k
= ⋅ = k⋅y j
∂ w jk ∂ net k ∂ w jk
∂E ∂ E ∂ net j
= ⋅ = j⋅y i
∂ w ij ∂ net j ∂ w ij
∂E ∂E
w jk = −⋅ w ij = −⋅
∂ w jk ∂ w ij
13
Function Approximation
y
1
Bumps from
which we
compose
1 1 1 1 f(x)
tanhw0w 1 x
x
3n+1 free parameters for n hidden units
14
Encoder Problem
Hidden
Unit 2
A
Hidden
Unit 2 E B
decision boundaries
“OR” “AND-NOT”
x2
x1 17
Improving Backprop Performance
● Avoiding local minima
● Keep derivatives from going to zero
● For classifiers, use reachable targets
● Compensate for error attenuation in deep layers
● Compensate for fan-in effects
● Use momentum to speed learning
● Reduce learning rate when weights oscillate
● Use small initial random weights and small initial
learning rate to avoid “herd effect”
● Cross-entropy error measure
18
Avoiding Local Minima
19
Flat Spots
g(x) g'(x)
22
Fan-In Affects Learning Rate
20
One learning step for yk
changes 4 parameters.
4
One learning step for yj
changes 625 parameters:
big change in netj results!
625
23
Momentum
∂E
w ij t = − ⋅ w ij t−1
∂ w ij t
∂E ∂E
= 2x , = 40y
∂x ∂y
25
Weights Can Oscillate If
Learning Rate Set Too High
Solution: calculate the cosine of the angle between
successive weight vectors.
w t ⋅
w t−1
cos =
w t∥⋅∥
∥ w t−1∥
∂E
w t = − ⋅ w t−1
∂w
26
The “Herd Effect” (Fahlman)
27
Cross-Entropy Error Measure
●Alternative to sum-squared error for binary
outputs; diverges when the network gets an output
completely wrong.
[ ]
p p
p d p 1−d
E = ∑ d log p 1−d log
y 1−y
p
p
28
How Many Layers Do We Need?
29
Early Application of Backprop:
From DECtalk to NETtalk
DECtalk was a text-to-speech program that drove a
Votrax speech synthesizer board.
30
NETtalk Learns to Read
33
Test of Generalization Performance
Initial training: 1000 words,
with 120 hidden units.
Testing set was a 20,012
word dictionary.
Random weight
perturbations in
[-.5,+.5] had little
effect.
35
Relearning After Damage
Analogy to rapid
recovery of language
in stroke patients?
36
Was NETtalk Really Competitive?