Backprop PDF

Backpropagation Learning
15-486/782: Artificial Neural Networks

David S. Touretzky
Fall 2006
1
LMS / Widrow-Hoff Rule
y
 w i = − y−d x i

wi
xi
Works fine for a single layer of trainable weights.

What about multi-layer networks?
2
With Linear Units, Multiple Layers
Don't Add Anything
y

 U: 2×3 matrix
 = U× V x
y U×V  x
  = 
2×4
 V: 3×4 matrix
x

Linear operators are closed under composition.

Equivalent to a single layer of weights W=U×V
But with non-linear units, extra layers add

computational power. 3
What Can be Done with
Non-Linear (e.g., Threshold) Units?
1 layer of
trainable
weights
separating hyperplane
4
2 layers of
trainable
weights
convex polygon region
5
3 layers of
trainable
weights
composition of polygons:
convex regions
6
How Do We Train A
Multi-Layer Network?
y Error = d-y
Error = ???
Can't use perceptron training algorithm because

we don't know the 'correct' outputs for hidden units.
7
How Do We Train A
Multi-Layer Network?
y
Define sum-squared error:
1
E = ∑ d p− y p 2
2
p
Use gradient descent error minimization:
∂E
 w ij = −
∂ w ij
Works if the nonlinear transfer function is differentiable.
8
Deriving the LMS or “Delta” Rule
As Gradient Descent Learning
y = ∑ wi xi
i
E = 1 ∑ dp−y p 2 dE
2 p
= y−d
dy
y
∂E dE ∂ y
= ⋅ = y−dx i
∂ wi d y ∂ wi
wi
∂E
xi  w i = − = −y−dx i
∂ wi
How do we extend this to two layers?

9
Switch to Smooth Nonlinear Units
net j = ∑ w ij yi
i
y j = gnet j  g must be differentiable
Common choices for g:

1
g x =
1e−x
g 'x = gx⋅1−gx
g x=tanhx
g 'x=1/cosh 2 x
10
Gradient Descent with Nonlinear Units
wi
xi tanh(wixi) y
y=g net=tanh
 ∑
i
wi xi

dE dy 2 ∂ net
=y−d, =1/cosh net, =x i
dy d net ∂ wi
∂E d E d y ∂ net
= ⋅ ⋅
∂ wi d y d net ∂ w i
 
2
= y−d/cosh ∑ w i x i ⋅x i
i
11
Now We Can Use The Chain Rule
∂E
=  y k −d k 
∂ yk
yk
∂E
k = =  y k −d k ⋅g 'net k 
∂ net k
w jk ∂E ∂ E ∂ net k ∂E
= ⋅ = ⋅y
∂ w jk ∂ net k ∂ w jk ∂ net k j
yj
w ij
∂E
∂yj
= ∑
k
 ∂ E ∂ net k
⋅
∂ net k ∂ y j 
∂E ∂E
j = = ⋅g 'net j 
∂ net j ∂yj
yi ∂E ∂E
= ⋅y i
∂ w ij ∂ net j
12
Weight Updates
∂E ∂E ∂ net k
= ⋅ =  k⋅y j
∂ w jk ∂ net k ∂ w jk
∂E ∂ E ∂ net j
= ⋅ =  j⋅y i
∂ w ij ∂ net j ∂ w ij
∂E ∂E
 w jk = −⋅  w ij = −⋅
∂ w jk ∂ w ij
13
Function Approximation
y
1
Bumps from
which we
compose
1 1 1 1 f(x)
tanhw0w 1 x
x
3n+1 free parameters for n hidden units
14
Encoder Problem
Hidden
Unit 2
Input patterns: 1 bit on out of N. Hidden

Output pattern: same as input. Unit 1
Only 2 hidden units: bottleneck! 15
5-2-5 Encoder Problem
Training patterns: Hidden code:
A: 0 0 0 0 1 2, 0
B: 0 0 0 1 0 0, 2
C: 0 0 1 0 0 1,−1
D: 0 1 0 0 0 −1, 1
E: 1 0 0 0 0 −1, 0
A
Hidden
Unit 2 E B
D C One hidden unit's

linear decision boundary
Hidden Unit 1 16
Solving XOR
x1 x2 y
0 0 0 Two solutions: Try the bpxor demo.
0 1 1 x 1 x2∨x 1 x 2 Which solution
1 0 1
x 1∨x2 ∧x 1∧x 2 does it use?
1 1 0
decision boundaries
“OR” “AND-NOT”
x2
x1 17
Improving Backprop Performance
● Avoiding local minima
● Keep derivatives from going to zero
● For classifiers, use reachable targets
● Compensate for error attenuation in deep layers
● Compensate for fan-in effects
● Use momentum to speed learning
● Reduce learning rate when weights oscillate
● Use small initial random weights and small initial
learning rate to avoid “herd effect”
● Cross-entropy error measure
18
Avoiding Local Minima
One problem with backprop is that the error surface

is no longer bowl-shaped.
Gradient descent can get trapped in local minima.

In practice, this does not usually prevent learning.
“Noise” can get us out of local minima:

Stochastic update (one pattern at a time).
Add noise to training data, weights, or activations.
Large learning rates can be a source of noise due to
overshooting.
19
Flat Spots
If weights become large, netj becomes large,

derivative of g() goes to zero.
flat spot
g(x) g'(x)
Fahlman's trick: add a small constant to g'(x) to

keep the derivative from going to zero. Typical
value is 0.1. 20
Reachable Targets for Classifiers
Targets of 0 and 1 are unreachable by the logistic or

tanh functions.
Weights get large as the algorithm tries to force

each output unit to reach its asymptotic value.
Trying to get a “correct” output from 0.95 up to 1.0
wastes time and resources that should be
concentrated elsewhere.
Solution: use “reachable targets” of 0.1 and 0.9

instead of 0/1. And don't penalize the network for
overshooting these targets.
21
Error Signal Attenuation
The error signal  gets attenuated as it moves

backward through multiple layers.
So different layers learn at different rates.
Input-to-hidden weights learn more slowly than

hidden-to-output weights.
Solution: have different learning rates  for

different layers.
22
Fan-In Affects Learning Rate
20
One learning step for yk
changes 4 parameters.
4
One learning step for yj
changes 625 parameters:
big change in netj results!
625
Solution: scale learning rate by fan-in.
23
Momentum
Learning is slow if the learning rate is set too low.

Gradient may be steep in some directions but
shallow in others.
Solution: add a momentum term .
∂E
 w ij t = −  ⋅ w ij t−1
∂ w ij t
Typical value for  is 0.5.

If the direction of the gradient remains constant,
the algorithm will take increasingly large steps.
24
Momentum Demo
Hertz, Krogh & Palmer figs. 5.10 and 6.3: gradient

descent on a quadratic error surface E (no neural
net) involved:
E = x 2  20 y 2
∂E ∂E
= 2x , = 40y
∂x ∂y
Initial [x , y ]=[−1,1] or [1,1]
25
Weights Can Oscillate If
Learning Rate Set Too High
Solution: calculate the cosine of the angle between
successive weight vectors.
 w t ⋅ 
  w t−1
cos  =
 w t∥⋅∥
∥  w t−1∥
If cosine close to 1, things are going well.

If cosine < 0.95, reduce the learning rate.
If cosine < 0, we're oscillating: cancel the
momentum.
∂E
 w t = −  ⋅ w t−1
∂w
26
The “Herd Effect” (Fahlman)
Hidden units all move in the same direction at once,

instead of spreading out to divide and conquer.
Solution: use initial random weights, not too large

(to avoid flat spots), to encourage units to diversify.
Use a small initial learning rate to give units time to

sort out their “specialization” before taking large
steps in weight space.
Add hidden units one at a time. (Cascor algorithm.)
27
Cross-Entropy Error Measure
●Alternative to sum-squared error for binary
outputs; diverges when the network gets an output
completely wrong.
[ ]
p p
p d p 1−d
E = ∑ d log p  1−d  log
y 1−y
p
p
●Can produce faster learning for some types of

problems.
●Can learn some problems where sum-squared
error gets stuck in a local minimum, because it
heavily penalizes “very wrong” outputs.
28
How Many Layers Do We Need?
Two layers of weights suffice to compute any

“reasonable” function.
But it may require a lot of hidden units!
Why does it work out this way?

Lapedes & Farmer: any reasonable function can be
approximated by a linear combination of localized
“bumps” that are each nonzero over a small region.
These bumps can be constructed by a network with
two layers of weights.
29
Early Application of Backprop:
From DECtalk to NETtalk
DECtalk was a text-to-speech program that drove a
Votrax speech synthesizer board.
Contained 700 rules for English pronunciation, plus

a large dictionary of exceptions.
Developed over several years by a team of linguists

and programmers.
30
NETtalk Learns to Read
In 1987, Sejnowski & Rosenberg made national

news when they used backprop to “teach” a neural
network to “read aloud”.
Output: 23 phonetic feature units

plus 3 for stress, syll. boundaries.
Hidden layer: 0-120 units.
Input: 7 letter window containing

7x29 = 206 units.
Training the network with 10,000 weights took 24

hours on a VAX-780 computer. (Today it would take
a few minutes.)
31
Why Was NETtalk Interesting?
No explicit rules. No exception dictionary. Trained

in less than a day. Programmers now obsolete!
NETtalk went through “developmental stages” as it
learned to read. Analogous to child development?
CV alternation: “babbling”
word boundaries recognized: “pseudo-words”
many words intelligible
understandable text
(play audio)
Graceful response to “damage” (some weights

deleted, or noise added.) Rapid recovery with
retraining. Analagous to human stroke patients?
32
Learning Curves for 0-120 Hidden Units
Training set was a 1000 word

dictionary corpus; many
irregular words.
No hiddens: 82% best guess.
120 hiddens: 98% best guess.
Errors in the no “hidden units”
case were often inappropriate.
Hidden units allow for more
contextual influence by
recognizing higher order
features in the input.
33
Test of Generalization Performance
Initial training: 1000 words,
with 120 hidden units.
Testing set was a 20,012
word dictionary.
No additional training: Regular rule c->[k]

77% best guess learned earlier than
28% perfect match irregular rule c->[s]
After 5 training passes:

90% best guess
48% perfect match
34
Effects of Damage
Std. dev. of the
original, undamaged
weights was 1.2
Random weight
perturbations in
[-.5,+.5] had little
effect.
So each weight must

convey only a few
bits of information.
35
Relearning After Damage
Relearning was about

10 times faster to
achieve similar
performance.
Analogy to rapid
recovery of language
in stroke patients?
36
Was NETtalk Really Competitive?
Couldn't handle words with context-dependent

pronunciations (“lead”) or stresses (“survey”).
Couldn't handle grammatical structure, e.g.,

questions vs. declarative sentences.
Lacked clever contextual tricks, such as:

“he dove” vs. “the dove”
“Dr. Smith” vs. “51 Rodeo Dr.”
But not bad for a seven letter window!

37

Backprop PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Backprop PDF

Uploaded by

Copyright:

Available Formats

Backpropagation Learning

15-486/782: Artificial Neural Networks

Works fine for a single layer of trainable weights.

Linear operators are closed under composition.

But with non-linear units, extra layers add

convex polygon region

Can't use perceptron training algorithm because

Define sum-squared error:

Use gradient descent error minimization:

Works if the nonlinear transfer function is differentiable.

How do we extend this to two layers?

y j = gnet j  g must be differentiable

Common choices for g:

Input patterns: 1 bit on out of N. Hidden

D C One hidden unit's

One problem with backprop is that the error surface

Gradient descent can get trapped in local minima.

“Noise” can get us out of local minima:

If weights become large, netj becomes large,

Fahlman's trick: add a small constant to g'(x) to

Targets of 0 and 1 are unreachable by the logistic or

Weights get large as the algorithm tries to force

Solution: use “reachable targets” of 0.1 and 0.9

The error signal  gets attenuated as it moves

So different layers learn at different rates.

Input-to-hidden weights learn more slowly than

Solution: have different learning rates  for

Solution: scale learning rate by fan-in.

Learning is slow if the learning rate is set too low.

Typical value for  is 0.5.

Hertz, Krogh & Palmer figs. 5.10 and 6.3: gradient

Initial [x , y ]=[−1,1] or [1,1]

If cosine close to 1, things are going well.

Hidden units all move in the same direction at once,

Solution: use initial random weights, not too large

Use a small initial learning rate to give units time to

Add hidden units one at a time. (Cascor algorithm.)

●Can produce faster learning for some types of

Two layers of weights suffice to compute any

Why does it work out this way?

Contained 700 rules for English pronunciation, plus

Developed over several years by a team of linguists

In 1987, Sejnowski & Rosenberg made national

Output: 23 phonetic feature units

Hidden layer: 0-120 units.

Input: 7 letter window containing

Training the network with 10,000 weights took 24

No explicit rules. No exception dictionary. Trained

Graceful response to “damage” (some weights

Training set was a 1000 word

No additional training: Regular rule c->[k]

After 5 training passes:

So each weight must

Relearning was about

Couldn't handle words with context-dependent

Couldn't handle grammatical structure, e.g.,

Lacked clever contextual tricks, such as:

But not bad for a seven letter window!

You might also like