You are on page 1of 37

Backpropagation Learning

15-486/782: Artificial Neural Networks


David S. Touretzky

Fall 2006

1
LMS / Widrow-Hoff Rule
y

 w i = − y−d x i

wi
xi

Works fine for a single layer of trainable weights.


What about multi-layer networks?

2
With Linear Units, Multiple Layers
Don't Add Anything

y

 U: 2×3 matrix
 = U× V x
y U×V  x
  = 
2×4
 V: 3×4 matrix

x

Linear operators are closed under composition.


Equivalent to a single layer of weights W=U×V

But with non-linear units, extra layers add


computational power. 3
What Can be Done with
Non-Linear (e.g., Threshold) Units?

1 layer of
trainable
weights

separating hyperplane

4
2 layers of
trainable
weights

convex polygon region

5
3 layers of
trainable
weights

composition of polygons:
convex regions

6
How Do We Train A
Multi-Layer Network?

y Error = d-y

Error = ???

Can't use perceptron training algorithm because


we don't know the 'correct' outputs for hidden units.
7
How Do We Train A
Multi-Layer Network?
y

Define sum-squared error:

1
E = ∑ d p− y p 2
2
p

Use gradient descent error minimization:

∂E
 w ij = −
∂ w ij

Works if the nonlinear transfer function is differentiable.

8
Deriving the LMS or “Delta” Rule
As Gradient Descent Learning
y = ∑ wi xi
i

E = 1 ∑ dp−y p 2 dE
2 p
= y−d
dy
y
∂E dE ∂ y
= ⋅ = y−dx i
∂ wi d y ∂ wi

wi
∂E
xi  w i = − = −y−dx i
∂ wi

How do we extend this to two layers?


9
Switch to Smooth Nonlinear Units

net j = ∑ w ij yi
i

y j = gnet j  g must be differentiable

Common choices for g:


1
g x =
1e−x
g 'x = gx⋅1−gx

g x=tanhx
g 'x=1/cosh 2 x
10
Gradient Descent with Nonlinear Units

wi
xi tanh(wixi) y
y=g net=tanh
 ∑
i
wi xi

dE dy 2 ∂ net
=y−d, =1/cosh net, =x i
dy d net ∂ wi

∂E d E d y ∂ net
= ⋅ ⋅
∂ wi d y d net ∂ w i

 
2
= y−d/cosh ∑ w i x i ⋅x i
i

11
Now We Can Use The Chain Rule

∂E
=  y k −d k 
∂ yk
yk
∂E
k = =  y k −d k ⋅g 'net k 
∂ net k
w jk ∂E ∂ E ∂ net k ∂E
= ⋅ = ⋅y
∂ w jk ∂ net k ∂ w jk ∂ net k j
yj

w ij
∂E
∂yj
= ∑
k
 ∂ E ∂ net k

∂ net k ∂ y j 
∂E ∂E
j = = ⋅g 'net j 
∂ net j ∂yj
yi ∂E ∂E
= ⋅y i
∂ w ij ∂ net j

12
Weight Updates

∂E ∂E ∂ net k
= ⋅ =  k⋅y j
∂ w jk ∂ net k ∂ w jk

∂E ∂ E ∂ net j
= ⋅ =  j⋅y i
∂ w ij ∂ net j ∂ w ij

∂E ∂E
 w jk = −⋅  w ij = −⋅
∂ w jk ∂ w ij

13
Function Approximation
y

1
Bumps from
which we
compose
1 1 1 1 f(x)
tanhw0w 1 x

x
3n+1 free parameters for n hidden units
14
Encoder Problem

Hidden
Unit 2

Input patterns: 1 bit on out of N. Hidden


Output pattern: same as input. Unit 1
Only 2 hidden units: bottleneck! 15
5-2-5 Encoder Problem
Training patterns: Hidden code:
A: 0 0 0 0 1 2, 0
B: 0 0 0 1 0 0, 2
C: 0 0 1 0 0 1,−1
D: 0 1 0 0 0 −1, 1
E: 1 0 0 0 0 −1, 0

A
Hidden
Unit 2 E B

D C One hidden unit's


linear decision boundary
Hidden Unit 1 16
Solving XOR
x1 x2 y
0 0 0 Two solutions: Try the bpxor demo.
0 1 1 x 1 x2∨x 1 x 2 Which solution
1 0 1
x 1∨x2 ∧x 1∧x 2 does it use?
1 1 0

decision boundaries

“OR” “AND-NOT”
x2

x1 17
Improving Backprop Performance
● Avoiding local minima
● Keep derivatives from going to zero
● For classifiers, use reachable targets
● Compensate for error attenuation in deep layers
● Compensate for fan-in effects
● Use momentum to speed learning
● Reduce learning rate when weights oscillate
● Use small initial random weights and small initial
learning rate to avoid “herd effect”
● Cross-entropy error measure
18
Avoiding Local Minima

One problem with backprop is that the error surface


is no longer bowl-shaped.

Gradient descent can get trapped in local minima.


In practice, this does not usually prevent learning.

“Noise” can get us out of local minima:


Stochastic update (one pattern at a time).
Add noise to training data, weights, or activations.
Large learning rates can be a source of noise due to
overshooting.

19
Flat Spots

If weights become large, netj becomes large,


derivative of g() goes to zero.
flat spot

g(x) g'(x)

Fahlman's trick: add a small constant to g'(x) to


keep the derivative from going to zero. Typical
value is 0.1. 20
Reachable Targets for Classifiers

Targets of 0 and 1 are unreachable by the logistic or


tanh functions.

Weights get large as the algorithm tries to force


each output unit to reach its asymptotic value.
Trying to get a “correct” output from 0.95 up to 1.0
wastes time and resources that should be
concentrated elsewhere.

Solution: use “reachable targets” of 0.1 and 0.9


instead of 0/1. And don't penalize the network for
overshooting these targets.
21
Error Signal Attenuation

The error signal  gets attenuated as it moves


backward through multiple layers.

So different layers learn at different rates.

Input-to-hidden weights learn more slowly than


hidden-to-output weights.

Solution: have different learning rates  for


different layers.

22
Fan-In Affects Learning Rate

20
One learning step for yk
changes 4 parameters.
4
One learning step for yj
changes 625 parameters:
big change in netj results!
625

Solution: scale learning rate by fan-in.

23
Momentum

Learning is slow if the learning rate is set too low.


Gradient may be steep in some directions but
shallow in others.
Solution: add a momentum term .

∂E
 w ij t = −  ⋅ w ij t−1
∂ w ij t

Typical value for  is 0.5.


If the direction of the gradient remains constant,
the algorithm will take increasingly large steps.
24
Momentum Demo

Hertz, Krogh & Palmer figs. 5.10 and 6.3: gradient


descent on a quadratic error surface E (no neural
net) involved:
E = x 2  20 y 2

∂E ∂E
= 2x , = 40y
∂x ∂y

Initial [x , y ]=[−1,1] or [1,1]

25
Weights Can Oscillate If
Learning Rate Set Too High
Solution: calculate the cosine of the angle between
successive weight vectors.
 w t ⋅ 
  w t−1
cos  =
 w t∥⋅∥
∥  w t−1∥

If cosine close to 1, things are going well.


If cosine < 0.95, reduce the learning rate.
If cosine < 0, we're oscillating: cancel the
momentum.

∂E
 w t = −  ⋅ w t−1
∂w
26
The “Herd Effect” (Fahlman)

Hidden units all move in the same direction at once,


instead of spreading out to divide and conquer.

Solution: use initial random weights, not too large


(to avoid flat spots), to encourage units to diversify.

Use a small initial learning rate to give units time to


sort out their “specialization” before taking large
steps in weight space.

Add hidden units one at a time. (Cascor algorithm.)

27
Cross-Entropy Error Measure
●Alternative to sum-squared error for binary
outputs; diverges when the network gets an output
completely wrong.

[ ]
p p
p d p 1−d
E = ∑ d log p  1−d  log
y 1−y
p
p

●Can produce faster learning for some types of


problems.
●Can learn some problems where sum-squared
error gets stuck in a local minimum, because it
heavily penalizes “very wrong” outputs.

28
How Many Layers Do We Need?

Two layers of weights suffice to compute any


“reasonable” function.
But it may require a lot of hidden units!

Why does it work out this way?


Lapedes & Farmer: any reasonable function can be
approximated by a linear combination of localized
“bumps” that are each nonzero over a small region.
These bumps can be constructed by a network with
two layers of weights.

29
Early Application of Backprop:
From DECtalk to NETtalk
DECtalk was a text-to-speech program that drove a
Votrax speech synthesizer board.

Contained 700 rules for English pronunciation, plus


a large dictionary of exceptions.

Developed over several years by a team of linguists


and programmers.

30
NETtalk Learns to Read

In 1987, Sejnowski & Rosenberg made national


news when they used backprop to “teach” a neural
network to “read aloud”.

Output: 23 phonetic feature units


plus 3 for stress, syll. boundaries.

Hidden layer: 0-120 units.

Input: 7 letter window containing


7x29 = 206 units.

Training the network with 10,000 weights took 24


hours on a VAX-780 computer. (Today it would take
a few minutes.)
31
Why Was NETtalk Interesting?

No explicit rules. No exception dictionary. Trained


in less than a day. Programmers now obsolete!
NETtalk went through “developmental stages” as it
learned to read. Analogous to child development?
CV alternation: “babbling”
word boundaries recognized: “pseudo-words”
many words intelligible
understandable text
(play audio)

Graceful response to “damage” (some weights


deleted, or noise added.) Rapid recovery with
retraining. Analagous to human stroke patients?
32
Learning Curves for 0-120 Hidden Units

Training set was a 1000 word


dictionary corpus; many
irregular words.
No hiddens: 82% best guess.
120 hiddens: 98% best guess.
Errors in the no “hidden units”
case were often inappropriate.
Hidden units allow for more
contextual influence by
recognizing higher order
features in the input.

33
Test of Generalization Performance
Initial training: 1000 words,
with 120 hidden units.
Testing set was a 20,012
word dictionary.

No additional training: Regular rule c->[k]


77% best guess learned earlier than
28% perfect match irregular rule c->[s]

After 5 training passes:


90% best guess
48% perfect match
34
Effects of Damage
Std. dev. of the
original, undamaged
weights was 1.2

Random weight
perturbations in
[-.5,+.5] had little
effect.

So each weight must


convey only a few
bits of information.

35
Relearning After Damage

Relearning was about


10 times faster to
achieve similar
performance.

Analogy to rapid
recovery of language
in stroke patients?

36
Was NETtalk Really Competitive?

Couldn't handle words with context-dependent


pronunciations (“lead”) or stresses (“survey”).

Couldn't handle grammatical structure, e.g.,


questions vs. declarative sentences.

Lacked clever contextual tricks, such as:


“he dove” vs. “the dove”
“Dr. Smith” vs. “51 Rodeo Dr.”

But not bad for a seven letter window!


37

You might also like