Professional Documents
Culture Documents
CH 4
CH 4
[Read Ch. 4]
[Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]
Threshold units
Gradient descent
Multilayer networks
Backpropagation
Hidden layer representations
Example: Face Recognition
Advanced topics
74
Connectionist Models
Consider humans:
Neuron switching time ~ :001 second
Number of neurons ~ 10
Connections per neuron ~ 10 ?
Scene recognition time ~ :1 second
100 inference steps doesn't seem like enough
! much parallel computation
10
4 5
75
76
Sharp
Left
Straight
Ahead
Sharp
Right
30 Output
Units
4 Hidden
Units
30x32 Sensor
Input Retina
77
Perceptron
x1
w1
x2
w2
.
.
.
x0=1
w0
wi xi
i=0
wn
o=
xn
8
>
<
>
:
1 if
AA
x >0
i=0 i i
-1 otherwise
w + w x + + wnxn > 0
o(x ; : : : ; xn) = ?11 ifotherwise.
1
78
x2
+
+
x1
x1
(a)
(b)
79
where
wi
wi + wi
Where:
t = c(~x) is target value
o is perceptron output
is small constant (e.g., .1) called learning rate
80
81
Gradient Descent
To understand, consider simpler linear unit, where
o = w + w x + + w n xn
Let's learn wi's that minimize the squared error
E [w~ ] 21 d2XD(td ? od)
Where D is set of training examples
0
82
Gradient Descent
25
20
E[w]
15
10
5
0
2
1
-2
-1
0
1
-1
2
3
w1
w0
Gradient
3
@E
@E
@E
rE [w~ ] @w ; @w ; @w 75
n
Training rule:
w~ = ?rE [w~ ]
i.e.,
@E
wi = ? @w
2
64
83
Gradient Descent
@E = @ 1 X(t ? o )
@wi @wi 2 d d d
1
X @
= 2 d @w (td ? od)
i
@ (t ? o )
= 21 Xd 2(td ? od) @w
d
d
i
@
X
= d (td ? od) @w (td ? w~ x~d)
i
@E = X(t ? o )(?x )
@wi d d d i;d
2
84
Gradient Descent
(training examples; )
Each training example is a pair of the form
h~x; ti, where ~x is the vector of input values,
and t is the target output value. is the
learning rate (e.g., .05).
Initialize each wi to some small random value
Until the termination condition is met, Do
{ Initialize each wi to zero.
{ For each h~x; ti in training examples, Do
Input the instance ~x to the unit and
compute the output o
For each linear unit weight wi, Do
wi wi + (t ? o)xi
{ For each linear unit weight wi, Do
wi wi + wi
Gradient-Descent
85
Summary
Perceptron training rule guaranteed to succeed if
Training examples are linearly separable
Suciently small learning rate
Linear unit training rule uses gradient descent
Guaranteed to converge to hypothesis with
minimum squared error
Given suciently small learning rate
Even when training data contains noise
Even when training data not separable by H
86
Do until satised
1. Compute the gradient rED[w~ ]
2. w~ w~ ? rED[w~ ]
Incremental mode Gradient Descent:
Do until satised
For each training example d in D
1. Compute the gradient rEd[w~ ]
2. w~ w~ ? rEd[w~ ]
1
ED [w~ ] 2 dX2D(td ? od)
1
Ed[w~ ] 2 (td ? od)
Incremental Gradient Descent can approximate
Batch Gradient Descent arbitrarily closely if
made small enough
2
87
head hid
...
F1
88
...
whod hood
F2
Sigmoid Unit
x1
x2
AA
AA
A
.
.
.
xn
w1
w2
x0 = 1
w0
net = wi xi
i=0
wn
o = (net) =
-net
1+e
89
90
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satised, Do
For each training example, Do
1. Input the training example to the network
and compute the network outputs
2. For each output unit k
k ok(1 ? ok )(tk ? ok )
3. For each hidden unit h
h oh(1 ? oh) X wh;k k
k2outputs
91
More on Backpropagation
Gradient descent over entire network weight
vector
Easily generalized to arbitrary directed graphs
Will nd a local, not necessarily global error
minimum
{ In practice, often works well (can run multiple
times)
Often include weight momentum
wi;j(n) = j xi;j + wi;j (n ? 1)
Minimizes error over training examples
{ Will it generalize well to subsequent
examples?
Training can take thousands of iterations !
slow!
Using network after training is very fast
92
A target function:
Input
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001
Outputs
!
!
!
!
!
!
!
!
Output
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001
93
Inputs
Outputs
94
Training
Sum of squared errors for each output unit
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
95
500
1000
1500
2000
2500
Training
96
500
1000
1500
2000
2500
Training
Weights from inputs to one hidden unit
4
3
2
1
0
-1
-2
-3
-4
-5
0
97
500
1000
1500
2000
2500
Convergence of Backpropagation
Gradient descent to some local minimum
Perhaps not global minimum...
Add momentum
Stochastic gradient descent
Train multiple nets with dierent inital weights
Nature of convergence
Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as
training progresses
98
99
0.009
0.008
Error
0.007
0.006
0.005
0.004
0.003
0.002
0
5000
10000
15000
Number of weight updates
20000
0.07
0.06
Error
0.05
0.04
0.03
0.02
0.01
0
0
100
1000
2000
3000
4000
Number of weight updates
5000
6000
...
...
30x32
inputs
101
...
...
Learned Weights
30x32
inputs
102
103
Recurrent Networks
y(t + 1)
y(t + 1)
x(t)
x(t)
c(t)
y(t + 1)
x(t)
c(t)
y(t)
x(t 1)
c(t 1)
y(t 1)
x(t 2)
c(t 2)
unfolded in time
104