Licao 20 PDF

Back-Propagation Algorithm
 Perceptron
 Gradient Descent
 Multi-layerd neural network
 Back-Propagation
 More on Back-Propagation
 Examples
1
Inner-product
r r r r
net =< w, x >=|| w ||" || x ||"cos(# )
n
net = # w i " x i
i=1
!
 A measure of the projection of one vector
onto another
!
Activation function
n
o = f (net) = f (# w i " x i )
i=1
$ 1 if x " 0
f (x) := sgn(x) = %
&#1 if x < 0
!
2
$1 if x #0
f (x) := " (x) = %
&0 if x <0
! & 1 if x # 0.5
(
f (x) := " (x) = ' x if 0.5 > x > 0.5
( 0 if x $ %0.5
)
sigmoid function
!
1
f (x) := " (x) =
1+ e(#ax )
Gradient Descent
 To understand, consider simpler linear
unit, where
n
o = # wi " x i
i= 0
 Let's learn wi that minimize the squared

error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}
! • (t for target)
3
Error for different hypothesis,
for w0 and w1 (dim 2)
 We want to move the weight vector in the

direction that decrease E
wi=wi+Δwi w=w+Δw
4
Differentiating E
Update rule for gradient decent
"w i = # % (t d & od )x id
d $D
5
Stochastic Approximation to
gradient descent
"w i = #(t $ o)x i
 The gradient decent training rule updates summing
over all the training examples D
 Stochastic gradient approximates gradient decent by
updating weights incrementally
 Calculate error for each example
!  Known as delta-rule or LMS (last mean-square) weight
update
 Adaline rule, used for adaptive filters Widroff and Hoff (1960)
6
XOR problem and Perceptron
 By Minsky and Papert in mid 1960
7
Multi-layer Networks
 The limitations of simple perceptron do not
apply to feed-forward networks with
intermediate or „hidden“ nonlinear units
 A network with just one hidden unit can
represent any Boolean function
 The great power of multi-layer networks was
realized long ago
 But it was only in the eighties it was shown how to
make them learn
 Multiple layers of cascade linear units still

produce only linear functions
 We search for networks capable of
representing nonlinear functions
 Units should use nonlinear activation functions
 Examples of nonlinear activation functions
8
XOR-example
9
 Back-propagation is a learning algorithm for
multi-layer neural networks
 It was invented independently several times
 Bryson an Ho [1969]
 Werbos [1974]
 Parker [1985]
 Rumelhart et al. [1986]
Parallel Distributed Processing - Vol. 1

Foundations
David E. Rumelhart, James L. McC lelland and the PDP
Research Group
What makes people smarter than computers? These volumes

by a pioneering neurocomputing.....
10
Back-propagation
 The algorithm gives a prescription for
changing the weights wij in any feed-
forward network to learn a training set of
input output pairs {xd,td}
 We consider a simple two-layer network
xk
x1 x2 x3 x4 x5
11
 Given the pattern xd the hidden unit j
receives a net input
5
net = " w jk x kd
d
j
k=1
 and produces the output

5
! V = f (net ) = f (" w jk x kd )
d
j
d
j
k=1
 Output unit i thus receives

3 3 5
net = "W ijV = " (W ij # f (" w jk x kd ))
d
i
d
j
j=1 j=1 k=1
 And produce the final output

3 3 5
!
o = f (net ) = f ("W ijV ) = f (" (W ij # f (" w jk x kd )))
d
i
d
i
d
j
j=1 j=1 k=1
!
12
 Out usual error function
 For l outputs and m input output pairs

{xd,td}
r 1 m l d
E[ w ] = " " (t i # oid ) 2
2 d =1 i=1
 In our example E becomes

r 1 m 2 d
E[ w ] = " " (t i # oid ) 2
2 d =1 i=1
r 1 m 2 d 3 5
E[ w ] = " " (t i # f ("W ij $ f (" w jk x kd ))) 2
2 d =1 i=1 j k=1
!  E[w] is differentiable given f is differentiable
 Gradient descent can be applied
13
 For hidden-to-output connections the
gradient descent rule gives:
m
%E
"W ij = #$ = #$& (t id # oid ) f ' (net id ) ' (#V jd )
%W ij d =1
m
"W ij = $& (t id # oid ) f ' (net id ) ' V jd
d =1
"id = f ' (net id )(t id # oid )

!
m
$W ij = %&"id V jd
d =1
 For the input-to hidden connection wjk we

must differentiate with respect to the wjk
 Using the chain rule we obtain
m d
%E %E %V j
"w jk = #$ = #$' d &
%w jk d =1
%V j %w jk
14
m 2
"w jk = #$ $ (t id % oid ) f ' (net id )W ij f ' (net dj ) & x kd
d =1 i=1
"id = f ' (net id )(t id # oid )

m 2
! "w jk = #& &$id %W ij f ' (net dj ) % x kd
d =1 i=1
! 2
" = f (net )#W ij"id
d
j
' d
j
i=1
!
m
"w jk = #%$ dj & x kd
d =1
!
m
"W ij = #%$id V jd
d =1
m
"w jk = #%$ dj & x kd
d =1
!
 we have same form with a different

! definition of δ
15
 In general, with an arbitrary number of layers,
the back-propagation update rule has always
the form m
"w ij = #%$output & Vinput
d =1
 Where output and input refers to the connection

concerned
 V stands for the appropriate input (hidden unit or
!
real input, xd )
 δ depends on the layer concerned
By the equation " = f (net )#W ij"i

d ' dd
 j j
i=1
 allows us to determine for a given

hidden unit Vj in terms of the δ‘s of the
unit oi !
 The coefficient are usual forward, but the
errors δ are propagated backward
 back-propagation
16
 We have to use a nonlinear differentiable
activation function
 Examples:
1
f (x) = " (x) =
1+ e(#$ % x )
f ' (x) = " ' (x) = # $ " (x) $ (1% " (x))
!
f (x) = tanh(" # x)
!
f ' (x) = " # (1$ f (x) 2 )
!
17
 Consider a network with M layers
m=1,2,..,M
 Vmi from the output of the ith unit of the
mth layer
 V0i is a synonym for xi of the ith input
 Subscript m layers m’s layers, not
patterns
 Wmij mean connection from Vjm-1 to Vim
Stochastic Back-Propagation
Algorithm (mostly used)
1. Initialize the weights to small random values
2. Choose a pattern xdk and apply is to the input layer V0k= xdk for
all k
3. Propagate the signal through the network
Vim = f (net im ) = f (" w ijmV jm#1 )
j
4. Compute the deltas for the output layer

"iM = f ' (net iM )(t id # ViM )
!5. Compute the deltas for the preceding layer for m=M,M-1,..2
"im#1 = f ' (net im#1 )$ w mji" mj
j
!6. Update all connections
"w ijm = #$imV jm%1 w ijnew = w ijold + "w ij
7.
! Goto 2 and repeat for the next pattern
! !
18
More on Back-Propagation
 Gradient descent over entire network
weight vector
 Easily generalized to arbitrary directed
graphs
 Will find a local, not necessarily global
error minimum
 In practice, often works well (can run
multiple times)
 Gradient descent can be very slow if η is to

small, and can oscillate widely if η is to large
 Often include weight momentum α
%E
"w pq (t + 1) = #$ + & ' "w pq (t)
%w pq
 Momentum parameter α is chosen between 0

!
and 1, 0.9 is a good value
19
 Minimizes error over training examples
 Will it generalize well
 Training can take thousands of iterations,
it is slow!
 Using network after training is very fast
20
Convergence of Back-
propagation
 Gradient descent to some local minimum
 Perhaps not global minimum...
 Add momentum
 Stochastic gradient descent
 Train multiple nets with different initial weights
 Nature of convergence
 Initialize weights near zero
 Therefore, initial networks near-linear
 Increasingly non-linear functions possible as training
progresses
21
Expressive Capabilities of
ANNs
 Boolean functions:
 Every boolean function can be represented by
network with single hidden layer
 but might require exponential (in number of inputs)
hidden units
 Continuous functions:
 Every bounded continuous function can be
approximated with arbitrarily small
error, by network with one hidden layer [Cybenko
1989; Hornik et al. 1989]
 Any function can be approximated to arbitrary
accuracy by a network with two
hidden layers [Cybenko 1988].
NETtalk Sejnowski et al 1987
22
Prediction
23
24
 Perceptron
 Gradient Descent
 Multi-layerd neural network
 Back-Propagation
 More on Back-Propagation
 Examples
25
 RBF Networks, Support Vector Machines
26

Licao 20 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Licao 20 PDF

Uploaded by

Copyright:

Available Formats

Back-Propagation Algorithm

 Let's learn wi that minimize the squared

 We want to move the weight vector in the

Update rule for gradient decent

 Multiple layers of cascade linear units still

Parallel Distributed Processing - Vol. 1

What makes people smarter than computers? These volumes

 We consider a simple two-layer network

 and produces the output

 Output unit i thus receives

 And produce the final output

 For l outputs and m input output pairs

 In our example E becomes

"id = f ' (net id )(t id # oid )

 For the input-to hidden connection wjk we

"id = f ' (net id )(t id # oid )

 we have same form with a different

 Where output and input refers to the connection

By the equation " = f (net )#W ij"i

 allows us to determine for a given

4. Compute the deltas for the output layer

 Gradient descent can be very slow if η is to

 Momentum parameter α is chosen between 0

 Using network after training is very fast

NETtalk Sejnowski et al 1987

You might also like