You are on page 1of 26

Back-Propagation Algorithm

 Perceptron
 Gradient Descent
 Multi-layerd neural network
 Back-Propagation
 More on Back-Propagation
 Examples

1
Inner-product
r r r r
net =< w, x >=|| w ||" || x ||"cos(# )
n
net = # w i " x i
i=1
!
 A measure of the projection of one vector
onto another
!

Activation function
n
o = f (net) = f (# w i " x i )
i=1

$ 1 if x " 0
f (x) := sgn(x) = %
&#1 if x < 0
!

2
$1 if x #0
f (x) := " (x) = %
&0 if x <0

! & 1 if x # 0.5
(
f (x) := " (x) = ' x if 0.5 > x > 0.5
( 0 if x $ %0.5
)

sigmoid function
!
1
f (x) := " (x) =
1+ e(#ax )

Gradient Descent
 To understand, consider simpler linear
unit, where
n
o = # wi " x i
i= 0

 Let's learn wi that minimize the squared


error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}
! • (t for target)

3
Error for different hypothesis,
for w0 and w1 (dim 2)

 We want to move the weight vector in the


direction that decrease E

wi=wi+Δwi w=w+Δw

4
Differentiating E

Update rule for gradient decent

"w i = # % (t d & od )x id
d $D

5
Stochastic Approximation to
gradient descent
"w i = #(t $ o)x i
 The gradient decent training rule updates summing
over all the training examples D
 Stochastic gradient approximates gradient decent by
updating weights incrementally
 Calculate error for each example
!  Known as delta-rule or LMS (last mean-square) weight
update
 Adaline rule, used for adaptive filters Widroff and Hoff (1960)

6
XOR problem and Perceptron
 By Minsky and Papert in mid 1960

7
Multi-layer Networks
 The limitations of simple perceptron do not
apply to feed-forward networks with
intermediate or „hidden“ nonlinear units
 A network with just one hidden unit can
represent any Boolean function
 The great power of multi-layer networks was
realized long ago
 But it was only in the eighties it was shown how to
make them learn

 Multiple layers of cascade linear units still


produce only linear functions
 We search for networks capable of
representing nonlinear functions
 Units should use nonlinear activation functions
 Examples of nonlinear activation functions

8
XOR-example

9
 Back-propagation is a learning algorithm for
multi-layer neural networks
 It was invented independently several times
 Bryson an Ho [1969]
 Werbos [1974]
 Parker [1985]
 Rumelhart et al. [1986]

Parallel Distributed Processing - Vol. 1


Foundations
David E. Rumelhart, James L. McC lelland and the PDP
Research Group

What makes people smarter than computers? These volumes


by a pioneering neurocomputing.....

10
Back-propagation
 The algorithm gives a prescription for
changing the weights wij in any feed-
forward network to learn a training set of
input output pairs {xd,td}

 We consider a simple two-layer network

xk
x1 x2 x3 x4 x5

11
 Given the pattern xd the hidden unit j
receives a net input
5
net = " w jk x kd
d
j
k=1

 and produces the output


5

! V = f (net ) = f (" w jk x kd )
d
j
d
j
k=1

 Output unit i thus receives


3 3 5
net = "W ijV = " (W ij # f (" w jk x kd ))
d
i
d
j
j=1 j=1 k=1

 And produce the final output


3 3 5
!
o = f (net ) = f ("W ijV ) = f (" (W ij # f (" w jk x kd )))
d
i
d
i
d
j
j=1 j=1 k=1

!
12
 Out usual error function

 For l outputs and m input output pairs


{xd,td}
r 1 m l d
E[ w ] = " " (t i # oid ) 2
2 d =1 i=1

 In our example E becomes


r 1 m 2 d
E[ w ] = " " (t i # oid ) 2
2 d =1 i=1
r 1 m 2 d 3 5
E[ w ] = " " (t i # f ("W ij $ f (" w jk x kd ))) 2
2 d =1 i=1 j k=1
!  E[w] is differentiable given f is differentiable
 Gradient descent can be applied

13
 For hidden-to-output connections the
gradient descent rule gives:
m
%E
"W ij = #$ = #$& (t id # oid ) f ' (net id ) ' (#V jd )
%W ij d =1
m
"W ij = $& (t id # oid ) f ' (net id ) ' V jd
d =1

"id = f ' (net id )(t id # oid )


!
m
$W ij = %&"id V jd
d =1

 For the input-to hidden connection wjk we


must differentiate with respect to the wjk
 Using the chain rule we obtain

m d
%E %E %V j
"w jk = #$ = #$' d &
%w jk d =1
%V j %w jk

14
m 2
"w jk = #$ $ (t id % oid ) f ' (net id )W ij f ' (net dj ) & x kd
d =1 i=1

"id = f ' (net id )(t id # oid )


m 2
! "w jk = #& &$id %W ij f ' (net dj ) % x kd
d =1 i=1
! 2
" = f (net )#W ij"id
d
j
' d
j
i=1
!
m
"w jk = #%$ dj & x kd
d =1
!

m
"W ij = #%$id V jd
d =1
m
"w jk = #%$ dj & x kd
d =1
!

 we have same form with a different


! definition of δ

15
 In general, with an arbitrary number of layers,
the back-propagation update rule has always
the form m
"w ij = #%$output & Vinput
d =1

 Where output and input refers to the connection


concerned
 V stands for the appropriate input (hidden unit or
!
real input, xd )
 δ depends on the layer concerned

By the equation " = f (net )#W ij"i


d ' dd
 j j
i=1

 allows us to determine for a given


hidden unit Vj in terms of the δ‘s of the
unit oi !
 The coefficient are usual forward, but the
errors δ are propagated backward
 back-propagation

16
 We have to use a nonlinear differentiable
activation function
 Examples:
1
f (x) = " (x) =
1+ e(#$ % x )
f ' (x) = " ' (x) = # $ " (x) $ (1% " (x))

!
f (x) = tanh(" # x)
!
f ' (x) = " # (1$ f (x) 2 )

!
17
 Consider a network with M layers
m=1,2,..,M
 Vmi from the output of the ith unit of the
mth layer
 V0i is a synonym for xi of the ith input
 Subscript m layers m’s layers, not
patterns
 Wmij mean connection from Vjm-1 to Vim

Stochastic Back-Propagation
Algorithm (mostly used)
1. Initialize the weights to small random values
2. Choose a pattern xdk and apply is to the input layer V0k= xdk for
all k
3. Propagate the signal through the network
Vim = f (net im ) = f (" w ijmV jm#1 )
j

4. Compute the deltas for the output layer


"iM = f ' (net iM )(t id # ViM )
!5. Compute the deltas for the preceding layer for m=M,M-1,..2
"im#1 = f ' (net im#1 )$ w mji" mj
j
!6. Update all connections
"w ijm = #$imV jm%1 w ijnew = w ijold + "w ij
7.
! Goto 2 and repeat for the next pattern

! !

18
More on Back-Propagation
 Gradient descent over entire network
weight vector
 Easily generalized to arbitrary directed
graphs
 Will find a local, not necessarily global
error minimum
 In practice, often works well (can run
multiple times)

 Gradient descent can be very slow if η is to


small, and can oscillate widely if η is to large
 Often include weight momentum α

%E
"w pq (t + 1) = #$ + & ' "w pq (t)
%w pq

 Momentum parameter α is chosen between 0


!
and 1, 0.9 is a good value

19
 Minimizes error over training examples
 Will it generalize well
 Training can take thousands of iterations,
it is slow!

 Using network after training is very fast

20
Convergence of Back-
propagation
 Gradient descent to some local minimum
 Perhaps not global minimum...
 Add momentum
 Stochastic gradient descent
 Train multiple nets with different initial weights

 Nature of convergence
 Initialize weights near zero
 Therefore, initial networks near-linear
 Increasingly non-linear functions possible as training
progresses

21
Expressive Capabilities of
ANNs
 Boolean functions:
 Every boolean function can be represented by
network with single hidden layer
 but might require exponential (in number of inputs)
hidden units
 Continuous functions:
 Every bounded continuous function can be
approximated with arbitrarily small
error, by network with one hidden layer [Cybenko
1989; Hornik et al. 1989]
 Any function can be approximated to arbitrary
accuracy by a network with two
hidden layers [Cybenko 1988].

NETtalk Sejnowski et al 1987

22
Prediction

23
24
 Perceptron
 Gradient Descent
 Multi-layerd neural network
 Back-Propagation
 More on Back-Propagation
 Examples

25
 RBF Networks, Support Vector Machines

26

You might also like