You are on page 1of 46

Variations on

Backpropagation

1
• The basic backpropagation algorithm is too slow for
most practical applications. It may take days or weeks
of computer time.
• We demonstrate why the backpropagation algorithm
is slow in converging.
• We saw the steepest descent is the slowest
minimization method.
• The conjugate gradient algorithm and Newton's
method generally provide faster convergence.

2
Variations
• Heuristic modifications
– Momentum
– Variable learning rate

• Standard numerical optimization


– Conjugate gradient
– Newton’s method (Levenberg-Marquardt)

3
Drawbacks of BP
• We saw that the LMS algorithm is guaranteed to
converge to a solution that minimizes the mean
squared error, so long as the learning rate is not too
large.

• Steepest Descent backpropagation (SDBP) is a


generalization of the LMS algorithm.

• Multilayer nonlinear Net → many local minimum


points → the curvature can vary widely in different
regions of the parameter space.
4
Performance Surface Example
Network Architecture Nominal Function

Parameter Values
1 1 1 1
w 1 1 = 10 w 2 1 = 10 b1 = –5 b2 = 5
2 2 2
w 1 1 = 1 w 1 2 = 1 b = –1 5
1 2
Squared Error vs. w and w 1,1 1,1

The curvature varies drastically over the parameter


space. So it is difficult to choose an appropriate learning
rate for SD algorithm. 6
1 1
Squared Error vs. w and b 1,1 1

1 1
w 1 1 = 10 b1 = –5
flat regions of the performance surface should not be unexpected, given the
sigmoid transfer functions used by the networks. The sigmoid is very flat for
large inputs. 7
1 1
Squared Error vs. b and b 1 2

It is because of this characteristic of neural


networks that we do not set the initial weights
1 1
b1 = –5 b2 = 5 and biases to zero.

8
b

a: converge to the optimal solution, but the convergence is slow.


b: converge to a local minimum (w11,1 =0.88, w21,1=38.6). 10
11
Learning Rate Too Large

nnd12sd1
nnd12sd2
12
Momentum
Filter
y k  = y k – 1 +  1 –  w k  0 1

Example
2k
w k  = 1 + sin  --------- 
 16 

13
Observations
• The oscillation of the filter output is less than
the oscillation in the filter input (low pass
filter).
• As γ is increased the oscillation in the filter
output is reduced.
• The average filter output is the same as
average filter input, although as γ is increased
the filter output is slower to respond.
• To summarize, the filter tends to reduce the
amount of oscillation, while still tracking the
average value.
14
Momentum Backpropagation

Steepest Descent Backpropagation


(SDBP)
m m m–1 T
W k  = –s a 
w21,1
m m
b  k = –s

Momentum Backpropagation
(MOBP)
w11,1
m m m m–1 T
W  k = W k – 1  – 1 –  s  a   = 0.8
m m m
b  k = b  k – 1 –  1 –  s 15
• The batching form of MOBP, in which the
parameters are updated only after the entire
example set has been presented.
• The same initial condition and learning rate
has been used as in the previous example, in
which the algorithm was not stable.
• The algorithm now is stable and it tends to
accelerate convergence when the trajectory is
moving in a consistent direction.
nnd12mo

16
Variable Learning Rate (VLBP)
• If the squared error (over the entire training set) increases by
more than some set percentage z after a weight update, then
the weight update is discarded, the learning rate is multiplied
by some factor (0<r <1), and the momentum coefficient  is
set to zero.
• If the squared error decreases after a weight update, then the
weight update is accepted and the learning rate is multiplied
by some factor h > 1. If  has been previously set to zero, it is
reset to its original value.
• If the squared error increases by less than z, then the weight
update is accepted, but the learning rate and the momentum
coefficient are unchanged.
17
Example

h = 1.05
r = 0.7
z = 4%

nnd12vl
18
Squared Error α Learning Rate

Convergence Characteristics Of Variable Learning Rate


19
Conjugate Gradient
• We saw SD is the simplest optimization method but is
often slow in converging.
• Newton’s method is much faster, but requires that the
Hessian matrix and its inverse be calculated.
• The conjugate gradient is a compromise; it does not
require the calculation of 2nd derivatives, and yet it
still has the quadratic convergence property.
• Now we describe how the conjugate gradient
algorithm can be used to train multilayer network.
• This algorithm is called Conjugate Gradient
Backpropagation (CGBP).

20
Review Of CG Algorithm
1. The first search direction is steepest descent.
p 0 = – g0 gk  F  x 
x = xk
2. Take a step and choose the learning rate to minimize the
function along the search direction.
xk + 1 = xk + kp k

3. Select the next search direction according to:


p k = – gk + k p k – 1
where
T
g k – 1 gk gTk gk T
gk – 1 gk
k = ------T---------- or k = --T----------- or  k = ----T----------
 gk – 1 p k – 1 gk – 1 g k – 1 gk – 1 gk – 1 21
• This cannot be applied to neural network training,
because the performance index is not quadratic.
T
– We cannot use  F  x  p
x=x k g Tk p k
 k = – --------------------------------------k---------- = – --------------------
p Tk  2F  x  pk p Tk A kp k
x = xk
to minimize the function along a line.
- The exact minimum will not normally reached in a
finite number of steps, and therefore the algorithm
will need to be reset after some set number of
iterations.
• Locating the minimum of a function
– Interval location
– Interval reduction
22
Interval Location

23
Interval Reduction

24
Golden Section Search
t=0.618
Set c1 = a1 + (1-t)(b1-a1), Fc=F(c1)
d1 = b1 - (1-t)(b1-a1), Fd=F(d1)
For k=1,2, ... repeat
If Fc < Fd then
Set ak+1 = ak ; bk+1 = dk ; dk+1 = ck
c k+1 = a k+1 + (1-t)(b k+1 -a k+1 )
Fd= Fc; Fc=F(c k+1 )
else
Set ak+1 = ck ; bk+1 = bk ; ck+1 = dk
d k+1 = b k+1 - (1-t)(b k+1 -a k+1 )
Fc= Fd; Fd=F(d k+1 )
end
end until bk+1 - ak+1 < tolerance
25
• For quadratic functions the algorithm will converge
to the minimum in at most n (# of parameters)
iterations; this normally does not happen for
multilayer networks.
• The development of the CG algorithm does not
indicate what search direction to use once a cycle of n
iterations has been completed.
• The simplest method is to reset the search direction to
the steepest descent direction after n iterations.
• In the following function approximate example we
use the BP algorithm to compute the gradient and the
CG algorithm to determine the weight updates. This
is a batch mode algorithm.

26
Conjugate Gradient BP (CGBP)

nnd12ls nnd12cg 27
Newton’s Method
xk + 1 = xk – A k–1gk

Ak  2 F  x gk  F x
x = xk x = xk

If the performance index is a sum of squares function:


N
2 T
F  x =  v i x  = v  x v x 
i =1

then the jth element of the gradient is


N
F x  vi  x
 F  x  j = -------- = 2  vi x  --------
x j x j
i=1
28
Matrix Form
The gradient can be written in matrix form:
T
Fx  = 2J  x v x 
where J is the Jacobian matrix:

 v1  x   v1  x  v1  x 
---------------- ----------------  ----------------
 x1  x2  xn
 v2  x   v2  x  v2  x 
---------------- ----------------  ----------------
Jx =  x1  x2  xn


vN  x  vN  x  vN  x 
----------------- -----------------  -----------------
 x1  x2  xn
N×n 29
Now we want to find the Hessian matrix

2 N 2
 F x    vi x  vi x   v i x  
 2 F x  k j = ------------------ = 2  --------- --------------- + vi x  ------------------ 
 ------
 xk x j  x k  x j  xk  x j 
i= 1
N
F  x  vi  x
 F  x  j = --------------- = 2  vi x  ---------------
x j x j
i=1

T
2F x  = 2J  x J x  + 2S x 

N
where S x  =  vi x 2v i x
i=1
30
Gauss-Newton Method
Approximate the Hessian matrix as:
2 T
 Fx   2J  x J x 

(if we assume that S(x) is small)

 Fx  = 2J  x v x 
T

We had:  xk + 1 = xk – Ak–1gk

Newton’s method becomes:
–1
T
xk + 1 = xk – 2 J  xk  J xk   2 JT  xk v  xk 
T –1 T
= xk – J xkJxk J xkvxk 31
Levenberg-Marquardt
Gauss-Newton approximates the Hessian by:
T
H =J J

This matrix may be singular, but can be made invertible as follows:


G = H + I

If the eigenvalues and eigenvectors of H are:


 1 2     n   z 1  z 2   z n 

then Eigenvalues of G
Gz i =  H +  I z i = Hz i +  z i = iz i + z i =  i +  z i

G can be made positive definite by increasing µ until


λi+ µ >0 for all i.
–1
xk + 1 = x k –  JT x k J x k + k I  JT  xk v  xk 
32
Adjustment of k
As k0, LM becomes Gauss-Newton.

T –1 T
x k + 1 = xk –  J  xk J  xk  J x k v xk 

As k, LM becomes Steepest Descent with small learning rate.

x k + 1  xk – ---1--J T xk  v xk  = x k – -----1---- F x 
k 2k

Therefore, begin with a small k to use Gauss-Newton and speed


convergence. If a step does not yield a smaller F(x), then repeat the
step with an increased k until F(x) is decreased. F(x) must decrease
eventually, since we will be taking a very small step in the steepest
descent direction. 33
Application To Multilayer Network
The performance index for the multilayer network is:
M
Q Q Q S N
T T
F x  =  t q – a q   tq – aq  =  eqeq =    e j q  =
2
  vi 
2

q=1 q=1 q= 1j = 1 i= 1
Equal probability
Where ej,q is the jth element of the
error for the qth input/target pair.
This is similar to performance index, for which LM was
designed.
In standard BP we compute the derivatives of the
squared errors, with respect to weights and biases. To
create matrix J we need to compute the derivatives of
errors. 34
The error vector is:
T
v = v1 v2  vN = e 1 1 e2  1  eS M 1 e1  2  eS M Q
The parameter vector is:
T
x = x 1 x2  x n = w11 1 w11  2  w11 b11  b11 w21  1  bMM
S R S S

The dimensions of the two vectors are:

M , n = S1 R + 1 + S2S1 + 1 +  + SM SM – 1 + 1


N = QS

If we make these substitutions into the Jacobian


matrix for multilayer network training we have:
35
Jacobian Matrix
e 1 1 e 1 1 e 1 1 e 1 1
-------------- --------------  ---------------- ------------ 
1 1 1 1
 1 1
w  1 2
w  w 1  1b
S  R

e 2 1 e 2 1 e 2 1 e 2 1
-------------- --------------  ---------------- ------------ 
1 1 1 1
w 1 1 w 1 2 w 1 b1
S  R


J x  =
e M e M ee ee
M M
S  1 S  1 S  1 S  1
--------------- ---------------  ---------------- ---------------- 
1 1 1 1
w 1 1 w 1 2 w 1 b1
S  R

e 1 2 e 1 2 e 1 2 e 1 2
-------------- --------------  ---------------- ------------ 
1 1 1 1
w 1 1 w 1 2 w 1 b1
S  R

N×n


36
Computing The Jacobian
T
SDBP computes terms like: F̂  x e q eq
-------------- = -----------------
x l x l
m
using the chain rule:  F̂ F̂ n i
------------ = --------m-  ---------
m
---
m
w i j ni wi j

m  F̂
where the sensitivity si  --------
-
m
ni
is computed using backpropagation.

For the Jacobian we need to compute terms like:


 vh e k q
 J h  l = -------- = ------------
 xl  xl 37
Marquardt Sensitivity
If we define a Marquardt sensitivity:
m vh  ek q
s̃ i h  ------------ = ------------ Δ M
m m
, h =  q – 1 S + k
n i q n i q

We can compute the Jacobian as follows:


weight
m m
v h ek  q  ek q  n i q m n i q m m–1
 J h l = -------- = ---------
--- = ------------  ------------ = s̃ i h  ------------ = s̃ i h  a j q
x l wi j
m m
 n i q w i j
m
wi j
m

bias m m
vh  ek q e k q n i q m  n i q m
J h  l -- = -------m----- 
= -------- = ---------- ---------m--- = s̃ i h  ----------
-- = s̃ i h
 xl b i
m
 ni q b i bi
m
38
Computing the Sensitivities
M M
M  vh  ek q  t k q – a k q  ak  q
Initialization s̃ i h = ------------ = ------------ = -------------------------------- = -– ---------M---
M M M
n i q n i q n i q  ni q

  M

si ,h   f (ni ,q ) for i  k
M
~ M

 0 for i  k
Therefore when the input pq has been applied to the
network and the corresponding network output a qM
has been computed, the LMBP is initialized with
~M M
S q   F (n q )
M

39
Where   m
m

f (n )1 0 ... 0 
  m 
m
 0 f (n ) ...m
0 
F (n )  
m 2

 : : : 
  m 
 0 0 ... f (n mSm ) 
~M
Each column of the matrix S q must be
backpropagated through the network using the
following equation (Ch11) to produce one row of the
Jacobian matrix.
m
m m +1 T m +1
s = F (n )  Wm
 s 40
The columns can also be backpropagated together using
~m m
m 1 T ~ m 1
Sq  F (n q )( W ) Sq
m

The total Marquardt sensitivity matrices for each


layer are then created by augmenting the matrices
computed for each input:

S m

 S1
m
S m
... S m
2 Q

Note that for each input we will backpropagate SM sensitivity


vectors. Because we compute the derivatives of each
individual error, rather than the derivative of the sum of
squares of the errors. For every input we have SM errors. For
each error there will be one row of the Jacobian matrix.
41
• After the sensitivities have been
backpropagated, the Jacobian matrix is
computed using:

m m
v h ek  q  ek q  n i q m n i q m m–1
 J h l = -------- = ---------
--- = --
----
----
--  m = s̃ i h  m = s̃ i h  a j q
-
----
----
--- -
----
----
---
x l m m
wi j  n i q w i j wi j
m m
vh  ek q e k q n i q m  n i q m
J h  l -- = -------m----- 
= -------- = ---------- ---------m--- = s̃ i h  ----------
-- = s̃ i h
 xl b i
m
 ni q b i bi
m

42
LMBP (summarized)
• Present all inputs to the network and compute the
corresponding network outputs and the errors. Compute the
sum of squared errors over all inputs.

e q  t q  a qM
M
Q Q Q S N
T T
F x  =  t q – a q   tq – aq  =  eqeq =    e j q  =
2
  vi 
2

q=1 q=1 q= 1j = 1 i= 1

• Compute the Jacobian matrix. Calculate the sensitivities with


the backpropagation algorithm, after initializing. Augment the
individual matrices into the Marquardt sensitivities. Compute
the elements of the Jacobian matrix.
43
~M M
S q   F (n q )
M

 m
~m m 1 T ~ m 1 , m = M – 1   2  1
Sq  F (n q )( W ) Sq
m

Sm  S
 1
m
Sm
2 ... m
SQ 

m m
v h ek  q  ek q  n i q m n i q m m–1
 J h l = -------- = ---------
--- = ------------  ------------ = s̃ i h  ------------ = s̃ i h  a j q
x l wi j
m m
 n i q w i j
m
wi j
m

m m
vh  ek q e k q n i q m  n i q m
J h  l -- = -------m----- 
= -------- = ---------- ---------m--- = s̃ i h  ----------
-- = s̃ i h
 xl b i
m
 ni q b i bi
m

44
• Solve the following Eq. to obtain the change in the weights.

T –1 T
xk + 1 = x k –  J x k J x k + k I  J  xk v  xk  x k  x k 1  x k

• Recompute the sum of squared errors with the new weights. If


this new sum of squares is smaller than that computed in step
1, then divide k by u, update the weights and go back to step
1. If the sum of squares is not reduced, then multiply k by u
and go back to step 3.

The algorithm is assumed to have converged when the norm


of the gradient is less than some predetermined value, or when
the sum of squares has been reduced to some error goal.

See P12.5 for a numerical illustration of Jacobian computation.


45
Example LMBP Step

Black arrow: small µk


(Gauss-Newton direction)
Blue arrow : large µk
(SD direction)
Blue curve: LM for
intermediate µk

46
LMBP Trajectory
nnd12ms
nnd12m
Storage requirement:
n×n for Hesssian
matrix

47

You might also like