Professional Documents
Culture Documents
Backpropagation
1
• The basic backpropagation algorithm is too slow for
most practical applications. It may take days or weeks
of computer time.
• We demonstrate why the backpropagation algorithm
is slow in converging.
• We saw the steepest descent is the slowest
minimization method.
• The conjugate gradient algorithm and Newton's
method generally provide faster convergence.
2
Variations
• Heuristic modifications
– Momentum
– Variable learning rate
3
Drawbacks of BP
• We saw that the LMS algorithm is guaranteed to
converge to a solution that minimizes the mean
squared error, so long as the learning rate is not too
large.
Parameter Values
1 1 1 1
w 1 1 = 10 w 2 1 = 10 b1 = –5 b2 = 5
2 2 2
w 1 1 = 1 w 1 2 = 1 b = –1 5
1 2
Squared Error vs. w and w 1,1 1,1
1 1
w 1 1 = 10 b1 = –5
flat regions of the performance surface should not be unexpected, given the
sigmoid transfer functions used by the networks. The sigmoid is very flat for
large inputs. 7
1 1
Squared Error vs. b and b 1 2
8
b
nnd12sd1
nnd12sd2
12
Momentum
Filter
y k = y k – 1 + 1 – w k 0 1
Example
2k
w k = 1 + sin ---------
16
13
Observations
• The oscillation of the filter output is less than
the oscillation in the filter input (low pass
filter).
• As γ is increased the oscillation in the filter
output is reduced.
• The average filter output is the same as
average filter input, although as γ is increased
the filter output is slower to respond.
• To summarize, the filter tends to reduce the
amount of oscillation, while still tracking the
average value.
14
Momentum Backpropagation
Momentum Backpropagation
(MOBP)
w11,1
m m m m–1 T
W k = W k – 1 – 1 – s a = 0.8
m m m
b k = b k – 1 – 1 – s 15
• The batching form of MOBP, in which the
parameters are updated only after the entire
example set has been presented.
• The same initial condition and learning rate
has been used as in the previous example, in
which the algorithm was not stable.
• The algorithm now is stable and it tends to
accelerate convergence when the trajectory is
moving in a consistent direction.
nnd12mo
16
Variable Learning Rate (VLBP)
• If the squared error (over the entire training set) increases by
more than some set percentage z after a weight update, then
the weight update is discarded, the learning rate is multiplied
by some factor (0<r <1), and the momentum coefficient is
set to zero.
• If the squared error decreases after a weight update, then the
weight update is accepted and the learning rate is multiplied
by some factor h > 1. If has been previously set to zero, it is
reset to its original value.
• If the squared error increases by less than z, then the weight
update is accepted, but the learning rate and the momentum
coefficient are unchanged.
17
Example
h = 1.05
r = 0.7
z = 4%
nnd12vl
18
Squared Error α Learning Rate
20
Review Of CG Algorithm
1. The first search direction is steepest descent.
p 0 = – g0 gk F x
x = xk
2. Take a step and choose the learning rate to minimize the
function along the search direction.
xk + 1 = xk + kp k
23
Interval Reduction
24
Golden Section Search
t=0.618
Set c1 = a1 + (1-t)(b1-a1), Fc=F(c1)
d1 = b1 - (1-t)(b1-a1), Fd=F(d1)
For k=1,2, ... repeat
If Fc < Fd then
Set ak+1 = ak ; bk+1 = dk ; dk+1 = ck
c k+1 = a k+1 + (1-t)(b k+1 -a k+1 )
Fd= Fc; Fc=F(c k+1 )
else
Set ak+1 = ck ; bk+1 = bk ; ck+1 = dk
d k+1 = b k+1 - (1-t)(b k+1 -a k+1 )
Fc= Fd; Fd=F(d k+1 )
end
end until bk+1 - ak+1 < tolerance
25
• For quadratic functions the algorithm will converge
to the minimum in at most n (# of parameters)
iterations; this normally does not happen for
multilayer networks.
• The development of the CG algorithm does not
indicate what search direction to use once a cycle of n
iterations has been completed.
• The simplest method is to reset the search direction to
the steepest descent direction after n iterations.
• In the following function approximate example we
use the BP algorithm to compute the gradient and the
CG algorithm to determine the weight updates. This
is a batch mode algorithm.
26
Conjugate Gradient BP (CGBP)
nnd12ls nnd12cg 27
Newton’s Method
xk + 1 = xk – A k–1gk
Ak 2 F x gk F x
x = xk x = xk
v1 x v1 x v1 x
---------------- ---------------- ----------------
x1 x2 xn
v2 x v2 x v2 x
---------------- ---------------- ----------------
Jx = x1 x2 xn
vN x vN x vN x
----------------- ----------------- -----------------
x1 x2 xn
N×n 29
Now we want to find the Hessian matrix
2 N 2
F x vi x vi x v i x
2 F x k j = ------------------ = 2 --------- --------------- + vi x ------------------
------
xk x j x k x j xk x j
i= 1
N
F x vi x
F x j = --------------- = 2 vi x ---------------
x j x j
i=1
T
2F x = 2J x J x + 2S x
N
where S x = vi x 2v i x
i=1
30
Gauss-Newton Method
Approximate the Hessian matrix as:
2 T
Fx 2J x J x
Fx = 2J x v x
T
We had: xk + 1 = xk – Ak–1gk
Newton’s method becomes:
–1
T
xk + 1 = xk – 2 J xk J xk 2 JT xk v xk
T –1 T
= xk – J xkJxk J xkvxk 31
Levenberg-Marquardt
Gauss-Newton approximates the Hessian by:
T
H =J J
then Eigenvalues of G
Gz i = H + I z i = Hz i + z i = iz i + z i = i + z i
T –1 T
x k + 1 = xk – J xk J xk J x k v xk
x k + 1 xk – ---1--J T xk v xk = x k – -----1---- F x
k 2k
q=1 q=1 q= 1j = 1 i= 1
Equal probability
Where ej,q is the jth element of the
error for the qth input/target pair.
This is similar to performance index, for which LM was
designed.
In standard BP we compute the derivatives of the
squared errors, with respect to weights and biases. To
create matrix J we need to compute the derivatives of
errors. 34
The error vector is:
T
v = v1 v2 vN = e 1 1 e2 1 eS M 1 e1 2 eS M Q
The parameter vector is:
T
x = x 1 x2 x n = w11 1 w11 2 w11 b11 b11 w21 1 bMM
S R S S
e 2 1 e 2 1 e 2 1 e 2 1
-------------- -------------- ---------------- ------------
1 1 1 1
w 1 1 w 1 2 w 1 b1
S R
J x =
e M e M ee ee
M M
S 1 S 1 S 1 S 1
--------------- --------------- ---------------- ----------------
1 1 1 1
w 1 1 w 1 2 w 1 b1
S R
e 1 2 e 1 2 e 1 2 e 1 2
-------------- -------------- ---------------- ------------
1 1 1 1
w 1 1 w 1 2 w 1 b1
S R
N×n
36
Computing The Jacobian
T
SDBP computes terms like: F̂ x e q eq
-------------- = -----------------
x l x l
m
using the chain rule: F̂ F̂ n i
------------ = --------m- ---------
m
---
m
w i j ni wi j
m F̂
where the sensitivity si --------
-
m
ni
is computed using backpropagation.
bias m m
vh ek q e k q n i q m n i q m
J h l -- = -------m-----
= -------- = ---------- ---------m--- = s̃ i h ----------
-- = s̃ i h
xl b i
m
ni q b i bi
m
38
Computing the Sensitivities
M M
M vh ek q t k q – a k q ak q
Initialization s̃ i h = ------------ = ------------ = -------------------------------- = -– ---------M---
M M M
n i q n i q n i q ni q
M
si ,h f (ni ,q ) for i k
M
~ M
0 for i k
Therefore when the input pq has been applied to the
network and the corresponding network output a qM
has been computed, the LMBP is initialized with
~M M
S q F (n q )
M
39
Where m
m
f (n )1 0 ... 0
m
m
0 f (n ) ...m
0
F (n )
m 2
: : :
m
0 0 ... f (n mSm )
~M
Each column of the matrix S q must be
backpropagated through the network using the
following equation (Ch11) to produce one row of the
Jacobian matrix.
m
m m +1 T m +1
s = F (n ) Wm
s 40
The columns can also be backpropagated together using
~m m
m 1 T ~ m 1
Sq F (n q )( W ) Sq
m
S m
S1
m
S m
... S m
2 Q
m m
v h ek q ek q n i q m n i q m m–1
J h l = -------- = ---------
--- = --
----
----
-- m = s̃ i h m = s̃ i h a j q
-
----
----
--- -
----
----
---
x l m m
wi j n i q w i j wi j
m m
vh ek q e k q n i q m n i q m
J h l -- = -------m-----
= -------- = ---------- ---------m--- = s̃ i h ----------
-- = s̃ i h
xl b i
m
ni q b i bi
m
42
LMBP (summarized)
• Present all inputs to the network and compute the
corresponding network outputs and the errors. Compute the
sum of squared errors over all inputs.
e q t q a qM
M
Q Q Q S N
T T
F x = t q – a q tq – aq = eqeq = e j q =
2
vi
2
q=1 q=1 q= 1j = 1 i= 1
m
~m m 1 T ~ m 1 , m = M – 1 2 1
Sq F (n q )( W ) Sq
m
Sm S
1
m
Sm
2 ... m
SQ
m m
v h ek q ek q n i q m n i q m m–1
J h l = -------- = ---------
--- = ------------ ------------ = s̃ i h ------------ = s̃ i h a j q
x l wi j
m m
n i q w i j
m
wi j
m
m m
vh ek q e k q n i q m n i q m
J h l -- = -------m-----
= -------- = ---------- ---------m--- = s̃ i h ----------
-- = s̃ i h
xl b i
m
ni q b i bi
m
44
• Solve the following Eq. to obtain the change in the weights.
T –1 T
xk + 1 = x k – J x k J x k + k I J xk v xk x k x k 1 x k
46
LMBP Trajectory
nnd12ms
nnd12m
Storage requirement:
n×n for Hesssian
matrix
47