You are on page 1of 4

ANALYSIS

T H E M I S A D J U S T M E N T OF BP
AND AN IMPROVED ALGORITHM

OF

Yang Jia

and

NETWORK

Yang Dali

Beijing University of Posts and Telecommunications


P.O.Box 145, Beijing 100088, P.R.CHINA

Abstract -- This paper analyzes the misadjustment


of t h e conventional BP network introduced by t h e
momentum factor a and i t s disadvantageous effect on
the network, and proposes a n improved algorithm with
stochastic attenuation momentum factor a ( k ) . Analysis
and simulations show t h a t t h e momentum misadjustment
is a nonnegligible factor to affect on t h e convergence
precision of
BP
network.
Compared
with
the
conventional BP algorithm, the improved algorithm can
effectively cancel t h e negative effect of a on network
and improve t h e convergence performance of the
network.

I. INTRODUCTION
The multilayer feedforward neural network models
with e r r o r Back-Propagation (BP) algorithm have been
widely researched and applied in some fields [ l , 21. A s
a generalization of the L M S algorithm, the backpropagation algorithm uses a gradient research
technique to minimize a cost function equal to the
mean square difference of t h e desired and the actually
net output. To reduce t h e e r r o r bursting during t h e
learning and to speed up convergence of the
algorithm, Rumelhart, Hinlton and Willims suggested to
add a momentum term to the weight update expression
of the BE algorithm in 1986 [3]. Thus, the weight
update expression of the conventional BP algorithm is
written as

size parameter p is often considered as a main factor


to affect the convergence performance of the BP
algorithm. Whereas t h e negative effect of
the
momentum factor dl on the convergence performance of
BP algorithm is often overlooked in t h e analysis. A
comprehensive analysis for t h e momentum factor a has
been made in t h e paper [41 as t h e MLMS adaptive
algorithm, in which the factor dl is a constant scalar.
For a adaptive filter, maybe the negative effect of a
is not serious to most applications. However, for a
multilayer neural network with a great number of t h e
adaptive units connected each other, t h e effect of a on
network should be paid attention.
The purpose of this work is to clear the negative
effect of t h e constant factor a on the multilayer
feedforward neural network with BP algorithm and to
present a n improved algorithm to overcome t h e effect.

11. ANALYSIS O F THE MOMENTUM MISADJUSTMENT

For the conventional BP network with t h e momentum


weight update given by equation (l), we may consider
each neural net as a standard adaptive unit. According
to the analysis result of t h e paper [41, the
misadjustment in each neural net may expressed as

Where M O denotes the misadjustment without t h e


momentum term o r the misadjustment of t h e standard
W d j ( k + l ) =W,,(k) +p6,(k)Oj(k)
+aAW,,(k)
L M S algorithm [5]. Because a is bounded 0 < a < 1 in
applications of BP network, it is clear from (2) t h a t
the larger the a is, the larger t h e misadjustment is.
Where I, is a step size to control convergence rate and
Though a larger a is contributory to speeding u p the
the stead-state performance of the algorithm, O.(k) is
convergence, the faster convergence is gained at t h e
1
the input signal, a and AW..(k)=(Wij(k)-Wij(k-l)) cost of convergence precision descent. For analysis
1)
express the momentum factor and the momentum,
simplicity, we consider a three-layer BP network
respectively.
Observe
that
the
weight
update
shown in Fig. 1. The momentum misadjustment of the
expression given by (1) corresponds to a secondsnetwork can be described as t h e noise sources which
order adaptive algorithm in t h a t two previous weight
a r e added to only t h e nodes in the hidden layer and
vectors are combined at each iteration to obtain the
the output layer of BP network. Thus, the output of
updated weight vectors. A variety of experiments have
the hidden layer node is expressed as
shown that the algorithm with t h e momentum update is
(3)
effective to s u p p r e s s the e r r o r bursting in t h e weight
,n,
W y ] x y + N ? ( U ) ) j=l,
space and to speed up the smoothing convergence of
t h e algorithm.
In the property analysis of the BP network, t h e s t e p
Where N j ( a ) denotes the noise introduced by t h e

(1)

...

0-7803-1254-6/93$03.000 1993 IEEE

2592

momentum factor
Clearly, t h e noise is a p a r t of t h e
input signal of t h e network o u t p u t layer. Accordingly,
we write t h e outputs of network as

a noise along with t h e useful signal P(*). A s a result,


f o r the o u t p u t layer of the network, we have
4

yE)-i~(&W $ ! ~ > ~ + N ;(a )1

(10)

= P ( X g ) + g ( X y , N y ( a ), X ; , N :
Substituting ( 4 ) into (31, yields

mal,

Where P(*) is the output of


additional momentum term and

network

(a))

without the

. . ., n3
(5)

I n the above equations, t h e nonlinear function


the typical sigmoid function defined by

is

f ( x )=- l +1e - =
Observe t h a t if equations ( 3 ) and ( 5 ) are directly
expressed by the form of equation (61, the analysis for
this question will become extremely complex. To avoid
the difficulty to analyze the complex nonlinear sigmoid
function, we u s e the following quadratic interpolation
function P(x) to approximate the sigmoid function
given by equation ( 6 ) , i.e.

x< -2 ln3

(0.1

(7)

x>21n3

0.9

I t may be demonstrated t h a t t h e interpolation e r r o r


between R(x) and f ( x ) satisfies R(x)=f(x)-P(x)< 10.
Consequently, we replace the sigmoid function f(x) by
the interpolation function P(x) and rewrite t h e
equation (3) as

Where P(*) denotes the output of t h e hidden layer


without the momentum term and g(*) is t h e additional
o u t p u t introduced by the momentum factor a . P(*) and
g(*) a r e defined, respectively, as follows

P ( x ~ =a+m?ic ( x y ) 2

Which expresses the additional o u t p u t introduced by


the momentum factor. Clearly, the effect of q(*), as a
noise source, on the network can not be overlooked in
the analysis of the BP algorithm.

111. AN IMPROVED ALGORITHM

We see from t h e above analysis t h a t t h e momentum


noises in t h e hidden and the o u t p u t nodes of BP
network depend on the value of the momentum factor
a. Consequently, t h e key to cancel the momentum
misadjustment noise is to cancel the effect of t h e
momentum factor a . However, t h e factor a is a
contradictory element in the algorithm convergence
process, t h a t is to say t h a t i t is of t h e smoothing
convergence role, and at the same time, introduces
the momentum noise to the network. If we consider to
cancel t h e negative effect of a on the BP network only
by simply setting a=O when the algorithm converges to
a certain extent, then t h e smoothing role of a will also
despair along with a. Therefore, a n improved algorithm
should take into account both to cancel the momentum
noise and to remain the speeding u p as well as the
smoothing convergence.
Based on the ideal, we give such a n improved
algorithm t h a t the momentum factor a ( k ) in this
algorithm is designed as the stochastic variable and
the function of the output error. As a result, t h e
stochastic factor a ( k ) will attenuate to zero as the
algorithm converges, and the momentum misadjustment
introduced by the constant a will disappear.
The weight update expression for the improved
algorithm is given by

,NY( a ) = N > ~() a ) ( b * ~ ( 2 X > ~ + N( ay )) )

g(x>2

(12)

Wfj(k+l)=Wfj(k)+ p b j (k)0,(k)+ a j (k)Awl, (k)

(9)

Where a=0.5, b=0.3/ln3, c=O.O5/(ln3). I t is clear from


( 8 ) t h a t the momentum misadjustment g(*) in the
hidden layer of the network will input to next layer as

I t can be demonstrated t h a t t h e momentum and t h e


gradient in t h e algorithm have t h e following orthogonal

2593

relationship

Where the subscript T denotes t h e vector transpose.


For the mean-square-error
(mse) function with a
quadratic o r a n approximate quadratic performance
surface, the output e r r o r can be expressed as

Thus, we have

a@=

u~pw)w'

bl

~ ~ ~ ~ ~ ( ~ x a ~ ~ - l > o - l ) )
tz(&)=iWT(k)RryLk)
+PIy(lk)+C

m-l

(14)

*-!E

Where R is a correlation matrix of the input signal for


each neural net, P is t h e cross-correlation vector of
t h e desired response and t h e input signal, and

t-1

c Ym)

m- 1

C=E[d(k)I2.By differentiating (14) with respect to the


weight vector W, it i s easy to obtain the following
expression

We see from (21) t h a t t h e momentum factor a ( k ) is a


function of the gradient instead of a constant, so t h a t
the varying factor added to the c u r r e n t momentum
ve(W~k+l))=vE(w~~))+~w~k+x)-W~k))
(15)
could play t h e role to accelerate and to smooth the
descent process towards t h e minimum during learning.
I t is note t h a t the role of speeding u p the
From the above expressions, we have the following
convergence will despair as the a ( k ) is little than zero
relation
during iteration[4]. Therefor, to keep the a ( k ) is great
than zero all time during adaptation, w e should make
ti,(k)OI'(A>Rb. W&k- 1)
(I6)
the compromise between the bursting compensation and
'P
A $(k- 1)RA W& 1)
the speeding up convergence, t h a t is

Now, rewriting Eq.(15) with k replaced by k-1, yields


m=1

RaW~~-I)=V(W~k))-V(W~~-l))
(17)
Substituting ( 1 7 ) into (16) and making use of the
relation given by (13), we obtain

B Y the definition of the


following recursion

momentum, we have the

A W&k)=p[t3,(&-l)Ubk-l)+

+akk-1)6,(&-2)0&&-2)

+a,(k-l>a,(k-2)...a~i2)8,(1)0~1)

We see from (22) t h a t a ( k ) is a stochastic attenuation


variable greater than zero during iteration, and
satisfies the condition of speeding u p convergence.
The attenuation property of a ( k ) can be seen from t h e

Again, it is clear from (23) t h a t the algorithm


misadjustment has nothing to do with the momentum
factor a ( k ) . Thereby, we conclude t h a t t h e momentum
misadjustment noise in B P network can be canceled
when a ( k ) given by (22) is used.

(Y9)

IV. SIMULATION

+a,(k-l) ... aLl)AWY(O)]

For each element of t h e momentum AW..(k) and for


11
l a ( k ) l < 1 and B < l , t h e r e exits

A s comparison of two algorithms, we present a


three-layer B P network to perform XOR function. The
learning c u r v e s shown in Fig.3a and Fig.3b illustrate
that the improved algorithm marked by I1 has higher
convergence precision than the conventional BP
algorithm with a=0.6 and a=0.7 marked by I. The
corresponding simulation results a r e given in Table 1.

2594

Xi

1 TH I

I
Iteratio, Input

d(k)

BP algortthm ( a . 0 . 7 )

Improved algorithm

EE

KE

XI x2

Output y

4981

0. 0. 0.0

0.00008 0.630X10-s

0.14583

4992

0. 0. 0.0

0.00005 0.6309X10-e

0.11821

0.6942XlO-

4993

1.

0. 1.0

0.99995

0.6309X10-m

0.96283

0.7751

4994

1.

1.

0.00008

0.2110X10-

0.17034

0.7751K10-

4895

0. 0. 0.0

0.00003

0.2llOXlO-*

0.13101

0.3146X10-p

4998

I.

0.98994

0.1323 &lo-

0.96416

0.7751 KIO-a

4997

0.

I.

1.0

0.99995

0.1323XLO-*

0.84592

4998

1.

1.

0.0

0 . m

0.~ ~ X I O0 .-i 4~m

4989

0.

0.

0.0

0.00008 0.1747X10-

0.07643

0.7485X10-*

5000

1.

0.

1.0

0.99994

0.94592

0.6841X10-a

Output y

Xi

X:

NI

Na

Fig.1 The effect of the momentum adjustment


on network for conventional B P algorithm.

W,;

0.

0.0

1.0

0.2160X10-

0.6941~~10-1

lo-

0.3146XlO-*
0.2140~10-a

Na

NI

Fig.2 The improved algorithm to cancel the


momentum nrisad justment.

0.60
0.50
0.40 0.30 0.205:
0.10 0.00 -0. I0 -

L
REFERENCES

lBQ(3 2(3m 3(3m 4 m 5(3(30

Fig.3a The learning curves of two algorithms with


p=O.1 (with a=0.6 for conventional R P algorithm).

0.40

0.30
w

0.20

[ l l R. P. Lippman, An introduction to computing with


neural net,IEEE ASSP Mag., Vo1.4,pp.4-22,Apr.,1987.
[2] B.Widrow and M.A.Lehr,30 years of adaptive neural
networks:perceptron,madaline,mdbackpropagation,
Proc.IEEE, Vo1.8,No.9,Sept.199OIpp 1415-1441.
[3] D.E.Rumelhart, G.E.Hinton,and R.J.Williams,learning
representations b y e r r o r propagation,Parallel
distributed processing, Vol. l,pp.318-362, Cambridge,
MA: M.I.T.Press,l986.
[4] S.Roy and John J.Shynk,Analysis of the momentum
LMS algorithm, IEEE Trans. Acoust.,Speech,Signal
Processing,Vol. ASSP-38,pp. 2088-2098, Dec.1990.
[SI R.Widrow,J.M.Mcmool, M.G.Larimore, and C.R.Johnson,
Jr., Stationary and nonstationary learning
characteristics of LMS adaptive filter, Proc.IEEE,
101.64, pp.1157-1162, 1976.

0.10
0.00
-0.10

1000 2000 3000 4000 5000

Fig.3b The learning c u r v e s of two algorithms with


p=O.l (with a=0.7 for conventional BP algorithm).

2595

You might also like