Professional Documents
Culture Documents
Back Propagation Neural Networks: Substance Use & Misuse February 1998
Back Propagation Neural Networks: Substance Use & Misuse February 1998
net/publication/13731614
CITATIONS READS
118 14,827
1 author:
Massimo Buscema
University of Colorado
287 PUBLICATIONS 3,510 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Massimo Buscema on 09 May 2015.
INTRODUCTION
Theoretical Principles
c. The system can “easily” learn other tasks which are similar to the ones it
has already learned, and then, to operate “generalizations”.
d. The relations, which have become stabilized among the system’s units
during the learning of the several tasks, are the only memory of the
system itself.
e. Many of the system’s units are of a discriminant or subconceptual type.
They can be of three logical types:
• Input units: the sensors through which the system receives the
surrounding area is stimulation;
• Output units: the units through which the system expresses its
behaviours as an interpretation of the received Inputs;
• Hidden units: they are the internal units of the network’s system. They
receive the Input either from the other Hidden units or from the Input
units. Their behaviour works as Output for other Hidden units or for the
Output units. They are the units that provide the internal representation,
through which the system interprets the Input which it receives, with
respect to the Output that it will produce.
f. The relations between the system units are uni-oriented; that is, unit A,
by activating itself in a certain way, activates unit B with respect to the
approximate in which A activated itself and to the strength of their
connection; but not vice versa.
This means that in BP, relations have the shape of fuzzy rules; for
example:
then
wij
ui uj
Figure 1.
238
w3,5 : 0.36 w4,5 : 0.54 BUSCEMA
w3,6 : 0.22 w4,6 : 0.20
O1 O2
Target 1: Target 2:
0 1
Figure 2.
ui w ji f ( Net j )
1
(1) uj f Net j
i 1 e
One must add to this equation the threshold of the unit, described as the
inclination of the unit to activate or inhibit itself.
This means that:
BACK PROPAGATION NEURAL NETWORKS 239
1
(1a) u j f ( Net j )
(i ) w ji ui j
1 e
where j is the Bias of unit uj; this is the degree of sensitivity, with which the
unit uj answers to the perturbation it receives by the Net Input. The Bias is
the opposite of the threshold and it behaves as a weight generated by a fixed
Input of unitary value.
Equation (1a) represents the forward algorithm of BPs, assuming as
default, the sigmoidal function.
It is necessary to calculate the dynamic of the back propagation or
correction algorithm.
The mathematical basis of this algorithm was already individuated by
Rosemblatt in 1956, through the so-called Delta Rule, a procedure that
allowed to correct for excess or defect of the weights between the network
units, on the basis of the difference between the actual Output and the
desired Target.
Nevertheless, the Delta Rule allows correcting only those weights that
connect the Output units with those of the just underlying layer. It doesn’t
allow one to know in a 3-layer network, how at each cycle, the weights that
connect the Input units with the Hidden units, should be modified.
Let’s examine in detail the Delta Rule.
The coefficient of error in this procedure is calculated by considering the
difference between the actual Output and the desired one (the Target one)
and relating this difference to the derivative between the activation state of
the actual Output and the Net Input of that Output.
uj
Therefore, if u j (1 u j ) , then outj (this is the error coefficient)
Net j
will be:
(2) out j (t j u j ) u j (1 u j )
Ep
(3) w ji ;
w ji
240 BUSCEMA
a) Ep 1
2 (t pk u pk ) 2
1
2 (t pk f k ( Net pk )) 2
k k
1
2 (t pk f k (wkj u pj k ))
k j
And then:
( Net pk )
b) wkj u pj k u pj ;
wkj wkj j
Ep
c) (t pk u pk ) f k' ( Net pk ) u pj
wkj
At this point, one can say that the quantity of value to be added or
subtracted from weight wji will be decided by value outj, with respect to the
activation state of unit ui, namely, the activation with which uj is connected
to weight wji and in relation to the coefficient r. This is the correction rate
that one wants to adopt (when r = 1, then the value of the adding or
subtracting from weight wji is the one calculated by the whole procedure).
(4) w ji r out j u j
The value wji can be both negative and positive. It represents the
“quantum” to be added or subtracted from the previous value of weight wji.
Then:
(5) w ji ( n1) w ji ( n) w ji
The Delta Rule, then, represented by equation (2), allows one to carry out
the weight’s correction only for very limited networks. For multilayer BP,
this is, with 1 or more layers of Hidden units, the Delta Rule, and as it is, is
insufficient; for example:
Input uk
? weights wik
Hidden ui
DELTA
RULE
weights wji
Output uj
Figure 3.
Ep
(d)
wkj
1
2 w (t pk u pk )
k kj
a. Forward Algorithm
1) Net j ui w ji j
i
2) u j f ( Net j )
1) out j (t j u j ) f (u j )
BACK PROPAGATION NEURAL NETWORKS 243
2) w ji r out j ui
2) wik r hiddeni uk
1) w ji ( n1) w ji ( n) w ji
ui ui Input k
wik = +2.0
threshold = -2.0 =
this is the reason for which the Net Input is calculated by adding the Bias
value to the sum of the products between the units and the weights afferent
to the reference unit at every unit:
244 BUSCEMA
N
(6) ui = f Neti f u j wij Biasi
j 1
Considered the Bias’s structural and functional nature, its dynamic for
each network unit can be considered similar to that of any weight. Therefore,
the weights’ matrix is liable to the normal learning algorithm to which it is
subjected.
This means that every BP can be provided with dynamic threshold’s
units. Each unit will dynamically learn its threshold, in relation to the type of
experiences that the whole ANN is carrying out.
This means that if the updating of the weights through the Back
Propagation is given by the following equations:
N
(9) hidden j f (u j ) outi wij
i 1
(10) w jk ( n1) w jk ( n) hidden j uk Rate
for Hidden units, then equations (11) and (12) will determine the Bias
updating of the Output layer and of any Hidden layer:
The Momentum
Experience has shown that the more that the learning coefficients of
equations (7) and (8) are reduced, the less the probability that the network
entraps itself in local minima.
At the same time, the more that the learning coefficient is small, the
slower is the learning.
An attempt has been made to resolve this dilemma through the
introduction of a new parameter, the Momentum (McClelland, 1988:135).
The Momentum is a parameter through which the network reinforces the
change of each of its connections in the descent’s general direction of the
paraboloidal, which has already emerged during its previous process for
updating. The reason for this is the eventual deleting contingent upon
oscillations produced by the steepest-descent algorithm.
With the introduction of the Momentum, equation (7) becomes:
Until now, the transfer function implicitly used in the previous equations
has been the sigmoidal one.
Nevertheless, equation (6) wasn’t specific on this matter:
N
(6) ui = f Neti f u j wij Biasi
j 1
If we mean that Net is the Net Input to a unit, and that f ( ) is the equation
of the Sigmoid, equation (6) will be rewritten in the following way:
1
(13) ui Neti 0 ui 1
1 e
e = 2.718281828459
Through this equation, the activation value of ui varies between 0 and 1
according to a semilinear functioning which has its flex point as 0.5.
In fact: if Neti = 0, then ui = 0.5.
Therefore, its derivative will be: f ( Neti ) ui (1 ui ) .
This function is the most diffused in Back Propagation ANNs and some
originators have tried to make it more complete, through the introduction of
other parameters; for example:
Net i 1
(13a) ui 1 e T
coef
(13b) ui Neti
coef e
where for coef > 1 the unit sensitivity to the Net Input value increases, while
for coef < 1 this sensitivity decreases.
This equation can be useful in order to avoid the fact that in some
problems the Hidden units are all overloaded towards the value 0.0 or
towards value 1.0 without being able to codify the differences among the
different Input models (for a closer examination, see Buscema, 1993:38).
Neti
e
Neti
e -1 ui +1
(14) ui Neti
e
Neti
e
In this case, the values of ui vary between -1 and +1; therefore, the
derivative of this function, at the moment of the weights’ correction, will
have to be: (10 . ui ) (10
. ui ) to replace the derivative of the sigmoidal
function ui (1 ui ) .
The equation of the Hyperbolic Tangent is less soft than the sigmoidal
function. This means that it can be less useful with problems which expect a
fuzzy Target or Input vectors with no determinant differences.
When there is the possibility of not making fall ANN in local minima, the
Hyperbolic Tangent is preferable to the Sigmoid for learning velocity about
binary problems (where the Input and Target vectors are laces of 0 and 1) and
when strong corrections are desired for values near to the limits of -1 and +1.
1 1
(15) ui Arctg( Neti )
2
METHODOLOGICAL DEVELOPMENTS
BACK PROPAGATION NEURAL NETWORKS 249
The SoftMax
This is a function utilized in the learning process in BP Networks for the
classification of problems.
In order to resolve a problem of this type, the desired Output (Target) is
usually represented through the code 1 of N classes. Each class is
represented by an Output unit; then, in the case of N classes, the Target will
be constituted by a vector composed by N Output nodes, and it will have the
following form:
where only the unit that corresponds to the desired class has a Target value
different from 0 (that is, 1).
There are 2 problems with this formula:
1. The codified 1 of N is dependent on the values of the Output units. When
one of the units has a value equal to 1, the other units have 0 value.
Nevertheless, there is no assumption of dependence in the error in a
trained back propagation network.
2. If there are more classes, the Network can find a reasonable solution
through a mapping of the Output vector. This will give an RMS of: 1 k
(where k is equal to the class number).
This solution is a very simple to find. For example, if we use the sigmoid
as transfer function, and all the weights of the Output layer are big and
negative, all the Output units will have the Outputs equal to 0. If the class
number is not too big, placing a very low learning rate, the network will have
difficulty converging on this artificial solution.
In order to have a satisfying solution for these 2 problems, it has been
recommended that the “SoftMax activation function” be used for the layer of
Output units (Bridle, 1989).
The SoftMax function is a refined version of the competitive function 1
of N, and it has some convenient mathematical properties:
Ik
e
yk k
e I l
l 1
k
J d j ln( y j )
j i
Let’s calculate the value that will return from the Output unit j in the
following way:
J k
J yk
Ij
yk I j
k 1
yk
y k ( k j y j )
Ij
Therefore:
d j
J
J
yk
yk k j y j d j y j
d k d j y j
k k
1 if k j
where: k j
0 if k j
This means that the back propagation’s standard algorithm can be used
with the SoftMax function for the back propagation of the real error.
This means that, to the value of the rising unit ui of the connection, the
BACK PROPAGATION NEURAL NETWORKS 251
error that it has registered in the previous cycles is added in proportion k (k > 0;
for k = 0, it is the normal back propagation of equation (8)).
NeuralWare (1993) has implemented this technique by defining it “Fast
Propagation” and sustaining its advantages in terms of rapidity (for
comparisons, see M. Buscema, 1994, Squashing Theory, Armando, Rome, p.
102; while for a closer examination of this technique, see Samad 1988 and
1989).
Semeion’s SelfMomentum
1
(17) SelfMomentum ij wij ( n1) out i ( n )
0.5 + wij
outi ( n) 1, varying the wij ( n1) in the interval [-0.5, 0.5] and the wij weights
in the interval [-1.0, 1.0].
The ANM allows every neuron, during the learning process, to develop
its own uniqueness as one of any connection (weight).
Each ANM’s unit expresses this individuality through the parameter of its
specific temperature. The individuation of a singular temperature for each
ANN’s unit implies that each unit’s incremental change of temperature is
BACK PROPAGATION NEURAL NETWORKS 253
E
(19) Ti[ s]
Ti[ s]
where: Ti[ s] = increase or decrease of the Temperature of i-th level unit [s];
= learning coefficient of the units (same concept to that of the
weights’ Rates); Ti[ s] = actual Temperature of the i-th level unit [s];
E = Global Error (see equation (25));
and finally:
Net [ s ] [ s ]
i
[ s]
i
Net i[ s] i[ s]
Ti
f e
(21)
Ti
[ s] 2 Net [ s ] [ s ]
2
Ti[ s]
i i
[ s]
Ti
1 e
The increasing factor Ti[ s] , obtained for a certain unit, is added to the
Temperature of that unit:
and then computed as its new temperature in the transfer equation that
changes the Net Input of that unit in its activation value:
254 BUSCEMA
1
Net[ s ] [ s ]
i i
Ti[ s ]
(23) ui 1 e
[ s]
In these simulations Tawel has set the learning coefficient = 0.1 for the
temperature, and n = 0.1 for the weights; he has allowed a total temperature
variation between +0.01 and +2.00; he has randomly initiated the units’
initial temperatures in a space included between +0.9 and +1.1, while the
weights have been randomized in a space included between +2.00 and -2.00.
Tawel is right, both from the epistomological and biological point of
view: the units of an ANN can’t behave as passive elements. Perhaps Tawel
is also right, when he states that ANN’s units must participate in the learning
process, in an analogous way to how the synapsis (weights’ matrix)
participate in it.
Probably, Tawel is also right when he hypothesizes the units’ learning
algorithm by instituting a proportion between the change of a parameter,
Ti[ s] and the negative of the Global Error’s derivative of ANN in respect to
the parameter Ti[ s] of that unit (equation (19)).
We doubt that the parameter Ti[ s] influence corresponds to the unit
temperature and then the parameter Ti[ s] updated through Ti[ s] must not, for
each unit, affect the sigmoid form. And then, that equation (23) is not
adequate to the problem. This inadequacy has different reasons:
a. The experimental results have shown that the ANM model obtains
contradictory results according to the type of problems; and in the case in
which it obtains better results, these are clearly inferior to those obtained
through another conception of the temperature concept (see section on
“Freezing”).
b. It is difficult to believe that the learning of each unit is reduced in
determining its sigmoid form.
BACK PROPAGATION NEURAL NETWORKS 255
Semeion’s Freezing
1 N
(24)
Tp = f ( Errorp ) = f (ti ui ) 2
2 i1
1
(25) Tp = f ( Errorp ) = 1 - where p = Pattern
1 + Errorp
from which:
Net 1
pi pi
T
1 e
p
(26) u pi
The Freezing presupposes that all units have the same temperature in
respect to a Model. This temperature is determined by the Error that that
Model has produced.
Furthermore, the Freezing presupposes that the different Models in
256 BUSCEMA
Carnegie’s Mellon University, with the aim to improve the convergence rate
in BP networks (Fahlman, 1988).
In order to find an optimal solution in the shortest time, it is not possible
to proceed in the weight’s space following the gradient with very short
infinitesimal steps. What is realistically asked is to proceed towards the
minimum with the most possible strident steps, avoiding the rebound
phenomenon on the minimum surfaces without reaching it.
Furthermore, in the hypotheses that the changes carried out on a
coefficient do not have much influence on the rate E w pertinent to another
coefficient, and that E is approximately a squared function of each
coefficient’s value (in the case of linear units it would be exactly a squared
function), it can be considered that the knowledge of the three memorized
values defines, on a Cartesian plane of coordinates w, E, a parabola that
graphically represents the dependence law of E from the considered
coefficient w. The learning law of the Quick Propagation consists, then, in
choosing a variation of this connection coefficient so that it assumes the
value corresponding to the vertex abscissa of the parabola, where E assumes
the minimum value possible in that moment.
Quick Propagation resolves this by combining two traditional
approaches:
a. dynamic regulation of the learning rate (globally or separately for each
weight) based on the historic succession of the learning rate up to the
actual value;
b. calculation of the second derivative of the error in respect to each weight.
The Quick Propagation is more of a heuristic method than a formal one
of the second order, and it is vaguely based on Newton’s method.
The parabola is determined by calculating for each weight the slope of
the former and the actual error and the changing of the weights’ values
among the points for which these slopes are measured.
The correction algorithm allows to directly reach the minimum of this
parabola.
Generally, the Quick Propagation characterizes itself very well in respect
to the other learning techniques. It is often in competition with the Extended-
Delta-Bar-Delta method and the Conjugate Gradients’ method, particularly
concerning problems with “clean data”; these which are not noisy.
Through the Quick Propagation algorithm, all the weights are
independently updated; in particular, for each weight, wij(n), are memorized:
c. the value of wij ( n) calculated at the end of the previous cycle, wij ( n1) .
At the end of the present cycle the coefficient variation wij(n), must be:
E( n)
wij
wij ( n )
determine the contribution to the error due to the particular weight wij of the
network. Furthermore, if in the standard Delta Rule the Rate is constant for
BACK PROPAGATION NEURAL NETWORKS 259
the weights’ updating, in the DBD a variable Rate (n) is assigned to each
network’s weight. Then, the updating of every weight will be given by the
following equation:
ij ( n1) ij ( n) ij ( n)
if o ( n1)
o( n) 0 then ij ( n) k
if o ( n1)
o( n) 0 then ij ( n) ij ( n)
otherwise ij ( n) 0
It has to be noted that DBD increases the learning rates linearly, but
decreases them geometrically. In order to obtain a good convergence, k must
be less than . It is advisable to put k = 0.001 and = 0.005.
260 BUSCEMA
In 1990 Minai and Williams (Minai, 1990), have planned the Extended-
Delta-Bar-Delta (EDBD) starting from some considerations about Jacobs’s
DBD:
a. the standard DBD does not utilize the Momentum;
b. small and linear constant (k) increases can modify the learning rate,
determining limits in the weights’ space which make the network
convergence difficult;
c. the geometric decrease sometimes isn’t sufficiently fast enough in order
to avoid limits in the weights’ space.
and then:
if o( n1)
o( n) 0
then ij ( n) k exp o( n)
if o( n1)
o( n) 0 then ij ( n) ij ( n) ij ( n)
otherwise ij ( n) 0
if o( n1)
o( n) 0
then ij ( n) k exp o( n)
if o( n1)
o( n) 0 then ij ( n) ij ( n) ij ( n)
BACK PROPAGATION NEURAL NETWORKS 261
otherwise ij ( n) 0
( n) max
( n) max
where max and max are the upper limits of the two coefficients.
(1) xi xi (identity)
0 1
262 BUSCEMA
0 1
0 1
0 1
0 1
0 1
One of the aims of every learning process consists in allowing the layer
of the Hidden unit to codify in a viable way for the Input vector.
In order to orient this codification, it has been decided to reinforce those
connections between Input and Hidden units with a larger correction rate, in
which the value of these latter are more similar to the Input value from
which the connection comes (M. Buscema, 1995, October: experiments at
*
Equation (4) has been proposed by Semeion.
BACK PROPAGATION NEURAL NETWORKS 263
Semeion).
This means that the Hidden units modulate their fan-in connections, in
relation to their archetypicity with the nodes of the Input vector and in
relation to the derivative of the error in Output.
We have defined this learning law as Jung’s Rule (JR), only in virtue of
the forcing that the Hidden units suffer in order to codify an “archetype” of
the Input vector.
From the algebraic point of view, the JR appears as:
( ui u j ) 2
(28) JRij exp
NumInput
The JR has offered good results especially for the prediction of temporal
series. The JR has shown to be able to correctly foresee between 40% and
50%, the sudden tendency changes of a value. Of course, the use of JR
implicates a number of Hidden units higher than the usual, because the
learning is subjected to a higher number of constraints.
Freud’s Rule (FR) expects that the fan-out connections of each Input
node are much more correct as the node is farther from the Input node which
represents the actual time of the record (t0).
The equation that implements this criterion is very simple:
( Pi 1)
(30) F 1
NumInput
where: Pi = Position of the Input Node (read from the left side), from which
the connection comes; NumInput = Total Number of the Input Nodes.
Hidden j
The weights coming from the 4 Input nodes would be correct with the
following values:
1 1
t-3 i = 1 Rate 1 1 1
4
2 1
t-2 i = 2 Rate 1 1 0.75
4
3 1
t-1 i = 3 Rate 1 1 0.5
4
4 1
t0 i = 4 Rate 1 1 0.25
4
This means that all Input values of the temporal series end, affecting the
learning process in the same way, but their influence grows linearly to the
decreasing of their actuality.
Through the FR the learning process is much more difficult than the one
occurring according to the traditional Back Propagation. Furthermore, the
Input vector of ANN changes into a node’s level whose position
topographically is meaningful for the learning itself. This is evidenced at the
end of the learning, by carrying out the analysis of sensitivity on the resulting
weights matrix. This, in fact, shows a specific distribution of the Input nodes
on the Output not being linearly connected to the position of the Input node
in the vector; the most “recent” Input nodes aren’t generally those more
important for the Output.
Furthermore, on the predictional level, the FR has allowed a meaningful
reduction of prediction errors of the tendency changes (critical points). The
FR correctly predict between the 45% and the 65% in our experiments on
100 critical points, in respect to the series by utilizing the traditional Back
Propagation which is correctly predicted between 15% and the 25% obtained
on the same temporal series.
Nevertheless, FR makes some more errors on the predictions that
maintain the same tendency. Sometimes it invents tendency changes that
don’t exist. Having provided ANN with a “proto-unconscious” device, it was
to be expected that it would “rave” or go “crazy”. But globally, the
prediction of all types of tendency has demonstrated that the FR is more
efficacious than other Back Propagations.
REFERENCES
AA.VV., 1991: AA.VV., Advanced in Neural Information Processing, vol. 3,
Morgan Kaufman, San Mateo, CA, 1991.
ANDERSON, 1988: J. A. Anderson and E. Rosenfeld, (eds.), Neurocomputing
Foundations of Research, The MIT Press, Cambridge, Massachusetts, London,
England, 1988.
BRIDLE, 1989: J. S. Bridle, Probabilistic Interpretation of Feedforward
Classification Network Outputs, with Relationships to Statistical Pattern
Recognition, in F. Fogelman-Soulié and J. Hérault (eds.), Neuro-computing:
Algorithms, Architectures, Springer-Verlag, New York.
BUSCEMA, 1993: M. Buscema and G. Massini, Il Modello MQ, Collana Semeion,
Armando, Rome, 1993. [The MQ Model: Neural Networks and Interpersonal
Perception, Semeion Collection by Armando Publisher].
BACK PROPAGATION NEURAL NETWORKS 267
THE AUTHOR
BACK PROPAGATION NEURAL NETWORKS 269