Professional Documents
Culture Documents
The human brain is known to operate under a radically different computational paradigm. While conventional computers use a fast and complex central
processor with explicit program instructions and locally addressable memory,
the human brain uses a parallel collection of slow and simple processing elements (neurons), densely connected by weights (synapses) whose strengths
are modified with experience. The process directly supports the integration
of multiple constraints and provides a distributed form of memory that shows
associative behaviour.
2.1
Artificial Neural Network (ANN)s are non- parametric prediction tools that
can be used for a host of pattern classification and speech recognition purposes. It is an information processing paradigm that is inspired by the way
biological nervous systems, such as the brain, process information. The key
element of this paradigm is the novel structure of the information processing
system. It is composed of a large number of highly interconnected processing
elements (neurones) working in unison to solve specific problems. ANNs, like
people, learn by example. An ANN is configured for a specific application,
such as pattern recognition or data classification, through a learning process
[Haykin 2003]. A ANN is characterized by
1. its pattern of connections between the neurons,called its architecture.
2. its methods of determining the weights on the connections, called its
training or learning algorithm, and
3. its activation function.
2.1.1
wkj xj
(2.1)
yk = (uk + bk )
(2.2)
uk =
j=1
and
where x1 , x2 , ....xk are the input signals, wk1, wk2 , ....wkm are the synaptic
weights of neuron k, bk is the bias, (.) is the activation function and yk
is the output signal of the neuron. The use of bias bk has the effect of applying an affine transformation to the output uk of the linear combiner in the
model of Figure 2.2, as shown by
k = uk + bk
(2.3)
19
m
X
wkj xj
(2.4)
j=0
and
yk = (k )
(2.5)
(2.6)
wk0 = bk
(2.7)
2.1.2
Learning Algorithms
2.1.3
Network Architecture
21
(2.8)
where [x] is the input vector, [w] is the associated weight vector, b is
a bias value and g(x) is the activation function. Such a setup, namely
the perceptron will be able to classify only linearly separable data. A
MLP, in contrast, consists of several layers of neurons. The equation for
output in a MLP with one hidden layer is given as:
Ox =
N
X
i g[w]i.[x] + bi
(2.9)
i=1
where i is the weight value between the ith hidden neuron. The process
2.2
ykj = Fk (xj )
(2.11)
2.3
The ANN in supervised mode is trained using (error) Back Propagation (BP)
depending upon which the connecting weights between the layers are updated.
This adaptive updating of the ANN is continued till the performance goal is
met.Training the ANN is done in two broad passes -one a forward pass and the
other a backward calculation with error determination and connecting weight
updating in between. Batch training method is adopted as it accelerates the
speed of training and the rate of convergence of the MSE to the desired value
[Haykin 2003]. The steps are as below:
25
L
X
i=1
h mi
wji
p + hj
(2.12)
(2.13)
(2.14)
(2.15)
PM PL
2
n=1 ejn
j=1
2M
(2.16)
(2.17)
X
j
o
o
mj
wjk
(2.18)
(2.19)
where is the learning rate(0<<1). For faster convergence a momentum term()maybe added as:
o
o
o
o
wkj
(t + 1) = wkj
(t) + mk
omj + (wkj
(t + 1) wkj )
(2.20)
(2.21)
(2.22)
One cycle through the complete training set forms one epoch. The above is
repeated till MSE meets the performance criteria. While repeating the above
the number of epoch elapsed is counted.
A few methods used for MLP training includes:
Gradient Descent( GDBP )
Gradient Descent with Momentum BP( GDMBP )
Gradient Descent with Adaptive Learning Rate BP( GDALRBP ) and
Gradient Descent with Adaptive Learning Rate and Momentum BP(
GDALMBP ).
Here, while MSE approaches the convergence goal, training of the MLPs
suffer if:
Corr(pm,i(j), pm,i (j + 1)) = high and
Corr(pm,i(j), pm,i+1 (j))= high. This is due to the requirements of the (error)
back propagation algorithm as suggested by equations 2.17 and 2.18. Therefore, the motivation of formulation of any feature set should always keep this
fact within sight.
2.4
27
Recurrent Neural Network (RNN)s are ANNs with one or more feedback loops.
The feedback can be of a local or global kind. The RNN maybe considered
to be an MLP having a local or global feedback in a variety of forms. It may
have feedback from the output neurons of the MLP to the input layer. Yet
another possible form of global feedback is from the hidden neurons of the
ANN to the input layer [Haykin 2003].
The architectural layout of a recurrent networks takes many different forms.
Four specific network architecture are commonly used each of which have a
specific form of global feedback.
1. Input-Output Recurrent Model: The architecture of a generic recurrent network that follows naturally from a multilayer perceptron is
shown in figure 2.5. The model has a single input that is applied to a
tapped-delay-line memory of q units. It has a single output that is fed
back to the input via another tapped-delay-line memory also use of q
units. The contents of these two tapped-delay-line memories are used to
feed the input layer of the multilayer perceptron. The present value of
the model input is denoted by u(n), and the corresponding value of the
model output is denoted by y(n + 1); that is, the output is ahead of the
input by one time unit [Haykin 2003]. Thus, the signal vector applied
to the input layer of the multilayer perceptron consists of adata window
made up of follows.
Present and past values of the input, namely u(n),u(n+1),.....u(nq+1), which represent exogenous inputs originating from outside
the network.
Delayed values of the output, namely, y(n), y(n-1),...y(n-q+1), on
which the model output y(n+1) is regressed.
The dynamic behavior of the NARX model is described by
y(n + 1) = F (y(n), ...y(n q + 1), u(n), .....u(n q + 1))
(2.23)
2.4.1
Learning Algorithms:
The basic algorithms for RNN training can be summarized as below(a) Back-Propagation Through Time: The back-propagation-throughtime (BPTT) algorithm for training of a recurrent network is an
extension of the standard back-propagation algorithm. It may be
29
x(n)
,
wj
j = 1, 2......., q
(2.24)
0
Uj (n) = nT j th ,
0
j = 1, 2......., q
(2.25)
31
(2.27)
(2.28)
d(n) = hn (x(n))
(2.29)
t
(2.30)
(2.31)
x (n + 1) = x (n) + K(n)(n)
(2.32)
(2.33)
Here F(n) and H(n) are the Jacobians as per the notation used by
Singhal and Wu (1989) of the components of f, hn with respect to
the state variables, evaluated at the previous state estimate;
(n) = d(n) hn (
x(n))
(2.34)
(2.37)
(2.38)
+ K(n)(n)
(2.39)
(2.40)
(2.42)
Inserting process noise q(n) into EKF has been claimed to improve the algorithms numerical stability, and to avoid getting
stuck in poor local minima (Puskorius and Feldkamp 1994)
[Haykin 2003] [Haykin 2002] [Jaeger 2005]. EKF is a second-
33
order gradient descent algorithm, in that it uses curvature information of the (squared) error surface. As a consequence of
exploiting curvature, for linear noise-free systems the Kalman
filter can converge in a single step which makes RNN training
faster [Haykin 2003] [Haykin 2002] [Jaeger 2005].
EKF requires the derivatives H(n) of the network outputs w.r.t.
the weights evaluated at the current weight estimate. These
derivatives can be exactly computed as in the RTRL algorithm,
at cost O(N 4 ).
Decoupled Extended Kalman Filter (DEKF) and RNN
Learning: Over and above of the calculation of H(n), the update of P(n) is computationally most expensive in EKF and
it requires O(LN 2 ) operations. By configuring the network
with decoupled subnetworks, a block-diagonal P(n) can be obtained which results in considerable reduction in computations
as proved by Feldkamp et al. (1998) [Jaeger 2005]. This gives
rise to the Decoupled Extended Kalman Filter (DEKF) algorithm which can be summarized as below [Haykin 2003]:
Initialization:
i. Set the synaptic weights of the RNN to small values selected from a uniform distribution.
ii. From the diagonal elements of the covariance matrix Q(n)
(process noise between 106 to 102 ).
iii. Set K(1, 0)= 1 I, where is a small positive constant.
Computation: For n=1, 2,... compute
(n) = [gi=1 Ci (n)Ki (n, n 1)CTi (n) + R(n)]1 (2.43)
Gi (n) = Ki (n, n 1)CTi (n)(n)
(2.44)
vectors y(k), s(k) and the bias input (1 + j), and is given by
I(k) = [s(k 1), ..., s(k p), 1 + j, y1 (k 1), ..., yN (k (2.45)
1)]T
= Inr (k) + jIni (k), n = 1, ..., p + N + 1
(2.46)
(2.47)
p+N +1
netl (k) =
X
n=1
(2.48)
35
(2.49)
(2.50)
(2.52)
where d(k) is the teaching signal. For real time applications the
cost function of the recurrent network is given by (Widrow et al.,
1975) [Goh 2004],
N
1X
1X r 2
1X
| el (k) |2 =
el (k)el (k) =
[(el ) + (eil )2 ],
E(k) =
2 l=1
2 l=1
2 l=1
(2.53)
where (.) denotes the complex conjugate. Notice that E(k) is
a real-valued function,and we are required to derive the gradient
E(k)with respect to both the real and imaginary part of the complex weights as
ws,t E(k) =
E(k)
E(k)
+j
, 1 6 l, s 6 N, 1 6 p + N + 1 (2.54)
r
i
ws,t
ws,t
E(k)
E ul (k)
E vl (k)
=
( i
)+
( i
), 1 l, s N, 1 t p + N + 1
i
ws,t (k)
ul ws,t (k) vl ws,t
(k)
(2.57)
yl (k)
ul (k)
vl (k)
yl (k)
ul (k)
vl (k)
The factors wr k = wr (k) +j wr (k) and wi k = wi (k) +j w
i
s,t
s,t
s,t
s,t
s,t
s,t (k)
are measures of sensitivity of the output of the mth unit at time k
to a small variation in the value of ws,t (k). This sensitivities can
be evaluated as [Goh 2004]
ul (k)
ul
l
ul
l
=
. r
+
. r
r
ws,t (k)
l ws,t (k) l ws,t (k)
(2.58)
ul (k)
ul
l
ul
l
=
. i
+
. i
i
ws,t (k)
l ws,t (k) l ws,t (k)
(2.59)
vl (k)
vl
l
vl
l
=
. r
+
. r
r
ws,t (k)
l ws,t (k) l ws,t (k)
(2.60)
vl (k)
vl
l
vl
l
=
. i
+
. i
i
ws,t (k)
l ws,t (k) l ws,t (k)
(2.61)
2.5
37
2.5.1
39
1 d d 1 X
(x xki )T (x xki )
fk (x) =
( )
exp[
22
N
(2 2 )
i=l
(2.63)
Nj
X
(x xji)T (x xji )
(x xki )T (x xki )
exp[
]
>
exp[
]
(2 2 )
(2 2 )
i=l
f orallj 6= k
(2.64)
which is derived using prior probabilities calculated as the relative frequency of the examples in each class:
Pk = Nk N
(2.65)
where, N is the number of all training examples,Nk is the number of examples in class k. The Parzen windows classifier uses the entire training
set of examples to perform classification, that is it requires storage of
the training set in the computer memory. The speed of computation is
proportional to the training set size [P. V. N. Rao 2005] [Kumar 2009]
[Haykin 2003].
As mentioned earlier the probabilistic neural network is a direct continuation of the work on Bayes classifiers. The probabilistic neural network (PNN) learns to approximate the pdf of the training examples.
More precisely, the PNN is interpreted as a function which approximates the probability density of the underlying examples distribution
[P. V. N. Rao 2005] [Kumar 2009] [Haykin 2003]. The PNN consists of
nodes allocated in three layers after the inputs- There is one pattern
node for each training example. Each pattern node forms a product of
the weight vector and the given example for classification, where the
weights entering a node are from a particular example. After that, the
product is passed through the activation function:
exp[
(xT Wki 1)
]
2
(2.66)
41
Each summation node receives the outputs from pattern nodes associated with a given class:
Nk
X
i=l
exp[
(xT wki 1)
]
2
(2.67)
The output nodes are binary neurons that produce the classification
decision
Nj
Nk
X
X
(xT wkj 1)
(xT wki 1)
]
>
exp[
]
(2.68)
exp[
2
2
i=l
i=l
The only factor that needs to be selected for training is the smoothing
factor, that is the deviation of the Gaussian functions:
(a) Too small deviations cause a very spiky approximation which can
not generalize well;
(b) Too large deviations smooth out details.
An appropriate deviation is chosen by experiment [P. V. N. Rao 2005]
[Kumar 2009] [Haykin 2003].
2.6
SOM Self-organizing feature maps (SOFMs) are a type of neural network develop by Kohonen. Self Organizing Maps (Figure-5.2) reduce dimensions by producing a map of usually one or two dimensions that plot
the similarities of the data by grouping the similar data items together.
SOMs accomplish two things. They reduce dimensions and display similarities. An input variable is presented to the entire network of neurons
during training. The neuron whose weight most closely matches that of
the input variable is called the winning neuron and this neurons weight
is modified to more closely match the incoming vector. The SOM uses
an unsupervised learning method to map high dimensional data into a
1-D, 2-D, or 3-D data space, subject to a topological ordering constraint.
A major advantage is that the clustering produced by the SOM retains
the underlying structure of the input space, while the dimensionality
of the space is is reduced. As a result, a neuron map is obtained with
weights encoding the stationary probability density function p(X) of the
input pattern vectors [T. Kohonen 2000].
(2.69)
where wi is the winning neuron. For the weight vector of unit i the
SOM weight update rule is given as
wi (t + 1) = wi (t) + Nmi (t)[x(t) wi (t)]
(2.70)
where t is the time, x(t) is an input vector, and Nmi (t) is the neighborhood kernal around the winner unit m. The neighborhood function in
this case is a Gaussian given by
Nmi (t) = (t)exp[
kri rm k
]
2 (t)
(2.71)
where rm and ri are the position vectors of the winning node and of
the winning neighborhood nodes, respectively. The learning rate factor
0 < (t) < 1, decreases monotonically with time, and 34 (t) corresponds
to the width of the neighborhood function, also decreasing monotonically
with time. Thus, the winning node undergoes the most change, while
the neighborhood nodes furthest away from the winner undergo the least
change.
Batch training: The incremental process defined above can be replaced by a batch computation version which is significantly faster[5].
The batch training algorithm is also iterative, but instead of presenting a single data vector to the map at a time, the whole data
set is given to the map before any adjustments are made. In each
training step, the data set is partitioned such that each data vector
belongs to the neighborhood set of the map unit to which it is closest, the Voronoi set [E. Alhoneiemi 1999]. The sum of the vectors
43
Si (t) =
Vi
X
Xj
(2.72)
j=1
45
j = 1, ......, m
(2.74)
(2.75)
Alternatively one might select the winner based on a Euclidean distance measure. Here the distance is measured between the present
input Xk and the weight vectors Wj , and the winning neuron index
J, satisfies
kXk WJ k = minj {kXk Wj k}
(2.76)
It is misleading to think that these methods of selecting the winning neuron are entirely distinct[323]. To see this, assume that the
weight equinorm property holds for all weight vectors:
kW1 k = kW2 k = ...... = kWm k
(2.77)
(2.78)
(2.82)
(2.83)
or,
which is the identical to criterion given in Eq. 2.76. Neuron J wins
if its weight vector correlates maximally with the impinging input.
In the case of ART 1 F2 neurons we have seen that a binary
choice network requires neuronal excitatory self feed-back, lateral
inhibition and a faster-than-linear signal function. In computer
simulations, however, one may simply choose the neuron index with
maximum activity or minimum distance.
i = 1, ........, n
(2.84)
i = 1, ........, n
(2.85)
i = 1, ........, n
(2.87)
47
time:
wijk+1 = wijk +skj (xki wijk )
where skj = 1 only for j=J and is zero otherwise, for a hard competitive field.
With these ideas in hand we now study three important classes of
neural network models that have their foundation in competitive
learning.