You are on page 1of 31

Chapter 2

Artificial Neural Network (ANN):


Basic Considerations

The human brain is known to operate under a radically different computational paradigm. While conventional computers use a fast and complex central
processor with explicit program instructions and locally addressable memory,
the human brain uses a parallel collection of slow and simple processing elements (neurons), densely connected by weights (synapses) whose strengths
are modified with experience. The process directly supports the integration
of multiple constraints and provides a distributed form of memory that shows
associative behaviour.

2.1

Basic Considerations of Artificial Neural Network (ANN)

Artificial Neural Network (ANN)s are non- parametric prediction tools that
can be used for a host of pattern classification and speech recognition purposes. It is an information processing paradigm that is inspired by the way
biological nervous systems, such as the brain, process information. The key
element of this paradigm is the novel structure of the information processing
system. It is composed of a large number of highly interconnected processing
elements (neurones) working in unison to solve specific problems. ANNs, like
people, learn by example. An ANN is configured for a specific application,
such as pattern recognition or data classification, through a learning process
[Haykin 2003]. A ANN is characterized by
1. its pattern of connections between the neurons,called its architecture.
2. its methods of determining the weights on the connections, called its
training or learning algorithm, and
3. its activation function.

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


18

Figure 2.1: A generic biological neuron

2.1.1

The Biological Model of ANN

There is a close analogy between the structure of a biological neuron(i.e. a


brain or nerve cell) and the processing element of the ANN [Haykin 2003].
Figure 2.1 shows a generic biological neuron, together with axons from other
neurons. The fundamental information-processing unit of the operation of the
neural networks is the McCulloch - Pitts Neuron (1943). The block diagram
of Figure 2.2 shows the model of a neuron, which forms the basis for designing
ANN. From this model the interval activity of a neuron k can be shown to be
m
X

wkj xj

(2.1)

yk = (uk + bk )

(2.2)

uk =

j=1

and

where x1 , x2 , ....xk are the input signals, wk1, wk2 , ....wkm are the synaptic
weights of neuron k, bk is the bias, (.) is the activation function and yk
is the output signal of the neuron. The use of bias bk has the effect of applying an affine transformation to the output uk of the linear combiner in the
model of Figure 2.2, as shown by
k = uk + bk

(2.3)

2.1. Basic Considerations of Artificial Neural Network (ANN)

19

Figure 2.2: A nonlinear model of a neuron


Formulating the combination of Eqs 2.1 and Eqs 2.3 as
k =

m
X

wkj xj

(2.4)

j=0

and
yk = (k )

(2.5)

In Eqs 2.4 a new synapse have been added. Its input is


x0 = +1

(2.6)

wk0 = bk

(2.7)

and its weight is


The activation function acts as a squashing function, such that the output of
a neuron in a neural network is between certain values (usually 0 and 1, or -1
and 1). Three basic types of activation functions are the threshold function,
piece-wise linear fiction and sigmoidal function [Haykin 2003].

2.1.2

Learning Algorithms

Learning Algorithms define an architectural-dependent procedure to encode


pattern information into weights to generate these internal models. Learning

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


20
proceeds by modifying connection strengths [Kumar 2009]. Learning is data
driven. The data might take the form of a set of input -output patterns derived
from a possibly unknown probability distribution. Here the output pattern
might specify a desired system response for a given pattern, and the issue of
learning would involve approximating the unknown function as described by
the given data. Alternatively , the data might comprise pattern that naturally
cluster into some number of unknown classes, and the learning problem might
involve generating a suitable classification of the sample [Kumar 2009]. The
nature of the problem of learning allows to demarcate learning algorithms into
two categories. These are supervised and unsupervised learning.
In supervised learning, the data available for learning might comprise a set
of discrete sample = Xk , Dk ,Q
k=1 drawn from the pattern space where each
sample relates an input vector Xk Rn to an output vector DK Rp . The
set of samples describe the behaviour of an unknown function f : Rn Rp
which is to be characterized.
The alternate way of learning is to simply provide the system with an input Xk ,
and allow it to self-organize its parameters, which are the weight of the network
to generate internal prototypes of sample vector. This is called unsupervised
learning. An unsupervised learning system attempts to represent the entire
data set by employing a small number of prototypical vectors- enough to
allow the system to retain a desired level of discrimination between samples.
As new samples continuously buffer the system , the prototypes will be in
a state of constant flux. There is no teaching input. This kind of learning
is often called adaptive vector quantization and is essentially unsupervised
in nature. In unsupervised systems, weights of the ANN undergo a process
of self-organization to create clusters of pattern represented by a codebook
vectors. Learning in an unsupervised system is often driven by a competitivecooperative process where individual neurons complete and cooperate with
each other to update their weights based on the present input. Only winning
neurons or clusters of neurons learn. Unsupervised learning systems embody
complicated time dynamics. Learning in such systems is usually implemented
in the form of differential equations that are designed to work with information
available only locally to a synapse.

2.1.3

Network Architecture

The neurons of a neural network are structured as it is intimately linked with


the learning algorithm to train the network. Therefore the learning algorithm
used to design the neural network can be said as being structured. Depending on various learning rules, three fundamental classes of neural networks
architectures are defined as mentioned below:

2.1. Basic Considerations of Artificial Neural Network (ANN)

21

Figure 2.3: Single-Layer Feedforward Networks


1. Single-Layer Feedforward Networks:In a layered neural network the
neurons are organized in the form of layers. In this simplest form of layered network, there is an input layer of source nodes that projects onto
an output layer of neurons, but not vice versa. In other words this network is strictly a feedforward or acyclic type. A single-layer network with
R input elements and S neurons is shown in figure 2.4. Such a network is
called single layer network, with the designation single-layer" refereing
to the output layer of computation neurons [Kumar 2009] [Haykin 2003].
2. Multilayer Feedforward Networks: The second class of a feedforward neural network distinguishes itself by the presence of one or more
hidden layers, whose computation nodes are correspondingly called hidden neurons or hidden units. The function of function of neurons is
to intervene between the external input and the network output in
some useful manner. By adding one or more hidden layers, the network is enabled to extract higher-order statistics. The ability of hidden
neurons to extract higher-order statistics is particularly valuable when
the size of the input layer is large [Haykin 2003]. The source nodes in
the input layer of the network supply respective elements of the activation pattern,which constitute the input signals applied to the neurons(computation nodes) in the second layer. The output signals of the
second layer are used as inputs to the third layer, and so on for the rest
of the network [Haykin 2003]. The set of output signals of the neurons

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


22

Figure 2.4: Multilayer Feedforward Networks


in the outputor final layer of the network constitutes the nodes in the
input(first) layer. The architectural graph in figure 2.5 shows a representation of a simple 3-layer feedforward artificial neural network with
four inputs, 5 hidden nodes, and one output [Kumar 2009]. The MLP
is the product of several researchers: Frank Rosenblatt (1958), H. D.
Block (1962) and M. L. Minsky with S. A. Papart (1988). Backpropagation, the training algorithm , was discovered independently by several
researchers (Rumelhart et. al.(1986) and also McClelland and Rumelhart (1988)).
A simple perceptron is a single McCulloch-Pitts neuron trained by the
perceptron algorithm is given as:
Ox = g(([w].[x] + b)

(2.8)

where [x] is the input vector, [w] is the associated weight vector, b is
a bias value and g(x) is the activation function. Such a setup, namely
the perceptron will be able to classify only linearly separable data. A
MLP, in contrast, consists of several layers of neurons. The equation for
output in a MLP with one hidden layer is given as:
Ox =

N
X

i g[w]i.[x] + bi

(2.9)

i=1

where i is the weight value between the ith hidden neuron. The process

2.2. Prediction and Classification using Artificial Neural Network


(ANN)s
23

Figure 2.5: Recurrent Neural Networks


of adjusting the weights and biases of a perceptron or MLP is known
as training. The perceptron algorithm (for training simple perceptrons
consists of comparing the output of the perceptron with an associated
target value. The most common training algorithm for MLPs is error
backpropagation. This algorithm entails a back propagation of the error
correction through each neuron in the network.
3. Recurrent Neural Network: A recurrent neural network is another
approach which distinguish itself from a feedforward neural network in
that it has at least one feedback loop [Haykin 2003]. A recurrent network
may consist of a single layer of neurons with each neuron feeding its
output signal back to the inputs of all the other neurons, as illustrated
in the architectural graph of figure 2.5. Section 2.4 provides a brief
discussion on Recurrent Neural network.

2.2

Prediction and Classification using Artificial Neural Network (ANN)s

Accuracy of a prediction and estimation process depends to a large extent


on a higher value of proper classification. Neural networks has been one of
the preferred classifiers for their ability to provide optimal solution to a wide
class of arbitrary classification problems [Haykin 2003]. Multi Layered Perceptron (MLP)s trained with (error) Back Propagation (BP) in particular

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


24
has found wide spread acceptance in several classification, recognition and
estimation applications. The reason is that the MLPs implement linear discriminants but in a space where inputs are mapped nonlinearly. The strength
of this classification is derived from the fact that MLPs admit fairly simple
algorithms where the form of nonlinearity can be learned from training data
[Haykin 2003]. The classification process involves defining a decision rule so
as to provide boundaries to separate one class of data from another. It means
placing the j th input of the input vector x(j) in the proper class among M
outputs of a classifier.
Classification by a neural network involves training it so as to provide an optimum decision rule for classifying all the outputs of the network. The training
should minimize the risk functional [Haykin 2003] :
1 X
(dj F (xj ))2
(2.10)
2N
where dj is the desired output pattern for the prototype vector xj ,
((.)) is the Euclidean norm of the enclosed vector and N is the total number
of samples presented to the network in training. The decision rule therefore
can be given by the output of the network:
R=

ykj = Fk (xj )

(2.11)

for the jth input vector xj .


The MLPs provide global approximations to non-linear mapping from the input to the output layers. As a result, MLPs are able to generalize in the
regions of the input space where little or no data are available [Haykin 2003].
The size of the network is an important consideration from both the performance and computational points of view.

2.3

Application of Error Back Propagation for


ANN training

The ANN in supervised mode is trained using (error) Back Propagation (BP)
depending upon which the connecting weights between the layers are updated.
This adaptive updating of the ANN is continued till the performance goal is
met.Training the ANN is done in two broad passes -one a forward pass and the
other a backward calculation with error determination and connecting weight
updating in between. Batch training method is adopted as it accelerates the
speed of training and the rate of convergence of the MSE to the desired value
[Haykin 2003]. The steps are as below:

2.3. Application of Error Back Propagation for ANN training

25

Initialization: Initialize weight matrix W with random values between


[-1,1] if a tan-sigmoid function is used as an activation function and
between [0, 1] if log-sigmoid function is used as activation function. W
is a matrix of CxP where P is the length of the feature vector used for
each of the C classes.
Presentation of training samples: Input is pm = [pm1 , pm2 .....pmL ].
The desired output is dm =[dm1 , dm2 ......dmL ].
Compute the values of the hidden nodes as:
nethmj

L
X
i=1

h mi
wji
p + hj

(2.12)

Calculate the output from the hidden layer as


ohmj = fjh (nethmj )

(2.13)

where f(x)= e1x


x
x
or f(x)= eex e
+ex
depending upon the choice of the activation function.
Calculate the values of the output node as:
oomk = fko (netomj )

(2.14)

Forward Computation: Compute the errors:


ejn = djn ojn

(2.15)

Calculate the mean square error(MSE) as :


MSE =

PM PL

2
n=1 ejn

j=1

2M

(2.16)

Error terms for the output layer is:


o
mk
= oomk (1 oomk )emn

(2.17)

Error terms for the hidden layer:


h
mk
= ohmk (1 ohmk )

X
j

o
o
mj
wjk

(2.18)

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


26
Weight Update:
Between the output and hidden layers
o
o
o
wkj
(t + 1) = wkj
(t) + mk
omj

(2.19)

where is the learning rate(0<<1). For faster convergence a momentum term()maybe added as:
o
o
o
o
wkj
(t + 1) = wkj
(t) + mk
omj + (wkj
(t + 1) wkj )

(2.20)

Between the hidden layer and input layer:


h
h
h
wji
(t + 1) = wji
(t) + mj
pi

(2.21)

A momentum term maybe added as:


h
h
h
o
wji
(t + 1) = wji
(t) + mj
pi + (wji
(t + 1) wji

(2.22)

One cycle through the complete training set forms one epoch. The above is
repeated till MSE meets the performance criteria. While repeating the above
the number of epoch elapsed is counted.
A few methods used for MLP training includes:
Gradient Descent( GDBP )
Gradient Descent with Momentum BP( GDMBP )
Gradient Descent with Adaptive Learning Rate BP( GDALRBP ) and
Gradient Descent with Adaptive Learning Rate and Momentum BP(
GDALMBP ).
Here, while MSE approaches the convergence goal, training of the MLPs
suffer if:
Corr(pm,i(j), pm,i (j + 1)) = high and
Corr(pm,i(j), pm,i+1 (j))= high. This is due to the requirements of the (error)
back propagation algorithm as suggested by equations 2.17 and 2.18. Therefore, the motivation of formulation of any feature set should always keep this
fact within sight.

2.4. Recurrent Neural Network

2.4

27

Recurrent Neural Network

Recurrent Neural Network (RNN)s are ANNs with one or more feedback loops.
The feedback can be of a local or global kind. The RNN maybe considered
to be an MLP having a local or global feedback in a variety of forms. It may
have feedback from the output neurons of the MLP to the input layer. Yet
another possible form of global feedback is from the hidden neurons of the
ANN to the input layer [Haykin 2003].
The architectural layout of a recurrent networks takes many different forms.
Four specific network architecture are commonly used each of which have a
specific form of global feedback.
1. Input-Output Recurrent Model: The architecture of a generic recurrent network that follows naturally from a multilayer perceptron is
shown in figure 2.5. The model has a single input that is applied to a
tapped-delay-line memory of q units. It has a single output that is fed
back to the input via another tapped-delay-line memory also use of q
units. The contents of these two tapped-delay-line memories are used to
feed the input layer of the multilayer perceptron. The present value of
the model input is denoted by u(n), and the corresponding value of the
model output is denoted by y(n + 1); that is, the output is ahead of the
input by one time unit [Haykin 2003]. Thus, the signal vector applied
to the input layer of the multilayer perceptron consists of adata window
made up of follows.
Present and past values of the input, namely u(n),u(n+1),.....u(nq+1), which represent exogenous inputs originating from outside
the network.
Delayed values of the output, namely, y(n), y(n-1),...y(n-q+1), on
which the model output y(n+1) is regressed.
The dynamic behavior of the NARX model is described by
y(n + 1) = F (y(n), ...y(n q + 1), u(n), .....u(n q + 1))

(2.23)

where F is a nonlinear function of its arguments.


2. State-Space Model: The block diagram of another generic recurrent
network, called a state-space model is shown in Figure 2.6. The hidden neurons define the state of the state of the network. The output of
the hidden layer is fed back to the input layer via a bank of unit delays.
The input layer consists of a concatenation of feedback nodes and source
nodes. The network is connected to the external environment via the

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


28

Figure 2.6: State-Space Model

Figure 2.7: Second Order Recurrent Network


source nodes. The number of unit delays used to feed the output of the
hidden layer back to the input layer determines the order of the model.
Let the m-by-1 vector u(n) denotes the input vector, and the q-by-1 vector x(n) denote the output of the hidden layer at time n [Haykin 2003].

2.4.1

Learning Algorithms:

The basic algorithms for RNN training can be summarized as below(a) Back-Propagation Through Time: The back-propagation-throughtime (BPTT) algorithm for training of a recurrent network is an
extension of the standard back-propagation algorithm. It may be

2.4. Recurrent Neural Network

29

derived by unfolding the temporal of the network into a layered


feedforward network, the topology of which grows by one layer at
every time step.
Let N denote a recurrent network required to learn a temporal task,
starting from time n0 all the way up to time n. Let N denote the
feedforward network that results from unfolding the temporal operation of the recurrent network N. The unfolded network N is
related to the original network N as follows:
i. For each time step in the interval (n0 , n), the network N has
a layer of containing K neurons, where K is the number of
neurons contained in the network N.
ii. In every layer of the network N there is a copy of each neuron
in the network N.
iii. For each time step l [n0 , n],the synaptic connection from
neuroni in layer l to neuron j in layer l+1 of the network N
is a copy of the synaptic connection from neuron i to neuron j
in the network N.
(b) Real-Time Recurrent Learning: Another learning method referred to as real-time recurrent network (RTRL), derives its name
from the fact that adjustments are made to the synaptic weights
of a fully connected recurrent network in real time. It means that
while the network continues to perform its signal processing function (Williams and Zipser, 1989), the training continues. Figure 2.7
shows the the layout of such a RNN. The network has two distinct
layers: a concatenated input-feedback layer and a processing layer
of computation nodes. Correspondingly, the synaptic connections
of the network are made up of feed-forward and feedback connection. The RTRL Algorithm can be summarized as followsParameters:
m = dimensionality of input space, q = dimensionality of state
space
p = dimensionality of output space and wj = synaptic weight vector of neuron.
Initialization:
i. Set the synaptic weights of the algorithm to small values selected from a uniform distribution.
ii. Set the initial value of the state vector x(0) = 0
iii. Set j (0) = 0f orj = 1, 2, ...q.
Computations: Compute for n = 0, 1, 2.....,

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


30
j (n + 1) = (n)[Wa (n)j (n) + Uj (n)]
e(n) = d(n) Cx(n)
wj (n) = Cj (n)e(n)
Introducing three new matrices j (n), Uj (n), and(n), as given in
the following descriptioni. j (n) is a q by (q + m + 1) matrix defined as the partial
derivative of the state vector x(n) with respect to the weight
vector wj ;
j (n) =

x(n)
,
wj

j = 1, 2......., q

(2.24)

ii. Uj (n) is a q by (q + m + 1) matrix hose rows are all zero,


except for the j th row that is equal to the transpose of vector
(n):

0
Uj (n) = nT j th ,
0

j = 1, 2......., q

(2.25)

iii. (n) is a q-by-q diagonal matrix whose k th diagonal elements


is the partial derivative of the activation function with respect
to its argument, evaluated at wjT (n):
(n) = diag ( (w1T (n)), ...... (wjT (n)), ....., (wqT (n)))
(2.26)
(c) Decoupled Extended Kalman Filtering (DEKF) Algorithm:
Despite its relative simplicity (as compared to recurrent back-propagation),
it has been demonstrated that one of the difficulties of using direct
gradient descent algorithms for training RNNs is the problem of
vanishing gradient. Bengio, Simard, and Frasconi (1994) showed
that the problem of vanishing gradients is the essential reason that
gradient descent methods are often slow to converge [Haykin 2003].
The solution to this problem has been derived by considering the
training of RNN to be an optimum filtering problem similar to that
shown by the Kalman filter such that each updated estimate of the
state is computed from the previous estimate and the data currently
available. Here only previous data requires storage. This algorithm
is derived from the Extended Kalman Filter (EKF) algorithm. The
EKF is a state estimation technique for nonlinear systems derived

2.4. Recurrent Neural Network

31

by linearizing the well-known linear-systems Kalman filter around


the current state estimate. A simple case maybe considered such
that a time-discrete system with additive input has no observation
noise:
x(n + 1) = f (x(n)) + q(n)

(2.27)
(2.28)

d(n) = hn (x(n))

where x(n) is the systems internal state vector, f is the systems


state update function (linear in original Kalman filters), q(n) is external input to the system (an uncorrelated Gaussian white noise
process, can also be considered as process noise), d(n) is the systems output, and hn is a time-dependent observation function (also
linear in the original Kalman filter). At time n = 0, the system
state x(0) is guessed by a multidimensional normal distribution
with mean x(0) and covariance matrix P(0). The system is observed until time n through d(0),..., d(n). The task addressed by
the extended Kalman filter is to give an estimate x(n+1) of the true
state x(n+1), given the initial state guess and all previous output
observations. This task is solved by the following two time update
(eq.s 2.29 to 2.30) and three measurement update computations
(eq. 2.31 to 2.33):
x (n) = F (
x(n))

(2.29)
t

P (n) = F (n)P (n 1)F (n) + Q(n)

(2.30)

K(n) = P (n)H(n)[H(n)t P (n)H(n)]1

(2.31)

x (n + 1) = x (n) + K(n)(n)

(2.32)

P (n + 1) = P (n) K(n)H(n)t P (n)

(2.33)

Here F(n) and H(n) are the Jacobians as per the notation used by
Singhal and Wu (1989) of the components of f, hn with respect to
the state variables, evaluated at the previous state estimate;
(n) = d(n) hn (
x(n))

(2.34)

(n) is the error (difference between observed output and output


calculated from state estimate x(n), P(n) is an estimate of the conditional error covariance matrix, Q(n) is the (diagonal) covariance
matrix of the process noise, and the time updates x (n), P (n)

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


32
of state estimate and state error covariance estimate are obtained
from extrapolating the previous estimates with the known dynamics f. Information contained in d(n) enters the measurement update
in the form of (n) and is accumulated in the Kalman gain K(n).
In the case of classical (linear, stationary) Kalman filtering, F(n)
and H(n) constant, and the state estimates converge to the true
conditional mean state. For nonlinear f, hn , this is not generally
true, and the use of extended Kalman filters leads only to locally
optimal state estimates [Jaeger 2005].
Application of EKF for RNN Training: Assuming that a
RNN perfectly reproduces the input output training data that
has a time series dependence:
u(n) = [u1 (n), .....uN (n)]T , d(n) = [d1 (n), .....dN (n)]T , n = 1, 2..T
(2.35)
The output d(n) of the RNN is passed through a function h
with products of the weights w and input u up to n is expressed
as
d(n) = h[w, u(0), u(1).....u(n)]
(2.36)
Assuming the presence of uncorrelated Gaussian noise q(n),
the network weight update expression maybe expressed as
w(n + 1) = w(n) + q(n)
d(n) = hn (w(n))

(2.37)
(2.38)

The measurement updates become


K(n) = P (n)H(n)[H(n)t P (n)H(n)]1
w(n
+ 1) = w(n)

+ K(n)(n)

(2.39)
(2.40)

P (n + 1) = P (n) K(n)H(n)t P (n) + Q(n) (2.41)


A learning rate to compensate for initially bad estimates of
P(n)) can be introduced into the Kalman gain update equation:
1
K(n) = P (n)H(n)[( )I + H(n)t P (n)H(n)]1

(2.42)

Inserting process noise q(n) into EKF has been claimed to improve the algorithms numerical stability, and to avoid getting
stuck in poor local minima (Puskorius and Feldkamp 1994)
[Haykin 2003] [Haykin 2002] [Jaeger 2005]. EKF is a second-

2.4. Recurrent Neural Network

33

order gradient descent algorithm, in that it uses curvature information of the (squared) error surface. As a consequence of
exploiting curvature, for linear noise-free systems the Kalman
filter can converge in a single step which makes RNN training
faster [Haykin 2003] [Haykin 2002] [Jaeger 2005].
EKF requires the derivatives H(n) of the network outputs w.r.t.
the weights evaluated at the current weight estimate. These
derivatives can be exactly computed as in the RTRL algorithm,
at cost O(N 4 ).
Decoupled Extended Kalman Filter (DEKF) and RNN
Learning: Over and above of the calculation of H(n), the update of P(n) is computationally most expensive in EKF and
it requires O(LN 2 ) operations. By configuring the network
with decoupled subnetworks, a block-diagonal P(n) can be obtained which results in considerable reduction in computations
as proved by Feldkamp et al. (1998) [Jaeger 2005]. This gives
rise to the Decoupled Extended Kalman Filter (DEKF) algorithm which can be summarized as below [Haykin 2003]:
Initialization:
i. Set the synaptic weights of the RNN to small values selected from a uniform distribution.
ii. From the diagonal elements of the covariance matrix Q(n)
(process noise between 106 to 102 ).
iii. Set K(1, 0)= 1 I, where is a small positive constant.
Computation: For n=1, 2,... compute
(n) = [gi=1 Ci (n)Ki (n, n 1)CTi (n) + R(n)]1 (2.43)
Gi (n) = Ki (n, n 1)CTi (n)(n)

(2.44)

(d) Complex - Valued RTRL Algorithm: Figure 5.3 shows an


CFRNN, which consists of N neurons with p external The network has two distinct layers consisting of the external inputs. The
network has two distinct layers consisting of the external inputfeedback layer and a layer of processing elements. Let yl (k) denote
the complex valued output of each neuron, l= 1,...,N at time index
k and s(k) the (1 p) external complex-valued input vector. The
overall input to the network I(k) represents the concatenation of

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


34

Figure 2.8: Fully connected RNN

vectors y(k), s(k) and the bias input (1 + j), and is given by
I(k) = [s(k 1), ..., s(k p), 1 + j, y1 (k 1), ..., yN (k (2.45)
1)]T
= Inr (k) + jIni (k), n = 1, ..., p + N + 1

(2.46)

where j = 1, (.)T denotes the vector transpose operator, and


the superscripts (.)r and (.)p denote, respectively,the real and imaginary parts of a complex number. A complex valued weight matrix of the network is denoted by W, where for the ith neuron, its
weights form a (p + F + 1) 1 dimensional weight vector Wl =
[wl,1,...,wl,p+F +1]T where F is the number of feedback connections.
The feedback connections represent the delayed output signals of
the CFRNN. In the case of figure 1, we have F=N.
The output of each neuron can be expressed as
yl (k) = (netl (k)), l = 1, ...., N.
where

(2.47)

p+N +1

netl (k) =

X
n=1

wl,n (k)In (k)

(2.48)

2.4. Recurrent Neural Network

35

is the net input to lth node at time index k. For simplicity, we


state that
yl (k) = r (netl (k)) + ji (netl (k)) = u1(k) + jvl (k)

(2.49)
(2.50)

netl (k) = l (k) + jl (k)

where is a complex nonlinear activation function [Goh 2004].


The output error consists of its real and imaginary parts and is
defined as
el (k) = d(k) yl (k) = erl (k) + jeil (k)
(2.51)
erl (k) = dr (k) ul (k), eil (k) = di (k) vl (k),

(2.52)

where d(k) is the teaching signal. For real time applications the
cost function of the recurrent network is given by (Widrow et al.,
1975) [Goh 2004],
N

1X
1X r 2
1X
| el (k) |2 =
el (k)el (k) =
[(el ) + (eil )2 ],
E(k) =
2 l=1
2 l=1
2 l=1
(2.53)
where (.) denotes the complex conjugate. Notice that E(k) is
a real-valued function,and we are required to derive the gradient
E(k)with respect to both the real and imaginary part of the complex weights as
ws,t E(k) =

E(k)
E(k)
+j
, 1 6 l, s 6 N, 1 6 p + N + 1 (2.54)
r
i
ws,t
ws,t

The CRTRL algorithm minimizes cost function E(k) by recursively


altering the weight coefficients based on gradient descent,given by
ws,t (k+1) = ws,t (k)+ws,t(k) = ws,t (k)ws,t E(k)| ws,t = ws,t(k),
(2.55)
where is the learning rate,a small,positive constant. Calculating
the gradient of the cost function with respect to the real part of
the complex weight gives
E(k)
E ul (k)
E vl (k)
=
( r
)+
( r
), 1 l, s N, 1 t p + N + 1
r
ws,t (k)
ul ws,t (k) vl ws,t
(k)
(2.56)
Similarly,the partial derivative of the function with respect to the
imaginary part of the complex weight yields

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


36
Table 2.1: Computational complexity of RNN training algorithms
Sl Num Algorithm
Computational
Complexity
1
BPTT
Time and storage, O(WL+SL)
2
RTRL
Time = O(W S 2L), Storage=O(WS)
Pg
3
DEKF
Time=O(p2 W +
ki2 )
Ppg i=1
2
Storage=O( i=1 ki )
p is the no. of outputs,
g is the no. of groups
ki is the no. of neurons in group ki
4
CRTL
Time complexity= O(N 4 )
Storage complexity= O(N 3 )

E(k)
E ul (k)
E vl (k)
=
( i
)+
( i
), 1 l, s N, 1 t p + N + 1
i
ws,t (k)
ul ws,t (k) vl ws,t
(k)
(2.57)
yl (k)
ul (k)
vl (k)
yl (k)
ul (k)
vl (k)
The factors wr k = wr (k) +j wr (k) and wi k = wi (k) +j w
i
s,t
s,t
s,t
s,t
s,t
s,t (k)
are measures of sensitivity of the output of the mth unit at time k
to a small variation in the value of ws,t (k). This sensitivities can
be evaluated as [Goh 2004]
ul (k)
ul
l
ul
l
=
. r
+
. r
r
ws,t (k)
l ws,t (k) l ws,t (k)

(2.58)

ul (k)
ul
l
ul
l
=
. i
+
. i
i
ws,t (k)
l ws,t (k) l ws,t (k)

(2.59)

vl (k)
vl
l
vl
l
=
. r
+
. r
r
ws,t (k)
l ws,t (k) l ws,t (k)

(2.60)

vl (k)
vl
l
vl
l
=
. i
+
. i
i
ws,t (k)
l ws,t (k) l ws,t (k)

(2.61)

(e) Computational Complexity of BPTT, RTRL, DEKF and


CRTRL Algorithms: Let S be the number of states, W be the
number of synaptic weights and L be the length of the training
sequence. The computational complexity [Haykin 2003] of these
algorithms maybe summarized as in Table 6.1.

2.5. Probabilistic Neural Network(PNN)

2.5

37

Probabilistic Neural Network(PNN)

The Probabilistic Neural Network, introduced by Donald Specht in 1988,


is a 3-layer, feed-forward, onepass training algorithm used for classification and mapping of data. It is based on well-established statistical
principles derived from Bayes decision strategy and non-parametric kernal based estimators of probability density functions.The basic operation
performed by the PNN is an estimation of the probability density function of features of each class from the provided training samples using
Gussian Kernel. These estimated densities are then used in a Bayes
decision rule to perform the classification. An advantage of the PNN
is that it is guaranteed to approach the Bayes optimal decision surface
provided that the class probability density functions are smooth and
continuous [E. Elenberg ].
Probabilistic neural network (PNN) is closely related to Parzen window
pdf estimator. A PNN consists of several sub-networks, each of which is
a Parzen window pdf estimator for each of the classes. The input nodes
are the set of measurements. The second layer consists of the Gaussian
functions formed using the given set of data points as centers. The third
layer performs an average operation of the outputs from the second layer
for each class. The fourth layer performs a vote, selecting the largest
value. The associated class label is then determined.
PNN is found to be an excellent pattern classifier, outperforming other
classifiers including backpropagation. However, it is not robust with
respect to affine transformations of feature space, and this can lead to
poor performance on certain data [N. Uma Maheswari 2009].
A Probabilistic Neural Network (PNN) is defined as an implementation
of statistical algorithm called Kernel discriminate analysis in which the
operations are organized into multilayered feed forward network with
four layers: input layer, pattern layer, summation layer and output
layer. A PNN is predominantly a classifier since it can map any input
pattern to a number of classifications. Among the main advantages that
discriminate PNN is: Fast training process, an inherently parallel structure, guaranteed to converge to an optimal classifier as the size of the
representative training set increases and training samples can be added
or removed without extensive retraining. Accordingly, a PNN learns
more quickly than many neural networks model and have had success
on a variety of applications. Based on these facts and advantages, PNN
can be viewed as a supervised neural network that is capable of using it
in system classification and pattern recognition [P. V. N. Rao 2005].

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


38

Figure 2.9: Architecture of Probabilistic Neural Network (PNN)

2.5.1

Architecture of a PNN Network

All PNN networks have four layers as shown in Figure 2.9:


(a) Input layer :There is one neuron in the input layer for each predictor variable. In the case of categorical variables, N-1 neurons
are used where N is the number of categories. The input neurons
or processing before the input layer, standardizes the range of the
values by subtracting the median and dividing by the interquartile range. The input neurons then feed the values to each of the
neurons in the hidden layer [P. V. N. Rao 2005] [Kumar 2009].
(b) Hidden layer: This layer has one neuron for each case in the
training data set. The neuron stores the values of the predictor
variables for the case along with the target value. When presented
with the x vector of input values from the input layer, a hidden
neuron computes the Euclidean distance of the test case from the
neurons center point and then applies the RBF kernel function
using the sigma values. The resulting value is passed to the neurons

2.5. Probabilistic Neural Network(PNN)

39

in the pattern layer [Haykin 2003] [Kumar 2009].


(c) Pattern layer or Summation layer: The next layer in the network is different for PNN networks and for GRNN networks. For
PNN networks there is one pattern neuron for each category of the
target variable. The actual target category of each training case is
stored with each hidden neuron; the weighted value coming out of
a hidden neuron is fed only to the pattern neuron that corresponds
to the hidden neurons category. The pattern neurons add the values for the class they represent.Hence, it is a weighted vote for that
category [P. V. N. Rao 2005] [Kumar 2009].
(d) Decision layer: The decision layer is different for PNN network.
For PNN networks, the decision layer compares the weighted votes
for each target category accumulated in the pattern layer and uses
the largest vote to predict the target category [P. V. N. Rao 2005]
[Kumar 2009].

2.5.2 Derivation of the Probabilistic Neural Network


from Parzen Windows Classifier
The development of the probabilistic neural network relies on Parzen
windows classifiers. The Parzen windows method is a non-parametric
procedure that synthesizes an estimate of a probability density function
(pdf) by superposition of a number of windows, replicas of a function (often the Gaussian). The Parzen windows classifier takes a classification
decision after calculating the probability density function of each class
using the given training examples [P. V. N. Rao 2005] [Kumar 2009] [Haykin 2003]
[Yegnanarayana 2003]. The multicategory Classifier decision is expressed
as follows:
pk fk > pj fj ,
f orallj 6= k
(2.62)
where, pk is the prior probability of occurrence of examples from class
k and fk is the estimated pdf of class k. The calculation of the pdf
is performed with the following algorithm- Add up the values of the
d-dimensional Gaussians, evaluated at each training example, and scale
the sum to produce the estimated probability density [P. V. N. Rao 2005]
[Kumar 2009] [Haykin 2003].
NK

1 d d 1 X
(x xki )T (x xki )
fk (x) =
( )
exp[
22
N
(2 2 )
i=l

(2.63)

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


40
where, xki is the d-dimensional ith example from class k. This function
fk (x) is a sum of small multivariate Gaussian probability distributions
centered at each training example. Using probability distributions allows to achieve generalization beyond the provided examples. The index
k in fk (x) indicates difference in spread between the distributions of the
classes.As the number of the training examples and their Gaussians increases the estimated pdf approaches the true pdf of the training set
[P. V. N. Rao 2005] [Kumar 2009] [Haykin 2003]. The classification decision is taken according to the inequality:
Nk
X
i=l

Nj

X
(x xji)T (x xji )
(x xki )T (x xki )
exp[
]
>
exp[
]
(2 2 )
(2 2 )
i=l

f orallj 6= k

(2.64)
which is derived using prior probabilities calculated as the relative frequency of the examples in each class:
Pk = Nk N

(2.65)

where, N is the number of all training examples,Nk is the number of examples in class k. The Parzen windows classifier uses the entire training
set of examples to perform classification, that is it requires storage of
the training set in the computer memory. The speed of computation is
proportional to the training set size [P. V. N. Rao 2005] [Kumar 2009]
[Haykin 2003].
As mentioned earlier the probabilistic neural network is a direct continuation of the work on Bayes classifiers. The probabilistic neural network (PNN) learns to approximate the pdf of the training examples.
More precisely, the PNN is interpreted as a function which approximates the probability density of the underlying examples distribution
[P. V. N. Rao 2005] [Kumar 2009] [Haykin 2003]. The PNN consists of
nodes allocated in three layers after the inputs- There is one pattern
node for each training example. Each pattern node forms a product of
the weight vector and the given example for classification, where the
weights entering a node are from a particular example. After that, the
product is passed through the activation function:
exp[

(xT Wki 1)
]
2

(2.66)

2.6. Self Organizing Feature Map (SOFM)

41

Each summation node receives the outputs from pattern nodes associated with a given class:
Nk
X
i=l

exp[

(xT wki 1)
]
2

(2.67)

The output nodes are binary neurons that produce the classification
decision
Nj
Nk
X
X
(xT wkj 1)
(xT wki 1)
]
>
exp[
]
(2.68)
exp[
2
2

i=l
i=l
The only factor that needs to be selected for training is the smoothing
factor, that is the deviation of the Gaussian functions:
(a) Too small deviations cause a very spiky approximation which can
not generalize well;
(b) Too large deviations smooth out details.
An appropriate deviation is chosen by experiment [P. V. N. Rao 2005]
[Kumar 2009] [Haykin 2003].

2.6

Self Organizing Feature Map (SOFM)

SOM Self-organizing feature maps (SOFMs) are a type of neural network develop by Kohonen. Self Organizing Maps (Figure-5.2) reduce dimensions by producing a map of usually one or two dimensions that plot
the similarities of the data by grouping the similar data items together.
SOMs accomplish two things. They reduce dimensions and display similarities. An input variable is presented to the entire network of neurons
during training. The neuron whose weight most closely matches that of
the input variable is called the winning neuron and this neurons weight
is modified to more closely match the incoming vector. The SOM uses
an unsupervised learning method to map high dimensional data into a
1-D, 2-D, or 3-D data space, subject to a topological ordering constraint.
A major advantage is that the clustering produced by the SOM retains
the underlying structure of the input space, while the dimensionality
of the space is is reduced. As a result, a neuron map is obtained with
weights encoding the stationary probability density function p(X) of the
input pattern vectors [T. Kohonen 2000].

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


42
Network Topology:The SOM consists of two main parts, the input layer and the output map. There is no hidden layer. The
dimensionality of the input layer is not restricted, while the output map has dimensionality 1-D, 2-D, or 3-D. An example of a
SOM is shown in Figure-5.2 and a planar array of neurons with the
neighborhoods is shown in Figure-2.11
The winning output node is determined by a similarity measure. It can
either be the Euclidean distance or the dot product between the two
vectors. Here Euclidean distance is considered. The norm adopted is
kx wm k = mini kx wik

(2.69)

where wi is the winning neuron. For the weight vector of unit i the
SOM weight update rule is given as
wi (t + 1) = wi (t) + Nmi (t)[x(t) wi (t)]

(2.70)

where t is the time, x(t) is an input vector, and Nmi (t) is the neighborhood kernal around the winner unit m. The neighborhood function in
this case is a Gaussian given by
Nmi (t) = (t)exp[

kri rm k
]
2 (t)

(2.71)

where rm and ri are the position vectors of the winning node and of
the winning neighborhood nodes, respectively. The learning rate factor
0 < (t) < 1, decreases monotonically with time, and 34 (t) corresponds
to the width of the neighborhood function, also decreasing monotonically
with time. Thus, the winning node undergoes the most change, while
the neighborhood nodes furthest away from the winner undergo the least
change.
Batch training: The incremental process defined above can be replaced by a batch computation version which is significantly faster[5].
The batch training algorithm is also iterative, but instead of presenting a single data vector to the map at a time, the whole data
set is given to the map before any adjustments are made. In each
training step, the data set is partitioned such that each data vector
belongs to the neighborhood set of the map unit to which it is closest, the Voronoi set [E. Alhoneiemi 1999]. The sum of the vectors

2.6. Self Organizing Feature Map (SOFM)

43

Figure 2.10: Self Organizing Feature Map


in each Voronoi set are calculated as:
n

Si (t) =

Vi
X

Xj

(2.72)

j=1

where n Vi is the number of samples in the Voronoi set unit i. The


new values of the weight vectora are then calculated as:
Pm
j=1 Nij (t)Sj (t)
wi (t + 1) = Pm n
(2.73)
j=1 ( Vi )hij (t)
where m is the number of map units an n Vj is the number of sample
falling into Voronoi set Vi .
Quality Measures: Although the issue of SOM quality is not a
simple one, two evaluation criteria, resolution and topology preservation, are commonly used [Kaski 1997]. Other methods are available in [Kaski 1997] and [Bauer 1992].

2.6.1 Competitive Learning and Self Organizing Feature Map (SOFM)


In competitive learning the neurons in a layer work with the principle "winners take all". There is a competition between the neurons

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


44

Figure 2.11: Topological neighborhood of Self Organizing Feature Map


to find the closest match to that of the input sample vector. While
the competition is on, the connectionist weights update during the process and the neuron with the closest similarity is assigned the input
sample. SOFMs or SOMs use competitive learning to perform a host of
tasks which includes pattern matching, recognition and feature mapping
[T. Kohonen 2000]. The following are among some of methods used for
training and weight updating of SOMs as part of competitive learning
[Kumar 2009]:
(a) Inner Product and
(b) Euclidean Distance Based Competition.
A detailed account of the two methods maybe given as below [Haykin 2003]
[Kumar 2009]:
(a) Inner Product and Euclidean Distance Based Competition: The training process of the SOM is linked with the creation
of a code-book which provides the account of finding the best match
among the set of input patterns given to it. There are two ways in
which the best matching codebook vector can be found. The first
way is to employ an inner product criterionselect the best reference vector by choosing the neuron in the competitive layer that
receives the maximum activation. This means that for the current

2.6. Self Organizing Feature Map (SOFM)

45

input vector Xk , we compute all neuron activations


yj = XkT Wj

j = 1, ......, m

(2.74)

and the winning neuron index, J, satisfies


yJ = maxj {XkT Wj }

(2.75)

Alternatively one might select the winner based on a Euclidean distance measure. Here the distance is measured between the present
input Xk and the weight vectors Wj , and the winning neuron index
J, satisfies
kXk WJ k = minj {kXk Wj k}
(2.76)
It is misleading to think that these methods of selecting the winning neuron are entirely distinct[323]. To see this, assume that the
weight equinorm property holds for all weight vectors:
kW1 k = kW2 k = ...... = kWm k

(2.77)

We now rework the condition given in Eq. 2.76 as follows[323]:


kXk WJ k2 = minj {kXk Wj k2 }

(2.78)

(Xk WJ )T (Xk WJ ) = minj {(Xk Wj )T (Xk Wj )} (2.79)

kXk k2 2XkT WJ + kWJ k2 = minj {kXk k2 2XkT Wj + kWj k2 }


(2.80)
T
2
T
2
2Xk WJ + kWJ k = minj {(2Xk Wj + kWj k )} (2.81)
We now subtract kWJ k2 from both side of Eq. 2.81 and invoke the
weight equinorm assumption. This yields:
2XkT WJ = minj {2XkT Wj }

(2.82)

XkT WJ = maxj {2XkT Wj }

(2.83)

or,
which is the identical to criterion given in Eq. 2.76. Neuron J wins
if its weight vector correlates maximally with the impinging input.
In the case of ART 1 F2 neurons we have seen that a binary
choice network requires neuronal excitatory self feed-back, lateral
inhibition and a faster-than-linear signal function. In computer
simulations, however, one may simply choose the neuron index with
maximum activity or minimum distance.

Chapter 2. Artificial Neural Network (ANN): Basic Considerations


46
(b) A Generalized Competitive Learning Law: Competitive learning requires that the weight vector of the winning neuron be made
to correlate more with the input vector. This is done by perturbation of only the winning weight vector WJ = (w1J , ..............., wnJ )T
towards the input vector. the scalar implementation of this learning law in difference form is presented below:
k+1
k
wiJ
= wiJ
+ xki

i = 1, ........, n

(2.84)

However, as we have seen in such forms of learning, the weight


grow without bound and some form of normalization is required.
By normalization we mean that the sum of the instar weight should
Pn
add up to unity:
i=1 = 1. This can be ensured by normalizing
the inputs within the equation and incorporating an extra weight
subtraction term:
xk
k+1
k
k
wiJ
= wiJ
+ ( P i k wiJ
)
x
j j

i = 1, ........, n

(2.85)

This equation leads to total weight normalization. To see this, we


simply add all n equations (Eq. 2.85) to get the total weight update
equation:
J(k+1) = W
J(k) + (1 W
J(k) )
W
(2.86)
P
J(k) = n w k . It is clear from Eq. 2.86, that the weight
where W
i=1 iJ
sum eventually approaches a point on the unit hyperspher, S n .
If the inputs are already normalized then normalization is not required within the law and weight update is direct:
k+1
k
k
wiJ
= wiJ
+ (xki wiJ
)

i = 1, ........, n

(2.87)

which effectively moves the Jth instar WJ , in the direction of the


input vector Xk .
We have written this equation for a winning neuron with index J.
In general one may re-write this equation for all neurons in the field
as :
k
k
), i = 1, ........, nj = J
wij + (xki wiJ
k+1
wiJ =
(2.88)
k
wij
i = 1, ........, nj 6= J
where J = argmaxj {yjk }. Based on this, one can generalize further
and write out the following standard competitive form in discrete

2.6. Self Organizing Feature Map (SOFM)

47

time:
wijk+1 = wijk +skj (xki wijk )

i = 1, ......, n j = 1, ......, m (2.89)

where skj = 1 only for j=J and is zero otherwise, for a hard competitive field.
With these ideas in hand we now study three important classes of
neural network models that have their foundation in competitive
learning.

You might also like