Module 2 Notes

Neural Network Module 2
MODULE 2
2.1PERCEPTRON LEARNING AND NON-SEPARABLE SETS
When the set of training patterns is not linearly separable, then for any set of weights, 𝑊𝑘 , there will exist
some training vector, 𝑋𝑘 , such that 𝑊𝑘 misclassifies 𝑋𝑘 . Consequently, the Perceptron learning algorithm will
continue to make weight changes indefinitely. It goes on forever.
Theorem 2.1 Given a finite set of training patterns X, there exists a number M such that if we run the
Perceptron learning algorithm beginning with any initial set of weights, 𝑊1 , then any weight vector 𝑊𝑘
produced in the course of the algorithm will satisfy
‖𝑊𝑘 ‖ ≤‖𝑊1 ‖ + 𝑀
For a given problem, the set of weights that Perceptron learning visits is bounded. From the point of view of
the present discussion, the following corollaries are important:
Corollary 2.1 If, in a finite set of training patterns X, each pattern 𝑋𝑘 has integer (or rational) components 𝑥𝑖𝑘 ,
then the Perceptron learning algorithm will visit a finite set of distinct weight vectors {𝑊𝑘 }.
At present, there is no known good bound on the number of weight vectors that the Perceptron learning
algorithm can visit. The following corollary provides us with a test for non-separability
Corollary 2.2 For a finite set of training patterns X, with individual patterns 𝑋𝑘 ; having integer (or rational)
components 𝑥𝑖𝑘 , the Perceptron learning algorithm will, in finite time:
1 produce a weight vector that correctly classifies all training patterns iff X is linearly separable, or
2 leave and revisit a specific weight vector iff X is linearly non-separable.
2.2 α-LEAST MEAN SQUARE LEARNING

2.2.1 Linear Neurons and Linear Error
In Perceptron learning, the prime objective was to achieve a linear separation of input patterns by attempting
to correct the classification error on a misclassified pattern in each iteration. The restriction there was that the
desired outputs had to be binary or bipolar (𝑑𝑘 , 𝑠𝑘 𝜖 {0,1}𝑜𝑟 𝑑𝑘 , 𝑠𝑘 𝜖 {−1, 1}), and that the pattern sets in
question be linearly separable.
consider a training set of the form T = {𝑋𝑘 , 𝑑𝑘 }, 𝑑𝑘 ∈ ℝ and 𝑋𝑘 ∈ ℝ𝑛+1 . To allow the desired output
to vary smoothly or continuously over some interval, the neuronal signal function is changed from binary
threshold to linear. The signal then equals the net activation of the neuron:
𝑠𝑘 = 𝑦𝑘 = 𝑋𝑘𝑇 𝑊𝑘 ----------------- (1)
The linear error 𝑒𝑘 due to a presented training pair (𝑋𝑘 , 𝑑𝑘 ), is the difference between the desired output 𝑑𝑘
and the neuronal signal 𝑠𝑘 :
𝑒𝑘 = 𝑑𝑘 − 𝑠𝑘 ------------------(2)
Substituting for 𝑠𝑘 , we have
𝑒𝑘 = 𝑑𝑘 − 𝑋𝑘𝑇 𝑊𝑘 ---------------(3)
Department of CSE, AJIET Mangalore 1

2.2.2 Operational Details of α-LMS
Incorporating a linear error measure into the weight update procedure, yields the a-least mean squared
(α-LMS) learning algorithm. The a-LMS algorithm applied to a single adaptive linear neuron shown in Fig.
2.1 embodies the minimal disturbance principle, in accordance with which it incorporates new information
into the weight vector while changing embedded information based on past learning to a minimal extent. The
α-LMS algorithm has the recursive update equation
𝑋
𝑊𝑘+1 = 𝑊𝑘 + 𝜂𝑒𝑘 ‖𝑋 𝑘‖2 --------------------------------(4)
𝑘
Here the weight vector 𝑊𝑘 is modified by the product of the scaled error and the normalized input vector.
𝜂𝑒 𝑋 𝜂
Δ𝑊𝑘 = ‖𝑋 𝑘‖ ‖𝑋𝑘‖ = ̂
𝑒 𝑋 = 𝜂̂𝑒 ̂
𝑘 𝑘 𝑋𝑘 -------------------------(5)
𝑘 𝑘 ‖𝑋𝑘 ‖ 𝑘 𝑘
Fig 2.1 An Adaptive Neuron Network
where 𝑋̂𝑘 is a unit vector in the direction of 𝑋𝑘 , η is the learning rate, and 𝜂̂𝑘 is a pattem--normalized learning
rate.
The weights are changed in the direction of 𝑋𝑘 as in Perceptron learning except that a unit vector 𝑋 ̂𝑘
is used in place of the original vector 𝑋𝑘 . The learning rate is scaled from iteration to iteration by the magnitude
‖𝑋𝑘 ‖, of the applied vector. This makes the algorithm self-normalizing in the sense that larger magnitude
vectors do not dominate the weight update process.
To understand how the update procedure works, note that the change in error for the pattern 𝑋𝑘 depends
on the extent of change in the weight vector, 𝑊𝑘 .
∆𝑒𝑘 = (𝑑𝑘 − 𝑋𝑘𝑇 𝑊𝑘+1 ) − (𝑑𝑘 − 𝑋𝑘𝑇 𝑊𝑘 )
= −𝑋𝑘𝑇 ∆𝑊𝑘
However,
Substituting for Δ𝑊𝑘 we have

𝜂𝑒 𝑋 𝜂𝑒𝑘 𝑋𝑘𝑇 𝑋𝑘
∆𝑒𝑘 = −𝑋𝑘𝑇 ( ‖𝑋𝑘 ‖2𝑘) = − ‖𝑋𝑘 ‖2
= −𝜂𝑒𝑘 ------------------------------(6)
𝑘
The error correction is proportional to the error itself, and each iteration reduces the error by a factor of η. The
choice of η controls the stability and speed of convergence. In general, stability is ensured if 0 < η < 2.

2.2.3 α-LMS Works with Normalized Training Patterns
As patterns are presented sequentially and weights adapted in accordance with Eq. (4), the error
corresponding to a pattern gets reduced by a factor η. Although this may actually increase the error on some
other pattern, it is to be expected that after a sufficient number of weight updates the error would tend to
stabilize at a value that represents a minimum error of classification over all patterns presented.
It is important to understand the geometrical picture that emerges from recursive application of Eq. (4)
to a set of training patterns. Each weight update pushes the weight vector in the direction of the current pattern
in an attempt to reduce the error 𝑒𝑘 . This is illustrated in Fig. 2.2.
The extent of weight update is inversely proportional to the magnitude of the applied pattern.
Therefore, larger magnitude vectors induce a smaller effective weight change than smaller magnitude vectors.
There is an exception though-for the case when input patterns have bipolar ±1 components. For a bipolar
pattern, ‖𝑋𝑘 ‖2 (=n) is the same for all patterns and the constant factor 1⁄𝑛 could as well be absorbed into η.
Thus, the scaled learning rate 𝜂̂,𝑘 remains fixed from iteration to iteration.
Fig 2.2 Weight Update takes place in the direction of Input Vector
For the case of two-valued inputs, if the patterns are assumed to be binary, then no adaptation occurs
in weights that are presented with a 0 input. On the other hand, with bipolar ±1 inputs all weights adapt in
every iteration. Convergence with bipolar vectors, thus, tends to be faster than the case of binary patterns. For
this reason, bipolar input patterns are preferred when two-valued patterns need to be handled.
Equation (4) can be rewritten as follows

𝑑𝑘 𝑋𝑘 𝑋𝑘
𝑊𝑘+1 = 𝑊𝑘 + 𝜂 ( − 𝑊𝑘𝑇 )
‖𝑋𝑘 ‖ ‖𝑋𝑘 ‖ ‖𝑋𝑘 ‖
̂𝑘 − 𝑊𝑘𝑇 𝑋
= 𝑊𝑘 + 𝜂(𝑑 ̂𝑘 )𝑋
̂𝑘
= 𝑊𝑘 + 𝜂𝑒̂𝑋 ̂
𝑘 𝑘
̂𝑘 represents the normalized desired value for normalized input pattern 𝑋
Where 𝑑 ̂𝑘 .
̂𝑘 − 𝑊𝑘𝑇 𝑋
and 𝑒̂𝑘 = 𝑑 ̂𝑘 is the error computed using the normalized training set.

Operational Summary of α- LMS Learning Algorithm
2.3 MSE ERROR SURFACE AND ITS GEOMETRY

Consider a training set T which is finite and its component patterns are completely specified. The setting is
stochastic. In such a situation, rather than being provided with a training set comprising deterministic patterns,
we have a sequence of samples {(𝑋𝑘 , 𝑑𝑘 )} which are assumed to be drawn from a statistically stationary
population or process. Therefore, if the neuron weights are to be adjusted in response to some pattern-
dependent error measure, the error computation has to be based on the expectation of the error over the
ensemble. For this, the square error on a pattern 𝑋𝑘 is given as



Where 𝑄 𝑇 𝑉 = 𝑉 ′ rotates the V space into the space with eigen vectors of R as a basis. The MSE gradient in
V-space can be computed as
∇𝜀 = 𝑅𝑉
which defines the family of vectors in V-space.
2.4 STEEPEST DESCENT SEARCH WITH EXACT GRADIENT

INFORMATION
Steepest descent search is a straightforward and intuitively appealing method to find out the Weiner
solution through iterative update of weights. It uses exact gradient information available from the mean-square
error surface to direct the search in weight space. To understand this, consider Fig. 2.3 which shows a
projection of the square-error function on the ε - 𝑤𝑖𝑘 plane.
We need to provide an appropriate weight increment to 𝑤𝑖𝑘 to push the error towards the minimum
which occurs at 𝑤 ̂. 𝑖 In order to move towards the minimum, the weight has to be perturbed in a direction that
depends on which side of the optimal weight 𝑤 ̂𝑖 the current weight value 𝑤𝑖𝑘 lies. If the weight component
𝑤𝑖𝑘 lies to the left of 𝑤
̂, 𝑘
𝑖 say at 𝑤𝑖1 , where the error gradient is negative (as indicated by the tangent) we need
to increase 𝑤𝑖𝑘 . Alternatively, if 𝑤𝑖𝑘 is on the right of 𝑤̂, 𝑘
𝑖 say at 𝑤𝑖2 where the error gradient is positive, we
𝑘
need to decrease 𝑤𝑖 . This rule is summarized in the following statements:
Fig 2.3 A projection of the error function on the ε - 𝒘𝒌𝒊 plane shows a point 𝒘
̂𝒊 where the weight 𝒘𝒌𝒊 minimizes the
error


2.5 μ- LMS: APPROXIMATE GRADIENT DESCENT

The real problem with steepest descent is that true gradient information is required, and this is only
available in situations where the data set is completely specified in advance. For then it is possible to compute
R and P exactly, and thus the true gradient at iteration k: ∇𝜀 = 𝑅𝑊𝑘 − 𝑃. However, when the data set
comprises a random stream of patterns (drawn from a stationary distribution), R and P cannot be computed
very accurately. To find a good approximation, one might have to examine the data stream for a reasonably
large period of time and keep averaging out. This is not very appealing for the simple reason that we do not
know how long to examine the stream to get reliable estimates of R and P.
The -LMS algorithm breaks this computational barrier by using 𝜀𝑘 directly in place of ε = E[𝜀𝑘 ] in the
gradient computation:

This is a very useful observation-for it allows us to conclude that since the long-term average of ∇ ̃𝜀𝑘
̃
approaches ∇𝜀, we can safely use ∇𝜀𝑘 as an unbiased estimate. That is what makes μ-LMS work! It follows
that since ∇̃𝜀𝑘 approaches ∇𝜀 in the long run, one could keep collecting ∇ ̃𝜀𝑘 for a sufficiently large number of
iterations (while keeping the weights fixed), and then make a weight change collectively for all those iterations
together. if the data set is finite (deterministic), then one can compute ∇𝜀 accurately by first collecting the
different ∇̃𝜀𝑘 gradients over all training patterns 𝑋𝑘 for the same set of weights. This accurate measure of the
gradient could then be used to change the weights. In this situation μ-LMS is identical to the steepest descent
algorithm.
However, even if the data set is deterministic, we still use ∇ ̃𝜀𝑘 to update the weights. After all, if the
data set becomes large, collection of all the gradients becomes expensive in terms of storage. It is much easier
to just go ahead and use ∇ ̃𝜀𝑘 . Be clear about the approximation made: we are estimating the true gradient
(which should be computed from E[𝜀𝑘 ]) by a gradient computed from the instantaneous sample error 𝜀𝑘 .
In the deterministic case, we can justify this as follows: if the learning rate η is kept small, the weight
change in each iteration will be small and consequently the weight vector W will remain ""somewhat
constant"" over Q iterations where Q is the number of patterns in the training set. Of course, this is provided
that Q is a small number. To see this, observe the total weight change 𝚫W, over Q iterations from the kth
iteration:
where ε denotes the mean-square error. Thus, the weight updates follow the true gradient on average.

2.5.1 μ-LMS Algorithm: Convergence in the Mean

2.6 APPLICATION OF LMS TO NOISE CANCELLATION
The field of adaptive signal processing has benefitted immensely from the simple linear neuron (what Widrow
calls the adaptive linear combiner (ALC)) and the LMS algorithm. The ALC forms an integral component of
adaptive filters.
Digital signals are usually generated by sampling continuous time functions followed by analog to-
digital conversion. These signals are generally filtered using tapped delay line filters as shown in Fig. 2.4. A
sampled input is delayed through a series of delay elements. These n samples (including the current one) are
input to the ALC which generates an output by computing the inner product 𝑦𝑘 = 𝑋𝑘𝑇 𝑊𝑘 , where X
=(𝑥𝑘 , 𝑥𝑘−1 , … … … , 𝑥𝑘−𝑛+1 ) and W = (𝑤1 , … … … 𝑤𝑛 ).
Fig 2.4 A Linear Adaptive Transversal Filter
The filtered output is simply this inner product--a linear combination of current and past signal
samples. The LMS procedure is employed to adjust the weights over time, so that the output matches the

desired response. The ALC has been used extensively in applications such as adaptive equalization and noise
cancelling,
A common problem in signal processing is the removal of noise 𝑛0 from a signal s. The goal is to pass
the signal and remove the noise. The adaptive noise cancelling approach employs an adaptive filter as its
integral component. This is shown in Fig. 2.5. This adaptive noise cancelation approach can be used only if a
reference signal is available that contains a noise component 𝑛1 that is correlated with the noise 𝑛0 . The
adaptive noise canceler subtracts the filtered reference signal from the noisy input, thereby making the output
of the canceler an error signal.
Fig 2.5 The Adaptive Noise Cancellation Approach
A simple argument shows that the filter can indeed adapt to cancel the noise rather easily. If we assume
that s, 𝑛0 , 𝑛1 and y are statistically independent and stationary with zero means, the analysis becomes tractable.
For,

is also minimized. LMS adaptation of the filter causes the output ε to be best least squares estimate of the
input signal s since the noise gets subtracted out.
The monitoring of fetal heart rates using electrocardiograms is an important application domain of
Adaptive interference cancelling. A major problem is background noise interference due to muscle movement
with an amplitude which matches that of the fetal heartbeat. Another major source of interference the heartbeat
of the mother which has amplitude much greater than that of the fetus (almost 2-10 times).
A series of experiments were conducted that targeted the cancelling of these interfering signals from
the fetal ECG. A set of four chest leads were used to provide a clear recording of the maternal heartbeat which
served as the reference input. A single abdominal lead comprising a mixture of the maternal and fetal ECG
signals served as the primary input. Chest leads provide a clear recording of the maternal ECG which is the
reference input. This reference input is adaptively filtered and subtracted from the fetal ECG signal. All signals
are filtered and digitized.
Fig 2.6(a) shows the reference input at a sampling rate of 256 Hz and a filtering bandwidth from 3 to
35 Hz. The abdominal lead recording in Fig. 2.6(b) shows the primary input where the maternal and fetal
heartbeats are mixed and vaguely discernible. Figure 2.6(c) shows the noise-cancelled output where the
maternal signal has been suppressed and the fetal signal is clearly visible.
Figure 2.7(a) shows the reference input now at a sampling rate of 512 Hz and a filtering bandwidth
from 0.5 to 75 Hz. The abdominal lead recording in Fig. 2.7(b) shows the primary input where the maternal
and fetal heartbeats are impossible to discern from one another. A strong 60 Hz interference can be seen in
the ECG along with baseline drift. The reference input was sufficient to cancel the noise as can be seen in Fig.
2.7(c) which shows the noise-cancelled output where the maternal and 60 Hz noise signals have been removed
and the fetal signal is once again clearly visible.
Fig 2.6 Recordings from the fetal ECG experiment at a bandwidth 3-35 H, sampling rate 236 Hz.

Fig 2.7 Recordings from the fetal ECG experiment at a bandwidth 0.3-75 H, sampling rate 512 Hz.
2.7 MULTILAYERED NETWORK ARCHITECTURE
The primary application of TLNs is in data classification-they can successfully classify linearly
separable data sets. On the other hand, linear neurons perform some kind of a least squares fit of a given data
set-fitting linear functions to approximate non-linear ones. The computational capabilities of these single
neuron systems are limited by the nature of the signal function, and by the lack of a layered architecture.
Layering drastically increases the computational power of the system. A layered network of linear neurons
does not provide any additional computational capability. This is because a multi-layered linear neuron
network is equivalent to a single layer linear neuron network.
Figure 2.8 portrays the generic architecture of a multi-layered neural network. Here the input layer has
n linear neurons that receive real valued external inputs in the form of an n- dimensional vector in ℝ𝑛 . This
layer also includes an additional bias neuron (assigned an index 0) that receives no external input but generates
a +1 signal that feeds all bias connections of the neurons of the hidden layer. Similarly, the hidden layer has
q sigmoidal neurons that receive signals from the input layer. A bias neuron has been additionally included in
the hidden layer to generate a +1 signal for bias connections of the output layer neurons. The output layer
comprises p sigmoidal neurons. Neurons in different layers compute their signals, layer by layer, in a strictly
feedforward fashion. Network signals that emanate from the output layer of neurons comprise a p-dimensional
vector of real numbers. Through a sequence of internal transformations, the neural network maps an input
vector in ℝ𝑛 (the input space) to an output vector in ℝ𝑝 (the output space).

Fig 2.8 Generic Architecture of a feedforward neural network
2.8 BACKPROPAGATION LEARNING ALGORITHM

2.8.1 Notation
• Input layer neurons are linear, whereas neurons in the hidden and output layers have sigmoidal signal
function.
• Vectors and scalar variables will be respectively subscripted and superscripted by the iteration index
k.
• Network is homogeneous in the sense that all neurons use similar signal functions.
o For the neurons in the input layer,
S(x) =x
o For the sigmoidal neurons in the hidden and output layer,
• Q be the set of training vector pairs

2.8.2 Squared Error Performance Function
2.8.3 Outline of Learning Procedure







2.4 PRACTICAL CONSIDERATIONS IN IMPLEMENTING THE BP

ALGORITHM
2.4.1 Pattern or Batch Mode Training
• The BP algorithm operates by sequentially presenting patterns drawn from a training set to a
predefined network architecture.

• Two choices of implementing the algorithm:

o A single pattern is presented, the gradient is computed and the network weights are changed
based on the instantaneous gradient values. This is called Pattern Mode Training. Given Q
training pattern {𝑋𝑖 𝐷𝑖 }𝑄𝑖=1 and some initial neural network 𝒩 0 , pattern mode training generates
a sequence of Q neural network 𝒩 1 … … … … … 𝒩 𝑄 over one epoch of training.
o In the second approach the error gradients are collected over an entire epoch and then change
the weights of the initial neural network in one shot. This is called Batch Mode Training.
• The pattern mode training is easier to implement than batch mode training and the latter mode is more
the exception than the rule in practical implementations of the algorithm.
• In fact, the two are more or less equivalent provided the learning rate is maintained to be small.
Since the weight changes from iteration to iteration are very small.
2.4.2 Use a Bipolar Signal Function

o Introducing a bipolar signal function such as the hyperbolic tangent function can cause a significant
speed up in the network convergence.
S(x) = a tanh(λx)
With a = 1.716 and λ = 0.66 being suitable values

2.4.3 Initialization of network weights

o It is important to correctly choose a set of initial weights for the network.
o Sometimes it can decide whether or not the network is able to learn the training set function.
o It is common practice to initialize weights to small random values within some interval [-∈, ∈]
o Initialization of weights of the entire network to the same value can lead to network paralysis where the
network learns nothing-weight changes are uniformly zero.
o Further, very small ranges of weight randomization should be avoided in general since this may lead to very slow
learning in the initial stages.
o Alternatively, an incorrect choice of weights might lead to network saturation where weight changes
are almost negligible over a large number of consecutive epochs.
o This may be incorrectly interpreted as a local minimum because weights might begin to change after
a large number of epochs.
o When neurons saturate, the signal values are close to the extremes 0 or 1 and the signal derivatives are
infinitesimally small.
o Since the weight changes are proportional to the signal derivative, saturated neurons generate weight
changes that are negligibly small.
o This is a major problem, because if the neuron outputs are incorrect (such as being close to 1 for a
desired output close to 0) these small weight changes will allow the neuron to escape from incorrect
saturation only after a very long time.
o Randomization of network weights can help avoid these problems.
To avoid incorrect saturation:
o use an offset in the computation of the signal slope: S' =0.1+S(1-S). This helps avoid having the
logistic neurons getting prematurely saturated.
o Discourage large weight: the weight decay term is introduced and the classic BP algorithm is
modified as
Where k is a weight decay parameter.

o Ideally all weights reach their final values almost together. This is called uniform learning. Correct
weight initialization also facilitates uniform learning.
2.4.4 Data Pre-processing
• Limitations to the performance of a neural network (or for that matter any machine learning technique in
general) can be partly attributed to the quality of data employed for training.
• Real world data sets are generally incomplete: they may lack feature values or may have missing features.
• Data is also often noisy, containing errors or outliers, and may be inconsistent.
• Alternatively, the data set might have many more features than are necessary for solving the classification
problem at hand, especially in application domains such as bioinformatics where features (such as
microarray gene expression values) can run in thousands, whereas only a mere handful of them usually
turn out to be essential.
• Further, feature values may have scales that are different from one another by orders of magnitude.
• It is therefore necessary to perform some kind of pre-processing on the data prior to using it for training
and testing purpose

The main issues involved in data pre-processing which are applicable for data mining methods in general, and
neural networks in particular are described below
1. Data Cleaning:
• If a class label, target value, or a large number of attribute values are missing, it may be appropriate
to ignore the data entry.
• For a missing attribute value, one can use the mean of that attribute, averaged over the entire data
set, or averaged over the data belonging to that class of data (for a classification problem).
• Alternatively, one might use the most probable value to fill in the missing value as determined by
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
2. Handling Noise in Data:

• The main issue here is to identify the outliers and smooth out the data.
• One way is to sort the attribute values and partition them into bins and then smooth the data using
either bin means, bin median, or bin boundaries.
• An alternative approach uses clustering where attribute values are grouped to detect and remove
outliers.
• Regression can be used to smooth the data by fitting it to regression functions.
• Inconsistent data (which includes redundancies) usually has to be corrected using domain
knowledge or an expert decision.
3. Data Scaling:
• The main issue in data transformation is normalization where one scales attribute values to fall
within a specified range.
• Common ways of doing this are to either employ the minimum and maximum values of an attribute
to perform the scaling.
• An attribute 𝑋𝑖 which is known to lie in the range [𝑥𝑖𝑚𝑖𝑛 , 𝑥𝑖𝑚𝑎𝑥 ] will be scaled to 𝑥𝑖′ ∈ [0, 1] by using
′ (𝑥𝑖 − 𝑥𝑖𝑚𝑖𝑛 )
the transformation 𝑥𝑖 = ⁄ 𝑚𝑎𝑥 .
(𝑥𝑖 − 𝑥𝑖𝑚𝑖𝑛 )
• Scaling is also done by using the mean and standard deviation of the data attribute when the
minimum and maximum values are unknown or when there are outliers.
• The scaling is done using the transformation: 𝑥𝑖′ = (𝑥𝑖′ − 𝑥̅𝑖 )⁄𝜎𝑥𝑖 .
• One obvious point concerns the range of values of the inputs and outputs of training data. Although
there is no requirement for the input to be in the range [0, 1], it is often beneficial to scale the inputs
appropriately (for example into the range [-1, 1]) since this can speed up training and reduce
chances of getting stuck in local minima.
• Desired outputs should lie well within the neuronal signal range.
• For example, given a logistic signal function which lies in the interval (0,1) the desired outputs of
patterns in the entire training set should lie in an interval [0 + ∈, 1 - ∈] where ∈ > 0 is some small
number.
• Desired values of 0 and 1 would cause weights to grow increasingly large because in order to
generate these limiting values of output 0 and 1 requires a -∞ or ∞ activation which can be
accomplished by increasing the values of weights.
• In addition, the algorithm will obviously not converge if desired outputs lie outside the achievable
interval (0,1).
• Alternatively, one can try to adjust the signal function to accommodate the ranges of the target
values, although this may require some adjustments at the algorithmic level.
4. Reduction of Features:

• Feature reduction is of paramount importance when the number of features is very large (as in the
case of bioinformatics problems where the number of features runs into thousands) and also when
there are redundant features.
• A very common approach to feature set reduction is principal components analysis and linear
discriminant analysis.
• Feature set reduction can also be built into the learning mechanism of the network which can
automatically select the more important features while reducing or removing the influence of the
lesser important ones.
2.4.5 Adjusting Learning Rates
• If the learning rates are small enough, then the algorithm converges to the closest local minimum.
• Very small leaming rates can lead to long training times.
• In addition, if the network learning is non-uniform and the learning is stopped before the network is
trained to an error minimum, some weights will have reached their final 'optimal' values while others
may not have. In such a situation, the network might perform well on some patterns and poorly on
others.
• If the error function can be approximated by a quadratic, then the following observation can be made
o An optimal learning rate will reach the error minimum in a single learning step.
o Rates that are lower will take longer to converge to the same solution.
o Rates that are larger but less than twice the optimal learning rate will converge to the error
minimum but only after much oscillation.
o Learning rates that are larger than twice the optimal value will diverge from the solution.
• There are a number of algorithms that attempt to adjust this learning rate somewhat optimally in order
to speed up conventional backpropagation.
2.4.6 Error Criteria for Termination of Learning
o Given a suitable network architecture for the approximation problem at hand, the error ε would gradually
reduce from epoch to epoch.
o Eventually, the error stabilizes at some local minimum and there is no way to ensure that the minimum
reached in this way is local or global.
o In practice, various error functions can be employed to decide whether training should stop, each with its
own merits. These are described below.
1. In the simplest case, one can consider the absolute value of squared error averaged over one epoch,
𝜀𝑎𝑣 , and compare this value with a threshold value called the training tolerance. The value of this
threshold is typically 0.01 or may go as low as 0.0001, depending upon the application.
2. A related criterion employs the absolute rate of change of the mean squared error per epoch. Once
again, the threshold for this rate of change may be 0.1 to 0.001 per epoch with smaller values being
possible.
3. A different criterion employs the error gradient based on the rationale that as one approaches a local
or global minimum the magnitude of the gradient shrinks towards 0. One can stop the training process
if the Euclidean norm of the error gradient falls below a sufficiently small threshold. This criterion
requires computation of the gradient at the end of each epoch.
2.4.7 Regularization
Over-training of a feedforward neural network can lead to over-fitting of data and consequently poor
predictive performance due to loss of generalization ability of the network.

Artificial neural network needs to be regularized during the training process to avoid such situations.
➢ Weight Decay:
▪ Over-fitted networks with a high degree of curvature are likely to contain weight with unusually
large magnitude.
▪ Weight decay is a simple regularization technique which penalizes large magnitudes of weights by
including into the error function a penalty term that grows with weight magnitudes.
▪ For example, the sum of the squares of the weights (including the biases) of the entire network can
be multiplied by a decay constant which decides the extent to which the penalty term affects the
error function.
▪ With this, the learning process tends to favour lower magnitudes of weights and thus helps keep
the operating range of activations of neurons in the linear regime of the sigmoid
➢ Re-sampling and Early Stopping

▪ One approach adopted in re-sampling is to partition the data set T into two subsets:
𝑇𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑎𝑛𝑑 𝑇𝑡𝑒𝑠𝑡 .
▪ 𝑇𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 is used for training and evaluation of the network. 𝑇𝑡𝑒𝑠𝑡 is employed to test the
performance of the network.
▪ To tune the weights of the network 𝑇𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 is divided into 𝑇𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 and 𝑇𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 .
▪ 𝑇𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 might typically comprise 20- 30 percent of the patterns in 𝑇𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 .
▪ 𝑇𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 is the set of patterns that is used to actually train the network using backpropagation.
▪ At the end of each epoch, the network performance is evaluated using 𝑇𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 .
▪ Early-stopping terminates the training process when the error on 𝑇𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 starts to increase--a
sign of over-fitting.
▪ Note that the point when the error starts to rise might not represent the point at which a minimum
error is achieved on 𝑇𝑡𝑒𝑠𝑡 .
▪ The weights resulting in this minimum validation error are then selected and the network then
finally tested on 𝑇𝑡𝑒𝑠𝑡 to give an unbiased estimate on the performance of the network on unseen
data samples.
➢ Cross-validation
▪ Cross-validation is a very effective approach when the number of samples in the data set is small
and splitting into the three subsets as discussed above is not feasible.
▪ A common approach is to use leave-out-one (LOO) cross-validation.
▪ In this, Q partitions on the data set are made (being the total number of training samples).
▪ In each partition, the network is trained on Q -1 samples, and tested on the one single sample that
is left out.
▪ This process is repeated such that each sample is used once for testing.
▪ Another variant called 10-fold cross-validation is also very commonly used when the number of
samples larger (say more than 100).
▪ In this, the data set is partitioned randomly into ten different training-testing subsets of size such
as 80%-training-20%-test.
▪ The network is trained on a training subset of one partition and then tested on the test subset of that
partition. This is repeated for all ten partitions.
▪ The final test error is the average of the ten test errors obtained.
▪ Cross-validation can be used to determine an appropriate network architecture as discussed in the
next subsection.

2.4.8 Selection of a Network Architecture
• Both the generalization and approximation ability of a feedforward neural network are closely related to
the architecture of the network (which determines the number of weights or free parameters in the network)
and the of the training set.
• It is possible to have a situation where there are too many connections in the network and too few training
examples.
• In such a situation, the network might "memorize' the training examples only too well, and may fail to
generalize properly because the number of training examples are insufficient to appropriately pin down all
the connection values in the network.
• In such a case, a network gets over trained and loses its ability to generalize or interpolate correctly.
• The real problem is to find a network architecture that is capable of approximation and generalization
simultaneously.
How Many Hidden Layers are Enough?

• Although the backpropagation algorithm can be applied to any number of hidden layers, a three-layered
network can approximate any continuous function.
• The problem with multi-layered networks using a single hidden layer is that the neurons tend to interact
with each other globally.
• Such interactions can make it difficult to generate approximations of arbitrary accuracy.
• For a network with two hidden layers, the curve-fitting process is easier for the neural network.
• The reason for this is that the first hidden layer extracts local features of the function.
• This is done in much the same way as binary threshold neurons partition the input space into regions.
• Neurons in the first hidden layer learn the local features that characterize specific regions of the input
space.
• Global features are extracted in the second hidden layer.
• Once again, in a way similar to the multi-layered TLN network, neurons in the second hidden layer
combine the outputs of neurons in the first hidden layer. This facilitates the learning of global features.
How Many Hidden Neurons are Enough? Using Cross-validation!

• The problem of selecting the number of neurons in the hidden layers of multilayer networks is an important
issue.
• The number of nodes must be large enough to form a decision region as complex as that required by the
problem.
• And yet the number must not be excessively large so that the weights cannot be reliably estimated by
available training data.
• One way to select an appropriate network architecture, is to use a cross-validation approach. The outline
of the operational procedure is shown below:
1. Divide the data set into a training set 𝑇𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 and a test set 𝑇𝑡𝑒𝑠𝑡 .
2. Subdivide Training into two subsets: one to train the network 𝑇𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 , and one to validate the
network 𝑇𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 .
3. Train different network architectures on 𝑇𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 and evaluate their performance on 𝑇𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 .
4. Select the best network.
5. Finally, retrain this network architecture on 𝑇𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 .
6. Test for generalization ability using 𝑇𝑡𝑒𝑠𝑡 .
This approach to selection of an appropriate network is logical but time consuming.

Module 2 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2 Notes

Uploaded by

Copyright:

Available Formats

Neural Network Module 2

2.2 α-LEAST MEAN SQUARE LEARNING

𝑠𝑘 = 𝑦𝑘 = 𝑋𝑘𝑇 𝑊𝑘 ----------------- (1)

Department of CSE, AJIET Mangalore 1

2.2.2 Operational Details of α-LMS

Fig 2.1 An Adaptive Neuron Network

∆𝑒𝑘 = (𝑑𝑘 − 𝑋𝑘𝑇 𝑊𝑘+1 ) − (𝑑𝑘 − 𝑋𝑘𝑇 𝑊𝑘 )

Substituting for Δ𝑊𝑘 we have

Department of CSE, AJIET Mangalore 2

2.2.3 α-LMS Works with Normalized Training Patterns

Equation (4) can be rewritten as follows

Department of CSE, AJIET Mangalore 3

Operational Summary of α- LMS Learning Algorithm

2.3 MSE ERROR SURFACE AND ITS GEOMETRY

Department of CSE, AJIET Mangalore 4

Department of CSE, AJIET Mangalore 5

Department of CSE, AJIET Mangalore 6

2.4 STEEPEST DESCENT SEARCH WITH EXACT GRADIENT

Department of CSE, AJIET Mangalore 7

Department of CSE, AJIET Mangalore 8

2.5 μ- LMS: APPROXIMATE GRADIENT DESCENT

Department of CSE, AJIET Mangalore 9

Department of CSE, AJIET Mangalore 10

2.5.1 μ-LMS Algorithm: Convergence in the Mean

Department of CSE, AJIET Mangalore 11

2.6 APPLICATION OF LMS TO NOISE CANCELLATION

Fig 2.4 A Linear Adaptive Transversal Filter

Department of CSE, AJIET Mangalore 12

Fig 2.5 The Adaptive Noise Cancellation Approach

Department of CSE, AJIET Mangalore 13

Department of CSE, AJIET Mangalore 14

2.7 MULTILAYERED NETWORK ARCHITECTURE

Department of CSE, AJIET Mangalore 15

Fig 2.8 Generic Architecture of a feedforward neural network

2.8 BACKPROPAGATION LEARNING ALGORITHM

o For the sigmoidal neurons in the hidden and output layer,

• Q be the set of training vector pairs

Department of CSE, AJIET Mangalore 16

2.8.2 Squared Error Performance Function

2.8.3 Outline of Learning Procedure

Department of CSE, AJIET Mangalore 17

Department of CSE, AJIET Mangalore 18

Department of CSE, AJIET Mangalore 19

Department of CSE, AJIET Mangalore 20

Department of CSE, AJIET Mangalore 21

Department of CSE, AJIET Mangalore 22

Department of CSE, AJIET Mangalore 23

2.4 PRACTICAL CONSIDERATIONS IN IMPLEMENTING THE BP

Department of CSE, AJIET Mangalore 24

• Two choices of implementing the algorithm:

2.4.2 Use a Bipolar Signal Function

With a = 1.716 and λ = 0.66 being suitable values

Department of CSE, AJIET Mangalore 25

2.4.3 Initialization of network weights

To avoid incorrect saturation:

Where k is a weight decay parameter.

2.4.4 Data Pre-processing

Department of CSE, AJIET Mangalore 26

2. Handling Noise in Data:

Department of CSE, AJIET Mangalore 27

2.4.5 Adjusting Learning Rates

2.4.6 Error Criteria for Termination of Learning

Department of CSE, AJIET Mangalore 28

➢ Re-sampling and Early Stopping