Professional Documents
Culture Documents
Module 2 Notes
Module 2 Notes
MODULE 2
2.1PERCEPTRON LEARNING AND NON-SEPARABLE SETS
When the set of training patterns is not linearly separable, then for any set of weights, 𝑊𝑘 , there will exist
some training vector, 𝑋𝑘 , such that 𝑊𝑘 misclassifies 𝑋𝑘 . Consequently, the Perceptron learning algorithm will
continue to make weight changes indefinitely. It goes on forever.
Theorem 2.1 Given a finite set of training patterns X, there exists a number M such that if we run the
Perceptron learning algorithm beginning with any initial set of weights, 𝑊1 , then any weight vector 𝑊𝑘
produced in the course of the algorithm will satisfy
‖𝑊𝑘 ‖ ≤‖𝑊1 ‖ + 𝑀
For a given problem, the set of weights that Perceptron learning visits is bounded. From the point of view of
the present discussion, the following corollaries are important:
Corollary 2.1 If, in a finite set of training patterns X, each pattern 𝑋𝑘 has integer (or rational) components 𝑥𝑖𝑘 ,
then the Perceptron learning algorithm will visit a finite set of distinct weight vectors {𝑊𝑘 }.
At present, there is no known good bound on the number of weight vectors that the Perceptron learning
algorithm can visit. The following corollary provides us with a test for non-separability
Corollary 2.2 For a finite set of training patterns X, with individual patterns 𝑋𝑘 ; having integer (or rational)
components 𝑥𝑖𝑘 , the Perceptron learning algorithm will, in finite time:
1 produce a weight vector that correctly classifies all training patterns iff X is linearly separable, or
2 leave and revisit a specific weight vector iff X is linearly non-separable.
In Perceptron learning, the prime objective was to achieve a linear separation of input patterns by attempting
to correct the classification error on a misclassified pattern in each iteration. The restriction there was that the
desired outputs had to be binary or bipolar (𝑑𝑘 , 𝑠𝑘 𝜖 {0,1}𝑜𝑟 𝑑𝑘 , 𝑠𝑘 𝜖 {−1, 1}), and that the pattern sets in
question be linearly separable.
consider a training set of the form T = {𝑋𝑘 , 𝑑𝑘 }, 𝑑𝑘 ∈ ℝ and 𝑋𝑘 ∈ ℝ𝑛+1 . To allow the desired output
to vary smoothly or continuously over some interval, the neuronal signal function is changed from binary
threshold to linear. The signal then equals the net activation of the neuron:
The linear error 𝑒𝑘 due to a presented training pair (𝑋𝑘 , 𝑑𝑘 ), is the difference between the desired output 𝑑𝑘
and the neuronal signal 𝑠𝑘 :
𝑒𝑘 = 𝑑𝑘 − 𝑠𝑘 ------------------(2)
Substituting for 𝑠𝑘 , we have
𝑒𝑘 = 𝑑𝑘 − 𝑋𝑘𝑇 𝑊𝑘 ---------------(3)
Incorporating a linear error measure into the weight update procedure, yields the a-least mean squared
(α-LMS) learning algorithm. The a-LMS algorithm applied to a single adaptive linear neuron shown in Fig.
2.1 embodies the minimal disturbance principle, in accordance with which it incorporates new information
into the weight vector while changing embedded information based on past learning to a minimal extent. The
α-LMS algorithm has the recursive update equation
𝑋
𝑊𝑘+1 = 𝑊𝑘 + 𝜂𝑒𝑘 ‖𝑋 𝑘‖2 --------------------------------(4)
𝑘
Here the weight vector 𝑊𝑘 is modified by the product of the scaled error and the normalized input vector.
𝜂𝑒 𝑋 𝜂
Δ𝑊𝑘 = ‖𝑋 𝑘‖ ‖𝑋𝑘‖ = ̂
𝑒 𝑋 = 𝜂̂𝑒 ̂
𝑘 𝑘 𝑋𝑘 -------------------------(5)
𝑘 𝑘 ‖𝑋𝑘 ‖ 𝑘 𝑘
where 𝑋̂𝑘 is a unit vector in the direction of 𝑋𝑘 , η is the learning rate, and 𝜂̂𝑘 is a pattem--normalized learning
rate.
The weights are changed in the direction of 𝑋𝑘 as in Perceptron learning except that a unit vector 𝑋 ̂𝑘
is used in place of the original vector 𝑋𝑘 . The learning rate is scaled from iteration to iteration by the magnitude
‖𝑋𝑘 ‖, of the applied vector. This makes the algorithm self-normalizing in the sense that larger magnitude
vectors do not dominate the weight update process.
To understand how the update procedure works, note that the change in error for the pattern 𝑋𝑘 depends
on the extent of change in the weight vector, 𝑊𝑘 .
= −𝑋𝑘𝑇 ∆𝑊𝑘
However,
The error correction is proportional to the error itself, and each iteration reduces the error by a factor of η. The
choice of η controls the stability and speed of convergence. In general, stability is ensured if 0 < η < 2.
As patterns are presented sequentially and weights adapted in accordance with Eq. (4), the error
corresponding to a pattern gets reduced by a factor η. Although this may actually increase the error on some
other pattern, it is to be expected that after a sufficient number of weight updates the error would tend to
stabilize at a value that represents a minimum error of classification over all patterns presented.
It is important to understand the geometrical picture that emerges from recursive application of Eq. (4)
to a set of training patterns. Each weight update pushes the weight vector in the direction of the current pattern
in an attempt to reduce the error 𝑒𝑘 . This is illustrated in Fig. 2.2.
The extent of weight update is inversely proportional to the magnitude of the applied pattern.
Therefore, larger magnitude vectors induce a smaller effective weight change than smaller magnitude vectors.
There is an exception though-for the case when input patterns have bipolar ±1 components. For a bipolar
pattern, ‖𝑋𝑘 ‖2 (=n) is the same for all patterns and the constant factor 1⁄𝑛 could as well be absorbed into η.
Thus, the scaled learning rate 𝜂̂,𝑘 remains fixed from iteration to iteration.
Fig 2.2 Weight Update takes place in the direction of Input Vector
For the case of two-valued inputs, if the patterns are assumed to be binary, then no adaptation occurs
in weights that are presented with a 0 input. On the other hand, with bipolar ±1 inputs all weights adapt in
every iteration. Convergence with bipolar vectors, thus, tends to be faster than the case of binary patterns. For
this reason, bipolar input patterns are preferred when two-valued patterns need to be handled.
̂𝑘 − 𝑊𝑘𝑇 𝑋
= 𝑊𝑘 + 𝜂(𝑑 ̂𝑘 )𝑋
̂𝑘
= 𝑊𝑘 + 𝜂𝑒̂𝑋 ̂
𝑘 𝑘
̂𝑘 represents the normalized desired value for normalized input pattern 𝑋
Where 𝑑 ̂𝑘 .
̂𝑘 − 𝑊𝑘𝑇 𝑋
and 𝑒̂𝑘 = 𝑑 ̂𝑘 is the error computed using the normalized training set.
Where 𝑄 𝑇 𝑉 = 𝑉 ′ rotates the V space into the space with eigen vectors of R as a basis. The MSE gradient in
V-space can be computed as
∇𝜀 = 𝑅𝑉
which defines the family of vectors in V-space.
Fig 2.3 A projection of the error function on the ε - 𝒘𝒌𝒊 plane shows a point 𝒘
̂𝒊 where the weight 𝒘𝒌𝒊 minimizes the
error
This is a very useful observation-for it allows us to conclude that since the long-term average of ∇ ̃𝜀𝑘
̃
approaches ∇𝜀, we can safely use ∇𝜀𝑘 as an unbiased estimate. That is what makes μ-LMS work! It follows
that since ∇̃𝜀𝑘 approaches ∇𝜀 in the long run, one could keep collecting ∇ ̃𝜀𝑘 for a sufficiently large number of
iterations (while keeping the weights fixed), and then make a weight change collectively for all those iterations
together. if the data set is finite (deterministic), then one can compute ∇𝜀 accurately by first collecting the
different ∇̃𝜀𝑘 gradients over all training patterns 𝑋𝑘 for the same set of weights. This accurate measure of the
gradient could then be used to change the weights. In this situation μ-LMS is identical to the steepest descent
algorithm.
However, even if the data set is deterministic, we still use ∇ ̃𝜀𝑘 to update the weights. After all, if the
data set becomes large, collection of all the gradients becomes expensive in terms of storage. It is much easier
to just go ahead and use ∇ ̃𝜀𝑘 . Be clear about the approximation made: we are estimating the true gradient
(which should be computed from E[𝜀𝑘 ]) by a gradient computed from the instantaneous sample error 𝜀𝑘 .
In the deterministic case, we can justify this as follows: if the learning rate η is kept small, the weight
change in each iteration will be small and consequently the weight vector W will remain ""somewhat
constant"" over Q iterations where Q is the number of patterns in the training set. Of course, this is provided
that Q is a small number. To see this, observe the total weight change 𝚫W, over Q iterations from the kth
iteration:
where ε denotes the mean-square error. Thus, the weight updates follow the true gradient on average.
The field of adaptive signal processing has benefitted immensely from the simple linear neuron (what Widrow
calls the adaptive linear combiner (ALC)) and the LMS algorithm. The ALC forms an integral component of
adaptive filters.
Digital signals are usually generated by sampling continuous time functions followed by analog to-
digital conversion. These signals are generally filtered using tapped delay line filters as shown in Fig. 2.4. A
sampled input is delayed through a series of delay elements. These n samples (including the current one) are
input to the ALC which generates an output by computing the inner product 𝑦𝑘 = 𝑋𝑘𝑇 𝑊𝑘 , where X
=(𝑥𝑘 , 𝑥𝑘−1 , … … … , 𝑥𝑘−𝑛+1 ) and W = (𝑤1 , … … … 𝑤𝑛 ).
The filtered output is simply this inner product--a linear combination of current and past signal
samples. The LMS procedure is employed to adjust the weights over time, so that the output matches the
desired response. The ALC has been used extensively in applications such as adaptive equalization and noise
cancelling,
A common problem in signal processing is the removal of noise 𝑛0 from a signal s. The goal is to pass
the signal and remove the noise. The adaptive noise cancelling approach employs an adaptive filter as its
integral component. This is shown in Fig. 2.5. This adaptive noise cancelation approach can be used only if a
reference signal is available that contains a noise component 𝑛1 that is correlated with the noise 𝑛0 . The
adaptive noise canceler subtracts the filtered reference signal from the noisy input, thereby making the output
of the canceler an error signal.
A simple argument shows that the filter can indeed adapt to cancel the noise rather easily. If we assume
that s, 𝑛0 , 𝑛1 and y are statistically independent and stationary with zero means, the analysis becomes tractable.
For,
is also minimized. LMS adaptation of the filter causes the output ε to be best least squares estimate of the
input signal s since the noise gets subtracted out.
The monitoring of fetal heart rates using electrocardiograms is an important application domain of
Adaptive interference cancelling. A major problem is background noise interference due to muscle movement
with an amplitude which matches that of the fetal heartbeat. Another major source of interference the heartbeat
of the mother which has amplitude much greater than that of the fetus (almost 2-10 times).
A series of experiments were conducted that targeted the cancelling of these interfering signals from
the fetal ECG. A set of four chest leads were used to provide a clear recording of the maternal heartbeat which
served as the reference input. A single abdominal lead comprising a mixture of the maternal and fetal ECG
signals served as the primary input. Chest leads provide a clear recording of the maternal ECG which is the
reference input. This reference input is adaptively filtered and subtracted from the fetal ECG signal. All signals
are filtered and digitized.
Fig 2.6(a) shows the reference input at a sampling rate of 256 Hz and a filtering bandwidth from 3 to
35 Hz. The abdominal lead recording in Fig. 2.6(b) shows the primary input where the maternal and fetal
heartbeats are mixed and vaguely discernible. Figure 2.6(c) shows the noise-cancelled output where the
maternal signal has been suppressed and the fetal signal is clearly visible.
Figure 2.7(a) shows the reference input now at a sampling rate of 512 Hz and a filtering bandwidth
from 0.5 to 75 Hz. The abdominal lead recording in Fig. 2.7(b) shows the primary input where the maternal
and fetal heartbeats are impossible to discern from one another. A strong 60 Hz interference can be seen in
the ECG along with baseline drift. The reference input was sufficient to cancel the noise as can be seen in Fig.
2.7(c) which shows the noise-cancelled output where the maternal and 60 Hz noise signals have been removed
and the fetal signal is once again clearly visible.
Fig 2.6 Recordings from the fetal ECG experiment at a bandwidth 3-35 H, sampling rate 236 Hz.
Fig 2.7 Recordings from the fetal ECG experiment at a bandwidth 0.3-75 H, sampling rate 512 Hz.
The primary application of TLNs is in data classification-they can successfully classify linearly
separable data sets. On the other hand, linear neurons perform some kind of a least squares fit of a given data
set-fitting linear functions to approximate non-linear ones. The computational capabilities of these single
neuron systems are limited by the nature of the signal function, and by the lack of a layered architecture.
Layering drastically increases the computational power of the system. A layered network of linear neurons
does not provide any additional computational capability. This is because a multi-layered linear neuron
network is equivalent to a single layer linear neuron network.
Figure 2.8 portrays the generic architecture of a multi-layered neural network. Here the input layer has
n linear neurons that receive real valued external inputs in the form of an n- dimensional vector in ℝ𝑛 . This
layer also includes an additional bias neuron (assigned an index 0) that receives no external input but generates
a +1 signal that feeds all bias connections of the neurons of the hidden layer. Similarly, the hidden layer has
q sigmoidal neurons that receive signals from the input layer. A bias neuron has been additionally included in
the hidden layer to generate a +1 signal for bias connections of the output layer neurons. The output layer
comprises p sigmoidal neurons. Neurons in different layers compute their signals, layer by layer, in a strictly
feedforward fashion. Network signals that emanate from the output layer of neurons comprise a p-dimensional
vector of real numbers. Through a sequence of internal transformations, the neural network maps an input
vector in ℝ𝑛 (the input space) to an output vector in ℝ𝑝 (the output space).
• Input layer neurons are linear, whereas neurons in the hidden and output layers have sigmoidal signal
function.
• Vectors and scalar variables will be respectively subscripted and superscripted by the iteration index
k.
• Network is homogeneous in the sense that all neurons use similar signal functions.
o For the neurons in the input layer,
S(x) =x
• The BP algorithm operates by sequentially presenting patterns drawn from a training set to a
predefined network architecture.
Since the weight changes from iteration to iteration are very small.
S(x) = a tanh(λx)
o Alternatively, an incorrect choice of weights might lead to network saturation where weight changes
are almost negligible over a large number of consecutive epochs.
o This may be incorrectly interpreted as a local minimum because weights might begin to change after
a large number of epochs.
o When neurons saturate, the signal values are close to the extremes 0 or 1 and the signal derivatives are
infinitesimally small.
o Since the weight changes are proportional to the signal derivative, saturated neurons generate weight
changes that are negligibly small.
o This is a major problem, because if the neuron outputs are incorrect (such as being close to 1 for a
desired output close to 0) these small weight changes will allow the neuron to escape from incorrect
saturation only after a very long time.
o Randomization of network weights can help avoid these problems.
o use an offset in the computation of the signal slope: S' =0.1+S(1-S). This helps avoid having the
logistic neurons getting prematurely saturated.
o Discourage large weight: the weight decay term is introduced and the classic BP algorithm is
modified as
• Limitations to the performance of a neural network (or for that matter any machine learning technique in
general) can be partly attributed to the quality of data employed for training.
• Real world data sets are generally incomplete: they may lack feature values or may have missing features.
• Data is also often noisy, containing errors or outliers, and may be inconsistent.
• Alternatively, the data set might have many more features than are necessary for solving the classification
problem at hand, especially in application domains such as bioinformatics where features (such as
microarray gene expression values) can run in thousands, whereas only a mere handful of them usually
turn out to be essential.
• Further, feature values may have scales that are different from one another by orders of magnitude.
• It is therefore necessary to perform some kind of pre-processing on the data prior to using it for training
and testing purpose
The main issues involved in data pre-processing which are applicable for data mining methods in general, and
neural networks in particular are described below
1. Data Cleaning:
• If a class label, target value, or a large number of attribute values are missing, it may be appropriate
to ignore the data entry.
• For a missing attribute value, one can use the mean of that attribute, averaged over the entire data
set, or averaged over the data belonging to that class of data (for a classification problem).
• Alternatively, one might use the most probable value to fill in the missing value as determined by
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
3. Data Scaling:
• The main issue in data transformation is normalization where one scales attribute values to fall
within a specified range.
• Common ways of doing this are to either employ the minimum and maximum values of an attribute
to perform the scaling.
• An attribute 𝑋𝑖 which is known to lie in the range [𝑥𝑖𝑚𝑖𝑛 , 𝑥𝑖𝑚𝑎𝑥 ] will be scaled to 𝑥𝑖′ ∈ [0, 1] by using
′ (𝑥𝑖 − 𝑥𝑖𝑚𝑖𝑛 )
the transformation 𝑥𝑖 = ⁄ 𝑚𝑎𝑥 .
(𝑥𝑖 − 𝑥𝑖𝑚𝑖𝑛 )
• Scaling is also done by using the mean and standard deviation of the data attribute when the
minimum and maximum values are unknown or when there are outliers.
• The scaling is done using the transformation: 𝑥𝑖′ = (𝑥𝑖′ − 𝑥̅𝑖 )⁄𝜎𝑥𝑖 .
• One obvious point concerns the range of values of the inputs and outputs of training data. Although
there is no requirement for the input to be in the range [0, 1], it is often beneficial to scale the inputs
appropriately (for example into the range [-1, 1]) since this can speed up training and reduce
chances of getting stuck in local minima.
• Desired outputs should lie well within the neuronal signal range.
• For example, given a logistic signal function which lies in the interval (0,1) the desired outputs of
patterns in the entire training set should lie in an interval [0 + ∈, 1 - ∈] where ∈ > 0 is some small
number.
• Desired values of 0 and 1 would cause weights to grow increasingly large because in order to
generate these limiting values of output 0 and 1 requires a -∞ or ∞ activation which can be
accomplished by increasing the values of weights.
• In addition, the algorithm will obviously not converge if desired outputs lie outside the achievable
interval (0,1).
• Alternatively, one can try to adjust the signal function to accommodate the ranges of the target
values, although this may require some adjustments at the algorithmic level.
4. Reduction of Features:
• Feature reduction is of paramount importance when the number of features is very large (as in the
case of bioinformatics problems where the number of features runs into thousands) and also when
there are redundant features.
• A very common approach to feature set reduction is principal components analysis and linear
discriminant analysis.
• Feature set reduction can also be built into the learning mechanism of the network which can
automatically select the more important features while reducing or removing the influence of the
lesser important ones.
• If the learning rates are small enough, then the algorithm converges to the closest local minimum.
• Very small leaming rates can lead to long training times.
• In addition, if the network learning is non-uniform and the learning is stopped before the network is
trained to an error minimum, some weights will have reached their final 'optimal' values while others
may not have. In such a situation, the network might perform well on some patterns and poorly on
others.
• If the error function can be approximated by a quadratic, then the following observation can be made
o An optimal learning rate will reach the error minimum in a single learning step.
o Rates that are lower will take longer to converge to the same solution.
o Rates that are larger but less than twice the optimal learning rate will converge to the error
minimum but only after much oscillation.
o Learning rates that are larger than twice the optimal value will diverge from the solution.
• There are a number of algorithms that attempt to adjust this learning rate somewhat optimally in order
to speed up conventional backpropagation.
o Given a suitable network architecture for the approximation problem at hand, the error ε would gradually
reduce from epoch to epoch.
o Eventually, the error stabilizes at some local minimum and there is no way to ensure that the minimum
reached in this way is local or global.
o In practice, various error functions can be employed to decide whether training should stop, each with its
own merits. These are described below.
1. In the simplest case, one can consider the absolute value of squared error averaged over one epoch,
𝜀𝑎𝑣 , and compare this value with a threshold value called the training tolerance. The value of this
threshold is typically 0.01 or may go as low as 0.0001, depending upon the application.
2. A related criterion employs the absolute rate of change of the mean squared error per epoch. Once
again, the threshold for this rate of change may be 0.1 to 0.001 per epoch with smaller values being
possible.
3. A different criterion employs the error gradient based on the rationale that as one approaches a local
or global minimum the magnitude of the gradient shrinks towards 0. One can stop the training process
if the Euclidean norm of the error gradient falls below a sufficiently small threshold. This criterion
requires computation of the gradient at the end of each epoch.
2.4.7 Regularization
Over-training of a feedforward neural network can lead to over-fitting of data and consequently poor
predictive performance due to loss of generalization ability of the network.
Artificial neural network needs to be regularized during the training process to avoid such situations.
➢ Weight Decay:
▪ Over-fitted networks with a high degree of curvature are likely to contain weight with unusually
large magnitude.
▪ Weight decay is a simple regularization technique which penalizes large magnitudes of weights by
including into the error function a penalty term that grows with weight magnitudes.
▪ For example, the sum of the squares of the weights (including the biases) of the entire network can
be multiplied by a decay constant which decides the extent to which the penalty term affects the
error function.
▪ With this, the learning process tends to favour lower magnitudes of weights and thus helps keep
the operating range of activations of neurons in the linear regime of the sigmoid
➢ Cross-validation
▪ Cross-validation is a very effective approach when the number of samples in the data set is small
and splitting into the three subsets as discussed above is not feasible.
▪ A common approach is to use leave-out-one (LOO) cross-validation.
▪ In this, Q partitions on the data set are made (being the total number of training samples).
▪ In each partition, the network is trained on Q -1 samples, and tested on the one single sample that
is left out.
▪ This process is repeated such that each sample is used once for testing.
▪ Another variant called 10-fold cross-validation is also very commonly used when the number of
samples larger (say more than 100).
▪ In this, the data set is partitioned randomly into ten different training-testing subsets of size such
as 80%-training-20%-test.
▪ The network is trained on a training subset of one partition and then tested on the test subset of that
partition. This is repeated for all ten partitions.
▪ The final test error is the average of the ten test errors obtained.
▪ Cross-validation can be used to determine an appropriate network architecture as discussed in the
next subsection.
• Both the generalization and approximation ability of a feedforward neural network are closely related to
the architecture of the network (which determines the number of weights or free parameters in the network)
and the of the training set.
• It is possible to have a situation where there are too many connections in the network and too few training
examples.
• In such a situation, the network might "memorize' the training examples only too well, and may fail to
generalize properly because the number of training examples are insufficient to appropriately pin down all
the connection values in the network.
• In such a case, a network gets over trained and loses its ability to generalize or interpolate correctly.
• The real problem is to find a network architecture that is capable of approximation and generalization
simultaneously.