Professional Documents
Culture Documents
• Heteroassociative Networks
Pattern Classification
• Pattern is an object, process or event
• Class is a set of patterns having same attributes/features
• Classes are assigned to each pattern.
• It is required to capture the implicit relation among the patterns of the same class.
• When a test pattern is given, the corresponding output class label is retrieved.
• The test patterns belonging to a class are not the same as the patterns used in the
training, although they may originate from the same source.
• Pattern classification problems are said to belong to the category of supervised learning.
• Classification can be considered as a special class of hetroassociation.
Pattern Mapping
• A set of input patterns and the corresponding output patterns are given
• The objective is to capture the implicit relationship between the input and
output patterns
• When a test input pattern is given, the corresponding output pattern is
retrieved.
• The system performs some kind of generalization as opposed to memorizing
the information.
• The test patterns are not the same as the training patterns, although they
may originate from the same source.
Temporal Patterns
• Human beings are able to capture effortlessly the dynamic features
present in a sequence of patterns.
• This happens in any dynamic scene situation as in a movie on a
television.
• It requires handling multiple static patterns simultaneously
• It looks for changes in the features in the subpatterns in adjacent
pattern pairs.
ANN
• Mimic a small part for to perform specific task rather than considering
total human brain which is impossible
• ANN
• Nonlinearity
• Interconnection of nonlinearity of neurons
• Distributed nolinearity
• i/p and o/p mapping
• Learning Supervised learning
• Supervised learning many inputs given, and outputs fetched, and keep strainng until
correct output is achieved
• It will tell you the desired output for set of inputs
• Adjust the free params so to we can adapt the i/p to produce desired
o/p
• Difference between o/p and i/p will correct the free params
• Learning ability
• Pattern is recognized by a teacher
Unsupervised learning
Biological Neural Networks
• The features of the biological neural network are attributed to its structure
and function.
• The fundamental unit of the network is called a neuron or a nerve cell.
• It consists of a cell body or soma where the cell nucleus is located.
• Tree like nerve fibres called dendrites are associated with the cell body.
These dendrites receive signals from other neurons.
• Extending from the cell body is a single long fibre called the axon, which
eventually branches into strands and substrands connecting to many other
neurons at the synaptic junctions, or synapses.
• The receiving ends of these junctions on other cells can be found both
on the dendrites and on the cell bodies themselves.
• The axon of a typical neuron leads to a few thousand synapses
associated with other neurons.
• The transmission of a signal from one cell to another at a synapse is a complex chemical process in which
specific transmitter substances are released from the sending side of the junction.
• The effect is to raise or lower the electrical potential inside the body of the receiving cell. If this potential
reaches a threshold, an electrical activity in the form of short pulses is generated.
• When this happens, the cell is said to have fired. These electrical signals of fixed strength and duration are
sent down the axon.
• Generally the electrical activity is confined to the interior of a neuron, whereas the chemical mechanism
operates at the synapses.
• The dendrites serve as receptors for signals from other neurons, whereas the axon serves as transmitter for
the generated neural activity to other nerve cells (inter-neuron) or to muscle fibres (motor neuron).
• A third type of neuron, which receives information from muscles or sensory organs, such as the eye or ear, is
called a receptor neuron.
• The size of the cell body of a typical neuron is approximately in the range 10-80 µm
• The dendrites and axons have diameters of the order of a few µm.
• The gap at the synaptic junction is about 200 nm wide.
• The total length of a neuron varies from 0.01 mm for internal neurons in the human
brain up to 1 m for neurons in the limbs.
• In the state of inactivity the interior of the neuron, the protoplasm, is negatively charged
against the surrounding neural liquid containing positive Sodium (Na+) ions.
• The resulting resting potential of about - 70 mV is supported by the action of the cell
membrane, which is impenetrable for the positive Sodium ions.
• This causes a deficiency of positive ions in the protoplasm.
• Signals arriving from the synaptic connections may result in a temporary depolarization
of the resting potential.
• As the potential increases above - 60 mV, the membrane loses its impermeability
against Na+ ions, which enter into the protoplasm and reduce the potential
difference.
• This change in the membrane potential causes the neuron to discharge signal.
Then the neuron is said to have fired.
• After that, membrane gradually recovers its original properties and regenerates the
resting potential over a period of several milliseconds.
• During this period, the neuron remains incapable of further excitation.
• The discharge, which initially occurs in the cell body, propagates as a signal along
the axon to the synapses.
• The intensity of the signal is encoded in the frequency of the sequence of pulses of
activity, which can range from about 1 to 100 per second.
• The speed of propagation of the discharge signal in the cells of the
human brain is about 0.5-2 m/s.
• The discharge signal travelling along the axon stops at the synapses, as
there exists no conducting link to the next neuron.
• Transmission of the signal across the synaptic gap is mostly effected by
chemical activity.
• When the signal arrives at the presynaptic nerve terminal, special
substances called neurotransmitters are produced in tiny amounts.
• The neurotransmitter molecules travel across the synaptic junction
reaching the postsynaptic neuron within about 0.5 ms.
• These substances modify the conductance of the postsynaptic
membrane for certain ions, causing a polarization/depolarization of the
postsynaptic potential.
• If the induced polarization potential is positive, the synapse is termed
excitatory, because the influence of the synapse tends to activate the
postsynaptic neuron.
• If the polarization potential is negative, the synapse is called inhibitory,
since it counteracts excitation of the neuron.
• All the synaptic endings of an axon are either of an excitatory or an
inhibitory nature.
• The cell body of a neuron acts as a summing device due to the net depolarizing effect of its input signals.
• This net effect decays with 5-10 ms.
• If several signals arrive within such a period, their excitatory effects accumulate.
• When the total magnitude of the depolarization potential in the cell body exceeds the critical threshold (~10
mV), the neuron fires.
• The activity of a given synapse depends on the rate of the arriving signals.
• An active synapse, which repeatedly triggers the activation of its postsynaptic neuron, will grow in strength,
while others will gradually weaken. Thus the strength of a synaptic connection gets modified continuously.
• This mechanism of synaptic plasticity in the structure of neural connectivity, known as Hebb's rule, appears to
play a dominant role in the complex process of learning.
• Different types of neurons having different size and degree of branching of their dendritic
trees, the length of their axons, and other structural details exist.
• The complexity of the human central nervous system is due to the vast number of the
neurons and their mutual connections.
• In the human cortex every neuron receives a converging input from 104 synapses.
• Each cell feeds its output into many hundreds of other neurons.
• The total number of neurons in the human cortex is near 1011 , which are distributed in at a
constant density of about 15 x 104 neurons per mm sq.
• There exists a total of about 1015 synaptic connections in the human brain, the majority of
which develop during the first few months after birth.
ANN: Models of Neuron
• In 1943 Warren McCulloch and Walter Pitts proposed a model of
computing element, called McCulloch-Pitts neuron.
• It performs a weighted sum of the inputs to the element followed by a
threshold logic operation.
• The main drawback of this model of computation is that the weights
are fixed and hence the model could not learn from examples.
Binary classification
Binary classification is the task of classifying the elements of a given set into two groups (e.g. classifying
whether an image depicts a cat or a dog) based on a prescribed rule.
Linear Separability
• Rosenblatt’s perceptron can handle only classification tasks for linearly separable classes.
• The decision function depends linearly on the inputs ai, hence the name Linear Classifier.
• The neuron takes an extra constant input θ, associated to the synaptic weight also known as the
bias.
• The bias is simply the negative of the activation threshold.
• The synaptic weights wₖ are not restricted to unity, thus allowing some inputs to have more
influence onto the neuron’s output than others.
• They are not restricted to be strictly positive either. Some of the inputs can hence have an
inhibitory influence.
• The algorithm converges to a useful set of synaptic weights.
• Assume that the mᵗʰ example aₘ belongs to class sₘ=0 and that the perceptron correctly predicts sₘ =0. In
this case, the weight correction is given by
∆𝑤 = 𝜂 ∗ 0 ∗ 𝑎𝑖 as 𝜖=𝑏−𝑠=0
we do not change the weights. The same applies again for the bias.
• Similarly, if the mᵗʰ example aₘ belongs to class sₘ=1 and the perceptron correctly predicts sₘ =1, then the
weight correction is Δw = 0. The same applies again for the bias.
• Assume now that the mᵗʰ example aₘ belongs to class sₘ=0 and that the perceptron wrongly predicts ŷₘ =1.
In this case, the weight correction is given by
∆𝑤 = 𝜂 ∗ (−1) ∗ 𝑎𝑖 as 𝜖=𝑏−𝑠=(0-1)=-1
• Finally, if the mᵗʰ example aₘ belongs to class sₘ=1 and the perceptron wrongly predicts sₘ =0, the weight
correction is
∆𝑤=𝜂∗(1)∗𝑎_𝑖 as 𝜖=𝑏−𝑠=(1-0)=1
𝑇
The normalized weight vector associated with the j th unit in F2 is as follows: 𝑤𝑗 = 𝑤𝑗1 , 𝑤𝑗2 , … … . , 𝑤𝑗𝑀
The nomalized input vector a at the F1 layer is as follows: 𝑎 = 𝑎1 , 𝑎2 , … … . , 𝑎𝑀 𝑇
Whenever the input is given to F1, then the jth unit of F2, will be activated to the maximum extent
Thus the operation of an instar can be viewed as content addressing the memory.
Let us consider two layers F1 and F2 with M and N processing
units, respectively.
𝑤1𝑗
Outstar network structure provides connections from the jth unit
in the F2 layer to all the units in the F1 layer.
It has fan-out geometries.
The weight vector for the connections from the jth unit in F2 approaches the activity pattern in F1, when
an input vector a is presented at F1.
During recall, whenever the unit j is activated, the signal pattern 𝑠𝑗 𝑤1𝑗 , 𝑠𝑗 𝑤2𝑗 , … , 𝑠𝑗 𝑤𝑀𝑗 transmitted to
F1, where sj is the output of the jth unit.
This signal pattern then produces the original activity pattern corresponding to the input vector a,
although the input is absent.
Thus the operation can be viewed as memory addressing the contents.
When all the connections from the units in F1 to F2 are made as we obtain a heteroassociation
network.
This network can be viewed as a group of instars, if the flow is from Fl to F2.
On the other hand, if the flow is from F2 to F1, then the network can be viewed as a group of outstars.
When the flow is bidirectional, we get a bidirectional associative memory, where either of the layers
can be used as input and output.
If the two layers Fl and F2 coincide and the weights are symmetric, i.e. 𝑤𝑖𝑗 = 𝑤𝑖𝑗 , 𝑖 ≠ 𝑗 then we
obtain an autoassociative memory in which each unit is connected to every other unit and to Itself.
Adaline
• Adaptive Linear Element (ADALINE) is a computing model proposed
by Widrow.
• The main distinction between the Rosenblatt's perceptron model and
the Widrow's Adaline model is that:
• In the Adaline the analog activation value (x) is compared with the
target output (b).
• In other words, the output is a linear function of the activation value
(x).
f(x)
Activation: 𝑥 = σ𝑀
𝑖=1 𝑤𝑖 𝑎𝑖 − 𝜃
Output: 𝑠=𝑓 𝑥 =𝑥
Error: δ = 𝑏 − 𝑠 = (𝑏 − 𝑥)
Weight Change: ∆𝑤𝑖 = 𝜂𝛿𝑎𝑖
𝑤𝑛+1 = 𝑤𝑛 + ∆𝑤
• where η is the learning rate parameter.
• This weight update rule minimizes the mean squared error 𝛿 2 = 𝑏 − 𝑥 2 ,
averaged over all inputs.
• Hence it is called Least Mean Squared (LMS) error learning law.
• This law is derived using the negative gradient of the error surface in the
weight space.
• Hence it is also known as a gradient descent algorithm.
Learning
• Synaptic dynamics is attributed to learning in a biological neural network.
• The synaptic weights are adjusted to learn the pattern information in the input
samples.
• Learning is a slow process, and the samples containing a pattern may have to be
presented to the network several times before the pattern information is captured
by the weights of the network.
• A large number of samples are normally needed for the network to learn the
pattern implicit in the samples.
• Pattern information is distributed across all the weights, and it is difficult to relate
the weights directly to the training samples.
• The only way to demonstrate the evidence of learning pattern information is
that, given another sample from the same pattern source, the network would
classify the new sample into the pattern class of the earlier trained samples.
• Another interesting feature of learning is that the pattern information is
slowly acquired by the network from the training samples, and the training
samples themselves are never stored in the network.
• The adjustment of the synaptic weights is represented by a set of learning
equations, which describe the synaptic dynamics of the network.
Requirements of learning laws
• The learning law should lead to convergence of weights.
• The learning or training time for capturing the pattern information from samples should be as small as
possible.
• The weights should be adjusted based on each sample containing the pattern information.
• Learning should use only the local information as far as possible. That is, the change in the weight on a
connecting link between two units should depend on the states of these two units only.
• It is possible to implement the learning law in parallel for all the weights for speeding up the learning process.
• Learning should be able to capture complex nonlinear mapping between input-output pattern pairs
• Learning should be able to capture as many patterns as possible into the network. That is, the pattern
information storage capacity should be as large as possible for a given network.
Categories of learning
• Supervised, reinforcement and unsupervised
• Off-line and on-line
• Deterministic, stochastic and fuzzy
• Discrete and continuous
• Criteria for learning
Types of Learning
Hebbian learning
where, si = fi(xi(t)) is the output signal of the unit i, and sj =fj(xj(t)) is the
output signal of the unit j.
• 𝑤ሶ 𝑖𝑗 𝑡 = −𝑤𝑖𝑗 𝑡 𝑠𝑖 + 𝑠𝑖 𝑠𝑗
For a given input vector a, the output from each unit i is computed
using the weighted sum
Then the weight vector leading to the kth unit is adjusted as follows
Wn+1=wn+delw
• The final weight vector tends to represent a group of input vectors.
• This is a case of unsupervised learning.
• The values of the weight vectors are initialized to random values prior to
learning.
Outstar Learning Law
where the kth unit is the only active unit in the input layer.
The vector b = (bl, b2, ..., bM) is the desired response from the layer of M units.
The outstar learning is a supervised learning law, and it is used with a network of instars to capture
the characteristics of the input and output patterns for data compression.
In ,implementation, the weight vectors are initialized to zero.
Learning Mechanism
xm
min 𝑥𝑖 , 𝑥𝑡𝑒𝑠𝑡 = 𝑑(𝑥𝑛, , 𝑥𝑡𝑒𝑠𝑡 )
𝑖
Test point lying near
the outlier will be
wrongly classified
• Test point lying near the outlier will be wrongly classified.
• So we will consider a variant of this method
• Let us consider a nbd of k numbers points
• Method known as k-neighbourhood method
•
Hebbian learning
• Hebb was an Neurophysiologist
Presynaptic Postsynaptic
neuron neuron If cell A fires cell B then metabolic changes
occur so that efficiency of firing B by cell A
increases.
Process is very much time dependant. Activation of cell A and B must be synchronous
𝑀
Slope=𝜂𝑥𝑗
𝑦𝑘 = 𝑤𝑘𝑗 𝑥𝑗
𝑘=1 𝑦𝑘
• For a constant 𝑥𝑗 we get positive 𝑦𝑘 which contributes to ∆𝑤𝑘𝑗
𝑦𝑘
Hypocampus
−𝜂 𝑦(𝑥
ത 𝑗 − 𝑥)ҧ
Feed forward/excitatory
1 𝑖𝑓 𝑣𝑘 > 𝑣𝑗 ∀ 𝑗 𝑗 ≠ 𝑘
• 𝑦𝑘 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• σ𝑗 𝑤𝑘𝑗 = 1 ∀ 𝑘
• 𝑣𝑗 = σ𝑖 𝑤𝑗𝑖 𝑥𝑖 Feedback
connection
𝜂 𝑥𝑗 − 𝑤𝑘𝑗 𝑖𝑓 𝑘 𝑖𝑠 𝑡ℎ𝑒 𝑤𝑖𝑛𝑛𝑒𝑟
∆𝑤𝑘𝑗 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑤𝑘𝑗 = 1 ∀ 𝑘
𝑗
• If 𝑤𝑘𝑗 increases in one iteration in next iteration also it is most likely that k will again be the winner and
𝑤𝑘𝑗 will increase
• However increment will be less than the previous iteration due to the above Eq.
• 𝑤𝑘𝑗 is going to follow 𝑥𝑗
• Unsupervised learning
• For different pattern different neuron will be winner
𝑤𝑘 = 𝑤𝑘1 , 𝑤𝑘1 , … . . , 𝑤𝑘𝑛
𝑥 = 𝑥1 , 𝑥2 , … . , 𝑥𝑛
Boltzman learning Wkj=1
E=-1/2*(1)*(-1)*(1)=1/2
E =-1/2*(-1)*(-1)*(1)=-1/2
• Stochastic method
• Statistical mechanism
• Neuron constitutes a recurrent structure (feedback interconnection)
• Operated on binary mode +1
• Energy function:
1 -1
•𝐸=− σ 𝑗 σ𝑘 𝑤𝑘𝑗 𝑥𝑗 𝑥𝑘
2 +1
𝑗≠𝑘
• Some neurons take part in output layer
• Whereas, some neurons belong to hidden layer
-1 +1
hidden
• All the neurons are associated with some state +1/-1
• Pick up any neuron. Flip its state. Calculate E
• Probability a neuron will flip its state from 𝑥𝑘 → −𝑥𝑘 is as follows
1
𝑃 𝑥𝑘 → −𝑥𝑘 =
(1 + exp(−∆𝐸𝑘 /𝑇))
+ −
∆𝑤𝑘𝑗 = 𝜂 𝜌𝑘𝑗 − 𝜌𝑘𝑗 ,𝑗 ≠ 𝑘
• Given the synaptic-weight vector w for the whole network, the probability that the visible neurons
are in state 𝑥𝑘 is P(𝑋𝑘 = 𝑥𝑘 )
• The log-likelihood function L(w) is formulated as follows
𝐿 𝑤 = log 𝛱 P(𝑋𝑘 = 𝑥𝑘 )
𝜕𝐿 𝜔 + −
= 1/𝑇 𝜌𝑘𝑗 − 𝜌𝑘𝑗 ,𝑗 ≠ 𝑘
𝜕𝜔𝑖𝑗
We may use gradient ascent to achieve that goal by writing
𝜕𝐿 𝜔
∆𝑤𝑘𝑗 =ε
𝜕𝜔𝑖𝑗
+ −
∆𝑤𝑘𝑗 = 𝜂 𝜌𝑘𝑗 − 𝜌𝑘𝑗 ,𝑗 ≠ 𝑘 𝜂= ε/T
Perceptron Convergence Algorithm
Variables and Parameters:
x(n) = (m + 1)-by-1 input vector
= [+1, x1(n), x2(n), ..., xm(n)]𝑇
w(n) = (m + 1)-by-1 weight vector
= [+1, w1(n), w2(n), ..., wm(n)]𝑇 , b = bias
y(n) = actual response (quantized)
d(n) = desired response
η= learning-rate parameter, a positive constant less than unity
For the perceptron to function properly, the two classes c1 and c2 must be linearly
separable.
This means that the patterns to be classified must be sufficiently separated from each
other to ensure that the decision surface consists of a hyperplane.
• Fig. (a) is the example of a two-dimensional perceptron.
• The two classes c1 and c2 are sufficiently separated from each other to draw a
hyperplane (in this case, a straight line) as the decision boundary.
• If, however, the two classes c1 and c2 are allowed to move too close to each other,
as in Fig. (b), they become nonlinearly separable, a situation that is beyond the
computing capability of the perceptron.
Suppose the input variables of the perceptron originate from two linearly separable
classes. Let h1 be the subspace of training vectors x1(1), x1(2), ... that belong to
class c1, and let h2 be the subspace of training vectors x2(1), x2(2), ... that belong to
class c2.
The union of h1 and h2 is the complete space denoted by h.
• Given the sets of vectors h1 and h2 to train the classifier, the training process
involves the adjustment of the weight vector w in such a way that the two classes
c1 and c2 are linearly separable.
• That is, there exists a weight vector w such that we may state
𝑤 𝑇 𝑥 > 0 for every input vector x belonging to class c1
𝑤 𝑇 𝑥 ≤ 0 for every input vector x belonging to class c2
• In the second line, we have arbitrarily chosen to say that the input vector x belongs
to class c2 if 𝑤 𝑇 𝑥 = 0 .
• Given the subsets of training vectors h1 and h2, the training problem for the
perceptron is then to find a weight vector w such that the two inequalities are
satisfied.
• The algorithm for adapting the weight vector of the elementary perceptron may now be
formulated as follows:
1. If the nth member of the training set, x(n), is correctly classified by the weight vector
w(n) computed at the nth iteration of the algorithm, no correction is made to the weight
vector of the perceptron in accordance with the rule:
w(n+1)=w(n) if 𝑤 𝑇 𝑥 > 0 and x(n) belonging to class c1
w(n+1)=w(n) if 𝑤 𝑇 𝑥 ≤ 0 and x(n) belonging to class c2
2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule
w(n+1)=w(n)-η(n)x(n) if 𝑤 𝑇 𝑥 > 0 and x(n) belonging to class c2
w(n+1)=w(n)+η(n)x(n) if 𝑤 𝑇 𝑥 ≤ 0 and x(n) belonging to class c1
where the learning-rate parameter η(n) controls the adjustment applied to the weight vector
at iteration n.
• If η(n)=η > 0, where η is a constant independent of the iteration number n, then we
have a fixed-increment adaptation rule for the perceptron.
• We first prove the convergence for η =1. The value of η is unimportant, so long as
it is positive. A value η ≠ 1 merely scales the pattern vectors without affecting
their separability.
• Initial condition: assume w(0)=0. Suppose that 𝑤 𝑇 (𝑛)𝑥(𝑛) ≤ 0 for n= 1, 2, ...,
and the input vector x(n) belongs to the subset h1. That is, the perceptron incorr
min 𝑤𝑜𝑇 𝑥 𝑛 = 𝛼
𝑥(𝑛)∈ℎ1
Eq (2) is clearly in conflict with the earlier result of Eq. (1) for sufficiently
large values of n. We can state that n cannot be larger than some value nmax
for which Eqs. (1) and (2) are both satisfied with the equality sign. That is,
nmax is the solution of the equation
• Now η(n) = 1 for all n and w(0) = 0, and a solution wo exists. The rule for
adapting the synaptic weights of the perceptron must terminate after at most nmax
iterations.
• Surprisingly, this statement, proved for hypothesis h1, also holds for hypothesis
h2.
• We may now state the fixed-increment convergence theorem for the perceptron as
follows (Rosenblatt, 1962):
w11k
Xk1 yk1
Xk2 yk2
Xk3 w1mk
xkm ykm
wmnk
𝑦𝑘𝑗 = σ𝑚
𝑖=1 𝑤𝑗𝑖 (𝑘)𝑥𝑘𝑖
𝑥𝑘1 𝑇
𝑥𝑘2
𝑦𝑘𝑗 = 𝑤𝑗1 𝑘 ,𝑤𝑗2 𝑘 ,……,𝑤𝑗𝑚 𝑘 ....
𝑥𝑘𝑚
We can write m such equations
𝑦𝑘1 𝑤11 𝑘 𝑤12 𝑘 … … 𝑤1𝑚 𝑘 𝑥𝑘1 𝑇
𝑦𝑘2 𝑤21 𝑘 𝑤22 𝑘 , … … 𝑤2𝑚 𝑘 𝑥𝑘2
.. = … ..
.. …. ..
𝑦𝑘𝑚 𝑤𝑚1 𝑘 𝑤𝑚2 𝑘 … … 𝑤𝑚𝑚 𝑘 𝑥𝑘𝑚
𝑀𝑘 = 𝑀𝑘−1 + 𝑤(𝑘)
Incremental
How to estimate the weights
Correlation matrix memory
𝑞
𝑀 = 𝑘=1 𝑦𝑘 𝑥𝑘𝑇
σ 𝑦𝑘 : 1𝑋𝑚 𝑣𝑒𝑐𝑡𝑜𝑟
Outer product 𝑥𝑘𝑇 : 𝑚𝑋1 𝑣𝑒𝑐𝑡𝑜𝑟
𝑚𝑋𝑚 𝑣𝑒𝑐𝑡𝑜𝑟
𝑀:
Reformulating
[𝑥1𝑇
𝑞
[𝑥2𝑇
= [𝑦Ԧ1 𝑦Ԧ2 … … . 𝑦Ԧ𝑞 ] .
𝑀
𝑘=1 .
[𝑥𝑞𝑇
= 𝑌𝑋 𝑇
𝑀
• Recursively
𝑘 = 𝑀
𝑀 𝑘−1 + 𝑦Ԧ𝑘 𝑥Ԧ𝑘𝑇 K=1,2,..,q
𝑥𝑘𝑇
𝑘
𝑀 𝑘−1
𝑀
𝑦𝑘
X ∑ 𝑧 −1 𝐼
• If we feed xj , y will be corresponding output
𝑦Ԧ = 𝑀 𝑥Ԧ𝑗
If 𝑦Ԧ = 𝑦Ԧ𝑗 implies recalling is perfect
1/2
𝑥Ԧ𝑘 = (𝑥Ԧ𝑘𝑇 ∙ 𝑥Ԧ𝑘 ) 1/2 = 𝐸𝑘 =1
𝜕𝐶 𝜕𝐶 𝜕𝐶 𝑇
𝛻𝐶 = [ ……. ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑛
• Local iterative descent
• 𝑤(0)=initial weight which will be used to generate
𝑤 𝑛 : 𝑛 = 1,2, … . , 𝑛
Δ𝐶 𝑤 𝑛 = 𝐶 𝑤(𝑛 + 1) − 𝐶 𝑤 𝑛
𝑇 1 𝑇
≈ 𝑔Ԧ 𝑛 . Δ𝑤 𝑛 + ∆𝑤 𝑛 𝐻 𝑛 ∆𝑤 𝑛
2
Hessian matrix
𝜕2 𝐶 𝜕2 𝐶 𝜕2 𝐶
………
𝜕2 𝑤1 𝜕𝑤1 𝜕𝑤2 𝜕𝑤1 𝜕𝑤𝑛
𝜕2 𝐶 𝜕2 𝐶 𝜕2 𝐶
………
• 𝐻 𝑛 = 𝛻2𝐶 𝑤 𝑛 = 𝜕𝑤1 𝜕𝑤2 𝜕2 𝑤2 𝜕𝑤2 𝜕𝑤𝑛
……
…….
𝜕2 𝐶 𝜕2 𝐶 𝜕2 𝐶
……… 2
𝜕𝑤1 𝜕𝑤𝑛 𝜕𝑤2 𝜕𝑤𝑛 𝜕 𝑤𝑛
Associative memories
• Linear associative memories
Heteroassociative
Autoassociative
Hopfield network
Boltzmann machine
• Bidirectional associative memories
• Multidirectional associative memories
• Temporal associative memories
Associative Memory
• Pattern storage is an obvious pattern recognition task that is realized
using an ANN.
• This is a memory function, where the network is expected to store the
pattern information (not data).
• The patterns to be stored may be of spatial type or spatio-temporal
(pattern sequence) type.
• Typically, an ANN behaves like an associative memory, in which a
pattern is associated with another pattern, or with itself.
• This is in contrast with the random access memory which maps an
address to a data.
• ANN functions as a content addressable memory where data is mapped onto an address.
• The pattern information is stored in the weight matrix of a feedback neural network.
• The stable states of the network represent the stored patterns, which can be recalled by
providing an external stimulus.
• If the weight matrix stores the given patterns, then it becomes an autoassociative memory.
• If the weight matrix stores the association between a pair of patterns, the network becomes
a bidirectional associative memory. This is called heteroassociation between the two
patterns.
• If the weight matrix stores multiple associations among several (> 2) patterns, then the
network becomes a multidirectional associative memory.
• If the weights store the associations between adjacent pairs of patterns in a sequence of
patterns, then the network is called a temporal associative memory.
Some desirable characteristics of associative memories are:
• The network should have a large capacity, i.e., ability to store a large number of
patterns or pattern associations.
• The network should be fault tolerant in the sense that damage to a few units or
connections should not affect the performance in recall significantly.
• The network should be able to recall the stored pattern or the desired associated
pattern even if the input pattern is distorted or noisy.
• The network performance as an associative memory should degrade only
gracefully due to damage to some units or connections, or due to noise or
distortion in the input.
• The network should be flexible to accommodate new patterns or associations
(within the limits of its capacity) and to be able to eliminate unnecessary
patterns or associations.
Bidirectional Associative Memory (BAM)
• The objective is to store a set of pattern pairs in such a way that
any stored pattern pair can be recalled by giving either of the
patterns as input.
• The network is a two-layer heteroassociative neural network that
encode binary or bipolar pattern pairs (al,bl) using the Hebbian
learning.
• It can learn on-line and it operates in discrete time steps.
• The BAM weight matrix from the first layer to the second layer is
given by
The associated bipolar patterns of the input and output layers are
1 j M
𝑎𝑙 ∈ −1, +1 𝑀 , 𝑏𝑙 ∈ −1, +1 𝑁 , 𝑙 = 1: 𝐿
𝑾 = 𝒂𝒍 𝒃𝑻𝒍
1 i N 𝒍=𝟏
The associated binary patterns of the input and output layers are
𝑝𝑙 ∈ 0,1 𝑀 , 𝑞𝑙 ∈ 0,1 𝑁
BAM
Corresponding bipolar representation of the binary values
The weight matrix from the second layer to the first layer is given by
𝑳
𝑾𝑻 = 𝒃𝒍 𝒂𝑻𝒍
𝒍=𝟏
The activation equations for the bipolar case are as follows:
1 𝑖𝑓 𝑦𝑗 > 0
𝑏𝑗 𝑚 + 1 = ൞𝑏𝑗 𝑚 𝑖𝑓 𝑦𝑗 = 0 𝑦𝑗 = σ𝑀
𝑖=1 𝑤𝑗𝑖 𝑎𝑖 (𝑚)
−1 𝑖𝑓 𝑦𝑗 < 0
1 𝑖𝑓 𝑥𝑖 > 0
𝑎𝑖 𝑚 + 1 = ൞𝑎 𝑚 𝑖𝑓 𝑥𝑖 = 0 𝑥𝑖 = σ𝑁
𝑗=1 𝑤𝑖𝑗 𝑏𝑗 (𝑚)
−1 𝑖𝑓 𝑥𝑖 < 0
ai=1 ai=-1
change in ai=-2
ai=1 ai=1
change in ai=0
ai=-1 ai=-1
change in ai=0
ai=-1 ai=1
change in ai=2
The change in energy due to change ∆𝑎𝑖 in ai is given by
The change in energy due to change ∆𝑏𝑗 in bj is given by
Cover theorem:
A pattern classification problem is more likely to be
linearly separable in higher dimensional space than it is
in lower dimension space
• Set H of N patterns (x1, x2, x3, …, xN)
• Each patterns can be assigned to either H1 and H2
• 𝑥Ԧ ∈ 𝐻 𝑑𝑒𝑓𝑖𝑛𝑒 𝑎 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑟𝑒𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
Ԧ ȁ𝑖 = 1,2, … , 𝑚1
∅𝑖 (𝑥)
With this m1 number of functions ∅𝑖 𝑥Ԧ is mapped in ∅𝑖 :i=1,2,…,m1
Set spanned by
Ԧ 1𝑚1 : ℎ𝑖𝑑𝑑𝑒𝑛 𝑠𝑝𝑎𝑐𝑒/𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒
{∅𝑖 (𝑥)}
m1 must be as high as possible
H1 and H2 is ∅-separable ∃ 𝑜𝑛𝑒 𝑚1 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑣𝑒𝑐𝑡𝑜𝑟∅𝑖 (𝑥)𝑠. Ԧ 𝑡
𝑤 𝑇 ∅ 𝑥Ԧ > 0 𝑥𝜖𝐻1
Ԧ
𝑤 𝑇 ∅ 𝑥Ԧ < 0 𝑥𝜖𝐻2Ԧ
• 𝑤 𝑇 ∅ 𝑥Ԧ = 0 is the separating surface (hyper plane)
• Inverse mapping into x space
Ԧ 𝑤 𝑇 ∅ 𝑥Ԧ = 0
𝑥:
gives hyper surface and a nonlinearly separable function
• Linear combination of r-wise product of the pattern co-ordinate gives the
hypersurface
σ0≤𝑖1 ≤𝑖2 ≤𝑚0 𝑎𝑖1𝑖2…….𝑖𝑟 𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖𝑟 = 0 r<m0 (no of i/p)
• Product 𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖𝑟 are called dichotomies
∅ 𝑥Ԧ
• m0C dichotomies are possible
r
𝑥
• Non linear mapping from i/p to hidden space
• Higher dimensional mapping but more no of hidden units increases
the cost
• X-OR problem
− 𝑥−𝑡𝑖 2
• 𝜑𝑖 (𝑥) = 𝑒 x2
− 𝑥−𝑡1 2
𝜑1 (𝑥) = 𝑒
𝑥−𝑡2 2
𝜑2 (𝑥) = 𝑒 −
x1
t2=(1,1),
t1=(0,0)
(1,1)
φ2 φ 𝑠𝑒𝑝𝑎𝑟𝑎𝑏𝑖𝑙𝑖𝑡𝑦
(1,0)
(0,1)
(0,0)
φ1
(0,1)
(1,1)
x2
(1,0)
(0,0)
x2
• m0 dim i/p space
• 𝑋Ԧ1 , 𝑋Ԧ2 ,….., 𝑋Ԧ𝑁 are sequence of random patterns
𝑁
• All possible dichotomies of 𝐻 = 𝑥Ԧ𝑖 𝑖=1 are equiprobable
• As N increases what is the probability of separability?
• Let probability that a particular dichotomy picked at random is separable, where the class of
separating surfaces chosen has m1 degrees of freedom. we may then state that N be the largest
integer for which the patterns are φ separable.
𝑚1−1 𝑁 − 1
21−𝑁 σ𝑚=0 𝑓𝑜𝑟 𝑁 > 𝑚1 − 1
• 𝑃 𝑁, 𝑚1 =ቐ 𝑚
1 𝑓𝑜𝑟 𝑁 ≤ 𝑚1 − 1
• Higher we make the dimension m1 of the hidden space, the closer the probability P(N, m1) will be
to unity.
• let x1, x2, ..., xN be a sequence of random patterns (vectors). Let N be a
random variable defined as the largest integer such that this sequence is
separable, where it has m1 degrees of freedom.
• Then we deduce that the probability that N =n is given by
1 𝑛 𝑛−1
𝑃 𝑁 = 𝑛 = 𝑃 𝑛, 𝑚1 − 𝑃 𝑛 + 1, 𝑚1 = n=0,1,2
2 𝑚1 − 1
• The expectation of the random variable N and its median are, respectively
• E(N)=2m1, median(N)=2m1
• The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality m1 is equal to 2m1.
• For an interpretation of this result, recall the definition of a negative binomial
distribution.
• This distribution equals the probability that k failures precede the rth success
• in a long, repeated sequence of Bernoulli trials.
• In such a probabilistic experiment, there are only two possible outcomes for each
trial—success or failure—and their probabilities
• remain the same throughout the experiment. Let p and q denote the probabilities
• of success and failure, respectively, with p +q= 1.
• The negative binomial distribution is defined by the following:
• RBF is the fitment of multidimentional hidden layer
• Hidden layer units form a basis function which maps the i/p to the hidden
layer
ℎ𝑗 = 𝜑𝑗 ( 𝑎 − 𝜇𝑗 /𝜎𝑗 )
• The output of the kth unit in the output layer of the network is given by
• j = 1,2, ..., J and ho = - 1 is the output of the bias unit, so that wko
corresponds to the bias on the kth output unit.
• The nonlinear basis function 𝜑𝑗 (. ) of the jth hidden unit is a function
of the normalized radial distance between the input vector
𝑎 = (𝑎1 , 𝑎2 , … . , 𝑎𝑀 )𝑇 and the weight vector 𝜇𝑗 = (𝜇1 , 𝜇2 , … . , 𝜇𝑀 )𝑇