You are on page 1of 142

Neural Network

Dr. Banani Basu


Assistant Professor, Dept of ECE
NIT Silchar
Why Neural Network
• High computational capacity of Modern Computer is supporting NN
• Large application in various field.
What is NN?
• Network of neurons consisting human brain
• Massively large interconnected n/w
• Parallel connection
• Perform Recognition and Perception task
• Very fast
• Processing time of IC 1ns, neuron computation time 1ms implies it is much
slower.
• However, NN is very fast (in order of ns)
• NN is massively parallel 10billion neurons, 60 trillion of interconnection
Introduction
• The artificial neural networks is motivated by our quest to solve
natural tasks by exploiting the developments in computer technology

• The developments in Artificial Intelligence (AI) appear promising, but


when applied to real world intelligent tasks such as in speech, vision
and natural language processing, the AI techniques show their
inadequacies.

• They become highly task specific. AI techniques need clear


specification and mapping of the given problem into a form suitable
for the techniques to be applicable.
• For example, in order to apply heuristic search methods, one needs to
map the problem as a search problem.
• The computing models inspired by biological neural networks may
provide new directions to solve problems arising in natural tasks.
• NN would extract the relevant features from the input data and
perform a pattern recognition task by learning from examples without
explicitly stating the rules for performing the task.
• We show that computing in intelligent tasks requires a distinction
between pattern and data.
• The main difference between human and machine intelligence is that humans
perceive everything as a pattern, whereas for a machine everything is data.
• Even humans tend to perceive a pattern for data like telephone numbers, bank
account numbers, car numbers etc.
• Moreover, for the name of a person written in a handwritten cursive script. Even
though the individual patterns for each letter may not be evident, the name is
understood due to the visual hints provided in the written script.
• Likewise, speech is understood even though the patterns corresponding to the
individual sounds may be distorted, sometimes to unrecognizable extents.
• Human beings are capable of making mental patterns in their biological neural
network from an input data given in the form of numbers, text, picture, sounds,
etc., using their sensory mechanisms of vision, sound, touch, smell and taste.
• These mental patterns are formed even when the data is noisy or deformed due to variations
such as translation, rotation and scaling.
• Thus pattern nature automatically gives robustness and fault tolerance for the human system
while recalling.
• If there is no pattern, then it is very difficult for a human being to remember and reproduce
the data later.
Pattern and Data
• Applications associated with speech, vision and natural language
processing have dominated the attention of designers.
• The most difficult issues in these cases are to derive the description of
the pattern in data in terms of symbols and to derive a set of rules
representing the knowledge of the problem domain.
• Even if we can implement highly complex algorithm we cannot derive
the pattern description and knowledge completely for a given problem.
Pattern Recognition Tasks
• Human beings and machines stores and recalls information in the
form of patterns and data respectively.
• Human beings are able to perform the pattern recognition tasks very
naturally and effortlessly.
• No simple algorithms exist to implement these tasks on a machine.
• The identification of these pattern recognition tasks is the basis of
the artificial neural network models
Pattern Association
• Pattern association involves storing a set of input-output pattern pairs (training data) and
when a test pattern is presented, the stored pattern corresponding to the test pattern is
recalled.
• This is purely a memory function for pattern pairs.
• It recalls the correct pattern even though the test pattern is noisy or incomplete.
• Human memory associates
 similar items,
 contrary/opposite items,
 items close in proximity,
 items close in succession (e.g., in a song)c etc.
Types of Associative Networks
Let xk denote a key pattern applied
• Autoassociative Networks to an associative memory and yk
denote a memorized pattern.
The key pattern xk acts as a stimulus
that not only determines the storage
location of memorized pattern yk, but
also holds the key for its retrieval.

• Heteroassociative Networks
Pattern Classification
• Pattern is an object, process or event
• Class is a set of patterns having same attributes/features
• Classes are assigned to each pattern.
• It is required to capture the implicit relation among the patterns of the same class.
• When a test pattern is given, the corresponding output class label is retrieved.
• The test patterns belonging to a class are not the same as the patterns used in the
training, although they may originate from the same source.
• Pattern classification problems are said to belong to the category of supervised learning.
• Classification can be considered as a special class of hetroassociation.
Pattern Mapping
• A set of input patterns and the corresponding output patterns are given
• The objective is to capture the implicit relationship between the input and
output patterns
• When a test input pattern is given, the corresponding output pattern is
retrieved.
• The system performs some kind of generalization as opposed to memorizing
the information.
• The test patterns are not the same as the training patterns, although they
may originate from the same source.
Temporal Patterns
• Human beings are able to capture effortlessly the dynamic features
present in a sequence of patterns.
• This happens in any dynamic scene situation as in a movie on a
television.
• It requires handling multiple static patterns simultaneously
• It looks for changes in the features in the subpatterns in adjacent
pattern pairs.
ANN
• Mimic a small part for to perform specific task rather than considering
total human brain which is impossible
• ANN
• Nonlinearity
• Interconnection of nonlinearity of neurons
• Distributed nolinearity
• i/p and o/p mapping
• Learning Supervised learning
• Supervised learning many inputs given, and outputs fetched, and keep strainng until
correct output is achieved
• It will tell you the desired output for set of inputs
• Adjust the free params so to we can adapt the i/p to produce desired
o/p
• Difference between o/p and i/p will correct the free params
• Learning ability
• Pattern is recognized by a teacher
Unsupervised learning
Biological Neural Networks
• The features of the biological neural network are attributed to its structure
and function.
• The fundamental unit of the network is called a neuron or a nerve cell.
• It consists of a cell body or soma where the cell nucleus is located.
• Tree like nerve fibres called dendrites are associated with the cell body.
These dendrites receive signals from other neurons.
• Extending from the cell body is a single long fibre called the axon, which
eventually branches into strands and substrands connecting to many other
neurons at the synaptic junctions, or synapses.
• The receiving ends of these junctions on other cells can be found both
on the dendrites and on the cell bodies themselves.
• The axon of a typical neuron leads to a few thousand synapses
associated with other neurons.
• The transmission of a signal from one cell to another at a synapse is a complex chemical process in which
specific transmitter substances are released from the sending side of the junction.
• The effect is to raise or lower the electrical potential inside the body of the receiving cell. If this potential
reaches a threshold, an electrical activity in the form of short pulses is generated.
• When this happens, the cell is said to have fired. These electrical signals of fixed strength and duration are
sent down the axon.
• Generally the electrical activity is confined to the interior of a neuron, whereas the chemical mechanism
operates at the synapses.
• The dendrites serve as receptors for signals from other neurons, whereas the axon serves as transmitter for
the generated neural activity to other nerve cells (inter-neuron) or to muscle fibres (motor neuron).
• A third type of neuron, which receives information from muscles or sensory organs, such as the eye or ear, is
called a receptor neuron.
• The size of the cell body of a typical neuron is approximately in the range 10-80 µm
• The dendrites and axons have diameters of the order of a few µm.
• The gap at the synaptic junction is about 200 nm wide.
• The total length of a neuron varies from 0.01 mm for internal neurons in the human
brain up to 1 m for neurons in the limbs.
• In the state of inactivity the interior of the neuron, the protoplasm, is negatively charged
against the surrounding neural liquid containing positive Sodium (Na+) ions.
• The resulting resting potential of about - 70 mV is supported by the action of the cell
membrane, which is impenetrable for the positive Sodium ions.
• This causes a deficiency of positive ions in the protoplasm.
• Signals arriving from the synaptic connections may result in a temporary depolarization
of the resting potential.
• As the potential increases above - 60 mV, the membrane loses its impermeability
against Na+ ions, which enter into the protoplasm and reduce the potential
difference.
• This change in the membrane potential causes the neuron to discharge signal.
Then the neuron is said to have fired.
• After that, membrane gradually recovers its original properties and regenerates the
resting potential over a period of several milliseconds.
• During this period, the neuron remains incapable of further excitation.
• The discharge, which initially occurs in the cell body, propagates as a signal along
the axon to the synapses.
• The intensity of the signal is encoded in the frequency of the sequence of pulses of
activity, which can range from about 1 to 100 per second.
• The speed of propagation of the discharge signal in the cells of the
human brain is about 0.5-2 m/s.
• The discharge signal travelling along the axon stops at the synapses, as
there exists no conducting link to the next neuron.
• Transmission of the signal across the synaptic gap is mostly effected by
chemical activity.
• When the signal arrives at the presynaptic nerve terminal, special
substances called neurotransmitters are produced in tiny amounts.
• The neurotransmitter molecules travel across the synaptic junction
reaching the postsynaptic neuron within about 0.5 ms.
• These substances modify the conductance of the postsynaptic
membrane for certain ions, causing a polarization/depolarization of the
postsynaptic potential.
• If the induced polarization potential is positive, the synapse is termed
excitatory, because the influence of the synapse tends to activate the
postsynaptic neuron.
• If the polarization potential is negative, the synapse is called inhibitory,
since it counteracts excitation of the neuron.
• All the synaptic endings of an axon are either of an excitatory or an
inhibitory nature.
• The cell body of a neuron acts as a summing device due to the net depolarizing effect of its input signals.
• This net effect decays with 5-10 ms.
• If several signals arrive within such a period, their excitatory effects accumulate.
• When the total magnitude of the depolarization potential in the cell body exceeds the critical threshold (~10
mV), the neuron fires.
• The activity of a given synapse depends on the rate of the arriving signals.
• An active synapse, which repeatedly triggers the activation of its postsynaptic neuron, will grow in strength,
while others will gradually weaken. Thus the strength of a synaptic connection gets modified continuously.
• This mechanism of synaptic plasticity in the structure of neural connectivity, known as Hebb's rule, appears to
play a dominant role in the complex process of learning.
• Different types of neurons having different size and degree of branching of their dendritic
trees, the length of their axons, and other structural details exist.

• The complexity of the human central nervous system is due to the vast number of the
neurons and their mutual connections.

• In the human cortex every neuron receives a converging input from 104 synapses.

• Each cell feeds its output into many hundreds of other neurons.

• The total number of neurons in the human cortex is near 1011 , which are distributed in at a
constant density of about 15 x 104 neurons per mm sq.

• There exists a total of about 1015 synaptic connections in the human brain, the majority of
which develop during the first few months after birth.
ANN: Models of Neuron
• In 1943 Warren McCulloch and Walter Pitts proposed a model of
computing element, called McCulloch-Pitts neuron.
• It performs a weighted sum of the inputs to the element followed by a
threshold logic operation.
• The main drawback of this model of computation is that the weights
are fixed and hence the model could not learn from examples.

to test by taking random weights and check if not coreect we change


McCulloch-Pitts Model
• In McCulloch-Pitts (MP) model the activation (x) is given by a
weighted sum of its M input values (𝑎𝑖 ) and a bias term (θ)
• output signal (s) is typically a nonlinear function f(x) of the activation
value x.
• The following equations describe the operation of an MP model:
Commonly used nonlinear activation function
Excitatory Inhibitory Sn-1 Sn
1 0 1 1
1 0 0 1
0 1 0 0
0 1 1 1
Rosenblatt's perceptron model
• Almost fifteen years after McCulloch & Pitts model Rosenblatt's came up with the
perceptron model.
• Here, an artificial neuron consists of outputs from sensory units connected to a
fixed set of association units, the outputs of which are fed to an MP neuron.
• The association units perform predetermined manipulations on their inputs.
• The main deviation from the MP model is that learning (i.e., adjustment of
weights) is incorporated in the operation of the unit.
• The desired or target output (b) is compared with the actual binary output (s), and
the error (e) is used to adjust the weights.
• The following equations describe the operation of the perceptron model of a
neuron:
Activation: 𝑥 = σ𝑀
𝑖=1 𝑤𝑖 𝑎𝑖 − 𝜃
Output: 𝑠 = 𝑓(𝑥)
Error: 𝜖 =𝑏−𝑠
Weight Change: ∆𝑤𝑖 = 𝜂𝜖𝑎𝑖
𝑤𝑛+1 = 𝑤𝑛 + ∆𝑤

Rosenblatt's perceptron model


In Rosenblatt’s model artificial neurons could actually learn from data.
It is a supervised learning algorithm
This modified MP neuron model enables the artificial neuron to figure out the correct weights directly from
training data by itself.

Binary classification
Binary classification is the task of classifying the elements of a given set into two groups (e.g. classifying
whether an image depicts a cat or a dog) based on a prescribed rule.
Linear Separability
• Rosenblatt’s perceptron can handle only classification tasks for linearly separable classes.
• The decision function depends linearly on the inputs ai, hence the name Linear Classifier.
• The neuron takes an extra constant input θ, associated to the synaptic weight also known as the
bias.
• The bias is simply the negative of the activation threshold.
• The synaptic weights wₖ are not restricted to unity, thus allowing some inputs to have more
influence onto the neuron’s output than others.
• They are not restricted to be strictly positive either. Some of the inputs can hence have an
inhibitory influence.
• The algorithm converges to a useful set of synaptic weights.
• Assume that the mᵗʰ example aₘ belongs to class sₘ=0 and that the perceptron correctly predicts sₘ =0. In
this case, the weight correction is given by

∆𝑤 = 𝜂 ∗ 0 ∗ 𝑎𝑖 as 𝜖=𝑏−𝑠=0

we do not change the weights. The same applies again for the bias.

• Similarly, if the mᵗʰ example aₘ belongs to class sₘ=1 and the perceptron correctly predicts sₘ =1, then the
weight correction is Δw = 0. The same applies again for the bias.

• Assume now that the mᵗʰ example aₘ belongs to class sₘ=0 and that the perceptron wrongly predicts ŷₘ =1.
In this case, the weight correction is given by

∆𝑤 = 𝜂 ∗ (−1) ∗ 𝑎𝑖 as 𝜖=𝑏−𝑠=(0-1)=-1

while the bias is updated as θ = θ–1.

• Finally, if the mᵗʰ example aₘ belongs to class sₘ=1 and the perceptron wrongly predicts sₘ =0, the weight
correction is

∆𝑤=𝜂∗(1)∗𝑎_𝑖 as 𝜖=𝑏−𝑠=(1-0)=1

while the bias is updated as θ = θ+1.


Topology
• Artificial neural networks are useful only when the processing units are organized in a suitable
manner to accomplish a given pattern recognition task.
• The arrangement of the processing units, connections, and pattern input & output is referred to as
topology.
• Artificial neural networks are normally organized into layers of processing units.
• The units of a layer are similar as they all have the same activation dynamics and output function.
• Connections can be made either from the units of one layer to the units of another layer (interlayer
connections) or among the units within the layer (intralayer connections) or both interlayer and
intralayer connections.
• Further, the connections across the layers and among the units within a layer can be organized
either in a feedforward manner or in a feedback manner.
• In a feedback network the same processing unit may be visited more than once.
Let us consider two layers F1 and F2 with M and N processing
units, respectively.
Instar network structure provides connections to the jth unit in
the F2 layer from all the units in the F1 layer.
It has fan-in geometries.

𝑇
The normalized weight vector associated with the j th unit in F2 is as follows: 𝑤𝑗 = 𝑤𝑗1 , 𝑤𝑗2 , … … . , 𝑤𝑗𝑀
The nomalized input vector a at the F1 layer is as follows: 𝑎 = 𝑎1 , 𝑎2 , … … . , 𝑎𝑀 𝑇

Thus the activation 𝑤𝑗𝑇 𝑎 = σ𝑀


𝑖=1 𝑤𝑗𝑖 𝑎𝑖 𝐨𝐟 the jth unit in the F2 layer will approach maximum value.

Whenever the input is given to F1, then the jth unit of F2, will be activated to the maximum extent
Thus the operation of an instar can be viewed as content addressing the memory.
Let us consider two layers F1 and F2 with M and N processing
units, respectively.
𝑤1𝑗
Outstar network structure provides connections from the jth unit
in the F2 layer to all the units in the F1 layer.
It has fan-out geometries.

The weight vector for the connections from the jth unit in F2 approaches the activity pattern in F1, when
an input vector a is presented at F1.
During recall, whenever the unit j is activated, the signal pattern 𝑠𝑗 𝑤1𝑗 , 𝑠𝑗 𝑤2𝑗 , … , 𝑠𝑗 𝑤𝑀𝑗 transmitted to
F1, where sj is the output of the jth unit.
This signal pattern then produces the original activity pattern corresponding to the input vector a,
although the input is absent.
Thus the operation can be viewed as memory addressing the contents.
When all the connections from the units in F1 to F2 are made as we obtain a heteroassociation
network.
This network can be viewed as a group of instars, if the flow is from Fl to F2.

On the other hand, if the flow is from F2 to F1, then the network can be viewed as a group of outstars.
When the flow is bidirectional, we get a bidirectional associative memory, where either of the layers
can be used as input and output.
If the two layers Fl and F2 coincide and the weights are symmetric, i.e. 𝑤𝑖𝑗 = 𝑤𝑖𝑗 , 𝑖 ≠ 𝑗 then we
obtain an autoassociative memory in which each unit is connected to every other unit and to Itself.
Adaline
• Adaptive Linear Element (ADALINE) is a computing model proposed
by Widrow.
• The main distinction between the Rosenblatt's perceptron model and
the Widrow's Adaline model is that:
• In the Adaline the analog activation value (x) is compared with the
target output (b).
• In other words, the output is a linear function of the activation value
(x).
f(x)

ADALINE Neuron Model


x

Activation: 𝑥 = σ𝑀
𝑖=1 𝑤𝑖 𝑎𝑖 − 𝜃
Output: 𝑠=𝑓 𝑥 =𝑥
Error: δ = 𝑏 − 𝑠 = (𝑏 − 𝑥)
Weight Change: ∆𝑤𝑖 = 𝜂𝛿𝑎𝑖
𝑤𝑛+1 = 𝑤𝑛 + ∆𝑤
• where η is the learning rate parameter.
• This weight update rule minimizes the mean squared error 𝛿 2 = 𝑏 − 𝑥 2 ,
averaged over all inputs.
• Hence it is called Least Mean Squared (LMS) error learning law.
• This law is derived using the negative gradient of the error surface in the
weight space.
• Hence it is also known as a gradient descent algorithm.
Learning
• Synaptic dynamics is attributed to learning in a biological neural network.
• The synaptic weights are adjusted to learn the pattern information in the input
samples.
• Learning is a slow process, and the samples containing a pattern may have to be
presented to the network several times before the pattern information is captured
by the weights of the network.
• A large number of samples are normally needed for the network to learn the
pattern implicit in the samples.
• Pattern information is distributed across all the weights, and it is difficult to relate
the weights directly to the training samples.
• The only way to demonstrate the evidence of learning pattern information is
that, given another sample from the same pattern source, the network would
classify the new sample into the pattern class of the earlier trained samples.
• Another interesting feature of learning is that the pattern information is
slowly acquired by the network from the training samples, and the training
samples themselves are never stored in the network.
• The adjustment of the synaptic weights is represented by a set of learning
equations, which describe the synaptic dynamics of the network.
Requirements of learning laws
• The learning law should lead to convergence of weights.
• The learning or training time for capturing the pattern information from samples should be as small as
possible.
• The weights should be adjusted based on each sample containing the pattern information.
• Learning should use only the local information as far as possible. That is, the change in the weight on a
connecting link between two units should depend on the states of these two units only.
• It is possible to implement the learning law in parallel for all the weights for speeding up the learning process.
• Learning should be able to capture complex nonlinear mapping between input-output pattern pairs
• Learning should be able to capture as many patterns as possible into the network. That is, the pattern
information storage capacity should be as large as possible for a given network.
Categories of learning
• Supervised, reinforcement and unsupervised
• Off-line and on-line
• Deterministic, stochastic and fuzzy
• Discrete and continuous
• Criteria for learning
Types of Learning
Hebbian learning

• Basic Hebbian learning


• Differential Hebbian learning
• Stochastic versions

Competitive learning-learning without a teacher

• Linear competitive learning


• Differential competitive learning
• Linear differential competitive learning
• Stochastic versions
Error correction learning-learning with a teacher
• Perceptron learning
• Delta learning
• LMS learning

Reinforcement learning-learning with a critic


• Fixed credit assignment
• Probabilistic credit assignment
• Temporal credit assignment
Stochastic learning
• In multilayer perceptron
• In Boltzmann machine

Other learning methods


• Sparse coding
• Min-max learning
• Principal component learning
• Drive-reinforcement learning
Hebbian learning
• The changes in the synaptic strength is proportional to the correlation
between the firing of the post- and pre-synaptic neurons

Topology for Hebbian learning

• sisj are the product of the post-synaptic and pre-synaptic neuronal


variables for the ith unit.
• These variables could be activation values (sisj = xi(t) xj(t)), or output
signals from two units (sisj = fi(xi(t))fj(xj(t))) or some other
parameters related to the post-synaptic and pre-synaptic activity.
• Let the respective mean values of si and sj be ( 𝒔ഥ𝒊 , 𝒔ഥ𝒋 ), then the
resulting correlation term (si - 𝒔ഥ𝒊 ) (sj - 𝑠ഥ𝑗 ) is called covariance
correlation term.
Competitive Learning
• Learning laws which modulate the difference between the output
signal and the synaptic weight
• The general form of competitive learning is given by the following
expression for the synaptic dynamics

where, si = fi(xi(t)) is the output signal of the unit i, and sj =fj(xj(t)) is the
output signal of the unit j.
• 𝑤ሶ 𝑖𝑗 𝑡 = −𝑤𝑖𝑗 𝑡 𝑠𝑖 + 𝑠𝑖 𝑠𝑗

The above expression is similar to the deterministic Hebbian learning. However,


the term -si wij(t), is nonlinear, whereas it was linear in the Hebbian case.
Here adjustment of weights takes place only when there is a nonzero post-
synaptic signal 𝑠𝑖. If si = 0, then the synaptic weights do not change.
It is also interesting to note that, unlike in the Hebbian case, in the competitive
learning case the system does not forget the past learning when the post-synaptic
signal is zero.
In the Hebbian case, for si = 0, 𝑤ij(t)
ሶ = - wij(t), which results in forgetting the
knowledge already acquired.
• The signals from the units in the output layer compete with each other, leaving eventually one of the units
(say i) as the winner.
• The weights leading to this unit are adjusted according to the learning law.
• This is also called the 'winner take-all' situation, since only one unit in the output layer will have a nonzero
output eventually.
• The corresponding weights wij from all the input units (j) are adjusted to match the input vector.
• While the Hebbian learning is generally distributed, i.e., all the weights are adjusted for every input pattern,
the competitive learning is not distributed.
• In fact the input vectors leading to the same winning unit in the competitive layer will produce a weight
vector for that unit which is an average of all the corresponding input vecor.
• If the input units are linear, i.e., s, = x,, then the resulting learning is called linear competitive learning,
All the inputs are connected to each of the units in the output layer in a
feed forward manner.

For a given input vector a, the output from each unit i is computed
using the weighted sum

The unit k that gives maximum output is identified.

Then the weight vector leading to the kth unit is adjusted as follows

Wn+1=wn+delw
• The final weight vector tends to represent a group of input vectors.
• This is a case of unsupervised learning.
• The values of the weight vectors are initialized to random values prior to
learning.
Outstar Learning Law

In this law the weights are adjusted so as to capture the desired


output pattern characteristics.

The adjustment of the weights is given by

where the kth unit is the only active unit in the input layer.

The vector b = (bl, b2, ..., bM) is the desired response from the layer of M units.
The outstar learning is a supervised learning law, and it is used with a network of instars to capture
the characteristics of the input and output patterns for data compression.
In ,implementation, the weight vectors are initialized to zero.
Learning Mechanism

• Error correcting learning


• Memory based learning
• Hebbian learning
• Competitive learning
• Boltzman learning
Error correcting learning
𝑒𝑘 𝑛 = 𝑑𝑘 𝑛 − 𝑦𝑘 (𝑛)
Minimization of
𝐸 = 1/2 ෍ 𝑒𝑘2 (𝑛)
j i
𝑘
∆𝑤𝑖 (𝑛) = 𝜂𝑒𝑘 (𝑛)𝑥𝑗 (𝑛)
𝑤𝑖(𝑛+1) = 𝑤𝑖 (𝑛) + ∆𝑤𝑖 (𝑛)

Gradient descent algorithm

Referred as Delta rule/Widrowhop rule/gradient descent method


Memory based learning
𝑛
𝑥𝑖 , 𝑑𝑖 1 x1

Measure the Eucledian distance x2


1 1 1 y
between the xtest and xi 1
1 11
1
𝑥𝑛′ ∈ 𝑥1 , 𝑥2 , … … . , 𝑥𝑛
1 outlier
𝑥𝑛′ is closest to 𝑥𝑡𝑒𝑠𝑡 𝑖𝑓

xm
min 𝑥𝑖 , 𝑥𝑡𝑒𝑠𝑡 = 𝑑(𝑥𝑛, , 𝑥𝑡𝑒𝑠𝑡 )
𝑖
Test point lying near
the outlier will be
wrongly classified
• Test point lying near the outlier will be wrongly classified.
• So we will consider a variant of this method
• Let us consider a nbd of k numbers points
• Method known as k-neighbourhood method

Hebbian learning
• Hebb was an Neurophysiologist
Presynaptic Postsynaptic
neuron neuron If cell A fires cell B then metabolic changes
occur so that efficiency of firing B by cell A
increases.

B Here synaptic weight of the connection will


A be adopted in such way that next time cell A
will get a better probability of firing cell B.

Process is very much time dependant. Activation of cell A and B must be synchronous

Local. It does not get influenced by other neuron.


Strongly interactive
• Positive correlation strengthen synaptic weighted
• Un/negatively correlated synaptic weakening
• Classification of synaptic modification
Hebbian: Synapse increases its strength to strengthen positive
correlation
Anti Hebbian: Synapse decreases its strength to strengthen
negative correlation
Non Hebbian: Does not involve Hebbian mechanism
Mathematical model
∆𝑤𝑘𝑗 (𝑛) = 𝐹(𝑦𝑘 𝑛 , 𝑥𝑗 𝑛 )
• Hebb’s hypothesis
∆𝑤𝑘𝑗 (𝑛) = 𝜂𝑦𝑘 𝑛 ∗ 𝑥𝑗 𝑛
• It is called Activity product rule
∆𝑤𝑘𝑗
𝑥𝑗 constant
According to the algorithm

𝑀
Slope=𝜂𝑥𝑗
𝑦𝑘 = ෍ 𝑤𝑘𝑗 𝑥𝑗
𝑘=1 𝑦𝑘
• For a constant 𝑥𝑗 we get positive 𝑦𝑘 which contributes to ∆𝑤𝑘𝑗

𝑤𝑘𝑗(𝑛+1) = 𝑤𝑘𝑗 (𝑛) + ∆𝑤𝑘𝑗 (𝑛)

• It again increases 𝑤𝑘𝑗 which in turn increases 𝑦𝑘


• Thus it has an exponential growth which needs saturation of learning
• Hypothesis is modified to Covariance hypothesis
𝑥:ҧ time avg value of 𝑥𝑗 (constant)
𝑦:
ത time avg value of 𝑦𝑘
∆𝑤𝑘𝑗 = 𝜂 ∗ (𝑦𝑘 −𝑦)
ത ∗ (𝑥𝑗 − 𝑥)ҧ
∆𝑤𝑘𝑗 increases as 𝑥𝑗 > 𝑥ҧ 𝑎𝑛𝑑 𝑦𝑗 > 𝑦ത
𝑥𝑗 < 𝑥ҧ 𝑎𝑛𝑑 𝑦𝑗 < 𝑦ത
∆𝑤𝑘𝑗 slope
𝜂 (𝑥𝑗 − 𝑥)ҧ
∆𝑤𝑘𝑗 decreases if as 𝑥𝑗 > 𝑥ҧ 𝑎𝑛𝑑 𝑦𝑗 < 𝑦ത
𝑥𝑗 < 𝑥ҧ 𝑎𝑛𝑑 𝑦𝑗 > 𝑦ത stabilizing

𝑦𝑘
Hypocampus
−𝜂 𝑦(𝑥
ത 𝑗 − 𝑥)ҧ

Maximum depression point


Y=mx+c
Competitive learning
• Supports several inputs and several outputs
• Competitions
• Feedback net work weakens competitors

Feed forward/excitatory
1 𝑖𝑓 𝑣𝑘 > 𝑣𝑗 ∀ 𝑗 𝑗 ≠ 𝑘
• 𝑦𝑘 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• σ𝑗 𝑤𝑘𝑗 = 1 ∀ 𝑘

• 𝑣𝑗 = σ𝑖 𝑤𝑗𝑖 𝑥𝑖 Feedback
connection
𝜂 𝑥𝑗 − 𝑤𝑘𝑗 𝑖𝑓 𝑘 𝑖𝑠 𝑡ℎ𝑒 𝑤𝑖𝑛𝑛𝑒𝑟
∆𝑤𝑘𝑗 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 ෍ 𝑤𝑘𝑗 = 1 ∀ 𝑘
𝑗

• If 𝑤𝑘𝑗 increases in one iteration in next iteration also it is most likely that k will again be the winner and
𝑤𝑘𝑗 will increase
• However increment will be less than the previous iteration due to the above Eq.
• 𝑤𝑘𝑗 is going to follow 𝑥𝑗
• Unsupervised learning
• For different pattern different neuron will be winner
𝑤𝑘 = 𝑤𝑘1 , 𝑤𝑘1 , … . . , 𝑤𝑘𝑛

𝑥 = 𝑥1 , 𝑥2 , … . , 𝑥𝑛
Boltzman learning Wkj=1
E=-1/2*(1)*(-1)*(1)=1/2
E =-1/2*(-1)*(-1)*(1)=-1/2
• Stochastic method
• Statistical mechanism
• Neuron constitutes a recurrent structure (feedback interconnection)
• Operated on binary mode +1
• Energy function:
1 -1
•𝐸=− σ 𝑗 σ𝑘 𝑤𝑘𝑗 𝑥𝑗 𝑥𝑘
2 +1
𝑗≠𝑘
• Some neurons take part in output layer
• Whereas, some neurons belong to hidden layer
-1 +1

hidden
• All the neurons are associated with some state +1/-1
• Pick up any neuron. Flip its state. Calculate E
• Probability a neuron will flip its state from 𝑥𝑘 → −𝑥𝑘 is as follows

1
𝑃 𝑥𝑘 → −𝑥𝑘 =
(1 + exp(−∆𝐸𝑘 /𝑇))

∆𝐸𝑘 is the change of energy due to the flip, T is a constant


Observations:
• States of low energy have a higher probability of occurrence than
states of high energy.
• As the temperature T is reduced, the probability is concentrated on a
smaller subset of low-energy states.
Boltzman Neuron
+
𝜌𝑘𝑗 =Correlation between neuron j and
Visible neuron k in the clamped condition (Positive phase)
(available at Hidden +
output) Always free 𝜌𝑘𝑗 =1/-1

𝜌𝑘𝑗 =Correlation between neuron j and
neuron k in the free running condition (Negative
Free Clamped
Can change Can’t change its phase)
its state state

Boltzman learning rule

+ −
∆𝑤𝑘𝑗 = 𝜂 𝜌𝑘𝑗 − 𝜌𝑘𝑗 ,𝑗 ≠ 𝑘
• Given the synaptic-weight vector w for the whole network, the probability that the visible neurons
are in state 𝑥𝑘 is P(𝑋𝑘 = 𝑥𝑘 )
• The log-likelihood function L(w) is formulated as follows

𝐿 𝑤 = log 𝛱 P(𝑋𝑘 = 𝑥𝑘 )

The goal of Boltzmann learning is to maximize the log-likelihood function L(w)

𝜕𝐿 𝜔 + −
= 1/𝑇 𝜌𝑘𝑗 − 𝜌𝑘𝑗 ,𝑗 ≠ 𝑘
𝜕𝜔𝑖𝑗
We may use gradient ascent to achieve that goal by writing

𝜕𝐿 𝜔
∆𝑤𝑘𝑗 =ε
𝜕𝜔𝑖𝑗

+ −
∆𝑤𝑘𝑗 = 𝜂 𝜌𝑘𝑗 − 𝜌𝑘𝑗 ,𝑗 ≠ 𝑘 𝜂= ε/T
Perceptron Convergence Algorithm
Variables and Parameters:
x(n) = (m + 1)-by-1 input vector
= [+1, x1(n), x2(n), ..., xm(n)]𝑇
w(n) = (m + 1)-by-1 weight vector
= [+1, w1(n), w2(n), ..., wm(n)]𝑇 , b = bias
y(n) = actual response (quantized)
d(n) = desired response
η= learning-rate parameter, a positive constant less than unity

1. Initialization. Set w(0) = 0. Then perform the following computations for


time-step n = 1, 2, ....
2. Activation. At time-step n, activate the perceptron by applying continuous-valued
input vector x(n) and desired response d(n).
3. Computation of Actual Response. Compute the actual response of the perceptron as
𝑦 𝑛 = sgn[𝑤 𝑇 𝑛 𝑥(𝑛)]
where sgn(·) is the signum function.
4. Adaptation of Weight Vector. Update the weight vector of the perceptron to obtain
w(n + 1) = w(n) + [d(n) - y(n)]x(n)

+1 if x(n) belongs to class c1


Where 𝑑 𝑛 =ቊ
−1if x(n) belongs to class c2

5. Continuation. Increment time step n by one and go back to step 2.


A pair of (a) linearly separable patterns. (b) non-
linearly separable patterns.
Equivalent signal-flow graph of the
perceptron

For the perceptron to function properly, the two classes c1 and c2 must be linearly
separable.
This means that the patterns to be classified must be sufficiently separated from each
other to ensure that the decision surface consists of a hyperplane.
• Fig. (a) is the example of a two-dimensional perceptron.
• The two classes c1 and c2 are sufficiently separated from each other to draw a
hyperplane (in this case, a straight line) as the decision boundary.
• If, however, the two classes c1 and c2 are allowed to move too close to each other,
as in Fig. (b), they become nonlinearly separable, a situation that is beyond the
computing capability of the perceptron.
Suppose the input variables of the perceptron originate from two linearly separable
classes. Let h1 be the subspace of training vectors x1(1), x1(2), ... that belong to
class c1, and let h2 be the subspace of training vectors x2(1), x2(2), ... that belong to
class c2.
The union of h1 and h2 is the complete space denoted by h.
• Given the sets of vectors h1 and h2 to train the classifier, the training process
involves the adjustment of the weight vector w in such a way that the two classes
c1 and c2 are linearly separable.
• That is, there exists a weight vector w such that we may state
𝑤 𝑇 𝑥 > 0 for every input vector x belonging to class c1
𝑤 𝑇 𝑥 ≤ 0 for every input vector x belonging to class c2
• In the second line, we have arbitrarily chosen to say that the input vector x belongs
to class c2 if 𝑤 𝑇 𝑥 = 0 .
• Given the subsets of training vectors h1 and h2, the training problem for the
perceptron is then to find a weight vector w such that the two inequalities are
satisfied.
• The algorithm for adapting the weight vector of the elementary perceptron may now be
formulated as follows:
1. If the nth member of the training set, x(n), is correctly classified by the weight vector
w(n) computed at the nth iteration of the algorithm, no correction is made to the weight
vector of the perceptron in accordance with the rule:
w(n+1)=w(n) if 𝑤 𝑇 𝑥 > 0 and x(n) belonging to class c1
w(n+1)=w(n) if 𝑤 𝑇 𝑥 ≤ 0 and x(n) belonging to class c2
2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule
w(n+1)=w(n)-η(n)x(n) if 𝑤 𝑇 𝑥 > 0 and x(n) belonging to class c2
w(n+1)=w(n)+η(n)x(n) if 𝑤 𝑇 𝑥 ≤ 0 and x(n) belonging to class c1
where the learning-rate parameter η(n) controls the adjustment applied to the weight vector
at iteration n.
• If η(n)=η > 0, where η is a constant independent of the iteration number n, then we
have a fixed-increment adaptation rule for the perceptron.
• We first prove the convergence for η =1. The value of η is unimportant, so long as
it is positive. A value η ≠ 1 merely scales the pattern vectors without affecting
their separability.
• Initial condition: assume w(0)=0. Suppose that 𝑤 𝑇 (𝑛)𝑥(𝑛) ≤ 0 for n= 1, 2, ...,
and the input vector x(n) belongs to the subset h1. That is, the perceptron incorr

• ectly classifies the vectors x(1), x(2), ..., So make


w(n + 1) = w(n) + x(n) for x(n) belonging to class c1 (η =1)
w(n + 1) = x(1) + x(2) +...... + x(n) as w(n)=0
• Since the classes c1 and c2 are assumed to be linearly separable, there exists a
solution wo for which 𝑤 𝑇 𝑥 > 0 for the vectors x(1), ..., x(n) belonging to the
subset h1.
• For a fixed solution wo, there exist a positive number α as

min 𝑤𝑜𝑇 𝑥 𝑛 = 𝛼
𝑥(𝑛)∈ℎ1

w(n + 1) = x(1) + x(2) +.... + x(n)

Applying Cauchy–Schwarz inequality states that

Euclidean norm of the enclosed argument vector


(1)
(2)

Eq (2) is clearly in conflict with the earlier result of Eq. (1) for sufficiently
large values of n. We can state that n cannot be larger than some value nmax
for which Eqs. (1) and (2) are both satisfied with the equality sign. That is,
nmax is the solution of the equation
• Now η(n) = 1 for all n and w(0) = 0, and a solution wo exists. The rule for
adapting the synaptic weights of the perceptron must terminate after at most nmax
iterations.
• Surprisingly, this statement, proved for hypothesis h1, also holds for hypothesis
h2.
• We may now state the fixed-increment convergence theorem for the perceptron as
follows (Rosenblatt, 1962):

Let the subsets of training vectors h1 and h2 be linearly separable.


Let the inputs presented to the perceptron originate from these two subsets.
The perceptron converges after some no iterations,
w(no) = w(no + 1) = w(no + 2) = …
is a solution vector for 𝑛𝑜 ≤ 𝑛𝑚𝑎𝑥
• The use of an initial value w(0) different from the null condition results in a
decrease or increase in the number of iterations required to converge, depending
on how w(0) relates to the solution wo.
• Regardless of the value assigned to w(0), the perceptron is assured of
convergence.
Associative Memory
• Retrieval of data based on content/pattern
Memory is distributive
Input activity pattern will stimulate and provide output activity
Memory pattern (recognizing a face)
Provide some data
Short term Long term Stimulus/response pattern are data vector
memory memory
Information content in the stimulus determine its address for
retrieval.
High degree of resistance to noise
High degree of Interaction between the pattern which causes
error in recall process
Learning capacity is limited, beyond a certain level, it will make
error in recall process
Associative memory is used for storage of patterns.
Stores association between xk yk: k is the index of the pattern.

w11k
Xk1 yk1
Xk2 yk2
Xk3 w1mk

xkm ykm
wmnk

Input layer Synaptic Output layer


It stores patterns xk:k=1,2,..,K
weight
Corresponding weights wijk
For a pattern 𝑥𝑘 = [𝑥𝑘1 , 𝑥𝑘2 , … , 𝑥𝑘𝑚 ]𝑇
𝑦𝑘 = [𝑦𝑘1 , 𝑦𝑘2 , … , 𝑦𝑘𝑚 ]𝑇

𝑦𝑘𝑗 = σ𝑚
𝑖=1 𝑤𝑗𝑖 (𝑘)𝑥𝑘𝑖
𝑥𝑘1 𝑇
𝑥𝑘2
𝑦𝑘𝑗 = 𝑤𝑗1 𝑘 ,𝑤𝑗2 𝑘 ,……,𝑤𝑗𝑚 𝑘 ....
𝑥𝑘𝑚
We can write m such equations
𝑦𝑘1 𝑤11 𝑘 𝑤12 𝑘 … … 𝑤1𝑚 𝑘 𝑥𝑘1 𝑇
𝑦𝑘2 𝑤21 𝑘 𝑤22 𝑘 , … … 𝑤2𝑚 𝑘 𝑥𝑘2
.. = … ..
.. …. ..
𝑦𝑘𝑚 𝑤𝑚1 𝑘 𝑤𝑚2 𝑘 … … 𝑤𝑚𝑚 𝑘 𝑥𝑘𝑚

This gives the association of xk with yk where k=1,2,..,q


For every pattern there exist unique set of w
Memory stores the weight associated with each pattern
𝑚

𝑦𝑘𝑗 = ෍ 𝑤𝑗𝑖 (𝑘)𝑥𝑘𝑖


𝑖=1
𝑞
𝑚 × 𝑚 𝑚𝑒𝑚𝑜𝑟𝑦 𝑚𝑎𝑡𝑟𝑖𝑥 𝑀 = ෍ 𝑊(𝑘)
1

𝑀 indicates total experience gained

Above Eq can be written in recursive form

𝑀𝑘 = 𝑀𝑘−1 + 𝑤(𝑘)

Incremental
How to estimate the weights
Correlation matrix memory

෡ 𝑞
𝑀 = 𝑘=1 𝑦𝑘 𝑥𝑘𝑇
σ 𝑦𝑘 : 1𝑋𝑚 𝑣𝑒𝑐𝑡𝑜𝑟
Outer product 𝑥𝑘𝑇 : 𝑚𝑋1 𝑣𝑒𝑐𝑡𝑜𝑟
෡ 𝑚𝑋𝑚 𝑣𝑒𝑐𝑡𝑜𝑟
𝑀:
Reformulating

[𝑥1𝑇
𝑞
[𝑥2𝑇
෡ = ෍ [𝑦Ԧ1 𝑦Ԧ2 … … . 𝑦Ԧ𝑞 ] .
𝑀
𝑘=1 .
[𝑥𝑞𝑇

෡ = 𝑌𝑋 𝑇
𝑀
• Recursively
෡𝑘 = 𝑀
𝑀 ෡𝑘−1 + 𝑦Ԧ𝑘 𝑥Ԧ𝑘𝑇 K=1,2,..,q

𝑥𝑘𝑇

෡𝑘
𝑀 ෡𝑘−1
𝑀
𝑦𝑘
X ∑ 𝑧 −1 𝐼
• If we feed xj , y will be corresponding output
𝑦Ԧ = 𝑀෡ 𝑥Ԧ𝑗
If 𝑦Ԧ = 𝑦Ԧ𝑗 implies recalling is perfect

Replacing memory vector we get


𝑚

𝑦Ԧ = ෍ (𝑦Ԧ𝑘 𝑥Ԧ𝑘𝑇 ). 𝑥Ԧ𝑗


𝑘=1
• Network has m dimension & rearranging we can write
𝑚

𝑦Ԧ = ෍ (𝑥Ԧ𝑘𝑇 . 𝑥Ԧ𝑗 )𝑦Ԧ𝑘


𝑘=1
𝑚

𝑦Ԧ = (𝑥Ԧ𝑗𝑇 . 𝑥Ԧ𝑗 )𝑦Ԧ𝑗 + ෍ (𝑥Ԧ𝑘𝑇 . 𝑥Ԧ𝑗 )𝑦Ԧ𝑘


𝑘=1
𝑘≠𝑗
(𝑥Ԧ𝑗𝑇 . 𝑥Ԧ𝑗 ) = 𝐸𝑘 energy of the pattern X1=1, x2=2,…x5=5
Find Max(x)
X=(x1,x2,x3,x4,x5)/Max(x)
X=(1/5,2/5,3/5,4/5,1)
If all the patters are normalized to unit energy
𝑚

𝑦Ԧ = 𝑦Ԧ𝑗 + ෍ (𝑥Ԧ𝑘𝑇 . 𝑥Ԧ𝑗 )𝑦Ԧ𝑘


𝑘=1
𝑘≠𝑗

𝑦Ԧ = 𝑦Ԧ𝑗 + 𝑣Ԧ𝑗 Noise


𝑥Ԧ𝑘𝑇 ∙ 𝑥Ԧ𝑗
cos 𝑥Ԧ𝑘 , 𝑥Ԧ𝑗 =
𝑥Ԧ𝑘 𝑥Ԧ𝑗

1/2
𝑥Ԧ𝑘 = (𝑥Ԧ𝑘𝑇 ∙ 𝑥Ԧ𝑘 ) 1/2 = 𝐸𝑘 =1

cos 𝑥Ԧ𝑗 , 𝑥Ԧ𝑘 = 𝑥Ԧ𝑗𝑇 ∙ 𝑥Ԧ𝑘


If k=j cos 𝑥Ԧ𝑗 , 𝑥Ԧ𝑘 ≠ 0
• Key vectors form a orthonormal set then it gives exact recall
𝑇 1 𝑘=𝑗
𝑥Ԧ𝑗 ∙ 𝑥Ԧ𝑘 = ቊ
0 𝑘≠𝑗
• It causes perfect recall.
• Largest no of reliably stored in M=Rank of (M)=r
• Matrix dimension l by m
• R<=min(l,m) for reliable recall
𝑥Ԧ𝑘𝑒𝑦 = 𝑥Ԧ1 , 𝑥Ԧ2 , ……, 𝑥Ԧ𝑞 Set of input/output
pattern
𝑦Ԧ𝑚𝑒𝑚 = 𝑦Ԧ1 , 𝑦Ԧ2 , ……, 𝑦Ԧ𝑞

If 𝑥Ԧ𝑘𝑇 ∙ 𝑥Ԧ𝑗 ≥ 𝛾 𝑓𝑜𝑟 𝑘 ≠ 𝑗 where 𝛾 is large enough


y cant be retrieved

Let 𝑥Ԧ𝑗 = 𝑥Ԧ𝑜 + 𝑣Ԧ


Where 𝑣Ԧ is a stochastic vector
𝑥Ԧ𝑜 , 𝑦Ԧ0 be the pair of pattern not seen before
Unconstrained optimization technique
C=d-y
• Cost function C(𝑤) is a function of free parameters 𝑤
• C(𝑤) must be continuously differentiable
• Obj: find 𝑤 = 𝑤 ∗ 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝐶(𝑤 ∗ ) ≤ 𝐶(𝑤) → 𝑤 ∗ is the optimal soln
More precisely
𝛻𝐶 𝑤 ∗ = 0 Necessary condition
𝜕 𝜕 𝜕 𝑇
𝛻=[ ……. ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑛

𝜕𝐶 𝜕𝐶 𝜕𝐶 𝑇
𝛻𝐶 = [ ……. ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑛
• Local iterative descent
• 𝑤(0)=initial weight which will be used to generate
𝑤 𝑛 : 𝑛 = 1,2, … . , 𝑛

• If network converges 𝐶 𝑤(𝑛 + 1) < 𝐶(𝑤(𝑛))


• Apply gradient descent method i.e. go downhill
• Steepest decent:
𝑔Ԧ = 𝛻𝐶 𝑤
𝑤 𝑛 + 1 = 𝑤 𝑛 − η𝑔(𝑛)Ԧ
Δ𝑤 𝑛 = 𝑤 𝑛 + 1 − 𝑤 𝑛 = −η𝑔(𝑛) Ԧ
Taylor series expansion
𝐶 𝑤(𝑛 + 1) ≈ 𝐶 𝑤 𝑛 + 𝑔Ԧ𝑇 𝑛 . Δ𝑤 𝑛
=𝐶 𝑤 𝑛 + 𝑔Ԧ𝑇 𝑛 . −η𝑔Ԧ 𝑛
=𝐶 𝑤 𝑛 + −η 𝑔Ԧ 𝑛 2

• Selection of 𝜂 must be judicious. Selection of 𝜂 too large causes over


damp
• Selection of 𝜂 too small causes underdamp
• For some 𝜂 soln may be oscillatory
• Newtons method for unconstrained optimization
• Using second order Taylor series expansion of cost function

Δ𝐶 𝑤 𝑛 = 𝐶 𝑤(𝑛 + 1) − 𝐶 𝑤 𝑛
𝑇 1 𝑇
≈ 𝑔Ԧ 𝑛 . Δ𝑤 𝑛 + ∆𝑤 𝑛 𝐻 𝑛 ∆𝑤 𝑛
2
Hessian matrix

𝜕2 𝐶 𝜕2 𝐶 𝜕2 𝐶
………
𝜕2 𝑤1 𝜕𝑤1 𝜕𝑤2 𝜕𝑤1 𝜕𝑤𝑛
𝜕2 𝐶 𝜕2 𝐶 𝜕2 𝐶
………
• 𝐻 𝑛 = 𝛻2𝐶 𝑤 𝑛 = 𝜕𝑤1 𝜕𝑤2 𝜕2 𝑤2 𝜕𝑤2 𝜕𝑤𝑛
……
…….
𝜕2 𝐶 𝜕2 𝐶 𝜕2 𝐶
……… 2
𝜕𝑤1 𝜕𝑤𝑛 𝜕𝑤2 𝜕𝑤𝑛 𝜕 𝑤𝑛
Associative memories
• Linear associative memories
Heteroassociative
Autoassociative
Hopfield network
Boltzmann machine
• Bidirectional associative memories
• Multidirectional associative memories
• Temporal associative memories
Associative Memory
• Pattern storage is an obvious pattern recognition task that is realized
using an ANN.
• This is a memory function, where the network is expected to store the
pattern information (not data).
• The patterns to be stored may be of spatial type or spatio-temporal
(pattern sequence) type.
• Typically, an ANN behaves like an associative memory, in which a
pattern is associated with another pattern, or with itself.
• This is in contrast with the random access memory which maps an
address to a data.
• ANN functions as a content addressable memory where data is mapped onto an address.

• The pattern information is stored in the weight matrix of a feedback neural network.

• The stable states of the network represent the stored patterns, which can be recalled by
providing an external stimulus.

• If the weight matrix stores the given patterns, then it becomes an autoassociative memory.

• If the weight matrix stores the association between a pair of patterns, the network becomes
a bidirectional associative memory. This is called heteroassociation between the two
patterns.

• If the weight matrix stores multiple associations among several (> 2) patterns, then the
network becomes a multidirectional associative memory.

• If the weights store the associations between adjacent pairs of patterns in a sequence of
patterns, then the network is called a temporal associative memory.
Some desirable characteristics of associative memories are:
• The network should have a large capacity, i.e., ability to store a large number of
patterns or pattern associations.
• The network should be fault tolerant in the sense that damage to a few units or
connections should not affect the performance in recall significantly.
• The network should be able to recall the stored pattern or the desired associated
pattern even if the input pattern is distorted or noisy.
• The network performance as an associative memory should degrade only
gracefully due to damage to some units or connections, or due to noise or
distortion in the input.
• The network should be flexible to accommodate new patterns or associations
(within the limits of its capacity) and to be able to eliminate unnecessary
patterns or associations.
Bidirectional Associative Memory (BAM)
• The objective is to store a set of pattern pairs in such a way that
any stored pattern pair can be recalled by giving either of the
patterns as input.
• The network is a two-layer heteroassociative neural network that
encode binary or bipolar pattern pairs (al,bl) using the Hebbian
learning.
• It can learn on-line and it operates in discrete time steps.
• The BAM weight matrix from the first layer to the second layer is
given by
The associated bipolar patterns of the input and output layers are
1 j M
𝑎𝑙 ∈ −1, +1 𝑀 , 𝑏𝑙 ∈ −1, +1 𝑁 , 𝑙 = 1: 𝐿

L is the total number of training patterns.


The weight matrix from the first layer to the second layer is given by
𝑳

𝑾 = ෍ 𝒂𝒍 𝒃𝑻𝒍
1 i N 𝒍=𝟏
The associated binary patterns of the input and output layers are
𝑝𝑙 ∈ 0,1 𝑀 , 𝑞𝑙 ∈ 0,1 𝑁

BAM
Corresponding bipolar representation of the binary values

𝑎𝑙𝑖 = 2𝑝𝑙𝑖 − 1 , 𝑏𝑙𝑖 = 2𝑞𝑙𝑖 − 1

The weight matrix from the second layer to the first layer is given by
𝑳

𝑾𝑻 = ෍ 𝒃𝒍 𝒂𝑻𝒍
𝒍=𝟏
The activation equations for the bipolar case are as follows:
1 𝑖𝑓 𝑦𝑗 > 0
𝑏𝑗 𝑚 + 1 = ൞𝑏𝑗 𝑚 𝑖𝑓 𝑦𝑗 = 0 𝑦𝑗 = σ𝑀
𝑖=1 𝑤𝑗𝑖 𝑎𝑖 (𝑚)
−1 𝑖𝑓 𝑦𝑗 < 0

1 𝑖𝑓 𝑥𝑖 > 0
𝑎𝑖 𝑚 + 1 = ൞𝑎 𝑚 𝑖𝑓 𝑥𝑖 = 0 𝑥𝑖 = σ𝑁
𝑗=1 𝑤𝑖𝑗 𝑏𝑗 (𝑚)
−1 𝑖𝑓 𝑥𝑖 < 0

𝑎 𝑚 = [ 𝑎1 𝑚 , 𝑎2 𝑚 , … , 𝑎𝑀 𝑚 , ]𝑇 is the output of the first layer at the mth iteration,


b 𝑚 = [ 𝑏1 𝑚 , 𝑏2 𝑚 , … , 𝑏𝑁 𝑚 , ]𝑇 is the output of the second layer at the mth iteration.
• For recall, ai(0), i = 1,2, ..., M, is applied to the first layer and the
activation equations are used in the forward and backward passes
several times until equilibrium is reached.
• The stable values bj(m), j = 1,2, ..., N are read out as the pattern
associated with the given input.
• Likewise the pattern at the first layer can be recalled given the
pattern at the second layer.
• The updates in the BAM are synchronous in the sense that the units
in each layer are updated simultaneously.
• BAM can be shown to be unconditionally stable using the Lyapunov
energy function given by

ai=1 ai=-1
change in ai=-2
ai=1 ai=1
change in ai=0
ai=-1 ai=-1
change in ai=0
ai=-1 ai=1
change in ai=2
The change in energy due to change ∆𝑎𝑖 in ai is given by
The change in energy due to change ∆𝑏𝑗 in bj is given by

For bipolar units


It signifies that the energy is either decreasing or remains same in
iterations. Thus BAM reaches to stable state for any weight matrix
derived from given pattern
• The BAM is limited to binary or bipolar valued pattern pairs.
• The upper limit on the number of pattern pairs (L) that can be stored is
min (M,N).
• The performance of BAM depends on the nature of the pattern pairs
and their number.
• As the number of pattern pairs increases, the probability of error in
recall will also increase.
• The error in the recall will be large if the memory is filled to its
capacity.
• Extensions of the discrete BAM have been proposed to deal with
analog pattern pairs in continuous time.
• The resulting network is called Adaptive BAM (ABAM).
• In this case the pattern pairs are encoded using Hebbian learning with
a passive decay term in learning.
• For recall of the patterns, the additive model of the activation
dynamics is used for units in each layer separately.
• According to the ABAM theorem the memory is globally stable.
Multidirectional Associative Memory
• The bidirectional associative memory concept can be generalized to
store associations among more than two patterns.
• The multiple association memory is also called multidirectional
associative memories.
• Let the proposed MAM associates among three patterns (al, bl, cl)
• To associate three patterns we require three layers of units denoted
as A, B, C
• The dimensions of the three vectors al, bl and cl are Nl, N2 and N3,
respectively.
• The weight matrices for the pairs of layers are given (MAM).
The outputs cj(m+1) and aj(m+1) are likewise computed. Each unit in a layer is updated
independently and synchronously based on the net input from units in the other two layers.
The updating is performed until a multidirectionally stable state is reached.
Temporal Associative Memory (TAM)
• The BAM can be used to store a sequence of temporal pattern vectors, and
recall the sequence of patterns.
• The basic idea is that the adjacent overlapping pattern pairs are to be stored in a
BAM.
• Let a1, a2, ..., aL be a sequence of L patterns, each with a dimensionality of M.
Then (al, a2), (a2, a3) ,..., (ai, ai+ 1), ..., (aL-1, aL) and (aL, a1) form the
pattern pairs to be stored in the BAM.
• Note that the last pattern in the sequence is paired with the first pattern.
• The weight matrix in the forward direction is given by
The weight matrix for the reverse direction is given by the transpose of the forward weight
matrix, 𝑊 𝑇
The recall steps are exactly the same as for BAM. When stable conditions are reached, then it
is possible to recall the entire sequence of patterns from any one pattern in the sequence.
The TAM has the same kind of limitations as those of BAM in its error performance in recall
and also in its capacity for storing a given length (L) of a sequence of patterns.
Pattern Mapping
• The multilayer feed forward neural network with error backpropagation learning was primarily developed to
overcome the limitation of a single layer perceptron for classification of hard problems (nonlinearly separable
classes) and to overcome the problem of training a multilayer perceptron (due to hard-limiting output
function) for these hard problems.
• In backpropagation network the objective is to capture (in the weights) the complex nonlinear hypersurfaces
separating the classes.
• The complexity of the surface is determined by the number of hidden units in the network. Any classification
problem specified by the training set of examples can be solved using a network with sufficient number of
hidden units.
• In such a case, the problem is more of a pattern association type than of a classification type, with no
restrictions on the associated patterns as in the case of a linear associative network.
• In a classification problem the input patterns belonging to a class are expected to have some common features
which are different for patterns belonging to another class.
• The idea is that, for a pattern belonging to any of the trained classes, the network is supposed to give the
correct classification.
Radial Basis Function (RBF) Networks
• Perceptron model cant guarantee the proper output always
• Training pattern: trains the network
• Test pattern: verification
• Generalization of curve fitting outlier
• Over fitting
• Outlier will not be fitted on the curve. However if we include it in
modelling generalization will be affected.
• Generalization affects
• Size of the training pattern
• Architecture of the network
• Physical complexity of the problem
• If the Architecture of the network is fixed then size of the training
pattern
𝑤
𝑁 = 𝑜( )
𝜖
w= no of free params, 𝜖=fraction of permissible classification error
low 𝜖 implies higher value of N
Covers theorem
• Multilayer perceptron problem
• Nonlinearly separable classes
• Surface fitting problem
• With respect to the hidden layer output is linearly separable due to
the mapping in the hidden layer
• The architecture of a radial basis function network consists of at least
one single hidden layer with nonlinear units, followed by an output
layer with linear units.
H1
Input space dimension determines hidden space
Hyper surface dimension(higher dimension)
Linearly separable hyper surface
H2 It requires non linear function

Cover theorem:
A pattern classification problem is more likely to be
linearly separable in higher dimensional space than it is
in lower dimension space
• Set H of N patterns (x1, x2, x3, …, xN)
• Each patterns can be assigned to either H1 and H2
• 𝑥Ԧ ∈ 𝐻 𝑑𝑒𝑓𝑖𝑛𝑒 𝑎 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑟𝑒𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
Ԧ ȁ𝑖 = 1,2, … , 𝑚1
∅𝑖 (𝑥)
With this m1 number of functions ∅𝑖 𝑥Ԧ is mapped in ∅𝑖 :i=1,2,…,m1
Set spanned by
Ԧ 1𝑚1 : ℎ𝑖𝑑𝑑𝑒𝑛 𝑠𝑝𝑎𝑐𝑒/𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒
{∅𝑖 (𝑥)}
m1 must be as high as possible
H1 and H2 is ∅-separable ∃ 𝑜𝑛𝑒 𝑚1 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑣𝑒𝑐𝑡𝑜𝑟∅𝑖 (𝑥)𝑠. Ԧ 𝑡
𝑤 𝑇 ∅ 𝑥Ԧ > 0 𝑥𝜖𝐻1
Ԧ
𝑤 𝑇 ∅ 𝑥Ԧ < 0 𝑥𝜖𝐻2Ԧ
• 𝑤 𝑇 ∅ 𝑥Ԧ = 0 is the separating surface (hyper plane)
• Inverse mapping into x space
Ԧ 𝑤 𝑇 ∅ 𝑥Ԧ = 0
𝑥:
gives hyper surface and a nonlinearly separable function
• Linear combination of r-wise product of the pattern co-ordinate gives the
hypersurface
σ0≤𝑖1 ≤𝑖2 ≤𝑚0 𝑎𝑖1𝑖2…….𝑖𝑟 𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖𝑟 = 0 r<m0 (no of i/p)
• Product 𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖𝑟 are called dichotomies
∅ 𝑥Ԧ
• m0C dichotomies are possible
r

𝑥
• Non linear mapping from i/p to hidden space
• Higher dimensional mapping but more no of hidden units increases
the cost
• X-OR problem
− 𝑥−𝑡𝑖 2
• 𝜑𝑖 (𝑥) = 𝑒 x2
− 𝑥−𝑡1 2
𝜑1 (𝑥) = 𝑒
𝑥−𝑡2 2
𝜑2 (𝑥) = 𝑒 −
x1
t2=(1,1),
t1=(0,0)
(1,1)
φ2 φ 𝑠𝑒𝑝𝑎𝑟𝑎𝑏𝑖𝑙𝑖𝑡𝑦

(1,0)
(0,1)
(0,0)
φ1

(0,1)
(1,1)
x2

(1,0)
(0,0)
x2
• m0 dim i/p space
• 𝑋Ԧ1 , 𝑋Ԧ2 ,….., 𝑋Ԧ𝑁 are sequence of random patterns
𝑁
• All possible dichotomies of 𝐻 = 𝑥Ԧ𝑖 𝑖=1 are equiprobable
• As N increases what is the probability of separability?
• Let probability that a particular dichotomy picked at random is separable, where the class of
separating surfaces chosen has m1 degrees of freedom. we may then state that N be the largest
integer for which the patterns are φ separable.

𝑚1−1 𝑁 − 1
21−𝑁 σ𝑚=0 𝑓𝑜𝑟 𝑁 > 𝑚1 − 1
• 𝑃 𝑁, 𝑚1 =ቐ 𝑚
1 𝑓𝑜𝑟 𝑁 ≤ 𝑚1 − 1

• Higher we make the dimension m1 of the hidden space, the closer the probability P(N, m1) will be
to unity.
• let x1, x2, ..., xN be a sequence of random patterns (vectors). Let N be a
random variable defined as the largest integer such that this sequence is
separable, where it has m1 degrees of freedom.
• Then we deduce that the probability that N =n is given by

1 𝑛 𝑛−1
𝑃 𝑁 = 𝑛 = 𝑃 𝑛, 𝑚1 − 𝑃 𝑛 + 1, 𝑚1 = n=0,1,2
2 𝑚1 − 1
• The expectation of the random variable N and its median are, respectively
• E(N)=2m1, median(N)=2m1
• The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality m1 is equal to 2m1.
• For an interpretation of this result, recall the definition of a negative binomial
distribution.
• This distribution equals the probability that k failures precede the rth success
• in a long, repeated sequence of Bernoulli trials.
• In such a probabilistic experiment, there are only two possible outcomes for each
trial—success or failure—and their probabilities
• remain the same throughout the experiment. Let p and q denote the probabilities
• of success and failure, respectively, with p +q= 1.
• The negative binomial distribution is defined by the following:
• RBF is the fitment of multidimentional hidden layer

• Hidden layer units form a basis function which maps the i/p to the hidden
layer
ℎ𝑗 = 𝜑𝑗 ( 𝑎 − 𝜇𝑗 /𝜎𝑗 )

• The output of the kth unit in the output layer of the network is given by

• j = 1,2, ..., J and ho = - 1 is the output of the bias unit, so that wko
corresponds to the bias on the kth output unit.
• The nonlinear basis function 𝜑𝑗 (. ) of the jth hidden unit is a function
of the normalized radial distance between the input vector
𝑎 = (𝑎1 , 𝑎2 , … . , 𝑎𝑀 )𝑇 and the weight vector 𝜇𝑗 = (𝜇1 , 𝜇2 , … . , 𝜇𝑀 )𝑇

You might also like