Professional Documents
Culture Documents
Module 5 Notes
Module 5 Notes
1. SELF ORGANIZATION
• The design of self-organizing focuses on the neural networks that are capable of extracting useful
information from the environment within which they operate.
• The primary purpose of self-organization lies in the discovery of significant patterns or invariants of
the environment without the intervention of a teaching input.
• An important aspect of the implementation of such a network is that all adaptation must be based on
information that is available locally to the synapse-from the pre- and postsynaptic neuron signals and
activations.
• Self-organization must lead eventually to a state of knowledge that provides useful information
concerning the environment from which patterns are drawn.
• Self-organizing systems are based on the following three principles:
o Adaptation in synapses in self-reinforcing
o LTM dynamics are based on competition
o LTM dynamics involve cooperation as well
Consider the linear neuron shown in fig (1a). The activation y and the signal s are computed in the
usual way:
𝑠 = 𝑦 = ∑ 𝑤𝑖 𝑥𝑖 = 𝑋 𝑇 𝑊 − − − − − − − (2)
𝑖=1
The simplest discrete time update rule that employs Hebbian rule is then
𝑤𝑖𝑘+1 = 𝑤𝑖𝑘 + 𝛼𝑥𝑖𝑘 𝑠𝑘 − − − − − − − −(6)
Where 𝛼 is the learning rate. The first two vector equation are the classical Hebbian form.
This learning rate depends on the inner product of 𝑋𝑘 and 𝑊𝑘 which is the activation. But activations
measure similarities, and so inputs which are more similar to the weight vector get larger learning
rates. The essential points to take home are:
• For any given weight vector, the effective leaming rate depends upon the pattern being
presented.
• Learning law Eq. (10) essentially perturbs the weight vector in the direction of 𝑋𝑘 by an amount
proportional to the signal, 𝑠𝑘 , or the activation 𝑦𝑘 Since the signal of the linear neuron is simply
its activation.
• One can interpret the Hebb learning scheme of Eq. (8) as adding the input vector to the weight
vector in direct proportion to the similarity between the two.
Further, frequently occurring inputs will exercise greater influence during learning. This is so because
such vectors will tend to bias the weight vector increasingly towards themselves, and on subsequent
presentations will thus generate larger signal values, which will lead to even larger weight updates in
those directions. The neuron weight gradually biases itself towards the direction in the input space that
has frequently occurring patterns. This dominant direction is that of the eigenvector corresponding to
the largest eigenvalue of the correlation matrix of the input vector stream.
̂ denote the equilibrium weight vector. This is the vector towards the neighbourhood of which
Let 𝑊
weight vectors converge after sufficient iterations elapse. In other words, the weight vector converges
towards a neighbourhood of the equilibrium weight vector and its trajectory remains confined to a
̂ . In the equilibrium condition weight changes must average to zero. With this
neighbourhood of 𝑊
definition and using Eq. (15) we can write:
̂ − − − − − − − −(16)
𝐸[Δ𝑊𝑘 ] = 0 = 𝛼𝑹𝑊
Or
̂ = 0 = 𝜆𝑛𝑢𝑙𝑙 𝑊
𝑹𝑊 ̂ − − − − − − − − − (17)
̂ is an eigenvector of R corresponding to the degenerate eigenvalue 𝜆𝑛𝑢𝑙𝑙 = 0. The
which shows that 𝑊
equilibrium weight vector lies in the kernel of R. Since R is positive semi-definite, some of its
eigenvalues (say p of them) will be zero, and others (say m of them) positive.
In general, any weight vector can be expressed in terms of eigen vector as:
𝑊 = ∑ 𝛽𝑖 𝜂𝑖 + 𝑊𝑛𝑢𝑙𝑙 − − − − − −(18)
𝑖=1
𝑚 𝑝
𝑊 = ∑ 𝛽𝑖 𝜂𝑖 + ∑ 𝛾𝑗 𝜂𝑗′ − − − − − −(19)
𝑖=1 𝑗=1
where 𝑊𝑛𝑢𝑙𝑙 is the component of W in the null subspace, we denote 𝜂𝑖 and 𝜂𝑗′ as the eigenvector that
corresponds to non-zero and zero eigenvalues respectively with 𝛽𝑖 and 𝛾𝑗 being the respective
components of W in those directions. We will make use of this decomposition in a moment.
̂ + 𝜖 − − − − − − − (20)
𝑊𝑘 = 𝑊
Therefore, from equation (15)
̂ + 𝜖) − − − − − (21)
𝐸[Δ𝑊𝑘 ] = 𝛼𝑅(𝑊
̂ + 𝛼𝑅𝜖 − − − −(22)
= 𝛼𝑅𝑊
= 𝛼𝑅𝜖 − − − − − − − (23)
̂ = 0. Expressing 𝜖 using the eigen decomposition from equation (19) yields
Since 𝑅𝑊
𝑚 𝑝
𝜖 = ∑ 𝛽𝑖 𝜂𝑖 + ∑ 𝛾𝑗 𝜂𝑗′ − − − − − −(24)
𝑖=1 𝑗=1
𝑚 𝑝
𝐸[Δ𝑊𝑘 ] = 𝛼 ∑ 𝛽𝑖 𝜆𝑖 𝜂𝑖 − − − − − − − − − − − − − (27)
𝑖=1
here 𝜆𝑖 denotes the 𝑖 𝑡ℎ eigenvalue, and the second term goes to zero since 𝑅𝜂𝑗′ = 𝜆𝑛𝑢𝑙𝑙 𝜂𝑗′ = 0 , 𝜂𝑗′ being
in the kernel of R. Equation (27) clearly shows that with a small perturbation, the weight changes occur
̂ -towards eigenvectors corresponding to non-zero eigenvalues.
in directions away from that of 𝑊
̂ represents an unstable equilibrium. The dominant direction of movement is the one
Therefore, 𝑊
corresponding to the largest eigenvalue, and this component grows in time. The weight vector
magnitude will therefore grow indefinitely and its direction approaches the eigenvector corresponding
to the largest eigenvalue, a direction referred to as the maximal eigenvector direction or maximal
eigendirection. It is obvious that although the directionality of this final weight vector is useful, its
unbounded magnitude can create problems.
Where 𝑠𝑘 = 𝑦𝑘 = 𝑋𝑘𝑇 𝑊𝑘 is the signal of the neuron at time constant k, then using a normalized weight
update procedure equation (30) modifies to:
𝑤𝑖𝑘 + 𝛼𝑠𝑘 𝑥𝑖𝑘
𝑤𝑖𝑘+1 = − − − − − − − −(31)
2
√∑𝑛𝑗=1(𝑤𝑗𝑘 + 𝛼𝑠𝑘 𝑥𝑗𝑘 )
This learning equation essentially states that the new weights, computed from the standard Hebb
procedure are normalized by the magnitude of regular Hebb update weight vector. In an equilibrium
condition the weight vector should average to zero. First the expected weight change conditional on
𝑊𝑘 is computed using equation (28):
𝐸[Δ𝑊𝑘 ] = 𝐸[𝑠𝑘 𝑋𝑘 − 𝑠𝑘2 𝑊𝑘 ] − − − − − − − − − (32)
= 𝐸[𝑋𝑘𝑇 𝑊𝐾 𝑋𝑘 − 𝑋𝑘𝑇 𝑊𝑘 𝑋𝑘𝑇 𝑊𝑘 𝑊𝑘 ]
= 𝐸[𝑥𝑘 𝑋𝑘𝑇 𝑊𝑘 − 𝑊𝑘𝑇 𝑋𝑘 𝑋𝑘𝑇 𝑊𝑘 𝑊𝑘 ]
= 𝐸[𝑋𝑘 𝑋𝑘𝑇 ]𝑊𝑘 − (𝑊𝑘𝑇 𝐸[𝑋𝑘 𝑋𝑘𝑇 ]𝑊𝑘 )𝑊𝑘
= 𝑅𝑊𝑘 − (𝑊𝑘𝑇 𝑅𝑊𝑘 )𝑊𝑘 − − − − − − − − − − − −(33)
̂:
Setting 𝐸[Δ𝑊𝑘 ] = 0 yields the equilibrium weight vector 𝑊
̂ − (𝑊
𝐸[Δ𝑊𝑘 ] = 0 = 𝑹𝑊 ̂ 𝑇 𝑹𝑊
̂ )𝑊
̂ − − − − − − − − − (34)
𝑜𝑟
̂ = (𝑊
𝑹𝑊 ̂ 𝑇 𝑹𝑊
̂ )𝑊
̂ = 𝜆𝑊
̂ − − − − − − − − − − − (35)
Where we define
̂ 𝑇 𝑹𝑊
𝜆 = (𝑊 ̂ ) = (𝑊
̂ 𝑇 𝝀𝑊
̂ ) = 𝜆𝑊
̂ 𝑇𝑊 ̂ ‖2 − − − − − − − (36)
̂ = 𝜆‖𝑊
Where the Kronecker delta function 𝛿𝑖𝑗 is introduced because of the orthogonality of the eigen vectors.
For j ≠ I equation (44) reduces to
𝜂𝑗𝑇 𝐸[Δ𝑊] = (𝜆𝑗 − 𝜆𝑖 )𝜂𝑗𝑇 𝜖 − − − − − −(45)
Equation (12.45) clearly shows that the perturbation component along 𝜂𝑗 must grow if 𝜆𝑗 > 𝜆𝑖 . In
other words, 𝜂𝑖 is an unstable direction if 𝜆𝑖 is not the maximum eigenvalue. Conversely, if 𝜆𝑖 = 𝜆𝑚𝑎𝑥
Then 𝜆𝑗 − 𝜆𝑖 < 0 ∀𝑗 ≠ 𝑖 and the weight perturbation component in all directions 𝜂𝑗 , 𝑗 ≠ 𝑖 is
negative. The maximal eigendirection is the only stable direction and the weight vector searches this
direction out, eventually settling down into a neighbourhood of the maximal eigenvector.
Oja's algorithm is summarized Table 1.
𝑋 = ∑ 𝛾𝑖 𝜂𝑖 − − − − − − − − − (46)
𝑖=1
Assuming non-null eigenvalues. It is now straightforward to reduce the feature dimensions required
for proper representation of data by neglecting those coefficients in Eq. (46) that have very small
variances (small eigenvalues), while retaining those coefficients with large variances (large
eigenvalues
As mentioned, the eigenvalues can be ordered without any loss of generality from largest to
smallest 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑛 . By selecting m components to represent the n-dimensional vector we are
̃
essentially saying that we can represent X by an approximate vector 𝑋,
𝑚
𝑋̃ = ∑ 𝛾𝑖 𝜂𝑖 𝑚 < 𝑛 − − − − − −(47)
𝑖=1
With an error
𝑛 𝑚
𝑒 = 𝑋 − 𝑋̃ = ∑ 𝛾𝑖 𝜂𝑖 − ∑ 𝛾𝑖 𝜂𝑖 − − − − − − − −(48)
𝑖=1 𝑖=1
𝑛
= ∑ 𝛾𝑖 𝜂𝑖 − − − − − − − − − −(50)
𝑖=𝑚+1
It can be shown that the variances along eigendirections are the eigenvalues of R. Hence the variance
of the error 𝜎𝑒2 is
𝑛
𝜎𝑒2 = ∑ 𝜆𝑖 − − − − − − − − − (50)
𝑖=𝑚+1
It then follows that the closer the neglected eigenvalues are to zero the more effective is the
approximation or dimensionality reduction.
In summary, to reduce dimension and perform subspace decomposition:
• Analyze the correlation matrix R of the data stream to find its eigenvectors and eigenvalues.
• Project the data onto the eigendirections.
• Discard n-m components corresponding to n-m smallest eigenvalues.
Our linear neuron based on Oja's update rule extracts the maximal eigendirection or the first principal
component of the data stream. Rather interestingly, an m node linear neuron network that accepts n
dimensional inputs can extract the first m principal components if the following learning rule is adopted
for weight update:
𝑗
This rule is called Sanger`s Learning Rule. Sanger`s rule reduces to Oja`s rule for a single neuron, and
it is therefore clear that this unit will search the first eigenvector or first principal component of the
input data stream. It can be shown that using Sanger`s Rule, the weight vectors of the m unit converge
to the first eigenvectors that correspond to the eigenvalues 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑚 :
𝑊𝑖 −→ ±𝜂𝑖 − − − − − − − − − − − −(52)
Where 𝜂𝑖 is a normalized eigenvector of correlation matrix R corresponding to the i-largest eigenvalue
𝜆𝑖 .
Adaptation Law 1: A simple passive decay of weights proportional to the signal, and a reinforcement
proportional to the external input:
𝑊̇ = −𝛼𝑠𝑊 + 𝛽𝑋 − − − − − − − − − −(54)
Adaptation Law 2: The standard Hebbian form of adaptation with signal driven passive weight decay:
𝑊̇ = −𝛼𝑠𝑊 + 𝛽𝑠𝑋 − − − − − − − − − −(55)
If the coefficient α and β are equal to some constant k, then equation (55), reduces to
𝑊̇ = 𝑘𝑠(𝑋 − 𝑊) − − − − − − − − − −(55)
Therefore, for finite ‖𝑋̅‖ and ‖𝑊‖, the weight vector direction converges asymptotically to the
direction of 𝑋̅.
Also note that since,
𝑑‖𝑊‖
‖𝑊‖ = 𝑠(𝛽 − 𝛼‖𝑊‖2 − − − − − − − − − (71)
𝑑𝑡
The magnitude of the weight vector asymptomatically approaches
𝛽
‖𝑊‖ = √ − − − − − − − − − − − − − (72)
𝛼
̂ can be written as
And therefore, asymptotic equilibrium weight vector 𝑊
𝛽 𝑋̅
̂ = 𝑊(∞) = √
𝑊 − − − − − − − − − (73)
𝛼 ‖𝑋̅‖
From whence,
𝛼𝑋̅ 𝑇 𝑊
̂
̂
𝑹𝑊 = ( ̂ − − − − − − − − − −(79)
)𝑊
𝛽
̂ − − − − − − − − − − − − − (80)
= 𝜆𝑊
𝛼𝑋̅ 𝑇 𝑊
̂
Where we have defined 𝜆 = . Clearly, eigenvectors of R are fixed point solutions of W.
𝛽
𝜂𝑖𝑇 𝑊̇ 𝜂𝑖𝑇 𝑊 𝑑
= 𝐸[ − ‖𝑊‖] − −(83)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝜂𝑖 ‖‖𝑊‖2 𝑑𝑡
𝜂𝑖𝑇 𝑊̇ 𝜂𝑖𝑇 𝑊𝑊 𝑇 𝑊̇
= 𝐸[ − ] − − − −(84)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝜂𝑖 ‖‖𝑊‖3
𝛽𝜂𝑖𝑇 𝑊 (𝑊 𝑇 𝑹𝑊)
(𝜆𝑖 − ) − − − − − − − −(88)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝑊‖2
(𝑊 𝑇 𝑹𝑊)
= 𝛽 cos 𝜃𝑖 (𝜆𝑖 − ) − − − − − − − −(89)
‖𝑊‖2
It follows from the Rayleigh quotient that the parenthetic term in Eq. (89) is guaranteed to be
positive only for 𝜆𝑖 = 𝜆𝑚𝑎𝑥 , which means that for the eigenvector 𝜂𝑚𝑎𝑥 the angle 𝜃𝑚𝑎𝑥 between
W and 𝜂𝑚𝑎𝑥 monotonically tends to zero as learning proceeds.
5. VECTOR QUANTIZATION
Vector quantization (VQ) is an important application of competitive learning which was originally
developed for information compression applications, and is now routinely employed to store and transmit
speech or vision data. In VQ, codebook vectors 𝑊𝐼 are required to be placed into the signal space in a way
that minimizes the average expected quantization error:
2
𝐸 = ∫‖𝑋 − 𝑊𝐽 ‖ 𝑃(𝑋)𝑑𝑋 − − − − − − − − − (90)
where X ∈ ℝ𝑛 is a signal vector. It is worth emphasizing that the winning codebook vector 𝑊𝐽 , is a
function of the input as well as all codebook vectors 𝑊𝑖 i= 1,., n: ‖𝑋 − 𝑊𝐽 ‖= 𝑚𝑖𝑛𝑖 {‖𝑋 − 𝑊𝐽 ‖}. VQ
algorithms exploit input vector structure to divide the input space into different regions, which will
ultimately be identified as clusters, classes or decision regions. Compression is then achieved by
identifying signal vectors in a region of the input space by a codebook vector. Then, all vectors that fall
within a region are represented by the codebook vector for that region. Thus, the name vector quantization.
Here, cluster neurons compete for activation from concatenated input patterns {𝑍𝑘 }. The unsupervised
VQ system compares the current sample vector 𝑍𝑘 with the C quantizing weight vectors 𝑊𝑗(𝑘) (weight
vector 𝑊𝑗 at time instant k), and neuron J wins based on a standard Euclidean distance competition:
‖𝑊𝑗(𝑘) − 𝑍𝑘 ‖ = min‖𝑊𝑗(𝑘) − 𝑍𝑘 ‖ − − − − − −(91)
𝑗
where the discrete time index k has been introduced because weight vectors will change in accordance
with a competitive learning law. The winning neuron J learns the input pattern in accordance with
standard competitive learning in vector form,
𝑊𝐽(𝑘+1) = 𝑊𝐽(𝑘) + 𝜂𝑘 (𝑍𝑘 − 𝑊𝐽(𝑘) − − − − − − − (92)
The learning coefficient 𝜂𝑘 should decrease gradually towards zero. For example, we could use the
𝑘
form 𝜂𝑘 = 𝜂0 [1 − ] for an initial learning rate 𝜂0 and Q training samples. This would make η
2𝑄
decrease linearly from 𝜂0 to zero over 2Q iterations. The winning vector 𝑊𝐽 learns; the losing vectors
do not learn.
In proper practice, the components of the data samples {𝑍𝑘 } is scaled so that all features have
equal weight in the distance measure. This scaling can be embedded within the distance computation
as shown below:
𝑇 𝑇
(𝑊𝐽(𝑘) − 𝑍𝑘 ) Ω(𝑊𝐽(𝑘) − 𝑍𝑘 ) = min {(𝑊𝐽(𝑘) − 𝑍𝑘 ) Ω(𝑊𝐽(𝑘) − 𝑍𝑘 )} − − − − − − − (93)
𝑗
Where,
𝑤12 … 0
Ω= [ ⋮ ⋱ ⋮ ] − − − − − − − −(94)
2
0 … 𝑤𝑛+𝑚
is a data dependent scaling matrix computed from the maximum and minimum feature values in each
dimension. This normalized Euclidean distance calculation ensures that no one variable dominates the
choice of the winner. In other words, no feature dominates and the classification is feature scale
invariant.
Table 2 summarizes the adaptive vector quantization (AVQ) algorithm
Table 2: Operational Summary of AVQ Algorithm
• A good example is that of biological vision where three-dimensional visual images are routinely
mapped onto a two-dimensional retina and information is preserved in a way that permits perfect
visualization of a three-dimensional world.
• Interestingly, specialized sensory areas of the cortex respond to the available spectrum of real-
world signals in an ordered fashion.
• For example, the tonotonic map in the auditory cortex is perfectly ordered according to frequency.
• Early evidence for computational maps comes from the studies of Hubel and Wiesel on the primary
visual cortex of cats and monkeys.
• They discovered two kinds of maps:
o cortical maps that selectively responded to a preferred line orientation, and
o maps of occular dominance that indicated the relative strength of excitatory influence on
each eye.
• In other words, there is ample neurobiological evidence for the formation of a hierarchy of visual,
somatosensory and auditory maps in the human brain.
• Primary maps are processed into secondary and tertiary maps through a sequence of temporal
processing that retains fine grained topological ordering as present in the original sensory signals.
• The main point is that in pattern recognition and information processing, an important step is the
identification or extraction of a set of features (invariants) which concentrate the essential
information of the input pattern set.
• This leads to economy of representation which results in significant savings in storage and
transmission bandwidth requirements.
• By virtue of their nature, computational maps provide a technique for efficient and dynamic
processing of information.
• To understand the concept of topology preservation, consider an m- neuron neural network where
the ith neuron produces a response 𝑠𝑖𝑘 in response to the input pattern 𝐼𝑘 ∈ ℝ𝑛 .
• Assume that input vectors {𝐼𝑘 }𝑄𝑘=1 can be ordered according to some distance metric or in some
topological way: 𝐼1 ℛ 𝐼2 ℛ 𝐼3 … . ., where ℛ is some ordering relation.
• Then the network produces a one-dimensional topology preserving map if for 𝑖1 > 𝑖2 > 𝑖3 … .,
𝑠𝑖11 = max{𝑠𝑗1 }
𝑗
𝑠𝑖33 = max{𝑠𝑗3 }
𝑗
• a similar weight update procedure be employed on many adjacent neurons which comprise
topologically related subsets; and
• The resulting adjustment be such that it enhances the responses to the same or to a similar
input that occurs subsequently.
Figure 3: A Kohonen network typically comprises a two-dimensional planar field of neurons that have Mexican
hat lateral interconnections to support soft-competition amongst neurons
In the words of Kohonen, "An intriguing result from this sort of spatially correlated learning is that the
weight vectors tend to attain values that are ordered along the axes of the network."
𝑇
Assume that the input vector 𝑋𝑘 = (𝑥1𝑘 , … … . , 𝑥𝑛𝑘 ) ∈ ℝ𝑛 (ℝ𝑛 being the pattern space in
question) is presented to a (m x n) field of neurons. Due to the planar nature of the field, each neuron
will be identified by the double row-column index ij. i, j = 1, ,.. m. The 𝑖𝑗 𝑡ℎ neuron has an incoming
𝑘 𝑘
weight vector 𝑊𝑖𝑗(𝑘) = (𝑤1,𝑖𝑗 , … … . 𝑤𝑛,𝑖𝑗 ) ∈ ℝ𝑛 at time instant k.
• The first step is to find the best matching weight vector 𝑊𝐼 𝐽(𝑘) for the present input, and to
thereby identify a neighbourhood 𝒩𝐼 𝐽 around the winning neuron.
• One can find the best matching weight vector by comparing the inner products 𝑋𝑘𝑇 𝑊𝑖𝑗(𝑘) of the
impinging input 𝑋𝑘 , with each weight vector 𝑊𝑖𝑗(𝑘) .
• The winning neuron the one that has the largest inner product.
• Equivalently, with normalized weight vectors, the maximum inner product criterion reduces to
a minimum Euclidean distance criterion: the winning neuron is the one that minimizes the
distance ‖𝑋𝑘 − 𝑊𝑖𝑗(𝑘) ‖.
• Kohonen suggests using the latter since it is more general and applies to natural signals in
metric vector spaces:
‖𝑋𝑘 − 𝑊𝑖𝑗(𝑘) ‖ = min{‖𝑋𝑘 − 𝑊𝑖𝑗(𝑘) ‖} − − − − − − − − − − − −(98)
𝑖,𝑗
• The SOFM algorithm assumes that the planar field of neurons has a Mexican hat interaction
that allows the identification of a neuron cluster around the winning neuron.
• The width of the neighbourhood is a function of time: as epochs of training elapse, the
neighbourhood shrinks.
• In the continuous version this shrinkage of neighbourhood width can be implemented by
gradually decreasing the positive lateral feedback and increasing the negative lateral feedback
in the Mexican hat function.
• The use of 𝒩𝐼 𝐽 simulates the quick formation of an activity cluster.
• Once a winning cluster has been identified, weights of neurons within the cluster are updated.
Adaptation in SOFM's takes place according to the second generalized law of adaptation:
𝑤̇𝑙,𝑖𝑗 = 𝜂𝑥𝑙 𝑠𝑖𝑗 − 𝛾(𝑠𝑖𝑗 )𝑤𝑙,𝑖𝑗 − − − − − − − (99)
which incorporates learning according to the Hebbian hypothesis but with a forgetting term scaled by
some function 𝛾(𝑠𝑖𝑗 ) of the neuronal signal 𝑠𝑖𝑗 . In the simplest case, 𝛾(𝑠𝑖𝑗 ) may be chosen to be linear
𝛾(𝑠𝑖𝑗 ) = 𝛽 𝑠𝑖𝑗 − − − − − − − − − (100)
Which yields,
𝑤̇𝑙,𝑖𝑗 = 𝜂𝑥𝑙 𝑠𝑖𝑗 − 𝛽𝑠𝑖𝑗 𝑤𝑙,𝑖𝑗 − − − − − −(101)
Choosing 𝜂 = 𝛽 without any loss of generality, further simplifies equation (100):
𝑤̇𝑙,𝑖𝑗 = 𝜂𝑠𝑖𝑗 (𝑥𝑙 − 𝑤𝑙,𝑖𝑗 ) − − − − − −(102)
The signal function can be chosen to be binary in nature. In accordance with this neuron within 𝒩𝐼 𝐽
fire, others do not. This is simply stated as follows:
1 𝑖𝑓 𝑖𝑗 ∈ 𝒩𝐼 𝐽
𝑠𝑖𝑗 = { − − − − − − − −(103)
0 𝑖𝑓 𝑖𝑗 ∉ 𝒩𝐼 𝐽
With this signal behaviour equation (102) simplifies to:
𝜂(𝑥𝑙 − 𝑤𝑙,𝑖𝑗 ) 𝑖𝑓 𝑖𝑗 ∈ 𝒩𝐼 𝐽
𝑤̇𝑙,𝑖𝑗 = { − − − − − − − −(104)
0 𝑖𝑓 𝑖𝑗 ∉ 𝒩𝐼 𝐽
In vector notation the SOFM adaptation law can be rewritten as
𝜂(𝑋 − 𝑊𝑖𝑗 ) 𝑖𝑓 𝑖𝑗 ∈ 𝒩𝐼 𝐽
𝑊̇𝑖𝑗 = { − − − − − − − −(105)
0 𝑖𝑓 𝑖𝑗 ∉ 𝒩𝐼 𝐽
Weight vectors of winning clusters update themselves to resemble the input vector X to a greater
extent. In general, the learning rate 𝜂 and the neighbourhood 𝒩𝐼 𝐽 are both functions of time, each
having a contracting nature.
Table 4 presents the operational summary of self-organizing feature map algorithm
Another way to visualize a feature map is by labelling each or the neurons with the test pattern that
has maximally excited the neuron in question. In other words, the pattern in question becomes the
best stimulus for that neuron. This procedure closely resembles the way sites in the brain get labelled
by stimulus features that maximally excite neurons at that site. Such a labelling procedure will
eventually produce a well-ordered partition of the map such that groups or cluster of neurons will
respond maximally to the same class of pattern. Such a topographic map creates similarity relationship
and can be used for pattern classification.
Example: The iris data set comprises 150 patterns of feature measurements made on the three species
of the Iris flower: iris sestosa, iris versicolor, and iris virginica. The four features measured are the
Department of CSE, AJIET Mangalore 20
Neural Network Module 5
petal length, petal width, sepal length and sepal width. There are 50 patterns of each of the three
classes. Figure 4 plots the 6 possible projections of the 4-dimensional feature vectors. This helps give
an idea of the range of values of each feature.
The iris data set is presented to a Kohonen map and then label the map after it has been trained. Since
the SOFM algorithm uses Euclidean distances, the scale of individual variables paly an important role
in determining what the final map will look like. If the range of values of one variable is much larger
than that of the other variables, then the variable will probably dominate the organization of the
SOFM. Therefore, components of the data are usually normalized prior to presentation so that each
variable has unity variance.
After training the map assuming 8 x 8 grid of neurons, the map is labelled. The best matching
neuron of the map is found for each pattern of the data set. This neuron is the one with the minimum
distance between the weight vector and the pattern under consideration. The species label
corresponding to the pattern is assigned to the neuron. Each neuron is finally assigned a label that
correspond to the species whose pattern elicited a maximal response with the highest frequency.
units. Phonetic units are called phonemes. These are linguistic abstractions that do not directly
correspond to distinct acoustic segments. They have varying lengths, that may overlap partly, and
their acoustic appearance varies according to their context as well as from speaker to speaker.
The primary objective of speech recognition is to transcribe carefully spoken utterances into
phoneme sequences and to transform them into orthographically correct written text. Kohonen's
neural phonetic typewriter represents a classic example of an automated speech recognition that uses
one of two neural network modules: the Self-organizing Feature Map (SOFM) or Learning Vector
Quantization (LVQ).
Either of these methods is used to classify speech segments into phoneme classes time
synchronously. The results are decoded into phoneme strings using Hidden Markov Models (HMM),
a statistical methodology for representing and recognizing stochastic time series. The ANN approach
differs from the conventional HMM methodology by using ANNs as a component to emphasize
discrimination between phonemes, whereas the aim of HMM is representation.
The neural phonetic typewriter was designed to work with the Finnish language that has 21
phenomes which correspond almost uniquely with graphemes. The equipment developed at the
Helsinki University of Technology. It is built around a coprocessor board that has the capacity to
perform the classification of short-time speech spectra by the SOM, as well as to apply symbolic
postprocessing to correct for coarticulation errors. It was able to transcribe unlimited Finnish and
Japanese dictation into text at a letter accuracy exceeding 92 per cent. The "neural" model applied in
this design is a special supervised SOM.
Input vectors comprise a 15-component short-time spectrum vector 𝑋𝑠 . Preprocessing of the
speech signal to generate this vector involves a number of steps. These include recording on a noise
cancelling microphone; analog-to-digital conversion; fast Fourier transform; low pass filtering and
then grouping the spectral channels into a 15-component real valued pattern vector. The final step of
preprocessing involves subtraction of the average from all components and normalization of the
resulting vector to constant length. Each input vector is tagged to a class using an output vector 𝑋𝑢 .
This is a binary vector with its components assigned 0 or 1 to indicate different phoneme classes.
During recognition, 𝑋𝑢 is not considered. A problem faced in discrimination is that the distributions
of the spectra of different classes overlap. It is here that neural networks can effectively define class
separating boundaries that generalize well.
In many applications, there is little or no information about the set. In such input distribution or the size
of the data cases, it is difficult to determine apriori the number of neurons to use in the self-organizing
feature map, or for that matter even in classical vector quantization. An incremental clustering algorithm
called Growing Neural Gas (GNG) makes no assumptions on the number of neurons used by the network.
Growing Neural Gas (GNG) is an unsupervised incremental clustering algorithm that can perform both
dimensionality reduction while discovering a topological map of an input data distribution. GNG
incrementally creates a network of nodes that quantize an input space described by a distribution of points
in ℝ𝑛 , which might be drawn from an unknown probability density function P(X). Position vectors of
GNG nodes in the input space act as codebook vectors for data clusters in ℝ𝑛 . GNG can also be used to
discover structures that closely reflect the topology of the input distribution. An appealing aspect of GNG
is that it is adaptive in the sense that even if the input distribution changes slowly over time, GNG nodes
will gradually migrate so as to cover the new distribution.
𝑋𝑟 = 0.5(𝑋𝑝 + 𝑋𝑞 )
d. Insert edges connecting the new node r with nodes p and q, and remove the original
edge between p and q.
e. Decrease the error of p and q by a factor a. Initialize the local error variable of r to the
new value of the error variable of p.
10. Decrease all node error variables by a factor of β.
11. If a stopping criterion (e.g., network size or some performance measure) is not yet fulfilled,
return to Step 1.