Module 5 Notes

Neural Network Module 5
SELF ORGANIZING FEATURE MAP
1. SELF ORGANIZATION
• The design of self-organizing focuses on the neural networks that are capable of extracting useful
information from the environment within which they operate.
• The primary purpose of self-organization lies in the discovery of significant patterns or invariants of
the environment without the intervention of a teaching input.
• An important aspect of the implementation of such a network is that all adaptation must be based on
information that is available locally to the synapse-from the pre- and postsynaptic neuron signals and
activations.
• Self-organization must lead eventually to a state of knowledge that provides useful information
concerning the environment from which patterns are drawn.
• Self-organizing systems are based on the following three principles:
o Adaptation in synapses in self-reinforcing
o LTM dynamics are based on competition
o LTM dynamics involve cooperation as well
2. MAXIMAL EIGENVECTOR FILTERING

2.1. Hebbian Learning Revisited
The signal Hebb Law formed the basis for outer product correlation encoding in bidirectional
associative memories
𝑤̇𝑖𝑗 = −𝑤𝑖𝑗 + 𝒮𝑖 (𝑥𝑖 )𝒮̇ 𝑗 (𝑥𝑗 ) − − − − − (1)

Hebbian learning incorporates both exponential forgetting of the past information and asymptotic
encoding of the product of the signal 𝒮𝑖 (𝑥𝑖 )𝒮𝑗 (𝑥𝑗 ). In other word, the change in the weight is dictated
by the product of signals of the pre and postsynaptic neurons. The algorithm uses local information.
Consider the linear neuron shown in fig (1a). The activation y and the signal s are computed in the
usual way:
Department of CSE, AJIET Mangalore 1

Figure 1: (a) A Linear Neuron (b) Discrete Time Formalism

𝑛
𝑠 = 𝑦 = ∑ 𝑤𝑖 𝑥𝑖 = 𝑋 𝑇 𝑊 − − − − − − − (2)
𝑖=1
Where 𝑋 = (𝑥1 … … . 𝑥𝑛 )𝑇 and 𝑊 = (𝑤1 … … . . 𝑤𝑛 )𝑇 .

For linear neuron, the learning law of Eq (2) reduces to
𝑤̇𝑖𝑗 = −𝑤𝑖𝑗 + 𝑥𝑖 𝑠 − − − − − − − (3)
= −𝑤𝑖𝑗 + 𝑥𝑖 𝑦 − − − − − − − (4)
𝑇
In discrete time, it is assumed that the sample input 𝑋𝑘 = (𝑋1𝑘 … … . 𝑋𝑛𝐾 ) is input to the neuron with
𝑇
weights 𝑊𝑘 = (𝑤1𝑘 … … . 𝑤𝑛𝐾 ) , indexed by the iteration index k. the signal computation can now be
written as
𝑛
𝑠𝑘 = 𝑦𝑘 = ∑ 𝑤𝑖𝑘 𝑥𝑖𝑘 = 𝑋𝑘𝑇 𝑊𝑘 − − − − − − − − − (5)

𝑖=1
The simplest discrete time update rule that employs Hebbian rule is then
𝑤𝑖𝑘+1 = 𝑤𝑖𝑘 + 𝛼𝑥𝑖𝑘 𝑠𝑘 − − − − − − − −(6)
= 𝑤𝑖𝑘 + 𝛼𝑥𝑖𝑘 𝑦𝑘 − − − − − − − (7)
Which can be written in vector form as

𝑊𝑘+1 = 𝑊𝑘 + 𝛼𝑠𝑘 𝑋𝑘 − − − − − − − (8)
= 𝑊𝑘 + 𝛼𝑦𝑘 𝑋𝑘 − − − − − −(9)
= 𝑊𝑘 + 𝛼𝑘 𝑋𝑘 − − − − − − − (10)
Where 𝛼 is the learning rate. The first two vector equation are the classical Hebbian form.

This learning rate depends on the inner product of 𝑋𝑘 and 𝑊𝑘 which is the activation. But activations
measure similarities, and so inputs which are more similar to the weight vector get larger learning
rates. The essential points to take home are:
• For any given weight vector, the effective leaming rate depends upon the pattern being
presented.
• Learning law Eq. (10) essentially perturbs the weight vector in the direction of 𝑋𝑘 by an amount
proportional to the signal, 𝑠𝑘 , or the activation 𝑦𝑘 Since the signal of the linear neuron is simply
its activation.
• One can interpret the Hebb learning scheme of Eq. (8) as adding the input vector to the weight
vector in direct proportion to the similarity between the two.
Further, frequently occurring inputs will exercise greater influence during learning. This is so because
such vectors will tend to bias the weight vector increasingly towards themselves, and on subsequent
presentations will thus generate larger signal values, which will lead to even larger weight updates in
those directions. The neuron weight gradually biases itself towards the direction in the input space that
has frequently occurring patterns. This dominant direction is that of the eigenvector corresponding to
the largest eigenvalue of the correlation matrix of the input vector stream.
Two points are worth noting.

• First, a major problem is that the magnitude of the weight vector grows without bound.
• Second, since patterns (representing random samples of a distribution) continuously perturb
the system, the equilibrium condition of learning has to be identified by the weight vector
̂ , on average.
remaining within a neighbourhood of an equilibrium weight vector 𝑊
Reorganizing the weight update equation (8)
𝑊𝑘+1 = 𝑊𝑘 + 𝛼(𝑋𝑘𝑇 𝑊𝑘 )𝑋𝑘 − − − − − −(11)
= 𝑊𝑘 + 𝛼𝑋𝑘 (𝑋𝑘𝑇 𝑊𝑘 ) − − − − − −(12)
This yields,
𝑊𝑘+1 − 𝑊𝑘 = 𝛼𝑋𝑘 (𝑋𝑘𝑇 𝑊𝑘 ) − − − − − −(13)
Taking the expectation of both sides’ conditional on 𝑊𝑘 we have:
𝐸[𝑊𝑘+1 − 𝑊𝑘 ] = 𝐸[Δ𝑊𝑘 ] = 𝛼 𝐸[𝑋𝑘 𝑋𝑘𝑇 ]𝑊𝑘 − − − − − −(14)
= 𝛼𝑹𝑊𝑘 − − − − − −(15)
where we have assumed that 𝑊𝑘 is independent of 𝑋𝑘 .

̂ denote the equilibrium weight vector. This is the vector towards the neighbourhood of which
Let 𝑊
weight vectors converge after sufficient iterations elapse. In other words, the weight vector converges
towards a neighbourhood of the equilibrium weight vector and its trajectory remains confined to a
̂ . In the equilibrium condition weight changes must average to zero. With this
neighbourhood of 𝑊
definition and using Eq. (15) we can write:
̂ − − − − − − − −(16)
𝐸[Δ𝑊𝑘 ] = 0 = 𝛼𝑹𝑊
Or
̂ = 0 = 𝜆𝑛𝑢𝑙𝑙 𝑊
𝑹𝑊 ̂ − − − − − − − − − (17)
̂ is an eigenvector of R corresponding to the degenerate eigenvalue 𝜆𝑛𝑢𝑙𝑙 = 0. The
which shows that 𝑊
equilibrium weight vector lies in the kernel of R. Since R is positive semi-definite, some of its
eigenvalues (say p of them) will be zero, and others (say m of them) positive.
In general, any weight vector can be expressed in terms of eigen vector as:
𝑊 = ∑ 𝛽𝑖 𝜂𝑖 + 𝑊𝑛𝑢𝑙𝑙 − − − − − −(18)
𝑖=1
𝑚 𝑝
𝑊 = ∑ 𝛽𝑖 𝜂𝑖 + ∑ 𝛾𝑗 𝜂𝑗′ − − − − − −(19)
𝑖=1 𝑗=1
where 𝑊𝑛𝑢𝑙𝑙 is the component of W in the null subspace, we denote 𝜂𝑖 and 𝜂𝑗′ as the eigenvector that
corresponds to non-zero and zero eigenvalues respectively with 𝛽𝑖 and 𝛾𝑗 being the respective
components of W in those directions. We will make use of this decomposition in a moment.
Now consider a perturbation 𝜖 from the equilibrium weight vector W,
̂ + 𝜖 − − − − − − − (20)
𝑊𝑘 = 𝑊
Therefore, from equation (15)
̂ + 𝜖) − − − − − (21)
𝐸[Δ𝑊𝑘 ] = 𝛼𝑅(𝑊
̂ + 𝛼𝑅𝜖 − − − −(22)
= 𝛼𝑅𝑊
= 𝛼𝑅𝜖 − − − − − − − (23)
̂ = 0. Expressing 𝜖 using the eigen decomposition from equation (19) yields
Since 𝑅𝑊

𝑚 𝑝
𝜖 = ∑ 𝛽𝑖 𝜂𝑖 + ∑ 𝛾𝑗 𝜂𝑗′ − − − − − −(24)
𝑖=1 𝑗=1
Substituting this equation into equation (23) yields:

𝑚 𝑝
𝐸[Δ𝑊𝑘 ] = 𝛼𝑅 (∑ 𝛽𝑖 𝜂𝑖 + ∑ 𝛾𝑗 𝜂𝑗′ ) − − − − − − − (25)

𝑖=1 𝑗=1
𝑚 𝑝
𝐸[Δ𝑊𝑘 ] = 𝛼 (∑ 𝛽𝑖 𝑹𝜂𝑖 + ∑ 𝛾𝑗 𝑹𝜂𝑗′ ) − − − − − − − (26)

𝑖=1 𝑗=1
𝑚
𝐸[Δ𝑊𝑘 ] = 𝛼 ∑ 𝛽𝑖 𝜆𝑖 𝜂𝑖 − − − − − − − − − − − − − (27)
𝑖=1
here 𝜆𝑖 denotes the 𝑖 𝑡ℎ eigenvalue, and the second term goes to zero since 𝑅𝜂𝑗′ = 𝜆𝑛𝑢𝑙𝑙 𝜂𝑗′ = 0 , 𝜂𝑗′ being
in the kernel of R. Equation (27) clearly shows that with a small perturbation, the weight changes occur
̂ -towards eigenvectors corresponding to non-zero eigenvalues.
in directions away from that of 𝑊
̂ represents an unstable equilibrium. The dominant direction of movement is the one
Therefore, 𝑊
corresponding to the largest eigenvalue, and this component grows in time. The weight vector
magnitude will therefore grow indefinitely and its direction approaches the eigenvector corresponding
to the largest eigenvalue, a direction referred to as the maximal eigenvector direction or maximal
eigendirection. It is obvious that although the directionality of this final weight vector is useful, its
unbounded magnitude can create problems.
2.2. Oja`s Rule

The problem of the weight vector magnitude growing without bounds can be solved in a straight
forward fashion by making the following modification to the simple Hebbian weight change
procedure:
𝑊𝑘+1 = 𝑊𝑘 + 𝛼𝑠𝑘 (𝑋𝑘 − 𝑠𝑘 𝑊𝑘 ) − − − − − − − −(28)
𝑊𝑘+1 = 𝑊𝑘 + 𝛼𝑠𝑘 𝑋𝑘′ − − − − − − − −(29)
Where 𝑋𝑘′ = 𝑋𝑘 − 𝑠𝑘 𝑊𝑘 . This rule is called Oja`s rule.

It adds a weight decay proportional to 𝑠𝑘2 to the simple Hebbian learning law of equation (8).
Equation (8) results directly by incorporating a weight normalization into the Hebb learning procedure.
If 𝑖 𝑡ℎ weight of the neuron is considered to be updated by the Hebb law equation (7):
𝑤𝑖𝑘+1 = 𝑤𝑖𝑘 + 𝛼𝑠𝑘 𝑥𝑖𝑘 − − − − − − − (30)

Where 𝑠𝑘 = 𝑦𝑘 = 𝑋𝑘𝑇 𝑊𝑘 is the signal of the neuron at time constant k, then using a normalized weight
update procedure equation (30) modifies to:
𝑤𝑖𝑘 + 𝛼𝑠𝑘 𝑥𝑖𝑘
𝑤𝑖𝑘+1 = − − − − − − − −(31)
2
√∑𝑛𝑗=1(𝑤𝑗𝑘 + 𝛼𝑠𝑘 𝑥𝑗𝑘 )
This learning equation essentially states that the new weights, computed from the standard Hebb
procedure are normalized by the magnitude of regular Hebb update weight vector. In an equilibrium
condition the weight vector should average to zero. First the expected weight change conditional on
𝑊𝑘 is computed using equation (28):
𝐸[Δ𝑊𝑘 ] = 𝐸[𝑠𝑘 𝑋𝑘 − 𝑠𝑘2 𝑊𝑘 ] − − − − − − − − − (32)
= 𝐸[𝑋𝑘𝑇 𝑊𝐾 𝑋𝑘 − 𝑋𝑘𝑇 𝑊𝑘 𝑋𝑘𝑇 𝑊𝑘 𝑊𝑘 ]
= 𝐸[𝑥𝑘 𝑋𝑘𝑇 𝑊𝑘 − 𝑊𝑘𝑇 𝑋𝑘 𝑋𝑘𝑇 𝑊𝑘 𝑊𝑘 ]
= 𝐸[𝑋𝑘 𝑋𝑘𝑇 ]𝑊𝑘 − (𝑊𝑘𝑇 𝐸[𝑋𝑘 𝑋𝑘𝑇 ]𝑊𝑘 )𝑊𝑘
= 𝑅𝑊𝑘 − (𝑊𝑘𝑇 𝑅𝑊𝑘 )𝑊𝑘 − − − − − − − − − − − −(33)
̂:
Setting 𝐸[Δ𝑊𝑘 ] = 0 yields the equilibrium weight vector 𝑊
̂ − (𝑊
𝐸[Δ𝑊𝑘 ] = 0 = 𝑹𝑊 ̂ 𝑇 𝑹𝑊
̂ )𝑊
̂ − − − − − − − − − (34)
𝑜𝑟
̂ = (𝑊
𝑹𝑊 ̂ 𝑇 𝑹𝑊
̂ )𝑊
̂ = 𝜆𝑊
̂ − − − − − − − − − − − (35)
Where we define
̂ 𝑇 𝑹𝑊
𝜆 = (𝑊 ̂ ) = (𝑊
̂ 𝑇 𝝀𝑊
̂ ) = 𝜆𝑊
̂ 𝑇𝑊 ̂ ‖2 − − − − − − − (36)
̂ = 𝜆‖𝑊
̂ ‖2 = 1. In other word, the equilibrium weight vector magnitude approaches as

This shows that ‖𝑊
learning proceeds. The algorithm thus has a self-normalizing property by design.
From equation (35) it can be seen that any normalized eigen vector of R satisfies the equilibrium
condition. However only one direction corresponding to the maximal eigenvector is stable.
Assume that R has eigenvectors {𝜂𝑖 }𝑛𝑖=1 and consider a weight vector W in a neighbourhood of some
𝜂𝑖 :
𝑊 = 𝜂𝑖 + 𝜖 − − − − − − − − − − − (37)
Where 𝜖 is a small perturbation applied to 𝜂𝑖 and ‖𝜂𝑖 ‖ =1. Substituting equation (37) into equation
(33) yields:
𝐸[Δ𝑊] = 𝑹(𝜼𝒊 + 𝝐) − ((𝜼𝒊 + 𝝐)𝑻 𝑅(𝜂𝑖 + 𝜖))(𝜂𝑖 + 𝜖) − − − − − − − (38)
= 𝑅𝜂𝑖 + 𝑅𝜖 − (𝜂𝑖𝑇 𝑅𝜂𝑖 + 𝜂𝑖𝑇 𝑅𝜖 + 𝜖 𝑇 𝑅𝜂𝑖 + 𝜖 𝑇 𝑅𝜖)(𝜂𝑖 + 𝜖) − − − −(39)

= 𝑅𝜂𝑖 + 𝑅𝜖 − (𝜂𝑖𝑇 𝑅𝜂𝑖 )𝜂𝑖 − 𝜂𝑖𝑇 𝑅𝜖𝜂𝑖 − (𝜖 𝑇 𝑅𝜂𝑖 )𝜂𝑖 − 𝜂𝑖𝑇 𝑅𝜂𝑖 𝜖 + 𝒪(𝜖) − − − −(40)
= 𝑅𝜖 − 2𝜆𝑖 (𝜖 𝑇 𝜂𝑖 )𝜂𝑖 − 𝜆𝑖 𝜖 + 𝒪(𝜖) − − − −(41)

Where 𝒪(𝜖) represents second and higher order terms in 𝜖
Now compute the component of the average weight change E[Δ𝑊] along with other eigenvector, 𝜂𝑗 .
This can be done by taking the inner product of E[Δ𝑊] and 𝜂𝑗 , while ignoring all the higher order
terms:
𝜂𝑗𝑇 𝐸[Δ𝑊] = 𝜂𝑖𝑇 𝑅𝜖 − 2𝜆𝑖 𝜂𝑗𝑇 (𝜖 𝑇 𝜂𝑖 )𝜂𝑖 − 𝜆𝑖 𝜂𝑗𝑇 𝜖 − − − − − − − −(42)
= 𝜆𝑖 𝜂𝑗𝑇 𝜖 − 2𝜆𝑖 (𝜖 𝑇 𝜂𝑖 )𝛿𝑖𝑗 − 𝜆𝑖 𝜂𝑗𝑇 𝜖 − − − − − −(43)
= (𝜆𝑗 − 𝜆𝑖 − 2𝜆𝑖 𝛿𝑖𝑗 )𝜂𝑗𝑇 𝜖 − − − − − −(44)
Where the Kronecker delta function 𝛿𝑖𝑗 is introduced because of the orthogonality of the eigen vectors.
For j ≠ I equation (44) reduces to
𝜂𝑗𝑇 𝐸[Δ𝑊] = (𝜆𝑗 − 𝜆𝑖 )𝜂𝑗𝑇 𝜖 − − − − − −(45)
Equation (12.45) clearly shows that the perturbation component along 𝜂𝑗 must grow if 𝜆𝑗 > 𝜆𝑖 . In
other words, 𝜂𝑖 is an unstable direction if 𝜆𝑖 is not the maximum eigenvalue. Conversely, if 𝜆𝑖 = 𝜆𝑚𝑎𝑥
Then 𝜆𝑗 − 𝜆𝑖 < 0 ∀𝑗 ≠ 𝑖 and the weight perturbation component in all directions 𝜂𝑗 , 𝑗 ≠ 𝑖 is
negative. The maximal eigendirection is the only stable direction and the weight vector searches this
direction out, eventually settling down into a neighbourhood of the maximal eigenvector.
Oja's algorithm is summarized Table 1.
Table 1: Operational Summary of Oja`s Rule

3. Extracting Principal Components: Sanger`s Rule

Eigendirections defined by the eigenvectors of the correlation matrix of the input data stream provide
a means of characterizing the properties of the data set. Specifically, they represent what are called
principal component directions (orthogonal directions) in the input space that account for the data's
variance.
Principal Component theory attempts to search for an effective mapping of data points from the
original n- dimensional space into a lower m- dimensional space in order to achieve an economy in
representation and storage of data set. This procedure involves finding the m dominant direction in the
input space with the aim of projecting n- dimensional data onto m – dimensional subspace spanned by
these m principal components, without sacrificing the information content of the data. This is called
dimensionality reduction.
For zero mean data, it can be shown that the first m principal components are those m eigenvectors of
the correlation matrix R, that corresponds to the m largest eigenvalues of R. without any loss of
generality these eigenvectors can be ordered as 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑚 .
Any data point can be represented by calculating the n projections along with eigendirection:
𝑛
𝑋 = ∑ 𝛾𝑖 𝜂𝑖 − − − − − − − − − (46)
𝑖=1
Assuming non-null eigenvalues. It is now straightforward to reduce the feature dimensions required
for proper representation of data by neglecting those coefficients in Eq. (46) that have very small
variances (small eigenvalues), while retaining those coefficients with large variances (large
eigenvalues
As mentioned, the eigenvalues can be ordered without any loss of generality from largest to
smallest 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑛 . By selecting m components to represent the n-dimensional vector we are
̃
essentially saying that we can represent X by an approximate vector 𝑋,
𝑚
𝑋̃ = ∑ 𝛾𝑖 𝜂𝑖 𝑚 < 𝑛 − − − − − −(47)
𝑖=1
With an error
𝑛 𝑚
𝑒 = 𝑋 − 𝑋̃ = ∑ 𝛾𝑖 𝜂𝑖 − ∑ 𝛾𝑖 𝜂𝑖 − − − − − − − −(48)
𝑖=1 𝑖=1
𝑛
= ∑ 𝛾𝑖 𝜂𝑖 − − − − − − − − − −(50)
𝑖=𝑚+1

It can be shown that the variances along eigendirections are the eigenvalues of R. Hence the variance
of the error 𝜎𝑒2 is
𝑛
𝜎𝑒2 = ∑ 𝜆𝑖 − − − − − − − − − (50)
𝑖=𝑚+1
It then follows that the closer the neglected eigenvalues are to zero the more effective is the
approximation or dimensionality reduction.
In summary, to reduce dimension and perform subspace decomposition:
• Analyze the correlation matrix R of the data stream to find its eigenvectors and eigenvalues.
• Project the data onto the eigendirections.
• Discard n-m components corresponding to n-m smallest eigenvalues.
Our linear neuron based on Oja's update rule extracts the maximal eigendirection or the first principal
component of the data stream. Rather interestingly, an m node linear neuron network that accepts n
dimensional inputs can extract the first m principal components if the following learning rule is adopted
for weight update:
𝑗
Δ𝑤𝑖𝑗 = 𝛼𝑠𝑗 (𝑥𝑖 − ∑ 𝑠𝑘 𝑤𝑖𝑘 𝑗 = 1 … . . , 𝑚 − − − − − − − (51)

𝑘=1
This rule is called Sanger`s Learning Rule. Sanger`s rule reduces to Oja`s rule for a single neuron, and
it is therefore clear that this unit will search the first eigenvector or first principal component of the
input data stream. It can be shown that using Sanger`s Rule, the weight vectors of the m unit converge
to the first eigenvectors that correspond to the eigenvalues 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑚 :
𝑊𝑖 −→ ±𝜂𝑖 − − − − − − − − − − − −(52)
Where 𝜂𝑖 is a normalized eigenvector of correlation matrix R corresponding to the i-largest eigenvalue
𝜆𝑖 .
4. GENERALIZED LEARNING LAWS

Generalized forgetting laws take on a form:
𝑑𝑊
= 𝜙(𝑠)𝑋 − 𝛾(𝑠)𝑊 − − − − − − − − − −(53)
𝑑𝑡
where we assume that the input vector X ∈ ℝ𝑛 is a stochastic variable with stationary stochastic
properties: W ∈ ℝ𝑛 is the neuronal weight vector, and 𝜙 (.) and 𝛾 (.) are possibly non-linear functions
of the neuronal signal. In the treatment that follows we assume X is independent of W.
The two form of adaptation laws:

Adaptation Law 1: A simple passive decay of weights proportional to the signal, and a reinforcement
proportional to the external input:
𝑊̇ = −𝛼𝑠𝑊 + 𝛽𝑋 − − − − − − − − − −(54)
Adaptation Law 2: The standard Hebbian form of adaptation with signal driven passive weight decay:
𝑊̇ = −𝛼𝑠𝑊 + 𝛽𝑠𝑋 − − − − − − − − − −(55)
If the coefficient α and β are equal to some constant k, then equation (55), reduces to
𝑊̇ = 𝑘𝑠(𝑋 − 𝑊) − − − − − − − − − −(55)
4.1. Analysis of Adaptation Law 1

We consider the learning law
𝑊̇ = −𝛼𝑠𝑊 + 𝛽𝑋 − − − − − − − − − −(57)
Since X is stochastic (with stationary properties), we are interested in the averaged or expected
trajectory of the weight vector W. Taking the expectation of both sides of Eq. (57) yields:
𝐸[𝑊̇ |𝑊] = 𝐸[−𝛼𝑠𝑊 + 𝛽 𝑋|𝑊] − − − − − − − −(58)
where we assume the expectation to be conditional on W. Below, the conditional notation has
been dropped for brevity. Also, we denote the average of X and the correlation matrix of X as
𝐸[𝑋] = 𝑋̅ − − − − − − − − − − − (59)
𝐸[𝑋𝑋 𝑇 ] = 𝑹 − − − − − − − − − (60)
Below, we will require the following intermediate result:
1𝑑 1𝑑 𝑑‖𝑊‖
(𝑊 𝑇 𝑊) = (‖𝑊‖2 ) = ‖𝑊‖ = 𝑊 𝑇 𝑊̇ − − − − − (61)
2 𝑑𝑡 2 𝑑𝑡 𝑑𝑡
= 𝑊 𝑇 (−𝛼𝑠𝑊 + 𝛽𝑋) − − − − − − − (62)
= −𝛼𝑠‖𝑊‖2 + 𝛽𝑋 𝑇 𝑊 − − − − − − − −(63)
= 𝑠(𝛽 − 𝛼‖𝑊‖2 − − − − − − − −(64)
Since 𝑋̅ is constant, we are interested in analysing how W changes with respect to 𝑋̅. For this
purpose, we define 𝜃 to be the angle between X and W, and take a look at the expected rate of
change of the cosine of this angle:
𝑑 cos 𝜃 𝑑 𝑋̅ 𝑇 𝑊
𝐸[ ] = 𝐸[ ( )] − − − − − − − −(65)
𝑑𝑡 𝑑𝑡 ‖𝑋̅‖‖𝑊‖
𝑋̅ 𝑇 𝑊̇ 𝑋̅ 𝑇 𝑊 𝑑‖𝑊‖
= 𝐸[ − ] − − − −(66)
‖𝑋̅‖‖𝑊‖ ‖𝑋̅‖‖𝑊‖2 𝑑𝑡
Substitution of d‖𝑊‖/𝑑𝑡 from equation (63) and using equation (60) yields:
𝑑 cos 𝜃 𝑋̅ 𝑇 (−𝛼𝑠𝑊 + 𝛽𝑊) 𝑋̅ 𝑇 𝑊(−𝛼𝑠‖𝑊‖2 + 𝛽𝑋 𝑇 𝑊)
𝐸[ ] = 𝐸[ − ] − − − − − (67)
𝑑𝑡 ‖𝑋̅‖‖𝑊‖ ‖𝑋̅‖‖𝑊‖3

𝛽‖𝑋̅‖ 𝛽(𝑋̅ 𝑇 𝑊)2

= − − − − − − (68)
‖𝑊‖ ‖𝑋̅‖‖𝑊‖3
𝛽
= (‖𝑋̅‖2 ‖𝑊‖2 − (𝑋̅ 𝑇 𝑊)2 ) − − − − − (69)
‖𝑋̅‖‖𝑊‖3
≥ 0 − − − − − − − −(70)
𝑑 cos 𝜃
where in the end we have employed the Cauchy-Schwarz inequality. Since is non-
𝑑𝑡
𝑑 cos 𝜃
negative, 𝜃 converges uniformly to zero, with = 0 iff 𝑋̅ and W have the same direction.
𝑑𝑡
Therefore, for finite ‖𝑋̅‖ and ‖𝑊‖, the weight vector direction converges asymptotically to the
direction of 𝑋̅.
Also note that since,
𝑑‖𝑊‖
‖𝑊‖ = 𝑠(𝛽 − 𝛼‖𝑊‖2 − − − − − − − − − (71)
𝑑𝑡
The magnitude of the weight vector asymptomatically approaches
𝛽
‖𝑊‖ = √ − − − − − − − − − − − − − (72)
𝛼
̂ can be written as
And therefore, asymptotic equilibrium weight vector 𝑊
𝛽 𝑋̅
̂ = 𝑊(∞) = √
𝑊 − − − − − − − − − (73)
𝛼 ‖𝑋̅‖
4.2. Analysis of Adaptation Law 2

We consider the learning law
𝑊̇ = −𝛼𝑠𝑊 + 𝛽𝑠𝑋 − − − − − − − − − −(74)
= −𝛼𝑠𝑊 + 𝛽𝑋𝑠 − − − − − − − − − −(75)
= −𝛼𝑋 𝑇 𝑊𝑊 + 𝛽𝑋𝑋 𝑇 𝑊 − − − − − −(76)
Therefore, taking expectation of both sides’ conditional on W yields:
𝐸[𝑊̇ ] = −𝛼𝑋̅ 𝑇 𝑊𝑊 + 𝛽𝑹𝑊 − − − − − − − (77)
Fixed points of W are obtained easily by setting 𝐸[𝑊̇ ] = 0:
𝐸[𝑊̇ ] = 0 = −𝛼𝑋̅ 𝑇 𝑊
̂𝑊̂ + 𝛽𝑹𝑾
̂ − − − − − −(78)
From whence,
𝛼𝑋̅ 𝑇 𝑊
̂
̂
𝑹𝑊 = ( ̂ − − − − − − − − − −(79)
)𝑊
𝛽
̂ − − − − − − − − − − − − − (80)
= 𝜆𝑊

𝛼𝑋̅ 𝑇 𝑊
̂
Where we have defined 𝜆 = . Clearly, eigenvectors of R are fixed point solutions of W.
𝛽
The ith solution is the eigenvector 𝜂𝑗 ; of R with corresponding eigenvalue

𝛼𝑋̅ 𝑇 𝜂𝑖
𝜆𝑖 = − − − − − − − −(81)
𝛽
However, as we now proceed to show, not all the eigen solutions are stable. Towards
establishing this we define 𝜃𝑖 , as the angle between W and 𝜂𝑖 , and analyze (as before) the
average value of rate of change of cos 𝜃𝑖 , conditional on W:
𝑑 cos 𝜃𝑖 𝑑 𝜂𝑖𝑇 𝑊
𝐸[ ] = 𝐸[ ( )] − − − − − − − −(82)
𝑑𝑡 𝑑𝑡 ‖𝜂𝑖 ‖‖𝑊‖
𝜂𝑖𝑇 𝑊̇ 𝜂𝑖𝑇 𝑊 𝑑
= 𝐸[ − ‖𝑊‖] − −(83)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝜂𝑖 ‖‖𝑊‖2 𝑑𝑡
𝜂𝑖𝑇 𝑊̇ 𝜂𝑖𝑇 𝑊𝑊 𝑇 𝑊̇
= 𝐸[ − ] − − − −(84)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝜂𝑖 ‖‖𝑊‖3
𝜂𝑖𝑇 (−𝛼𝑋̅ 𝑇 𝑊𝑊 + 𝛽𝑹𝑊) 𝜂𝑖𝑇 𝑊𝑊 𝑇 (−𝛼𝑋̅ 𝑇 𝑊𝑊 + 𝛽𝑹𝑊)

= − − − − (85)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝜂𝑖 ‖‖𝑊‖3
𝛽𝜂𝑖𝑇 𝑹𝑊 𝛽𝜂𝑖𝑇 𝑊(𝑊 𝑇 𝑹𝑊)

= − − − − − − − − −(86)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝜂𝑖 ‖‖𝑊‖3
𝛽𝜆𝑖 𝜂𝑖𝑇 𝑊 𝛽𝜂𝑖𝑇 𝑊(𝑊 𝑇 𝑹𝑊)

= − − − − − − − − −(87)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝜂𝑖 ‖‖𝑊‖3
𝛽𝜂𝑖𝑇 𝑊 (𝑊 𝑇 𝑹𝑊)
(𝜆𝑖 − ) − − − − − − − −(88)
‖𝜂𝑖 ‖‖𝑊‖ ‖𝑊‖2
(𝑊 𝑇 𝑹𝑊)
= 𝛽 cos 𝜃𝑖 (𝜆𝑖 − ) − − − − − − − −(89)
‖𝑊‖2
It follows from the Rayleigh quotient that the parenthetic term in Eq. (89) is guaranteed to be
positive only for 𝜆𝑖 = 𝜆𝑚𝑎𝑥 , which means that for the eigenvector 𝜂𝑚𝑎𝑥 the angle 𝜃𝑚𝑎𝑥 between
W and 𝜂𝑚𝑎𝑥 monotonically tends to zero as learning proceeds.
5. VECTOR QUANTIZATION
Vector quantization (VQ) is an important application of competitive learning which was originally
developed for information compression applications, and is now routinely employed to store and transmit

speech or vision data. In VQ, codebook vectors 𝑊𝐼 are required to be placed into the signal space in a way
that minimizes the average expected quantization error:
2
𝐸 = ∫‖𝑋 − 𝑊𝐽 ‖ 𝑃(𝑋)𝑑𝑋 − − − − − − − − − (90)
where X ∈ ℝ𝑛 is a signal vector. It is worth emphasizing that the winning codebook vector 𝑊𝐽 , is a
function of the input as well as all codebook vectors 𝑊𝑖 i= 1,., n: ‖𝑋 − 𝑊𝐽 ‖= 𝑚𝑖𝑛𝑖 {‖𝑋 − 𝑊𝐽 ‖}. VQ
algorithms exploit input vector structure to divide the input space into different regions, which will
ultimately be identified as clusters, classes or decision regions. Compression is then achieved by
identifying signal vectors in a region of the input space by a codebook vector. Then, all vectors that fall
within a region are represented by the codebook vector for that region. Thus, the name vector quantization.
5.1. Unsupervised vector Quantization

In the simplest unsupervised form of the algorithm, if there are Q input-output pairs {𝑋𝑘 , 𝑌𝑘 } of data
that need to be clustered, they are concatenated into vectors 𝑍𝑘 = (𝑋𝑘 |𝑌𝑘 ) and then employed in the
auto associative vector quantization learning process presented below. Figure 2 shows a typical neural
network architecture that can implement vector quantization learning.
Figure 2: Network Architecture that implements unsupervised vector quantization
Here, cluster neurons compete for activation from concatenated input patterns {𝑍𝑘 }. The unsupervised
VQ system compares the current sample vector 𝑍𝑘 with the C quantizing weight vectors 𝑊𝑗(𝑘) (weight
vector 𝑊𝑗 at time instant k), and neuron J wins based on a standard Euclidean distance competition:
‖𝑊𝑗(𝑘) − 𝑍𝑘 ‖ = min‖𝑊𝑗(𝑘) − 𝑍𝑘 ‖ − − − − − −(91)
𝑗
where the discrete time index k has been introduced because weight vectors will change in accordance

with a competitive learning law. The winning neuron J learns the input pattern in accordance with
standard competitive learning in vector form,
𝑊𝐽(𝑘+1) = 𝑊𝐽(𝑘) + 𝜂𝑘 (𝑍𝑘 − 𝑊𝐽(𝑘) − − − − − − − (92)
The learning coefficient 𝜂𝑘 should decrease gradually towards zero. For example, we could use the
𝑘
form 𝜂𝑘 = 𝜂0 [1 − ] for an initial learning rate 𝜂0 and Q training samples. This would make η
2𝑄
decrease linearly from 𝜂0 to zero over 2Q iterations. The winning vector 𝑊𝐽 learns; the losing vectors
do not learn.
In proper practice, the components of the data samples {𝑍𝑘 } is scaled so that all features have
equal weight in the distance measure. This scaling can be embedded within the distance computation
as shown below:
𝑇 𝑇
(𝑊𝐽(𝑘) − 𝑍𝑘 ) Ω(𝑊𝐽(𝑘) − 𝑍𝑘 ) = min {(𝑊𝐽(𝑘) − 𝑍𝑘 ) Ω(𝑊𝐽(𝑘) − 𝑍𝑘 )} − − − − − − − (93)
𝑗
Where,
𝑤12 … 0
Ω= [ ⋮ ⋱ ⋮ ] − − − − − − − −(94)
2
0 … 𝑤𝑛+𝑚
is a data dependent scaling matrix computed from the maximum and minimum feature values in each
dimension. This normalized Euclidean distance calculation ensures that no one variable dominates the
choice of the winner. In other words, no feature dominates and the classification is feature scale
invariant.
Table 2 summarizes the adaptive vector quantization (AVQ) algorithm
Table 2: Operational Summary of AVQ Algorithm

5.2. Supervised Vector Quantization

Supervised version of vector quantization called the learning vector quantization (LVQ1) where the
data classes are defined in advance and each data sample is labelled with its class. In LVQ, each neuron
is assumed to represent a class ∁𝑗 and the weight update procedure for the winning neuron J and other
neurons j ≠ J is modified as follows:
𝑘
𝑘+1
𝑤𝑖𝐽 + 𝜂𝑘 (𝑥𝑖𝑘 − 𝑤𝑖𝐽
𝑘
) 𝑖𝑓 𝑋𝑘 ∈ 𝐶𝐽
𝑤𝑖𝐽 = { 𝑘 − − − − − − − − − (95)
𝑤𝑖𝐽 − 𝜂𝑘 (𝑥𝑖𝑘 − 𝑤𝑖𝐽
𝑘
) 𝑖𝑓 𝑋𝑘 ∉ 𝐶𝐽
𝑤𝑖𝑗𝑘 = 𝑤𝑖𝐽
𝑘
𝑓𝑜𝑟 𝑗 ≠ 𝐽 − − − − − − − − − − − (96)
i = 1,... .n, where 0 < 𝜂𝑘 < l is a learning rate that is made to decrease monotonically with successive
iterations. It is usually recommended that 𝜂𝑘 be kept small, maybe as small as 0.1. In a limited training
set, vectors may be applied cyclically to the system as 𝜂𝑘 is made to decrease linearly to zero.
Table 3 summarizes the LVQ1 learning procedure:
Table 3: Operational Summary of LVQ1 Algorithm
6. SELF ORGANIZING FEATURE MAPS

• Dimensionality reduction with preservation of topological information is common in normal
subconscious information human processing.
• The information is routinely compressed by extracting relevant facts and thereby develop reduced
representations of impinging information while retaining essential knowledge.

• A good example is that of biological vision where three-dimensional visual images are routinely
mapped onto a two-dimensional retina and information is preserved in a way that permits perfect
visualization of a three-dimensional world.
• Interestingly, specialized sensory areas of the cortex respond to the available spectrum of real-
world signals in an ordered fashion.
• For example, the tonotonic map in the auditory cortex is perfectly ordered according to frequency.
• Early evidence for computational maps comes from the studies of Hubel and Wiesel on the primary
visual cortex of cats and monkeys.
• They discovered two kinds of maps:
o cortical maps that selectively responded to a preferred line orientation, and
o maps of occular dominance that indicated the relative strength of excitatory influence on
each eye.
• In other words, there is ample neurobiological evidence for the formation of a hierarchy of visual,
somatosensory and auditory maps in the human brain.
• Primary maps are processed into secondary and tertiary maps through a sequence of temporal
processing that retains fine grained topological ordering as present in the original sensory signals.
• The main point is that in pattern recognition and information processing, an important step is the
identification or extraction of a set of features (invariants) which concentrate the essential
information of the input pattern set.
• This leads to economy of representation which results in significant savings in storage and
transmission bandwidth requirements.
• By virtue of their nature, computational maps provide a technique for efficient and dynamic
processing of information.
6.1. Topology Preservation

• The self-organizing feature map is a neural network model that is based on Kohonen's discovery
that topological information prevalent in high dimensional input data can be transformed onto a
one or two-dimensional layer of neurons.
• Topological maps preserve an order or a metric defined on the impinging inputs.
• They are motivated by fact that representation of sensory information in the human brain has a
geometrical order.
• The same functional principle can be responsible for diverse representation of information –
possibly even hierarchical.

• To understand the concept of topology preservation, consider an m- neuron neural network where
the ith neuron produces a response 𝑠𝑖𝑘 in response to the input pattern 𝐼𝑘 ∈ ℝ𝑛 .
• Assume that input vectors {𝐼𝑘 }𝑄𝑘=1 can be ordered according to some distance metric or in some
topological way: 𝐼1 ℛ 𝐼2 ℛ 𝐼3 … . ., where ℛ is some ordering relation.
• Then the network produces a one-dimensional topology preserving map if for 𝑖1 > 𝑖2 > 𝑖3 … .,
𝑠𝑖11 = max{𝑠𝑗1 }
𝑗
𝑠𝑖22 = max{𝑠𝑗2 } − − − − − − − −(97)

𝑗
𝑠𝑖33 = max{𝑠𝑗3 }
𝑗
• Kohonen's self-organizing feature map (SOFM) represents an approach by means of which

important topological information can be obtained through an unsupervised learning process.
• The SOFM algorithm is primarily a competitive vector quantizer in which real valued patterns are
presented sequentially to a linear or planar array of neurons which have Mexican hat kind of
interactions.
• These interactions allow clusters of neurons to win the competition rather than one single neuron.
• Then the weights of winning neurons are adjusted to bring about a better response to the current
input.
• Iterative application of the competition-adaptation process to a sequence of input patterns
eventually results in weights that specify clusters of network nodes that are topologically close,
being sensitive to clusters of inputs that are physically close in the input space.
• In other words, there is a correspondence between signal features and response locations on the
map.
• The spatial location of a neuron in the array corresponds to a specific domain of inputs. In other
words, we say that the map preserves the topology of the input.
6.2. Operational details of the SOFM Algorithm

Consider the network of Fig. 3, which represents a two-dimensional lattice or field of m x m neurons.
For topological preservation of information, distance relations in high dimensional spaces should be
approximated by the network as the distances in the two-dimensional neural network. For this
mapping to be generated we require:
• Input neurons be exposed to a sufficient number of different inputs,
• For a given input, only the winning neuron and its neighbours adapt their connections;

• a similar weight update procedure be employed on many adjacent neurons which comprise
topologically related subsets; and
• The resulting adjustment be such that it enhances the responses to the same or to a similar
input that occurs subsequently.
Figure 3: A Kohonen network typically comprises a two-dimensional planar field of neurons that have Mexican
hat lateral interconnections to support soft-competition amongst neurons
In the words of Kohonen, "An intriguing result from this sort of spatially correlated learning is that the
weight vectors tend to attain values that are ordered along the axes of the network."
𝑇
Assume that the input vector 𝑋𝑘 = (𝑥1𝑘 , … … . , 𝑥𝑛𝑘 ) ∈ ℝ𝑛 (ℝ𝑛 being the pattern space in
question) is presented to a (m x n) field of neurons. Due to the planar nature of the field, each neuron
will be identified by the double row-column index ij. i, j = 1, ,.. m. The 𝑖𝑗 𝑡ℎ neuron has an incoming
𝑘 𝑘
weight vector 𝑊𝑖𝑗(𝑘) = (𝑤1,𝑖𝑗 , … … . 𝑤𝑛,𝑖𝑗 ) ∈ ℝ𝑛 at time instant k.
• The first step is to find the best matching weight vector 𝑊𝐼 𝐽(𝑘) for the present input, and to
thereby identify a neighbourhood 𝒩𝐼 𝐽 around the winning neuron.
• One can find the best matching weight vector by comparing the inner products 𝑋𝑘𝑇 𝑊𝑖𝑗(𝑘) of the
impinging input 𝑋𝑘 , with each weight vector 𝑊𝑖𝑗(𝑘) .
• The winning neuron the one that has the largest inner product.
• Equivalently, with normalized weight vectors, the maximum inner product criterion reduces to
a minimum Euclidean distance criterion: the winning neuron is the one that minimizes the
distance ‖𝑋𝑘 − 𝑊𝑖𝑗(𝑘) ‖.

• Kohonen suggests using the latter since it is more general and applies to natural signals in
metric vector spaces:
‖𝑋𝑘 − 𝑊𝑖𝑗(𝑘) ‖ = min{‖𝑋𝑘 − 𝑊𝑖𝑗(𝑘) ‖} − − − − − − − − − − − −(98)
𝑖,𝑗
• The SOFM algorithm assumes that the planar field of neurons has a Mexican hat interaction
that allows the identification of a neuron cluster around the winning neuron.
• The width of the neighbourhood is a function of time: as epochs of training elapse, the
neighbourhood shrinks.
• In the continuous version this shrinkage of neighbourhood width can be implemented by
gradually decreasing the positive lateral feedback and increasing the negative lateral feedback
in the Mexican hat function.
• The use of 𝒩𝐼 𝐽 simulates the quick formation of an activity cluster.
• Once a winning cluster has been identified, weights of neurons within the cluster are updated.
Adaptation in SOFM's takes place according to the second generalized law of adaptation:
𝑤̇𝑙,𝑖𝑗 = 𝜂𝑥𝑙 𝑠𝑖𝑗 − 𝛾(𝑠𝑖𝑗 )𝑤𝑙,𝑖𝑗 − − − − − − − (99)
which incorporates learning according to the Hebbian hypothesis but with a forgetting term scaled by
some function 𝛾(𝑠𝑖𝑗 ) of the neuronal signal 𝑠𝑖𝑗 . In the simplest case, 𝛾(𝑠𝑖𝑗 ) may be chosen to be linear
𝛾(𝑠𝑖𝑗 ) = 𝛽 𝑠𝑖𝑗 − − − − − − − − − (100)
Which yields,
𝑤̇𝑙,𝑖𝑗 = 𝜂𝑥𝑙 𝑠𝑖𝑗 − 𝛽𝑠𝑖𝑗 𝑤𝑙,𝑖𝑗 − − − − − −(101)
Choosing 𝜂 = 𝛽 without any loss of generality, further simplifies equation (100):
𝑤̇𝑙,𝑖𝑗 = 𝜂𝑠𝑖𝑗 (𝑥𝑙 − 𝑤𝑙,𝑖𝑗 ) − − − − − −(102)
The signal function can be chosen to be binary in nature. In accordance with this neuron within 𝒩𝐼 𝐽
fire, others do not. This is simply stated as follows:
1 𝑖𝑓 𝑖𝑗 ∈ 𝒩𝐼 𝐽
𝑠𝑖𝑗 = { − − − − − − − −(103)
0 𝑖𝑓 𝑖𝑗 ∉ 𝒩𝐼 𝐽
With this signal behaviour equation (102) simplifies to:
𝜂(𝑥𝑙 − 𝑤𝑙,𝑖𝑗 ) 𝑖𝑓 𝑖𝑗 ∈ 𝒩𝐼 𝐽
𝑤̇𝑙,𝑖𝑗 = { − − − − − − − −(104)
In vector notation the SOFM adaptation law can be rewritten as
𝜂(𝑋 − 𝑊𝑖𝑗 ) 𝑖𝑓 𝑖𝑗 ∈ 𝒩𝐼 𝐽
𝑊̇𝑖𝑗 = { − − − − − − − −(105)
Weight vectors of winning clusters update themselves to resemble the input vector X to a greater

extent. In general, the learning rate 𝜂 and the neighbourhood 𝒩𝐼 𝐽 are both functions of time, each
having a contracting nature.
Table 4 presents the operational summary of self-organizing feature map algorithm
Table 4: operational summary of self-organizing feature map algorithm
7. APPLICATION OF THE SELF-ORGANIZING MAP

Application of SOM
1. Classic neural phonetic typewriter
2. Control of robot arms
3. Vector quantization
4. Radar based classification
5. Brain modelling
6. Feature mapping of language data by creating semantic categories
7.1. PATTERN CLASSIFICATION
Another way to visualize a feature map is by labelling each or the neurons with the test pattern that
has maximally excited the neuron in question. In other words, the pattern in question becomes the
best stimulus for that neuron. This procedure closely resembles the way sites in the brain get labelled
by stimulus features that maximally excite neurons at that site. Such a labelling procedure will
eventually produce a well-ordered partition of the map such that groups or cluster of neurons will
respond maximally to the same class of pattern. Such a topographic map creates similarity relationship
and can be used for pattern classification.
Example: The iris data set comprises 150 patterns of feature measurements made on the three species
of the Iris flower: iris sestosa, iris versicolor, and iris virginica. The four features measured are the
petal length, petal width, sepal length and sepal width. There are 50 patterns of each of the three
classes. Figure 4 plots the 6 possible projections of the 4-dimensional feature vectors. This helps give
an idea of the range of values of each feature.
Figure 4: Six projection plots of the Iris Data Set
The iris data set is presented to a Kohonen map and then label the map after it has been trained. Since
the SOFM algorithm uses Euclidean distances, the scale of individual variables paly an important role
in determining what the final map will look like. If the range of values of one variable is much larger
than that of the other variables, then the variable will probably dominate the organization of the
SOFM. Therefore, components of the data are usually normalized prior to presentation so that each
variable has unity variance.
After training the map assuming 8 x 8 grid of neurons, the map is labelled. The best matching
neuron of the map is found for each pattern of the data set. This neuron is the one with the minimum
distance between the weight vector and the pattern under consideration. The species label
corresponding to the pattern is assigned to the neuron. Each neuron is finally assigned a label that
correspond to the species whose pattern elicited a maximal response with the highest frequency.
7.2. NEURAL PHONETIC TYPEWRITER

Automated speech recognition represents a difficult problem due to the high degree of variability of
acoustic speech signals corresponding to phonemes, that have to be mapped to the same linguistic

units. Phonetic units are called phonemes. These are linguistic abstractions that do not directly
correspond to distinct acoustic segments. They have varying lengths, that may overlap partly, and
their acoustic appearance varies according to their context as well as from speaker to speaker.
The primary objective of speech recognition is to transcribe carefully spoken utterances into
phoneme sequences and to transform them into orthographically correct written text. Kohonen's
neural phonetic typewriter represents a classic example of an automated speech recognition that uses
one of two neural network modules: the Self-organizing Feature Map (SOFM) or Learning Vector
Quantization (LVQ).
Either of these methods is used to classify speech segments into phoneme classes time
synchronously. The results are decoded into phoneme strings using Hidden Markov Models (HMM),
a statistical methodology for representing and recognizing stochastic time series. The ANN approach
differs from the conventional HMM methodology by using ANNs as a component to emphasize
discrimination between phonemes, whereas the aim of HMM is representation.
The neural phonetic typewriter was designed to work with the Finnish language that has 21
phenomes which correspond almost uniquely with graphemes. The equipment developed at the
Helsinki University of Technology. It is built around a coprocessor board that has the capacity to
perform the classification of short-time speech spectra by the SOM, as well as to apply symbolic
postprocessing to correct for coarticulation errors. It was able to transcribe unlimited Finnish and
Japanese dictation into text at a letter accuracy exceeding 92 per cent. The "neural" model applied in
this design is a special supervised SOM.
Input vectors comprise a 15-component short-time spectrum vector 𝑋𝑠 . Preprocessing of the
speech signal to generate this vector involves a number of steps. These include recording on a noise
cancelling microphone; analog-to-digital conversion; fast Fourier transform; low pass filtering and
then grouping the spectral channels into a 15-component real valued pattern vector. The final step of
preprocessing involves subtraction of the average from all components and normalization of the
resulting vector to constant length. Each input vector is tagged to a class using an output vector 𝑋𝑢 .
This is a binary vector with its components assigned 0 or 1 to indicate different phoneme classes.
During recognition, 𝑋𝑢 is not considered. A problem faced in discrimination is that the distributions
of the spectra of different classes overlap. It is here that neural networks can effectively define class
separating boundaries that generalize well.

8. GROWING NATURAL GAS
In many applications, there is little or no information about the set. In such input distribution or the size
of the data cases, it is difficult to determine apriori the number of neurons to use in the self-organizing
feature map, or for that matter even in classical vector quantization. An incremental clustering algorithm
called Growing Neural Gas (GNG) makes no assumptions on the number of neurons used by the network.
Growing Neural Gas (GNG) is an unsupervised incremental clustering algorithm that can perform both
dimensionality reduction while discovering a topological map of an input data distribution. GNG
incrementally creates a network of nodes that quantize an input space described by a distribution of points
in ℝ𝑛 , which might be drawn from an unknown probability density function P(X). Position vectors of
GNG nodes in the input space act as codebook vectors for data clusters in ℝ𝑛 . GNG can also be used to
discover structures that closely reflect the topology of the input distribution. An appealing aspect of GNG
is that it is adaptive in the sense that even if the input distribution changes slowly over time, GNG nodes
will gradually migrate so as to cover the new distribution.
8.1. THE GNG ALGORITHM AND OPERATIONAL SUMMARY

GNG starts out always with only two nodes in a graph. Nodes are considered to be neighbours if they
are connected by an edge. Information on neighbours is maintained throughout the run of the
algorithm. As in NG, CHL is an essential component of the GNG algorithm and is used to direct the
adaptation of position vectors of nodes and also the insertion of new nodes. Note that GNG only uses
parameters that are constant in time. Further, it is not necessary to decide on the number of nodes to
use a priori since nodes are added incrementally during execution. Insertion of new nodes stops when
a user-defined performance criterion is met or if a specified maximum network size has been reached.
The Various steps followed in GNG are summarized as follows

1. Scatter two nodes a and b at random position 𝑋𝑎 𝑎𝑛𝑑 𝑋𝑏 in ℝ𝑛
2. Generate an input 𝑋𝑘 its iteration k in accordance with P(X).
3. Find the nearest node 𝑛1 and second nearest node 𝑛2 using Euclidian distance
4. Add the squared distance between the input 𝑋𝑘 and the nearest node 𝑛1 in input space to a node
local error variable:
2
ℰ𝑛1(𝑘+1) = ℰ𝑛1(𝑘) + ‖𝑋𝑛1(𝑘) − 𝑋𝑘 ‖
5. Move 𝑛1 and its direct topological neighbours 𝑁𝑛 1, towards 𝑋𝑘 with learning rates 𝜂𝑤 and 𝜂𝑛 ,
respectively, of the distance of each node to 𝑋𝑘
𝑋𝑛1(𝑘+1) = 𝑋𝑛1(𝑘) + 𝜂𝑊 (𝑋𝑘 − 𝑋𝑛1(𝑘) )
𝑋𝑗(𝑘+1) = 𝑋𝑗(𝑘) + 𝜂𝑁 (𝑋𝑘 − 𝑋𝑗(𝑘) ) ∀𝑗 ∈ 𝒩𝑛1

6. Increment the age of all edges connected to 𝑛1 .
7. Connect 𝑛1 and 𝑛2 by an edge: If an edge exists, set the age of this edge to zero. If such an
edge does not exist, then create it, and set its age to zero.
8. Remove edges with an age larger than Amax. If this results in nodes having no edges connected
to them, remove them.
9. If the number of input vectors generated so far is an integer multiple of a parameter λ, insert a
new node into the network as follows:
a. Identify the node p with the maximum accumulated error.
b. Identify node q which is the Neighbor of p with the largest accumulated error.
c. Insert a new node r halfway between p and q:
𝑋𝑟 = 0.5(𝑋𝑝 + 𝑋𝑞 )
d. Insert edges connecting the new node r with nodes p and q, and remove the original
edge between p and q.
e. Decrease the error of p and q by a factor a. Initialize the local error variable of r to the
new value of the error variable of p.
10. Decrease all node error variables by a factor of β.
11. If a stopping criterion (e.g., network size or some performance measure) is not yet fulfilled,
return to Step 1.

Module 5 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 5 Notes

Uploaded by

Copyright:

Available Formats

Neural Network Module 5

SELF ORGANIZING FEATURE MAP

2. MAXIMAL EIGENVECTOR FILTERING

𝑤̇𝑖𝑗 = −𝑤𝑖𝑗 + 𝒮𝑖 (𝑥𝑖 )𝒮̇ 𝑗 (𝑥𝑗 ) − − − − − (1)

Department of CSE, AJIET Mangalore 1

Figure 1: (a) A Linear Neuron (b) Discrete Time Formalism

Where 𝑋 = (𝑥1 … … . 𝑥𝑛 )𝑇 and 𝑊 = (𝑤1 … … . . 𝑤𝑛 )𝑇 .

𝑠𝑘 = 𝑦𝑘 = ∑ 𝑤𝑖𝑘 𝑥𝑖𝑘 = 𝑋𝑘𝑇 𝑊𝑘 − − − − − − − − − (5)

= 𝑤𝑖𝑘 + 𝛼𝑥𝑖𝑘 𝑦𝑘 − − − − − − − (7)

Which can be written in vector form as

Department of CSE, AJIET Mangalore 2

Two points are worth noting.

Department of CSE, AJIET Mangalore 3

Now consider a perturbation 𝜖 from the equilibrium weight vector W,

Department of CSE, AJIET Mangalore 4

Substituting this equation into equation (23) yields:

𝐸[Δ𝑊𝑘 ] = 𝛼𝑅 (∑ 𝛽𝑖 𝜂𝑖 + ∑ 𝛾𝑗 𝜂𝑗′ ) − − − − − − − (25)

𝐸[Δ𝑊𝑘 ] = 𝛼 (∑ 𝛽𝑖 𝑹𝜂𝑖 + ∑ 𝛾𝑗 𝑹𝜂𝑗′ ) − − − − − − − (26)

2.2. Oja`s Rule

Where 𝑋𝑘′ = 𝑋𝑘 − 𝑠𝑘 𝑊𝑘 . This rule is called Oja`s rule.

Department of CSE, AJIET Mangalore 5

̂ ‖2 = 1. In other word, the equilibrium weight vector magnitude approaches as

𝐸[Δ𝑊] = 𝑹(𝜼𝒊 + 𝝐) − ((𝜼𝒊 + 𝝐)𝑻 𝑅(𝜂𝑖 + 𝜖))(𝜂𝑖 + 𝜖) − − − − − − − (38)

= 𝑅𝜂𝑖 + 𝑅𝜖 − (𝜂𝑖𝑇 𝑅𝜂𝑖 + 𝜂𝑖𝑇 𝑅𝜖 + 𝜖 𝑇 𝑅𝜂𝑖 + 𝜖 𝑇 𝑅𝜖)(𝜂𝑖 + 𝜖) − − − −(39)

= 𝑅𝜖 − 2𝜆𝑖 (𝜖 𝑇 𝜂𝑖 )𝜂𝑖 − 𝜆𝑖 𝜖 + 𝒪(𝜖) − − − −(41)

= (𝜆𝑗 − 𝜆𝑖 − 2𝜆𝑖 𝛿𝑖𝑗 )𝜂𝑗𝑇 𝜖 − − − − − −(44)

Table 1: Operational Summary of Oja`s Rule

Department of CSE, AJIET Mangalore 7

3. Extracting Principal Components: Sanger`s Rule

Department of CSE, AJIET Mangalore 8

Δ𝑤𝑖𝑗 = 𝛼𝑠𝑗 (𝑥𝑖 − ∑ 𝑠𝑘 𝑤𝑖𝑘 𝑗 = 1 … . . , 𝑚 − − − − − − − (51)

4. GENERALIZED LEARNING LAWS

Department of CSE, AJIET Mangalore 9

4.1. Analysis of Adaptation Law 1

Department of CSE, AJIET Mangalore 10

𝛽‖𝑋̅‖ 𝛽(𝑋̅ 𝑇 𝑊)2

4.2. Analysis of Adaptation Law 2

Department of CSE, AJIET Mangalore 11

The ith solution is the eigenvector 𝜂𝑗 ; of R with corresponding eigenvalue

𝜂𝑖𝑇 (−𝛼𝑋̅ 𝑇 𝑊𝑊 + 𝛽𝑹𝑊) 𝜂𝑖𝑇 𝑊𝑊 𝑇 (−𝛼𝑋̅ 𝑇 𝑊𝑊 + 𝛽𝑹𝑊)

𝛽𝜂𝑖𝑇 𝑹𝑊 𝛽𝜂𝑖𝑇 𝑊(𝑊 𝑇 𝑹𝑊)

𝛽𝜆𝑖 𝜂𝑖𝑇 𝑊 𝛽𝜂𝑖𝑇 𝑊(𝑊 𝑇 𝑹𝑊)

Department of CSE, AJIET Mangalore 12

5.1. Unsupervised vector Quantization

Figure 2: Network Architecture that implements unsupervised vector quantization

Department of CSE, AJIET Mangalore 13

Department of CSE, AJIET Mangalore 14

5.2. Supervised Vector Quantization

Table 3: Operational Summary of LVQ1 Algorithm

6. SELF ORGANIZING FEATURE MAPS

Department of CSE, AJIET Mangalore 15

6.1. Topology Preservation

Department of CSE, AJIET Mangalore 16

𝑠𝑖22 = max{𝑠𝑗2 } − − − − − − − −(97)

• Kohonen's self-organizing feature map (SOFM) represents an approach by means of which

6.2. Operational details of the SOFM Algorithm

Department of CSE, AJIET Mangalore 17

Department of CSE, AJIET Mangalore 18

Department of CSE, AJIET Mangalore 19

Table 4: operational summary of self-organizing feature map algorithm

7. APPLICATION OF THE SELF-ORGANIZING MAP

7.1. PATTERN CLASSIFICATION

Figure 4: Six projection plots of the Iris Data Set

7.2. NEURAL PHONETIC TYPEWRITER