You are on page 1of 17

The Self-organizing Map

Invited Paper

Among the architectures and algorithms suggested for artificial responses tend to become ordered as if some meaningful
neural networks, the Self-Organizing Map has the special property
coordinate system for different input features were being
of effectively creating spatially organized "internal representa-
tions" of various features of input signals and their abstractions. created over the network. The spatial location or coordi-
One novel result is that the self-organization process can also dis- natesof acell in the network then correspond toa particular
cover semantic relationships in sentences. In this respect the domain of input signal patterns. Each cell or local cell group
resulting maps very closely resemble the topographically orga- acts like a separate decoder for the same input. It is thus
nized maps found in the cortices of the more developed animal
brains. After supervised fine tuning of its weight vectors, the Self- the presence or absence of an active response at that loca-
Organizing Map has been particularly successful in various pattern tion, and not so much the exact input-output signal trans-
recognition tasks involving very noisy signals. In particular, these formation or magnitude of the response, that provides an
maps have been used in practical speech recognition, and work is interpretation of the input information.
in progress on their application to robotics, process control, tele-
The Self-organizing Map was intended as a viable alter-
communications, etc. This paper contains a survey of several basic
facts and results. native to more traditional neural network architectures. It
i s possibletoaskjust how"neura1"the map is. Itsanalytical
description has already been developed further in the tech-
I. INTRODUCTION nical than in the biological direction. But the learning results
A. O n the Role o f the Self-organizing Map Among Neural achieved seem very natural, at least indicating that the
Network Models adaptive processes themselves at work in the map may be
similar t o those encountered i n the brain. There may there-
The network architectures and signal processes used to fore be sufficient justification for calling these maps "neural
model nervous systems can roughly be divided into three networks" in the same sense as their traditional rivals.
categories, each based o n a different philosophy. Feedfor-
Self-organizing Maps, or systems consisting of several
wardnetworks [94] transform sets of input signals into sets
map modules, have been used for tasks similar to those t o
of output signals. The desired input-output transformation which other more traditional neural networks have been
i s usually determined by external, supervised adjustment
applied: pattern recognition, robotics, process control, and
of the system parameters. I n feedback networks [27], the even processing of semantic information. The spatial seg-
input information defines the initial activity state of a feed-
regation of different responses and their organization into
back system, and after state transitions the asymptotic final topologically related subsets results i n a high degree of effi-
state is identified as theoutcomeof the computation. In the ciency in typical neural network operations.
third category, neighboring cells in a neural network com- Although the largest map we have used i n practical appli-
pete in their activities by means of mutual lateral interac-
cations has only contained about 1000 cells, its learning
tions, and develop adaptively into specific detectors of dif- speed, especially when using computational shortcuts, can
ferent signal patterns. I n this category learning is called be increased to orders of magnitude greater than that of
competitive, unsupervised, or self-organizing, many other neural networks. Thus much larger maps than
The Self-organizing Map discussed in this paper belongs
those used so far are quite feasible, although it also seems
to the last category. It i s a sheet-like artificial neural net- that practical applications favor hierarchical systems made
work, the cells of which become specifically tuned to var- up of many smaller maps.
ious input signal patterns or classes of patterns through an It may be appropriate t o observe here that if the maps are
unsupervised learning process. In the basic version, only used for pattern recognition, their classification accuracy
one cell or local group of cells at a time gives the active can be multiplied if the cells are fine-tuned using super-
response t o the current input. The locations of the vised learning principles (cf. Sec. Ill).
Although the Self-organizing Map principle was intro-
Manuscript received August 18, 1989; revised March 15, 1990. duced in early 1981, no complete review has appeared in
The author is with the Department of Computer Science, Hel-
sinki University of Technology, 02150 Espoo, and the Academy of compact form, except perhaps in [44], which does not con-
Finland, 00550 Helsinki, Finland. tain the latest results. I have therefore tried t o collect a vari-
IEEE Log Number 9038826. ety of basic material in the present paper.

0018-9219/90/0900-1464$01.OO (9 1990 IEEE

1464 PROCEEDINGS OF THE IEEE, VOL. 78, NO 9, SEPTEMBER 1990

_-
1
B. Brain Maps the higher levels the mapsare usuallyunordered, orat most
As much as a hundred years ago, a quite detailed top- the order i s a kind of ultrametric topological order that is
ographical organization of the brain, and especially of the not easily interpretable. There are also singular cells that
respond to rather complex patterns, such as the human face
cerebral cortex, could be deduced from functional deficits
and behavioral impairments induced by various kinds of PI, P31.
lesion, or by hemorrhages, tumors, or malformations. Dif- Some maps represent quite abstract qualities of sensory
and other experiences. For instance, in the word-process-
ferent regions i n the brain thereby seemed to be dedicated
ing areas, neural responses seem to beorganized according
to specific tasks. One modern systematic technique for
to categories and semantic values of words ([6], [15], [65],
causing controllable, reversible simulated lesions is t o stim-
ulate a particular site with small electric currents, thereby
[671, [106-1091, and [114]; cf. also Sec. V).
It thus seems as if the internal representations of infor-
eventually inducing both excitatory and inhibitory effects
mation in the brain are generally organized spatially.
and disturbing the assumed local function [75]. If such a
Although there i s only partial biological evidence for this,
spatially confined stimulus then disrupts a specific cog-
enough data are already available t o justify further theo-
nitiveabilitysuch as namingofobjects, it givesat least some
retical studies of this principle. Artificial self-organizing
indication that this site i s essential t o that task.
maps and brain maps thus have many features in common,
One straightforward method for locating a response is
and what is even more intriguing, we now fully understand
to record the electric potential or train of neural impulses
the processes by which such artificial maps can be formed
associated with it. Many detailed mappings, especiallyfrom
adaptively and completely automatically.
the primary sensory and associative areas of the brain, have
been made using various electrophysiological recording
C. Early Work on Competitive Learning
techniques.
Direct evidence for any localization of brain functions The basic idea underlying what is called competitive
can also beobtained using modern imaging techniques that learning is roughly as follows: Assume a sequence of sta-
display the strength and spatial distribution of neural tistical samples of a vectorial observablex = x(t) E El", where
responses simultaneously over a large area, with a spatial tis the time coordinate, and a set of variable reference vec-
resolution of a few millimeters. The two principal methods tors {m,(t): m, E .d", i = 1 , 2 , . * , k } . Assume that the m,(O)
which use radioactive tracers are positron emission tomog- have been initialized in some proper way; random selection
raphy (PET) [80] and autoradiography of the brain through will often suffice. If x(t) can somehow be simultaneously
very narrow collimators (gamma camera). PET reveals compared with each m,(t) at each successive instant of time,
changes in oxygen uptake and phosphate metabolism. The taken here t o be an integer t = 1 , 2 , 3 , . . . , then the best-
gammacamera method directlydetectschanges in cerebral matching ml(t) i s t o be updated t o match even more closely
blood flow. Both phenomena correlate with local neural the current x(t). If the comparison i s based on some dis-
activity, but they are unable t o monitor rapid phenomena. tance measure d(x, m,),altering m, must be such that, if i =
I n magnetoencephalography (MEG), the low magnetic field c i s the index of the best-matching reference vector, then
caused by electrical neural responses i s detected, and by d(x, m,) is decreased, and all the other reference vectors m,,
computing its sources, quite rapid neural responses can be with i # c, are left intact. In this way the different reference
directly analyzed, with a spatial resolution of a few milli- vectors tend to become specifically "tuned" to different
meters. The main drawback of MEG is that only current domains of the input variable x. It will be shown below that
dipoles parallel to the surface of the skull are detectable; i f p i s the probabilitydensityfunction ofthesamplesx, then
and since the dipoles are oriented perpendicular to the cor- the m, tend to be located in the input space S" in such a
tex, onlythe sulci can be studied with this method. A review way that they approximate t o p in the sense of some min-
of experimental techniques and results relating t o these imal residual error.
studies can be found in [32]. Vector Quantization (VQ) (cf., e.g., [19], [54], [58]) is a clas-
Aftera largenumber of such observations,afairlydetailed sical method, that produces an approximation to a contin-
organizational view of the brain has evolved [32]. Especially uous probability density function p(x) of the vectorial input
in higher animals, thevariouscortices in thecell mass seem variable x using a finite number of codebook vectors m,, i
to contain many kinds of "map" [33], such that a particular = 1,2, . . . ,k. Once the "codebook" is chosen, the approx-
location of the neural response in the map often directly imation of xinvolvesfinding the referencevectorm,closest
corresponds to a specific modality and quality of sensory to x. One kind of optimal placement of the m, minimizes
signal. The field of vision i s mapped "quasiconformally" E, the expected r t h power of the reconstruction error:
onto the primaryvisual cortex. Someof the maps, especially
those in the primary sensory areas, are ordered according
to some feature dimensions of the sensory signals; for
E = j IIX - m,ll'p(x) dx (11

instance, in the visual areas, there are line orientation and where dx i s the volume differential in the x space, and the
color maps [Ill, [116], and in the auditory cortex there are index c = c(x) of the best-matching codebook vector ("win-
the so-called tonotopic maps [91], [103], [104], which rep- ner") i s a function of the input vector x:
resent pitches of tones in terms of the cortical distance, or
other auditory maps [98]. One of the sensory maps i s the IIx - m,ll = min {llx - mllll. (2)
I
somatotopic map [29], [30] which contains a representation
of the body, i.e., the skin surface. Adjacent to it i s a motor In general, no closed-form solution for the optimal place-
map [70] that i s topographically almost identically orga- ment of the m, is possible, and iterative approximation
nized. Its cells mediate voluntary control actions on mus- schemes must be used.
cles. Similar maps exist in other parts of the brain [97]. O n It has been pointed out in [14], [64], and [I151 that (1)

KOHONEN: THE SELF-ORGANIZING MAP 1465


defines a placement of the codebookvectors into the signal ing power they demonstrated was, however, still weak as
space such that their point density function i s an approx- nerve-field type equations only describe this tendency as
imation to [p(x)]"""+'), where n i s the dimensionality of x a marginal effect. I n spite of numerous attempts, no "maps"
and m,.We usually consider the case r = 2. I n most practical of practical importance could be produced; ordering was
applications n >> r, and then the optimal VQ can be shown either restricted to a one-dimensional case, or confined t o
t o approximate p(x). small parcelled areas of the network [601, [77, [78], [99], [loo],
Using the square-error criterion ( r = 2), it can also be [IIO], [ I l l ] .
shown that the following stepwise "delta rule," i n the dis- Indeed it later transpired that system equations have t o
crete-time formalism (t = 0,1,2, . . ), defines the optimal involve much stronger, idealized self-organizing effects,
values asymptotically. Let m, = m,(t) be the closest code- and that the organizing effect has to be maximized in every
bookvector t o x = x ( t ) i n the Euclidean metric.The steepest- possibleway beforeuseful global maps can becreated.The
descent gradient-step optimization of € in the m, space present author, in early 1981, was experimenting with var-
yields the sequence ious architectures and system equations, and found a pro-
cess description [34]-[36] that seemed generally t o produce
m,(t + 1) = m,(t) + a(t)[x(t) - m,(t)l,
globally well-organized maps. Because all the other system
m,(t + 1) = m,(t) for i # c (3) models known at that time only yielded results that were
significantly more "brittle" with respect t o the selection of
with a(t) being a suitable, monotonically decreasing parameters and t o success in achieving the desired results,
sequence of scalar-valued gain coefficients, 0 < a(t) < 1. we may skip them here and concentrate o n the compu-
This i s then the simplest analytical description of compet- tationally optimized algorithm known as the Self-Organiz-
itive learning. ing Map algorithm.
I n general, if we express the dissimilarity of x and m, in
terms of a general distance function d(x, m,), we have first I I . AN ALGORITHM RESPONSESSPATIALLY
THAT ORDERS
t o identify the "winner" m, such that
Readers who are not yet familiar with the Self-organizing
d(x, m,) = min {d(x, m , ) } . (4) Maps may benefit from a quick look at Figs. 5 and 6 or Fig.
I
9 to find out what spatial ordering of output responses
After that an updating rule should be used such that d means.
decreases monotonically: the correction 6m, of m, must be The Self-organizing Map algorithm that I shall now
such that describe has evolved during a long series of computer
[grad,, d(x, mJIT . Am, < 0. experiments. The background to this research has been
(5)
expounded in [ a ] . While the purpose of each detail of the
If (1) is used for signal approximation, it often turns out to final equations may be clear in concrete simulations, it has
be more economical t o first observe a number of training proved extremely difficult, in spite of numerous attempts,
samples x ( t ) , which are "classified" (labeled) on the basis to express the dynamic properties of this process in math-
of (2) according to the closest codebook vectors m,, and then ematical theorems. Strict mathematical analysis only exists
to perform the updating operation i n a single step. For the for simplified cases. And even they are too lengthy t o be
new codebook vector m,, the average i s taken of those x ( t ) reviewed here: cf. [q,[8], [24], [57, [83], [86], [89], [go]. It i s
that were identified with codebook vector i.This algorithm, therefore hoped that the simulation experiments and prac-
termed the k-means algorithm i s widely used in digital tele- tical applications reported below in Secs. 11-C, 11-E, IV, V,
communications engineering [58]. and VI will suffice to convince the reader about the utility
The m,(t), i n the above processes, actually develop into of this algorithm.
a set of feature-sensitive detectors. Feature-sensitive cells It may also be necessary t o emphasize again that for prac-
arealso known to becommon i n the brain. Neural modelers tical purposes we are trying to extract or explain the self-
like Nass and Cooper [72], Perez et a/. [79], and Grossberg organizing function in its purest, most effective form,
[21] have been able t o suggest how such feature-sensitive whereas in genuine biological networks this tendency may
cells can emerge from simplified membrane equations of be more or less disguised by other functions. It may thus
model neurons. be conceivable, as has been verified by numerous simu-
I n the above process and i t s biophysical counterparts, all lation experiments, that the two essential effects leading to
the cells act independently. Therefore the order in which spatially organized maps are: 1)spatialconcentration of the
they are assigned t o the different domains of input signals network activityon thecell (or its neighborho0d)thati s best
i s more or less haphazard, most strongly depending on the tuned to the present input, and 2) further sensitization or
initial values of the m,(O).I n fact, in 1973, v.d. Malsburg [59] tuning of the best-matching cell and its topological neigh-
had already published a computer simulation in which he bors t o the present input.
demonstrated localordering of feature-sensitive cells, such
that i n small subsets of cells roughly corresponding to the
A. Selection o f the Best-Matching Cell
so-called columns of the cortex, the cells were tuned more
closely than were more remote cells. Later, Amari [I] for- Consider the two-dimensional network of cells depicted
mulated and analyzed the corresponding system of differ- in Fig. 1. Their arrangement can be hexagonal, rectangular,
ential equations, relating them t o spatially continuous two- etc. Let (in matrix notation) x = [x,, x2, . ,x,]' E 8"be the
dimensional media. Such continuous layers interacted i n input vector that, for simplicity and computational effi-
the lateral direction; the arrangement was called a nerve ciency, i s assumed t o be connected i n parallel t o all the neu-
field. The above studies are of great theoretical importance Tons i in this network, (We have also shown that subsets of
because they involve a self-organizing tendency. The order- the same input signals can be connected at random t o the

1466 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

T
x==mFFmPk
@Q@OO . . -
00000
00000
..
A,l t,l--
0

\q-oo])
xclt,)----6-o
o/o
-
7kT-70
0

0
0
0

0
0
0

0
0
0

0
0

0
0

0
0

0
0

0
o\o
0 o\o

0
0

0
0
0

0
0

0
0

o\o
0

0
0

0
0

0
0

0
0

0
0

0 O0 O

0
0

0
0

00000 0 0 0 0 0 0 0 0 0 0 0 0

Fig. 2. Examples of topological neighborhood N,(t), where


Fig. 1. Cell arrangement for the map and definition of
t, < t, < t,.
variables.

shrink monotonicallywith time (Fig. 2). The explanation for


cells; cf. the “tonotopic map” discussed in [35] and [ U ] . this may be that a wide initial N, corresponding to a coarse
Experiments are in progress in which the input connections spatial resolution in the learning process, first induces a
t o the cells can be made in a cascade.) The weight vector rough global order in them, values, after which narrowing
of cell i shall henceforth be denoted by m, = [m,,, m,2r the N, improves the spatial resolution of the map; the
* . . , m,,,lJE R”. acquired global order, however, i s not destroyed later on.
The simplest analytical measure for the match of x with It i s even possible to end the process with N, = {c}, that
them, may be the inner product x‘m,. If, however, the self- is, finally updating the best-matching unit (”winner”) only,
organizing algorithm i s t o be used for, say, natural signal in which case the process i s reduced t o simple competitive
patterns relating t o metric vector spaces, a better and more learning. Before this, however, the ”topological order” of
convenient (cf. the adaptation law below) matching crite- the map would have t o be formed.
rion may be used, based o n the Euclidean distances The updating process (in discrete-time notation) may read
between xand m,.The minimum distance defines the “win- + a(t)[x(t) - m,(t)
ner” m, (cf. (2)). A shortcut algorithm t o find m, has been
presented in [49].
Comment. Definition of the input vector,x, asan ordered
m,(t + 1) =
m,(t)

i,,,t)

where a(t) i s a scalar-valued “adaptation gain” 0 < a(t) <


1 if i E NCO),

if i $ N,(t),
(6)

set of signal values i s only possible if the interrelation 1. It i s related to a similar gain used in the stochastic approx-
between the signals i s simple. I n many practical problems, imation processes [49], [92], and as in these methods, a(t)
such as image analysis (cf. Discussion, Sec. VII) it will gen- should decrease with time.
erally be necessary t o use some kind of preprocessing t o An alternative notation i s to introduce a scalar ”kernel”
extract a set of invariant features for the components of x. function h,, = h,,(t),

B. Adaptation (Updating) of the Weight Vectors m,(t + 1) = m,W + h,,(t)[x(t) - m,(t)l (7)
It iscrucial totheformation ofordered mapsthatthecells whereby,above, h,,(t) = a(t)within N,,and h,,(t) = Ooutside
doing the learning are not affected independently of each N,. O n theother hand, thedefinition of h,,can also be more
other (cf. competitive learning i n Sec. I-C), but as topo- general; a biological lateral interaction often has the shape
logically related subsets, o n each of which a similar kind of of a “bell curve”. Denoting the coordinates of cells c and
correction is imposed. During the process, such selected iby the vectors rc and r,, respectively, a proper form for h,,
subsets will then encompass different cells. The net cor- might be
rections at each cell will thus tend t o be smoothed out in
the long run. An even more intriguing result from this sort h, = hoexp ( - llr, - r,l12/a2), (8)
of spatially correlated learning is that the weight vectors with bo = h,(t) and U = o(t) as suitable decreasing functions
tend to attain values that are ordered along the axes of the of time.
network.
In biophysically inspired neural network models, cor- C. Demonstrations of the Ordering Process
related learning byspatiallyneighboringcells can be imple-
mented using various kinds of lateral feedback connection The first computer simulations presented here are
and other lateral interactions. I n the present process we intended t o illustrate the effect that the weight vectors tend
to approximate to the density function of the input vectors
want to enforce lateral interaction directly in a general form,
in an orderly fashion. I n these examples, the input vectors
for arbitrary underlying network structures, by defining a
were chosen to be two-dimensional for visual display pur-
neighborhood set N, around cell c. At each learning step,
poses, and their probability density function was arbitrarily
all the cells within N,are updated, whereas cells outside N,
selected to be uniform over the areademarcated bythe bor-
are left intact. This neighborhood is centered around that
derlines (square or triangle). Outside the frame the density
cell for which the best match with input x i s found:
was zero. The vectors x(t) were drawn from this density
1Ix - mcll = min {Ilx - m,ll}. (2’) function independently and at random, after which they
I
caused adaptive changes in the weight vectors m,.
The width or radius of N, can be time-variable; in fact, for The m, vectors appear as points in the same coordinate
good global ordering, it has experimentally turned out to system as that in which the x(t) are represented; in order
beadvantageousto let N,beverywide in the beginning and to indicate to which unit each m, value belongs, the points

KOHONEN: THE SELF-ORGANIZING MAP 1467

__
__ 1
I I
1600 seee !P0000

Fig. 3. Weight vector during the ordering process, two-


dimensional array.

corresponding to the m, vectors have been connected by


a lattice of lines conforming t o the topology of the pro-
cessing unit array. In other words, a line connecting two
weight vectors m, and m, i s only used to indicate that the
corresponding units i a n d j a r e adjacent i n the array. In Fig.
3 the arrangement of the cells i s rectangular (square),
whereas in Fig. 4 the cells are interconnected in a linear
chain.

.....
:...-- ..~
>._._.I
? . _ _ _ _ 8
,
. ... ... . ...
> I . .

(b)
Fig. 5. Representationof three-dimensional(uniform)den-
sity functions by two-dimensional maps.

Since n o factor present defines a particular orientation


in the output map, the latter can be realized i n the process
Fig. 4. Weight vectors during the ordering process, one-
dimensional array. in any mirror or point-symmetric inversion, mainlydepend-
ing o n the initial values m,(O).If a particular orientation i s
to be favored, the easiest way t o achieve this result i s by
Examples of intermediate phases during the self-orga-
asymmetric choice of the initial values m,(O).
nizing process are given in Figs. 3 and 4. The initial values
m,(O)were selected at random from a certain domain of val-
D. Some Practical Hints for the Application of the
ues.
Algorithm
As stated above, in Fig. 3 the array was two-dimensional.
The results, however, are particularly interesting if the dis- When applying the map algorithm, (2) or (2') and (6)alter-
tribution and the array have different dimensionalities: Fig. nate. Input x is usually a random variable with a density
4 illustrates a case i n which the distribution of x i s two- function p(x), from which the successive values x(t) are
dimensional, but the array i s one-dimensional (linear row drawn. In real-world observations, such as speech recog-
of cells). The weight vectors of linear arrays tend t o approx- nition,thex(t)can simply be successivesamplesofthe input
imate to higher-dimensional distributions by Peano curves. observables in their natural order of occurrence.
A two-dimensional network representing three-dimen- The process may be started by choosing arbitrary, even
sional "bodies" (uniform-densityfunction)i s shown in Fig. random, initial valuesforthem,(O), theonlyrestriction being
5. that they should be different.
I n practical applications, the input and output weight We shall give numerical examples of efficient process
vectors are usually high-dimensional; e.g., in speech rec- parameters with the simulation examples. It may also be
ognition, the dimensionality n may be 15 to 100. helpful t o emphasize the following general conditions.

1468 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


1) Since learning is a stochastic process, the final sta- produce a hierarchical representation of the relations
tistical accuracy of the mapping depends o n the number between the primary data.
of steps, which must be reasonably large; there i s n o way The central result in self-organization i s that if the input
to circumvent this requirement. A “rule of thumb” i s that, signals have a well-defined probability density function,
for good statistical accuracy, the number of steps must be then theweightvectorsof thecells t r y t o imitateit, however
at least500timesthenumberof networkunits.Ontheother complex its form. It i s even possible to perform a kind of
hand, the number of components in x has n o effect o n the numerical taxonomy o n this model. Because there are n o
number of iteration steps, and if hardware neural com- restrictions o n the semantic content of the input signals,
puters are used, a very high dimensionality of input i s they can be regarded as arbitrary attributes, with discrete
allowed. Typically we have used u p t o 100 000 steps in our or continuous values. I n Table 1, 32 items, each with five
simulations, but for “fast learning,” e.g., in speech rec- hypothetical attributes, are recorded in a data matrix. (This
ognition, 10000 steps and even less may sometimes be example is completely artificial.) Each of the columns rep-
enough. Note that the algorithm i s computationally resents one item, and for later inspection the items are
extremely light. If onlya small number of samples are avail- labeled “A” through “6“, although these labels were not
able,theymust be recycled forthedesired numberof steps. referred to during the learning.
2) For approximately the first 1000 steps, a(t)should The attribute values (a,, a2, . . . ,a,) constitute the pattern
startwith avalue that i s close t o unity, thereafter decreasing vector x which acts as a set of signal values at the inputs t o
monotonically. An accurate rule i s not important: a! = a(t) the network of the type shown in Fig. 1. During training,
can be linear, exponential, or inversely proportional t o t. the vectors x were selected from Table 1 at random. Sam-
For instance, a(t) = 0.9(1 - t/IOOO) may be a reasonable pling and adaptation were continued iteratively until the
choice. The ordering of the rn, occurs during this initial asymptotic state could be considered stationary. Such a
period, while the remaining steps are only needed for the “learned” network was then calibrated using the items from
fine adjustment of the map. After the ordering phase, a! = Table 1and labeling the best-matching mapcells according
a(t)should attain small values (e.g., of the order of or less to the different calibration items. Such a labeled map i s
than .01) over a long period. Neither i s it crucial whether shown in Fig. 6. It can be seen that the”images”of different
the law for a(t)decreases linearly or exponentially during items are related according t o a taxonomic graph where the
the final phase. different branches are visible. For comparison, Fig. 7 illus-
3) Special caution i s required in the choice of N, =
N,(t). If the neighborhood is too small t o start with, the map
will not be ordered globally. Instead various kinds of mo-
saic-like parcellations of the map are seen, between which
the ordering direction changes discontinuously. This phe-
nomenon can be avoided by starting with a fairly wide N,
= N,(O) and letting it shrink with time. The initial radius of
N,can even be more than half the diameter of the network!
During the first 1000 steps or so, when the proper ordering
takes place, and a! = a(t)i s fairly large, the radius of N, can
shrink linearly to, say, one unit; during the fine-adjustment
phase, N, can still contain the nearest neighbors of cell c.
Parallel implementations of the algorithm have been dis- Fig. 6. Self-organized map of the data matrix of Table 1.
cussed in [251 and [611.

E. Example: Taxonomy (Hierarchical Clustering) of

T
A B C D E
Abstract Data

Although the more practical applications of the Self-


Organizing Maps are available, for example, in pattern rec-
ognition and robotics, it may be interesting t o apply this
principle first t o abstract data vectors consisting of hypo-
thetical attributes or characteristics. We will look at an
example with implicitly defined (hierarchical) structures in
the primary data, which the map algorithm i s then able t o b b
reveal. Although this system i s a single-level network, it can Fig. 7. Minimal spanning tree corresponding to Table 1.

Table 1 Input Data Matrix


Item
A B C D E F C H I J K L M N O P Q R S T U V W X Y Z I 2 3 4 5 6
Attribute
a, 1 2 3 4 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
a2 0 0 0 0 0 1 2 3 4 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
a3 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 3 3 3 3 6 6 6 6 6 6 6 6 6 6
a4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 1 2 3 4 2 2 2 2 2 2
a, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6

KOHONEN: THE SELF-ORCANIZINC MAP 1469


trates the minimal spanning tree (where the most closely that they directly define near-optimal decision borders
similar pairs of items are linked) describing the similarity between the classes, even in the sense of classical Bayesian
relations of the items i n Table 1. The system parameters i n decision theory. These strategies and learning algorithms
this process were: were introduced by the present author [38], [43], [45] and
called Learning Vector Quantization (LVQ).
a = a(t): During the first IO00 steps, cy decreased linearly
with time from .5to .04 (the initial value could have been
closer to unity, say, .9).During the subsequent 10 000
A. Type One Learning Vector Quantization (LVQI)
steps, a decreased from .04 t o zero linearly with time. If several codebook vectors mi are assigned to each class,
N, = N,(t): The lattice was hexagonal, 7 by 10 units, and and each of them i s labeled with the corresponding class
during the first IOOOsteps, theradiusof N,decreasedfrom symbol, the class regions in thexspace are defined by sim-
the value six (encompassing the majority of cells in the ple nearest-neighborcomparison of x with them,; the label
network) to one (encompassing neuron c and its six of the closest mi defines the classification of x.
neighbors)linearlywith time, thereafter keeping thevalue TO define the optimal placement of mi in an iterative
one. learning process, initial values for them must first be set
using any classical VQ method or by the Self-organizing
F. Another Variant o f the Algorithm Map algorithm. The initial values in both cases roughly cor-
respond to the overall statistical densityfunction p(x) of the
One further remark may be necessary. It has sometimes input. The next phase i s to determine the labels of the code-
been suggested that x be normalized before it i s used i n the book vectors, by presenting a number of input vectors with
algorithm. Normalization i s not necessary in principle, but known classification, and assigning the cells to different
it may improve numerical accuracy because the resulting classes by majority voting, according to the frequency with
reference vectors then tend to have the same dynamic which each mi is closest to the calibration vectors of a par-
range. ticular class.
Another aspect, as mentioned above, i s that it i s also pos- The classification accuracy is improved if the m, are
sible to apply a general distance measure in the matching; updated according to the following algorithm [411, [431-[451.
then, however, the matching and updating laws should be The idea isto pull codebookvectorsaway from thedecision
mutually compatible with respect to the same metric. For surfaces to demarcate the class borders more accurately.
instance, if the inner-product measure of similarity were Let m, be the codebook vector closest to x in the Euclidean
applied, the learning equations should read: metric (cf. (2), (2’)); this then also defines the classification
of x. Apply training vectors x the classification of which i s
xT(t)m,(t) = max {x ‘(t)m;(t)}, (9)
I known. Update the mi = mi@)as follows:

m,(t + 1) = m,(t) + cy(t)tx(t) - m,(t)l


if x is classified correctly,
if i$ N,(t),
m,(t + 1) = m,W - cy(t)[x(t) - m,(t)l
and 0 < cy’(t) < OD; for instance, d ( t ) = 1001t. This process if the classification of x is incorrect,
normalizes the reference vectors at each step. The nor-
malization computations slow down the training algorithm mj(t + 1) = mj(t) for i # c. (11)
significantly. O n the other hand, the linear matching cri-
Here a(t) is a scalar gain (0 < a(t) < I), which i s decreasing
terion applied during recognition i s very simple and fast,
monotonically in time, as in earlier formulas. Since this is
and amenableto many kindsof simpleanalog computation,
a fine-tuning method, one should start with a fairly small
both electronic and optical.
value, say a(0)= 0.01 or 0.02 and let it decrease to zero, say,
in 100 000 steps.
111. FINE TUNING OF THE MAPBY LEARNINGVECTOR
This algorithm tends to reduce the point density of the
QUANTIZATION (LVQ) METHODS
mi around the Bayesian decision surfaces. This can be
If the Self-organizing Map i s t o be used as apattern clas- deduced as follows. The minus sign in the second equation
sifier in which the cells or their responses are grouped into may be interpreted as defining corrections in the same
subsets, each of which corresponds to a discrete class of direction as if (IO) were used for the class to which m,
patterns, then the problem becomesadecision processand belongs, b u t with the probability density function of the
must be handled differently. Theoriginal Map, likeanyclas- neighboring (overlapping) class subtracted from that of m,.
sicalVector Quantization (VQ) method (cf. Sec. I-D)is mainly In other words, we would perform a classical Vector Quan-
intended t o approximate input signal values, or their prob- tization of the function )p(xlC;)P(C;) - p(x)Cj)P(C,)lwhere C;
ability density function, by quantized “codebook” vectors and Cjare the neighboringclasses,p(xlC,) is theconditional
that are localized in the input space to minimize a quan- probabilitydensityfunction of samplesx belonging to class
tization error functional (cf. Sec. Ill-A below). On the other Ci, and P(CJ i s the apriori probability of occurrences of the
hand, if the signal sets are to be classified into a finite num- class C, samples. The difference between the density func-
berof categories, then several codebookvectorsare usually tions of the neighboring classes, by definition, drops tozero
made to represent each class, and their identity within the at the Bayes border, inducing the above ”depletion layer”
classes is no longer important. In fact, only decisions made of the codebook vectors.
at class borders count. It i s then possible, as shown below, After training, them, will have acquired values such that
to define effective values for the codebook vectors such classification using the “nearest neighbor” principle, by

1470 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


t
x2

Fig. 8. (a) An illustrativeexample in which x i s two-dimensional and the probability den-


sity functions of the classes substantially overlap. (a) The probability density function of
x = [x,, x J r i s represented here by i t s samples, the small dots. The superposition of two
symmetric Gaussian density functions corresponding to two different classes C, and Cz,
with their centroids shown by the white and the dark cross, respectively, i s shown. Solid
curve: the theoretically optimal Bayes decision surface. (b) Large black dots: reference
vectorsof class C,. Open circles: referencevectorsof class Cz. Solid curve: decision surface
in the Learning Vector Quantization. Broken curve: Bayes decision surface.

comparing of x w i t h the mi,already rather closely coincides


with that of the Bayes classifier. Figure 8 represents an illus-
trative example in which x i s two-dimensional, and the
probability density functions of the classes substantially
if C, is the nearest class, but x belongs
overlap. The decision surface defined by this classifier
seems t o be near-optimal, although piecewise linear, and to C, # C, where C, i s the next-to-nearest
the classification accuracy in this rather difficult example
i s within a fraction of a percent of that achieved with the class (“runner-up”); furthermore x must
Bayes classifier. For practical applications of the LVQI, cf. fall into the “window”. I n all the
[12], [76]. A rigorous mathematical discussion of the LVQI,
and suggestions t o improve its stability, have been rep- other cases,
resented i n [SI]. mk(t + 1) = mk(t). (12)
The optimal width of the window must be determined
B. Type Two Learning Vector Quantization (LVQ2)
experimentally, and it depends o n the number of available
The previous algorithm can easily be modified t o comply samples. With a relatively small number of training sam-
even better with Bayes’ decisionmaking philosophy [43]- ples, a width 10 to 20% of the difference between m, and
[45]. Assume that two codebook vectors m, and m, that m, seems to be proper.
belong to different classes and are closest neighbors in the One question concerns the practical definition of the
vector spaceare initially in awrong position.The(incorrect) “window“. If we are working in a high-dimensional signal
discrimination surface, however, i s always defined as the space, it seems reasonable t o define the “window” in terms
midplane of m, and m,. Let us define a symmetric window of relative distances d, and d, from m, and m,, respectively,
of nonzero width around the midplane, and stipulate that having constant ratios. I n this way the borders of the “win-
corrections to m, a n d m, s h a l l only be made ifx falls into dow“ are Apollonian hyperspheres. The vector x i s defined
the window o n the wrong side of the midplane (cf. Fig. 9). to lie in the “window” if
min (d,/d,, d,/d,) > s. (13)
If w is the relative width of the window in i t s narrowest
+
point, then s = (1 - w)/(l w). The optimal size of the win-
dow depends on the number of available training samples.
If we had a large number of samples, a narrow window
would guarantee the most accurate location of the border;
window but for good statistical accuracy the number of samples fall-
ing into the window must be sufficient, too, so a 20% win-
Fig. 9. Illustration of the “window” used in the LVQZ and
LVQ3 algorithms. The curves represent class distributions dow seems a good compromise, at least in the experiments
of x samples. reported below.
For reasons explained in the next section, the classifi-
If the corrections are made according to (12), it i s easy t o cationaccuracyofthe LVQZ i s first improvedwhen thedeci-
see that for vectors falling into the window, the corrections sion surface i s shifted towards the Bayes limit; after that,
of both m, and m,, o n average, have such a direction that however, the m, continue “drifting away”. Therefore this
the midplane moves towards the crossing surface of the algorithm ought t o be applied for a relatively short time only,
class distributions, and thus asymptotically coincides with say, starting with a = 0.02 and letting it to decrease to zero
the Bayes decision border. in at most 10 000 steps.

KOHONEN: THE SELF-ORGANIZINC MAP


C. Type Three Learning Vector Quantization (LVQ3) of dynamical speech information, or whether it is only to
replace some of the traditional ”spectral” and “vector
The LVQ2 algorithm was based o n the idea of differentially
space” pattern recognition algorithms by highly adaptive,
shifting the decision borders towards the Bayes limits, while
learning “neural network” principles.
no attention was paid towhat might happen t o the location
I n a speech recognizer, a proper place for artificial neural
of the m, i n the long run if this process were continued.
networks is in the phonemic recognition stage where exact-
Thus, although researchers have reported good results,
ing statistical analysis is needed. It should be remembered
some have had problems, too. It turns out that at least two
that if phonemes, i.e., classes of different phonological real-
different kinds of detrimental effect must be taken into
izations of vowels and consonants, are selected for the basic
account. First, because corrections are proportional t o the
phonetic units, then account has t o be taken of their trans-
difference of x and m,, or x and m,, the correction o n m,
formations due t o coarticulation effects. I n other words, the
(correct class) is of larger magnitude than that o n m, (wrong
spectral properties of the phonemes are changed in the
class); this results i n monotonically decreasing distances
context or frame of other phonemes. I n an ”all-neural”
IIm, - m, 11. One remedy i s t o compensate for this effect,
speech recognizer it may not be necessary to distinguish
approximately at least, by accepting all the training vectors
or consider phonemes at all, because interpretation of
from the “window,” and the only condition is that one of
speech i s then regarded as an integral, implicit process.
m, and m, must belong to the correct class, and the otherto
Introduction of the phoneme concept already implies that
the incorrect class. The second problem arises from the fact
the system must be able to automatically identify them in
that if the process in (12) iscontinued, it may lead to another
one form or another and to label the corresponding time
asymptotic equilibrium of m, that i s no longer optimal.
interval. Correction of coarticulation effects may then
Therefore it seems necessary to include corrections that
already be implemented i n the acoustic analysis itself, by
ensure that the m, continue approximating the class dis-
regarding the speech states as Markov processes, and ana-
tributions,at least roughly. Combiningthese ideas, we now
lyzing the state transitions statistically [53]. A different
obtain an improved algorithm that may be called LVQ3:
approach altogether is first to apply some vector quanti-
m,(t + 1) = m,(t) - c.u(t)[x(t) - m,(t)], zation classification, whereby the speech waveform i s only
labeled by class symbols of stationary phonemes, as if no
m,(t + 1) = m,(t) + c.u(t)[x(t) - m,(t)l, coarticulation effects were being taken into account. Cor-
where m, and m, are the t w o closest codebook rections can then be made afterwards in a separate post-
processing stage, in symbolic form. We have used the latter
vectors t o x, and x and m, belong t o the same
approach.
class, while x and m, belong to different classes; We have implemented a practical “phonetic typewriter”
for unlimited speech input using the Self-organizing Map
furthermore x must fall into the “window”; to spot and recognize phonemes in continuous speech (Fin-
mk(t + 1) = mk(t) + E(Y(t)[X(f)- mk(t)] nish and Japanese)[42], [46], [48]. The “network” was fine-
tuned for optimal decision accuracy b y the Learning Vector
for k E { i , j } , if x, m,, and m, belong t o the Quantization. After that, in the postprocessing stage we
applied a self-learning grammar that corrects the majority
same class. (14)
of coarticulation errors and derives its numerous transfor-
I n a series of experiments, applicable values for E between mation rules automatically from given examples. This prin-
0.1 and 0.5 were found. The optimal value of E seems to ciple, termed “Dynamically Expanding Context” [ 3 7 , [40],
depend o n the size of the window, being smaller for nar- actually belongs to the category of learning Artificial Intel-
rower windows. This algorithm seems t o be self-stabilizing, ligence methods, and thus falls outside the scope of this
i.e., the optimal placement of the m, does not change i n article (cf. Sec. IV-D below).
continued learning.
Noticethatwhereas in LVQI onlyoneof them,valueswas A. Acoustic Preprocessing o f the Speech Signal
changed at each step, LVQ2 and LVQ3 change two code-
book vectors simultaneously. It is known that biological sensory organs such as the
inner ear are usually able to adapt t o signal transients in a
fast, nonlinear way. Nonetheless, we decided to apply con-
Iv. APPLICATION
OF THE MAPT O SPEECH RECOGNITION
ventional frequency analysis t o the preprocessing of
When artificial neural networks are t o be used for a prac- speech. The main reason for this was that digital Fourier
tical pattern recognition application such as speech rec- analysis i s both accurate and fast, and the fundamentals of
ognition, the first task is t o make clear whether it i s desir- digital filtering are well understood. Deviations from phys-
able t o perform the complete chain of processing iological reality are not essential since the self-organizing
operations starting, e.g., from the pre-analysis of the micro- neural network can accept many alternative kinds of pre-
phone signal and leading o n t o some form of linguistic processing and can compensate for minor imperfections.
encoding of speech using “all-neural” operations, or The technical details of the acoustic preprocessing stage
whether “neural networks” should only be applied at the are briefly as follows: 1 ) 5.3-kHz low-pass switched-capac-
most critical stage, whereby the rest of the processingoper- itor filter, 2) 12-bit A/D-converter with 13.02-kHz sampling
ations can be implemented o n standard computing equip- rate, 3) 256-point FFT formed every 9.83 ms using a Ham-
ment.Thischoice mainlydependsonwhethertheobjective ming window, 4) logarithmization and smoothing of the
is commercial or academic. power spectrum, 5) combination of spectral channels from
Another issue i s whether the aim is t o demonstrate the the frequency range 200 Hz-5 kHz into a 15-component pat-
ultimate capabilities of “neural networks” in the analysis tern vector, 6) subtraction of the average from the com-

1472 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

--- -1-
ponents, 7) normalization of the pattern vectors. Except for C. Specific Problems with Transient Phonemes
steps 1) and 2), an integrated-circuit signal processor,
Generally, the spectral properties of consonants behave
TMS32010, i s used for the computation.
more dynamically than those of vowels. Especially in the
case of stop consonants, it seems to be better to pay atten-
B. Phoneme Map
tion to the plosive burst and transient region between the
The simplest type of speech maps formed by self-orga- consonant and the subsequent vowel in order t o identify
nization i s the static phoneme map. There are 21 phonemes the consonant. In our system transient information i s coded
in Finnish: /U, 0,a, ce, 0 , y, e, i, s, m, n, t, I, r, j, v, h, d, k, in additional "satellite" maps (called transient maps) and
p, tl. For their representation we used short-time spectra as they are trained, using transient spectral samples alone, to
the input patterns x ( t ) . The spectra were evaluated every describe the dynamic features with higher resolution [48].
9.83 ms. They were computed by the 256-point FFT, from Our system was in fact developed in two versions: one for
which a 15-component spectral vector was formed by Finnish and one for Japanese. In the Japaneseversion, four
grouping of the channels. In the present study all the spec- transient maps have been constructed to distinguish the
tral samples, even those from the transitory regions, were following cases:
employed and presented to the algorithm in the natural
order of their utterance. During learning, the spectra were 1) voiceless stops /k, p, t/ and glottal stop (vowel at the
not segmented or labeled in any way: any features present beg inning of utterance),
in the speech waveform contributed t o the self-organized 2) voiceless stops /k, p, t/ without comparison with the
map. After adaptation, the map was calibrated using known glottal stop,
stationary phonemes (Fig. IO). The map resembles the well- 3) voiced stops /b, d, g/,
known formant maps used in phonetics; the main differ- 4) nasals Im, n, 71,
ence is that i n our maps complete spectra, not just two of
Only one transient map has been adopted for the Finnish
their resonant frequencies as in formant maps, are used to
version, making the distinction between /k, p, t/ and the
define the mapping.
glottal stop. (/b/ and /g/ do not exist in original Finnish.)
Recognition of discrete phonemes is a decision-making
process in which the final accuracy only depends o n the
D. Compensation for Coarticulation Effects using the
rate of misclassification errors. It i s therefore necessary t o
"Dynamically Expanding Context"
try t o minimize them using a decision-controlled (super-
vised) learning scheme, using a training set of speech spec- Because of coarticulation effects, i.e., transformation of
tra with known classification. the speech spectra due to neighboring phonemes, system-
I n practice, for a new speaker, it will be sufficient t o dic- atic errors appear in phonemic transcriptions. For instance,
tate 200 t o 300 words which are then analyzed by an auto- the Finnish word "hauki" (meaning pike) i s almost in-
matic segmentation method. The latter picks u p the train- variably recognized as the phoneme string Ihaoukil by our
ing spectra that are applied in the supervised learning acoustic processor. It may then be suggested that if a trans-
algorithm. The finite set of training spectra (of the order of formation rule /sou/ + /au/ i s introduced, this error will be
2000) must be repeated in the algorithm either cyclically or corrected. It might also be imagined that it i s possible to
in a random permutation. LVQI, LVQ2, or LVQ3 can be used list and take into account all such variations. However, there
as the fine tuning algorithm. A map created for a typical may be hundreds of different frames or contexts of neigh-
(standard) speaker can then be modified for a new speaker boring phonemes in which a particular phoneme may
very quickly, using 100 more dictated words, and LVQ fine occur, and in many cases such empirical rules are contra-
tuning only. dictory; they are only statistically correct. The frames may

Fig. 10. An exampleof aphonememap. Natural Finnish speech was processed bya model
of the inner ear which performs its frequency analysis. The resulting signals were then
connected to an artificial neural network, the cells of which are shown in this picture as
circles. The cells were tuned automatically, without any supervision or extra information
given, to the acoustic units of speech known as phonemes. The cells are labeled by the
symbols of those phonemes to which they "learned" to give responses; most cells give
a unique answer, whereas the double labels show which cells respond to two phonemes.

KOHONEN: THE SELF-ORGANIZING M A P 1473


also be erroneous. In order to find an optimal and very large It may be of interest here to mention other results, inde-
system of rules, the Dynamically Expanding Context gram- pendent of ours. McDermott and Katagiri [28], [66] have car-
mar mentioned abovewas applied [37], [40]. Its rules or pro- ried out experiments o n all the Japanese phonemes, and
ductionscan be used t o transform erroneous symbol strings report that LVQ2 gave consistently higher accuracies than
into correct ones, and even into orthographic text. Backpropagation Time Delay Neural Networks [I051and was
Because the correction rules are made accessible in faster in learning.
memory using a softwarecontent-addressingmethod (hash
coding), they can be applied very quickly, such that the V. SEMANTIC MAP
overall operation of the grammar, even with 15 000 rules,
Demonstrations such as those reported above have indi-
is almost in real time. This algorithm is able to correct u p
cated that the Self-organizing Map i s indeed able to extract
to 70% of the errors left by the phoneme map recognizer.
abstract information from multidimensional primary sig-
nals, and to represent it as a location, say, in a two-dimen-
E. Performance o f the “Phonetic Typewriter“
sional network. Although this i s already a step towards gen-
I n order to get some idea of the accuracy of the map algo- eralization and symbolism, it must be admitted that the
rithm, we first show a comparative benchmarking of five extraction of features from geometrically or physically relat-
different methods, namely, classification of manually able data elements i s still a very concrete task, in principle
selected phonemic spectra bythe classical parametric Bayes at least.
classification, the well-known k-Nearest-Neighbor method Theoperation of the brain at the higher levels relies heav-
(kNN), and LVQI, LVQ2 and LVQ3. ily on abstract concepts, symbolism, and language. It is an
In this partial experiment, the spectral samples were of old notion that the deepest semantic elements of any lan-
Finnish phonemes (divided into 19 classes and using 15 fre- guage should also be physiologically represented in the
quency channels for the spectral decomposition). There neural realms. There i s now new physiological evidence for
were 1550 training vectors, and 1550 statistically indepen- linguistic units being locatable in the human brain [6], [15].
dent vectors that were only used for testing. The error per- In attempting t o devise Neural Network models for lin-
centages are given in Table 2. guistic representations, the first difficulty i s encountered
when trying t o find metric distance relations between sym-
bolic items. Unlike with primary sensory signal patterns for
Table 2 Speech Recognition Experiments with Error
Percentages for Independent Test Data which similarity i s easily derivable from their mutual dis-
tances in the vector spaces in which they are represented,
Parametric it can not be assumed that encodings of symbols in general
Bayes kNN LVQI LVQ2 LVQ3 have any relationship with the observable characteristics
Test 1 12.1 12.0 10.2 9.8 9.6 of the corresponding items. How could it then be possible
Test 2 13.8 12.1 13.2 12.0 11.5 to represent the “logical similarity” of pairs of items, and
to map such items topographically? The answer lies in the
fact that the symbol, during the learning process, i s pre-
Note that the parametric Bayes classifier is not even the-
sented in context, i.e., in conjunction with the encodings
oretically the best because it assumes that the class samples
of a set of other concurrent items. In linguistic represen-
are normally distributed. We have not been able to find any
tations context might mean afew adjacentwords. Similarity
method, theoretical or heuristic, that classifies speech
between items would then be reflected through the simi-
spectra better than LVQI, LVQ2 o r LVQ3.
larity o f the contexts. Note that for ordered sets of arbitrary
In i t s complete form, the“Ph0neticTypewriter” has been
encodings, invariant similarity can be expressed, e.g., in
tested on several Finnish and Japanesespeakers over a long
terms of the number of items they have in common. O n the
period. To someone familiar with practical speech recog-
other hand, it may be evident that the meaning (semantics)
nizers it will be clear that it would be meaningless t o eval-
of a symbolic encoding i s only derivable from the condi-
uateand comparedifferenttest runs statistical1y;the results
tional probabilities of its occurrences with other encod-
obtained in each isolated test depend so much on the
ings, independent of the type of encoding [68].
experimental situation and the content of text, the status
and tuningof the equipment, as well as on the physical con- However, in the learning process, the literal encodings
dition of the speaker. The number of tests performed over of the symbols must be memorized, too. Let vector x, rep-
many years i s also too large t o be discussed fully here. Let resent the symbolic expression of an item, and x, the rep-
it sufficeto mention thattheaccuracyof spottingand recog- resentation of its context. The simplest neural model then
nizing phonemes in arbitrary continuous speech typically assumes that x, and x, are connected to the same neural
varies between 80 and 90% (depending o n the automatic units, i.e., the representation (pattern) vector x of the item
segmentation and recognition of any phoneme), and this is formed as a concatenation of x, and x,:
figure depends o n the speaker and the text. After com-
pensation for coarticulation effects and editing the text into
orthographic form, the accuracy, in terms of correctness of
any letter, i s of the order of 92 t o 97%, again depending o n In other words, the symbol part and the context part form
the speaker and the text. a vectorial sum of two orthogonal components.
The Phonetic Typewriter has already been implemented The core idea underlying symbol maps is that the two
in several hardware versions using signal processor chips. parts are weighted properly such that the norm o f the con-
The latest versions operate in genuine real time with con- text part predominates over that o f the symbol part during
tinuous dictation. the self-organizing process; the topographical mapping

1474 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

-~
1
then mainly reflects the metric relationships of the sets of numbering in Fig. Il(a). A total of 498 different three-word
associated encodings. But since the inputs for symbolic sig- sentences are possible, a few of which are exemplified in
nals are also active all the time, memory traces of them are Fig. I l ( c ) . These sentences were concatenated into a single
formed in the corresponding inputs of those cells i n the continuous string, S.
map that have been selected (or actually enforced) by the The context of a word in this string was restricted to the
context part. If then, during recognition o f input informa- pair ofwords formed by its immediate predecessor and suc-
tion, the context signals are missing or are weaker, the cessor in S(ignoring any sentence borders; i.e., words from
(same) map units are selected solely o n the basis o f the adjacent sentences in S are uncorrelated, and act like ran-
symbol part. In this way the symbols become encoded into dom noise in that field). The code vectors of the prede-
a spatial order reflecting their logical (or semantic) similari- cessor/successor-pair forming the context t o a word were
ties. concatenated into a single 14dimensional code vector x,.
In the following, I shall demonstrate this idea, which was In this simpledemonstration wethusonlytook intoaccount
originated by H. Ritter, using a simple language [84]. The the context provided by the immediately adjacent textual
simplest definition of the context of a word i s t o take all environment of each word occurrence. Even this restricted
those words (together with their serial order) that occur in context already contains interesting semantic relation-
a certain "window" around the selected word. For sim- ships.
plicity, we shall imagine that the content of each "window" In our computer experiments it turned out that instead
can somehow be presented t o the x, input ports of the of presenting each phrase separately to the algorithm, a
neural system. We are not interested here in any particular much more efficient learning strategy is first to consider
means for the conversion of, say, temporal signal patterns each word in its average contextover a set of possible "win-
into parallel ones (this task could be done using paths with dows". The (mean) context of a word was thus first defined
different delays, eigenstates that depend on sequences, or as the average over 70 000 sentences o f all code vectors of
any other mechanisms implementable in short-term mem- predecessor/successor-pairs surrounding that particular
ory). word. The resulting thirty ICdimensional "average word
The vocabulary used in this experiment i s listed in Fig. contexts", normalized t o unit length, assumed the role of
I l ( a ) and comprises nouns, verbs, and adverbs. Each word the"context fields"^, i n (14). Each "context field" was com-
bined with a 7-dimensional "symbol field" xs, consisting of
the code vector for the word itself, but scaled t o length a.
The parameter a determines the relative influence of the
Bob/Jim/Mary 1 Sentence Patterns:
sym bo1 part x, i n comparison t o the context part x, and was
horse/dog/cat 2 1-5-12 1-9-2 2-5-14 Jim speaks well
beer/water 3 1-5-13 1-9-3 2-9-1
set t o 0.2.
Mary likes Jim
meat/bread 4 1-5-14 1-9-4 2-9-2 For the simulation, a planar, rectangular lattice of 10 by
Jim eats often
runs/walks 5 1-6-12 1-10-3 2-9-3 Mary buys meat 15cellswas used.The initial weightvectorsof thecellswere
works/speaks 6 1-6-13 1-11-4 2-9-4 dog drinks fast chosen randomly, so that no initial order was present.
visits/phones 7 1-6-14 1-10-12 2-10-3 horse hates meat Updating was based o n (7) and (8). The learning step size
buys/sells 1-6-15 1-10-132-10-12
8 Jim eats seldom was h, = 0.8 and the radius o(r) of the adjustment zone (cf.
likes/hates 9 1-7-14 1-10-142-10-12 Bob buys meat
1-8-12 1-11-122-10-14
(8))was gradually decreased from an initial value U, = 4 to
drinks/eats 10/1 cat walks slowly
1-8-2 1-11-13 2-11-4 a final value uf = 0.5 according to the law a(t) = u,(af/u,)'tmax
much/little 12 Jim eats bread
fast/slowly 13 1-8-3 1-11-142-11-12 cat hates Jim Here t counts the number of adaptation steps.
often/seldom 14 1-8-4 2-5-12 2-11-1: Bob sells beer After,,,t = 2000 input presentations the responses of the
well/poorly 15 1-9-1 2-5-13 2-11-14 neurons to presentation of the symbol parts alone were
tested. I n Fig. 12, the symbolic label i s written to that site
(a) (b) (C)
at which the symbol signal x = [xs, 0IT gave the maximum
Fig. 11. Outline of vocabulary used in this experiment. (a)
response. We clearly see that the contexts "channel" the
List of used words (nouns, verbs, and adverbs), (b) sentence
patterns, and (c) some examples of generated three-word- word items to memory positions whose arrangements
sentences. reflects both grammatical and semantic relationships.
Words of same type, i.e., nouns, verbs, and adverbs, are
segregated into separate, large domains. Each of these
class has further categorial subdivisions, such as names of domains is further organized according t o similarities o n
persons, animals, and inanimate objects. To study semantic the semantic level. Adverbs with opposite meaning tend t o
relationships in their purest form, it must be stipulated that be close t o each other, because sentences differing in one
the semantic meaning be not inferable from any patterns word only are regarded as semantically correlated, and the
usedfortheencodingofthe individual words, butonlyfrom words that are different then usually have the opposite
the context in which the words occur (i.e., combinations of meaning. The groupings of the verbs correspond to dif-
words). To this end each word was encoded by a random ferences in the ways they can co-occur with adverbs, per-
vector of unit length (here, seven-dimensional). sons, animals, and nonanimate objects such as, e.g., food.
A sequence of randomly generated meaningful three- It could be argued thatthe structures resulting in the map
word sentences was used as the input data t o the self-orga- were artificially created by a preplanned choice of the sen-
nizing process. Meaningful sentence patterns had there- tence patterns allowed as input. This is not the case, how-
fore first to be constructed o n the basis of word categories ever, since it is easy t o check that the categorial sentence
(Fig. I l ( b ) ) . Each explicit sentence was then constructed by patterns in Fig. I l ( b ) almost completely exhaust the pos-
randomly substituting the numbers in a randomly selected sibilities for forming semantically meaningful three-word
sentence pattern from Fig. I l ( b ) by words with compatible sentences.

KOHONEN: THE SELF-ORGANIZING MAP 1475


the maps should be interconnected, e.g., whether special
. water . meat . . . . dog horse
nonlinear interfaces are needed [82]; and in hierarchical
beer . . . . bread . . . . systems, adaptive normalization of input (cf. [31])also seems
necessary. Only a few isolated problems, such as texture
analysis that i s under study in our laboratory, might be ame-
nable to the basic method as such.

VII. DISCUSSION
. . . It was stated in Secs. I and Ill that it is not advisable t o
use the Self-organizing Map for classification problems
because decision accuracy can be significantly increased
iffinetuningsuchasLVQisused.Another important notion,
that not only concerns the maps but most of the other neural
. . . buys . . visits .
network models as well, is that it would often be absurd to
. . . . . sells . . . use primary signal elements, such as temporal samples of
. . runs . . . . . . . speech waveform or pixels of an image, for the components
drinks . . . walks . . hates . likes
of x directly. This is especially true if the input patterns are
fine-structured, like line drawings. It is not possible to
Fig. 12. “Semantic map” obtained on a network of 10 x 15
cells after 2000 presentations of word-context-pairsderived achieve any invariances in perception unless the primary
from 10 000 randomsentencesofthe kind shown in Fig. IO(c). information i s first transformed, using, e.g., various con-
Nouns, verbs, and adverbs are segregated into different volutions with, say, Gabor functions [13], or other, possibly
domains. Within each domain a further grouping according nonlinearfunctionalsof the imagefield [81], ascomponents
to aspects of meaning i s discernible.
of x. Which particular choice of functionals should be used
for preprocessing in a particular task i s a very difficult and
VI. SURVEY OF PRACTICAL APPLICATIONS
O F THE MAP delicate question, and cannot be discussed here.
In addition to numerous more abstract simulations, the- One question concerns the maximum capacity achiev-
oretical developments, and “toy examples,” the following ablein themaps. I s itpossibletoincreasetheirsize,toeven-
practical problem areas have been suggested for the Self- tually use them for data storage in large knowledge data
Organizing Map or the LVQ algorithms. I n some of them bases? It can at least be stated that the brain’s maps are not
concrete work i s already in progress. particularly extensive; they mainly seem to provide for effi-
cient encoding of a particular subset of signals t o enhance
Statistical pattern recognition, especially recognition the operation and capacity of associative memory [44]. If
of speech [42], [48]; more extensive systems are required, it might be more effi-
control of robot arms, and other problems in robotics cient to develop hierarchical structures of abstract repre-
[V,[ W , [631, [851, W71, WI; sentations.
control of industrial processes, especially diffusion The hardware used for the maps has so far only consisted
processes in the production of semiconductor sub- of co-processor boards (cf., e.g., [42],[48]). If the simple algo-
strates [62], [102]; rithm i s to be directly built into special hardware, one of
automatic synthesis of digital systems [23];
its essential operations will be the global extremum selec-
adaptive devices for various telecommunications tor, for which conventional parallel computing hardware
tasks PI, [4l, [47l; i s available [39]. Analog “winner-take-all” circuits can also
image compression [71];
be used [20], [21], [52]. Another question i s whether the
radar classification of sea-ice [76];
learning operations ought t o be performed o n the board,
optimization problems [2]; or whether fixed values for weights could be loaded into
sentence understanding [95]; the cells. Note that in the latter case the function of the cells
application of expertise in conceptual domains [96];
can be very simple, Iike that of the conventional formal neu-
and even
rons. One beneficial property of the maps i s that their
classification of insect courtship songs [73]. parameters usually stabilize out into a narrow dynamic
Of these, the application to speech recognition has the range, and the accuracy requirements are then modest. In
longest tradition in demonstrating the power of the map this case even integer arithmetic operations can provide for
method when dealing with difficult stochastic signals. M y sufficient accuracy.
personal expectations are, however, that the greatest What is the most significant difference between the Self-
industrial potential of this method may lie i n process con- Organizing Map and other contemporary neural-model
trol and telecom m u n ications. approaches? Most of the latter strongly emphasize the
O n theother hand, it i s a little surprising that so few appli- aspect of distributed processing, and only consider spatial
cations of the maps to computer vision are being studied. organization of the processing units as a secondary aspect.
This does not mean that the problems of vision are not The map principle, on the other hand, i s in some ways com-
important. It is rather that automatic analysis and extraction plementary to this idea. The intrinsic potential of this par-
of visual features, without heuristic or analytical approach, ticular self-organizing process for creating a localized,
has turned out t o be an extremely difficult problem. Bio- structured arrangement of representations in the basic net-
logical and artificial vision probably require very compli- work module i s emphasized.
cated hierarchical systems using many stages (e.g., several Actually, we should n o t talk of the localization o f a
different maps) [55], [56]. One unclarified problem i s how “function”: i t is only the response that i s localized. I am

1476 PROCEEDINGS OF THE IEEE, V O L 78, NO 9, SEPTEMBER 1990

.~ -~
~ 1
thus not opposed to the view that neural networks are dis- modules in a hierarchical map system: the signals merging
tributed systems. The massive interconnects that underlie from different modules may have t o be combined nonlin-
all neural processing are certainly spread over the network; early [82]. O n the other hand, it has already been dem-
their effects, on the other hand, may be "focused" on local onstrated that the map, or the LVQ algorithms, can be used
sites. as a preprocessing stage for other models [26], [69], [ l o l l .
It seems inevitable, however, that any complex process- I n the Counterpropagation Network of Hecht-Nielsen [22],
ing task requires organization o f information into separate competitive learning is neatly integrated into a hierarchical
parts. Distributed processing models in general underrate system as a special layer. Combinations of maps have also
this issue. Consequently, many models that process fea- been studied in [74], [112], and [113].
tures of input data without structuring exhibit slow con- It should be noted that slightly different self-organization
vergence and poor generalization ability, usually ensuing ideas have recently been suggested [5], [IO], [16]. They, how-
from ignorance of the localization of the adaptive pro- ever, fall outside the scope of this article.
cesses. One of the strongest original motives for starting the
On the lower perceptual levels, localization of responses development of (artificial) neural networks was their use as
in topographically organized maps has already been dem- learning systems that might effectively be able to utilize the
onstrated long time ago, and it i s known that such maps vast capacities of active circuits that can be manufactured
need not be prespecified in detail, but can instead organize using semiconductor o r optical technologies. It i s therefore
themselves on the basis of the statistics of the incoming a little surprising that most of the theoretical research on
signals. Such maps have already been applied with success and simulations of neural networks have been restricted t o
in many complex pattern recognition and robot control relatively small networkscontainingonlyafewtens t o a f e w
tasks. thousands of nodes (let alone parallel networks for pre-
On the higher levels of representation, relationships processing images). The main problem with most circuits
between items seem to be based on more subtle roles in seems t o be slow convergence of learning, which again
their occurrence, and are less apparent from their imme- indicates that the best learning mechanisms are yet to be
diate intrinsic properties. Nonetheless it has also been found.
shown recently that even with a simple modeling assump-
tion of semantic roles, topological self-organization of REFERENCES
semanticdatawill takeplace.Todescribetheroleofan item,
S.4. Amari, "Topographic organization of nerve fields," Bull.
it i s sufficient that the input data are presented together
Math. Biology, vol. 42, pp. 339-364, 1980.
with a sufficient amount of context. This then controls the B. Angenio1,C. de IaCroixVaubois, and J.-Y.LeTexier,"Self-
adaptation process. organizing feature maps and the travelling salesman prob-
In the practical application that we have studied most lem," Neural Networks, vol. 1, pp. 289-293, 1988.
carefully, viz. speech recognition, a statistical accuracy of N. Ansari and Y. Chen, "Dynamic digital satellite commu-
nication network management by self-organization," Proc.
phonemic recognition has been achieved that i s clearly lnt. joint Conf. on Neural Networks, ljCNN-90-WASH-DC
equal to or better than the results produced by more con- (Washington, DC, 1990) pp. 11-567-11-570.
ventional methods, even when the latter are based on anal- D. S. Bradburn, "Reducing transmission error effects using
ysis of signal dynamics [28], [66]. a self-organizing network," Proc. lnt. joint Conf. on Neural
Networks, lICNN89 (Washington, D.C., 1989) pp. 11-531-537.
It should be emphasized that the map method is not
D. J. Burr, "An improved elastic net method for the traveling
restricted t o using of any particular form of preprocessing, salesman problem," Proc. / € E € lnt. Conf. on Neural Net-
such as amplitude spectra in speech recognition, or even works, ICNN-88(San Diego, Cal., 1988) pp. 1-69-1-76.
to phonemes as basic phonological units. For instance, A. Caramazza, "Some aspects of language processing
analogous maps may be formed for diphones, syllables, or revealed through the analysis of acquired aphasia: The lex-
ical system,"Ann. Rev. Neurosci., vol. 11, pp. 395-421,1988.
demisyllables, and other spectral representations such as M. Cottrell and J.-C.Fort, "A stochastic model of retinotopy:
linear prediction coding (LPC) coefficients or cepstra may A self-organizing process," Blol. Cybern., vol. 53, pp. 405-
be used as the input information t o the maps. 411,1986.
Although the basic one-level map, as demonstrated i n -, "Etude d'u n processus d'auto-organisation," Ann. Inst.
Henri Poincare, vol. 23, pp. 1-20, 1987.
Sec. II-E, has already been shown to be capable of creating A. R . Darnasio, H. Damasio, and C. W. Van Hoesen, "Pro-
hierarchical (ultrametric) representations of structured data sopagnosia: Anatomic basis and behavioral mechanisms,"
distributions, it might be expected that the real potential Neurology, vol. 24, pp. 89-93, 1975.
of the map lies in a genuine hierarchical or otherwise struc- R. Durbin and D. Willshaw, "An analogue approach to the
tured system that consists of several interconnected map travelling salesman problem using an elastic net method,"
Nature, vol. 326, pp. 689-691, 1987.
modules. In a more natural system, such modules might D. Essen, "Functional organization of primate visual cor-
also correspond to contiguous areas in a single large sheet, tex," in Cerebral Cortex, vol. 3, A. Peters, E. G. Jones (Eds.).
where each area receives a different kind of external input, New York: Plenum Press, 1985, pp. 259-329.
as in the different areas in the cortex. In that case, the bor- 1. Fuller and A. Farsaie, "Invariant target recognition using
feature extraction," Proc. Int. joint. Conf. on Neural Net-
ders between the modules might be diffuse. The problem works, IJCNN-90-WASH-DC (Washington, DC, 1990) pp.
of hierarchical maps, however, has turned out to be very 11-595-11-598.
difficult. One of the particular difficultiesarises if the inputs D. Cabor, "Theory of communication," )./.€.E., vol. 93, pp.
to a cell come from very different sources; it then seems 429-459, 1946.
inevitable that for the comparison of input patterns, an A. Cersho, "On the structure of vector quantizers," I € € €
Trans. Inform. Theory, vol. IT-25, no. 4, pp. 373-380, July1979.
asymmetrical distance function, in which the signal com- H. Goodglass, A. Wingfield, M. R. Hyde, and J.C. Theurkauf,
ponents are provided with adaptive tensorial weights, must "Category specific dissociations in naming and recognition
be applied [31]. Another aspect concerns the interfaces of by aphasic patients," Cortex, vol. 22, pp. 8 7 4 0 2 , 1986.

KOHONEN: THE SELF-ORGANIZING MAP 1477


[I61 I. Crabec, “Self-organization based on the second maxi- [391 -, Content-Addressable Memories, 2nd ed. Berlin, Hei-
mum entropy principle,” First /€E lnt. Conf. on Artificial delberg, Germany: Springer-Verlag, 1987.
NeuralNetworks, Conference Publication No. 313. (London, [40] -, “Self-learning inference rules by dynamically expand-
1989) pp. 12-16. ing context,” Proc. lEEE First Ann. lnt. Conf. on Neural Net-
[ I 7 D. H. Craf and W. R. LaLonde, “A neural controller for col- works (San Diego, CA, 1987) pp. 11-3-11-9.
lision-free movement of general robot manipulators,” Proc. [41] -, “An introduction to neural networks,” Neural Net-
lE€E lnt. Conf. on Neural Networks, ICNN-88 (San Diego, works, vol. 1, pp. 3-16,1988,
Cal., 1988) pp. 1-77-1-84. [42] -, “The ‘neural’ phonetic typewriter,“ Computer, vol. 21,
[I81 D. H. Craf and W. R. LaLonde, ”Neuroplannersfor handleye pp. 11-22, March 1988.
coordination,” Proc. lnt. joint Conf. on Neural Networks, [431 -, “Learning vector quantization,” NeuralNetworks, vol.
IICNN 89 (Washington, DC, 1989) pp. 11-543-11-548. 1, suppl. 1, p. 303, 1988.
[I91 R. M. Gray, ”Vector quantization,” /E€€ ASSP Mag., vol. 1 , [441 -, Self-organization and Associative Memory, 3rd ed.
pp. 4-29, 1984. Berlin, Heidelberg, Germany: Springer-Verlag, 1989.
[201 S. Crossberg, “On the development of feature detectors in [45] T. Kohonen, C. Barna, and R. Chrisley, “Statistical pattern
the visual cortex with applications to learning and reaction- recognition with neural networks: Benchmarking studies,”
diffusion systems,“ Biol. Cybern., vol. 21, pp. 145-159,1976. Proc. lEE€ lnt. Conf. on Neural Networks, ICNN-88 (San
[21] -, “Adaptive pattern classification and universal recod- Diego, Cal., 1988) pp. 1-61-1-68.
ing: I. Parallel development and coding of neural feature [46] T. Kohonen, K. Makisara, and T. Saramaki, “Phonotopic
detectors; I I . Feedback, expectation, olfaction, illusions,” maps-insightful representation of phonological features
Biol. Cybern., vol. 23, pp. 121-134 and 187-202, 1976. for speech recognition,” Proc. Seventh lnt. Conf. on Pattern
[22] R. Hecht-Nielsen, “Applications of counterpropagation net- Recognition (Montreal, Canada, 1984) pp. 182-185.
work,” Neural Networks, vol. 1 , pp. 131-139, 1988. [47l T. Kohonen, K. Raivio, 0. Simula, 0. Venta, and J. Hen-
[23] A. Hemani and A. Postula, “Scheduling by self organisa- riksson, “An adaptive discrete-signaldetector basedon self-
tion,” Proc. lnt. joint. Conf. on Neural Networks, ljCNN-90- organizing maps,” Proc. lnt. joint. Conf. on Neural Net-
WASH-DC (Washington, DC, 1990) pp. 11-543-11-546. works, lICNN-90-WASH-DC (Washington, DC, 1990) pp.
[24] R. E. Hodgesand C.-H. Wu, “A method toestablish an auton- 11-249-11-252.
omous self-organizing feature map,” Proc. lnt. joint. Conf. [48] T. Kohonen, K. Torkkola, M. Shozakai, J. Kangas, and 0.
on NeuralNetworks, ljCNN-90-WASH-DC(Washington, DC, Venta, “Microprocessor implementation of a large vocab-
1990) pp. 1-517-1-520. ulary speech recognizerand phonetic typewriter for Finnish
[251 R. E. Hodges, C.-H. Wu, and C.-J.Wang, “Parallelizing the and Japanese,” Proc. European Conference on Speech
self-organizing feature map on multi-processor systems,” Technology (Edinburgh, 1987) pp. 377-380.
Proc. lnt. joint. Conf. on Neural Networks, lICNN-90-WASH- [49] H . J. Kushner and D. S. Clark, Stochastic Approximation
DC (Washington, DC, 1990) pp. 11-141-11-144. Methods for Constrained and Unconstrained Systems.
[26] R. M. Holdaway, ”Enhancing supervised learning algo- New York, Berlin: Springer-Verlag, 1978.
rithms via self-organization,” Proc. lnt. /oint Conf. on Neural [50] J.Lampinen and E. Oja, “Fast self-organization by the prob-
Networks, l/CNN 89 (Washington, D.C., 1989) pp. 11-523- ing algorithm,” Proc. lnt. /oint Conf. on Neural Networks,
11-529. IICNN 89 (Washington, DC, 1989) pp. 11-503-11-507.
[2-/1 J. J. Hopfield, “Neural networks and physical systems with [51] A. LaVigna, “Nonparametric classification using learning
emergent collective computational abilities,” Proc. Natl. vector quantization,” Ph.D. Thesis, University of Maryland,
Acad. Sci. USA, vol. 79, pp. 2554-2558, 1982. 1989.
[28] H. Iwamida, S. Katagiri, E. McDermott, and Y. Tohkura, “A 1521 J. Lazarro, S. Ryckebusch, M. A. Mahowald, and C. A. Mead,
hybrid speech recognition system using HMMswith an LVQ- “Winner-take-all network of O(N)complexity,“ in Advances
trained codebook,” ATR Technical Report TR-A-0061, ATR in Neurallnformation ProcessingSystems I,D. S. Touretzky,
Auditory and Visual Perception ResearchLaboratories, 1989. Ed. San Mateo, CA: Morgan Kaufmann Publishers, 1989.
[29] 1. H. Kaas, R. I. Nelson, M. Sur, C. S. Lin, and M. M. Mer- [53] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An intro-
zenich, ”Multiple representations of the body within the duction to the application of the theory of probabilistic
primary somatosensory cortex of primates,” Science, vol. functions of a Markov process to automatic speech rec-
204, pp. 521-523,1979. ognition,” Bell Syst. Tech. /., pp. 1035-1073, Apr. 1983.
[30] J. H. Kaas, M. M. Merzenich, and H. P. Killackey, “The reor- [54] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector
ganization of somatosensory cortex following peripheral quantization,” IEEE Trans. Communication, vol. COM-28, pp.
nerve damage in adult and developing mammals,” Annual 84-95, 1980.
Rev. Neurosci., vol. 6, pp. 325-356, 1983. [55] S. P. Luttrell, “Self-organizing multilayer topographic map-
[31] J. Kangas, T. Kohonen, and J. Laaksonen, “Variants of self- pings,“ Proc. / F E E lnt. Conf. on Neural Networks, ICNN-88
organizing maps,” lEEE Trans. Neural Networks, vol. 1 , pp. (San Diego, CA, 1988) pp. 1-93-1-100.
93-99,1990. [56] -, “Hierarchical self-organizing networks,” First I€€ lnt.
[32] A. Kertesz, Ed., Localization in Neuropsychology. New Conf. on Artificial Neural Networks, Conference Publica-
York, N.Y.: Academic Press, 1983. tion No. 313. (London, 1989) pp. 2-6.
[33] E. I. Knudsen, S. du Lac, and S. D. Esterly, ”Computational 1571 -, “Self-organization: A derivation from first principles of
maps in the brain,” Ann. Rev. Neurosci., vol. IO, pp. 41-65, a class of learning algorithms,” Proc. lnt. joint Conf. on
1987. Neural Networks, l/CNN 89 (Washington, DC, 1989) pp.
[34] T. Kohonen, ”Automatic formation of topological maps of 11495-11-498.
patterns in a self-organizing system,” Proc. 2nd Scandina- [58] J. Makhoul, S. Roucos, and H. Cish, “Vector quantization
vian Conf. on lmageAnalysis (Espoo, Finland, 1981) pp. 214- in speech coding,” Proc. I€€€,vol. 73, pp. 1551-1588, 1985.
220. 1591 Ch. v.d Malsburg, “Self-organization of orientation sensi-
[35] -, “Self-organized formation of topologically correct fea- tive cells in the striate cortex,” Kybernetik, vol. 14, pp. 85-
ture maps,” Biol. Cybern., vol. 43, pp. 59-69, 1982. 100, 1973.
[36] -, “Clustering, taxonomy, and topological maps of pat- [60] Ch. v.d. Malsburg and D. J. Willshaw, “How to label nerve
terns,” Proc. Sixth lnt. Conf. on Pattern Recognition (Mun- cells so that they can interconnect in an ordered fashion,”
ich, Germany, 1982) pp. 114-128. Proc. Natl. Acad. Sci. USA, vol. 74, pp. 5176-5178, 1977.
[37) -, ”Dynamically expanding context, with application to [61] R. Mannand S. Haykin, “A parallel implementation of Koho-
the correction of symbol strings in the recognition of con- nen feature mapson the Warp systolic computer,” Proc. lnt.
tinuous speech,” Proc. €ighth lnt. Conf. on Pattern Recog- joint. Conf. on Neural Networks, lICNN-90-WASH-DC
nition (Paris, France, 1986) pp. 1148-1151. (Washington, DC, 1990) pp. 11-84-11-87.
[38] -, ”Learning Vector Quantization,” Helsinki University [62] K. M. Marks and K. F. Goser, “Analysis of VLSl process data
of Technology, Laboratory of Computer and Information based on self-organizing feature maps,” Proc. Neuro-
Science, Report TKK-F-A-601, 1986. Nimes’88 (Nimes, France, 1988) pp. 337-347.

1478 PROCEEDlNCS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

I
[63] 1. Martinetz, H. J.Ritter, and K. J. Schulten, “Three-dimen- [851 H. 1. Ritter, T. M. Martinetz, and K. J.Schulten, “Topology
sional neural net for learning visuomotor coordination of conserving maps for learning visuo-motor-coordination,”
a robot arm,” / € € E Trans. Ne& Networks, vol. 1, pp. 131- Neural Networks, vol. 2, pp. 159-168, 1989.
136,1990. [86] H. Ritter and K. Schulten, ”On the stationary state of Koho-
[64] J. Max, “Quantizing for minimum distortion,” lR€ Trans. nen’s self-organizing sensory mapping,” Biol. Cybern., vol.
Inform. Theory, vol. IT-6, no. 2, pp. 7-12, Mar. 1960. 54, pp. 99-106, 1986.
165) R. A. McCarthy and E. K. Warrington, “Evidence for modality [87l -, “Topology conserving mappings for learning motor
specific meaning systems in the brain,” Nature, vol. 334, pp. tasks,” Proc. Neural Networks for Computing, AIP Confer-
428-430, 1988. ence (Snowbird, Utah, 1986) pp. 376-380.
[66] E. McDermott and S. Katagiri, “Shift-invariant, multi-cate- [88] -, ”Extending Kohonen’s self-organizing mapping algo-
gory phoneme recognition using Kohonen’s LVQ2,” Proc. rithm to learn ballistic movements,” NATO AS/ Series, vol.
lnt. Conf. on Acoustics, Signals, and Speech, KASSP 89 F41, pp. 393-406,1988.
(Glasgow, Scotland) pp. 81-84. [89] -, “Kohonen’s self-organizing maps: exploring their com-
[67l P. McKenna and E. K. Warrington, ”Category-specific nam- putational capabilities,” Proc. / E € € lnt. Conf. on Neural Net-
ing preservation: A single case study,” 1. Neurol. Neurosurg. works, lCNNN 88 (San Diego, CA, 1988) pp. I-109-1-116.
Psychiatry, vol. 41, pp. 571-574, 1978. [go] -, “Convergency properties of Kohonen’s topology con-
[68] R. Miikkulainen and M.-G. Dyer, “Forming global repre- serving maps: Fluctuations, stability and dimension selec-
sentations with extended backpropagation,” Proc. / € € E lnt. tion,’’ Biol. Cybern., vol. 60, pp. 59-71, 1989.
Conf. on Neural Networks, lCNN 88 (San Diego, CA, 1988) [91] R. A. Reale and T. J. Imig, “Tonotopic organization in audi-
. . 285-292.
pp. tory cortex of the cat,” J. Comp. Neurol., vol. 192, pp. 265-
P. Morasso, “Neural models of cursive script handwriting,” 291, 1980.
Proc. lnt. Joint Conf. on Neural Networks, IJCNN 89 (Wash- [92] H. Robbins and S. Monro, “A stochastic approximation
ington, DC, 1989) pp. 11-539-11-542. method,” Ann. Math. Statist., vol. 22, pp. 400-407, 1951.
J. T. Murphy, H. C. Kwan, W. A. MacKay, and Y. C. Wong, [93] E. T. Rolls, “Neurons in the cortex of the temporal lobe and
“Spatial organization of precentral cortex in awake pri- in the amygdala of the monkey with responses selective for
mates, Ill, Input-outputcoupIing,’’J. Neurophysiol., vol. 41, faces,” Hum. Neurobiol., vol. 3, pp. 209-222, 1984.
pp. 1132-1139, 1977. [94] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
N. M. Nasrabadi and Y. Feng, ”Vector quantization of images internal representations by error propagation,” in Parallel
based upon the Kohonen Self-organizing Feature Maps,” Distributed Processing: Explorations in the Microstructure
Proc. /€E€ lnt. Conf. on Neural Networks, IC”-88 (San of Cognition. Vol. 1.: Foundations, D. E. Rumelhart, J. L.
Diego, CA, 1988) pp. I-101-1-108. McClelland and the PDP research group, Eds. Cambridge,
M. M. Nassand L. N. Cooper,”Atheoryforthedevelopment Mass.: MIT Press, 1986, pp. 318-362.
of featuredetectingcells in visual cortex,”Biol. Cybern., vol. [95] J. K. Samarabandu and 0. E. Jakubowicz, “Principles of
19, pp. 1-18, 1975. sequential feature maps in multi-level problems,” Proc. lnt.
E. K. Neumann, D. A. Wheeler, J. W. Burnside, A. S. Bern- loint. Conf. on Neural Networks, IJCNN-90-WASH-DC
stein, and J. C. Hall, “A technique for the classification and (Washington, DC, 1990) pp. 11-683-11-686.
analysis of insect courtship song,” Proc. lnt. Joint. Conf. on [96] P. G. Schyns, “Expertise acquisition through concepts
Neural Networks, lJCNN-90-WASH-DC,(Washington, DC, refinement in a self-organizingarchitecture,” Proc. lnt. Joint.
1990) pp. 11-257-262. Conf. on Neural Networks, IJCNN-90-WASH-DC(Washing-
Y. Nishikawa, H. Kita, and A. Kawamura, ”“/I: a neural net- ton, DC, 1990) pp. 1-236-1-239.
work which divides and learns environments,” Proc. lnt. [97l D. L. Sparks and J. S. Nelson, ”Sensory and motor maps in
Joint. Conf. on Neural Networks, lJCNN-90-WASH-DC, the mammalian superior colliculus,” TINS, vol. IO, pp. 312-
(Washington, DC, 1990) pp. 1-684-1-687. 317, 1987.
G. A. Ojemann, ”Brain organization for language from the [98] N. Suga and W. E. O’Neill, ”Neural axis representing target
perspective of electrical stimulation mapping,” Behav. Brain range in the auditory cortex of the mustache bat,” Science,
Sci., vol. 2, pp. 189-230, 1983. vol. 206, pp. 351-353, 1979.
J. Orlando, R. Mann, and S. Haykin, ”Radar classification of [99] N. V. Swindale, ”A model for the formation of ocular dom-
sea-ice using traditional and neural classifiers,” Proc. lnt. inance stripes,” Proc. R. Soc., vol. 8.208, pp. 243-264, 1980.
Joint. Conf. on Neural Networks, lJCNN-90-WASH-DC, [IOO] A. Takeuchi and S. Amari, “Formation of topographic maps
(Washington, DC, 1990) pp. 11-263-11-266. and columnar microstructures,” Biol. Cybern., vol. 35, pp.
K. J. Overton and M. A. Arbib, “The branch arrow model of 63-72,1979.
the formation of retinotectal connections,” Biol. Cybern., [I011 T. Tanaka, M. Naka, and K. Yoshida, “Improved back-prop-
VOI.45, pp. 157-175, 1982. agation combined with LVQ,” Proc. lnt. Joint. Conf. on
K. J. Pearson, L. H. Finkel, and G. M. Edelman, “Plasticity in Neural Networks, IJCNN-90-WASH-DC(Washington, DC,
the organization of adult cerebral maps: a computer sim- 1990) pp. 1-731-734.
ulation based on neuronal group selection,” /. Neurosci., [I021 V. Tryba, K. M. Marks, U. Ruckert, and K. Goser, “Selbst-
vol. 12, pp. 4209-4223, 1987. organisierende Karten als lernende klassifizierende
R. Perez, L. Glass, and R. J. Shlaer, “Development of spec- Speicher,” ITG Fachbericht, vol. 102, pp. 407-419, 1988.
ificity in cat visual cortex,” ],Math. Biol., vol. 1, pp. 275-288, [I031 A. R. Tunturi, “Physiological determination of the arrange-
1975. ment of the afferent connections to the middle ectosylvian
[80] S. E. Petersen, P. T. Fox, M. I. Ponsner, M. Mintun, and auditory area in the dog,” Am. J. Physiol., vol. 162, pp. 489-
M. E. Raichle, ”Positron emission tomographic studies of 502, 1950.
the cortical anatomy of single-word processing,” Nature, [I041 -, “The auditory cortex of the dog,” Am. /. Physiol., vol.
vol. 331, pp. 585-589, 1988. 168, pp. 712-717, 1952.
[81] M. Porat and Y. Y. Zeevi, ”The generalized Gabor scheme [I051 A.Waibel,T.Hanazawa,G. Hinton,K.Shikano,andK.J. Lang,
of image representation in biological and machine vision,” ”Phoneme recognition using time-delay neural networks,”
/€E€ Trans. Pattern Anal. Machine lntell., vol. PAMI-IO, pp. /E€€ Trans. Acoust. Speech and Signal Processing, vol. ASSP-
452-468,1988. 37, pp. 382-339, 1989.
[82] H. Ritter, ”Combining self-organizing maps,” Proc. lnt./oint [I061 E. K. Warrington, ”The selective impairment of semantic
Conf. on Neural Networks, ljCNN 89 (Washington, DC, 1989) memory,” Q. j . Exp. Psycho/., vol. 27, pp. 635-657,1975.
pp. 11-499-11-502. [IO7 E. K. Warrington and R. A. McGarthy, “Category specific
[83] -, ”Asymptotic level density for a class of vector quan- access dysphasia,” Brain, vol. 106, pp. 859-878,1983.
tization processes,” Helsinki Universityof Technology, Lab. [I081 -, “Categories of knowledge,” Brain, vol. 110, pp. 1273-
of Computer and Information Science, Report A9, 1989. 1296, 1987.
[84] H. Ritter and T. Kohonen, “Self-organizing semantic maps,” [I091 E. K. Warrington and T. Shallice, “Category-specific impair-
Biol. Cybern., vol. 61, pp. 241-254, 1989. ments,” Brain, vol. 107, pp. 829-854, 1984.

KOHONEN: THE SELF-ORGANIZING M A P 1479


[I101 D.J.Willshawand Ch.v.d. Malsburg,”Howpatterned neural Teuvo Kohonen (Senior Member, IEEE)
connections can be set up by self-organization,” Proc. R. Soc. received the D.Eng. degree in physics from
London, vol. B 194, pp. 431-445, 1976. Helsinki University of Technology, Espoo,
[ I l l ] -,“A marker induction mechanism for the establishment Finland, in 1962.
of ordered neural mappings: its application to the retino- He i s a professor on the Faculty of Infor-
tectal problem,” Proc. 17.Soc. London, vol. B 287, pp. 203- mation Sciences of Helsinki University of
243, 1979. Technology, and i s also a research profes-
[I121 L. Xu and E. Oja, “Vector pair correspondence by a sim- sorof the Academyof Finland. Hisscientific
plified counter-propagation model: a twin topographic interests are neural computers and pattern
map,” Proc. lnt. joint. Conf. on Neural Networks, l/CNN-90- recognition. He is the author of four text-
WASH-DC (Washington, DC, 1990) pp. 11-531-534. books, of which Content-Addressable
[I131 -, “Adding top-down expectation into the learning pro- Memones (Springer, 1987) and Self-Organizat/on and Associative
cedure of self-organizing maps,” Proc. Int. loint. Conf. on Memory (Springer, 1989) are best known.
Neural Networks, l/CNN-9@WASH-DC (Washington, DC, Dr. Kohonen i s a member of the Finnish Academy of Sciences,
1990) pp. 1-735-738. and of the Finnish Academy of Engineering Sciences.He hasserved
[I141 A. Yamadori and M. L. Albert, ”Word category aphasia,” as the first vice chairman of the International Association for Pat-
Cortex, vol. 9, pp. 112-125, 1973. tern Recognition, and as a Governing Board Member of the Inter-
[I151 P. L. Zador, “Asymptotic quantization error of continuous national Neural Network Society. He was awarded the Eemil Aal-
signals and the quantization dimension,” I€€€Trans. Inform. tonen Prize in 1983 and the Cultural Prize of Finnish Commercial
Theory, vol. IT-28, pp. 139-149, March 1982. Television in 1984.
[I161 S. Zeki, “The representation of colours in the cerebral cor-
tex,” Nature, vol. 284, pp. 412-418, 1980.

1480 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

- 1

You might also like