Professional Documents
Culture Documents
ACTA
POLYTECHNICA
SCANDINAVICA
MATHEMATICS, COMPUTING AND MANAGEMENT IN ENGINEERING
SERIES No. 82
SAMUEL KASKI
Helsinki University of Technology
Neural Networks Research Centre
Rakentajanaukio 2C
FIN-02150 Espoo, Finland
Thesis for the degree of Doctor of Technology to be presented with due per-
mission for public examination and criticism in Auditorium F1 of the Helsinki
University of Technology on the 21st of March, at 12 o'clock noon.
Helsinki University of Technology
Department of Computer Science and Engineering
Laboratory of Computer and Information Science
ESPOO 1997
2
ABSTRACT
Finding structures in vast multidimensional data sets, be they measurement data,
statistics, or textual documents, is dicult and time-consuming. Interesting,
novel relations between the data items may be hidden in the data. The self-
organizing map (SOM) algorithm of Kohonen can be used to aid the exploration:
the structures in the data sets can be illustrated on special map displays.
In this work, the methodology of using SOMs for exploratory data analysis or
data mining is reviewed and developed further. The properties of the maps are
compared with the properties of related methods intended for visualizing high-
dimensional multivariate data sets. In a set of case studies the SOM algorithm
is applied to analyzing electroencephalograms, to illustrating structures of the
standard of living in the world, and to organizing full-text document collections.
Measures are proposed for evaluating the quality of dierent types of maps in
representing a given data set, and for measuring the robustness of the illustrations
the maps produce. The same measures may also be used for comparing the
knowledge that dierent maps represent.
Feature extraction must in general be tailored to the application, as is done
in the case studies. There exists, however, an algorithm called the adaptive-
subspace self-organizing map, recently developed by Kohonen, which may be of
help. It extracts invariant features automatically from a data set. The algorithm
is here characterized in terms of an objective function, and demonstrated to be
able to identify input patterns subject to dierent transformations. Moreover, it
could also aid in feature exploration: the kernels that the algorithm creates to
achieve invariance can be illustrated on map displays similar to those that are
used for illustrating the data sets.
c All rights reserved. No part of the publication may be reproduced, stored in a retrieval
system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the author.
3
PREFACE
This work has been carried out in the Neural Networks Research Centre, Helsinki
University of Technology, during 1993-1996. While completing the thesis I have
participated in several projects of the centre, under the supervision of Academy
Professor Teuvo Kohonen. I wish to thank him for guidance and for providing the
facilities and nancial support necessary for the studies. Most important, how-
ever, have been the numerous discussions that have had a fundamental eect on
the work, not to mention the contributions of Professor Kohonen in the projects
themselves and in co-authoring the majority of the presented publications.
I also wish to thank the other co-workers in the projects that have been in-
cluded in the thesis, Timo Honkela, Harri Lappalainen, Sirkka-Liisa Joutsiniemi,
and Krista Lagus, and also the rest of the personnel of the Neural Networks Re-
search Centre and the Laboratory of Computer and Information Science. Special
thanks for comments on the manuscript are due to Timo Honkela, Krista Lagus,
Janne Sinkkonen, and Professors Teuvo Kohonen and Erkki Oja.
Numerous other people with whom I have worked and lived in other projects,
some of which are hopefully lifelong, deserve many thanks as well. Those ac-
knowledgments relate to some completely dierent stories, however.
The nancial support of the Academy of Finland and the Jenny and Antti
Wihuri Foundation is gratefully acknowledged.
CONTENTS
Preface 3
List of publications 6
The author's contribution 7
List of symbols and abbreviations 8
1 Introduction 9
2 Methods for exploratory data analysis 10
2.1 Visualization of high-dimensional data items . . . . . . . . . . . . 11
2.2 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Projection methods . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Linear projection methods . . . . . . . . . . . . . . . . . . 14
2.3.2 Nonlinear projection methods . . . . . . . . . . . . . . . . 15
2.4 Self-organizing maps . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 The self-organizing map algorithm . . . . . . . . . . . . . 21
2.4.2 Properties useful in exploring data . . . . . . . . . . . . . 22
2.4.3 Mathematical characterizations . . . . . . . . . . . . . . . 24
2.4.4 Some variants . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.5 Notes on statistical accuracy . . . . . . . . . . . . . . . . . 29
2.5 Relations and dierences between SOM and MDS . . . . . . . . . 30
3 Stages of SOM-based exploratory data analysis 33
3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Computation of the maps . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Choosing good maps . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Interpretation, evaluation, and use of the maps . . . . . . . . . . . 38
4 Case studies 39
4.1 Multichannel EEG signal . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Statistical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Full-text document collections . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Recent developments . . . . . . . . . . . . . . . . . . . . . 41
5 Further developments 42
5.1 Feature exploration with the adaptive-subspace SOM . . . . . . . 42
5.2 Comparison of knowledge areas . . . . . . . . . . . . . . . . . . . 43
6 Conclusion 44
References 45
5
LIST OF PUBLICATIONS
The thesis consists of an introduction and the following publications:
1. Joutsiniemi, S.-L., Kaski, S., and Larsen, T. A. (1995) Self-organizing map in
recognition of topographic patterns of EEG spectra. IEEE Transactions on
Biomedical Engineering, 42:10621068.
2. Kaski, S. and Kohonen, T. (1996) Exploratory data analysis by the Self-
organizing map: Structures of welfare and poverty in the world. In Refenes,
A.-P. N., Abu-Mostafa, Y., Moody, J., and Weigend, A., editors, Neural Net-
works in Financial Engineering. Proceedings of the Third International Con-
ference on Neural Networks in the Capital Markets, pages 498507. World
Scientic, Singapore.
3. Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1996) Creating an order
in digital libraries with self-organizing maps. In Proceedings of WCNN'96,
World Congress on Neural Networks, pages 814817. Lawrence Erlbaum and
INNS Press, Mahwah, NJ.
4. Kohonen, T., Kaski, S., Lagus, K., and Honkela, T. (1996) Very large two-level
SOM for the browsing of newsgroups. In von der Malsburg, C., von Seelen, W.,
Vorbrüggen, J. C., and Sendho, B., editors, Proceedings of ICANN'96, Inter-
national Conference on Articial Neural Networks, Lecture Notes in Computer
Science, vol. 1112, pages 269274. Springer, Berlin.
5. Lagus, K., Honkela, T., Kaski, S., and Kohonen, T. (1996) Self-organizing
maps of document collections: a new approach to interactive exploration.
In Simoudis, E., Han, J., and Fayyad, U., editors, Proceedings of the Second
International Conference on Knowledge Discovery & Data Mining, pages 238
243. AAAI Press, Menlo Park, CA.
6. Kaski, S. (1997) Computationally ecient approximation of a probabilis-
tic model for document representation in the WEBSOM full-text analysis
method. Accepted to Neural Processing Letters.
7. Kaski, S. and Lagus, K. (1996) Comparing self-organizing maps. In von der
Malsburg, C., von Seelen, W., Vorbrüggen, J. C., and Sendho, B., edi-
tors, Proceedings of ICANN'96, International Conference on Articial Neural
Networks, Lecture Notes in Computer Science, vol. 1112, pages 809814.
Springer, Berlin.
8. Kohonen, T., Kaski, S., and Lappalainen, H. Self-organized formation of vari-
ous invariant-feature lters in the Adaptive-Subspace SOM. Accepted to Neu-
ral Computation.
7
c = (x)
c index of the centroid (or model vector)
that is closest to x
K number of cluster centroids (and reference vectors)
xk 0
projection of xk
( )
d k; l distance between xk and xl
d
0
( )k; l distance between xk and xl
0 0
Ni
Px x number of data items in k V
ni = 1 =Ni
k 2Vi k centroid of i
V
1 INTRODUCTION
It is relatively easy to give answers to well-specied questions about the statistical
nature of a well-understood data set, like how large should an aeroplane cockpit
be to accommodate any one of the 95% of the potential pilots? In this example
the data may be assumed to be normally distributed and it is straightforward to
estimate the threshold. The more data there is available the more accurate an
answer can be given.
If, on the other hand, the data is not well-understood and the problem is
not well-specied, an increase in the amount of data may even have the opposite
eect. This holds for multivariate data in particular. If the goal is simply to
try to make sense out of a data set to generate sensible hypotheses or to nd
some interesting novel patterns, it paradoxically seems that the more data there
is available the more dicult it is to understand the data set. The structures
are hidden among the large amounts of multivariate data. When exploring a
data set for new insights, only methods that discover and illustrate eectively the
structures in the data can be of help. Such methods, applied to large data sets,
are the topic of this work.
A data-driven search for statistical insights and models is traditionally called
exploratory data analysis (Hoaglin, 1982; Jain and Dubes, 1988; Tukey, 1977;
Velleman and Hoaglin, 1981) in statistical literature. The process of making sta-
tistical inferences often consists of an exploratory, data-driven phase, followed by
a conrmatory phase in which the reproducibility of the results is investigated.
There thus exists a wealth of applications in which data sets need to be summa-
rized to gain insight into them; the goal in this work is to present a data set in a
form that is easily understandable but that at the same time preserves as much
of the essential information in the data as possible.
Exploratory data analysis methods can be used as tools in knowledge discovery
in databases (KDD) (Fayyad, 1996; Fayyad et al., 1996a; Fayyad et al., 1996c;
Simoudis, 1996). In this relatively recently established eld the emphasis is on
the whole interactive process of knowledge discovery, discovery of novel patterns
or structures in the data. The process consists of a multitude of steps starting
from setting up the goals to evaluating the results, and possibly reformulating
the goals based on the results. Data mining1 is one step in the discovery process,
a step in which suitable tools from many other disciplines including exploratory
data analysis are used to nd interesting patterns in the data. Depending on
the goals of the data mining process essentially any kinds of pattern recognition
(Devijver and Kittler, 1982; Fu, 1974; Fukunaga, 1972; Therrien, 1989; Schalko,
1992), machine learning (Forsyth, 1989; Langley, 1996; Michalski, 1983), and
multivariate analysis (Cooley and Lohnes, 1971; Hair, Jr. et al., 1984; Kendall,
1Not all authors use the same terminology; data mining can also be used to refer to whole
process, even as a synonym of KDD.
10
1975) algorithms may be useful; for recent examples, cf. Fayyad et al. (1996b). An
essential novelty in the eld then lies in emphasizing the discovery of previously
unknown structures from vast databases, and in emphasizing the importance of
considering the whole process.
In this work exploratory data analysis methods which illustrate the structures
in data sets, are applied to large databases. The tool in this endeavor will be the
self-organizing map (SOM) (Kohonen, 1982; Kohonen, 1995c). Some properties
that distinguish the SOM from the other data mining tools are that it is numerical
instead of symbolic, nonparametric, and capable of learning without supervision.
The numerical nature of the method enables it to treat numerical statistical
data naturally, and to represent graded relationships. Because the method does
not require supervision and is nonparametric, used here in the sense that no
assumptions about the distribution of the data need to be made, it may even nd
quite unexpected structures from the data.
In this thesis the relation of the SOM to some other data visualization and
clustering methods is rst analyzed in Section 2. Then, recipes on how to use the
SOM in exploratory data analysis are given in Section 3. The areas of application
that are treated in the Publications are introduced in Section 4, and nally two
recent developments in the methodology are discussed in Section 5.
a b
c d
Figure 1: A ten-dimensional data item visualized using four dierent methods. a A prole of
the component values, b a star in which the length of each ray emanating from the center
illustrates one component, c Andrews' curve, and d a facial caricature.
kinds of summaries of the data set like the cluster centroids that are introduced
below, or the reference vectors of a self-organizing map.
large cluster is split. The end result of the algorithm is a tree of clusters called
a dendrogram, which shows how the clusters are related. By cutting the den-
drogram at a desired level a clustering of the data items into disjoint groups is
obtained.
Partitional clustering, on the other hand, attempts to directly decompose the
data set into a set of disjoint clusters. The criterion function that the clustering
algorithm tries to minimize may emphasize the local structure of the data, as
by assigning clusters to peaks in the probability density function, or the global
structure. Typically the global criteria involve minimizing some measure of dis-
similarity in the samples within each cluster, while maximizing the dissimilarity
of dierent clusters.
A commonly used partitional clustering method, K-means clustering
(MacQueen, 1967), will be discussed in some detail since it is closely related to
the SOM algorithm. In K-means clustering the criterion function is the average
squared distance of the data items xk from their nearest cluster centroids,
=
X kx ? m k 2
(1)
EK k xc( k)
;
k
where (xk ) is the index of the centroid that is closest to xk . One possible
c
algorithm for minimizing the cost function begins by initializing a set of cluster
K
initialization of the cluster centroids may also be crucial; some clusters may even
14
be left empty if their centroids lie initially far from the distribution of data.
Clustering can be used to reduce the amount of data and to induce a catego-
rization. In exploratory data analysis, however, the categories have only limited
value as such. The clusters should be illustrated somehow to aid in understanding
what they are like. For example in the case of the K-means algorithm the cen-
troids that represent the clusters are still high-dimensional, and some additional
illustration methods are needed for visualizing them.
ETH
BGD
BEL SLE MLI
DNK
NOR HUN
FIN SWE
DEU
NLD
CAN FRA GBR
USA RWA
BEN NPL
JPN ITA POL MAR MWISEN
PAKTZA
ESP ISR MDG
AUSAUT IRL YUG KOR IDN INDGHA NGA
ZAR
NZL GRC DOM EGY CMR ZMB
URYMUS
PRT TUR BOL
IRN
ARG TUNPHL SLV KEN
CHL
PAN JAM THA PRY ZWE
HKG SGP ECU CHN
MYS
VEN PER
COL GTM CIV
CRI LKA
BWA
BRA
Figure 2: A dataset projected linearly onto the two-dimensional subspace obtained with PCA.
Each 39-dimensional data item describes dierent aspects of the welfare and poverty of one
country. The data set consisting of 77 countries, used also in Publication 2, was picked up from
the World Development Report published by the World Bank (1992). Missing data values were
neglected when computing the principal components, and zeroed when forming the projections.
A key to the abbreviated country names is given in the Appendix.
can be restarted from the beginning to reveal more of the structure of the data
set.
2.3.2 Nonlinear projection methods
PCA cannot take into account nonlinear structures, structures consisting of ar-
bitrarily shaped clusters or curved manifolds, since it describes the data in terms
of a linear subspace. Projection pursuit tries to express some nonlinearities, but
if the data set is high-dimensional and highly nonlinear it may be dicult to
visualize it with linear projections onto a low-dimensional display even if the
projection angle is chosen carefully.
Several approaches have been proposed for reproducing nonlinear higher-
dimensional structures on a lower-dimensional display. The most common meth-
ods allocate a representation for each data point in the lower-dimensional space
and try to optimize these representations so that the distances between them
would be as similar as possible to the original distances of the corresponding
data items. The methods dier in how the dierent distances are weighted and
how the representations are optimized.
Multidimensional scaling (MDS) refers to a group of methods that is widely
used especially in behavioral, econometric, and social sciences to analyze subjec-
tive evaluations of pairwise similarities of entities, such as commercial products in
a market survey. The starting point of MDS is a matrix consisting of the pairwise
dissimilarities of the entities. In this thesis only distances between pattern vectors
in a Euclidean space will be considered, but in MDS the dissimilarities need not
be distances in the mathematically strict sense. In fact, MDS is perhaps most
16
often used for creating a space where the entities can be represented as vectors,
based on some evaluation of the dissimilarities of the entities.
The goal in this thesis is not merely to create a space which would represent
the relations of the data faithfully, but also to reduce the dimensionality of the
data set to a suciently small value to allow visual inspection of the set. The
MDS methods can be used to fulll this goal, as well.
There exists a multitude of variants of MDS with slightly dierent cost func-
tions and optimization algorithms. The rst MDS for metric data was developed
in the 1930s (historical treatments and introductions to MDS have been provided
by, for example, Kruskal and Wish, 1978; de Leeuw and Heiser, 1982; Wish and
Carroll, 1982; Young, 1985), and later generalized for analyzing nonmetric data
and even the common structure in several dissimilarity matrices corresponding
to, for instance, evaluations made by dierent individuals.
The algorithms designed for analyzing a single dissimilarity matrix, and which
can thus be used for reducing the dimensionality of a data set, can be broadly
divided into two basic types, metric and nonmetric MDS.
In the original metric MDS (Torgerson, 1952; cf. Young and Householder,
1938) the distances between the data items have been given, and a conguration
of points that would give rise to the distances is sought. Often a linear projec-
tion onto a subspace obtained with PCA is used. The key idea of the method,
to approximate the original set of distances with distances corresponding to a
conguration of points in a Euclidean space can, however, also be used for con-
structing a nonlinear projection method. If each item xk is represented with a
lower-dimensional, say, two-dimensional data vector xk , then the goal of the pro- 0
jection is to optimize the representations so that the distances between the items
in the two-dimensional space will be as close to the original distances as possible.
If the distance between xk and xl is denoted by ( ) and the distance between
d k; l
A perfect reproduction of the Euclidean distances may not always be the best
possible goal, especially if the components of the data vectors are expressed on
an ordinal scale. Then only the rank order of the distances between the vectors
is meaningful, not the exact values. The projection should try to match the rank
order of the distances in the two-dimensional output space to the rank order in
the original space. The best possible rank ordering for a given conguration or
points can be guaranteed by introducing a monotonically increasing function f
that acts on the original distances, and always maps the distances to such values
that best preserve the rank order. Nonmetric MDS (Kruskal, 1964; Shepard,
1962) uses such a function (Kruskal and Wish, 1978), whereby the normalized
17
minimize Equation 3.
Although the nonmetric MDS was motivated by the need of treating ordinal-
scale data, it can also be used of course if the inputs are presented as pattern
vectors in a Euclidean space. The projection then only tries to preserve the
order of the distances between the data vectors, not their absolute values. A
demonstration of nonmetric MDS, applied in a dimension reduction task, is given
in Figure 3.
BGD
HUN PAK
IND RWA
SLE
BEL
SWE ISRPOL PHL
DNK
NOR IDN
FIN
NLD JPN
CAN DEU
KOR
DOM EGY MAR BEN
USA GBR
FRA NPL
ESP
ITAIRL GHA MLI
AUT MUS IRN SEN
AUS GRC YUGURY ETH
PAN TUN CMR ZAR
MDG
NZL ARGCHL BOL MWI
PRT PER SLV KEN NGA TZA
VENCOL
SGP ECU ZWE
JAM PRY CIV
MYSCRI
HKG TUR BWA GTM
THA BRA ZMB
LKA
CHN
Figure 3: A nonlinear projection constructed using nonmetric MDS. The data set is the same as
in Figure 2. Missing data values were treated by the following simple method, which has been
demonstrated to produce good results at least in the pattern recognition context (Dixon, 1979).
When computing the distance between a pair of data items, only the (squared) dierences
between component values that are available are computed. The rest of the dierences are then
set to the average of the computed dierences.
Another nonlinear projection method, Sammon's mapping (Sammon, Jr., 1969),
is closely related to the metric MDS version described above. It, too, tries to op-
timize a cost function that describes how well the pairwise distances in a data set
18
=
X[ ( d k; l ) ? ( )]
d
0
k; l
2
(4)
( )
ES :
d k; l
k6=l
The only dierence between Sammon's mapping and the (nonlinear) metric MDS
(Eq. 2) is that the errors in distance preservation are normalized with the distance
in the original space. Because of the normalization the preservation of small
distances will be emphasized.
A demonstration of Sammon's mapping is presented in Figure 4.
CHN BGD
HKG LKA
HUN
POL
TURPHLIDN IND BEN
BEL JPN ISR NPL
KOR
NLDSWE YUG PAK
THA EGY
NOR ESP PRT
DNKFIN ITA GRC JAM DOM MAR
MDG SLE
DEU GBR
FRAAUT URY GHA TZA
IRL CHL TUN
CAN ARG ZWE ZAR
AUS PER MWI
USA MUSPAN COL IRN MLI
NZL SLVBOL NGASEN
SGP VEN ECU KEN
BRA
MYSCRI CIV
PRY BWA
CMR
GTM RWA ETH
ZMB
Figure 4: Sammon's mapping of the data set which has been projected using PCA in Figure 2
and nonmetric MDS in Figure 3. Missing data values were treated in the same manner as in
forming the nonmetric MDS.
that are dened by the property that each point of the curve is the average of
all data points that project to it, i.e., for which that point is the closest point
on the curve. Intuitively speaking, the curves pass through the center of the
data set. Principal curves are generalizations of principal components extracted
using PCA in the sense that a linear principal curve is a principal component;
the connections between the two methods are delineated more carefully in the
original article. Although the extracted structures are called principal curves the
generalization to surfaces seems relatively straightforward, although the resulting
algorithms will become computationally more intensive.
The conception of continuous principal curves may aid in understanding how
principal components could be sensibly generalized. To be useful in practical
computations, however, the curves must be discretized. It has turned out (Mulier
and Cherkassky, 1995; Ritter et al., 1992) that discretized principal curves are
essentially3 equivalent to SOMs, introduced before Hastie and Stuetzle (1989)
introduced the principal curves. It thus seems that the conception of principal
curves is most useful in providing one possible viewpoint to the properties of the
SOM algorithm.
Other methods. A problem with the nonlinear MDS methods is that they are
computationally very intensive for large data sets. The computational complexity
can be reduced, however, by restricting attention to a subset of the distances
between the data items. When placing a point on a plane its distance from
two other points of the plane can be set exactly. This property is used in the
triangulation method (Lee et al., 1977). Points are mapped sequentially onto the
plane, and the distance of the new item to the two nearest items already mapped
is preserved. Alternatively, the distance to the nearest item and a reference point
that is common to all items may be preserved. The points are mapped in such
an order that all of the nearest-neighbor distances in the original space will be
preserved. The triangulation can be computed quickly, compared to the MDS
methods, but since it only tries to preserve a small fraction of the distances the
projection may be dicult to interpret for large data sets. The method may,
however, be useful in connection with Sammon's mapping (Biswas et al., 1981).
The dimensionality of data sets can also be reduced with the aid of autoas-
sociative neural networks that represent their inputs using a smaller number of
variables than there are dimensions in the input data. Such networks try to re-
construct their inputs as faithfully as possible, and the representation of the data
items constructed into the network can be used as the reduced-dimensional ex-
pression of the data. Some linear and nonlinear associative memories have been
introduced by Kohonen (1984). The representations formed into the hidden layer
3The algorithm Hastie and Stuetzle (1989) proposed for nding discretized principal curves
resembles the batch version (Kohonen, 1995c) of the SOM algorithm, although the details are
dierent.
20
of a multilayer perceptron have also been used for the dimension reduction task
(DeMers and Cottrell, 1993; Garrido et al., 1995). A special version of the multi-
layer perceptrons, a replicator neural network (Hecht-Nielsen, 1995) has even been
shown capable of representing its inputs in terms of their natural coordinates.
This occurs for a somewhat idealized model when the inherent dimensionality q
require a separate study that would compare both the quality of the results and
the computational requirements for a network having a practical size. The learn-
ing of the multilayer perceptrons with the backpropagation algorithm (cf., e.g.,
Rumelhart et al., 1986) is known to be very slow (Haykin, 1994), but it is possible
that some alternative learning algorithms would be more feasible.
function of the distance of the units from the winning unit on the map lattice. If
the locations of units and on the map grid are denoted by the two-dimensional
i j
During the learning process at time the reference vectors are changed itera-
t
tively according to the following adaptation rule, where x( ) is the input at time
t
Ordered display. The ordered nature of the regression justies the use of the
map as a display for data items. When the items are mapped to those units on the
map that have the closest reference vectors, nearby units will have similar data
items mapped onto them. Such an ordered display of the data items facilitates
understanding of the structures in the data set. Kohonen (1981) was the rst to
propose using such displays to illustrate a data set.
The same display can be used for displaying several other kinds of information.
One clear advantage of always using the same display is that as the analysts grow
more familiar with the map, they can interpret new information displayed on it
faster and more easily.
For example, the map display can be used as an ordered groundwork on which
the original data variables, components of the data vectors, can be displayed in
their natural order. Such displays have been demonstrated in Publication 2. The
variables become smoothed locally on the display, which helps in gaining insight
in the distributions of their values in the data set. Such displays are much more
illustrative than, for instance, raw linearly organized statistical tables. It might
also be useful to display the residuals, average dierences of the variables from
their smoothed values.
Visualization of clusters. The same ordered display can be used for illustrat-
ing the clustering density in dierent regions of the data space. The density of the
reference vectors of an organized map will reect the density of the input samples
(Kohonen, 1995c; Ritter, 1991). In clustered areas the reference vectors will be
close to each other, and in the empty space between the clusters they will be more
sparse. Thus, the cluster structure in the data set can be brought visible by dis-
playing the distances between reference vectors of neighboring units (Kraaijveld
et al., 1992; Kraaijveld et al., 1995; Ultsch, 1993b; Ultsch and Siemon, 1990).
The cluster display may be constructed as follows (Iivarinen et al., 1994). The
distance between each pair of reference vectors is computed and scaled so that
the distances t between a given minimum and maximum value, after optionally
removing outliers. On the map display each scaled distance value determines
the gray level or color of the point that is in the middle of the corresponding
map units. The gray level values in the points corresponding to the map units
themselves are set to the average of some of the nearest distance values (on a
23
hexagonal grid, e.g., to the average of three of the six distances toward the lower-
right corner). After these values have been set up, they can be visualized as such
on the display, or smoothed spatially.
The resulting cluster diagram is very general in the sense that nothing needs
to be assumed about the shapes of the clusters. Most of the clustering algorithms
prefer clusters of certain shapes (Jain and Dubes, 1988).
A demonstration of a display constructed using SOM is presented in Figure 5.
Figure 5: A map display constructed using the SOM algorithm. The overall order of the
countries seems to correspond fairly closely to the Sammon's mapping of the same data set
(Fig. 4). The most prominent clustering structures are also visible in both displays. Details on
how the map was constructed are presented in Publication 2. The size of the map was 13 by 9
units.
where the index depends on the xk and the reference vectors mi (cf. Eq. 5).
c
implies that it may change when the gradient descent step is taken. Locally, if the
index = (xk ) does not change for any xk , the gradient step is valid, however.
c c
be chosen according to the number of clusters there are in the data, in the SOM
the number of reference vectors can be chosen to be much larger, irrespective of
the number of clusters. The cluster structures will become visible on the special
displays that were discussed in Section 2.4.2.
Each point on a principal curve is the average of all the points that project
to it. The curve is thus formed of conditional expectations of the data. For
discrete distributions it is not sensible to dene such curves, but it is possible
to construct practical algorithms if the data is smeared spatially. In the SOM
a similar smearing is performed by the neighborhood kernel in the adaptation
process (cf. Eq. 6). It is well-known in SOM literature that in the SOM each
reference vector represents local conditional expectations of the data items the
batch map algorithm (Kohonen, 1995c) is essentially a manifestation of this idea.
The principal curves or manifolds are thus essentially continuous counterparts of
the SOM.
The principal curves also have another characterization which, based on the
previous discussion, may be used as one source in providing an intuitive under-
standing of the SOM algorithm. The goodness of a curve in representing a data
distribution can be measured, for example, by the average (squared) distance of
the data points from the curve, like the goodness of the K-means algorithm was
measured by the average (squared) distance of the data points from the nearest
cluster. The principal curves are the critical points of this measure, i.e., they are
extremal with respect to small, smooth variations (Hastie and Stuetzle, 1989).
This implies that any smooth curve which corresponds to a minimum of the
average distance from the data items is a principal curve.
that j ij = 1 for all , which holds exactly for toroidal maps when the kernel
h i
h has the same shape for all , and also away from the borders on a non-toroidal
i
the centroid
P of the cluster
P dened by mi. At a xed point of the SOM algorithm
mi = k c x i xk k c x i , which is closer to ni the more the neighborhood
h (
k)
= h (
k)
kernel ci is centered around . To minimize the second term the units that
h c
are close to each other on the map should have similar reference vectors, since
the value of the neighborhood kernel is large. Units that lie farther away on
the map may, on the other hand, have quite dissimilar reference vectors, since
the neighborhood kernel is small and the distance thus does not make a large
contribution to the error.
Each of the computational speedups could potentially be useful, but the com-
parison of the methods would require a careful study. Most of the truly eective
speedups are suboptimal in the sense that they only approximate the original
SOM, and a careful study is therefore needed to guarantee that the results are
satisfactory for a given application.
Bishop et al. (1996a, 1996b) have recently introduced a latent-variable density
model called Generative Topographic Mapping (GTM), which is closely related
to the SOM and to principal curves. In latent variable models it is assumed that
the data set can be explained using a small set of latent variables. In principle the
latent variable space could be continuous, but in GTM a discrete grid like the grid
of the SOM is used for computational reasons. A probability distribution on the
latent grid is postulated to generate a probability distribution in the input space
through a parameterized model. Given an input data set, the goal of the GTM
is then to t the model, in the maximum likelihood sense, to the data set. This
is done using the expectation-maximization (EM) algorithm (Dempster et al.,
1977), by iteratively estimating rst the probability distribution on the latent
grid and then maximizing the likelihood of the input data, given the distribution.
The GTM is claimed to have several advantageous properties when compared
with the SOM, and no signicant disadvantages (Bishop et al., 1996b). It may
be useful to study these properties more closely to get a deeper understanding of
the relations and relative merits of the methods.
The GTM has an objective function, which may facilitate theoretical analy-
ses of the method. The SOM does not have an objective function if the input
data distribution is continuous (Erwin et al., 1992). In practical applications the
data sets are always nite, however, and the SOM does have a (local) objective
function, Equation 7.
The methods dier in that the GTM does not have a neighborhood function
which governs the smoothness of the mapping in the SOM algorithm. In GTM
the smoothness is governed by the widths of certain basis functions that are used
in generating the probability distribution in the input space. The only essential
dierence regarding the neighborhood then seems to be that in the SOM the
width of the neighborhood kernel is decreasing in time according to certain, well-
established schedules (cf., e.g., Kohonen, 1995c; Kohonen et al., 1996a) whereas
in GTM the basis functions remain xed.
One possible way of dening topology preservation is to require that the map-
ping from the lower-dimensional output space to the input space is smooth and
continuous. According to this denition at least a continuous version of GTM can
be interpreted to be topology preserving. Since even smooth, continuous map-
pings can be very twisted, however, a measure of the regularity of the mapping
is also needed. Most of the attempts to measure the topology preservation of
the SOMs have in fact aimed at measuring the goodness (or regularity) of the
mapping; some approaches will be discussed in Section 3.3. Measures that help in
choosing basis functions that provide suitably regular mappings would probably
29
based directly on the total number of parameters in the SOM may then be overly
conservative.
If the goal of data analysis is merely to visualize a given data set for ex-
ploratory purposes as in Publication 2, and if no new samples need to be mapped
later, then the map can be taught to represent the given input data set with a suf-
cient statistical accuracy by sampling with replacement even from a small data
set during the computation of the map. Then, of course, there is no guarantee
that the map will generalize well to new data. The SOM is useful, however, even
for maps having more units than there are data items. Even if the map passed
through all data items, it would do so in an ordered manner and the distances
between close-by data item pairs could be visualized on the map displays using
the methods discussed in Section 2.4.2.
A brief introduction to one application of the SOM may serve to illustrate
this point. A one-dimensional circular SOM can in eect be used for nding an
approximate solution to the traveling salesman problem (Angéniol et al., 1988;
Budinich, 1996): nd the shortest route that passes through all of the data items
once. The map forms a path that follows the data distribution, i.e., extends close
to each data item. A two-dimensional map follows the data in a similar manner.
It forms an elastic network that can follow the data the more closely the more
there are map units. (Note, however, that the smoothness or regularity of the
mapping can be controlled independently of the map size by changing the width
of the neighborhood kernel.)
Setting the number of nodes approximately equal to the number of the in-
put samples seems to be a useful rule-of-thumb for many applications when the
data sets are relatively small. More extensive empirical investigations should be
done, however, for example by cross-validating the results utilizing either the cost
function (Eq. 7) or the measures presented in Publication 7.
(or, in the case of nonmetric MDS, the global ordering relations) of the original
space, whereas the SOM tries to preserve the topology, i.e., the local neighborhood
relations. Sammon's mapping lies in the middle: it tries to preserve the metric,
but considers preservation of local structures more important than the other MDS
methods do.
It may be hard to dene topology preservation rigorously for nite discrete
structures, but it may still be possible to construct viable practical measures of
how well the neighborhood relations are preserved. Such measures are discussed
in more detail in Section 3.3 and in Publication 7.
The dierence between methods that preserve distances and methods that pre-
serve topology can perhaps be claried through a hypothetical experiment with
a data set consisting of a curved two-dimensional surface in a three-dimensional
space. Distance-preserving methods require three dimensions to describe the
structure of the data adequately. Topology-preserving methods like the SOM, on
the other hand, only need two dimensions. The elastic network that the map
forms can follow the curved surface in the data space.
Computational properties. It is common to all MDS methods that they
do not construct an explicit mapping function, but instead the projections of
all samples are computed in a simultaneous optimization process. Thus, new
samples cannot be projected without recomputing the whole projection, at least
not accurately.
At least in the case of Sammon's mapping this problem may be alleviated
somewhat by computing the projection with a multilayer perceptron-type network
which learns to minimize the cost function. This can be done by using a learning
rule that is analogous to the error backpropagation algorithm (Mao and Jain,
1995). The method does not necessarily reduce the computational complexity of
the original algorithm since the backpropagation learning rule is computationally
very intensive, but it generalizes the mapping to new samples. An alternative
approach is to use a radial basis function (RBF) network to construct the distance-
preserving mapping (Webb, 1995); this could be a promising alternative to the
original MDS methods.
The optimization of the cost functions of the MDS methods requires compar-
isons between all pairs of vectors, whereby their computational complexity is of
the order of N 2 , O(N 2 ), where N is the number of data items. The computa-
tional complexity of the SOM is K 2 , where K is the number of map units: each
learning step requires O(K ) computations, and to achieve a sucient statistical
accuracy the number of iterations should be at least some multiple of K .
If the size of the map is of the order of the number of input samples as in
Publication 2, the computational complexities of the methods are of the same
order of magnitude. If reduction of complexity is needed, the resolution of the
SOM can be reduced by using a smaller map. It is also possible to use faster
versions of the algorithm, discussed in Section 2.4.4, or a parallel implementation.
33
EM 0 =
X[ ( d k; l ) ? ( )] ( ( ))
d
0
k; l
2
F d
0
k; l : (10)
k6=l
treatments have been presented earlier (Kaski and Kohonen, 1995; Kohonen,
1995c).
The basic steps in applying the SOM are quite standard although, as usual
in exploratory data analysis, care must be taken in choosing the inputs and in
evaluating the results to avoid misleading conclusions. The steps of the analysis
can be used interactively by modifying the composition of the data set on the
basis of insight gained during the previous round.
The rst step in the analysis consists of the choice of the data set; although
this step seems evident it cannot be stressed too much. No matter how good
the methodology is the results will fundamentally depend on the quality and
suitability of the data. Even good methods may only aid in the process of gaining
insight into the nature of the data.
3.1 Preprocessing
Feature extraction, the choice of suitable representations for the data items, is
a key step in the analysis. All unsupervised methods merely illustrate some
structures in the data set, and the structures are ultimately determined by the
features chosen to represent the data items. The usefulness of dierent prepro-
cessing methods depends strongly on the application. Therefore a comprehensive
treatment would be an immense task, and only the basic approaches used in the
case studies can be introduced here.
If the data comes from a process of which there exists some a priori knowledge,
this information should of course be used for choosing the features. This is the
case with the EEG signal (Publication 1); it is known that the frequency content
of the signal depends, for instance, on the vigilance of the individual. The better
the features can be tailored to reect the requirements of the task the better the
results will be. In general, however, the task of tailoring requires considerable
expertise both in the application area and in the data analysis methodology. Some
experiments with a method that is potentially useful as an automatic feature
extraction stage are reported in Publication 8.
Sophisticated feature extraction methods are also required in the case of min-
ing symbolic information like text, for which no evident automatically available
semantic features exist. Similarity relations computed based on the forms of
the words would convey almost no information about the meaning and use of
the words. Contextual information can, however, be used for constructing useful
similarity relationships for textual data (Ritter and Kohonen, 1989). A system
that utilizes representations of the average context of words to represent the
words, and suitably processed word category histograms to represent documents
is discussed in Publications 3, 4, 5, and 6.
Besides the choice of the features, their scaling must also be chosen before
applying the SOM algorithm. If there exists knowledge of the relative impor-
tance of the components of the data items, the corresponding dimensions of the
35
input space can be scaled according to this information. The importance may
in some applications be estimated automatically by, for example, an entropy-
based criterion (Publication 4). If no such criterion is available the variance of
all components may be scaled to an equal value as was done in Publication 2.
The importance of choosing a set of indicators that describes the phenomenon
of interest and nothing else, and proper scaling of the indicators cannot be over-
stressed. Even the best analysis methods cannot overcome all mistakes made at
this stage.
In the rst approach the relations between the reference vectors and the relations
between the corresponding units on the map lattice are compared. For example, it
can be measured how often the line connecting the reference vectors of neighboring
units will be dissected by the Voronoi region of some other unit than the endpoints
(Zrehen, 1993). The Voronoi region of unit is dened to be the set consisting
i
of the points in the input space that are closer to mi than to any other reference
vector.
An alternative approach for measuring topology preservation is to use input
samples to determine how continuous the mapping from the input space to the
map grid is. If the two reference vectors that are closest to a given data item are
not neighbors on the map lattice, then the mapping is not continuous near that
item.
Neither of these approaches takes into account the accuracy of the map in
representing the data, the rst term in the decomposition of the cost function
(Eq. 9). In Publication 7 it is proposed that the goodness of the SOM could
be measured by using a sum of the accuracy and a suitably chosen index of the
topology preservation (cf. Fig. 6).
01 01 0011
01 0011 0011 01 01
0110 x 1100 0011 x
01 01
0011
Figure 6: One method for measuring the goodness of SOMs consists of computing, for a repre-
sentative set of input samples x, the distance from x rst to the closest reference vector, and
thereafter to the second-closest reference vector along the map. The index of goodness is the
average of these distances. If the mapping from the input space to the map grid is continuous
near x the distances are generally smaller than when the mapping is discontinuous. The dots
denote reference vectors of a one-dimensional map in a two-dimensional input space. Reference
vectors of neighboring map units have been connected with thin lines. The thick line indicates
how the distance is computed. For two-dimensional maps the distance is computed using the
shortest path along the map.
A related measure has been proposed by Minamimoto et al. (1995) for analyz-
ing the topological structure of the data space. There are two essential dierences
between the measures, however: rst, Minamimoto et al. combine qualitatively
dierent measures using a linear combination. The coecients of the combination
may then be dicult to choose, whereas in Publication 7 the measures that are
combined measure distances in the same space. Second, Minamimoto et al. also
37
combine the number of reference vectors in the measure. We, however, consid-
ered the size of the map only indirectly, to the extent that is contributes to the
preservation of the topology.
The best way to use the measure of goodness for choosing a map, assuming
there are enough computational resources, seems to be to teach a set of maps
with each choice of the map size and the neighborhood kernel. The map that
minimizes the cost function (Eq. 7) is then chosen from each set. Of the resulting
set of best maps, the nal map is chosen according to the measure of goodness
presented in Publication 7.
Some additional notes, for which there was no space in Publication 7, are
given below:
Note 1: Sparse data. If the data set is sparse, consisting only of a few sam-
ples per map unit, it becomes more dicult to measure topology preservation. A
judgment of the continuity of the mapping based on a sum over the data sam-
ples may become inaccurate. It may then be useful to consider, along with the
distance between the nearest and second-nearest reference vector as was done
in Publication 7, the distance to the third-closest reference vector etc., weighted
suitably.
4 CASE STUDIES
This thesis consists of three case studies in addition to some methodological devel-
opments. Although the case studies serve here as demonstrations of the use of the
SOM in three dierent kinds of analysis tasks, an equally important motivation
behind the studies has been the need for the actual applications themselves.
j P
s n j
aj =( )
X e (j )
(11)
pk k :
k
A problem with this encoding method is that if the vocabulary is very large the
dimensionality of the vector is also high. In the Publications this problem was
solved by eliminating some of the most common and some of the rarest words,
and by clustering the words into word categories. (Actually an extended version
of Equation 11 was used, where probabilistic information about the similarity of
use of dierent words was incorporated into the coecients kj .) p
( )
5 FURTHER DEVELOPMENTS
5.1 Feature exploration with the adaptive-subspace SOM
Feature extraction is perhaps the most dicult task of the whole enterprise of data
exploration. The components of the data vectors should be selected and prepro-
cessed so that the relations between the representations of the data items would
correspond to meaningful relations between the items. Considerable expertise in
the application area may be required for building suitable feature extractors.
The adaptive-subspace SOM (ASSOM) (Kohonen, 1995a; Kohonen, 1995b;
Kohonen, 1995c; Kohonen, 1996) is a step towards a more general-purpose feature
extractor: it extracts invariant features from its input, features that are invariant
to the particular transformation that has operated on the inputs.
The ASSOM could act as a learning feature extraction stage that produces
invariant representations to be explored by the SOM, or in feature exploration.
The ASSOM extracts the invariant features with lters that are formed auto-
matically based on short sequences of input samples during the learning process.
Insight into the processes that have produced the data set might then be gained
by visualizing the resulting lters. The visualization is particularly easy since the
representations of the ASSOM are ordered just as in the basic SOM.
In Publication 8 the study of the ASSOM algorithm is continued; both the the-
oretical treatment and the scope of the experimental simulations are broadened.
It is demonstrated that features invariant to dierent kinds of transformations
can be extracted, and that areas in which all units have been specialized to be
invariant to one of the transformations emerge if several transformations have
been present in the training data, albeit at dierent times.
43
10 00 10 11
00 01 11
00 01 11
00 01
1100 11 10
00 00
11
1100
k
10 01 01
m01 k
Figure 7: The nonlinearity of the SOM is taken into account by dening the distance between
map units to be the distance along the elastic network formed of the map in the input space.
On the left the reference vectors of a two-dimensional map in a two-dimensional input space
have been denoted by dots. Neighboring reference vectors have been connected with thin lines.
Distance between units k and l is drawn with the thick line. Depending on the values of the
reference vectors the path along which the distance is computed need not be the shortest path
on the map grid, as shown on the right.
The measure of the (dis)similarity of two maps, a metric for SOMs, can then be
constructed by comparing the relations of pairs of data items on the two maps.
6 CONCLUSION
In this thesis methodologies have been established for applying self-organizing
maps in the exploratory analysis of large data sets. The methods have been
demonstrated in three dierent kinds of case studies: continuous-valued dense
data (EEG signals), continuous-valued sparse data (indicators of the standard of
living of dierent countries), and discrete-valued data (full-text document collec-
tions).
The self-organizing maps illustrate structures in the data in a dierent man-
ner than, for example, multidimensional scaling, a more traditional multivariate
data analysis methodology. The SOM algorithm concentrates on preserving the
neighborhood relations in the data instead of trying to preserve the distances
between the data items. Comparisons between methods having dierent goals
must eventually be based on their practical applicability. Here the SOM has
been shown to provide a viable alternative.
45
REFERENCES
Alander, J. T., Frisk, M., Holmström, L., Hämäläinen, A., and Tuominen, J.
(1991) Process error detection using self-organizing feature maps. In Koho-
nen, T., Mäkisara, K., Simula, O., and Kangas, J., editors, Articial Neural
Networks. Proceedings of ICANN'91, International Conference on Articial
Neural Networks, volume II, pages 12291232. North-Holland, Amsterdam.
Amari, S. (1980) Topographic organization of nerve elds. Bulletin of Mathe-
matical Biology, 42:339364.
Anderberg, M. R. (1973) Cluster Analysis for Applications. Academic Press, New
York, NY.
Andrews, D. F. (1972) Plots of high-dimensional data. Biometrics, 28:125136.
Angéniol, B., de La Croix Vaubois, G., and Le Texier, J.-Y. (1988). Self-
organizing feature maps and the traveling salesman problem. Neural Net-
works, 1:289293.
Back, B., Sere, K., and Vanharanta, H. (1996) Data mining accounting numbers
using self-organizing maps. In Alander, J., Honkela, T., and Jakobsson, M.,
editors, Proceedings of STeP'96, Finnish Articial Intelligence Conference,
pages 3547. Finnish Articial Intelligence Society, Vaasa, Finland.
Bellman, R. E. (1961) Adaptive Control Processes: A Guided Tour. Princeton
University Press, New Jersey, NJ.
Bezdek, J. C. and Pal, N. R. (1995) An index of topological preservation for
feature extraction. Pattern Recognition, 28:381391.
Bishop, C. M., Svensén, M., and Williams, C. K. I. (1996a). EM optimization
of latent-variable models. In Touretzky, D. S., Mozer, M. C., and Hasselmo,
M. E., editors, Advances in Neural Information Processing Systems 8, pages
465471. The MIT Press, Cambridge, MA.
Bishop, C. M., Svensén, M., and Williams, C. K. I. (1996b). GTM: a principled
alternative to the self-organizing map. In von der Malsburg, C., von Seelen,
W., Vorbrüggen, J. C., and Sendho, B., editors, Proceedings of ICANN'96,
International Conference on Articial Neural Networks, Lecture Notes in
Computer Science, vol. 1112, pages 165170. Springer, Berlin.
Biswas, G., Jain, A. K., and Dubes, R. C. (1981) Evaluation of projection algo-
rithms. IEEE Transactions on Pattern Analysis and Machine Intelligence,
3:701708.
46
on Computers, 26:288292.
Lloyd, S. P. (1957). Least squares quantization in PCM. Unpublished memoran-
dum, Bell Laboratories. (Published in IEEE Transactions on Information
Theory, 28:129-137, 1982).
Lopes da Silva, F. H., Storm van Leeuwen, W., and Rémond, A., editors (1986)
Handbook of Electroencephalography and Clinical Neurophysiology. Volume
2: Clinical Applications of Computer Analysis of EEG and other Neurophys-
iological Signals. Elsevier, Amsterdam.
Luttrell, P. S. (1988). Self-organizing multilayer topographic mappings. In Pro-
ceedings of ICNN'88, IEEE International Conference on Neural Networks,
volume I, pages 93100. IEEE Service Center, Piscataway, NJ.
Luttrell, S. P. (1989). Hierarchical vector quantization. IEE Proceedings, 136:405
413.
MacQueen, J. (1967). Some methods for classication and analysis of multivariate
observations. In Le Cam, L. M. and Neyman, J., editors, Proceedings of
the Fifth Berkeley Symposium on Mathematical Statistics and Probability.
Volume I: Statistics, pages 281297. University of California Press, Berkeley
and Los Angeles, CA.
Makhoul, J., Roucos, S., and Gish, H. (1985) Vector quantization in speech
coding. Proceedings of the IEEE, 73:15511588.
Mao, J. and Jain, A. K. (1995) Articial neural networks for feature extraction
and multivariate data projection. IEEE Transactions on Neural Networks,
6:296317.
Martín-del-Brío, B. and Serrano-Cinca, C. (1993) Self-organizing neural networks
for the analysis and representation of data: some nancial cases. Neural
Computing & Applications, 1:193206.
53
Pérez, R., Glass, L., and Shlaer, R. J. (1975) Development of specicity in cat
visual cortex. Journal of Mathematical Biology, 1:275288.
Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge
University Press, Cambridge, Great Britain.
Ritter, H. (1991) Asymptotic level density for a class of vector quantization
processes. IEEE Transactions on Neural Networks, 2:173175.
Ritter, H. and Kohonen, T. (1989) Self-organizing semantic maps. Biological
Cybernetics, 61:241254.
Ritter, H., Martinetz, T., and Schulten, K. (1992) Neural Computation and Self-
Organizing Maps: An Introduction. Addison-Wesley, Reading, MA.
Ritter, H. and Schulten, K. (1988). Kohonen's self-organizing maps: exploring
their computational capabilities. In Proceedings of the ICNN'88, IEEE In-
ternational Conference on Neural Networks, volume I, pages 109116. IEEE
Service Center, Piscataway, NJ.
Rubner, J. and Tavan, P. (1989) A self-organizing network for principal compo-
nent analysis. Europhysics Letters, 10:693698.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986) Learning internal
representations by error propagation. In Rumelhart, D. E., McClelland,
J. L., and the PDP Research Group, editors, Paralled Distributed Processing.
Explorations in the Microstructure of Cognition. Volume 1: Foundations,
pages 318362. The MIT Press, Cambridge, MA.
Salton, G. and McGill, M. J. (1983) Introduction to Modern Information Re-
trieval. McGraw-Hill, New York, NY.
Samad, T. and Harp, S. A. (1992) Self-organization with partial data. Network:
Computation in Neural Systems, 3:205212.
Sammon, Jr., J. W. (1969) A nonlinear mapping for data structure analysis.
IEEE Transactions on Computers, 18:401409.
Schalko, R. J. (1992) Pattern Recognition: Statistical, Structural and Neural
Approaches. Wiley, New York, NY.
Sedgewick, R. (1988) Algorithms. Addison-Wesley, Reading, MA, 2nd edition.
Serrano-Cinca, C. (1996) Self-organizing neural networks for nancial diagnosis.
To appear in Decision Support Systems.
Shepard, R. N. (1962) The analysis of proximities: multidimensional scaling with
an unknown distance function. Psychometrika, 27:125140; 219246.
55
Simoudis, E. (1996). Reality check for data mining. IEEE Expert, October, pages
2633.
Sneath, P. H. A. and Sokal, R. R. (1973) Numerical Taxonomy. Freeman, San
Francisco, CA.
Swindale, N. W. (1980) A model for the formation of ocular dominance stripes.
Proceedings of the Royal Society of London, B, 208:243264.
Therrien, C. W. (1989) Decision, Estimation, and Classication. An Introduction
to Pattern Recognition and Related Topics. Wiley, New York, NY.
Torgerson, W. S. (1952) Multidimensional scaling: I. Theory and method. Psy-
chometrika, 17:401419.
Tryba, V. and Goser, K. (1991) Self-organizing feature maps for process control
in chemistry. In Articial Neural Networks. Proceedings of ICANN'91, Inter-
national Conference on Articial Neural Networks, volume I, pages 847852,
North-Holland, Amsterdam.
Tryon, R. C. and Bailey, D. E. (1973) Cluster Analysis. McGraw-Hill, New York,
NY.
Tukey, J. W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA.
Ultsch, A. (1993a) Knowledge extraction from self-organizing neural networks. In
Opitz, O., Lausen, B., and Klar, R., editors, Information and Classication,
pages 301306. Springer-Verlag, Berlin.
Ultsch, A. (1993b) Self-organizing neural networks for visualization and classi-
cation. In Opitz, O., Lausen, B., and Klar, R., editors, Information and
Classication, pages 307313. Springer-Verlag, Berlin.
Ultsch, A. and Siemon, H. P. (1990) Kohonen's self organizing feature maps for
exploratory data analysis. In Proceedings of ICNN'90, International Neural
Network Conference, pages 305308, Kluwer, Dordrecht.
Vars, A. and Versino, C. (1992) Clustering of socio-economic data with Kohonen
maps. Neural Network World, 2:813834.
Velleman, P. F. and Hoaglin, D. C. (1981) Applications, Basics, and Computing
of Exploratory Data Analysis. Duxbury Press, Boston, MA.
von der Malsburg, C. (1973) Self-organization of orientation sensitive cells in the
striate cortex. Kybernetik, 14:85100.
Webb, A. R. (1995). Multidimensional scaling by iterative majorization using
radial basis functions. Pattern Recognition, 28:753759.
56