You are on page 1of 6

Proceedings of the Matlab DSP Conference 1999, Espoo, Finland, November 16–17, pp. 35–40, 1999.

Self-organizing map in Matlab: the SOM Toolbox

Juha Vesanto, Johan Himberg, Esa Alhoniemi and Juha Parhankangas

Laboratory of Computer and Information Science, Helsinki University of Technology, Finland

quality, clusters on the map and correlations between


Abstract variables. With data mining in mind, the Toolbox and the
The Self-Organizing Map (SOM) is a vector SOM in general is best suited for data understanding or
quantization method which places the prototype vectors on survey, although it can also be used for classification and
a regular low-dimensional grid in an ordered fashion. modeling.
This makes the SOM a powerful visualization tool. The
SOM Toolbox is an implementation of the SOM and its
2. Self-organizing map
visualization in the Matlab 5 computing environment. In A SOM consists of neurons organized on a regular low-
this article, the SOM Toolbox and its usage are shortly dimensional grid, see Figure 1. Each neuron is a d-
presented. Also its performance in terms of computational dimensional weight vector (prototype vector, codebook
load is evaluated and compared to a corresponding C- vector) where d is equal to the dimension of the input
program. vectors. The neurons are connected to adjacent neurons by
a neighborhood relation, which dictates the topology, or
structure, of the map. In the Toolbox, topology is divided
1. General
to two factors: local lattice structure (hexagonal or
This article presents the (second version of the) SOM rectangular, see Figure 1) and global map shape (sheet,
Toolbox, hereafter simply called the Toolbox, for Matlab cylinder or toroid).
5 computing environment by MathWorks, Inc. The SOM
2 2
acronym stands for Self-Organizing Map (also called
Self-Organizing Feature Map or Kohonen map), a popular 1 1

neural network based on unsupervised learning [1]. The 0 0

Toolbox contains functions for creation, visualization and


analysis of Self-Organizing Maps. The Toolbox is
available free of charge under the GNU General Public
License from http://www.cis.hut.fi/projects/somtoolbox.
The Toolbox was born out of need for a good, Figure 1. Neighborhoods (0, 1 and 2) of the centermost
easy-to-use implementation of the SOM in Matlab for unit: hexagonal lattice on the left, rectangular on the right.
research purposes. In particular, the researchers The innermost polygon corresponds to 0-, next to the 1-
responsible for the Toolbox work in the field of data and the outmost to the 2-neighborhood.
mining, and therefore the Toolbox is oriented towards that The SOM can be thought of as a net which is spread to
direction in the form of powerful visualization functions. the data cloud. The SOM training algorithm moves the
However, also people doing other kinds of research using weight vectors so that they span across the data cloud and
SOM will probably find it useful — especially if they so that the map is organized: neighboring neurons on the
have not yet made a SOM implementation of their own in grid get similar weight vectors. Two variants of the SOM
Matlab environment. Since much effort has been put to training algorithm have been implemented in the Toolbox.
make the Toolbox relatively easy to use, it can also be In the traditional sequential training, samples are
used for educational purposes. presented to the map one at a time, and the algorithm
The Toolbox — the basic package together with gradually moves the weight vectors towards them, as
contributed functions — can be used to preprocess data, shown in Figure 2. In the batch training, the data set is
initialize and train SOMs using a range of different kinds presented to the SOM as a whole, and the new weight
of topologies, visualize SOMs in various ways, and vectors are weighted averages of the data vectors. Both
analyze the properties of the SOMs and data, e.g. SOM algorithms are iterative, but the batch version is much
faster in Matlab since matrix operations can be utilized fastest. In IRIX it was upto 20 times faster than
efficiently. som_seqtrain and upto 8 times faster than
For a more complete description of the SOM and its som_sompaktrain. Median values were 6 times and 3
implementation in Matlab, please refer to the book by times, respectively. The som_batchtrain was
Kohonen [1], and to the SOM Toolbox documentation. especially faster with larger data sets, while with a small
set and large map it was actually slower. However, the
latter case is very atypical, and can thus be ignored. In
Linux, the smaller amount of memory clearly came into
play: the marginal between batch and other training
functions was halved.
The number of data samples clearly had a linear effect
on the computational load. On the other hand, the number
of map units seemed to have a quadratic effect, at least
X with som_batchtrain. Of course, also increase in
BMU input dimension increased the computing times: about
two- to threefold as input dimension increased from 10 to
50. The most suprising result of the performance test was
that especially with large data sets and maps, the
som_batchtrain outperformed the C-program (vsom
used by som_sompaktrain). The reason is probably
Figure 2. Updating the best matching unit (BMU) and the fact that in SOM_PAK, distances between map units
its neighbors towards the input sample marked with x. on the grid are always calculated anew when needed. In
The solid and dashed lines correspond to situation before SOM Toolbox, all these are calculated beforehand.
and after updating, respectively. Likewise for many other required matrices.
Indeed, the major deficiency of the SOM Toolbox, and
3. Performance especially of batch training algorithm, is the expenditure
of memory. A rough lower bound estimate of the amount
The Toolbox can be downloaded for free from of memory used by som_batchtrain is given by:
http://www.cis.hut.fi/projects/somtoolbox. It requires no 8(5(m+n)d + 3m2) bytes, where m is the number of
other toolboxes, just the basic functions of Matlab (version map units, n is the number of data samples and d is the
5.1 or later). The total diskspace required for the Toolbox input space dimension. For [3000 x 10] data matrix and
itself is less than 1 MB. The documentation takes a few 300 map units the amount of memory required is still
MBs more. moderate, in the order of 3.5 MBs. But for [30000 x 50]
The performance tests were made in a machine with 3 data matrix and 3000 map units, the memory requirement
GBs of memory and 8 250 MHz R10000 CPUs (one of is more than 280 MBs, the majority of which comes from
which was used by the test process) running IRIX 6.5 the last term of the equation. The sequential algorithm is
operating system. Some tests were also performed in a less extreme requiring only one half or one third of this.
workstation with a single 350 MHz Pentium II CPU, 128 SOM_PAK requires much less memory, about 20 MBs for
MBs of memory and Linux operating system. The Matlab the [30000 x 50] case, and can operate with buffered data.
version in both environments was 5.3.
The purpose of the performance tests was only to Table 1. Typical computing times. Data set size is
evaluate the computational load of the algorithms. No given as [n x d] where n is the number of data samples
attempt was made to compare the quality of the resulting and d is the input dimension.
mappings, primarily because there is no uniformly
recognized “correct” method to evaluate it. The tests were data size map units batch seq sompak
performed with data sets and maps of different sizes, and IRIX
three training functions: som_batchtrain, [300x10] 30 0.2 s 3.1 s 0.9 s
som_seqtrain and som_sompaktrain, the last of [3000x10] 300 7s 54 s 17 s
which calls the C-program vsom to perform the actual [30000x10] 1000 5 min 19 min 9 min
training. This program is part of the SOM_PAK [3], [30000x50] 3000 27 min 5.7 h 75 min
which is a free software package implementing the SOM Linux
algorithm in ANSI-C. [300x10] 30 0.3 s 2.7 s 1.9 s
Some typical computing times are shown in Table 1. As [3000x10] 300 24 s 76 s 26 s
a general result, som_batchtrain was clearly the [30000x10] 1000 13 min 40 min 15 min
4. Use of SOM Toolbox necessary to use data structs at all. Most functions accept
numerical matrices as well. However, if there are
4.1. Data format categorial variables, data structs has be used. The
categorial variables are converted to strings and put into
The kind of data that can be processed with the the .labels field of the data struct as a cell array of
Toolbox is so-called spreadsheet or table data. Each row strings.
of the table is one data sample. The columns of the table
are the variables of the data set. The variables might be the 4.3. Data preprocessing
properties of an object, or a set of measurements measured Data preprocessing in general can be just about
at a specific time. The important thing is that every sample anything: simple transformations or normalizations
has the same set of variables. Some of the values may be performed on single variables, filters, calculation of new
missing, but the majority should be there. The table variables from existing ones. In the Toolbox, only the first
representation is a very common data format. If the of these is implemented as part of the package.
available data does not conform to these specifications, it Specifically, the function som_normalize can be used
can usually be transformed so that it does. to perform linear and logarithmic scalings and histogram
The Toolbox can handle both numeric and categorial
equalizations of the numerical variables (the .data
data, but only the former is utilized in the SOM algorithm.
field). There is also a graphical user interface tool for
In the Toolbox, categorial data can be inserted into labels
preprocessing data, see Figure 3.
associated with each data sample. They can be considered
Scaling of variables is of special importance in the
as post-it notes attached to each sample. The user can
Toolbox, since the SOM algorithm uses Euclidean metric
check on them later to see what was the meaning of some
to measure distances between vectors. If one variable has
specific sample, but the training algorithm ignores them.
values in the range of [0,...,1000] and another in the range
Function som_autolabel can be used to handle
of [0,...,1] the former will almost completely dominate the
categorial variables. If the categorial variables need to be map organization because of its greater impact on the
utilized in training the SOM, they can be converted into distances measured. Typically, one would want the
numerical variables using, e.g., mapping or 1-of-n variables to be equally important. The standard way to
coding [4]. achieve this is to linearly scale all variables so that their
Note that for a variable to be “numeric”, the numeric variances are equal to one.
representation must be meaningful: values 1, 2 and 4 One of the advantages of using data structs instead of
corresponding to objects A, B and C should really mean simple data matrices is that the structs retain information
that (in terms of this variable) B is between A and C, and of the normalizations in the field .comp_norm. Using
that the distance between B and A is smaller than the
function som_denormalize one can reverse the
distance between B and C. Identification numbers, error
normalization to get the values in the original scale: sD =
codes, etc. rarely have such meaning, and they should be
handled as categorial data. som_denormalize(sD). Also, one can repeat the
exactly same normalizations to other data sets.
4.2. Construction of data sets All normalizations are single-variable transformations.
One can make one kind of normalization to one variable,
First, the data has to be brought into Matlab using, for and another type of normalization to another variable.
example, standard Matlab functions load and fscanf. Also, multiple normalizations one after the other can be
In addition, the Toolbox has function som_read_data made for each variable. For example, consider a data set
which can be used to read ASCII data files: sD with three numerical variables. The user could do a
histogram equalization to the first variable, a logarithmic
sD = som_read_data(‘data.txt’); scaling to the third variable, and finally a linear scaling to
unit variance to all three variables:
The data is usually put into a so-called data struct,
which is a Matlab struct defined in the Toolbox to group sD = som_normalize(sD,'histD',1);
information related to a data set. It has fields for numerical sD = som_normalize(sD,'log',3);
data (.data), strings (.labels), as well as for sD = som_normalize(sD,'var',1:3);
information about data set and the individual variables.
The Toolbox utilizes many other structs as well, for The data does not necessarily have to be preprocessed
example a map struct which holds all information related at all before creating a SOM using it. However, in most
to a SOM. A numerical matrix can be converted into a real tasks preprocessing is important; perhaps even the
data struct with: sD = som_data_struct(D). If the most important part of the whole process [4].
data only consists of numerical values, it is not actually
4.4. Initialization and training
There are two initialization (random and linear) and
two training (sequential and batch) algorithms
implemented in the Toolbox. By default linear
initialization and batch training algorithm are used. The
simplest way to initialize and train a SOM is to use
function som_make which does both using automatically
selected parameters:

sM = som_make(sD);

The training is done is two phases: rough training with


large (initial) neighborhood radius and large (initial)
learning rate, and finetuning with small radius and
2 learning rate. If tighter control over the training
1
0
parameters is desired, the respective initialization and
−1 training functions, e.g. som_batchtrain, can be used
directly. There is also a graphical user interface tool for
initializing and training SOMs, see Figure 4.
4.5. Visualization and analysis
There are a variety of methods to visualize the SOM. In
the Toolbox, the basic tool is the function som_show. It
can be used to show the U-matrix and the component
Figure 3. Data set preprocessing tool.
planes of the SOM:

som_show(sM);

The U-matrix visualizes distances between neighboring


map units, and thus shows the cluster structure of the map:
high values of the U-matrix indicate a cluster border,
uniform areas of low values indicate clusters themselves.
Each component plane shows the values of one variable in
each map unit. On top of these visualizations, additional
information can be shown: labels, data histograms and
trajectories.
With function som_vis much more advanced
visualizations are possible. The function is based on the
idea that the visualization of a data set simply consists of a
set of objects, each with a unique position, color and
shape. In addition, connections between objects, for
example neighborhood relations, can be shown using
lines. With som_vis the user is able to assign arbitrary
values to each of these properties. For example, x-, y-, and
z-coordinates, object size and color can each stand for one
variable, thus enabling the simultaneous visualization of
five variables. The different options are:
- the position of an object can be 2- or 3-dimensional
- the color of an object can be freely selected from
the RGB cube, although typically indexed color is
used
Figure 4. SOM initialization and training tool. - the shape of an object can be any of the Matlab
plot markers ('.','+', etc.), a pie chart, a bar
chart, a plot or even an arbitrarily shaped polygon, %% make the data
typically a rectangle or hexagon sD = som_read_data('iris.data');
- lines between objects can have arbitrary color, sD = som_normalize(sD,'var');
width and any of the Matlab line modes, e.g. '-'
- in addition to the objects, associated labels can be %% make the SOM
shown sM = som_make(sD,'munits',30);
For quantitative analysis of the SOM there are at the sM = som_autolabel(sM,sD,'vote');
moment only a few tools. The function som_quality
supplies two quality measures for SOM: average %% basic visualization
quantization error and topographic error. However, using som_show(sM,’umat’,’all’,’comp’,1:4,...
low level functions, like som_neighborhood, ’empty’,’Labels’,’norm’,’d’);
som_addlabels(sM,1,6);
som_bmus and som_unit_dists, it is easy to
implement new analysis functions. Much research is being
From the U-matrix it is easy to see that the top three
done in this area, and many new functions for the analysis
rows of the SOM form a very clear cluster. By looking at
will be added to the Toolbox in the future, for example
the labels, it is immediately seen that this corresponds to
tools for clustering and analysis of the properties of the
the Setosa subspecies. The two other subspecies
clusters. Also new visualization functions for making
Versicolor and Virginica form the other cluster. The U-
projections and specific visualization tasks will be added
matrix shows no clear separation between them, but from
to the Toolbox.
the labels it seems that they correspond to two different
4.6. Example parts of the cluster. From the component planes it can be
seen that the petal length and petal width are very closely
Here is a simple example of the usage of the Toolbox to related to each other. Also some correlation exists between
make and visualize a SOM of a data set. As the example them and sepal length. The Setosa subspecies exhibits
data, the well-known Iris data set is used [5]. This data set small petals and short but wide sepals. The separating
consists of four measurements from 150 Iris flowers: 50 factor between Versicolor and Virginica is that the latter
Iris-setosa, 50 Iris-versicolor and 50 Iris-virginica. The has bigger leaves.
measurements are length and width of sepal and petal
U−matrix sepallength sepalwidth
leaves. The data is in an ASCII file, the first few lines of 1.6 7.09
3.7
which are shown below. The first line contains the names 1.4 6.67
3.49
1.2
of the variables. Each of the following lines gives one 1
6.26
3.27
data sample beginning with numerical variables and 0.8 5.84 3.05

followed by labels. 0.6 5.43 2.84


0.4 5.02 2.62
0.2
#n sepallen sepalwid petallen petalwid d d

5.1 3.5 1.4 0.2 setosa


petallength petalwidth Labels
4.9 3.0 1.4 0.2 setosa se se se se se se
5.52 1.96 se se se se se se
... se se se se se
se se se
4.64 1.58 ve
ve ve ve ve ve ve
ve ve ve ve ve vi
The data set is loaded into Matlab and normalized. 3.76 1.2 ve ve ve ve ve ve
ve ve ve ve
Before normalization, an initial statistical look of the data 2.88 0.817 vi ve ve ve
vi ve vi vi vi vi
vi vi vi vi
set would be in order, for example using variable-wise 1.99 0.436 vi
vi
vi vi vi
vi vi vi vi
histograms. This information would provide an initial idea d d

of what the data is about, and would indicate how the Map: SOM 06−Sep−1999, Size: 14 6

variables should be preprocessed. In this example, the Figure 5. Visualization of the SOM of Iris data. U-
variance normalization is used. After the data set is ready, matrix on top left, then component planes, and map unit
a SOM is trained. Since the data set had labels, the map is labels on bottom right. The six figures are linked by
also labeled using som_autolabel. After this, the position: in each figure, the hexagon in a certain position
SOM is visualized using som_show. The U-matrix is corresponds to the same map unit. In the U-matrix,
shown along with all four component planes. Also the additional hexagons exist between all pairs of neighboring
labels of each map unit are shown on an empty grid using map units. For example, the map unit in top left corner has
som_addlabels. The values of components are low values for sepal length, petal length and width, and
denormalized so that the values shown on the colorbar are relatively high value for sepal width. The label associated
in the original value range. The visualizations are shown with the map unit is 'se' (Setosa) and from the U-matrix it
in Figure 5. can be seen that the unit is very close to its neighbors.
Component planes are very convenient when one has to 5. Conclusions
visualize a lot of information at once. However, when only
a few variables are of interest scatter plots are much more In this paper, the SOM Toolbox has been shortly
efficient. Figures 6 and 7 show two scatter plots made introduced. The SOM is an excellent tool in the
using the som_grid function. Figure 6 shows the PCA- visualization of high dimensional data [6]. As such it is
projection of both data and the map grid, and Figure 7 most suitable for data understanding phase of the
visualizes all four variables of the SOM plus the knowledge discovery process, although it can be used for
subspecies information using three coordinates, marker data preparation, modeling and classification as well.
size and marker color. In future work, our research will concentrate on the
3 quantitative analysis of SOM mappings, especially
analysis of clusters and their properties. New functions
and graphical user interface tools will be added to the
2
Toolbox to increase its usefulness in data mining. Also
se vi outside contributions to the Toolbox are welcome.
se se
1 se vi vi It is our hope that the SOM Toolbox promotes the
sese vi
se ve vi
vi vi vi utilization of SOM algorithm – in research as well as in
sese ve vi vi vi vi
se ve viveve industry – by making its best features more readily
ve vi
0 sese se veve accessible.
se veve vevi
se ve vi vi
sese veve ve vi
se ve ve
se veve vivi
Acknowledgements
−1 ve
se ve ve ve
veve
This work has been partially carried out in ‘Adaptive
ve and Intelligent Systems Applications’ technology program
−2
of Technology Development Center of Finland, and the
EU financed Brite/Euram project ‘Application of Neural
−3 Network Based Models for Optimization of the Rolling
−3 −2 −1 0 1 2 3 4
Process’ (NEUROLL). We would like to thank Mr. Mika
Figure 6. Projection of the IRIS data set to the Pollari for implementing the initialization and training
subspace spanned by its two eigenvectors with greatest GUI.
eigenvalues. The three subspecies have been plotted using
different markers: IRU6HWRVDx for Versicolor and ¸IRU
References
Virginica. The SOM grid has been projected to the same [1] Kohonen T. Self-Organizing Maps. Springer, Berlin, 1995.
subspace. Neighboring map units connected with lines. [2] Vesanto J., Alhoniemi E., Himberg J., Kiviluoto K.,
Labels associated with map units are also shown. Parviainen J. Self-Organizing Map for Data Mining in
MATLAB: the SOM Toolbox. Simulation News Europe
1999;25:54.
[3] Kohonen T., Hynninen J., Kangas J., Laaksonen J.
SOM_PAK: The Self-Organizing Map Program Package,
7 Technical Report A31, Helsinki University of Technology,
6
1996, http://www.cis.hut.fi/nnrc/nnrc-programs.html
[4] Pyle D. Data Preparation for Data Mining. Morgan
5 Kaufman Publishers, San Francisco, 1999.
4
[5] Anderson E. The Irises of the Gaspe Peninsula. Bull.
American Iris Society; 1935;59:2-5.
3 [6] Vesanto J. SOM-Based Visualization Methods. Intelligent
2
Data Analysis 1999;3:111-126.

1
4
Address for correspondence.
7.5
3.5
7
3 6.5 Juha Vesanto
6
2.5 5.5
Helsinki University of Technology
2 4.5
5 P.O.Box 5400, FIN-02015 HUT, Finland
Juha.Vesanto@hut.fi
Figure 7. The four variables and the subspecies http://www.cis.hut.fi/projects/somtoolbox
information from the SOM. Three coordinates and marker
size show the four variables. Marker color gives
subspecies: black for Setosa, dark gray for Versicolor and
light gray for Virginica.

You might also like