Professional Documents
Culture Documents
Internship Report: Otto-von-Guericke-University of Magdeburg
Internship Report: Otto-von-Guericke-University of Magdeburg
Internship Report
Author:
Georg Ruß
July 5, 2005
Academic Supervisors:
A/Prof Saman Halgamuge, Md Azharul Karim
University of Melbourne
Department of Mechanical and Manufacturing Engineering
Parkville, 3052, Victoria, Australia
Acknowledgements
I’d like to say thanks to all the people at the Mechatronics Research Group at the
University of Melbourne, Department of Mechanical and Manufacturing Engineering;
you’ve made my six months of training time worthwile and mind-opening. Special thanks
go to my academic supervisors A/Prof. Saman Halgamuge, Md Azharul Karim and
Prof. Rudolf Kruse from the University of Magdeburg, Germany; you have provided
the necessary ideas and steps to make this research project a success. My very personal
thanks are directed towards Genevieve Coath, for proof-reading not only this report, but
also for being a true friend, taking care of WalTher and providing emotional support
whilst being far, far away from Germany. The 2nd and 4th assumptions from the last
sentence are also valid for Jessica Christel, being on the other side of the world. Last,
but not least, I’d like to thank my parents and family for supporting my stay in Australia
financially and emotionally.
CONTENTS ii
Contents
List of Figures vi
1 Introduction 1
1.1 Purpose of this document . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Given task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Manufacturing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4.1 Proposed solution / Document structure . . . . . . . . . . . . . . 3
2 Basic techniques 4
2.1 Short introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.4 Important properties . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 GSOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Structure and Initialisation . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CONTENTS iii
2.3.3 Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 GSOM implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Simulation Results 30
4.1 Results for old data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Sampled data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Complete data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Results for new data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Complete data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Effects of preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 32
C Resulting maps 45
45
LIST OF FIGURES v
List of Figures
4.6 Comparison of old and new data CQ, with SF = 0.8 and 0.9 . . . . . . . 34
C.1 Hits maps, sampled old data, SF=0.7, GP = 1 and 8, SP1 = SP2 = 0 . . 45
C.2 Hits maps, complete old data, SF=0.7, GP = 1 and 8, SP1 = SP2 = 0 . 46
C.3 Hits maps, complete old data, SF=0.8, GP=1 and 8, SP1=SP2=0 . . . . 46
C.4 Hits maps, complete old data, SF=0.9, GP=1 and 8, SP1=SP2=0 . . . . 47
C.5 Hits maps, complete new data, SF=0.8, GP=1 and 8, SP1=SP2=0 . . . 47
C.6 Hits maps, complete new data, SF=0.9, GP=1 and 8, SP1=SP2=0 . . . 48
C.7 SOM growing process, SF=0.1. . . 0.9, GP=1, SP1=SP2=0 . . . . . . . . 49
LIST OF TABLES vii
List of Tables
List of Abbreviations
Chapter 1
Introduction
1.2 Motivation
Imagine working at a large semiconductor company and being responsible for quality
control at the end of multiple assembly and production lines. Since there is no fuzzy
classification of chips, that is, there is only a working product or non-working ‘waste’,
you are definitely interested in discovering where possibly faulty steps might be situated
and with which amount they contribute to the final product. The only input data
you possess to perform your work consists of an enormous amount of values from the
assembly line meters; tedious working hours have been invested by colleagues to find out
manually whether the final product has been a working exemplar or one that was out
of order. Those other people, however, relied on their knowledge about the production
process and did not target relationships inside the data which might have led to faulty
products. Your task as a data analyst is to detach from the physical background and
instead to work on the given data exclusively. Right from the beginning, based on the
Chapter 1. Introduction 2
high dimensionality of the data, you are sure not to use statistical, descriptive methods
to find distinguishing attributes. Since neural network learning algorithms are known to
operate well under these input data conditions, you decide to use the extension of the
Self-Organising Map, the Growing Self-Organising Map, to try to find the (presumed)
distinguishing attributes (= possibly faulty production line points).
1
The words ‘good’ and ‘bad’ have been used to describe the two possible classes throughout the au-
thor’s training time at the University of Melbourne; they do not constitute any valuation and could have
been chosen arbitrarily. As mentioned in the introduction, this report also serves as a documentation
to my Australian supervisor and is, in that regard, compliant to his thesis.
Chapter 1. Introduction 3
Chapter 2
Basic techniques
Artificial neural networks are adaptive models that can learn from input data and gener-
alise (therefore simplify) learned things. They extract the essential characteristics from
the numerical data as opposed to memorising all of it. This offers a convenient way
of reducing the amount of data (neural networks have been used to process millions
of inputs [5]) as well as to form an implicit model without having to manually form a
traditional, physical or logical model of the underlying phenomenon. In contrast to tra-
ditional models, which are theory-rich and data-poor the neural networks are data-rich
and theory-poor in a way that little or no a-priori knowledge of the problem is present
and also that certain properties are hard to prove but easily to accept based on empirical
observations.
Self-Organising Maps
Self-Organising Maps have been widely used in data mining – or knowledge exploration –
to visualise high-dimensional data and to reduce its dimension to just a few representative
prototypes. It is therefore crucial to sustain the highest possible accuracy during the
data mining process. Many variants extending the conventional SOM’s capabilities were
proposed (one of which is the Growing Self-Organising Map) to allow more flexibility
and adaptiveness by introducing controllable and consecutive growth during the training
process. This chapter will explain the SOM and GSOM algorithms in detail to provide
the necessary basics for understanding the rest of this report.
Chapter 2. Basic techniques 5
2.2 SOM
The Self-Organising Map developed by Prof Teuvo Kohonen [6] is one of the most pop-
ular neural network models. The SOM algorithm is based on unsupervised competitive
learning causing the training to be entirely data-driven and the neurons on the map
to compete with each other. Supervised algorithms like the Multi-Layer Perceptron or
Support Vector Machines require the target values for each data vector to be known in
advance whereas the SOM does not have this limitation. The SOM has been invoked in
a large variety of tasks ranging from process modeling [3] over large textual document
collections [4] to multimedia feature extraction [9].
The important distinction from Vector Quantisation techniques is that the neurons
are organised among a regular grid and that along with the selected neuron (the Best-
Matching Unit) also its neighbors are updated, by means of which the SOM performs
an ordering of the neurons. In this respect the SOM is a scaling method projecting data
from a high-dimensional input space onto a typically two-dimensional map (see figure
2.1). Hence, similar input vectors get mapped to neighboring neurons on the output
map.
input layer
map
Figure 2.1: Neighborhood preserving mapping from input layer to two-dimensional map
2.2.1 Structure
A SOM is formed of neurons located on a usually 2-dimensional grid having a rectangular
or hexagonal topology. Each neuron of the map is represented by an n-dimensional
weight vector mi = [mi1 , · · · , min ]T , where n is equal to the respective dimension of the
input vectors. Higher dimensional grids might be used but their visualisation is not as
straightforward as it is for two-dimensional maps.
Chapter 2. Basic techniques 6
Figure 2.2: Equidistant SOM neighborhoods on a rectangular grid with (a) Euclidean,
(b) city-block, (c) chessboard distance. [7]
The number of neurons naturally determines the granularity of the resulting mapping,
which in turn influences the accuracy and the generalisation capability of the SOM.
2.2.2 Initialisation
In this section’s basic SOM algorithm the number of neurons and the topological re-
lationship are fixed from the beginning, as opposed to the GSOM algorithm explained
in Section 2.3. The number of neurons should usually be selected as large as possible
with the neighborhood size affecting the smoothness and generalisation capability of the
mapping. The mapping does not suffer considerably even when the number of neurons
exceeds the number of given input vectors; selecting the neighborhood size appropriately
seems to be much more important. As the size of the map increases to e.g. thousands
of neurons the training phase becomes computationally almost infeasible for practical
applications; in this term the SOM does not exhibit much difference from the GSOM
algorithm whose (exemplary) computation times can be obtained in detail from Chapter
4.
An initialisation of the weight vectors has to be provided before starting the training
phase. Since the SOM is robust in regards to the choice of the initialisation it will finally
converge, albeit the proper choice of initial values can save some computational effort.
Two often-used initialisation procedures are the following:
• random init: weight vectors are initialised with small random values,
Chapter 2. Basic techniques 7
• sample init: weight vectors are initialised with random samples drawn from the
input data set.
2.2.3 Training
In each training step one sample vector x from the input data set is chosen (in order
of appearance or randomly) and a similarity measure is calculated between it and all
the weight vectors of the map’s neurons. The Best-Matching Unit (BMU), denoted as
c, is the unit whose weight vector possesses greatest similarity with respect to the input
sample x. The distance measure used to define the similarity is typically a Euclidean
distance; formally the BMU is the neuron for which
kx − mc k = min{kx − mi k} (2.1)
i
BMU
Figure 2.3: Updating the BMU and its neighbors towards the input sample [10]
update rule for changing the respective weight vectors of unit i of the SOM is:
Chapter 2. Basic techniques 8
• Bubble: constant among the neighborhood of the winner unit and zero elsewhere
(see figure 2.4 (a)),
2
• Gaussian: exp(− kr2σ
c −ri k
2 (t) ) (see figure 2.4 (b)).
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
10 10
8 10 8 10
6 8 6 8
4 6 4 6
4 4
2 2 2
2
0 0 0 0
(a) (b)
Figure 2.4: Commonly used neighborhood functions [10]
Using the Gaussian neighborhood function yields slightly better results but is also
heavier in computational terms. When the SOM is applied, the neighborhood radius is
normally larger at first (resulting in fast adaptation of the map to the inputs) and is
constantly decreased throughout the training process.
As well as the neighborhood also the learning rate α(t) is a function decreasing over
time. Two common forms are:
A
• a function inversely proportional to time: α(t) = t+B
,
Voronoi regions
The SOM partitions the input space into convex Voronoi regions, with each neuron being
responsible (in analogy to the biological case) for one of these regions. The Voronoi region
of a neuron i is the union over all the vectors x to which it is closest:
The reference vector of the respective Voronoi region is placed according to the local
expectation of the data weighted by the neighborhood kernel:
R
hci p(x)xdx
m= R (2.5)
hci p(x)dx
In searching for good reference vectors and ordering them on a regular grid at the same
time, the SOM combines the properties of data projection and vector quantisation tech-
niques. The grid can be thought of as a 2-dimensional elastic network following the
Chapter 2. Basic techniques 10
original data’s distribution. However, the SOM does not try to preserve the distances
directly but instead focuses on representing the topology of the input data which makes
it inherently useful for visualisation of large data sets.
There is an important tradeoff to be made between the two competing goals of (exact)
quantisation and topology preservation; this can be controlled by setting the radius of
the neighborhood kernel appropriately. Obviously, the SOM reduces to a plain vector
quantisation algorithm when the neighborhood radius is set to zero.
Naturally, even the SOM cannot be prevented from suffering through incomplete or
otherwise faulty data fed into it. However, outliers in the input data can be easily
detected as their distance from other vectors in the same unit that it was mapped to is
large. In practice, we often have to deal with data that has missing values, represented
by vectors with missing components. This problem can be overcome by leaving out
the missing component from the distance calculation during the training and updating
process; if there is still a sufficient number of complete vectors to train the map with, the
missing values can even be filled in with e.g. mean values from the unit it got mapped to.
Obviously, ’Errors in data’ does not refer to errors introduced during the preprocessing;
if the input data is somehow biased or has other systematic or inherent errors, those
cannot be detected and/or might simply result in bad maps.
2.2.5 Visualisation
There are a variety of different visualisation techniques once the SOM has been com-
puted, two of which are relevant to the current problem and therefore mentioned here.
The reference vectors of the SOM can be visualised via the component plane represen-
tation [3]. The computed SOM can be thought of as multi-tiered with the components
of the vectors describing horizontal layers themselves and the reference vectors being
orthogonal to these layers. In this planar, sliced representation it is easy to (a) see the
distribution of the component values, and (b) recognise correlations between compo-
nents. This technique might be used in a later stage of the current solution to the task
from Section 1.3.
Clusters
The main techniques used for evaluating the results in Chapter 4 were to (a) visualise the
computed map and manually look for distinguishable clusters and (b) introduce a quality
measure as a further indicator of the map’s grade. Clusters are groups of vectors which
Chapter 2. Basic techniques 11
are close to each other, relative to their distance to other vectors on the map. Therefore,
there are different maps (through different criteria being visualised) to be shown and the
ones to be used here are either the distance map or the hits map. Examples for types of
maps can be found in Section 2.3.3 and their manual evaluation in Chapter 4.
2.3 GSOM
The characteristic feature distinguishing neural maps from other neural network
paradigms and other regular vector quantisers is the preservation of neighborhoods. This
desirable feature obviously depends on the choice of the output topology: the proper di-
mensionality of the output space is usually not known a-priori, but yet has to be specified
prior to learning in the SOM algorithm. This knot has to be cut to be able to optimise
the neighborhood preservation.
Of course, one could easily compute structures in a brute-force approach, starting
from different map sizes and evaluating the degree of neighborhood preservation to choose
the best map possible. Yet, this strategy is computationally heavy and does not neces-
sarily yield the map with the best possible neighborhood preservation. Therefore, in [2]
a new algorithm was proposed which extends the SOM in a very straightforward way. A
more advanced description can be found in [8] or in [1].
Since the GSOM is an extension to the SOM, the latter’s basic principles (see Section
2.2.1 and 2.2.4) also hold for the GSOM. For the sake of clarity this section is structured
similarly to the SOM section.
New variables will be introduced:
The mentioned Growth Threshold (GT) derives from the SF as in Equation 2.6 (where
D stands for the data dimensionality) and is used internally as a threshold value for
initiating node generation: a high GT means less spread and a low GT leads to a larger,
more spread-out map. Note that the outer boundaries of the spread factor (0 and 1) can
not be included since ln(0) = undef ined, and ln(1) = 0 renders the growth threshold
useless since the computed error of the map will exceed it in every step and therefore
grow continuously.
GT = −D × ln(SF ) (2.6)
2.3.2 Phases
Growing phase
As with the SOM, the set of neuron vectors Wi in the GSOM can be considered as
a vector quantisation of the input space so that each neuron i is used to represent a
Voronoi region Vi (see Section 2.2.4). A winner neuron is found as per the algorithm
listed below. If a neuron contributes significantly to the total error of the network then
its Voronoi region is said to be underrepresented by the assigned neuron. Hence, a new
neuron is generated in the immediate neighborhood to achieve a better representation
of the region, or, if growth is not possible, the error gets distributed to the neighboring
neurons, raising their error and therefore their growth probability.
The (simplified) GSOM algorithm for the growing phase presents as follows:
• Determine the winner neuron i, using the chosen distance measure (e.g. Euclidean).
• If total error of neuron is larger than the growth threshold: grow, if i is a boundary
neuron; or distribute weights to neighbors if i is a nonboundary neuron.
Smoothing phases
The smoothing phases occur after the new neuron growing in the preceding growing
phase. The growing phase stops when new neuron growth gets saturated, i.e. if the
frequency of new neuron additions falls below a certain threshold. Once the neuron
growing phase is complete, the weight adaptation is continued at a lower adaptation
rate. No new neurons are added during this phase, whose purpose it is to smooth out
any existing quantisation error, particularly in the neurons grown in the later stages of
the growing phase.
Chapter 2. Basic techniques 13
During the smoothing phases, inputs to the network are the same as those of the
growing phase. The starting learning rate (LR) in this phase is smaller than in the
growing phase since the weight values should not fluctuate too much without converg-
ing. The input data vectors are repeatedly presented to the network until convergence
is achieved or a certain specified number of inputs have been presented. The smoothing
phase is stopped if the error values of the neurons in the map fall below a threshold. Mul-
tiple smoothing phases with different (consecutively decreasing) LR and neighborhood
sizes can be applied.
2.3.3 Maps
Average map The left map in Figure 2.5 shows similarities between neighboring weight
vectors. For each weight vector the distance to each of the neighboring weight vec-
tors is calculated, those values are averaged and a color from a black-and-white
palette is assigned according to the determined value. White or light grey colors
show similarities (low distances) whereas black and darker colors show dissimilar-
ities (high distances). If the distances are not averaged, but accumulated instead,
you get to the distance map, which will be described now.
Distance map The right map in Figure 2.5 exemplifies a distance map, where the
distances that each neuron has to its neighboring neurons are accumulated and
depicted with a color palette ranging from blue (low distance) over green (medium
distance) towards red/brown (high distance). This map can help in finding borders
between existing clusters, since cluster borders are small regions where distances
are relatively high compared to the rest of the map or the inside of a single cluster.
Error map The error map, to be seen on the left of Figure 2.6, shows the accumulated
error for every neuron at the current stage of training. A low or zero error is
illustrated as a violet or dark blue color, whereas the error raises over red/brown
to greenish colors for neurons with the highest error. Since the map grows from a
small initialisation stage to larger sizes, the error is expected to propagate towards
the borders of the map, where neurons with highest error cause map growth to
better represent that particular region of the input space.
Hits map This map, shown on the right of Figure 2.6, is the type of map that will
mainly be worked with when running simulations. It shows the number of inputs
which were mapped to the individual neurons during the GSOM training phase.
The color palette ranges from blue (no mappings) over shades of green to red (high
number of mappings). In the depicted case, it is to be expected to have a high
number of mappings in the middle cluster (neurons 1, 3, 5, 6, 7, 15), a medium
number assigned to neuron 62 (bottom left) and a cluster with less mappings
(neurons 60 and 64), with the rest of the input references scattered throughout the
remaining neurons.
Chapter 2. Basic techniques 14
Chapter 3
Having explained the basic techniques in Chapter 2, this chapter deals with the generated
simulation framework and introduces a quality measure for evaluating the simulation
results. However, since preprocessing of data is inherently important, it will be explained
first.
As can be seen from table 1.1 (original data) there are certain types of data fields that
are irrelevant to the data mining process, such as dates and times of data acquisition.
There are also constant fields which therefore bear no distinguishing information at all.
These can safely be pruned to reduce the computational load afterwards without having
influence on the data mining result.
Categorical data has to be expanded to suit the assumption of equal distances among the
data. For each column of categorical values, the number of different attribute values n is
calculated; subsequently, this column is replaced by n columns (one for each category)
which are filled with 1’s where the column’s category matches the original category and
with 0’s otherwise. This preprocessing step can dramatically change the dimensions of
the input data and might also lead to sparsely populated input tables. See table 3.2 for
a comprehensive example.
Category A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
⇒
C 0 0 1 0 0
A 1 0 0 0 0
D 0 0 0 1 0
E 0 0 0 0 1
Elimination of outliers
Among the original dataset in Table 1.1 it can be seen that there are certain outliers in
the data which are most probably caused by incorrect measurements from the production
lines (e.g. faulty meters or errors during transmission); which manifest by being several
orders of magnitude different from the remaining majority of the data. Assuming that
the normalisation is carried out in the usual way, even a single outlier is sufficient to
skew the ensuing data mining process. A more intuitive explanation can be obtained
from Figures 3.1 and 3.2, where the immediate effect of the outliers is obvious, scaling
the majority of values to nearly 0 and the few outliers to 1. This renders the data mining
process nearly useless since floating point arithmetics’ accuracy is limited.
Finding the outliers was done manually as follows:
It shall be noted that the distribution of “good” and “bad” input vector references
did not undergo a major change after outliers were eliminated; the percentage of “good”
product references changed from 85.92% to 86.04% whereas the percentage of bad prod-
uct references was reduced by the respective amount.
Normalisation
Every data attribute was scaled to [0, 1] by using Equation 3.1, where xmin and xmax
denote the minimum and maximum value of the attribute, respectively.
x − xmin
xnew = (3.1)
xmax − xmin
1
For each of the numerical attributes under consideration in this context, one or more outliers could
be found that were at least three orders of magnitude different from the rest of the attribute values.
This method works in this special case because of the attributes’ having a distribution similar to the
one shown in Figure 3.1.
Chapter 3. Simulation and Evaluation Framework 20
0E+00 1E+08
0 1
Expansion of categorical data was performed in the same manner as in the original
preprocessing chain (Table 3.2).
Assumptions
As mentioned in Section 1.3 the given task is to invoke the GSOM algorithm to produce
an automatic classification (clustering) of high-dimensional input data. The desired clus-
tering is known in advance from the data distribution and would be a binary clustering
at best, i.e. separating the “good” from the “bad” products on the GSOM. The data
available for developing the benchmarking method are (a) the original data’s statisti-
Chapter 3. Simulation and Evaluation Framework 21
cal values and its distribution between “good” and “bad”, and (b) the data from the
generated map, i.e. the distribution of input vector references among the map’s neurons.
The benchmark (named ‘clustering quality’) should yield a number between 0 and 1
as a quality measure of the input vector separation among the neurons. 0 means no
difference in data distribution, i.e. no clustering at all. 1 means a high probability of good
clustering, albeit visual map exploration is still necessary. The developed benchmark is
supposed to aid the user in evaluating the result, therefore a value of 1 does not guarantee
good clustering but can instead be seen as an indicator of potentially good clustering. A
simple example for a high benchmark rating in combination with an obviously bad map
could be where a benchmark value of 1 exists for an overtrained, non-clustered map with
as many neurons as input vectors and every input vector being assigned to exactly one
neuron. The algorithm’s block diagram is depicted in Figure 3.3.
A good separation of input vectors is achieved if every neuron contains only input
vectors from one class (i.e. “good” or “bad”) or no input vectors at all, in which case the
benchmark should yield 1. Bad separation (or no separation) should be benchmarked
with a 0.
Conventions
Calculation
From Equation 3.2 it can easily be seen that the maximum CQ value of 1 will be achieved
if good and bad product references are mapped to separate clusters. On the other hand,
the minimum benchmark value of 0 will be returned if the data partitioning on the map is
the same as that amongst the input dataset. Colloquially expressed, CQ gives a measure
of how far the clustering on the map deviates from the original data distribution.
Testing
Two simple test datasets were generated to show the correctness of the developed clus-
tering quality measure (and, simultaneously, the functionality of the used GSOM im-
plementation as well). As the interests lie in assessing binary clustering quality, two
samples which represent the original dataset’s two different classes were taken from the
normalised dataset, slightly shortened (less attributes), and multiplied by a number of 50
and 500, respectively. The resulting test datasets have dimensions of 59x100 and 59x1000
and, due to the synthetic generation, consist of two easily distinguishable classes. These
two datasets were then fed into the GSOM and showed the correctness of CQ. The
generated hits maps can be found in Figure 3.4. Both maps were computed with the
parameter set [SF=0.5, GP=3, SP1=SP2=0] and show the desired CQ value of 1 since
both classes are perfectly separated on the map (one class per red cluster).
Figure 3.4: Generated maps with test dataset, left: 59x100, right: 59x1000
Number of inputs: As most important factor determining the result of the GSOM
algorithm, the number and dimension of inputs can be chosen arbitrarily, although
data has to be normalised beforehand. Special care has to be taken regarding
the size of the input data since the running time of the GSOM algorithm grows
exponentially with the size of the input. Simulations were run on (a) sampled
808x1638-, (b) 808x16380-, and (c) 808x15980-dimensional datasets.
Spread factor: This parameter determines the final size of the map by providing the
user with a convenient, high-level way to influence the map’s internal growth ac-
cordingly [8]. A larger spread factor means a more spread-out and therefore larger
final GSOM. It was varied from 0.1 to 0.995 during simulations and its range is in
]0, 1[. It is also of great influence on the length of the training period.
Topology: Either hexagonal or rectangular, as explained in Chapter 2. A hexagonal
grid is used by default and stayed unaltered as it also states a balanced tradeoff
between neighborhood size and quantisation of the input space.
Kernel: The actual implementation provided the Gaussian kernel as well as the ‘bubble’
kernel. Again, see Chapter 2 for details. The Gaussian kernel is used in most of
the scientific literature and therefore went unaltered.
Similarity measure: Inside the GSOM algorithm, this measure is used in the calcu-
lation of the Best-Matching Unit. The distance between the prototype and every
2
‘tested’: the effects of the respective parameter were evaluated in preliminary experiments
3
‘adapted’: the parameter has been chosen for a systematic adaptation in later simulations
4
default options are denoted in emphasised font
Chapter 3. Simulation and Evaluation Framework 24
weight vector of the map is calculated using the chosen similarity measure and
the unit with the smallest distance is denoted as BMU. As the Euclidean distance
measure is best suited to most of the GSOM tasks, it was used here. Another
implemented method was Pearson’s correlation. Depending on the complexity of
the similarity measure, its choice can also have an impact on computation time.
Options per algorithm phase: learning length, learning rate, neighborhood size
• learning length:
During the course of the GSOM calculation, this parameter sets how many
times the input data is fed into the algorithm. As an integer multiplier, it
determines the final size of the map and therefore also the running time of
the algorithm. It was varied from 1 to 100 during preliminary experiments
and from 1 to 10 during the actual simulations. However, in the current task
the second and third phases had no actual effect on mappings and therefore
only increased computation time dramatically.
• learning rate:
This parameter is used while updating the winner’s and its neighbors’ weight
vectors and can be used to control the speed of the map’s adaptation to inputs.
It went unchanged from its defaults of 0.5 (1st phase), 0.1 (2nd phase), 0.01
(3rd phase). A large value is desired at first to keep the map flexible towards
inputs, whereas smaller values are used during the smoothing phases.
• neighborhood size:
Controls how many layers of neighboring neurons are affected by a weight
change of the winner unit. Decreases during the course of the algorithm from
3 (1st phase) to 1 (3rd phase).
3.4.1 Ideas
In [8] three entangled ideas are proposed on how to proceed with the simulations after
having preprocessed the data. An additional, fourth, idea incorporates the generated
quality measure.
Certainly the most basic approach to adapt the map to the input data is to vary the map’s
size by varying the spread factor appropriately. Since SF takes values from the interval
]0,1[ (see Equation 2.6) it is useful to start with small spread factors and increase them
Chapter 3. Simulation and Evaluation Framework 26
sampled data
original data original data
1
sampled data
2
...
sampled data
sampled data
n−1
sampled data
n
continuously in discrete steps. The user should then be able to examine the generated
maps and check for existing clusters of data. This approach was pursued systematically,
among others.
It may be necessary and of great insight to study the effect of removing several attributes
from the input dataset and check its impact on the generated maps. This will generate
and/or confirm knowledge about possible non-contributing attributes. Preliminary work
has already been done in pursuing this approach; figures 3.6 and 3.7 show promising
discriminatory distances among several attributes whereas numerous other attributes
seem to be less distinguishing. Simultaneously, the effect of removing outliers can be
recognised along the y-axis, which shows the difference between the average values of
‘good’ and ‘bad’ classes; due to a few outliers, this difference had been skewed without
preprocessing.
Unfortunately, the manufacturer’s requirement was not to leave out any attributes
from the data mining process, hence this approach has not been pursued any further.
The process of systematically growing maps with varying spread factors is best suited
to being automated for saving computation time, using the dimensionality-independent
spread factor to name and compare different maps. However, since most of the maps took
several hours and even days to compute, the time for evaluating these maps manually
and starting new simulations with different parameters could be neglected, especially
Chapter 3. Simulation and Evaluation Framework 27
16
distance
14
12
10
0
0 10 20 30 40 50 60
Attribute Number
Figure 3.6: Distances between logarithmic attribute averages, before eliminating outliers
0.9 distance
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60
Attribute Number
Figure 3.7: Distances between logarithmic attribute averages, after eliminating outliers
Chapter 3. Simulation and Evaluation Framework 28
To easily assess the objective quality of the generated maps, the benchmark formula in-
troduced in section 3.3 was applied to the maps and indicates the quality of the clustering
compared to the known original data distribution as a single percentage.
Complete data Simulations with the same parameters as with the sampled data were
run on the complete dataset. Additional simulations were run with constant spread
factors of 0.8 and 0.9 and variations in the learning length during the growing phase
being in the range from 1 to 10. The quality measure was applied to the latter
simulations.
Complete data As with the sampling approach, prohibitive time restrictions caused
not to run simulations in the same, extensive manner. For comparative results
spread factors of 0.8 and 0.9 were chosen and the learning length varied in analogy
to the later simulations run on the old data. However, to assess the influence of
different spread factors on the map growth, additional simulations were run with
SF varying from 0.1 to 0.9.
Chapter 4
Simulation Results
This chapter gives explanations of obtained simulation results. The results themselves are
graphed in condensed form at the end of this chapter, whereas the underlying numerical
data as well as additional maps can be found in the appendix.
vector mappings at all (which is what we are focused on), these phases were neglected
in subsequent computations.
Two exemplary hits maps are depicted in Figure C.1 and show no apparent clustering
at all, but instead exhibit a behaviour of aligning mappings regularly throughout the map
grid.
Results very similar to those for the sampled data above were obtained for the complete
dataset; they are graphed in Figure 4.2. Additionally, it can be seen that the sampling
approach would save a large amount of computation time. Unfortunately, the longest
simulation (denoted with ‘*’ in Table B.2) could not be finished due to administrative
restrictions at the simulation laboratory. However, by that time it had already been
running for 15 consecutive days, clarifying the inherent computational complexity of the
GSOM algorithm
Unexpectedly, the training error TE exhibits an abnormal behaviour, which will
manifest again in subsequent results; QE shows the expected decline with the number
of growing phases rising.
To keep the resulting maps comparable to the ones depicted for sampled data, maps
resulting from the same parameter set are shown in Figure C.2. Those apparently seem
to be much better than the ones for sampled data, but closer examination of the data
distribution will reveal that no clustering has been performed by the GSOM algorithm.
Having finished simulations with SF=0.7, and having established that varying the SP1
and SP2 lengths does not change the mapping of input vectors to map neurons, only SF
and the number of GP were changed systematically; the obtained results are depicted
in Figure 4.3. Numerical results can be found in Table B.3, from which it can also be
seen that an increase in SF from 0.8 to 0.9 roughly doubles computation times and map
sizes. QE behaves as expected, declining steadily, and is therefore not graphed; TE again
shows non-converging behavior with both spread factors used.
The current parameter set also firstly features the quality measure CQ introduced in
Chapter 3 and shows rising clustering quality towards higher SF and GP, (Figure 4.3,
right). As explained in the definition of this quality measure, a higher obtained CQ does
not guarantee a higher subjective clustering quality but can indicate which parameters
to change towards achieving better results or, to put it another way, to compare maps
generated with different parameters.
Four maps for the current parameter set can be found in Figures C.3 and C.4. The
depicted maps seem to be of higher quality than the ones in the last section; they also
Chapter 4. Simulation Results 32
yield a higher CQ value, but they are still quite far away from showing perfect separation
of the two classes.
Again, the left graph of Figure 4.4 shows non-converging behaviour of TE throughout
different parameter sets. On the other hand, CQ reliably rises with higher SF and to-
wards longer GP, to be seen in the right graph of Figure 4.4. Four maps with the same
parameters as in preceding sections have been depicted in Figures C.5 and C.6; direct
comparison to the maps generated from the old dataset shows significant subjective im-
provements for those maps generated with low spread factors whereas the aforementioned
’regular alignment of inputs’ on the grid can be seen on the right of Figure C.6.
SF = [0.1. . . 0.9], GP = 1
Figure 4.5 shows the effects of leaving GP constant at a low value and varying SF from
0.1 to 0.9. TE as well as QE should be declining with larger SF, but TE seems to
alternate whereas QE roughly behaves as expected. Similar unexpected results were
obtained for CQ [right graph], which is alternating uncontrollably when it was expected
to rise steadily.
The current parameter set is perfectly suitable to demonstrate the growth process of
the GSOM with different spread factors; an example of the hits maps for SF = 0.1. . . 0.9
is depicted in Figure C.7. The faster the map grows, the faster different input clusters
become recognisable on the self-organising map.
time[s] TE
map size QE
1200 3.4 60
1000 3.2 50
time[s] | map size
800 3 40
QE
TE
600 2.8 30
400 2.6 20
200 2.4 10
0 2.2 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of Growing Phases Number of Growing Phases
16000 4.2
100 0.3
14000 4
map size
time[s]
12000 3.8
QE
TE
80 0.25
10000 3.6
8000 3.4
60 0.2
6000 3.2
4000 40 0.15 3
2000 2.8
0 20 0.1 2.6
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of Growing Phases Number of Growing Phases
0.45
0.35
0.4
CQ
TE
0.3 0.35
0.3
0.25
0.25
0.2
0.2
0.15 0.15
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of Growing Phases Number of Growing Phases
0.55
0.4
0.5
0.35
0.45
0.3
CQ
TE
0.4
0.25
0.35
0.2
0.3
SF 0.8 SF 0.9
0.15 SF 0.9 SF 0.8
0.25
0.1 0.2
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of Growing Phases Number of Growing Phases
TE CQ
QE 4.8
0.35 0.24
4.7
0.3 0.22
4.6
4.3
0.15 0.16
4.2
0.1 0.14
4.1
0.05 4 0.12
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Spread Factor Spread Factor
Figure 4.5: Complete new data with GP = 1 and SF = [0.1. . . 0.9], TE/QE and CQ
Clustering Quality for SF = 0.8 (old, new data) Clustering Quality for SF = 0.9 (old, new data)
0.55 0.6
0.5
0.45
0.45
0.4
0.4
CQ
CQ
0.35
0.35
0.3
0.3
0.25
0.25
0.2 0.2
0.15 0.15
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of Growing Phases Number of Growing Phases
Figure 4.6: Comparison of old and new data CQ, with SF = 0.8 and 0.9
Chapter 5. Summary and Conclusion 35
Chapter 5
maps
parameters subjective
pruning avr, dist, hits, err
elimination of
outliers
sampling statistics
objective
TE, QE, size
Preprocessing: It starts, of course, with the data itself of which we have only little or
no previous knowledge, a fact that is to be remedied by the ensuing data mining.
Thorough preprocessing has been performed upon the dataset, including pruning,
the elimination of outliers, normalisation and expansion of categorical data. A
straightforward sampling approach to reduce inherent computational complexity
has been implemented and tested.
Chapter 5. Summary and Conclusion 36
Processing: This step can be held accountable for a fraction of 95% of the overall com-
puting time involved in this work. Important GSOM parameters have been sieved
out systematically and have later been applied according to a self-generated sched-
ule. The algorithm’s output consists, in one part, of the generated maps, which are
neighborhood-preserving projections of the high-dimensional input space onto two-
dimensional maps. On the other hand, during the computation of the maps, useful
measurements (TE, QE, map size) are being generated. Furthermore, a measure
aiding in evaluating the maps, CQ, has been introduced, justified, implemented,
and tested.
Evaluation: The evaluation of the obtained results from the GSOM algorithm finishes
the data mining process. It depends on the user to draw his conclusions from
the data, but his subjective skills in evaluating the maps are supported by (semi)-
objective widgets such as TE, QE and CQ from the processing step. Yet and above
all, the evaluation step needs one thing: experience.
5.1.2 Conclusion
Summary: Overall, it can justifiably be assumed that the GSOM algorithm is able
to deal with high-dimensional datasets and can be used to find distinguishing at-
tributes. Regarding the extensive set of simulations which were run systematically
and the numerous approaches to dealing with the given data, it can also be said
that the resulting maps are a good start to follow-up improvements.
Limitations: Since the results are similar throughout the different parameter sets, the
error is probably not situated inside the processing part of Figure 5.1. Having
taken expert human knowledge into consideration in the evaluation step, this part
can also be taken as correct. Therefore, with the evident improvement in CQ due
exclusively to different preprocessing steps (Figure 4.6) kept in mind, the most
probable cause of not fully satisfactory results seems to be the preprocessing part.
5.1.3 Recommendations
Even with the results being rather distant from satisfactory, there are a number of rec-
ommendations to be given for future GSOM simulations with the manufacturing dataset
used throughout this report. It certainly cannot be avoided to run numerous, compu-
tationally heavy experiments with an algorithm which is inherently of near-exponential
demands. Therefore, all of the recommendations below target the reduction of compu-
tation time, enabling the user to run more experiments in the same amount of time.
Reduce dimension of inputs: Since the running time is inherently dependent on the
number and dimension of inputs, try to reduce the number of attributes. This can
be done by sampling and pruning, as demonstrated in Section 3.1.
Chapter 5. Summary and Conclusion 37
Use adequate parameter sets: Start with small spread factors and short growing
phases. Continually raise SF and GP until sufficient map sizes have been achieved.
For the manufacturing dataset, spread factors between 0.7 and 0.9 and the number
of growing phases between 1 and 5 could be shown to be adequate. Contingent,
questionable benefits of larger parameter sets do not pay off in terms of computa-
tion time and, more importantly, subjective map quality.
Get powerful computing equipment: For the full manufacturing dataset under con-
sideration in conjunction with the introduced JAVA implementation of the GSOM
algorithm, try to get hold of at least 1024MB of RAM and a fast Intel Pentium-IV
(or similar) CPU.
uniform weights. This might lead to an overvaluation of the categorical attributes which
have been expanded.
Hence, it is proposed to introduce some changes to the distance calculation in the
GSOM algorithm and add a weighting scheme as follows:
Introduce weight vector for attributes: This weight vector w will contain weights
for every attribute: 1 for numerical attributes and n1 for each attribute of an
expanded category, where n is the number of distinct attributes of the respective
category.
Change calculation of BMU: Equation 2.1 will be changed into the following ver-
sion, reducing the excess influence of categorical attributes by weighting them:
kw ∗ x − mc k = min{kw ∗ x − mi k} (5.1)
i
Table 5.1 extends the example from Table 3.2 and introduces the weights as an additional
vector.
Category A B C D E
1 1 1 1 1
1 weight 5 5 5 5 5
A 1 0 0 0 0
B 0 1 0 0 0
C ⇒ 0 0 1 0 0
A 1 0 0 0 0
D 0 0 0 1 0
E 0 0 0 0 1
Appendix A
{
System.out.println("DrawSampleSet, value of n" +sample_size);
PrintWriter pw = null;
try
{
String oldFile = ilayer.getFilename();
String newFile = oldFile.substring( 0, oldFile.length()-4 ) +
"_sample_" + sample_size + oldFile.substring( oldFile.length()-4 );
System.out.println("new filename ought to be: " +newFile);
pw = new PrintWriter( new FileOutputStream(newFile) );
}
catch (Exception e)
{
Appendix A. Java source code for sampling 40
e.printStackTrace();
}
pw.println();
int c=0;
while (indices.size() > 0 && c < sample_size)
{
int randIdx = rand.nextInt(indices.size());
int sampleIdx = Integer.parseInt((String)indices.elementAt(randIdx));
float[] sample = ilayer.getRow(sampleIdx);
for (int i = 0; i < sample.length; i++)
pw.print( sample[i] + "\t" );
pw.println( ilayer.getClassName(sampleIdx) );
indices.removeElementAt( randIdx );
c++;
}
pw.close();
}
{
if ( ilayer == null ) return;
Vector indices = new Vector();
Random rand = new Random( System.currentTimeMillis() );
PrintWriter pw = null;
int rem = ilayer.length % number_of_partitions;
int length_minus_rem = ilayer.length-rem;
int samples_per_part = (int)(length_minus_rem/number_of_partitions);
int c=0;
try
{
String oldFile = ilayer.getFilename();
for (int j = 0; j < number_of_partitions; j++)
{
String newFile = oldFile.substring( 0, oldFile.length()-4 )
+ "_sample" +j + oldFile.substring( oldFile.length()-4 );
pw = new PrintWriter( new FileOutputStream(newFile) );
if ((number_of_partitions-j)==1)
c=rem*(-1);
else
c=0; /*If last sample file is to be written,
include ALL remaining samples in it.
Realized in this case by altering the initial
value of the counter c in while loop below*/
Appendix B
Table B.1: Simulation results for sampled data, SF=0.7, varying GP/SP1/SP2
Appendix B. Tabular simulation results 43
Table B.2: Simulation results for old complete data, SF=0.7, varying GP/SP1/SP2
Table B.3: Simulation results for old complete data, SF=0.8 and 0.9, GP=[1. . . 10]
Appendix B. Tabular simulation results 44
Table B.4: Simulation results for new complete data, SF=0.8 and 0.9, GP=[1. . . 10]
Table B.5: Simulation results for new complete data, SF=[0.1. . . 0.9], GP=1
Appendix C. Resulting maps 45
Appendix C
Resulting maps
Figure C.1: Hits maps, sampled old data, SF=0.7, GP = 1 and 8, SP1 = SP2 = 0
Appendix C. Resulting maps 46
Figure C.2: Hits maps, complete old data, SF=0.7, GP = 1 and 8, SP1 = SP2 = 0
Figure C.3: Hits maps, complete old data, SF=0.8, GP=1 and 8, SP1=SP2=0
Appendix C. Resulting maps 47
Figure C.4: Hits maps, complete old data, SF=0.9, GP=1 and 8, SP1=SP2=0
Figure C.5: Hits maps, complete new data, SF=0.8, GP=1 and 8, SP1=SP2=0
Appendix C. Resulting maps 48
Figure C.6: Hits maps, complete new data, SF=0.9, GP=1 and 8, SP1=SP2=0
Appendix C. Resulting maps 49
Selbständigkeitserklärung
Hiermit erkläre ich, daß ich die vorliegende Arbeit selbständig und nur mit erlaubten
Hilfsmitteln angefertigt habe.
Bibliography
[1] Damminda Alahakoon. Controlling the spread of dynamic self organising maps.
Neural Computing and Applications, 13:168–174, 2004.
[2] Th. Villmann; H.-U. Bauer. Growing a hypercubical output space in a self-
organizing feature map. Technical Report TR-95-030, International Computer Sci-
ence Institute, 1947 Center St., Suite 600, Berkeley, California 95704-1198, July
1995.
[3] Jaakko Hollmën. Process modeling using the self-organizing map. Master’s thesis,
Helsinki University of Technology, February 1996.
[4] Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen. WEBSOM—self-
organizing maps of document collections. In Proceedings of WSOM’97, Workshop on
Self-Organizing Maps, Espoo, Finland, June 4-6, pages 310–315. Helsinki University
of Technology, Neural Networks Research Centre, Espoo, Finland, 1997.
[5] T. Kohonen, S. Kaski, K. Lagus, J. Salojrvi, J. Honkela, V. Paatero, and A. Saarela.
Self organization of a massive document collection, 2000.
[6] Teuvo Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information
Sciences. Springer, Berlin, Heidelberg, New York, third extended edition, 1995,
1997, 2001.
[7] M. Koskela. Content-based image retrieval with selforganizing maps. Master’s
thesis, Helsinki University of Technology, 1999.
[8] Damminda Alahakoon; Saman K. Halgamuge; Bala Srinivasan. Dynamic self-
organizing maps with controlled growth for knowledge discovery. IEEE Transactions
on Neural Networks, 11(3):601–614, May 2000.
[9] Markus Törmä. Self-organizing neural networks in feature extraction. In Interna-
tional Archives of Photogrammetry and Remote Sensing ISPRS XVIIIth Congress
in Vienna. Austria, 1996.
[10] Juha Vesanto. Data mining techniques based on the self-organizing map. Master’s
thesis, Helsinki University of Technology, May 1997.