You are on page 1of 9

Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

Contents lists available at SciVerse ScienceDirect

Chemometrics and Intelligent Laboratory Systems


journal homepage: www.elsevier.com/locate/chemolab

Short Communication

A MATLAB toolbox for Self Organizing Maps and supervised neural network
learning strategies
Davide Ballabio a,, Mahdi Vasighi b
a
b

Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano Bicocca, Milano, Italy
Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran

a r t i c l e

i n f o

Article history:
Received 17 April 2012
Received in revised form 5 July 2012
Accepted 14 July 2012
Available online 22 July 2012
Keywords:
Self Organizing Maps
Supervised pattern recognition
Articial Neural Networks
MATLAB
Kohonen maps

a b s t r a c t
Kohonen maps and Counterpropagation Neural Networks are two of the most popular learning strategies based
on Articial Neural Networks. Kohonen Maps (or Self Organizing Maps) are basically self-organizing systems
which are capable to solve the unsupervised rather than the supervised problems, while Counterpropagation
Articial Neural Networks are very similar to Kohonen maps, but an output layer is added to the Kohonen
layer in order to handle supervised modelling. Recently, the modications of Counterpropagation Articial
Neural Networks allowed introducing new supervised neural network strategies, such as Supervised Kohonen
Networks and XY-fused Networks.
In this paper, the Kohonen and CP-ANN toolbox for MATLAB is described. This is a collection of modules for calculating Kohonen maps and derived methods for supervised classication, such as Counterpropagation Articial
Neural Networks, Supervised Kohonen Networks and XY-fused Networks. The toolbox comprises a graphical
user interface (GUI), which allows the calculation in an easy-to-use graphical environment. It aims to be useful
for both beginners and advanced users of MATLAB. The use of the toolbox is discussed here with an appropriate
practical example.
2012 Elsevier B.V. All rights reserved.

1. Introduction
Kohonen maps (or Self Organizing Maps, SOMs) are one of the most
popular learning strategies among the several Articial Neural Networks
algorithms proposed in literature [1]. Their uses are increasing related to
several different tasks and nowadays they can be considered as an
important tool in multivariate statistics [2]. Kohonen maps are selforganizing systems able to solve unsupervised rather than supervised
problems. As a consequence, methods based on the Kohonen approach
but combining characteristics from both supervised and unsupervised
learning have been introduced. Counterpropagation Articial Neural
Networks (CP-ANNs) are very similar to Kohonen maps, since an
output layer is added to the Kohonen layer [3]. When dealing with
classication issues, CP-ANNs are generally efcacious methods for
modelling classes separated with non-linear boundaries. Recently, modications to CP-ANNs led introducing new supervised neural network
strategies, such as Supervised Kohonen Networks (SKNs) and XY-fused
Networks (XY-Fs) [4].
As a consequence of the increasing success of Self Organizing
Maps, some toolboxes for calculating supervised and unsupervised
SOMs were proposed in literature [58]. The Kohonen and CP-ANN
Corresponding author at: Dept. of Environmental Sciences, University of MilanoBicocca, P.zza della Scienza, 120126 Milano, Italy. Tel.: +39 02 6448 2801; fax: +39
02 6448 2839.
E-mail address: davide.ballabio@unimib.it (D. Ballabio).
0169-7439/$ see front matter 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2012.07.005

toolbox for MATLAB was originally developed in order to calculate


unsupervised Kohonen maps and supervised classication models
by means of CP-ANNs in an easy-to-use graphical user interface
(GUI) environment [9]. Recently, several new features and algorithms
(SKNs, XY-Fs, batch training, optimization of network settings by
means of Genetic Algorithms) were introduced in the toolbox. This
work deals with the presentation of the last version of the Kohonen
and CP-ANN toolbox, which is a collection of MATLAB modules freely
available via Internet (http://www.disat.unimib.it/chm) along with
examples and a comprehensive user manual released as HTML les.
2. Methodological background
2.1. Notation
Scalars are indicated by italic lower-case characters (e.g. xij) and vectors by bold lower-case characters (e.g. x). Two-dimensional arrays
(matrices) are denoted as X (I J), where I is the number of samples
and J the number of variables. The ij-th element of the data matrix X
is denoted as xij and represents the value of the j-th variable for the ith sample.
2.2. Kohonen maps
The toolbox was developed following the algorithm described in
the paper from Zupan, Novic and Ruisnchez [10]. Only a brief

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

description of Kohonen maps is given, since all the details can be


found in the quoted paper.
The Kohonen map is usually characterized by being a squared toroidal
space that consists of a grid of N2 neurons, where N is the number of
neurons for each side of the space (Fig. 1a). Given a multivariate dataset
composed of I samples described by J experimental variables, each
neuron is associated to J weights, that is, it contains as many elements
(weights) as the number of variables. The weights of each neuron are initialized between 0 and 1 and updated on the basis of the I samples, for a
certain number of times (termed as training epochs). Kohonen maps can
be trained by means of sequential or batch training algorithms [1].
When the sequential training is adopted, in each training epoch samples are randomly introduced in the network, one at a time. For each
sample (xi), the most similar neuron (i.e. the winning neuron) is selected
on the basis of the minimum Euclidean distance. Then, the weights of the
r-th neuron (wr) are changed as a function of the difference between
their values and the values of the sample; this correction (wr) is scaled
according to the topological distance from the winning neuron (dri):

wr 1



dri
old
xi wr
d max 1

where is the learning rate and dmax the size of the considered
neighbourhood, that decreases during the training phase. The topological
distance dri is dened as the number of neurons between the considered
neuron r and the winning neuron. The learning rate changes during the
training phase, as follows:


 
t
start
final
final
1


t tot

25

where t is the number of the current training epoch, ttot is the total number of training epochs, start and nal are the learning rate at the beginning and at the end of the training, respectively.
When the batch training is used, the whole set of samples is
presented to the network and winner neurons are found; after this,
the weights are calculated on the basis of the effect of all the samples,
at the same time:
I
P

uir xi
wr i1I
P
uir

i1

where wr are the updated weights of the r-th neuron, xi is the i-th
sample, and uir is the weighting factor of the winning neuron related
to the i-th sample with respect to neuron r:

uir 1


dri
d max 1

where, , dmax and dri are dened as before (see Eq. (1)).
At the end of the network training, samples are placed in the most
similar neurons of the Kohonen map; in this way, the data structure
can be visualized and the role of experimental variables in dening
the data structure can be elucidated by looking at the Kohonen
weights.

Fig. 1. Structures of Kohonen maps and related methods (CP-ANNs, SKNs, and XY-Fs) for a generic dataset constituted by J variables and G classes. Notation in the g. refers to notation
used in the text: xij represents the value of the jth variable for the i-th sample, wrj represents the value of the j-th Kohonen weight for the r-th neuron, cig represents the membership of the
i-th sample to the gth class expressed with a binary code, and yrg represents the value of the g-th output weight for the r-th neuron.

26

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

2.3. Counterpropagation Articial Neural Networks


Counterpropagation Articial Neural Networks (CP-ANNs) are
modelling methods which combine features from both supervised
and unsupervised learning [10]. CP-ANNs consist of two layers, a
Kohonen layer and an output layer, whose neurons have as many
weights as the number of classes to be modelled (Fig. 1b). The class
vector is used to dene a matrix C, with I rows and G columns,
where I is the number of samples and G the total number of classes;
each entry cig of C represents the membership of the i-th sample to
the g-th class expressed with a binary code (0 or 1).
When the sequential training is adopted, the weights of the r-th
neuron in the output layer (yr) are updated in a supervised manner
on the basis of the winning neuron selected in the Kohonen layer.
Considering the class of each sample i, the update is calculated as follows:

yr 1



dri
old
ci yr
d max 1

where dri is the topological distance between the considered neuron r


and the winning neuron selected in the Kohonen layer; ci is the i-th
row of the unfolded class matrix C, that is, a G-dimensional binary
vector representing the class membership of the i-th sample.
On the other hand, if the batch training is used, the weights of the
output layer are changed following the same algorithm shown in the
previous paragraph (see Eqs. (3) and (4)).
At the end of the network training, each neuron of the Kohonen
layer can be assigned to a class on the basis of the output weights
and all the samples placed in that neuron are automatically assigned
to the corresponding class.
2.4. XY-fused Networks
XY-fused Networks (XY-Fs) are supervised neural networks for
building classication models derived from Kohonen Maps (Fig. 1c).
In XY-fused Networks, the winning neuron is selected by calculating
Euclidean distances between a) sample (xi) and weights of the Kohonen
layer, b) class membership vector (ci) and weights of the output layer.
These two Euclidean distances are then combined together to form a
fused similarity, that is used to nd the winning neuron. The inuence
of distances calculated on the Kohonen layer decreases linearly during
the training epochs, while the inuence of distances calculated on the
output layer increases. Details on XY-fused Networks can be found in
the paper from Melssen, Wehrens and Buydens [4].
2.5. Supervised Kohonen Networks (SKNs)
As well as CP-ANNs and XY-Fs, Supervised Kohonen Networks
(SKNs) are supervised neural networks derived from Kohonen Maps

and used to calculate classication models (Fig. 1d). In Supervised


Kohonen Networks, Kohonen and output layers are glued together
to give a combined layer that is updated according to the training
scheme of Kohonen maps. Each sample (xi) and its corresponding
class vector (ci) are combined together and act as input for the network. In order to achieve classication models with good predictive
performances, xi and ci must be scaled properly. Therefore, a scaling
coefcient for ci is introduced for tuning the inuence of class vector
in the model calculation. Details on SKNs can be found in the paper
from Melssen, Wehrens and Buydens [4].
3. Main features of the Kohonen and CP-ANN toolbox
The toolbox was initially developed under MATLAB 6.5 (Mathworks),
but it is compatible with the latest releases of MATLAB. The collection of
functions and algorithms are provided as MATLAB source les, with no requirements for any other third party's utilities beyond the standard
MATLAB installation. The les just need to be copied into a folder. The
model calculation can be performed both via the MATLAB command window and a graphical user interface, which enables the user to perform all
the analysis steps.
3.1. Input data
Data must be structured as a numerical matrix with dimensions
I J, where I is the number of samples and J the number of variables.
When dealing with supervised classication, the class vector must be
prepared as a column numerical vector (I 1), where the i-th element
of this vector represents the class label of the i-th sample. If G classes
are present, class labels must be integer numbers ranging from 1 to G.
Note that 0 values are not allowed as class labels.
Data sets with missing values can be handled by the toolbox. Basically, missing values (and the corresponding values of the neuron
weights) are not considered when calculating Euclidean distances to
nd the closest neuron and when updating the neuron weights.
3.2. Network settings
Kohonen Maps have adaptable parameters that must be chosen
prior to calculation. Network settings can be dened in the GUI or
via the MATLAB command window by means of the som_setting
function and can be stored in a MATLAB data structure. Each eld of
this structure denes a specic setting for the network. All the available settings are listed in Table 1.
The network size (nsize) denes the number of neurons for each
side of the map. If the number of neurons for each side is set to N, the
total number of neurons will be N 2. The number of epochs (epochs)
is the number of times each sample is introduced in the network. The
boundary condition (bound) denes whether the space of the

Table 1
Network settings available in the toolbox.
Settings

Description

Possible values

Default

net_type
nsize
epochs
topol
bound
training
init
a_max
a_min
scaling
absolute_range
ass_meth
scalar

Type of neural network


Number of neurons for each side of the map
Number of training epochs
Topology condition
Boundary condition
Training algorithm
Initialization of weights
Initial learning rate
Final learning rate
Data scaling (prior to automatic range scaling)
Type of automatic range scaling
Neuron assignment criterion (only for CP-anns, skns and XY-Fs)
Scaling coefcient for tuning the effect of class vector (only for skns)

Kohonen, cpann, skn, xyf


Any integer number greater than zero
Any integer number greater than zero
Square, hexagonal
Toroidal, normal
Batch, sequential
Random, eigen
Any real number between 0.9 and 0.1
Any real number between 0 and the initial learning rate
None, centering, variance scaling, auto scaling
Classical, absolute
Four different criteria (1, 2, 3 or 4)
Any real number greater than 0

NaN
NaN
NaN
square
toroidal
batch
random
0.5
0.01
none
classical
1
1

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

27

Table 2
MATLAB routines of the toolbox related to the calculation of Kohonen maps and their main outputs. For each routine, outputs are collected as elds of a unique MATLAB structure.
MATLAB routine
model_kohonen

pred_kohonen

Outputs

Description

Fitting of Kohonen maps

Prediction with Kohonen maps

settings
scal
top_map
top_map

Kohonen weights stored in a 3-way data matrix with dimensions N N J, where N is


the number of neurons on each side of the map and J is the number of variables.
Settings used for building the model
Structure with scaling parameters
Coordinates of the samples in the Kohonen top map
Coordinates of the predicted samples in the Kohonen top map

dene different methods of data scaling in the setting structure


(scaling), to be applied prior to the automatic range scaling.

Kohonen Map is normal or toroidal. The topology condition (topol)


denes the shape of each neuron (square or hexagonal). The training
algorithm can be dened by the eld training. Sequential or batch
training algorithms are available.
Learning rates ( start and nal) can be modied by changing the
values in a_max and a_min, respectively. Values of start and nal
are set by default at 0.5 and 0.01, respectively, as suggested in literature [10]. When dealing with supervised classication, the user can
also dene a criterion for assigning neurons to the classes on the
basis of their output weights (ass_meth) [9]. Initialization of
Kohonen weights can be dened by the eld init. In fact, Kohonen
weights can be initialized both randomly (between 0.1 and 0.9) or
on the basis of the eigenvectors corresponding to the two largest
principal components of the dataset [1]. In this second case, weights
are always initialized to the same values. Therefore, when the initialization of Kohonen weights is both based on the eigenvectors and
coupled with the batch training algorithm, the nal weights are always the same, since random initialization or random introduction
of samples into the Kohonen map are avoided. When dealing with
Supervised Kohonen Networks (SKNs), the scaling coefcient for
tuning the effect of class vector can be dened in the scalar eld.
This scaling coefcient is set by default at 1.
Regarding data scaling, it must be noted that variables are always
range scaled between 0 and 1, in order to be comparable with the network weights [10]. The range scaling can be performed separately on
each column (variable) of the dataset or by using the maximum and
minimum values of the entire dataset (absolute_range). This second
option can be used when all the variables are dened at the same
scale, such as for proles and spectral data. Moreover, the user can

3.3. Optimization of the network architecture by means of Genetic


Algorithms
Kohonen Maps require an optimization step in order to choose the
most suitable network architecture. When dealing with classication
models, CP-ANNs, SKNs, and XY-Fs require the selection of appropriate
numbers of neurons and training epochs, in order to make accurate predictions. The relationship between architecture and network performance cannot be easily decided and depends on many parameters
like the number of samples and their distribution in the data space.
Searching for the best architecture is usually performed by heuristic
methods and actually one of the major disadvantages of these multivariate statistical models is probably related to the network optimization,
since this procedure suffers from some arbitrariness and can be timeexpensive in some cases. Recently, a new strategy for the selection of
the optimal number of neurons and training epochs was proposed
[11]. This strategy exploited the ability of Genetic Algorithms to optimize network parameters [1215]. Details on this approach can be
found in the quoted paper. In this toolbox, this strategy for optimizing
the network architecture has been introduced and can be run both via
the graphical user interface and in the MATLAB command window.
Once the optimization has been performed, the optimization results
can be easily saved, loaded and analyzed in the graphical user interface.
Details on how to perform optimization are given in the section describing the illustrative example of analysis.

Table 3
MATLAB routines of the toolbox related to the calculation of CP-ANNs, SKNs and XY-Fs and their main outputs. For each routine, outputs are collected as elds of a unique MATLAB
structure.
MATLAB routine

Description

Outputs

Description

model_cpann, model_skn,
model_xyf

Fitting of CP-ANN, SKN and


XY-F models

Kohonen weights stored in a 3-way data matrix with dimensions


N N J, where N is the number of neurons on each side of the
map and J is the number of variables
Output weights stored in a 3-way data matrix with dimensions
N N G, where N is the number of neurons on each side of the
map and G is the number of classes
Vector with neuron assignments
Settings used for building the model
Structure with scaling parameters
True class vector
Calculated class vector
Output weights associated to samples
Coordinates of the samples in the Kohonen top map
Structure containing classication parameters (confusion
matrix, error rate, specicity, sensitivity and precision)
Settings used for cross validating the model
True class vector
Class vector calculated in cross validation
Output weights associated to samples in cross validation
Structure containing cross validated classication parameters
Predicted class vector
Output weights associated to samples in prediction
Coordinates of the predicted samples in the Kohonen top map

W_out

neuron_ass
settings
scal
class_true
class_calc
class_weights
top_map
class_param
cv_cpann, cv_skn, cv_xyf

Cross-validation of CP-ANN, SKN


and XY-F models

pred_cpann, pred_skn, pred_xyf

Prediction with CP-ANN, SKN


and XY-F models

settings
class_true
class_pred
class_weights
class_param
class_pred
class_weights
top_map

28

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

Fig. 2. Kohonen and CP-ANN toolbox: main graphical interface.

3.4. Calculating models


Once data have been prepared and settings have been dened, the
user can easily calculate the Kohonen network by using the
model_kohonen function via the MATLAB command window. The
output of the routine is a structure with several elds containing all
the results (Table 2).
Supervised classication models can be calculated by using CP-ANNs,
SKNs, or XY-Fs via the MATLAB command window. The MATLAB functions associated to these methods are listed in Table 3. The output of
these functions is a structure, where results concerning the output
layer and indices describing classication performance are stored together with the results concerning the Kohonen layer (Table 3). In particular, the output weights are stored in a three-way data matrix with
dimensions NNG, where G is the number of modelled classes. The assignment of each neuron is saved as well as the consequent assignment

of each sample placed in the neuron. Finally, the confusion matrix is provided. This is a squared matrix with dimensions GG where each entry
ngk represents the number of samples belonging to class g and assigned
to class k. The most known classication indices, such as error rate,
non-error rate specicity, sensitivity, precision and ratio of not assigned
samples are derived from the confusion matrix [16].
Cross-validation can be performed by means of the functions
listed in Table 3, by choosing the number of cancellation groups and
the cross-validation method for separating the samples into cancellation groups (venetian blinds or contiguous blocks). The output of this
routine is a MATLAB structure containing the confusion matrix and
the derived classication indices calculated in cross-validation.
Unknown or test samples can be predicted by using an existing
model: new samples are compared with the trained Kohonen weights,
placed in the closest neuron and assigned to the corresponding class.
This calculation can be made in the toolbox by means of the functions

Fig. 3. Kohonen and CP-ANN toolbox: interactive graphical interface for visualizing the Kohonen top map.

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

29

Fig. 4. Kohonen and CP-ANN toolbox: interactive graphical interface for visualizing the optimization results.

listed in Table 3, that return a structure containing the class assignment


of the new samples.
3.5. Calculating models via the graphical user interface
The following command line must be executed in the MATLAB
prompt to run the graphical interface (Fig. 2):
>> model gui

The user can load data, sample and variable labels, and the class vector when dealing with supervised classication, both from the MATLAB
workspace or MATLAB les. Then, in addition to basic operations, such
as looking at the data, plotting variable means and sample proles, all
the calculation steps described in the previous paragraphs can be easily
performed in the graphical interface.
Optimization of the network structure to choose the optimal number of epochs and neurons can be performed directly in a proper window. Once the user has decided how to set the network, settings and
parameters for cross-validation can be dened in a proper window,
where basic and advanced settings are divided in order to facilitate
practitioners who are not skilled with SOMs.
Once a model has been calculated, results of the optimization step,
models, settings and cross validation results can be exported in the
MATLAB workspace. Saved models can be easily loaded in the toolbox
for future analyses, as well as new samples can be loaded and
predictions can be calculated on the basis of previously calculated
models.
When dealing with supervised classication, the user can graphically evaluate indices for classication diagnostic (confusion matrix,
error rate, non error rate, specicity, sensitivity, purity) and analyze
ROC curves (Receiver Operating Characteristics). These are graphical
tools for the analysis of classication results and describe the degree
of separation of classes. ROC curves are graphical plots of 1 Specicity
(also known as False Positive Rate, FPR) and Sensitivity (also known as
True Positive Rate, TPR) as x and y axes, respectively, for a binary
classication system as its discrimination threshold is changed. In this
toolbox, ROC curves are separately calculated for each class, by changing
the threshold of assignation over the output weights from 0 to 1.

3.6. Visualizing results via the graphical user interface


Once the model has been calculated, the Kohonen top map can be
visualized in the toolbox graphical interface (Fig. 3). The Kohonen top
map represents the space dened by the neurons where the samples
are placed and allows visual investigation of the data structure by
analyzing the sample positions and their relationships. Samples are
visualized by randomly scattering their positions within each neuron
space and by means of the update button, it is possible to move the
sample positions within the neuron. Samples can be labelled with different strings: identication numbers, class labels in the case of supervised classication, or user dened labels. Moreover, the map
can be shifted if the chosen boundary condition is toroidal, in order
to optimize the map visualization.
Inuence of variables in describing data can be evaluated by coloring
neurons on the basis of the Kohonen weights by means of the Display
weights list. In this way, neurons will be colored from white (weight
equal to zero, minimum value) to black (weight equal to 1, maximum
value). Therefore, one can evaluate if the considered variable has a direct relationship on the sample distribution in the space of the top
map. Moreover, both Kohonen and output weights of a selected neuron
can be displayed by means of the get neuron weights button.
However, the analysis of the Kohonen top map only allows to plot all
the weights for a specic neuron or all the neurons for a specic weight,
that is, all the available information cannot be contemporaneously
plotted. When dealing with complex data, high dimensional spaces

Table 4
Example of analysis: some of the indices calculated by the toolbox and used for classication diagnostic. Error rate, non-error rate, specicity, sensitivity and precision
obtained in tting, cross-validation (10 cancellation groups) and on the external test
set of samples are shown.
Classication parameter

Fitting

Cross-validation

External test set

Non-error rate
Error rate
Precision of class 1
Precision of class 2
Sensitivity of class 1
Sensitivity of class 2
Specicity of class 1
Specicity of class 2

0.97
0.03
0.99
0.94
0.97
0.98
0.98
0.97

0.97
0.03
0.99
0.94
0.97
0.98
0.98
0.97

0.96
0.04
0.98
0.91
0.95
0.97
0.97
0.95

30

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

Fig. 5. Example of analysis: a) variable prole for each class produced by the toolbox. In this plot, the average of the Kohonen weights of each variable calculated on the neurons
assigned to each class is shown; b) plot of ROC curves produced by the toolbox.

are common; in these cases, it is not easy to solve the data interpretation with a simple visual approach.
For this reason, the toolbox allows the calculation of Principal
Component Analysis (PCA) on the Kohonen weights, in order to investigate the relationships between variables and classes in a global
way and not one variable at a time [17]. A GUI for calculating PCA
on the Kohonen weights is provided in the toolbox. Details on its
use are given in the section describing the illustrative example of
analysis.

4. Illustrative example: classication of multivariate data


This example consists of the Breast Cancer dataset, that is a real
benchmark dataset for classication [18]. The dataset is constituted
of 699 samples divided in 2 classes, class 1 as Benign (458 samples)
and class 2 as Malignant (241 samples). Samples are described by 9
variables (Clump Thickness, Uniformity of Cell Size, Uniformity of
Cell Shape, Marginal, Adhesion, Single Epithelial C, Bare Nuclei,
Bland Chromatin, Normal Nucleoli, Mitoses) which take on discrete
values in the range 110. Kohonen maps are not directly treated
here since they are implicitly calculated as the Kohonen layer of CPANNs.
The 25% of samples was randomly extracted and used as external
test samples maintaining the class proportions, that is, the number of
test samples of each class was proportional to the number of training
samples of that class. Training samples were used to optimize the network architecture and to build and cross-validate the CP-ANN classication model. External test samples were just used to evaluate the
predictive ability of the nal CP-ANN model.

4.1. Selection of the optimal numbers of neurons and epochs


The optimal number of neurons and epochs were calculated by
means of Genetic Algorithms, as previously explained. Optimization
results can be easily analyzed in the graphical user interface (Fig. 4).
Each bubble represents a network architecture. The dimension of
each bubble is proportional to the network size, that is, the number
of neurons. The color of the bubbles is proportional to the number
of epochs, that is, the darker the bubble, the higher the number of
epochs used to train the network. This plot enables qualitative interpretation of the results: architectures placed in the right upper part of
the plot are appropriate, since they are characterized by high relative
frequencies of selection by Genetic Algorithms and high predictive
performances [11]. As a consequence, the architectures placed on
the right top limit of the plot can be considered as the most suitable
ones, such as the architecture marked in red in Fig. 4, representing a
neural network optimised with 4 4 neurons and 250 epochs. The
list of all the represented architectures with their number of neurons,
epochs, frequency of selection in the GA runs and average of tness
function can be seen by clicking the view results in table button.
By clicking the select button, it's possible to select a specic bubble
(architecture) in the plot and see its corresponding numbers of
epochs and neurons, frequency of selection and value of tness
function.
4.2. Calculation of the classication model
On the basis of the optimization results obtained by means of Genetic Algorithms, the numbers of neurons and epochs were set to
4 4 and 250, respectively. In Table 4, the classication indices

Fig. 6. Example of analysis: a) Kohonen top map produced by the toolbox. In the top map, each sample is labelled on the basis of its class. Each neuron is colored with a gray scale on
the basis of Kohonen weights of variable 2 (uniformity of cell size): white corresponds to Kohonen weight equal to 0, black to Kohonen weights equal to 1; b) prole of Kohonen
weights of one of the neurons where samples of class 2 (malignant) were placed.

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

31

calculated in the toolbox are shown. The classication performances


refer both to tting and cross-validation, executed with 10 cancellation groups selected by venetian blinds. These classication indices
can be accessed by clicking on the classication results button in
the toolbox main form, as well as the plot of Kohonen weight averages for each class (class prole) and ROC curves. In these plots, it is
possible to see that a) class 2 (Malignant) is characterised by higher
values on all the considered variables (Fig. 5a); b) the degree of separation between the two classes is high in the ROC curves (Fig. 5b). Finally, the model can be saved in the MATLAB workspace and later
loaded in the toolbox to predict new sets of samples. This was made
on the external test samples of the data set in analysis. In Table 4,
the classication indices calculated on the external test set are shown.
4.3. Interpreting the results with the graphical interface
The classication indices provided by the toolbox can help the
user to evaluate the overall classication performance, but it is important to have an insight into the model by interpreting samples and
variables relationships. This can be done by analyzing the Kohonen
top map, where samples are projected in order to evaluate the data
structure, while variable importance can be analyzed by coloring
the neurons on the basis of the neuron weights, which are always
comprised between 0 and 1.
As an example, the top map of the calculated model (4 4 neurons
and 250 epochs) is shown in Fig. 6a. In the top map, each sample is
labelled on the basis of its class, while neurons are colored on the
basis of the Kohonen weight of variable 2 (Uniformity of Cell Size),
going from low values (white) to high values (black). It is reasonably
easy to see that variable 2 discriminates samples belonging to class 1
(Benign) and class 2 (Malignant), which are placed in neurons with
higher weights. On the other hand, the user can plot all the Kohonen
and output weights of a selected neuron. The prole of Kohonen
weights of one of the neurons where class 2 samples are placed is
shown in Fig. 6b. However, it is not possible to have a comprehensive
insight into the relationships between variables and samples. For this
reason, a tool for calculating PCA on the Kohonen weights is provided
in the graphical user interface of the toolbox. In Fig. 7, the score and
loading plots of the rst two components (explaining together the
74% of the total information) are shown, respectively. In the score
plot (Fig. 7a), each point represents a neuron of the previous CPANN model. Each neuron is colored with a gray scale on the basis of
the output weight of class 2: the larger the value of the output weight,
the higher the probability that the neuron belongs to class 2 and the
darker the color. The majority of neurons assigned to class 2 are
placed on the left side of the score plot. Thus, comparing score and
loading plots, one can evaluate how variables characterize classes. All
variables are placed on the left of the loading plot (Fig. 7b), thus variables are directly correlated with samples belonging to class 2
(Malignant), that is, samples of class 2 are characterized by higher
values of all the considered variables.
5. Independent testing
Dr. Federico Marini, at the Chemistry Department, Universit di
Roma La Sapienza, P.le Aldo Moro 5, I-00185 Rome, Italy, informed
that he has tested the described software and found that it appears
to function as the Authors described.
6. Conclusion
The Kohonen and CP-ANN toolbox for MATLAB is a collection of
modules for calculating Self Organizing Maps (Kohonen maps) and derived methods for supervised classication, such as Counterpropagation
Articial Neural Networks (CP-ANNs), Supervised Kohonen Networks
(SKNs) and XY-fused Networks (XY-Fs).

Fig. 7. Example of analysis: a) score plot of the rst two principal components calculated on the Kohonen weights. Each neuron is colored with a gray scale on the basis of the
output weight corresponding to class 2 (malignant): white corresponds to output
weight equal to 0, black to output weights equal to 1; b) loading plot of the rst two
principal components calculated on the Kohonen weights. Each variable is labelled
with its identication number.

The toolbox is regularly updated and it is freely available via Internet


from the Milano Chemometrics and QSAR Research Group website
(http://www.disat.unimib.it/chm). It aims to be useful for both beginners and advanced users of MATLAB. For this reason, examples and a
comprehensive user manual are provided with the toolbox.
The toolbox comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment. In the
GUI, all the analysis steps (data loading, model settings, optimization,
calculation, cross-validation, prediction and results visualization) can
be easily performed.

References
[1] T. Kohonen, Self-Organization and Associative Memory, Springer Verlag, Berlin,
1988.
[2] F. Marini, Analytica Chimica Acta 635 (2009) 121131.
[3] J. Zupan, M. Novic, J. Gasteiger, Chemometrics and Intelligent Laboratory Systems
27 (1995) 175187.
[4] W. Melssen, R. Wehrens, L. Buydens, Chemometrics and Intelligent Laboratory
Systems 83 (2006) 99113.
[5] J. Vesanto, J. Himberg, E. Alhoniemi, J. Parhankangas, SOM Toolbox for Matlab 5,
Technical Report A57, Helsinki University of Technology, 2000.
[6] M. Schmuker, F. Schwarte, A. Brck, E. Proschak, E. Tanrikulu, A. Givehchi, K.
Scheiffele, G. Schneider, Journal of Molecular Modeling 13 (2007) 225228.
[7] I. Kuzmanovski, M. Novic, Chemometrics and Intelligent Laboratory Systems 90
(2008) 8491.

32

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

[8] J. Aires-de-Sousa, Chemometrics and Intelligent Laboratory Systems 61 (2002)


167173.
[9] D. Ballabio, V. Consonni, R. Todeschini, Chemometrics and Intelligent Laboratory
Systems 98 (2009) 115122.
[10] J. Zupan, M. Novic, I. Ruisnchez, Chemometrics and Intelligent Laboratory Systems
38 (1997) 123.
[11] D. Ballabio, M. Vasighi, V. Consonni, M. Kompany-Zareh, Chemometrics and Intelligent
Laboratory Systems 105 (2011) 5664.
[12] I. Kuzmanovski, S. Dimitrovska-Lazova, S. Aleksovska, Analytica Chimica Acta 595
(2007) 182189.
[13] I. Kuzmanovski, M. Trpkovska, B. Soptrajanov, Journal of Molecular Structure
744747 (2005) 833838.

[14] D. Polani, Kohonen Maps, In: On the optimisation of self-organising maps by genetic
algorithms, Elsevier, Amsterdam, 1999.
[15] I. Kuzmanovski, M. Novic, M. Trpkovska, Analytica Chimica Acta 642 (2009)
142147.
[16] D. Ballabio, R. Todeschini, Infrared Spectroscopy for Food Quality Analysis and
Control, In: Multivariate Classication for Qualitative Analysis, Elsevier, 2008.
[17] D. Ballabio, R. Kokkinofta, R. Todeschini, C.R. Theocharis, Chemometrics and Intelligent
Laboratory Systems 87 (2007) 7884.
[18] W.H. Wolberg, O.L. Mangasarin, Proceedings of the National Academy of Sciences
of the United States of America 87 (1990) 91939196.