You are on page 1of 9

Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

Contents lists available at SciVerse ScienceDirect

Chemometrics and Intelligent Laboratory Systems


journal homepage: www.elsevier.com/locate/chemolab

A MATLAB toolbox for Self Organizing Maps and supervised neural network learning strategies
Davide Ballabio a,, Mahdi Vasighi b
a b

Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano Bicocca, Milano, Italy Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran

a r t i c l e

i n f o

a b s t r a c t
Kohonen maps and Counterpropagation Neural Networks are two of the most popular learning strategies based on Articial Neural Networks. Kohonen Maps (or Self Organizing Maps) are basically self-organizing systems which are capable to solve the unsupervised rather than the supervised problems, while Counterpropagation Articial Neural Networks are very similar to Kohonen maps, but an output layer is added to the Kohonen layer in order to handle supervised modelling. Recently, the modications of Counterpropagation Articial Neural Networks allowed introducing new supervised neural network strategies, such as Supervised Kohonen Networks and XY-fused Networks. In this paper, the Kohonen and CP-ANN toolbox for MATLAB is described. This is a collection of modules for calculating Kohonen maps and derived methods for supervised classication, such as Counterpropagation Articial Neural Networks, Supervised Kohonen Networks and XY-fused Networks. The toolbox comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment. It aims to be useful for both beginners and advanced users of MATLAB. The use of the toolbox is discussed here with an appropriate practical example. 2012 Elsevier B.V. All rights reserved.

Article history: Received 17 April 2012 Received in revised form 5 July 2012 Accepted 14 July 2012 Available online 22 July 2012 Keywords: Self Organizing Maps Supervised pattern recognition Articial Neural Networks MATLAB Kohonen maps

1. Introduction Kohonen maps (or Self Organizing Maps, SOMs) are one of the most popular learning strategies among the several Articial Neural Networks algorithms proposed in literature [1]. Their uses are increasing related to several different tasks and nowadays they can be considered as an important tool in multivariate statistics [2]. Kohonen maps are selforganizing systems able to solve unsupervised rather than supervised problems. As a consequence, methods based on the Kohonen approach but combining characteristics from both supervised and unsupervised learning have been introduced. Counterpropagation Articial Neural Networks (CP-ANNs) are very similar to Kohonen maps, since an output layer is added to the Kohonen layer [3]. When dealing with classication issues, CP-ANNs are generally efcacious methods for modelling classes separated with non-linear boundaries. Recently, modications to CP-ANNs led introducing new supervised neural network strategies, such as Supervised Kohonen Networks (SKNs) and XY-fused Networks (XY-Fs) [4]. As a consequence of the increasing success of Self Organizing Maps, some toolboxes for calculating supervised and unsupervised SOMs were proposed in literature [58]. The Kohonen and CP-ANN
Corresponding author at: Dept. of Environmental Sciences, University of MilanoBicocca, P.zza della Scienza, 120126 Milano, Italy. Tel.: +39 02 6448 2801; fax: +39 02 6448 2839. E-mail address: davide.ballabio@unimib.it (D. Ballabio). 0169-7439/$ see front matter 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2012.07.005

toolbox for MATLAB was originally developed in order to calculate unsupervised Kohonen maps and supervised classication models by means of CP-ANNs in an easy-to-use graphical user interface (GUI) environment [9]. Recently, several new features and algorithms (SKNs, XY-Fs, batch training, optimization of network settings by means of Genetic Algorithms) were introduced in the toolbox. This work deals with the presentation of the last version of the Kohonen and CP-ANN toolbox, which is a collection of MATLAB modules freely available via Internet (http://www.disat.unimib.it/chm) along with examples and a comprehensive user manual released as HTML les. 2. Methodological background 2.1. Notation Scalars are indicated by italic lower-case characters (e.g. xij) and vectors by bold lower-case characters (e.g. x). Two-dimensional arrays (matrices) are denoted as X (I J), where I is the number of samples and J the number of variables. The ij-th element of the data matrix X is denoted as xij and represents the value of the j-th variable for the ith sample. 2.2. Kohonen maps The toolbox was developed following the algorithm described in the paper from Zupan, Novic and Ruisnchez [10]. Only a brief

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

25

description of Kohonen maps is given, since all the details can be found in the quoted paper. The Kohonen map is usually characterized by being a squared toroidal space that consists of a grid of N2 neurons, where N is the number of neurons for each side of the space (Fig. 1a). Given a multivariate dataset composed of I samples described by J experimental variables, each neuron is associated to J weights, that is, it contains as many elements (weights) as the number of variables. The weights of each neuron are initialized between 0 and 1 and updated on the basis of the I samples, for a certain number of times (termed as training epochs). Kohonen maps can be trained by means of sequential or batch training algorithms [1]. When the sequential training is adopted, in each training epoch samples are randomly introduced in the network, one at a time. For each sample (xi), the most similar neuron (i.e. the winning neuron) is selected on the basis of the minimum Euclidean distance. Then, the weights of the r-th neuron (wr) are changed as a function of the difference between their values and the values of the sample; this correction (wr) is scaled according to the topological distance from the winning neuron (dri):  wr 1   dri old xi wr d max 1 1

where t is the number of the current training epoch, ttot is the total number of training epochs, start and nal are the learning rate at the beginning and at the end of the training, respectively. When the batch training is used, the whole set of samples is presented to the network and winner neurons are found; after this, the weights are calculated on the basis of the effect of all the samples, at the same time:
I P

uir xi w r i 1 I P uir
i1

where wr are the updated weights of the r-th neuron, xi is the i-th sample, and uir is the weighting factor of the winning neuron related to the i-th sample with respect to neuron r:  uir 1  dri d max 1

where is the learning rate and dmax the size of the considered neighbourhood, that decreases during the training phase. The topological distance dri is dened as the number of neurons between the considered neuron r and the winning neuron. The learning rate changes during the training phase, as follows:     t start final final 1 t tot 2

where, , dmax and dri are dened as before (see Eq. (1)). At the end of the network training, samples are placed in the most similar neurons of the Kohonen map; in this way, the data structure can be visualized and the role of experimental variables in dening the data structure can be elucidated by looking at the Kohonen weights.

Fig. 1. Structures of Kohonen maps and related methods (CP-ANNs, SKNs, and XY-Fs) for a generic dataset constituted by J variables and G classes. Notation in the g. refers to notation used in the text: xij represents the value of the jth variable for the i-th sample, wrj represents the value of the j-th Kohonen weight for the r-th neuron, cig represents the membership of the i-th sample to the gth class expressed with a binary code, and yrg represents the value of the g-th output weight for the r-th neuron.

26

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

2.3. Counterpropagation Articial Neural Networks Counterpropagation Articial Neural Networks (CP-ANNs) are modelling methods which combine features from both supervised and unsupervised learning [10]. CP-ANNs consist of two layers, a Kohonen layer and an output layer, whose neurons have as many weights as the number of classes to be modelled (Fig. 1b). The class vector is used to dene a matrix C, with I rows and G columns, where I is the number of samples and G the total number of classes; each entry cig of C represents the membership of the i-th sample to the g-th class expressed with a binary code (0 or 1). When the sequential training is adopted, the weights of the r-th neuron in the output layer (yr) are updated in a supervised manner on the basis of the winning neuron selected in the Kohonen layer. Considering the class of each sample i, the update is calculated as follows:  yr 1   dri old ci yr d max 1 5

and used to calculate classication models (Fig. 1d). In Supervised Kohonen Networks, Kohonen and output layers are glued together to give a combined layer that is updated according to the training scheme of Kohonen maps. Each sample (xi) and its corresponding class vector (ci) are combined together and act as input for the network. In order to achieve classication models with good predictive performances, xi and ci must be scaled properly. Therefore, a scaling coefcient for ci is introduced for tuning the inuence of class vector in the model calculation. Details on SKNs can be found in the paper from Melssen, Wehrens and Buydens [4]. 3. Main features of the Kohonen and CP-ANN toolbox The toolbox was initially developed under MATLAB 6.5 (Mathworks), but it is compatible with the latest releases of MATLAB. The collection of functions and algorithms are provided as MATLAB source les, with no requirements for any other third party's utilities beyond the standard MATLAB installation. The les just need to be copied into a folder. The model calculation can be performed both via the MATLAB command window and a graphical user interface, which enables the user to perform all the analysis steps. 3.1. Input data Data must be structured as a numerical matrix with dimensions I J, where I is the number of samples and J the number of variables. When dealing with supervised classication, the class vector must be prepared as a column numerical vector (I 1), where the i-th element of this vector represents the class label of the i-th sample. If G classes are present, class labels must be integer numbers ranging from 1 to G. Note that 0 values are not allowed as class labels. Data sets with missing values can be handled by the toolbox. Basically, missing values (and the corresponding values of the neuron weights) are not considered when calculating Euclidean distances to nd the closest neuron and when updating the neuron weights. 3.2. Network settings Kohonen Maps have adaptable parameters that must be chosen prior to calculation. Network settings can be dened in the GUI or via the MATLAB command window by means of the som_setting function and can be stored in a MATLAB data structure. Each eld of this structure denes a specic setting for the network. All the available settings are listed in Table 1. The network size (nsize) denes the number of neurons for each side of the map. If the number of neurons for each side is set to N, the total number of neurons will be N 2. The number of epochs (epochs) is the number of times each sample is introduced in the network. The boundary condition (bound) denes whether the space of the

where dri is the topological distance between the considered neuron r and the winning neuron selected in the Kohonen layer; ci is the i-th row of the unfolded class matrix C, that is, a G-dimensional binary vector representing the class membership of the i-th sample. On the other hand, if the batch training is used, the weights of the output layer are changed following the same algorithm shown in the previous paragraph (see Eqs. (3) and (4)). At the end of the network training, each neuron of the Kohonen layer can be assigned to a class on the basis of the output weights and all the samples placed in that neuron are automatically assigned to the corresponding class. 2.4. XY-fused Networks XY-fused Networks (XY-Fs) are supervised neural networks for building classication models derived from Kohonen Maps (Fig. 1c). In XY-fused Networks, the winning neuron is selected by calculating Euclidean distances between a) sample (xi) and weights of the Kohonen layer, b) class membership vector (ci) and weights of the output layer. These two Euclidean distances are then combined together to form a fused similarity, that is used to nd the winning neuron. The inuence of distances calculated on the Kohonen layer decreases linearly during the training epochs, while the inuence of distances calculated on the output layer increases. Details on XY-fused Networks can be found in the paper from Melssen, Wehrens and Buydens [4]. 2.5. Supervised Kohonen Networks (SKNs) As well as CP-ANNs and XY-Fs, Supervised Kohonen Networks (SKNs) are supervised neural networks derived from Kohonen Maps
Table 1 Network settings available in the toolbox. Settings net_type nsize epochs topol bound training init a_max a_min scaling absolute_range ass_meth scalar Description

Possible values Kohonen, cpann, skn, xyf Any integer number greater than zero Any integer number greater than zero Square, hexagonal Toroidal, normal Batch, sequential Random, eigen Any real number between 0.9 and 0.1 Any real number between 0 and the initial learning rate None, centering, variance scaling, auto scaling Classical, absolute Four different criteria (1, 2, 3 or 4) Any real number greater than 0

Default NaN NaN NaN square toroidal batch random 0.5 0.01 none classical 1 1

Type of neural network Number of neurons for each side of the map Number of training epochs Topology condition Boundary condition Training algorithm Initialization of weights Initial learning rate Final learning rate Data scaling (prior to automatic range scaling) Type of automatic range scaling Neuron assignment criterion (only for CP-anns, skns and XY-Fs) Scaling coefcient for tuning the effect of class vector (only for skns)

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

27

Table 2 MATLAB routines of the toolbox related to the calculation of Kohonen maps and their main outputs. For each routine, outputs are collected as elds of a unique MATLAB structure. MATLAB routine model_kohonen Fitting of Kohonen maps Outputs W settings scal top_map top_map Description Kohonen weights stored in a 3-way data matrix with dimensions N N J, where N is the number of neurons on each side of the map and J is the number of variables. Settings used for building the model Structure with scaling parameters Coordinates of the samples in the Kohonen top map Coordinates of the predicted samples in the Kohonen top map

pred_kohonen

Prediction with Kohonen maps

Kohonen Map is normal or toroidal. The topology condition (topol) denes the shape of each neuron (square or hexagonal). The training algorithm can be dened by the eld training. Sequential or batch training algorithms are available. Learning rates ( start and nal) can be modied by changing the values in a_max and a_min, respectively. Values of start and nal are set by default at 0.5 and 0.01, respectively, as suggested in literature [10]. When dealing with supervised classication, the user can also dene a criterion for assigning neurons to the classes on the basis of their output weights (ass_meth) [9]. Initialization of Kohonen weights can be dened by the eld init. In fact, Kohonen weights can be initialized both randomly (between 0.1 and 0.9) or on the basis of the eigenvectors corresponding to the two largest principal components of the dataset [1]. In this second case, weights are always initialized to the same values. Therefore, when the initialization of Kohonen weights is both based on the eigenvectors and coupled with the batch training algorithm, the nal weights are always the same, since random initialization or random introduction of samples into the Kohonen map are avoided. When dealing with Supervised Kohonen Networks (SKNs), the scaling coefcient for tuning the effect of class vector can be dened in the scalar eld. This scaling coefcient is set by default at 1. Regarding data scaling, it must be noted that variables are always range scaled between 0 and 1, in order to be comparable with the network weights [10]. The range scaling can be performed separately on each column (variable) of the dataset or by using the maximum and minimum values of the entire dataset (absolute_range). This second option can be used when all the variables are dened at the same scale, such as for proles and spectral data. Moreover, the user can

dene different methods of data scaling in the setting structure (scaling), to be applied prior to the automatic range scaling.

3.3. Optimization of the network architecture by means of Genetic Algorithms Kohonen Maps require an optimization step in order to choose the most suitable network architecture. When dealing with classication models, CP-ANNs, SKNs, and XY-Fs require the selection of appropriate numbers of neurons and training epochs, in order to make accurate predictions. The relationship between architecture and network performance cannot be easily decided and depends on many parameters like the number of samples and their distribution in the data space. Searching for the best architecture is usually performed by heuristic methods and actually one of the major disadvantages of these multivariate statistical models is probably related to the network optimization, since this procedure suffers from some arbitrariness and can be timeexpensive in some cases. Recently, a new strategy for the selection of the optimal number of neurons and training epochs was proposed [11]. This strategy exploited the ability of Genetic Algorithms to optimize network parameters [1215]. Details on this approach can be found in the quoted paper. In this toolbox, this strategy for optimizing the network architecture has been introduced and can be run both via the graphical user interface and in the MATLAB command window. Once the optimization has been performed, the optimization results can be easily saved, loaded and analyzed in the graphical user interface. Details on how to perform optimization are given in the section describing the illustrative example of analysis.

Table 3 MATLAB routines of the toolbox related to the calculation of CP-ANNs, SKNs and XY-Fs and their main outputs. For each routine, outputs are collected as elds of a unique MATLAB structure. MATLAB routine model_cpann, model_skn, model_xyf Description Fitting of CP-ANN, SKN and XY-F models Outputs W Description Kohonen weights stored in a 3-way data matrix with dimensions N N J, where N is the number of neurons on each side of the map and J is the number of variables Output weights stored in a 3-way data matrix with dimensions N N G, where N is the number of neurons on each side of the map and G is the number of classes Vector with neuron assignments Settings used for building the model Structure with scaling parameters True class vector Calculated class vector Output weights associated to samples Coordinates of the samples in the Kohonen top map Structure containing classication parameters (confusion matrix, error rate, specicity, sensitivity and precision) Settings used for cross validating the model True class vector Class vector calculated in cross validation Output weights associated to samples in cross validation Structure containing cross validated classication parameters Predicted class vector Output weights associated to samples in prediction Coordinates of the predicted samples in the Kohonen top map

W_out

neuron_ass settings scal class_true class_calc class_weights top_map class_param cv_cpann, cv_skn, cv_xyf Cross-validation of CP-ANN, SKN and XY-F models settings class_true class_pred class_weights class_param class_pred class_weights top_map

pred_cpann, pred_skn, pred_xyf

Prediction with CP-ANN, SKN and XY-F models

28

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

Fig. 2. Kohonen and CP-ANN toolbox: main graphical interface.

3.4. Calculating models Once data have been prepared and settings have been dened, the user can easily calculate the Kohonen network by using the model_kohonen function via the MATLAB command window. The output of the routine is a structure with several elds containing all the results (Table 2). Supervised classication models can be calculated by using CP-ANNs, SKNs, or XY-Fs via the MATLAB command window. The MATLAB functions associated to these methods are listed in Table 3. The output of these functions is a structure, where results concerning the output layer and indices describing classication performance are stored together with the results concerning the Kohonen layer (Table 3). In particular, the output weights are stored in a three-way data matrix with dimensions N N G, where G is the number of modelled classes. The assignment of each neuron is saved as well as the consequent assignment

of each sample placed in the neuron. Finally, the confusion matrix is provided. This is a squared matrix with dimensions G G where each entry ngk represents the number of samples belonging to class g and assigned to class k. The most known classication indices, such as error rate, non-error rate specicity, sensitivity, precision and ratio of not assigned samples are derived from the confusion matrix [16]. Cross-validation can be performed by means of the functions listed in Table 3, by choosing the number of cancellation groups and the cross-validation method for separating the samples into cancellation groups (venetian blinds or contiguous blocks). The output of this routine is a MATLAB structure containing the confusion matrix and the derived classication indices calculated in cross-validation. Unknown or test samples can be predicted by using an existing model: new samples are compared with the trained Kohonen weights, placed in the closest neuron and assigned to the corresponding class. This calculation can be made in the toolbox by means of the functions

Fig. 3. Kohonen and CP-ANN toolbox: interactive graphical interface for visualizing the Kohonen top map.

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

29

Fig. 4. Kohonen and CP-ANN toolbox: interactive graphical interface for visualizing the optimization results.

listed in Table 3, that return a structure containing the class assignment of the new samples. 3.5. Calculating models via the graphical user interface The following command line must be executed in the MATLAB prompt to run the graphical interface (Fig. 2): >> model gui

3.6. Visualizing results via the graphical user interface Once the model has been calculated, the Kohonen top map can be visualized in the toolbox graphical interface (Fig. 3). The Kohonen top map represents the space dened by the neurons where the samples are placed and allows visual investigation of the data structure by analyzing the sample positions and their relationships. Samples are visualized by randomly scattering their positions within each neuron space and by means of the update button, it is possible to move the sample positions within the neuron. Samples can be labelled with different strings: identication numbers, class labels in the case of supervised classication, or user dened labels. Moreover, the map can be shifted if the chosen boundary condition is toroidal, in order to optimize the map visualization. Inuence of variables in describing data can be evaluated by coloring neurons on the basis of the Kohonen weights by means of the Display weights list. In this way, neurons will be colored from white (weight equal to zero, minimum value) to black (weight equal to 1, maximum value). Therefore, one can evaluate if the considered variable has a direct relationship on the sample distribution in the space of the top map. Moreover, both Kohonen and output weights of a selected neuron can be displayed by means of the get neuron weights button. However, the analysis of the Kohonen top map only allows to plot all the weights for a specic neuron or all the neurons for a specic weight, that is, all the available information cannot be contemporaneously plotted. When dealing with complex data, high dimensional spaces

The user can load data, sample and variable labels, and the class vector when dealing with supervised classication, both from the MATLAB workspace or MATLAB les. Then, in addition to basic operations, such as looking at the data, plotting variable means and sample proles, all the calculation steps described in the previous paragraphs can be easily performed in the graphical interface. Optimization of the network structure to choose the optimal number of epochs and neurons can be performed directly in a proper window. Once the user has decided how to set the network, settings and parameters for cross-validation can be dened in a proper window, where basic and advanced settings are divided in order to facilitate practitioners who are not skilled with SOMs. Once a model has been calculated, results of the optimization step, models, settings and cross validation results can be exported in the MATLAB workspace. Saved models can be easily loaded in the toolbox for future analyses, as well as new samples can be loaded and predictions can be calculated on the basis of previously calculated models. When dealing with supervised classication, the user can graphically evaluate indices for classication diagnostic (confusion matrix, error rate, non error rate, specicity, sensitivity, purity) and analyze ROC curves (Receiver Operating Characteristics). These are graphical tools for the analysis of classication results and describe the degree of separation of classes. ROC curves are graphical plots of 1 Specicity (also known as False Positive Rate, FPR) and Sensitivity (also known as True Positive Rate, TPR) as x and y axes, respectively, for a binary classication system as its discrimination threshold is changed. In this toolbox, ROC curves are separately calculated for each class, by changing the threshold of assignation over the output weights from 0 to 1.

Table 4 Example of analysis: some of the indices calculated by the toolbox and used for classication diagnostic. Error rate, non-error rate, specicity, sensitivity and precision obtained in tting, cross-validation (10 cancellation groups) and on the external test set of samples are shown. Classication parameter Non-error rate Error rate Precision of class 1 Precision of class 2 Sensitivity of class 1 Sensitivity of class 2 Specicity of class 1 Specicity of class 2 Fitting 0.97 0.03 0.99 0.94 0.97 0.98 0.98 0.97 Cross-validation 0.97 0.03 0.99 0.94 0.97 0.98 0.98 0.97 External test set 0.96 0.04 0.98 0.91 0.95 0.97 0.97 0.95

30

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

Fig. 5. Example of analysis: a) variable prole for each class produced by the toolbox. In this plot, the average of the Kohonen weights of each variable calculated on the neurons assigned to each class is shown; b) plot of ROC curves produced by the toolbox.

are common; in these cases, it is not easy to solve the data interpretation with a simple visual approach. For this reason, the toolbox allows the calculation of Principal Component Analysis (PCA) on the Kohonen weights, in order to investigate the relationships between variables and classes in a global way and not one variable at a time [17]. A GUI for calculating PCA on the Kohonen weights is provided in the toolbox. Details on its use are given in the section describing the illustrative example of analysis.

4.1. Selection of the optimal numbers of neurons and epochs The optimal number of neurons and epochs were calculated by means of Genetic Algorithms, as previously explained. Optimization results can be easily analyzed in the graphical user interface (Fig. 4). Each bubble represents a network architecture. The dimension of each bubble is proportional to the network size, that is, the number of neurons. The color of the bubbles is proportional to the number of epochs, that is, the darker the bubble, the higher the number of epochs used to train the network. This plot enables qualitative interpretation of the results: architectures placed in the right upper part of the plot are appropriate, since they are characterized by high relative frequencies of selection by Genetic Algorithms and high predictive performances [11]. As a consequence, the architectures placed on the right top limit of the plot can be considered as the most suitable ones, such as the architecture marked in red in Fig. 4, representing a neural network optimised with 4 4 neurons and 250 epochs. The list of all the represented architectures with their number of neurons, epochs, frequency of selection in the GA runs and average of tness function can be seen by clicking the view results in table button. By clicking the select button, it's possible to select a specic bubble (architecture) in the plot and see its corresponding numbers of epochs and neurons, frequency of selection and value of tness function. 4.2. Calculation of the classication model On the basis of the optimization results obtained by means of Genetic Algorithms, the numbers of neurons and epochs were set to 4 4 and 250, respectively. In Table 4, the classication indices

4. Illustrative example: classication of multivariate data This example consists of the Breast Cancer dataset, that is a real benchmark dataset for classication [18]. The dataset is constituted of 699 samples divided in 2 classes, class 1 as Benign (458 samples) and class 2 as Malignant (241 samples). Samples are described by 9 variables (Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal, Adhesion, Single Epithelial C, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses) which take on discrete values in the range 110. Kohonen maps are not directly treated here since they are implicitly calculated as the Kohonen layer of CPANNs. The 25% of samples was randomly extracted and used as external test samples maintaining the class proportions, that is, the number of test samples of each class was proportional to the number of training samples of that class. Training samples were used to optimize the network architecture and to build and cross-validate the CP-ANN classication model. External test samples were just used to evaluate the predictive ability of the nal CP-ANN model.

Fig. 6. Example of analysis: a) Kohonen top map produced by the toolbox. In the top map, each sample is labelled on the basis of its class. Each neuron is colored with a gray scale on the basis of Kohonen weights of variable 2 (uniformity of cell size): white corresponds to Kohonen weight equal to 0, black to Kohonen weights equal to 1; b) prole of Kohonen weights of one of the neurons where samples of class 2 (malignant) were placed.

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432

31

calculated in the toolbox are shown. The classication performances refer both to tting and cross-validation, executed with 10 cancellation groups selected by venetian blinds. These classication indices can be accessed by clicking on the classication results button in the toolbox main form, as well as the plot of Kohonen weight averages for each class (class prole) and ROC curves. In these plots, it is possible to see that a) class 2 (Malignant) is characterised by higher values on all the considered variables (Fig. 5a); b) the degree of separation between the two classes is high in the ROC curves (Fig. 5b). Finally, the model can be saved in the MATLAB workspace and later loaded in the toolbox to predict new sets of samples. This was made on the external test samples of the data set in analysis. In Table 4, the classication indices calculated on the external test set are shown. 4.3. Interpreting the results with the graphical interface The classication indices provided by the toolbox can help the user to evaluate the overall classication performance, but it is important to have an insight into the model by interpreting samples and variables relationships. This can be done by analyzing the Kohonen top map, where samples are projected in order to evaluate the data structure, while variable importance can be analyzed by coloring the neurons on the basis of the neuron weights, which are always comprised between 0 and 1. As an example, the top map of the calculated model (4 4 neurons and 250 epochs) is shown in Fig. 6a. In the top map, each sample is labelled on the basis of its class, while neurons are colored on the basis of the Kohonen weight of variable 2 (Uniformity of Cell Size), going from low values (white) to high values (black). It is reasonably easy to see that variable 2 discriminates samples belonging to class 1 (Benign) and class 2 (Malignant), which are placed in neurons with higher weights. On the other hand, the user can plot all the Kohonen and output weights of a selected neuron. The prole of Kohonen weights of one of the neurons where class 2 samples are placed is shown in Fig. 6b. However, it is not possible to have a comprehensive insight into the relationships between variables and samples. For this reason, a tool for calculating PCA on the Kohonen weights is provided in the graphical user interface of the toolbox. In Fig. 7, the score and loading plots of the rst two components (explaining together the 74% of the total information) are shown, respectively. In the score plot (Fig. 7a), each point represents a neuron of the previous CPANN model. Each neuron is colored with a gray scale on the basis of the output weight of class 2: the larger the value of the output weight, the higher the probability that the neuron belongs to class 2 and the darker the color. The majority of neurons assigned to class 2 are placed on the left side of the score plot. Thus, comparing score and loading plots, one can evaluate how variables characterize classes. All variables are placed on the left of the loading plot (Fig. 7b), thus variables are directly correlated with samples belonging to class 2 (Malignant), that is, samples of class 2 are characterized by higher values of all the considered variables. 5. Independent testing Dr. Federico Marini, at the Chemistry Department, Universit di Roma La Sapienza, P.le Aldo Moro 5, I-00185 Rome, Italy, informed that he has tested the described software and found that it appears to function as the Authors described. 6. Conclusion The Kohonen and CP-ANN toolbox for MATLAB is a collection of modules for calculating Self Organizing Maps (Kohonen maps) and derived methods for supervised classication, such as Counterpropagation Articial Neural Networks (CP-ANNs), Supervised Kohonen Networks (SKNs) and XY-fused Networks (XY-Fs).

Fig. 7. Example of analysis: a) score plot of the rst two principal components calculated on the Kohonen weights. Each neuron is colored with a gray scale on the basis of the output weight corresponding to class 2 (malignant): white corresponds to output weight equal to 0, black to output weights equal to 1; b) loading plot of the rst two principal components calculated on the Kohonen weights. Each variable is labelled with its identication number.

The toolbox is regularly updated and it is freely available via Internet from the Milano Chemometrics and QSAR Research Group website (http://www.disat.unimib.it/chm). It aims to be useful for both beginners and advanced users of MATLAB. For this reason, examples and a comprehensive user manual are provided with the toolbox. The toolbox comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment. In the GUI, all the analysis steps (data loading, model settings, optimization, calculation, cross-validation, prediction and results visualization) can be easily performed.

References
[1] T. Kohonen, Self-Organization and Associative Memory, Springer Verlag, Berlin, 1988. [2] F. Marini, Analytica Chimica Acta 635 (2009) 121131. [3] J. Zupan, M. Novic, J. Gasteiger, Chemometrics and Intelligent Laboratory Systems 27 (1995) 175187. [4] W. Melssen, R. Wehrens, L. Buydens, Chemometrics and Intelligent Laboratory Systems 83 (2006) 99113. [5] J. Vesanto, J. Himberg, E. Alhoniemi, J. Parhankangas, SOM Toolbox for Matlab 5, Technical Report A57, Helsinki University of Technology, 2000. [6] M. Schmuker, F. Schwarte, A. Brck, E. Proschak, E. Tanrikulu, A. Givehchi, K. Scheiffele, G. Schneider, Journal of Molecular Modeling 13 (2007) 225228. [7] I. Kuzmanovski, M. Novic, Chemometrics and Intelligent Laboratory Systems 90 (2008) 8491.

32

D. Ballabio, M. Vasighi / Chemometrics and Intelligent Laboratory Systems 118 (2012) 2432 [14] D. Polani, Kohonen Maps, In: On the optimisation of self-organising maps by genetic algorithms, Elsevier, Amsterdam, 1999. [15] I. Kuzmanovski, M. Novic, M. Trpkovska, Analytica Chimica Acta 642 (2009) 142147. [16] D. Ballabio, R. Todeschini, Infrared Spectroscopy for Food Quality Analysis and Control, In: Multivariate Classication for Qualitative Analysis, Elsevier, 2008. [17] D. Ballabio, R. Kokkinofta, R. Todeschini, C.R. Theocharis, Chemometrics and Intelligent Laboratory Systems 87 (2007) 7884. [18] W.H. Wolberg, O.L. Mangasarin, Proceedings of the National Academy of Sciences of the United States of America 87 (1990) 91939196.

[8] J. Aires-de-Sousa, Chemometrics and Intelligent Laboratory Systems 61 (2002) 167173. [9] D. Ballabio, V. Consonni, R. Todeschini, Chemometrics and Intelligent Laboratory Systems 98 (2009) 115122. [10] J. Zupan, M. Novic, I. Ruisnchez, Chemometrics and Intelligent Laboratory Systems 38 (1997) 123. [11] D. Ballabio, M. Vasighi, V. Consonni, M. Kompany-Zareh, Chemometrics and Intelligent Laboratory Systems 105 (2011) 5664. [12] I. Kuzmanovski, S. Dimitrovska-Lazova, S. Aleksovska, Analytica Chimica Acta 595 (2007) 182189. [13] I. Kuzmanovski, M. Trpkovska, B. Soptrajanov, Journal of Molecular Structure 744747 (2005) 833838.

You might also like