You are on page 1of 9

Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

journal homepage: www.elsevier.com/locate/chemolab

Software Description

A MATLAB toolbox for Principal Component Analysis and unsupervised


exploration of data structure
Davide Ballabio ⁎
Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano–Bicocca, Milano, Italy

a r t i c l e i n f o a b s t r a c t

Article history: Principal Component Analysis is a multivariate method to project data in a reduced hyperspace, defined by
Received 8 September 2015 orthogonal principal components, which are linear combinations of the original variables. In this way, data di-
Received in revised form 16 September 2015 mension can be reduced, noise can be excluded from the subsequent analysis, and therefore, data interpretation
Accepted 3 October 2015
is extremely facilitated. For these reasons, Principal Component Analysis is nowadays the most common chemo-
Available online 19 October 2015
metric strategy for unsupervised exploratory data analysis.
Keywords:
In this paper, the PCA toolbox for MATLAB is described. This is a collection of modules for calculating Principal
Principal Component Analysis Component Analysis, as well as Cluster Analysis and Multidimensional Scaling, which are two other well-
Rank analysis known multivariate methods for unsupervised data exploration. The toolbox is freely available via Internet and
Cluster Analysis comprises a graphical user interface (GUI), which allows the calculation in an easy-to-use graphical environment.
Multidimensional Scaling It aims to be useful for both beginners and advanced users. The use of the toolbox is discussed here with an
MATLAB appropriate practical example.
© 2015 Elsevier B.V. All rights reserved.

1. Introduction Internet from the Milano Chemometrics and QSAR Research Group
website [5]. The toolbox was developed in order to calculate PCA,
Principal Component Analysis (PCA) is a well-known chemometric Cluster Analysis, and MDS in an easy-to-use graphical user interface
technique for exploratory data analysis; it basically projects data in a re- (GUI) environment. It does not require an experienced user, but a
duced hyperspace, defined by orthogonal principal components [1,2]. basic knowledge on the underlying methods is necessary to correctly
These are linear combinations of the original variables, with the first interpret the results.
principal component having the largest variance, the second principal The PCA toolbox for MATLAB provides comprehensive results of PCA,
component having the second largest variance, and so on. It is thus besides the usual outputs, as well as different methods to estimate the
possible to select a number of significant components, so that data di- optimal number of significant components. Therefore, the originality
mension is reduced by preserving the systematic variation in the data of this manuscript is not related to methods implemented in the tool-
retained in the first selected components, while noise is excluded, box, but in the fact that the entire workflow can be done by means of
being represented in the last components. Therefore, PCA enhances a graphical user interface (GUI). There is no need to give instructions
and facilitates data exploration and interpretation of multivariate to the MATLAB command line and all steps of analysis (data loading,
datasets. univariate data screening, component selection, model calculation,
In addition to PCA, two other common chemometric strategies for model analysis, projection of new samples) can be handled in the GUI
unsupervised data analysis are Cluster Analysis and Multidimensional with an easy-to-use interface. This is an important aspect, since tool-
Scaling. Cluster Analysis differs from PCA in that the goal is to detect boxes and software usually miss a graphical interface and this can lead
similarities between samples and define groups in the data [3], while users (especially beginners of chemometrics or MATLAB) to not use
Multidimensional Scaling (MDS) takes into account the mutual rela- them, even if the underlying model is a basic multivariate method.
tionships of sample distances to reproduce the data structure encoded Moreover, some already available toolboxes are black boxes without ap-
in the distance (similarity) matrix into a low-dimensional space [4]. parent detailed description of the options related to the underlying
This work deals with the presentation of the PCA toolbox for models, while a comprehensive help is provided with the PCA toolbox
MATLAB, which is a collection of MATLAB modules freely available via for MATLAB, describing both theory, options, and examples of the calcu-
lated models.
In the first part of the paper, the theory of methods included in the
⁎ Dept. of Earth and Environmental Sciences, University of Milano–Bicocca, P.zza della
toolbox, is briefly overviewed. Then, the MATLAB modules and their fea-
Scienza, 1–20126 Milano, Italy. Tel.: +39 02 64482818. tures are described, and finally, the results obtained on a real chemical
E-mail address: davide.ballabio@unimib.it. dataset are shown, as an example of application.

http://dx.doi.org/10.1016/j.chemolab.2015.10.003
0169-7439/© 2015 Elsevier B.V. All rights reserved.
2 D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9

2. Methodological background Hotelling's T2 indicate samples far from the PCA model centre. The
upper confidence limit for Hotelling's T2 is calculated as follows [6]:
2.1. Notation
M ðI−1Þ
T 2I;M ¼ F M;I−M;α ; ð6Þ
Scalars are indicated by italic lower-case characters (e.g. xij) and I−M
vectors by bold lower-case characters (e.g. x). Two-dimensional arrays
(matrices) are denoted as X (I × J), where I is the number of samples where FM,I-M,α is the F inverse cumulative distribution function for the
corresponding probability α. The Hotelling's T2 contribution for a specific
and J the number of variables. The ij-th element of the data matrix X is
denoted as xij and represents the value of the j-th variable for the i-th sample indicates variables which caused the sample to have extreme
score values and is calculated with the following equation [9]:
sample.
XM
t im l jm
2.2. Principal Component Analysis Tcont i j ¼ pffiffiffiffiffiffiffi ; ð7Þ
m¼1
λm
Theory of PCA is briefly described, since details can be found in liter-
ature [6,7,1,2,8]; particular attention is given to the definition of the out- where Tcontij is the Hotelling's T2 contribution of the j-th variable on the
puts given by the toolbox. i-th sample, tim is the score value of the i-th sample on the m-th compo-
PCA determines a set of orthogonal vectors called principal compo- nent and ljm is the loading of the j-th variable on the m-th component.
nents, which are defined by a linear combination of the original vari- The lack of fit statistic of PCA models can be provided by Q residuals.
ables and ordered by the amount of variance explained in component Assuming that the PCA model is an approximation:
directions. Coefficients of variables to determine principal components
are stored in the loading matrix. Given a set of I samples and J variables, X¼TLT þE ð8Þ
arranged in a two-dimensional matrix X (I × J), the loadings are calcu-
lated by singular value decomposition (SVD) of the covariance matrix C: then Q for the i-th sample (Q i) is calculated as the sum of squares of the
i-th row of E (ei) [6]:
XT X
C¼ ¼ LS2 LT ¼ ZΛZT ð1Þ Q i ¼ei eTi ð9Þ
I−1

where Z ( J × J) is an orthogonal matrix, S ( J × J) is a diagonal matrix with The Q statistic indicates how well each sample conforms to the
the nonzero singular values on its diagonal, L ( J × J) is the loading ma- PCA model and it is a measure of PCA residuals; ei represents the Q
trix, which collects on each j-th column (eigenvector), the coefficients contributions, which define how much each variable contributes to
of the J variables for defining the j-th principal component, and Λ is the overall Q statistic for the sample. Finally, confidence limits can
the diagonal matrix which contains the nonnegative eigenvalues of de- be calculated for Q at a given number of retained significant compo-
creasing magnitude (λ1 ≥ λ2 ≥ … λJ ≥ 0). Each eigenvalue encodes the nents (QM) [10]:
variation related to the corresponding component. From eigenvalues,
explained variance (EV) and cumulative explained variance (CEV) asso- pffiffiffiffiffiffiffiffiffi !1
h0 Θ2 ðh0 −1Þ 0
h
h0 cM 2Θ2
ciated to each m-th component can be calculated: Q M ¼Θ1 þ1þ ; ð10Þ
Θ1 Θ12

λm
EV m ¼ ð2Þ
X
J
where
λj
j¼1
X
J
X
m Θk ¼ λkj ð11Þ
λj j¼Mþ1
j¼1
CEV m ¼ ð3Þ
X
J and
λj
j¼1 2Θ1 Θ3
h0 ¼ 1− ð12Þ
3Θ22
Considering that only M significant components are retained in the
PCA model, then the dimension of the loading matrix L decreases from
(J × J) to (J × M) and samples are thus projected into the lower-
2.3. Selection of optimal number of principal components
dimensional space defined by the significant principal components in
the following way:
The selection of components has several benefits, since the influence
T ¼ XL ð4Þ of variation related to noise is minimised and the interpretation is
significantly supported by reducing the data dimension. Several
where T (I × M) is the score matrix, which collects on each m-th column approaches and indices to designate an optimal number of principal
the coordinates of the I samples into the m-th principal component. components have been proposed in literature [1,11,12]. The PCA tool-
Additional statistics can be calculated in order to analyse how each box for MATLAB comprises the following methods and criteria based
sample conforms to the PCA model. In particular, the sum of normalised both on eigenvalues and cross-validation procedures.
squared scores, known as Hotelling's T2 statistic, is a measure of the var- Beside the scree test [13,6], two basic methods based on eigenvalues
iation in each sample within the PCA model [6]: are the Average Eigenvalue Criterion (AEC, also known as Kaiser's crite-
rion) and the Corrected Average Eigenvalue Criterion (CAEC) [14]. AEC
X
M 2
t accepts as significant only components with eigenvalue larger than
T 2i ¼ im
; ð5Þ
m¼1
λm the average eigenvalue, while CAEC is the same as AEC, but simply de-
creases the rejection threshold by multiplying the average eigenvalue
where T2i and tim are the Hotelling's T2 value and the score value of the by 0.7. Note that when data are autoscaled, the average eigenvalue is
i-th sample on the m-th component, respectively. High values of equal to 1.
D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9 3

Malinowski proposed two eigenvalue-based indices: the Imbedded samples is obtained by cutting the dendrogram at the desired level of
Error (IE) and the Malinowski Indicator Function (IND) [15], which similarity.
are defined as Multidimensional Scaling (MDS) also elaborates the distance
(or similarity) matrix representing the internal similarity/diversity
!12 relationships of samples [4]. With respect to Cluster Analysis, MDS
M X λm
J
IEM ¼ ð13Þ takes into account the mutual relationships of sample distances by re-
I  J m¼Mþ1 J−M
producing the data structure encoded in the distance (similarity) matrix
into a low-dimensional space. Therefore, given that MDS finds a low-
!12 dimensional representation to match a distance matrix, a scatter plot
X
J
λm
of samples in the reduced dimensional space provides a visual represen-
Ið J−M Þ
INDM ¼
m¼Mþ1
: ð14Þ tation of the original data structure.
ð J−M Þ2
3. Main features of the PCA toolbox for MATLAB
These indicator functions were specifically proposed to deal with
spectroscopic data. They are based on the assumption that the error is The collection of functions and algorithms included in the toolbox
random and identically distributed in the data and thus the eigenvalues are provided as MATLAB source files, with no requirements for any
associated with the residual error of PCA should be approximately other third party's utilities beyond the MATLAB installation. The files
equal. Both IE and IND are calculated as a function of the number of se- just need to be copied into a folder. The MATLAB Statistics Toolbox is
lected principal components and the minimum of the function should needed to compute Cluster Analysis and Multidimensional Scaling.
indicate the optimal number of components to be retained in the model. The toolbox was built on MATLAB 2014 and tested on previous versions
Another strategy to select significant components is based on the until MATLAB 2010. The model calculation can be performed both via
multivariate K correlation index, which is a multivariate approach to the MATLAB command window and a graphical user interface, which
quantify the correlation content of a data matrix [16]. Since the K enables the user to perform all the analysis steps.
index ranges between zero (all variables are orthogonal) and 1 (all var-
iables are perfectly correlated), a linear function (KL) and a non-linear 3.1. Input data
power function (KP) can be derived from the K correlation index as
follows: Data must be structured as a numerical matrix with dimensions I × J,
where I is the number of samples and J the number of variables. If avail-
KL ¼ int½1 þ ð J−1Þð1−K Þ ð15Þ able, a qualitative (class) or quantitative response vector can be loaded.
Obviously, the response vector does not influence the calculation of the
h i PCA model, but it can be useful to enhance interpretation of results. In
KP ¼ int J ð1−K Þ ; ð16Þ
fact, samples can be coloured in the score plot on the basis of their re-
sponse values, and this can help the user to better identify trends in
where K is the K correlation index and int indicates the nearest integer the sample distribution. The response vector must be loaded as a
upper value. Both functions equal 1 when K equals 1 (all J variables column numerical vector (I × 1), where the i-th element of this vector
are mutually correlated, so one component is retained) and equal J represents the response value of the i-th sample. When dealing with
when K equals 0 (all J variables are orthogonal). KL gives the maximum qualitative responses, if G classes are present, class labels must be de-
number of theoretical significant principal components, under the as- fined as integer numbers ranging from 1 to G. Sample and variable labels
sumption that the information in the data is linearly distributed, while can be loaded as cell array vectors.
KP estimates the minimum number of theoretical significant compo-
nents under the assumption that the information in the data decreases 3.2. Calculating models
more steeply.
Finally, another option is to estimate the optimal number of compo- Once data have been prepared, the user can easily calculate PCA,
nents by means of cross-validation procedures [17]. Samples are there- MDS, or Cluster Analysis via the MATLAB command window. The
fore divided into a number of cross-validation groups; the PCA model is MATLAB functions associated to these methods are listed in Table 1.
then built on all but one of the groups and used to estimate variables of The output of these functions is a MATLAB structure array, where results
the left out samples. One variable at a time is removed and considered are stored together with settings used for calculation, as described in
as missing data; the missing variable is predicted from the model and Table 1.
the sample observation excluding the one variable [18]. The residuals The “pca_compsel” routine can be used to calculate indices and
for this reconstruction are calculated as root mean squared error in parameters for the estimation of the optimal number of significant
cross-validation (RMSECV) and analysed as a function of the number components to be retained. In particular, the number of components
of components included in the PCA model. When components describ- suggested by different procedures (AEC, CAEC, KL, and KP) as well as
ing only small noise variation are added, the residuals are expected to indices calculated as a function of the number of components (eigen-
increase. values, explained and cumulative variance, RMSECV, IND, IE) is stored
in the output structure array. As previously described, when dealing
2.4. Cluster Analysis and Multidimensional Scaling with the selection of principal components, one option is to calculate
the root mean squared error (RMSECV) in cross-validation. This can
Cluster Analysis defines groups (clusters) in the data on the basis of be performed by choosing the number of cross-validation groups and
sample similarities [3]. Similarities among samples are estimated by the cross-validation method for dividing the samples into groups (vene-
means of distances: similar samples are characterised by small distances tian blinds or contiguous blocks). In order to clarify the difference
and the opposite for dissimilar samples. In particular, hierarchical clus- between the two approaches, the following example is given. Let the
tering methods use several linkage approaches to quantify distances dataset be composed of 9 samples, divided in 3 cross-validation groups
between groups of samples: distances are thus used to pair samples (each therefore constituted by 3 samples). The venetian blinds ap-
into binary clusters and the newly clusters are then grouped into larger proach distributes samples in the following groups: [1,0,0,1,0,0,1,0,0],
clusters until a hierarchical dendrogram is obtained. The dendrogram [0,1,0,0,1,0,0,1,0], and [0,0,1,0,0,1,0,0,1]. On the contrary, the data split
encodes the clusterisation structure of the data and the partition of in contiguous blocks would be [1,1,1,0,0,0,0,0,0], [0,0,0,1,1,1,0,0,0], and
4 D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9

Table 1
MATLAB routines of the toolbox related to the calculation of models and their main outputs. For each routine, outputs are collected as fields of a unique MATLAB structure array; I is the
number of samples, J the number of variables, M the number of retained principal components.

Routine name Description Outputs Description

pca_model PCA E Eigenvalues (M × 1)


exp_var Explained variance % (M × 1)
cum_var Cumulative explained variance % (M × 1)
T Scores (I × M)
L Loadings (J × M)
Thot Hotelling's T2 (I × 1)
Tcont Hotelling's T2 contributions (I × J)
Qres Q residuals (I × 1)
Qcont Q residuals contributions (I × J)
set Structure with settings used to calculate PCA (scaling parameters, Hotelling's T2,
and Q confidence limits)
cluster_model Cluster Analysis L Linkage results
D Distance matrix (I × I)
set Structure with settings used to calculate hierarchical Cluster Analysis (scaling parameters, type of
distance, type of linkage)
mds_model MDS T Configuration matrix (I × MDS dimensions)
D Distance matrix (I × I)
set Structure with settings used to calculate MDS (scaling parameters, type of distance)

[0,0,0,0,0,0,1,1,1]. Obviously, the choice of the suitable type of cross- histograms can also be drawn for each specific class of samples. Similar-
validation depends on how samples are listed along the dataset. ly, these tools can be used at the end of the PCA analysis in order to de-
Finally, when dealing with PCA, new samples can be projected in code and corroborate the data trend and information contained in the
the PC space by using an existing model. This calculation can be made principal components.
in the toolbox by means of the function “pca_project”, that returns a In addition to these basic screening procedures, all the calculation
structure array containing scores, Hotelling's T2, Hotelling's T2 con- steps described in the previous paragraph can be performed in the
tributions, Q residuals, and Q residuals contributions of the new graphical interface.
projected samples. The selection of significant component to be retained in the PCA
model can be evaluated directly in a proper window, where the user
3.3. Calculating models via the graphical user interface can initially define data scaling and cross-validation procedures and
then analyse results in a graphical way. Eigenvalues, explained variance,
The following command line must be executed in the MATLAB cumulative explained variance, RMSECV, IE, and IND can be plotted as a
prompt to run the graphical interface of the toolbox (Fig. 1): function of the components included in the model. Since some of these
N N pca_gui indices can have great variation especially for the highest numbers of
The user can load data, labels of samples and variables, and a quali- components, the number of components to be shown in the plot can
tative or quantitative response vector (if available), both from the be changed and reduced. When plotting eigenvalues, two horizontal
MATLAB workspace or MATLAB files. lines are drawn; these correspond to the average eigenvalue and to
Some basic graphical tools are provided to give an initial insight to the average eigenvalue multiplied by 0.7, which are the thresholds sug-
the data. The preliminary data screening gives the user the possibility gested by Kaiser for the AEC and CAEC selection criteria [14]. Moreover,
to graphically evaluate and analyse distributions of variables and there- the number of components suggested by eigenvalue-based methods
fore select an adequate data scaling procedure. In particular, profiles of (AEC, CAEC, KL, KP) are labelled.
samples and profiles of variable averages can be analysed both on the Once a model has been calculated, results can be saved in the
raw and pre-treated data as a function of different scaling procedures MATLAB workspace. Saved models can be easily loaded in the toolbox
(autoscaling, mean centering, range scaling between 0 and 1). More- for future analyses. When dealing with PCA, new samples can be loaded
over, boxplots, histograms, and biplots of variables can be drawn. If a and projected in PCA models which were previously calculated on
qualitative response (class) is loaded, then averages, boxplots, and different sets of samples.

Fig. 1. PCA toolbox for MATLAB: main graphical interface.


D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9 5

Fig. 2. PCA toolbox for MATLAB: interactive graphical interface for visualising results of Principal Component Analysis.

3.4. Visualising results via the graphical user interface a response was previously loaded. When dealing with qualitative
responses, samples are coloured on the basis of their experimental
PCA results can be visualised in the toolbox graphical interface. Nu- class, while when dealing with quantitative responses, samples are
merical values of eigenvalues, explained and cumulative variances can coloured in a grey scale, ranging from white (minimum response
be both investigated and plotted as a function of retained components. value) to black (maximum response value). Besides scores, Hotelling's
Scores and loadings can be analysed in a proper window (Fig. 2), T2 and Q residuals of samples can be plotted together in the so-called in-
where several options are given to the user in order to explore results. fluence plot (together with their 95% confidence levels), where it is pos-
The score plot represents sample coordinates in the PC space and sible to easily identify outliers. When dealing with outliers or particular
allows visual investigation of the data structure by analysing sample trends in the data, it is useful to have a quick idea on the reasons why
positions and their relationships. Samples can be labelled with different these samples have extreme behaviours. Therefore, the user can select
strings (identification numbers or user-defined labels) and coloured if a specific sample on the score or influence plots by positioning the

Fig. 3. Example of analysis: profile of averages of variables for each class on the a) raw data and b) autoscaled data.
6 D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9

cursor with the mouse; this operation opens a new figure where vari-
able profiles of raw and scaled data of that sample, as well as Hotelling's
T2 contributions and Q contributions, are plotted. On the other hand, co-
efficients of variables in defining principal components can be evaluated
in the loading plot, where variables can be labelled with identification
numbers or user-defined labels.
When dealing with hierarchical cluster analysis, the dendrogram
encoding the clusterisation structure of the data can be visualised in
the graphical interface. The partition of samples can thus be obtained
and saved by defining the number of clusters; the dendrogram will be
therefore cut and coloured at the similarity level which gives the chosen
number of clusters. Finally, the scatter plot of samples in the reduced di-
mensional space provided by MDS can be visualised. As for the PCA
score plot, samples can be labelled and coloured on the basis of a qual-
itative or quantitative response and the user can graphically select a
specific sample to get the variable profiles on the raw and scaled data
of that sample.

4. Illustrative example

In the following paragraphs, an example of the application of the


PCA toolbox for MATLAB is given over the aquatic toxicity dataset,
which is provided together with the toolbox. Note that this example is
intended just to highlight the main features of the PCA toolbox.
The aquatic toxicity dataset was used to calibrate a Quantitative-
Structure Activity Relationship (QSAR) model to predict acute aquatic
toxicity towards Daphnia Magna [13]. It is constituted of 545 organic
molecules (samples), each described by 8 molecular descriptors (vari-
ables). Molecules were originally divided into training (436) and exter-
Fig. 4. Example of analysis: plots of a) eigenvalues, b) cumulative explained variance, and
nal test sets (109) with a random selection. The training set is here used c) RMSECV as a function of components included in the PCA model; vertical lines on the
to calculate the PCA model, while test molecules are projected in the eigenvalue plot identify components suggested by eigenvalue-based methods (AEC,
model to validate the data distribution. An experimental quantitative CAEC, and KP), horizontal lines correspond to the average eigenvalue (red) and the
response (LC50) is associated to each molecule; LC50 is the concentra- average eigenvalue multiplied by 0.7 (blue).

tion that causes death in 50% of test Daphnia Magna over a test duration
of 48 hours. Lethal concentrations were first converted to molarity and
then transformed to a logarithmic scale (−Log mol/L). Moreover, sam- with 5 groups divided in venetian blinds. Note that, being the data
ples were divided in two experimental classes: class 1 includes samples autoscaled, the horizontal line corresponding to the average eigenvalue
with LC50 values lower than 4, while class 2 includes samples with LC50 has value equal to 1. Malinowski indices (IE and IND) are not analysed
values higher than 4. Note that this threshold does not have a toxicolog- here since they were specifically proposed to deal with spectroscopic
ical meaning, but it was used here just in order to exemplify the use of data. Looking at plots of Fig. 4, the selection of three components can
the toolbox. be a suitable compromise between data reduction and preservation of
information. In fact, vertical lines on the eigenvalue plot identify compo-
4.1. Data screening nents suggested by the eigenvalue-based methods, and in this case, AEC,
CAEC, and KP criteria all suggest three components as the most suitable
Among the basic tools for preliminary data screening available in the solution (Fig. 4a). Thus, only these components have eigenvalues great-
toolbox, an inspection of variable distribution can be easily performed er than 0.7; this can be checked by looking at numerical values by means
by plotting profiles of variable averages on the raw and scaled data. of the “view eigenvalues” button (Table 2).
Fig. 3 is the result of the “view – N plot profiles” toolbox menu and On the other hand, the first three components explain almost 80% of
shows averages of variables on the raw and autoscaled data calculated variation in the data, as shown in the cumulative explained variance
on each of the two experimental classes, plotted as a function of plot (Fig. 4b), while residuals in cross-validation significantly increase
variables. An evaluation of profiles clearly indicates that the two first with the selection of more than three components (Fig. 4c).
variables have different scales on the raw data and therefore autoscaling
can be chosen in the subsequent modelling steps as proper proce-
dure for data scaling. Another option to get a similar vision would
be the analysis of boxplots or histograms of each variable in the Table 2
Example of analysis: eigenvalues, explained variance (%), and cumulative explained vari-
“view – N plot univariate stat” menu. ance (%) associated to principal components.

4.2. Selection of the optimal numbers of components Principal Eigenvalue Explained Cumulative explained
component variance % variance %

Criteria and indices to evaluate the optimal number of principal PC1 3.6855 46.07 46.07
PC2 1.5137 18.92 64.99
components can be calculated and easily analysed in the graphical
PC3 1.0681 13.35 78.34
user interface. Fig. 4 shows some of the plots which can be obtained PC4 0.68167 8.52 86.86
with the “calculate – N optimal components for PCA” menu. In particu- PC5 0.61763 7.72 94.58
lar, eigenvalues, cumulative explained variance, and RMSECV are plot- PC6 0.25285 3.16 97.74
ted as a function of components included in the PCA models, which PC7 0.10159 1.27 99.01
PC8 0.00720 0.09 100.00
were calculated by autoscaling the toxicity dataset and cross-validated
D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9 7

Fig. 5. Example of analysis: score plot of the first two principal components: a) each sample is coloured with a grey scale on the basis of its experimental quantitative response (LC50);
b) each sample is coloured on the basis of its experimental class.

4.3. PCA calculation and interpretation with the graphical interface At this stage, basic univariate tools provided with the toolbox can be
used to support the data trends revealed by PCA. With the “view –Nplot
As a consequence of the workflow described so far, data can be univariate stat” menu, it is therefore possible to analyse the sample dis-
autoscaled and three components can be retained when calculat- tribution along each variable by means of histograms, boxplots, or
ing PCA on the training molecules of the aquatic toxicity dataset biplots. The distribution of samples of the two experimental toxicity
(“calculate –N PCA” toolbox menu). Results can thus be easily classes along MLOGP is shown in the histograms of Fig. 7 and confirms
accessed by means of “results –N view scores and loadings” menu, what previously estimated in the PCA score and loading plots.
which opens a specific window where score, loadings and other sta- Another option provided in the PCA result window of the toolbox is
tistics are available to analyse the data structure. the plot between Hotelling's T2 and Q residuals of samples, also known
In Fig. 5, the score plot of the first two components (explaining to- as influence plot (Fig. 8). There are two clear outliers (samples 371 and
gether the 65% of the total amount of information) is shown. In Fig. 5a, 373) beyond the 95% confidence limits. Since it is preferable to analyse
each sample is coloured with a grey scale on the basis of its experimen- outliers before removing them from the data, sample 373 was selected
tal quantitative response (LC50): the larger the value of LC50, the darker by clicking the button “view sample” which enables the positioning of
the colour. In Fig. 5b, the option of colouring samples on the basis of the cursor on the plot with the mouse. This operation produces the var-
their experimental class is shown. Although an evident overlap of clas- iable profiles of raw and scaled data, Hotelling's T2 and Q contributions
ses, which is quite common in QSAR data due to experimental variabil- plots, as shown in Fig. 9. These tools can greatly enhance the outlier in-
ity, both plots highlight a potential trend of separation. In fact, there is a vestigation; in particular, looking at T2 contributions, we can point out
systematic distribution of samples with respect to both qualitative and variables which mostly contribute to the high Hotelling's T2 value of
quantitative experimental values and LC50 mainly increases along the sample 373 and consequently to its extreme score values (TPSA,
second component. SAacc, MLOPG, nN and H-050). It is therefore possible to understand
Thus, one can evaluate how variables characterise classes by com- how these variables contribute: for example, the negative contribution
paring score and loading plots (Fig. 6). Being MLOGP (octanol–water of MLOGP is due to the extremely low value of MLOGP for sample 373.
partition coefficient) characterised by a negative loading on both the The same assumption could also be derived from the opposite position
first and second principal components, this variable would be expected of sample 373 in the score plot of the first two components (upper
to have an apparent weight in describing molecular toxicity and there- right, Fig. 5) with respect to the position of MLOGP in the corresponding
fore in discriminating the analysed experimental classes. loading plot (bottom left, Fig. 6). Similar analyses can be carried out
looking at Q residuals (Fig. 9d), where it is evident that sample 373
has high residual variation on variables H-050 and nN.

Fig. 6. Example of analysis: loading plot of the first two principal components; each vari-
able is identified with its label. Fig. 7. Example of analysis: histograms of MLOGP for the two experimental toxicity classes.
8 D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9

by looking again at score and influence plots (Fig. 10), where training
and test samples are shown with different marks. As an option, experi-
mental responses could be loaded also for test samples; in this case, the
same colours for both training and test samples would be used to
discriminate different experimental values.

5. Independent testing

Prof. Michelle Sergent, at the Département de Chimie, Aix Marseille


Université, LISA EA4672, 13397, Marseille Cedex 20, France, informed
that she has tested the described software and found that it appears to
function as the authors described.

6. Conclusion

Fig. 8. Example of analysis: plot between Hotelling's T2 and Q residuals of samples; The PCA toolbox for MATLAB is a collection of modules for calculat-
extreme samples are labelled with their identification number; red lines correspond to ing Principal Component Analysis, Cluster Analysis, and Multidimen-
the 95% confidence limits; samples are coloured with a grey scale on the basis of their sional Scaling for unsupervised analysis of multivariate datasets.
experimental quantitative response (LC50).
The toolbox is freely available via Internet from the Milano
Chemometrics and QSAR Research Group website [5]. It aims to be
useful for both beginners and advanced users of MATLAB and
Finally, the model can be saved in the MATLAB workspace and later chemometrics. For this reason, examples and an extensive user man-
loaded in the toolbox to predict new sets of samples. This was made on ual are provided with the toolbox.
the 109 test samples of the data set in analysis. The test samples were The toolbox comprises a graphical user interface (GUI), which allows
therefore projected in the PCA space and their distribution analysed the calculation in an easy-to-use graphical environment. In the GUI, all

Fig. 9. Example of analysis: variable profiles of a) autoscaled data and b) raw data of sample 373 and its c) Hotelling's T2 contributions and d) Q contributions.

Fig. 10. Example of analysis: a) score plot of the first two principal components and b) Hotelling's T2 and Q residuals plot with training samples (white circles) and projected test samples
(black asterisks).
D. Ballabio / Chemometrics and Intelligent Laboratory Systems 149 (2015) 1–9 9

the analysis steps (data loading, model settings, principal component [5] Weblink to download PCA toolbox at Milano Chemometrics and QSAR Research
Group website, http://michem.disat.unimib.it/chm/download/pcainfo.htm.
selection, calculation, results visualisation, and projection of new sam- [6] R. Bro, A. Smilde, Analytical Methods, 6 (2014) 2812–2831.
ples in the PCA model) can be easily performed. [7] J.E. Jackson, A User's Guide to Principal Components, Wiley, New York, 1991.
[8] S. Wold, K.H. Esbensen, P. Geladi, Chemom. Intell. Lab. Syst. 2 (1987) 37–52.
[9] B. Wise, N. Gallagher, R. Bro, J. Shaver, PLS Toolbox 3.0, Manson, WA, 2003.
Conflict of interest [10] J.E. Jackson, J.S. Mudholkar, Technometrics 21 (1979) 341–349.
[11] M. Meloun, J. Capek, P. Miksik, R.G. Brereton, Anal. Chim. Acta 423 (2000) 51–68.
The authors declare that there are no conflicts of interest. [12] M. Wasim, R.G. Brereton, Chemom. Intell. Lab. Syst. 72 (2004) 133–151.
[13] R.B. Cattell, Multivar. Behav. Res. 1 (1966) 245–276.
[14] H.F. Kaiser, Educ. Psychol. Meas. 20 (1960) 141–151.
References [15] E.R. Malinowski, D.G. Howery, Factor Analysis in Chemistry, Wiley, New York, 1980.
[16] R. Todeschini, Anal. Chim. Acta 348 (1997) 419–430.
[1] I.T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 1986. [17] B. Wise, N.L. Ricker, IFAC Symp on Advanced Control of Chemical Processes, in: K.
[2] W.J. Krzanowski, Principles of Multivariate Analysis, Clarendon Press, Oxford, 2000. Najim, E. Dufour (Eds.), Toulouse, France 1991, pp. 125–130.
[3] D.L. Massart, L. Kaufman, The Interpretation of Analytical Chemical Data by the Use [18] R. Bro, K. Kjeldahl, A. Smilde, H.A.L. Kiers, Anal. Bioanal. Chem. 390 (2008)
of Cluster Analysis, Wiley, New York, 1983. 1241–1251.
[4] G.A.F. Seber, Multivariate Observations, John Wiley & Sons, Inc., Hoboken, NJ, USA,
2008.

You might also like