You are on page 1of 13

JOURNAL OF CHEMOMETRICS

J. Chemometrics 2007; 21: 427–439


Published online 4 September 2007 in Wiley InterScience
(www.interscience.wiley.com) DOI: 10.1002/cem.1086

A randomization test for PLS component selection


Susanne Wiklund1*, David Nilsson1, Lennart Eriksson2, Michael Sjöström1,
Svante Wold1 and Klaas Faber3
1
Research Group for Chemometrics, Department of Chemistry, Umeå University, S-901 87 Umeå, Sweden
2
Umetrics, Tvistevägen 48, P.O. Box 7960, S-90719 Umeå, Sweden
3
Chemometry Consultancy, Rubensstraat 7, 6717 VD Ede, The Netherlands
Received 7 August 2006; Revised 29 November 2006; Accepted 12 March 2007

During the last two decades, a number of methods have been developed and evaluated for selecting
the optimal number of components in a PLS model. In this paper, a new method is introduced that is
based on a randomization test. The advantage of using a randomization test is that in contrast to cross
validation (CV), it requires no exclusion of data, thus avoiding problems related to data exclusion, for
example in designed experiments. The method is tested using simulated data sets for which the true
dimensionality is clearly defined and also compared to regularly used methods for 10 real data sets.
The randomization test works as a good statistical selection tool in combination with other selection
rules. It also works as an indicator when the data require a pre-treatment. Copyright # 2007 John
Wiley & Sons, Ltd.

KEYWORDS: randomization test; permutation test; component selection; factor selection; latent variable selection;
partial least squares

1. INTRODUCTION We present a new randomization test that intends to


provide objective guidance for PLS component selection.
Partial least squares projections to latent structures (PLS) is a This type of test is well known in statistics. Fisher [15]
commonly used regression method for the calibration of introduced the principle in 1935, which is why it is often
multivariate data [1]. The selection of the number of PLS called Fisher’s randomization test. Randomization tests have
components (also known as latent variables or factors) is a received considerable attention in the chemometric-
crucial step in reaching the optimal model. Retaining too few s-oriented literature, where they have been proposed for
components implies that the calibration data are under-fitted testing the significance of models (e.g. obtained after
and there is still information left that can be modeled. extensive variable selection) [16–19], the equality of per-
Choosing too many components leads to over-fitting. This formance of two prediction methods [20,21] and also for
means that, although the calibration data are well described, selecting the number of components in principal component
the model will have inferior predictive ability for future analysis (PCA) [22] and PLS [23]. It is important to note that
objects due to spurious incorporated noise. the currently proposed method intends to improve on the
In 1996, Denham [2] noted that ‘Unfortunately, the one introduced by Sheridan et al. [23] To demonstrate this
statistical behavior of PLS regression is still not well point, simulated data sets are used for which the true
understood and this has meant that it has been extremely dimensionality is clearly defined.
difficult to perform the usual inferential tasks associated with The objective of this study is to compare the new
modeling, such as choosing an optimal model, assessing randomization test with three regularly used methods, that
uncertainty in coefficient estimates and producing prediction is cross validation (CV) [24,25], leverage correction [26–29]
intervals for future responses. As a result, a number of rules and the size of the eigenvalue [30]. The results for the
for choosing an optimal model have been suggested with regularly used methods are obtained through two chemo-
varying degrees of statistical validity.’ During the last metric software programs, namely SIMCA 11.0 (Umeå,
decade, this statement has not lost any of its validity. Sweden) [30] and Unscrambler v8.0.5 (Oslo, Norway) [31].
Indeed, developing [3–12] as well as comparing [2,6,13,14] Two final notes seem to be in order. First, we restrict
PLS component selection rules have been and apparently ourselves to ‘conventional’ implementations of CV. We are
continue to be subjects of active research in chemometrics. fully aware that Monte Carlo CV [32–34] has recently been
introduced in chemometrics for component [35–37], as well
*Correspondence to: S. Wiklund, Research Group for Chemo- as variable [19,37–40] selection. Typical for Monte Carlo CV
metrics, Department of Chemistry, Umeå University, S-901 87 Umeå,
Sweden. is the deletion of rather large segments, often even (much)
E-mail: susanne.wiklund@chem.umu.se larger than half the calibration set. The results of a simulation

Copyright # 2007 John Wiley & Sons, Ltd.


428 S. Wiklund et al.

study indicate that the method may outperform ‘conven- value of the number of components is immaterial as long as
tional’ implementations for small and medium-sized cali- the prediction error is close to its minimum.
bration sets [35]. However, Gourvénec et al. [36] noted that a
major concern remains with this method, namely whether 2.2. Leverage correction
making a calibration model with so few objects will be Leverage correction is a simplified method used for
representative for the structure of the data. Moreover, since validation and component selection [26–29]. The goal of
our goal is to compare a new randomization test with leverage correction is to convert a fit residual into a
regularly used methods, we consider its inclusion out of the prediction residual, without performing any actual predic-
scope of this paper. Second, other statistical tests have been tions as in CV. The leverage is a measure of the influence of
proposed [3,4,9,11]. We prefer not to include them either. an object on the PLS model and is closely related to the
Instead, they are discussed in Section 2.7. Mahalanobis distance and Hotelling’s T2. The leverage of
object i using A components is calculated as
2. PLS COMPONENT SELECTION
XA
t2ia
1
This section presents a description of the selection rules under hA
i ¼N þ T
(2)
t t
a¼1 a a
study. Although there are many ways to select the rank of a
model, the optimal one is to use a large external validation set where ta denotes the orthogonal score vector of component a
that is not involved in the model building stage. Unfortunately, (a ¼ 1;    ; A) and tia is the object score value. It is noted that
this requires highly redundant data sets; hence the model must mean centering is included in the leverage calculation as the
usually be validated the best alternative way. Mean-centered N 1 -term. The leverage-corrected fit residual is obtained as
models are assumed throughout, but this does not affect the
A
ðyobs;i  yA
fit;i Þ
generality of the presentation. fi;LC ¼ (3)
ð1  hA
i Þ

2.1. CV where yA fit;i is the fitted response for object i with A


CV amounts to a number of rounds of setting calibration components. The root mean square error of leverage
objects apart, for which predictions are obtained using the correction (RMSELC) then follows as
model that describes the remainder of the objects [24,25]. In vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
this study we will restrict ourselves to two common variants u X N
RMSELC ¼ tdf1 A Þ2
ðfi;LC (4)
of CV, namely leave one out (LOO) CV, where single i¼1
calibration objects are set apart, and so-called r-fold CV [41],
where the calibration objects are divided in r segments, here where df stands for degrees of freedom. This conversion of fit
r ¼ 7. For these variants all objects are left out once. The root residuals to predictive ones is correct for multiple regression
mean square error of CV (RMSECV) is calculated as models with orthogonal variables, including principal
component regression (PCR), but inappropriate for PLS.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi As a general principle, degrees of freedom are consumed for
u
u X N
PRESS
RMSECV ¼ tN 1 ðyobs;i  yA pred;i Þ ¼
2
(1) each piece of information that enters the model building
i¼1
N stage. Consequently, PLS consumes more degrees of freedom
than top-down PCR since the response values are deployed.
where N is the number of objects in the calibration set,yobs;i is This may explain why Lorber and Kowalski [28] found
the observed response for object i and yA pred;i is the associated leverage correction to work better for top-down PCR than for
prediction with A components. The summation under the PLS. Consequently, when applied to a PLS model, leverage
radical sign is the prediction error sum of squares (PRESS). correction should be regarded only as a ‘quick-and-dirty’
The division by N to obtain an average measure of prediction alternative to the more time-consuming CV. It is noted that
uncertainty is motivated by CV not consuming degrees of van der Voet [43] has proposed more rigorous degrees of
freedom, unlike leverage correction (see below). freedom. Unfortunately, their calculation requires CV hence
Ideally, the model with the lowest RMSECV is considered their utility is limited in the current context.
to be the optimal one. In practice, however, a clear global
minimum, achievable for a large external validation set, is 2.3. Size of the eigenvalue
often not obtained and one has to resort to softer rules such as The eigenvalue is related to the amount of variation
‘the first local minimum’ or ‘the beginning of a plateau’. explained by a component. It is calculated in SIMCA as [30]
Furthermore, there are various ways of implementing the
exclusion procedure in CV, which may considerably impact SSXðAÞ
lðAÞ ¼ minðN;KÞ (5)
the actual CV performance. The results may also depend on SSXð0Þ
whether mean centering is applied prior to PLS modeling where min(N, K) is either the number of objects (N) or
only, or for each deletion step in CV [42]. Consequently, CV variables (K) in the X-data matrix, SSXðAÞ is the sum of
like most statistics based on real data is inherently somewhat squares explained by component A and SSXð0Þ is the total
uncertain. However, as seen in the result section, the sum of squares that is initially present in the X-data (A ¼ 0).
prediction error (RMSECV) has almost identical values for This formula enables one to see how many objects (N) or
models with a range of the number of components. The variables (K) are explained by a component. For example PLS
important issue is that the selected number of components by is often used to calibrate X-data for which the number of
CV or randomization test correctly finds this range, the actual objects is (much) smaller than the number of variables. In that
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 429

case, the eigenvalue will show how many objects are but not for PLS; hence one should not expect a single value to
explained by the component. The relative amount of work well for both validation methods when PLS is used.
variation explained by a model follows as the ratio of the
sum of the SSX(A)s and SSXð0Þ. It should be noted that the
eigenvalue is less popular for PLS component selection than 2.6. The proposed randomization test
CV or leverage correction, although it constitutes the basis for
many component selection rules in PCA [44]. 2.6.1. Why a statistical test?
PLS component selection is most frequently carried out in
practice using some sort of validation. However, each
2.4. SIMCA validation-based selection rule has drawbacks. CV and
1. A component is significant according to Rule 1 when the leverage correction handle the available data economically,
amount of predicted (CV) variance Q2 > limit. With a PLS but like any data-based statistical test gives an interval
model, Limit ¼ 0 for models with more than 100 objects results and hence sometimes gives either an under-fit or an
and Limit ¼ 0.05 for models with 100 objects or less. over-fit, that is they reach the minimum RMSEV for a lower
2. A component is significant according to Rule 2 when at or higher model rank than would be achieved using an
least one Y-variable for a PLS model has Q2V >limit. Q2V is infinitely large independent validation set. In addition, the
the fraction from one Y-variable, that is Q2V ¼ 1 estimated RMSEs carry a considerable uncertainty, as noted
PRESS=SSy, where SS is sums of squares for one response by Martens and Dardenne [45]. When extreme outliers are
vector y. The total from all responses is Q2. removed, the mean squared error (square of RMSEV) is
A PLS component is not significant if: distributed approximately proportional to a x2-variable [49].
Asffiffiffiffiffiffiffiffiffiffiffi
p a result,
ffi the relative uncertainty can be estimated as
1. Rule 1 or 2 is not fulfilled. 1=2N in which N is the number of objects. This estimate can
2. The data have insufficient number of degrees of freedom be used in the selection rule given by Breiman et al., [41]
after previous component, that is if (N  A) or (K  A) ¼ 0. which starts with a candidate model and works backwards,
3. The explained variance for Y (PLS) is less than 1% and see Figure 1. However, this procedure does not account for
no single Y-variable has more than 3% (PLS) explained the fact that the models are nested; hence subsequent RMSEs
variance. are correlated.
In summary, it is not clear with CV-based selection rules
how the uncertainty in the input data translates into an
2.5. Unscrambler uncertainty in the output number of PLS components. Here
For Unscrambler [31], the selection follows as the software we try to develop a distribution-based statistical test that
recommendation for CV and leverage correction. This may provide more objective guidance.
recommendation is based on the minimization of the
following criterion: [45]

CritðAÞ ¼ RMSEVðAÞ þ c  A  RMSEVð0Þ (6) 2.6.2. The procedure for successive PLS components
The proposed method assesses the statistical significance of
where RMSEV stands for the root mean square error of each individual component that enters the model. Theoretical
validation (irrespective of how this has been estimated), the approaches to achieve this goal (using a t- or F-test) have
number of components is given in parentheses and c denotes been put forth but they are all based on unrealistic
a ‘punish factor’ for adding components. RMSEVð0Þ assumptions about the data, for example the absence of
symbolizes the validation error (RMSECV or RMSELC) spectral noise (see Section 2.7). A pragmatic data-driven
when the model consists of the center only (zero com- approach is therefore called for. The randomization test is
ponents), and is therefore expected to resemble the spread in entirely data-driven and therefore ideally suited for avoiding
the validation responses. The second term on the right-hand unrealistic assumptions. For an excellent description of this
side constitutes a penalty for adding components. When methodology, see van der Voet [20]. The rationale behind the
RMSEV is approximately equal for two models, the smaller randomization test in the context of regression modeling is
one is favored by the criterion (6). The purpose is therefore to illustrated in Figure 2. Randomization amounts to permuting
add robustness to the component selection. It is important to indices. For that reason, the randomization test is often
note, however, that the optimum ‘punish factor’ depends on referred to as a permutation test. In quantitative structure
the data. Martens and Dardenne [45] give the value c ¼ 0.05. activity relationship (QSAR) applications it is known as
Westad and Martens [46] mention that ‘The ‘‘punish factor’’ ‘Y-scrambling’. Clearly, a complete scrambling of the
may be tuned to a smaller value for models with low elements of Y while keeping the corresponding numbers
explained y-variance’ [relatively high RMSEP (RMSEP in X fixed destroys any relationship that might exist between
stands for root mean square error of prediction, a synonym the X-and Y-variables. Randomization therefore yields PLS
for RMSEV)]. Three per cent seems to be a good ad hoc value, regression models that should reflect the absence of a real
although setting it lower will not affect the model association between the X- and Y-variables—in other words:
performance in general’. However, the value c ¼ 0.01, as insignificant models. However, in practice any scrambling of
deployed by Høy et al., [47] may lead to severe under-fitting the Y-data still leaves some correlation between the
[48]. Finally, Lorber and Kowalski [28] have found that CV scrambled and original data which needs to be taken in
and leverage correction perform similarly for top-down PCR, account. For each of these random models, a test statistic, T, is
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
430 S. Wiklund et al.
0.25 0.2

0.2
(a) (b)
0.15

RMSEV
0.15
0.1
0.1

0.05
0.05

0 0
0 2 4 6 8 10 0 2 4 6 8 10
PLS components PLS components

Figure 1. Gas oil data set: RMSEV as a function of PLS components from (a)
sevenfold CV (84 objects) and (b) independent validation set (155 objects). The
RMSEV estimate for the five-component candidate model (*) is adorned with
uncertainty limits (---). The selection rule given by Breiman et al. [41] checks
whether smaller models yield an RMSEV estimate that is within the uncertainty
limits associated with the candidate model. For the current example data set, the
candidate model would be kept.

calculated. We have opted for the covariance between the distribution that is tabulated, for example t or F. For
t-score and the Y-values because it is a natural measure of illustrative examples, see Figure 3.
association, thus T¼ ðt0 yÞ=N. Clearly, the value for a test
statistic obtained after randomization should be indistinguish-
able from a chance fluctuation, except to the small remaining
2.6.3. Computational details
correlation to the original Y. For this reason, it will be referred
Step-by-step explanation. The procedure consists of the
to as a ‘noise value’. Repeating this calculation a number of
following steps:
times generates a histogram for the null-distribution, that is,
the distribution that holds when the component is due to 1. Initialize, that is (1) set the minimum model dimension-
chance—the null-hypothesis (Ho ). Next, a critical value is ality for the X- and Y-data, A, to zero, (2) optionally
derived from the null-distribution as the value exceeded by a remove the mean from the data, yielding centered data
certain percentage of noise values (say 5 or 10%). Finally, the sets X0 and y0 (the 0th residuals) and (3) set the maximum
statistic obtained for the original data—the value under number of components to be considered to Amax.
test—is compared with the critical value. The (only) 2. Increase the model dimensionality, A, by one. Compute
difference with a ‘conventional’ statistical test is that the the residual data sets as XA ¼ XA1  tA pTA and yA ¼
critical value follows as a percentage point of a data-driven yA1  tA qA , where t is the score vector, and p and q
histogram of noise values instead of a (fixed) theoretical are the PLS loadings for the X- and Y-data, respectively.

1 2 4
2 1 5

N 4 3
ordered permuted y2 y4 x1
indice s indices
y1 y5 x2 dis t r i b u t i o n
PLS under H o
y1
y4 y3 xN
y2 ord e red
randomised
y ve c t o rs X ma t rix

yN
ordered
y-v e cto r

Figure 2. Generating the distribution under the null-hypothesis (Ho ) by build-


ing a series of PLS models after pairing up the observations for predictor (X)
and response (Y) variables at random. Any result obtained by PLS modeling
after randomization must be due to chance. Consequently, the statistical
significance of the value under test follows from a comparison with the
corresponding randomization results.

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 431

A=3 A=6
250
300
(a) (b)
250 200

Frequency
200
150
150
100
100

50 50

0 0
0 0.2 0.4 0.01 0.02 0.03
Test statistic Test statistic

Figure 3. Gas oil data set: comparison of the histogram of noise values and the
value under test (---) for PLS components (a) 3 and (b) 6. For component 3, 3.3%
of the noise values exceed the value under test. This component is therefore
significant, depending on the confidence level deployed. Component 6 is clearly
insignificant, because the value under test is exceeded by 36% of the noise
values.

3. Compute the value under test, TA (cov(tN,yA)), from XA Chhikara and Folks [53]. The inverse Gaussian probability
and yA. density function is given by
4. Generate P permutations of the rows of yA, resulting in !
 g  gðx  mÞ2
yA,p (p ¼ 1;    ; P). Compute for XA and yA;p (p ¼ 1;    ; P), gðx; m; gÞ ¼ exp x > 0; m; g > 0 (7)
the noise value, TA,p (p ¼ 1;    ; P) from the first PLS 2px3 2m2 x
component. where x represents the data (noise values) to be modeled, and
5. If A < Amax, return to Step 2. m and g are location and shape parameters. It holds that

It is emphasized that the calculation of the test statistics TA m ¼ mean


is conducted conditional on previous components having ðmeanÞ3 (8)

been (fully) extracted from the data. Since this is a crucial variance
aspect of the procedure, it is called conditional model where mean and variance denote the true values. From these
dimensionality test (COMODITE), sometimes this is called relationships, one obtains the method of moments estimators
partial model dimensionality test [30]. The currently for m and g as
proposed randomization test intends to improve on the ^ ¼x
m
one introduced by Sheridan et al. [23] Characteristic for their ðxÞ3
method is that only Y-residuals are updated in Step 2. This g^ ¼ (9)
P
P
incomplete updating is essentially correct for the PLS model
1
P1 ðxp  xÞ2
p¼1
under test, for which the predictor and response variables are
‘correctly’ paired up, but not after randomization. Because where x denotes the observed mean and P is the number of
previous components are not fully removed, the noise values data points (permutations). A better fit is, however, obtained
are severely overestimated. As a result, the Sheridan test will by the maximum likelihood estimators (not shown),
have a tendency to under-fit, which will be illustrated using m^ ¼x
simulated data. g^ ¼ P  P 
How many permutations are needed? It is common in X 1 1 (10)

bootstrapping to use 1000 ‘draws’; this value has been used p¼1
xp x
also here in light of the close analogy between randomization
tests and bootstrapping [50]. The calculation of the risk involves evaluating a percentage
Histogram of noise values. Quite often, the test statistic TA point of the inverse Gaussian cumulative distribution
(far) exceeds all noise values for low-numbered components. function (cdf),
Then a smooth approximation of the histogram may provide pffiffiffi   
g^  ðTA =^m  1Þ 2^
g
a more sensible estimate of the associated risk of over-fitting ^ ; g^ Þ ¼ F
GðTA ; m pffiffiffiffiffiffi þ exp
TA m^
than the conventional value, 1/P. Because the test statistic is  pffiffiffi 
proportional to a correlation, it is tempting to use Fisher’s  g^  ðTA =^ m þ 1Þ
F p ffiffiffiffiffiffi (11)
Z-transform [51] (or Hotelling’s transform [52] for few TA
objects, i.e. 25) to convert it to a normally distributed where FðÞ denotes the standard normal cdf. Implementing
variable. This approach fails for PLS (results not shown), this expression ‘at face value’ may lead to numerical
because the two vectors involved in the calculation (t-score problems because the second term on the right-hand side
and response) are functionally related. We have therefore involves multiplying a large number, expðÞ, which may
opted for an empirical approach and selected the so-called cause an overflow, by a very small number, FðÞ. We have
inverse Gaussian distribution, because it is often suited for therefore implemented the solution proposed by Dennis et al.
modeling positive and/or positively skewed data, see [54].
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
432 S. Wiklund et al.

N=84 N=28
50
250 (a) 40 250 (b)
30
20

Frequency
200 200
10
0
150 0.04 0.05 06 150

100 100

50 50

0 0
0.02 0.04 0.06 0.08 0.005 0.01 0.015 0.02
Test statistic Test statistic

(c) 0.02
(d)
0.08
Test statistic

0.06 0.015

0.04 0.01

0.02 0.005

0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
2 2
r r

Figure 4. Gas oil data set: comparison of the histogram of noise values with
inverse Gaussian fit (—) and the value under test (---) for PLS component 5 for
(a) full and (b) reduced calibration set; corresponding scatter plots for (c) full, and
(d) reduced calibration set.

Finally, we have often observed that the tail is over- Lazraq and Cléroux [11] proposed a t-test. However, this test
estimated, see Figure 4 for an illustrative example; hence the tacitly assumes the scores and response vector to be
fit leads to conservative estimates of risk of over-fitting (often independent, which is clearly violated for PLS. The test
still much smaller than 1/P, but more realistic). statistic therefore fails to follow the hypothesized
Scatter plot. Lindgren et al. [18] have noted that the usual t-distribution under the null-hypothesis. The problems with
histograms convey limited information. Chance correlations this test are illustrated by these authors’ recommendation to
between the randomized and original response vectors tend test at the unusual level of 30%.
to increase with decreasing number of objects. This useful
piece of information is lacking from the histogram repres-
3. EXPERIMENTAL
entation. However, it is easily accommodated for in a scatter
plot where one axis (e.g. the ordinate) is reserved for the test 3.1. Simulated data
statistic and the other one (e.g. the abscissa) for the Testing the claim that the currently proposed randomization
correlation, see Wold and Eriksson [55] for a related plot. test improves on the one introduced by Sheridan et al. [23]
Figure 4 presents a comparison with the usual histograms. It requires data sets for which the true dimensionality is clearly
is observed that by reducing the number of calibration defined. We therefore conducted a small Monte Carlo
objects by a factor of three, the risk of over-fitting increases simulation study where a three-component data set was
from 2  103 to 3.6%. Simultaneously, chance correlations constructed as follows. First, Y-data for a calibration and
increase, but certainly not to the point where one would start prediction set, 50 samples in each set, were generated from a
to worry about the validity of the estimated risk. Thus, it uniform distribution between 0 and 1. Next, the correspond-
appears that the scatter plot representation enables one to ing X-data were constructed by multiplying the ‘noiseless’
make a visual link between uncertainty in the input data (e.g. Y-data and the profiles depicted in Figure 5 (K ¼ 100). Finally,
is the calibration set large and diverse enough?) and the normally distributed noise was added to the ‘noiseless’
uncertainty in aspects of the output model. X- and Y-data with standard deviations 0.01 and 0.05,
respectively. For both randomization tests, an identical
2.7. Alternative statistical tests initialization of the pseudo-random number generators is
Haaland and Thomas [3] and Osten [4] proposed F-tests, deployed to enable a fair comparison.
which have been criticized by Van der Voet [20]. More
recently, Holcomb et al. [9] proposed another F-test. 3.2. Real data sets
However, this test assumes full-rank errorless predictor The PLS component selection rules are tested on 10 real data
variables, which is extremely restrictive in practice. Finally, sets, namely four near-infrared (NIR), two red/green/blue
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 433

between 15 and 30 mm. The water content was measured on


each plug, creating the response vector. The data set is
divided into a calibration and a validation set (see Table I).

3.2.3. ISO-brightness
This data set contains Vis-NIR measurements (400–2500 nm)
collected from samples of ground spruce wood. The
response variable is the ISO-brightness value obtained from
the peroxide bleached thermo mechanical pulp refined from
the measured wood.

3.2.4. Gas oil


This data set contains NIR measurements of gas oil samples
(4900–9000 cm1). The property of interest is the hydrogen
content determined by NMR. The data set is divided into a
calibration and a validation set (see Table I).

Figure 5. Simulated profiles for Y-variable 1 (—), 2 (– – –) 3.2.5. RGB


and 3 (---). This data set contains histograms that are derived from RGB
images of bark from two wood species, namely Norway
spruce and Scots pine. The image pixels are binned according
(RGB) imaging, two nuclear magnetic resonance (NMR) and to their intensities, thus providing 28 ¼ 256 bins for each color
two QSAR applications. Since most of these data sets are channel and a total of 768 bins for each image. These 768 bins
detailed elsewhere [56–64],only a short description follows. are used as predictors while the wood specie class (spruce or
Table I gives an overview and also indicates how four data pine) is the response variable.
sets are further divided in a calibration and validation set in
order to present a final test of the different component 3.2.6. Wavelet
selection rules. The data sets can be downloaded from http:// This data set contains RGB images (300  600 pixels) of pine
www.chemometrics.se. and spruce bark, compressed into 5000 wavelet coefficients.
The response variable is the wood specie. The data set is
3.2.1. Carra divided into a calibration and a validation set (see Table I).
This data set contains NIR measurements (1100–2500 nm)
collected from carrageenan powders with varying compo- 3.2.7. Internodes
sitions. In the original data, the carrageenan is described by This data set contains 1H HR/MAS NMR (high resolution
the relative amounts of five carrageenan types; here only a magic angle spinning NMR) spectra from transgenic and
single response, kappa, is modeled. The data set is divided wild type poplar trees. Each NMR spectrum was reduced to
into a calibration and a validation set (see Table I). 657 spectral integral regions of equal width (0.02 ppm). All
objects were aligned and normalized prior to analysis. The
3.2.2. Water response vector is derived from the internodes of the trees,
This data set contains Vis-NIR measurements (350–2500 nm) creating a growth or time related response. Only the
made at the surface of a wood plug using a LabSpex1 Pro transgenic type takes part in the modeling.
portable NIR spectrometer, model number A108310. Wood
plugs were collected from cut spruce logs in the northern 3.2.8. Poplar class
part of Sweden in 2001 (June 5–July 30) using a drill. The size For the second NMR data set the two classes create the
of each sample was 12 mm in diameter and the length varies response vector.

Table I. Overview of the real data sets

Objects

Data set Type Response Total Calibration Validation X-variables References

Carra NIR Kappa 128 102 26 699 56


Water NIR Water content 647 431 216 2151 57
ISO-brightness NIR ISO-brightness 42 42 — 1050 58
Gas oil NIR Hydrogen content 239 84 155 2128 59
RGB Image Wood specie 146 146 — 768 —
Wavelet Image Wood specie 146 98 48 5000 60
Internodes NMR Internodes 24 24 — 655 61
Poplar class NMR Poplar class 45 45 — 655 61
Kelder QSAR Log RA 55 55 — 24 62, 63
Hexapep QSAR BA2 16 16 — 18 64

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
434 S. Wiklund et al.

3.2.9. Kelder the test. By contrast, the test introduced by Sheridan et al. [23]
In this data set Kelder and Greven described a Free-Wilson assesses only the first component to be significant. As
analysis of the behavior-modifying activities of ACH- indicated in Subsection ‘Step-by-step explanation’, this test
T-related peptides. The data contain 55 objects and 24 tends to under-fit because contributions from earlier
variables. components are not properly removed during the random-
izations. Consequently, the ‘noise values’ are severely
3.2.10. Hexapep overestimated.
This data set contains hexapeptides that were synthesized
according to a molecular design and then tested with regards 4.2. Real data sets
to the biological activity BA2. The Z-scaled amino acids were Table III gives an overview of the selected model ranks using
used to characterize the hexapeptide sequence giving 18 SIMCA, Unscrambler and the present randomization test
predictor variables. The response variable is BA2. while Table IV details the results for the randomization test
(1000 randomizations). All data sets are mean-centered prior
3.3. Calculations to PLS modeling. Some data sets required further pre-
All calculations concerning the randomization test were treatment such as standard normal variate (SNV) scaling,
carried out using Matlab version 7.0 (The Mathworks, multiplicative signal correction (MSC) and unit variance
Natick, MA). A copy of the program is available on request. scaling (UVS). Table V presents root mean square errors
(RMSEs) obtained for the independent validation set.

4. RESULTS AND DISCUSSION


4.2.1. Detailed
4.1. Simulated data
Carra
External validation using the (independent) prediction set
yields a minimum RMSEP for the three-component PLS  Spectra require pre-treatment due to offset variations that
model for all three Y-variables (Table II). Consistent with this are unrelated to response values. SNV-scaling appears to
result, the currently proposed randomization test yields work well over the entire wavelength range (Figure 6).
small significance levels for the first three PLS components,  SNV-scaling leads to smaller RMSEV for CV (not shown)
whereas the fourth PLS component is clearly insignificant by and independent validation set (Table V).

Table II. Prediction and randomization test results for simulated data sets

Y1 Y2 Y3

PLS component RMSEP a (%) current Sheridan RMSEP a (%) current Sheridan RMSEP a (%) current Sheridan

1 0.239 3  104 3  104 0.224 3  103 3  103 0.245 8  103 8  103


2 0.085 0.25 100 0.131 0.03 100 0.052 4  103 100
3 0.049 0.04 100 0.050 4  103 100 0.049 0.50 100
4 0.054 87.2 100 0.055 58.7 100 0.061 29.3 100

The models with minimum RMSEP are indicated in bold. The symbols are explained in the text.

Table III. Selected number of PLS components using LOO CV, CV with r ¼ 7 segments, size of the eigenvalue, rules of
significance, leverage correction and randomization test

SIMCA Unscrambler

Pre-treatment CV CV Rules of CV CV Leverage Randomization


Data set (þcentering) (LOO) (r ¼ 7) Eigenvalue significance (LOO) (r ¼ 7) correction test

Carra — 6 6 2 6 6 6 6 6–9
SNV 5 8 4 5 5 5 5 8–10
Water — 8 8 3 9 6 8 11 9
ISO-brightness — 6 3–6 1 6 7 5 18 3–6
MSC 5 5 2 5 5 5 18 3–5
Gas oil — 2 2 3 2 2 2 2 5
RGB — 3 3 8 5 5 6 10 4
Wavelet — 7–10 7–10 4 6 5 4 6 10
Internodes — 1 1 3 1 1 1 6 1
Poplar class — 4 4 3 5 4 4 7 4
Kelder — 2 2 2 2 3 5 3 2
Hexapep UVS 3 — 2 3 11 — 4 4

For more detailed results from the randomization test, see Table IV.

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 435

Table IV. Risk of over-fitting (in %) for individual components, estimated from 1000 randomizations

PLS component
Pre-treatment
Data set (þ centering) 1 2 3 4 5 6 7 8 9 10
3 3
Carra — 21 0.06 1  10 0.23 0.32 2  10 30 0.03 3.1 11
SNV 3  103 0.07 8  103 6  103 2  106 1  103 0.50 0.07 7.5 0.01
Water — 2  105 1  106 4  106 2  105 0.13 6  104 1  104 6  105 7  103 16
ISO-brightness — 24 0.22 4.1 44 26 0.80 41 73 24 68
MSC 7.9 1.4 2.0 16 0.12 44 37 40 60 29
Gas oil — 9  104 0.02 3.3 6  104 2  103 36 43 38 11 16
RGB — 7  107 7  105 0.70 3  103 37 28 37 28 20 87
Wavelet — 1  107 1  103 0.01 1  107 4  106 7  105 4  103 0.09 6  104 0.03
Internodes — 6  103 20 11 37 41 13 4.6 21 13 2.6
Poplar class — 0.01 0.11 0.30 1.4 11 8.8 2.9 8.5 19 12
Kelder — 2  106 0.70 24 33 91 100 100 100 100 100
Hexapep UVS 32 0.36 5.1 2.2 58 82 88 95 80 98

Table V. RMSEV for independent validation set

Number of PLS components


Pre-treatment
Data set (þ centering) 1 2 3 4 5 6 7 8 9 10

Carra — 18.5 14.9 7.30 7.04 6.04 3.21 3.11 2.35 1.79 1.85
SNV 14.2 7.29 6.75 5.86 3.22 2.39 1.68 1.52 1.16 1.36
Water — 9.17 6.99 6.70 6.07 6.04 5.47 5.39 5.43 5.44 5.31
Gas oil — 0.18 0.10 0.085 0.055 0.045 0.039 0.039 0.037 0.039 0.038
Wavelet — 0.27 0.24 0.22 0.23 0.23 0.23 0.22 0.22 0.22 0.22

 With the exception of the eigenvalue criterion, general Water


agreement for SIMCA and Unscrambler when spectra
 Agreement between SIMCA and Unscrambler CV
are not pre-treated (Table III).
(Table III). Disagreement of CV and the eigenvalue and
 Erratic behavior for randomization test when spectra are
leverage correction.
not pre-treated. The first component is insignificant by the
 The randomization test gives clear support for nine com-
test (Table IV), whereas successive components clearly are.
ponents, in agreement with SIMCA’s rules of significance.
In other words, the components do not enter the model in
the natural order of decreasing relevance. Note that the
ISO-brightness
first component has the same effect as SNV on RMSEV for
the independent validation set (Table V). It follows that  With the exception of the eigenvalue criterion and leverage
this behavior can be rationalized and is therefore indica- correction, good agreement between SIMCA and
tive that data pre-treatment that needs to be done prior PLS Unscrambler before and after pre-treatment with MSC
modeling. (Table III).
 The randomization test gives support for 8 or 10 com-  Before pre-treatment with MSC, components 1, 3 and 4 are
ponents after SNV-scaling. not significant according to SIMCA’s rules of significance,

(a) (b)
Signal

1100 1500 1900 2300 1100 1500 1900 2300


Wavelength [nm] Wavelength [nm]

Figure 6. Carra data set: (a) raw and (b) SNV-scaled calibration set
spectra in arbitrary signal unit.

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
436 S. Wiklund et al.

but 2, 5 and 6 are; this is in fairly good agreement with the Y. Y itself is a discrete, noiseless variable. As a result, both
randomization test where components 2, 3 and 6 are CV and the randomization test will find many significant
significant. After pre-treatment with MSC, component 1 components. Therefore we chose to display an interval of
is not significant according to the rules, which is corrobo- 7–10 components for SIMCA’s CV (Table III). This result is
rated by the randomization test. in good agreement with the randomization test (Table IV).
It also agrees fairly well with the minimum RMSEV
Gas oil obtained between 3 and 10 components for the indepen-
dent validation set (Table V).
 Good agreement between SIMCA and Unscrambler selec-
Internodes
tion rules, while the randomization test stands out
(Table III). The decrease of RMSECV upon going from
two to five components (Figure 2a) is incorrectly found to  With the exception of the eigenvalue criterion and leverage
be insignificant by the SIMCA and Unscrambler selection correction, all SIMCA and Unscrambler selection rules
rules. Caution to avoid over-fitting has led to under-fitting: suggest that only the first component is predictive.
RMSEV for the independent validation set increases from  The randomization test clearly supports only the first
0.045 to 0.10 when dropping components from 5 to 2 component.
(Table V).
 Nonmonotonic behavior for randomization test: the esti- Poplar class
mated tail probability for component 3 is 3.3% (Figure 3a),
while successive components 4 and 5 reach much smaller  Some disagreement between CV and other SIMCA and
values (see e.g. Figure 4 for component 5). Component 6 is Unscrambler selection rules.
clearly insignificant by the test (Figure 3b). The reason for  The randomization test supports four components, in
this nonmonotonic behavior maybe the presence of tiny agreement with CV.
nonlinear substructures in the spectra corresponding to
high response values. Indicative is that high response Kelder
values are underestimated by their prediction: clearly so
for the two-, and to a lesser extent for the five-dimensional
 Disagreement between SIMCA and Unscrambler selection
model (Figure 7).
rules (Table III).
 The randomization test clearly supports two components,
RGB
in agreement with the SIMCA selection rules.

 Disagreement between SIMCA and Unscrambler selection Hexapep


rules (Table III).
 The randomization test clearly supports four components.  Since this data set originates from a design in X-space, only
Such an uninterrupted series of highly significant com- the ‘mildest’ form of CV is used, namely LOO CV.
ponents can be considered as ideal behavior of the test.  Disagreement between SIMCA and Unscrambler selection
rules (Table III).
Wavelet
 Erratic behavior for CV (Figure 8). The purpose of CV is to
generate prediction residuals from the calibration set.
 The wavelet transformed data set is a little bit tricky since Consequently, the average cross-validated error
RMSECV never attains a local minimum. This is due to the
fact that X represents compressed image data and because
of the complex structure many latent variables correlate to
20

A=2 A=5
15
RMSE
NIR prediction

14 14

10
13 13

12 12
5

11 (a) 11 (b)
0
11 12 13 14 11 12 13 14 0 2 4 6 8 10
Reference value Reference value PLS components

Figure 7. Gas oil data set prediction results for independent Figure 8. Hexapep data set: RMSE as a function of PLS
validation set obtained using (a) two and (b) five PLS com- components from fit (*), LOO CV (&) and leverage correc-
ponents. tion (*).

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 437

(RMSECV) should not differ much from the average fit randomization test need not be used in stand-alone mode. It
error, especially for under-fitting models. The failure of CV can also be used in an interactive fashion to guide the
can be explained from the designed character of the data. component selection according to another rule, for example,
In a strict sense, CV can only be used if the data constitute a CV.
random sample from a population. In a loose sense, the
effect of a design can be negligible for practical purpose
5. CONCLUSIONS
but only if the calibration set is large enough, which
apparently is not the case here (N ¼ 16).  The simulation results illustrate that the currently pro-
 Erratic behavior for the randomization test. The first com- posed randomization test improves on the one introduced
ponent is insignificant by the test (Table IV), whereas three by Sheridan et al. [23].
successive components are, depending on the level of  The randomization test assumes the components to enter
confidence. This is in agreement with SIMCA’s rules of the model in a natural order, that is according to their
significance which also claim that the first component is relevance for describing the response variable. After all,
insignificant, but the second and third are significant. this property is a theoretical advantage of PLS over
 The Unscrambler recommends four components based on methods like PCR [65]. The ideal natural order is not
applying criterion (6) to the RMSEV from leverage correc- always attained for practical data. It can, for example be
tion. This happens to agree with the randomization test. distorted by an improper pre-treatment of the predictor
However, the associated RMSEV is twice the value variables. Fortunately, the resulting erratic behavior can
achieved for the global minimum at eight components point at useful pre-treatment methods. Likewise, the test
(1.42 vs. 0.71). In this case, the user is confronted with a appears to be sensitive to nonlinearities in the data.
difficult choice: follow the recommendation based on the  The proposed randomization test, possibly in combination
criterion (6), which depends on an ad hoc value for c, or with other PLS component selection rules will make the
accept the rank that yields the global minimum? The choice of the right model easier for the user.
results of the randomization test are easier to interpret.  A comparison with the recently introduced Monte Carlo
CV [35–37] is an interesting topic for future research.
Likewise, a more systematic evaluation of the test for
4.2.2. General designed data seems to be warranted.
The suggested optimum number of components varies  One area where CV works poorly both for PLS and PCR is
between the two implementations of CV and the other design of experiments, where exclusion of data has large
selection rules available in SIMCA and Unscrambler, as well consequences for modeling. Here the randomization test
as the randomization test (Table III). Although this should have merit.
disagreement may be confusing for the less experienced
user, indeed it is important to remember that an exact
number does not exist, but rather an appropriate interval. It is Acknowledgements
comforting that CV as well as the randomization test seem to Chris Brown and Alejandro Olivieri are thanked for supply-
find this interval in a reliable way as indicated by the results. ing Matlab code for evaluating the inverse Gaussian fit.
Picard and Cook [32] note that ‘In the analysis of large data
sets, an appropriate way to proceed is often not immediately
REFERENCES
apparent. Consequently, some aspects of exploratory data
analysis naturally arise. Competing models may be tempor- 1. Wold S, Eriksson L, Sjöström M. PLS in chemistry. In The
arily entertained and the final choice of a predictive equation Encyclopedia of Computational Chemistry, Schleyer PVR,
is influenced by many factors, including the personal Allinger NL, Clark T, Gasteiger J, Kollman PA, Schaefer
HF, III, Schreiner PR (eds). Wiley: Chichester, UK, 1999;
experiences and prejudices of the investigator’. In other 2006–2020.
words, imprecise results only present an immediate problem 2. Denham MC. Choosing the number of factors in partial
if a definite choice is desired. Otherwise, one could simply least squares regression: estimating and minimizing the
keep several competing models and put them all to the test mean squared error of prediction. J. Chemometr. 2000; 14:
on future data. This future prediction-testing phase would 351–361.
3. Haaland DM, Thomas EV. Partial least-squares methods
then sometimes lead to the final model—a sound and safe for spectral analyses. 1. Relation to other quantitative
strategy. However, for the common selection rules studied in calibration methods and the extraction of qualitative
this work, one observes that competing models may have information. Anal. Chem. 1988; 60: 1193–1202.
differing number of components. In these cases, it may be 4. Osten DW. Selection of optimal regression models via
hard to defend why certain models are considered as cross-validation. J. Chemometr. 1988; 2: 39–48.
5. Wakeling IN, Morris JJ. A test of significance for partial
competitors. By contrast, the randomization test leads to least squares regression. J. Chemometr. 1993; 7: 291–304.
models for which the disagreeing number of components has 6. Höskuldsson A. Dimension of linear models. Chemometr.
a simple interpretation. Clearly, a higher degree of Intell. Lab. Syst. 1996; 32: 37–55.
confidence leads to smaller models and vice versa. Actually, 7. Messick NJ, Kalivas JH, Lang PM. Selecting factors for
this is the situation for any statistical test including all the partial least squares. Microchem. J. 1997; 55: 200–207.
8. Faber NM, Kowalski BR. Propagation of measurement
ones looked at here; it often needs to be supplemented with a errors for the validation of predictions obtained by prin-
rule that encourages a simpler model before a more cipal component regression and partial least squares.
complicated one. Finally, it is emphasized that the proposed J. Chemometr. 1997; 11: 181–238.
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
438 S. Wiklund et al.

9. Holcomb TR, Hjalmarsson H, Morari M, Tyler ML. 32. Picard RR, Cook RD. Cross-validation of regression
Significance regression: a statistical approach to partial models. J. Amer. Stat. Assoc. 1984; 79: 575–583.
least squares. J. Chemometr. 1997; 11: 283–309. 33. Shao J. Linear model selection by cross-validation.
10. Liu D, Shah SL, Grant Fisher D. Choice of latent expla- J. Amer. Stat. Assoc. 1993; 88: 486–494.
natory variables: a multiobjective optimization approach. 34. Zhang P. Model selection via multifold cross validation.
J. Chemometr. 2000; 14: 79–92. Ann. Stat. 1993; 21: 299–313.
11. Lazraq A, Cléroux R. The PLS multivariate regression 35. Xu Q-S, Liang Y-Z. Monte Carlo cross validation. Che-
model: testing the significance of successive PLS com- mometr. Intell. Lab. Syst. 2001; 56: 1–11.
ponents. J. Chemometr. 2001; 15: 523–536. 36. Gourvénec S, Fernández Pierna JA, Massart DL, Rutle-
12. Green RL, Kalivas JH. Graphical diagnostics for regres- dge DN. An evaluation of the PoLiSh smoothed
sion model determinations with consideration of the regression and the Monte Carlo cross-validation for
bias/variance trade-off. Chemometr. Intell. Lab. Syst. the determination of the complexity of a PLS model.
2002; 60: 173–188. Chemometr. Intell. Lab. Syst. 2003; 68: 41–51.
13. Li B, Morris J, Martin EB. Model selection for partial least 37. Xu Q-S, Liang Y-Z, Du Y-P. Monte Carlo cross-validation
squares regression. Chemometr. Intell. Lab. Syst. 2002; 64: for selecting a model and estimating the prediction error
79–89. in multivariate calibration. J. Chemometr. 2004; 18:
14. Forina M, Lanteri S, Cerrato Oliveros MC, Pizarro Millan 112–120.
C. Selection of useful predictors in multivariate cali- 38. Baumann K, Albert H, von Korff M. A systematic evalu-
bration. Anal. Bioanal. Chem. 2004; 380: 397–418. ation of the benefits and hazards of variable selection in
15. Fisher RA. The Design of Experiments. Oliver and Boyd: latent variable regression. Part I. Search algorithm,
Edinburgh, 1935. theory and simulations. J. Chemometr. 2002; 16: 339–350.
16. Klopman G, Kalos AN. Causality in structure-activity 39. Baumann K, von Korff M, Albert H. A systematic evalu-
studies. J. Comput. Chem. 1985; 6: 492–506. ation of the benefits and hazards of variable selection in
17. Thioulouse J, Lobry JR. Co-inertia analysis of amino-acid latent variable regression. Part II. Practical applications.
physico-chemical properties and protein composition J. Chemometr. 2002; 16: 351–360.
with the ADE package. Comput. Appl. Biosci. 1995; 11: 40. Baumann K. Cross-validation as the objective function
321–329. for variable selection techniques. Trends Anal. Chem.
18. Lindgren F, Hansen B, Karcher W, Sjöström M, Eriksson 2003; 22: 395–406.
L. Model validation by permutation tests: applications to 41. Breiman L, Friedman JH, Olshen RA, Stone C. Classifi-
variable selection. J. Chemometr. 1996; 10: 521–532. cation and Regression Trees. Wadsworth: Belmont, CA,
19. Baumann K, Stiefl N. Validation tools for variable 1984.
subset regression. J. Comput.-Aided Mol. Des. 2004; 18: 42. Reeves JB, III, Delwiche SR. SAS1 partial least squares
549–562. regression for analysis of spectroscopic data. J. Near
20. van der Voet H. Comparing the predictive accuracy of Infrared Spectrosc. 2003; 11: 415–431.
models using a simple randomization test. Chemometr. 43. van der Voet H. Pseudo-degrees of freedom for complex
Intell. Lab. Syst. 1994; 25: 313–323. predictive models: the example of partial least squares.
21. van der Voet H. Corrigendum to ‘comparing the pre- J. Chemometr. 1999; 13: 195–208.
dictive accuracy of models using a simple randomization 44. Jackson JE. A User’s Guide to Principal Components. Wiley:
test’. Chemometr. Intell. Lab. Syst. 1995; 28: 315. New York, 1991.
22. Dijksterhuis GB, Heiser WJ. The role of permutation tests 45. Martens HA, Dardenne P. Validation and verification of
in exploratory multivariate data analysis. Food Qual. regression in small data sets. Chemometr. Intell. Lab. Syst.
Prefer. 1995; 6: 263–270. 1998; 44: 99–121.
23. Sheridan RP, Nachbar RB, Bush BL. Extending the trend 46. Westad F, Martens H. Variable selection in near infrared
vector: the trend matrix and sample-based partial least spectroscopy based on significance testing in partial least
squares. J. Comput.-Aided Mol. Des. 1994; 8: 323–340. squares regression. J. Near Infrared Spectrosc. 2000; 8:
24. Stone M. Cross-validatory choice and assessment of 117–124.
statistical predictions. J. Roy. Stat. Soc. B 1974; 36: 47. Høy M, Steen K, Martens H. Review of partial least
111–133. squares regression prediction error in Unscrambler. Che-
25. Wold S. Cross-validatory estimation of the number of mometr. Intell. Lab. Syst. 1998; 44: 123–133.
components in factor and principal components models. 48. Faber NM. Comparison of two recently proposed
Technometrics 1978; 20: 397–405. expressions for partial least squares regression predic-
26. Allen DM. The prediction sum of squares as a criterion for tion error. Chemometr. Intell. Lab. Syst. 2000; 52: 123–134.
selecting prediction variables. Technical Report No. 23, 49. Faber NM. Estimating the uncertainty in estimates of
Dept. of Statistics, University of Kentucky: Lexington, root mean square error of prediction: application to
KY, 1971. determining the size of an adequate test set in multi-
27. Martens H, Næs T. Multivariate calibration by data variate calibration. Chemometr. Intell. Lab. Syst. 1999; 49:
compression. In Near-Infrared Technology in the Agricul- 79–89.
tural and Food Industries, Williams P, Norris K (eds). 50. Efron B. Bootstrap methods: another look at the jack
American Cereal Association: St Paul, MN, 1987; 57–87. knife. Ann. Stat. 1979; 7: 1–26.
28. Lorber A, Kowalski BR. Alternatives to cross-validatory 51. Fisher RA. Statistical Methods for Research Workers. Oliver
estimation of the number of factors in multivariate cali- and Boyd: Edinburgh, 1925.
bration. Appl. Spectrosc. 1990; 44: 1464–1470. 52. Hotelling H. New light on the correlation coefficient and
29. Marbach R, Heise HM. Calibration modeling by partial its transform. J. Roy. Stat. Soc. B 1953; 15: 193–232.
least-squares and principal component regression and its 53. Chhikara RS, Folks JL. The Inverse Gaussian Distribution:
optimization using an improved leverage correction for Theory, Methodology, and Applications. Marcel Dekker:
prediction testing. Chemometr. Intell. Lab. Syst. 1990; 9: New York, 1989.
45–63. 54. Dennis B, Munholland PL, Scott JM. Estimation of
30. SIMCA-P and SIMCA Pþ 10, User Guide, Umetrics AB: growth and extinction parameters for endangered
Umeå, Sweden, 2002. species. Ecol. Monogr. 1991; 61: 115–143.
31. Unscrambler for Windows, User’s guide. CAMO AS: Trond- 55. Wold S, Eriksson L. Statistical validation of QSAR
heim, Norway, 1996. results. In Chemometric Methods in Molecular Design,

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 439

van de Waterbeemd H (ed.). VCH Publishers: Weinheim, 60. Nilsson D, Edlund U. Pine and spruce roundwood
1995; 309–318. species classification using multivariate image analysis
56. Dyrby M, Petersen RV, Larsen J, Rudolf B, Nørgaard L, on bark. Holzforschung 2005; 59: 689–695.
Engelsen SB. Towards on-line monitoring of the compo- 61. Wiklund S, Karlsson M, Antti H, Johnels D, Sjöström M,
sition of commercial carrageenan powders. Carbohydr. Wingsle G, Edlund U. A new metabonomic strategy for
Polym. 2004; 57: 337–348. analyzing the growth process in poplar tree. Plant
57. Persson E, Sjöström M, Sundblad L-G, Wiklund S, Wil- Biotechnol. J. 2005; 3: 353–362.
hemsson L. Fresh timber—a challenge to forestry and 62. Kelder J, Greven HM. A quantitative study on the
mensuration. Resultat 2002; 8: 2–4. relationship between structure and behavi activity of
58. Nilsson D, Edlund U, Sjöström M, Agnemo R. Prediction peptides related to ACTH. Rec. Trav. Chim. Pays-Bas
of thermo mechanical pulp brightness using NIR spec- 1979; 98: 168–172.
troscopy on wood raw material. Paperi Ja Puu (Pap. 63. Kubinyi H. Evolutionary variable selection in regression
Timber) 2005; 87: 102–109. and PLS analyses. J. Chemometr. 1996; 10: 119–133.
59. Fernández Pierna JA, Jin L, Wahl F, Faber NM, Massart 64. Eriksson L, Johansson E, Kettaneh-Wold N, Wold S.
DL. Estimation of partial least squares regression (PLSR) Multi- and Megavariate Data analysis. Principles and Appli-
prediction uncertainty when the reference values carry a cations. Umetrics AB: Umeå, Sweden, 2002.
sizeable measurement error. Chemometr. Intell. Lab. Syst. 65. Helland IS. On the structure of partial least squares
2003; 65: 281–291. regression. Commun. Stat.—Simula. 1988; 17: 581–607.

Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem

You might also like