Professional Documents
Culture Documents
During the last two decades, a number of methods have been developed and evaluated for selecting
the optimal number of components in a PLS model. In this paper, a new method is introduced that is
based on a randomization test. The advantage of using a randomization test is that in contrast to cross
validation (CV), it requires no exclusion of data, thus avoiding problems related to data exclusion, for
example in designed experiments. The method is tested using simulated data sets for which the true
dimensionality is clearly defined and also compared to regularly used methods for 10 real data sets.
The randomization test works as a good statistical selection tool in combination with other selection
rules. It also works as an indicator when the data require a pre-treatment. Copyright # 2007 John
Wiley & Sons, Ltd.
KEYWORDS: randomization test; permutation test; component selection; factor selection; latent variable selection;
partial least squares
study indicate that the method may outperform ‘conven- value of the number of components is immaterial as long as
tional’ implementations for small and medium-sized cali- the prediction error is close to its minimum.
bration sets [35]. However, Gourvénec et al. [36] noted that a
major concern remains with this method, namely whether 2.2. Leverage correction
making a calibration model with so few objects will be Leverage correction is a simplified method used for
representative for the structure of the data. Moreover, since validation and component selection [26–29]. The goal of
our goal is to compare a new randomization test with leverage correction is to convert a fit residual into a
regularly used methods, we consider its inclusion out of the prediction residual, without performing any actual predic-
scope of this paper. Second, other statistical tests have been tions as in CV. The leverage is a measure of the influence of
proposed [3,4,9,11]. We prefer not to include them either. an object on the PLS model and is closely related to the
Instead, they are discussed in Section 2.7. Mahalanobis distance and Hotelling’s T2. The leverage of
object i using A components is calculated as
2. PLS COMPONENT SELECTION
XA
t2ia
1
This section presents a description of the selection rules under hA
i ¼N þ T
(2)
t t
a¼1 a a
study. Although there are many ways to select the rank of a
model, the optimal one is to use a large external validation set where ta denotes the orthogonal score vector of component a
that is not involved in the model building stage. Unfortunately, (a ¼ 1; ; A) and tia is the object score value. It is noted that
this requires highly redundant data sets; hence the model must mean centering is included in the leverage calculation as the
usually be validated the best alternative way. Mean-centered N 1 -term. The leverage-corrected fit residual is obtained as
models are assumed throughout, but this does not affect the
A
ðyobs;i yA
fit;i Þ
generality of the presentation. fi;LC ¼ (3)
ð1 hA
i Þ
case, the eigenvalue will show how many objects are but not for PLS; hence one should not expect a single value to
explained by the component. The relative amount of work well for both validation methods when PLS is used.
variation explained by a model follows as the ratio of the
sum of the SSX(A)s and SSXð0Þ. It should be noted that the
eigenvalue is less popular for PLS component selection than 2.6. The proposed randomization test
CV or leverage correction, although it constitutes the basis for
many component selection rules in PCA [44]. 2.6.1. Why a statistical test?
PLS component selection is most frequently carried out in
practice using some sort of validation. However, each
2.4. SIMCA validation-based selection rule has drawbacks. CV and
1. A component is significant according to Rule 1 when the leverage correction handle the available data economically,
amount of predicted (CV) variance Q2 > limit. With a PLS but like any data-based statistical test gives an interval
model, Limit ¼ 0 for models with more than 100 objects results and hence sometimes gives either an under-fit or an
and Limit ¼ 0.05 for models with 100 objects or less. over-fit, that is they reach the minimum RMSEV for a lower
2. A component is significant according to Rule 2 when at or higher model rank than would be achieved using an
least one Y-variable for a PLS model has Q2V >limit. Q2V is infinitely large independent validation set. In addition, the
the fraction from one Y-variable, that is Q2V ¼ 1 estimated RMSEs carry a considerable uncertainty, as noted
PRESS=SSy, where SS is sums of squares for one response by Martens and Dardenne [45]. When extreme outliers are
vector y. The total from all responses is Q2. removed, the mean squared error (square of RMSEV) is
A PLS component is not significant if: distributed approximately proportional to a x2-variable [49].
Asffiffiffiffiffiffiffiffiffiffiffi
p a result,
ffi the relative uncertainty can be estimated as
1. Rule 1 or 2 is not fulfilled. 1=2N in which N is the number of objects. This estimate can
2. The data have insufficient number of degrees of freedom be used in the selection rule given by Breiman et al., [41]
after previous component, that is if (N A) or (K A) ¼ 0. which starts with a candidate model and works backwards,
3. The explained variance for Y (PLS) is less than 1% and see Figure 1. However, this procedure does not account for
no single Y-variable has more than 3% (PLS) explained the fact that the models are nested; hence subsequent RMSEs
variance. are correlated.
In summary, it is not clear with CV-based selection rules
how the uncertainty in the input data translates into an
2.5. Unscrambler uncertainty in the output number of PLS components. Here
For Unscrambler [31], the selection follows as the software we try to develop a distribution-based statistical test that
recommendation for CV and leverage correction. This may provide more objective guidance.
recommendation is based on the minimization of the
following criterion: [45]
CritðAÞ ¼ RMSEVðAÞ þ c A RMSEVð0Þ (6) 2.6.2. The procedure for successive PLS components
The proposed method assesses the statistical significance of
where RMSEV stands for the root mean square error of each individual component that enters the model. Theoretical
validation (irrespective of how this has been estimated), the approaches to achieve this goal (using a t- or F-test) have
number of components is given in parentheses and c denotes been put forth but they are all based on unrealistic
a ‘punish factor’ for adding components. RMSEVð0Þ assumptions about the data, for example the absence of
symbolizes the validation error (RMSECV or RMSELC) spectral noise (see Section 2.7). A pragmatic data-driven
when the model consists of the center only (zero com- approach is therefore called for. The randomization test is
ponents), and is therefore expected to resemble the spread in entirely data-driven and therefore ideally suited for avoiding
the validation responses. The second term on the right-hand unrealistic assumptions. For an excellent description of this
side constitutes a penalty for adding components. When methodology, see van der Voet [20]. The rationale behind the
RMSEV is approximately equal for two models, the smaller randomization test in the context of regression modeling is
one is favored by the criterion (6). The purpose is therefore to illustrated in Figure 2. Randomization amounts to permuting
add robustness to the component selection. It is important to indices. For that reason, the randomization test is often
note, however, that the optimum ‘punish factor’ depends on referred to as a permutation test. In quantitative structure
the data. Martens and Dardenne [45] give the value c ¼ 0.05. activity relationship (QSAR) applications it is known as
Westad and Martens [46] mention that ‘The ‘‘punish factor’’ ‘Y-scrambling’. Clearly, a complete scrambling of the
may be tuned to a smaller value for models with low elements of Y while keeping the corresponding numbers
explained y-variance’ [relatively high RMSEP (RMSEP in X fixed destroys any relationship that might exist between
stands for root mean square error of prediction, a synonym the X-and Y-variables. Randomization therefore yields PLS
for RMSEV)]. Three per cent seems to be a good ad hoc value, regression models that should reflect the absence of a real
although setting it lower will not affect the model association between the X- and Y-variables—in other words:
performance in general’. However, the value c ¼ 0.01, as insignificant models. However, in practice any scrambling of
deployed by Høy et al., [47] may lead to severe under-fitting the Y-data still leaves some correlation between the
[48]. Finally, Lorber and Kowalski [28] have found that CV scrambled and original data which needs to be taken in
and leverage correction perform similarly for top-down PCR, account. For each of these random models, a test statistic, T, is
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
430 S. Wiklund et al.
0.25 0.2
0.2
(a) (b)
0.15
RMSEV
0.15
0.1
0.1
0.05
0.05
0 0
0 2 4 6 8 10 0 2 4 6 8 10
PLS components PLS components
Figure 1. Gas oil data set: RMSEV as a function of PLS components from (a)
sevenfold CV (84 objects) and (b) independent validation set (155 objects). The
RMSEV estimate for the five-component candidate model (*) is adorned with
uncertainty limits (---). The selection rule given by Breiman et al. [41] checks
whether smaller models yield an RMSEV estimate that is within the uncertainty
limits associated with the candidate model. For the current example data set, the
candidate model would be kept.
calculated. We have opted for the covariance between the distribution that is tabulated, for example t or F. For
t-score and the Y-values because it is a natural measure of illustrative examples, see Figure 3.
association, thus T¼ ðt0 yÞ=N. Clearly, the value for a test
statistic obtained after randomization should be indistinguish-
able from a chance fluctuation, except to the small remaining
2.6.3. Computational details
correlation to the original Y. For this reason, it will be referred
Step-by-step explanation. The procedure consists of the
to as a ‘noise value’. Repeating this calculation a number of
following steps:
times generates a histogram for the null-distribution, that is,
the distribution that holds when the component is due to 1. Initialize, that is (1) set the minimum model dimension-
chance—the null-hypothesis (Ho ). Next, a critical value is ality for the X- and Y-data, A, to zero, (2) optionally
derived from the null-distribution as the value exceeded by a remove the mean from the data, yielding centered data
certain percentage of noise values (say 5 or 10%). Finally, the sets X0 and y0 (the 0th residuals) and (3) set the maximum
statistic obtained for the original data—the value under number of components to be considered to Amax.
test—is compared with the critical value. The (only) 2. Increase the model dimensionality, A, by one. Compute
difference with a ‘conventional’ statistical test is that the the residual data sets as XA ¼ XA1 tA pTA and yA ¼
critical value follows as a percentage point of a data-driven yA1 tA qA , where t is the score vector, and p and q
histogram of noise values instead of a (fixed) theoretical are the PLS loadings for the X- and Y-data, respectively.
1 2 4
2 1 5
N 4 3
ordered permuted y2 y4 x1
indice s indices
y1 y5 x2 dis t r i b u t i o n
PLS under H o
y1
y4 y3 xN
y2 ord e red
randomised
y ve c t o rs X ma t rix
yN
ordered
y-v e cto r
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 431
A=3 A=6
250
300
(a) (b)
250 200
Frequency
200
150
150
100
100
50 50
0 0
0 0.2 0.4 0.01 0.02 0.03
Test statistic Test statistic
Figure 3. Gas oil data set: comparison of the histogram of noise values and the
value under test (---) for PLS components (a) 3 and (b) 6. For component 3, 3.3%
of the noise values exceed the value under test. This component is therefore
significant, depending on the confidence level deployed. Component 6 is clearly
insignificant, because the value under test is exceeded by 36% of the noise
values.
3. Compute the value under test, TA (cov(tN,yA)), from XA Chhikara and Folks [53]. The inverse Gaussian probability
and yA. density function is given by
4. Generate P permutations of the rows of yA, resulting in !
g gðx mÞ2
yA,p (p ¼ 1; ; P). Compute for XA and yA;p (p ¼ 1; ; P), gðx; m; gÞ ¼ exp x > 0; m; g > 0 (7)
the noise value, TA,p (p ¼ 1; ; P) from the first PLS 2px3 2m2 x
component. where x represents the data (noise values) to be modeled, and
5. If A < Amax, return to Step 2. m and g are location and shape parameters. It holds that
N=84 N=28
50
250 (a) 40 250 (b)
30
20
Frequency
200 200
10
0
150 0.04 0.05 06 150
100 100
50 50
0 0
0.02 0.04 0.06 0.08 0.005 0.01 0.015 0.02
Test statistic Test statistic
(c) 0.02
(d)
0.08
Test statistic
0.06 0.015
0.04 0.01
0.02 0.005
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
2 2
r r
Figure 4. Gas oil data set: comparison of the histogram of noise values with
inverse Gaussian fit (—) and the value under test (---) for PLS component 5 for
(a) full and (b) reduced calibration set; corresponding scatter plots for (c) full, and
(d) reduced calibration set.
Finally, we have often observed that the tail is over- Lazraq and Cléroux [11] proposed a t-test. However, this test
estimated, see Figure 4 for an illustrative example; hence the tacitly assumes the scores and response vector to be
fit leads to conservative estimates of risk of over-fitting (often independent, which is clearly violated for PLS. The test
still much smaller than 1/P, but more realistic). statistic therefore fails to follow the hypothesized
Scatter plot. Lindgren et al. [18] have noted that the usual t-distribution under the null-hypothesis. The problems with
histograms convey limited information. Chance correlations this test are illustrated by these authors’ recommendation to
between the randomized and original response vectors tend test at the unusual level of 30%.
to increase with decreasing number of objects. This useful
piece of information is lacking from the histogram repres-
3. EXPERIMENTAL
entation. However, it is easily accommodated for in a scatter
plot where one axis (e.g. the ordinate) is reserved for the test 3.1. Simulated data
statistic and the other one (e.g. the abscissa) for the Testing the claim that the currently proposed randomization
correlation, see Wold and Eriksson [55] for a related plot. test improves on the one introduced by Sheridan et al. [23]
Figure 4 presents a comparison with the usual histograms. It requires data sets for which the true dimensionality is clearly
is observed that by reducing the number of calibration defined. We therefore conducted a small Monte Carlo
objects by a factor of three, the risk of over-fitting increases simulation study where a three-component data set was
from 2 103 to 3.6%. Simultaneously, chance correlations constructed as follows. First, Y-data for a calibration and
increase, but certainly not to the point where one would start prediction set, 50 samples in each set, were generated from a
to worry about the validity of the estimated risk. Thus, it uniform distribution between 0 and 1. Next, the correspond-
appears that the scatter plot representation enables one to ing X-data were constructed by multiplying the ‘noiseless’
make a visual link between uncertainty in the input data (e.g. Y-data and the profiles depicted in Figure 5 (K ¼ 100). Finally,
is the calibration set large and diverse enough?) and the normally distributed noise was added to the ‘noiseless’
uncertainty in aspects of the output model. X- and Y-data with standard deviations 0.01 and 0.05,
respectively. For both randomization tests, an identical
2.7. Alternative statistical tests initialization of the pseudo-random number generators is
Haaland and Thomas [3] and Osten [4] proposed F-tests, deployed to enable a fair comparison.
which have been criticized by Van der Voet [20]. More
recently, Holcomb et al. [9] proposed another F-test. 3.2. Real data sets
However, this test assumes full-rank errorless predictor The PLS component selection rules are tested on 10 real data
variables, which is extremely restrictive in practice. Finally, sets, namely four near-infrared (NIR), two red/green/blue
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 433
3.2.3. ISO-brightness
This data set contains Vis-NIR measurements (400–2500 nm)
collected from samples of ground spruce wood. The
response variable is the ISO-brightness value obtained from
the peroxide bleached thermo mechanical pulp refined from
the measured wood.
Objects
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
434 S. Wiklund et al.
3.2.9. Kelder the test. By contrast, the test introduced by Sheridan et al. [23]
In this data set Kelder and Greven described a Free-Wilson assesses only the first component to be significant. As
analysis of the behavior-modifying activities of ACH- indicated in Subsection ‘Step-by-step explanation’, this test
T-related peptides. The data contain 55 objects and 24 tends to under-fit because contributions from earlier
variables. components are not properly removed during the random-
izations. Consequently, the ‘noise values’ are severely
3.2.10. Hexapep overestimated.
This data set contains hexapeptides that were synthesized
according to a molecular design and then tested with regards 4.2. Real data sets
to the biological activity BA2. The Z-scaled amino acids were Table III gives an overview of the selected model ranks using
used to characterize the hexapeptide sequence giving 18 SIMCA, Unscrambler and the present randomization test
predictor variables. The response variable is BA2. while Table IV details the results for the randomization test
(1000 randomizations). All data sets are mean-centered prior
3.3. Calculations to PLS modeling. Some data sets required further pre-
All calculations concerning the randomization test were treatment such as standard normal variate (SNV) scaling,
carried out using Matlab version 7.0 (The Mathworks, multiplicative signal correction (MSC) and unit variance
Natick, MA). A copy of the program is available on request. scaling (UVS). Table V presents root mean square errors
(RMSEs) obtained for the independent validation set.
Table II. Prediction and randomization test results for simulated data sets
Y1 Y2 Y3
PLS component RMSEP a (%) current Sheridan RMSEP a (%) current Sheridan RMSEP a (%) current Sheridan
The models with minimum RMSEP are indicated in bold. The symbols are explained in the text.
Table III. Selected number of PLS components using LOO CV, CV with r ¼ 7 segments, size of the eigenvalue, rules of
significance, leverage correction and randomization test
SIMCA Unscrambler
Carra — 6 6 2 6 6 6 6 6–9
SNV 5 8 4 5 5 5 5 8–10
Water — 8 8 3 9 6 8 11 9
ISO-brightness — 6 3–6 1 6 7 5 18 3–6
MSC 5 5 2 5 5 5 18 3–5
Gas oil — 2 2 3 2 2 2 2 5
RGB — 3 3 8 5 5 6 10 4
Wavelet — 7–10 7–10 4 6 5 4 6 10
Internodes — 1 1 3 1 1 1 6 1
Poplar class — 4 4 3 5 4 4 7 4
Kelder — 2 2 2 2 3 5 3 2
Hexapep UVS 3 — 2 3 11 — 4 4
For more detailed results from the randomization test, see Table IV.
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 435
Table IV. Risk of over-fitting (in %) for individual components, estimated from 1000 randomizations
PLS component
Pre-treatment
Data set (þ centering) 1 2 3 4 5 6 7 8 9 10
3 3
Carra — 21 0.06 1 10 0.23 0.32 2 10 30 0.03 3.1 11
SNV 3 103 0.07 8 103 6 103 2 106 1 103 0.50 0.07 7.5 0.01
Water — 2 105 1 106 4 106 2 105 0.13 6 104 1 104 6 105 7 103 16
ISO-brightness — 24 0.22 4.1 44 26 0.80 41 73 24 68
MSC 7.9 1.4 2.0 16 0.12 44 37 40 60 29
Gas oil — 9 104 0.02 3.3 6 104 2 103 36 43 38 11 16
RGB — 7 107 7 105 0.70 3 103 37 28 37 28 20 87
Wavelet — 1 107 1 103 0.01 1 107 4 106 7 105 4 103 0.09 6 104 0.03
Internodes — 6 103 20 11 37 41 13 4.6 21 13 2.6
Poplar class — 0.01 0.11 0.30 1.4 11 8.8 2.9 8.5 19 12
Kelder — 2 106 0.70 24 33 91 100 100 100 100 100
Hexapep UVS 32 0.36 5.1 2.2 58 82 88 95 80 98
Carra — 18.5 14.9 7.30 7.04 6.04 3.21 3.11 2.35 1.79 1.85
SNV 14.2 7.29 6.75 5.86 3.22 2.39 1.68 1.52 1.16 1.36
Water — 9.17 6.99 6.70 6.07 6.04 5.47 5.39 5.43 5.44 5.31
Gas oil — 0.18 0.10 0.085 0.055 0.045 0.039 0.039 0.037 0.039 0.038
Wavelet — 0.27 0.24 0.22 0.23 0.23 0.23 0.22 0.22 0.22 0.22
(a) (b)
Signal
Figure 6. Carra data set: (a) raw and (b) SNV-scaled calibration set
spectra in arbitrary signal unit.
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
436 S. Wiklund et al.
but 2, 5 and 6 are; this is in fairly good agreement with the Y. Y itself is a discrete, noiseless variable. As a result, both
randomization test where components 2, 3 and 6 are CV and the randomization test will find many significant
significant. After pre-treatment with MSC, component 1 components. Therefore we chose to display an interval of
is not significant according to the rules, which is corrobo- 7–10 components for SIMCA’s CV (Table III). This result is
rated by the randomization test. in good agreement with the randomization test (Table IV).
It also agrees fairly well with the minimum RMSEV
Gas oil obtained between 3 and 10 components for the indepen-
dent validation set (Table V).
Good agreement between SIMCA and Unscrambler selec-
Internodes
tion rules, while the randomization test stands out
(Table III). The decrease of RMSECV upon going from
two to five components (Figure 2a) is incorrectly found to With the exception of the eigenvalue criterion and leverage
be insignificant by the SIMCA and Unscrambler selection correction, all SIMCA and Unscrambler selection rules
rules. Caution to avoid over-fitting has led to under-fitting: suggest that only the first component is predictive.
RMSEV for the independent validation set increases from The randomization test clearly supports only the first
0.045 to 0.10 when dropping components from 5 to 2 component.
(Table V).
Nonmonotonic behavior for randomization test: the esti- Poplar class
mated tail probability for component 3 is 3.3% (Figure 3a),
while successive components 4 and 5 reach much smaller Some disagreement between CV and other SIMCA and
values (see e.g. Figure 4 for component 5). Component 6 is Unscrambler selection rules.
clearly insignificant by the test (Figure 3b). The reason for The randomization test supports four components, in
this nonmonotonic behavior maybe the presence of tiny agreement with CV.
nonlinear substructures in the spectra corresponding to
high response values. Indicative is that high response Kelder
values are underestimated by their prediction: clearly so
for the two-, and to a lesser extent for the five-dimensional
Disagreement between SIMCA and Unscrambler selection
model (Figure 7).
rules (Table III).
The randomization test clearly supports two components,
RGB
in agreement with the SIMCA selection rules.
A=2 A=5
15
RMSE
NIR prediction
14 14
10
13 13
12 12
5
11 (a) 11 (b)
0
11 12 13 14 11 12 13 14 0 2 4 6 8 10
Reference value Reference value PLS components
Figure 7. Gas oil data set prediction results for independent Figure 8. Hexapep data set: RMSE as a function of PLS
validation set obtained using (a) two and (b) five PLS com- components from fit (*), LOO CV (&) and leverage correc-
ponents. tion (*).
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 437
(RMSECV) should not differ much from the average fit randomization test need not be used in stand-alone mode. It
error, especially for under-fitting models. The failure of CV can also be used in an interactive fashion to guide the
can be explained from the designed character of the data. component selection according to another rule, for example,
In a strict sense, CV can only be used if the data constitute a CV.
random sample from a population. In a loose sense, the
effect of a design can be negligible for practical purpose
5. CONCLUSIONS
but only if the calibration set is large enough, which
apparently is not the case here (N ¼ 16). The simulation results illustrate that the currently pro-
Erratic behavior for the randomization test. The first com- posed randomization test improves on the one introduced
ponent is insignificant by the test (Table IV), whereas three by Sheridan et al. [23].
successive components are, depending on the level of The randomization test assumes the components to enter
confidence. This is in agreement with SIMCA’s rules of the model in a natural order, that is according to their
significance which also claim that the first component is relevance for describing the response variable. After all,
insignificant, but the second and third are significant. this property is a theoretical advantage of PLS over
The Unscrambler recommends four components based on methods like PCR [65]. The ideal natural order is not
applying criterion (6) to the RMSEV from leverage correc- always attained for practical data. It can, for example be
tion. This happens to agree with the randomization test. distorted by an improper pre-treatment of the predictor
However, the associated RMSEV is twice the value variables. Fortunately, the resulting erratic behavior can
achieved for the global minimum at eight components point at useful pre-treatment methods. Likewise, the test
(1.42 vs. 0.71). In this case, the user is confronted with a appears to be sensitive to nonlinearities in the data.
difficult choice: follow the recommendation based on the The proposed randomization test, possibly in combination
criterion (6), which depends on an ad hoc value for c, or with other PLS component selection rules will make the
accept the rank that yields the global minimum? The choice of the right model easier for the user.
results of the randomization test are easier to interpret. A comparison with the recently introduced Monte Carlo
CV [35–37] is an interesting topic for future research.
Likewise, a more systematic evaluation of the test for
4.2.2. General designed data seems to be warranted.
The suggested optimum number of components varies One area where CV works poorly both for PLS and PCR is
between the two implementations of CV and the other design of experiments, where exclusion of data has large
selection rules available in SIMCA and Unscrambler, as well consequences for modeling. Here the randomization test
as the randomization test (Table III). Although this should have merit.
disagreement may be confusing for the less experienced
user, indeed it is important to remember that an exact
number does not exist, but rather an appropriate interval. It is Acknowledgements
comforting that CV as well as the randomization test seem to Chris Brown and Alejandro Olivieri are thanked for supply-
find this interval in a reliable way as indicated by the results. ing Matlab code for evaluating the inverse Gaussian fit.
Picard and Cook [32] note that ‘In the analysis of large data
sets, an appropriate way to proceed is often not immediately
REFERENCES
apparent. Consequently, some aspects of exploratory data
analysis naturally arise. Competing models may be tempor- 1. Wold S, Eriksson L, Sjöström M. PLS in chemistry. In The
arily entertained and the final choice of a predictive equation Encyclopedia of Computational Chemistry, Schleyer PVR,
is influenced by many factors, including the personal Allinger NL, Clark T, Gasteiger J, Kollman PA, Schaefer
HF, III, Schreiner PR (eds). Wiley: Chichester, UK, 1999;
experiences and prejudices of the investigator’. In other 2006–2020.
words, imprecise results only present an immediate problem 2. Denham MC. Choosing the number of factors in partial
if a definite choice is desired. Otherwise, one could simply least squares regression: estimating and minimizing the
keep several competing models and put them all to the test mean squared error of prediction. J. Chemometr. 2000; 14:
on future data. This future prediction-testing phase would 351–361.
3. Haaland DM, Thomas EV. Partial least-squares methods
then sometimes lead to the final model—a sound and safe for spectral analyses. 1. Relation to other quantitative
strategy. However, for the common selection rules studied in calibration methods and the extraction of qualitative
this work, one observes that competing models may have information. Anal. Chem. 1988; 60: 1193–1202.
differing number of components. In these cases, it may be 4. Osten DW. Selection of optimal regression models via
hard to defend why certain models are considered as cross-validation. J. Chemometr. 1988; 2: 39–48.
5. Wakeling IN, Morris JJ. A test of significance for partial
competitors. By contrast, the randomization test leads to least squares regression. J. Chemometr. 1993; 7: 291–304.
models for which the disagreeing number of components has 6. Höskuldsson A. Dimension of linear models. Chemometr.
a simple interpretation. Clearly, a higher degree of Intell. Lab. Syst. 1996; 32: 37–55.
confidence leads to smaller models and vice versa. Actually, 7. Messick NJ, Kalivas JH, Lang PM. Selecting factors for
this is the situation for any statistical test including all the partial least squares. Microchem. J. 1997; 55: 200–207.
8. Faber NM, Kowalski BR. Propagation of measurement
ones looked at here; it often needs to be supplemented with a errors for the validation of predictions obtained by prin-
rule that encourages a simpler model before a more cipal component regression and partial least squares.
complicated one. Finally, it is emphasized that the proposed J. Chemometr. 1997; 11: 181–238.
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
438 S. Wiklund et al.
9. Holcomb TR, Hjalmarsson H, Morari M, Tyler ML. 32. Picard RR, Cook RD. Cross-validation of regression
Significance regression: a statistical approach to partial models. J. Amer. Stat. Assoc. 1984; 79: 575–583.
least squares. J. Chemometr. 1997; 11: 283–309. 33. Shao J. Linear model selection by cross-validation.
10. Liu D, Shah SL, Grant Fisher D. Choice of latent expla- J. Amer. Stat. Assoc. 1993; 88: 486–494.
natory variables: a multiobjective optimization approach. 34. Zhang P. Model selection via multifold cross validation.
J. Chemometr. 2000; 14: 79–92. Ann. Stat. 1993; 21: 299–313.
11. Lazraq A, Cléroux R. The PLS multivariate regression 35. Xu Q-S, Liang Y-Z. Monte Carlo cross validation. Che-
model: testing the significance of successive PLS com- mometr. Intell. Lab. Syst. 2001; 56: 1–11.
ponents. J. Chemometr. 2001; 15: 523–536. 36. Gourvénec S, Fernández Pierna JA, Massart DL, Rutle-
12. Green RL, Kalivas JH. Graphical diagnostics for regres- dge DN. An evaluation of the PoLiSh smoothed
sion model determinations with consideration of the regression and the Monte Carlo cross-validation for
bias/variance trade-off. Chemometr. Intell. Lab. Syst. the determination of the complexity of a PLS model.
2002; 60: 173–188. Chemometr. Intell. Lab. Syst. 2003; 68: 41–51.
13. Li B, Morris J, Martin EB. Model selection for partial least 37. Xu Q-S, Liang Y-Z, Du Y-P. Monte Carlo cross-validation
squares regression. Chemometr. Intell. Lab. Syst. 2002; 64: for selecting a model and estimating the prediction error
79–89. in multivariate calibration. J. Chemometr. 2004; 18:
14. Forina M, Lanteri S, Cerrato Oliveros MC, Pizarro Millan 112–120.
C. Selection of useful predictors in multivariate cali- 38. Baumann K, Albert H, von Korff M. A systematic evalu-
bration. Anal. Bioanal. Chem. 2004; 380: 397–418. ation of the benefits and hazards of variable selection in
15. Fisher RA. The Design of Experiments. Oliver and Boyd: latent variable regression. Part I. Search algorithm,
Edinburgh, 1935. theory and simulations. J. Chemometr. 2002; 16: 339–350.
16. Klopman G, Kalos AN. Causality in structure-activity 39. Baumann K, von Korff M, Albert H. A systematic evalu-
studies. J. Comput. Chem. 1985; 6: 492–506. ation of the benefits and hazards of variable selection in
17. Thioulouse J, Lobry JR. Co-inertia analysis of amino-acid latent variable regression. Part II. Practical applications.
physico-chemical properties and protein composition J. Chemometr. 2002; 16: 351–360.
with the ADE package. Comput. Appl. Biosci. 1995; 11: 40. Baumann K. Cross-validation as the objective function
321–329. for variable selection techniques. Trends Anal. Chem.
18. Lindgren F, Hansen B, Karcher W, Sjöström M, Eriksson 2003; 22: 395–406.
L. Model validation by permutation tests: applications to 41. Breiman L, Friedman JH, Olshen RA, Stone C. Classifi-
variable selection. J. Chemometr. 1996; 10: 521–532. cation and Regression Trees. Wadsworth: Belmont, CA,
19. Baumann K, Stiefl N. Validation tools for variable 1984.
subset regression. J. Comput.-Aided Mol. Des. 2004; 18: 42. Reeves JB, III, Delwiche SR. SAS1 partial least squares
549–562. regression for analysis of spectroscopic data. J. Near
20. van der Voet H. Comparing the predictive accuracy of Infrared Spectrosc. 2003; 11: 415–431.
models using a simple randomization test. Chemometr. 43. van der Voet H. Pseudo-degrees of freedom for complex
Intell. Lab. Syst. 1994; 25: 313–323. predictive models: the example of partial least squares.
21. van der Voet H. Corrigendum to ‘comparing the pre- J. Chemometr. 1999; 13: 195–208.
dictive accuracy of models using a simple randomization 44. Jackson JE. A User’s Guide to Principal Components. Wiley:
test’. Chemometr. Intell. Lab. Syst. 1995; 28: 315. New York, 1991.
22. Dijksterhuis GB, Heiser WJ. The role of permutation tests 45. Martens HA, Dardenne P. Validation and verification of
in exploratory multivariate data analysis. Food Qual. regression in small data sets. Chemometr. Intell. Lab. Syst.
Prefer. 1995; 6: 263–270. 1998; 44: 99–121.
23. Sheridan RP, Nachbar RB, Bush BL. Extending the trend 46. Westad F, Martens H. Variable selection in near infrared
vector: the trend matrix and sample-based partial least spectroscopy based on significance testing in partial least
squares. J. Comput.-Aided Mol. Des. 1994; 8: 323–340. squares regression. J. Near Infrared Spectrosc. 2000; 8:
24. Stone M. Cross-validatory choice and assessment of 117–124.
statistical predictions. J. Roy. Stat. Soc. B 1974; 36: 47. Høy M, Steen K, Martens H. Review of partial least
111–133. squares regression prediction error in Unscrambler. Che-
25. Wold S. Cross-validatory estimation of the number of mometr. Intell. Lab. Syst. 1998; 44: 123–133.
components in factor and principal components models. 48. Faber NM. Comparison of two recently proposed
Technometrics 1978; 20: 397–405. expressions for partial least squares regression predic-
26. Allen DM. The prediction sum of squares as a criterion for tion error. Chemometr. Intell. Lab. Syst. 2000; 52: 123–134.
selecting prediction variables. Technical Report No. 23, 49. Faber NM. Estimating the uncertainty in estimates of
Dept. of Statistics, University of Kentucky: Lexington, root mean square error of prediction: application to
KY, 1971. determining the size of an adequate test set in multi-
27. Martens H, Næs T. Multivariate calibration by data variate calibration. Chemometr. Intell. Lab. Syst. 1999; 49:
compression. In Near-Infrared Technology in the Agricul- 79–89.
tural and Food Industries, Williams P, Norris K (eds). 50. Efron B. Bootstrap methods: another look at the jack
American Cereal Association: St Paul, MN, 1987; 57–87. knife. Ann. Stat. 1979; 7: 1–26.
28. Lorber A, Kowalski BR. Alternatives to cross-validatory 51. Fisher RA. Statistical Methods for Research Workers. Oliver
estimation of the number of factors in multivariate cali- and Boyd: Edinburgh, 1925.
bration. Appl. Spectrosc. 1990; 44: 1464–1470. 52. Hotelling H. New light on the correlation coefficient and
29. Marbach R, Heise HM. Calibration modeling by partial its transform. J. Roy. Stat. Soc. B 1953; 15: 193–232.
least-squares and principal component regression and its 53. Chhikara RS, Folks JL. The Inverse Gaussian Distribution:
optimization using an improved leverage correction for Theory, Methodology, and Applications. Marcel Dekker:
prediction testing. Chemometr. Intell. Lab. Syst. 1990; 9: New York, 1989.
45–63. 54. Dennis B, Munholland PL, Scott JM. Estimation of
30. SIMCA-P and SIMCA Pþ 10, User Guide, Umetrics AB: growth and extinction parameters for endangered
Umeå, Sweden, 2002. species. Ecol. Monogr. 1991; 61: 115–143.
31. Unscrambler for Windows, User’s guide. CAMO AS: Trond- 55. Wold S, Eriksson L. Statistical validation of QSAR
heim, Norway, 1996. results. In Chemometric Methods in Molecular Design,
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem
PLS component selection 439
van de Waterbeemd H (ed.). VCH Publishers: Weinheim, 60. Nilsson D, Edlund U. Pine and spruce roundwood
1995; 309–318. species classification using multivariate image analysis
56. Dyrby M, Petersen RV, Larsen J, Rudolf B, Nørgaard L, on bark. Holzforschung 2005; 59: 689–695.
Engelsen SB. Towards on-line monitoring of the compo- 61. Wiklund S, Karlsson M, Antti H, Johnels D, Sjöström M,
sition of commercial carrageenan powders. Carbohydr. Wingsle G, Edlund U. A new metabonomic strategy for
Polym. 2004; 57: 337–348. analyzing the growth process in poplar tree. Plant
57. Persson E, Sjöström M, Sundblad L-G, Wiklund S, Wil- Biotechnol. J. 2005; 3: 353–362.
hemsson L. Fresh timber—a challenge to forestry and 62. Kelder J, Greven HM. A quantitative study on the
mensuration. Resultat 2002; 8: 2–4. relationship between structure and behavi activity of
58. Nilsson D, Edlund U, Sjöström M, Agnemo R. Prediction peptides related to ACTH. Rec. Trav. Chim. Pays-Bas
of thermo mechanical pulp brightness using NIR spec- 1979; 98: 168–172.
troscopy on wood raw material. Paperi Ja Puu (Pap. 63. Kubinyi H. Evolutionary variable selection in regression
Timber) 2005; 87: 102–109. and PLS analyses. J. Chemometr. 1996; 10: 119–133.
59. Fernández Pierna JA, Jin L, Wahl F, Faber NM, Massart 64. Eriksson L, Johansson E, Kettaneh-Wold N, Wold S.
DL. Estimation of partial least squares regression (PLSR) Multi- and Megavariate Data analysis. Principles and Appli-
prediction uncertainty when the reference values carry a cations. Umetrics AB: Umeå, Sweden, 2002.
sizeable measurement error. Chemometr. Intell. Lab. Syst. 65. Helland IS. On the structure of partial least squares
2003; 65: 281–291. regression. Commun. Stat.—Simula. 1988; 17: 581–607.
Copyright # 2007 John Wiley & Sons, Ltd. J. Chemometrics 2007; 21: 427–439
DOI: 10.1002/cem