Professional Documents
Culture Documents
Uncertainty Estimation For Multivariate Regression Coefficients PDF
Uncertainty Estimation For Multivariate Regression Coefficients PDF
www.elsevier.com/locate/chemometrics
Abstract
Five methods are compared for assessing the uncertainty in multivariate regression coefficients, namely, an approximate
variance expression and four resampling methods (jack-knife, bootstrapping objects, bootstrapping residuals, and noise
addition). The comparison is carried out for simulated as well as real near-infrared data. The calibration methods considered are
ordinary least squares (simulated data), partial least squares regression, and principal component regression (real data). The
results suggest that the approximate variance expression is a viable alternative to resampling.
D 2002 Elsevier Science B.V. All rights reserved.
Keywords: Multivariate calibration; Regression vector; Uncertainty estimation; Resampling; Jack-knife; Bootstrap; Monte Carlo simulation;
OLS; PLSR; PCR; NIR
0169-7439/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 9 - 7 4 3 9 ( 0 2 ) 0 0 1 0 2 - 8
170 N.M. Faber / Chemometrics and Intelligent Laboratory Systems 64 (2002) 169–179
fidence limits on estimated model parameters, Underlying all resampling methods is the assump-
whereas Carrol et al. [6] focus on bias estimation. tion that the resampled entity is independently iden-
It was used by Derks et al. [7] to assess the tically distributed (iid). The validity of this
uncertainty in the output of artificial neural net- assumption depends, among others, on the experi-
works (ANNs), while Duewer et al. [8] and Dable mental set-up. For a correct treatment of resampling
and Booksh [9] reported successful application to methods, it is, therefore, useful to distinguish between
pseudo-rank estimation. The versatility of noise two fundamentally different experimental set-ups,
addition is further illustrated by the work of del namely random and controlled calibration [11]. In
Rı́o et al. [10], who used the method to validate the first case, the training set predictor variables (rows
expression-based prediction intervals in linear of X) are randomly observed, whereas in the latter,
regression with errors on both axes. they are fixed by design. A design leads to exact
The purpose of this study is to investigate the relationships in the training set data because measure-
relative merits of various resampling methods and an ments are taken at special points. As a result, a certain
approximate variance expression. The resampling iid assumption will be violated. Clustering of the data,
methods under investigation are the jack-knife, boot- which is often the case in quantitative structure
strapping objects, bootstrapping residuals, and noise activity relationship (QSAR) work, may have similar
addition. The different approaches are compared for consequences. Score plots are convenient for visual-
simulated as well as real near-infrared (NIR) data. izing relationships in the data.
The simulated data are modeled using ordinary least
squares (OLS). This allows one to study the methods 2.2. Estimation of regression coefficients
under idealized circumstances. For example, the
approximate variance expression specializes to a The OLS estimate for b is:
well-known exact one under these circumstances. It
is believed that these simulations yield insight that bOLS ¼ ðXT XÞ1 XT y ð2Þ
can be used to better interpret the results obtained for where the superscripts ‘ 1’ and ‘T’ denote matrix
real NIR data modeled by PLSR or PCR. inversion and transposition, respectively. For the OLS
solution to exist, X must be of full column rank.
The F-factor PLSR estimate for b can be
2. Theory expressed under the SIMPLS formalism [12] as:
The multiple linear regression model is assumed, where R ( J F) is a matrix of weights; when
i.e., applied to the predictor variables, one obtains the
scores as T = XR. Eq. (3) includes Eq. (2) as a
y ¼ Xb þ e ð1Þ special case because the full-factor PLSR model
reproduces the OLS solution. The PLSR estimate
where y (I 1) is the true predictand (property of also exists if X is rank-deficient (as long as the
interest); X (I J) is the true predictor matrix (e.g., number of factors does not exceed the rank of X)
spectra); b ( J 1) is the true regression vector; e or even if J >I, which is the ‘underdetermined’ case
(I 1) is a vector of residuals; and I and J denote often encountered in spectroscopy. Similar to Eq.
the number of training samples and predictor var- (3), the PCR estimate is given by:
iables (e.g., wavelengths), respectively. The actual
bPCR ¼ ðVK1 VT ÞXT y ð4Þ
modeling may be based on realizations of y and X
that are corrupted by non-negligible measurement where V and K contain a subset of the eigenvectors
errors. However, the presence of measurement and eigenvalues of XTX, respectively. Eq. (4) can
errors is not indicated by additional notation to be brought in the form of Eq. (3) by a simple
simplify the presentation. rescaling of the eigenvectors, i.e., R = VK 1/2.
N.M. Faber / Chemometrics and Intelligent Laboratory Systems 64 (2002) 169–179 171
2.3. Uncertainty in regression coefficients De Jong [12] noticed that the analogy between
Eqs. (2) and (3) suggests that RRT is proportional
In the current study, the uncertainty in the coef- to an approximate covariance matrix for the PLSR
ficient estimates is quantified by a standard error coefficients. Faber and Kowalski [3] further worked
(square root of a variance): out this observation for PLSR as well as PCR (see
their Sections 3.3.3, 3.3.4, and 3.3.5). Recalling that
rðbj Þ ¼ VðbÞjj ;
1=2
j ¼ 1; . . . ; J ð5Þ the notation RRT also applies to OLS (full-factor
PLSR) and PCR (R = VK 1/2) yields:
where V() symbolizes the covariance matrix of a
vector quantity and b stands for bOLS, bPLSR, or bPCR, VðbÞ ¼ MSEC ðRRT Þ ð8Þ
respectively. It is important to note that the standard
error fully accounts for the uncertainty only when where b stands for bOLS, bPLSR, or bPCR, and R is
OLS is applied in connection with errorless predictors estimated using the appropriate method.
(e.g., spectra). With measurement errors in X, the Two comments seem to be in order. First, the
OLS solution will be biased [13]. The relative impor- correct estimation of MSEC requires an adequate
tance of this bias depends on the size of the errors in number of degrees of freedom. As OLS, PCR con-
X. Often, the signal-to-noise ratio is rather high for sumes a single degree of freedom for each factor
spectroscopic data (H10), so that we can safely when the factors are chosen without reference to the
neglect this bias. The situation is further complicated predictand vector (e.g., in the order of their corre-
when using PLSR or PCR because these methods owe sponding eigenvalues). By contrast, the appropriate
much of their popularity to the bias –variance trade- number of degrees of freedom for PLSR is not a
off. It is well known that the number of factors is trivial matter, since the construction of the factors
selected as a compromise between bias (too few includes the predictand vector. The rigorous study of
factors) and variance (too many factors). However, a van der Voet [14] has clearly established that the
successful bias –variance trade-off implies the bias to conventional number, i.e., a single degree of freedom
be relatively unimportant. for each factor, is too small for the early factors and
too large for the latter ones. A sound alternative can
2.3.1. Approximate formula be calculated using the results of leave-one-out
An approximate covariance matrix of the OLS cross-validation (see Eq. (26) in Ref. [14]). Second,
regression coefficients is given by: MSEC in Eq. (8) may contain a bias term (see
Denham [15] for more details). Thus, Eq. (8) yields
a mean squared error in the regression coefficients,
VðbOLS Þ ¼ MSEC ðXT XÞ1 ð6Þ
rather than a variance. Because we have assumed
bias to be relatively small (successful bias – variance
where MSEC denotes the mean squared error of
trade-off), this distinction will not be made explicit
calibration estimated as:
to simplify the presentation.
X
I It is important to note that the validity of the
ðyi yfit;i Þ2 approximate variance formulas (6) and (8) does not
i¼1 depend on the experimental set-up leading to the data
MSEC ¼ ð7Þ
I df (random or fixed calibration). Clearly, the variance in
the parameter estimates depends on the data only, not
in which yi is the predictand for the ith training on how they are obtained. The ‘design’ of the training
sample; yfit,i is the corresponding fitted value; and df set is reflected in the matrix (XTX) 1 or RRT, which
denotes the degrees of freedom consumed by the determines the amount of error propagation. Obvi-
model parameters. For OLS, each parameter takes ously, error propagation is minimized by a proper
away a degree of freedom from the data, likewise a design, but in many applications, e.g., when dealing
potential intercept. Eq. (6) is approximate [13], unless with natural produce, it is impossible to construct a
X is without error. design. The main assumption underlying Eqs. (6) and
172 N.M. Faber / Chemometrics and Intelligent Laboratory Systems 64 (2002) 169–179
(8) is that the noise in y and X is adequately accounted known as bootstrapping pairs or observations) and
for by MSEC. bootstrapping residuals (see Chapter 9 in Ref. [4]).
The method working with residuals will be
2.3.2. Jack-knife explained in Section 2.3.4. Bootstrapping objects
The jack-knife generates reduced data sets by proceeds as follows. New data sets are generated
deleting objects, i.e., an element of y and the by randomly drawing objects with replacement (cf.
corresponding row of X (cf. Fig. 1a and b for a Fig. 1a and c for a univariate model):
univariate model):
ðybi ; xbi Þ ¼ ðynbi ; xnbi Þ; i ¼ 1; . . . ; I;
yi ¼ ðy1 ; . . . ; yi1 ; yiþ1 ; . . . ; yI ÞT ; i ¼ 1; . . . ; I
b ¼ 1; . . . ; B ð12Þ
Xi ¼ ðxT1 ; . . . ; xTi1 ; xTiþ1 ; . . . ; xTI ÞT ; i ¼ 1; . . . ; I:
ð9Þ where
For these reduced data sets, coefficient vectors
(bi) are estimated. Combining these estimates with nbi ¼ int½U ð0 1Þ I þ 1; i ¼ 1; . . . ; I;
the estimate of the entire data set (b) yields so- b ¼ 1; . . . ; B ð13Þ
called pseudo-values [4]:
in which int[] symbolizes the integer part of the
bipseudo ¼ Ib ðI 1Þbi ; i ¼ 1; . . . ; I: ð10Þ associated number and U(0– 1) is a random number
that is uniformly distributed between zero and unity.
The desired covariance matrix follows from the The use of random numbers effectively makes boot-
spread in the pseudo-values as: strapping a Monte Carlo simulation technique. The
procedure is repeated B times, where B should be
1 XI selected large enough to yield precise estimates for
VðbÞ ¼ ðbi bÞðbipseudo bÞT the desired standard error. Eq. (13) ensures that for
IðI 1Þ i¼1 pseudo
each draw, any of the ( yi,xi)-pairs is selected with
ð11Þ probability I -1. As a result, some of the objects will
be present more than once, whereas others are not
where b denotes the average of the pseudo-values. selected at all. The desired covariance matrix fol-
When comparing Eq. (11) with the common expres- lows from the common formula of a covariance
sion for a covariance matrix (see Eq. (14) below), the matrix of independent vectors:
additional division by I is noteworthy. The reason for
this additional division is that the covariance matrix for
the mean of the pseudo-values estimates the desired 1 X B
VðbÞ ¼ ðbb bÞðbb bÞT ð14Þ
covariance matrix [4]. Martens and Martens [16] B 1 b¼1
introduced a modification of the jack-knife that yields
similar results.
It is seen that this procedure does not make any where b denotes the average of the bootstrapped
assumption about the noise in y or X. The values of b. Bootstrapping objects is similar to the
resampled entities are ( yi,xi)-pairs, so they should jack-knife in the sense that it does not make any
form a random sample from some multivariate assumption about the noise in y or X, but the
distribution. This implies that the data should not ( yi,xi)-pairs should form a random sample from
be designed or— maybe even worse—clustered. some multivariate distribution (random calibration).
Fig. 1. Illustration of resampling methods for univariate straight-line fit of single x versus y. (a) Original data points (o), model (—), fitted points
.
( ), and residuals (: : : ). (b) Original data points (o) and model (—) for objects {1,2,4,5} selected by jack-knife ( ). (c) Original data points
.
(o) and model (—) for objects {4,4,4,5,2} selected by bootstrap ( ). (d) Fitted points for original data ( ) and model (—) for objects obtained
by adding bootstrap-selected residuals {3,2,3,1,4} to the fitted points ( ). Resampling objects works directly with the original data points (o),
.
whereas resampling residuals works with the fitted points ( ) and estimated residuals ( – – – ).
mate the uncertainty. First, residuals are calculated new residual vectors are generated by randomly
as: drawing residuals with replacement. Finally, new
data sets are constructed by adding these new
yi yfit;i residual vectors to the fitted predictand vector (cf.
ei ¼ ; i ¼ 1; . . . ; I: ð15Þ
ð1 df =IÞ1=2 Fig. 1a and d for a univariate model):
The ‘raw’ residual in the numerator is corrected for ybi ¼ yfit;i þ enbi ; i ¼ 1; . . . ; I; b ¼ 1; . . . ; B ð16Þ
degrees of freedom because the difference between
observed and fitted data is consistently smaller than
the deviation from the expected values (E[ yi ]; where nib is as defined in Eq. (13). The procedure is
i = 1,. . .,I). Alternatively, one could adjust the ‘raw’ repeated B times and the desired covariance matrix
residuals by means of the associated leverage. Next, follows from Eq. (14). Unlike jack-knifing and boot-
174 N.M. Faber / Chemometrics and Intelligent Laboratory Systems 64 (2002) 169–179
strapping objects, resampling residuals alters the data reference values were obtained using the Kjeldahl
similarly as the original perturbation would (e.g., method. The training set consists of 24 objects
measurement noise). The procedure assumes that (I = 24). The NIR spectra are digitized at six wave-
the order in which the residuals are drawn is imma- lengths in the range 1680– 2310 nm ( J = 6). This
terial (exchangeability), which is the case when the data set has been used extensively in the chemo-
noise is iid. As discussed by Efron and Tibshirani metrics literature for method testing (see Refs.
[4], bootstrapping residuals yields better results for [3,15] and references therein). The simulated data
the classic linear regression model (full column rank are generated by the following two steps:
X) than bootstrapping objects when this condition is
met. However, since the results depend critically on 1. The ‘true’ y, X, and b in Eq. (1) are the OLS fit of
this iid condition [4], bootstrapping objects is the the experimental y, the experimental X, and bOLS,
preferred mode [17]. respectively.
Finally, it is noted that the ( yi,xi)-pairs need not 2. ‘Experimental’ realizations of y and X are
form a random sample from some multivariate constructed by artificially adding noise to y and
distribution. The reason for this is that the simu- X. The noise added to y is iid with standard
lations are performed conditional on the model. deviation 0.2% (m/m), which is the estimated
This conditioning effectively fixes the xi, so that uncertainty of the Kjeldahl method [19]. The noise
it is immaterial whether the data are designed (fixed added to X is either iid or proportional. The
calibration) or not (random calibration). standard deviation of the iid noise takes the values
0%, 0.25%, 0.5%, 0.75%, and 1% of the maximum
2.3.5. Add noise to original data value of X. Similarly, the proportional noise has
Similar to Eq. (16), N new data sets are gen- standard deviation 0%, 0.25%, 0.5%, 0.75%, and
erated according to: 1% of the associated value of X. It is believed that
the level of the noise in X is unrealistically high for
yni ¼ yi þ MSEC1=2 Fð0; 1Þ; i ¼ 1; . . . ; I; certain spectroscopies (e.g., NIR), but it may be
n ¼ 1; . . . ; N ð17Þ adequate for testing the validity of uncertainty
estimates.
where F(0,1) symbolizes a random number gener-
ated from a distribution with mean zero and stand- A single ‘experimental’ realization suffices to
ard deviation unity. The covariance matrix follows calculate estimates of standard error in the regres-
from the equivalent of Eq. (14). sion coefficients. However, these estimates contain
Noise addition and bootstrapping residuals have an uncertainty themselves. Clearly, an unavoidable
in common that the ( yi,xi)-pairs need not form a source of the uncertainty in error estimates is that
random sample. However, the noise addition some ‘experimental’ realizations are noisier than
method is more versatile because it can deal with others are by chance alone. To quantify the total
heteroskedastic and correlated noise. uncertainty, the error estimates are calculated for
100 independent ‘experimental’ realizations. Boot-
strapping and noise addition are based on 1000
3. Experimental replicates constructed for a single ‘experimental’
realization (B = N = 1000). This large number is
3.1. Simulated NIR data chosen to ensure that the uncertainty of the error
estimates is mainly determined by the variability
Fearn [18] published a NIR data set that was among the ‘experimental’ realizations [20].
collected for the prediction of protein content in The ‘ideal’ estimate of standard error is obtained
ground wheat samples. Because wheat samples from the spread in the regression vectors obtained for
cannot be designed, the experimental set-up con- 1000 ‘experimental’ realizations. Obviously, one can-
forms to random calibration. Hence, no resampling not calculate this ‘ideal’ estimate in practice because
methods can be excluded from the start. The only few realizations are available—most often just a
N.M. Faber / Chemometrics and Intelligent Laboratory Systems 64 (2002) 169–179 175
Fig. 3. Standard errors in PLSR coefficients for oxygenate data ( 10 2): formula-based versus the ones obtained by bootstrapping objects
(top) and bootstrapping residuals (bottom).
Fig. 4. Reliabilities of PLSR coefficients for oxygenate data: bootstrapping objects (top) and bootstrapping residuals (bottom). The dotted line
(: : :) indicates the value corresponding to a 90% confidence interval including zero.
178 N.M. Faber / Chemometrics and Intelligent Laboratory Systems 64 (2002) 169–179
detailed knowledge about the measurement noise, [7] E.P.P.A. Derks, M.S. Sánchez Pastor, L.M.C. Buydens, Che-
which is not always available. By contrast, Eq. (8) mometr. Intell. Lab. Syst. 28 (1995) 49.
[8] D.L. Duewer, B.R. Kowalski, J.L. Fasching, Anal. Chem. 48
requires only the MSEC as input. (1976) 2002.
[9] B.K. Dable, K.S. Booksh, J. Chemom. 15 (2001) 591.
[10] F.J. del Rı́o, J. Riu, F.X. Rius, J. Chemom. 15 (2001) 773.
Acknowledgements [11] P.J. Brown, Measurement, Regression, and Calibration, Clar-
endon Press, Oxford, 1993(Chap. 5).
[12] S. de Jong, Chemometr. Intell. Lab. Syst. 18 (1993) 251.
The National Institute of Standards and Technol-
[13] S.D. Hodges, P.G. Moore, Appl. Stat. 21 (1972) 185.
ogy (NIST) is thanked for making the oxygenate data [14] H. van der Voet, J. Chemom. 13 (1999) 195.
available for this study. The critical remarks by Frank [15] M.C. Denham, J. Chemom. 14 (2000) 351.
Schreutelkamp, Age Smilde, and two reviewers are [16] H. Martens, M. Martens, Food Qual. Prefer. 11 (2000) 5.
appreciated by the author. [17] R. Wehrens, H. Putter, L.M.C. Buydens, Chemometr. Intell.
Lab. Syst. 54 (2000) 35.
[18] T. Fearn, Appl. Stat. 32 (1983) 73.
[19] H. Martens, T. Naes, Multivariate Calibration, Wiley, Chiches-
References ter, 1989.
[20] J.S. Alper, R.I. Gelb, Talanta 40 (1993) 355.
[1] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.G.M. [21] H.R. Keller, J. Röttele, H. Bartels, Anal. Chem. 66 (1994)
Vandeginste, C. Sterna, Anal. Chem. 68 (1996) 3851. 937.
[2] R. Wehrens, W.E. Van der Linden, J. Chemom. 11 (1997) 157. [22] N.M. Faber, D.L. Duewer, S.J. Choquette, T.L. Green, S.N.
[3] K. Faber, B.R. Kowalski, J. Chemom. 11 (1997) 181. Chesler, Anal. Chem. 70 (1998) 2972.
[4] B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap, [23] R. Boqué, M.S. Larrechi, F.X. Rius, Chemometr. Intell. Lab.
Chapman and Hall, London, 1993. Syst. 45 (1999) 397.
[5] W.H. Press, B.P. Flannery, S.A. Teukolski, W.T. Vetterling, [24] J.A. Fernández Pierna, L. Jin, F. Wahl, N.M. Faber, D.L.
Numerical Recipes. The Art of Scientific Computing, Cam- Massart, Chemometr. Intell. Lab. Syst., accepted for pub-
bridge Univ. Press, Cambridge, 1988(Section 14.5). lication.
[6] R.J. Carrol, D. Ruppert, L.A. Stefanski, Measurement Error in [25] A.C. Olivieri, J. Chemom. 16 (2002) 207.
Nonlinear Models, Chapman and Hall, London, 1995(Chap. 4). [26] N.M. Faber, Anal. Chim. Acta 439 (2001) 193.