You are on page 1of 28

This article was downloaded by: [University of Birmingham]

On: 12 April 2013, At: 09:58


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer
House, 37-41 Mortimer Street, London W1T 3JH, UK

Technometrics
Publication details, including instructions for authors and subscription information:
http://amstat.tandfonline.com/loi/utch20

A Statistical View of Some Chemometrics


Regression Tools
a b
lldiko E. Frank & Jerome H. Friedman
a
Jerll, Inc., Stanford, CA, 94305
b
Department of Statistics and Stanford Linear Accelerator Center, Stanford University,
Stanford, CA, 94305
Version of record first published: 12 Mar 2012.

To cite this article: lldiko E. Frank & Jerome H. Friedman (1993): A Statistical View of Some Chemometrics Regression
Tools, Technometrics, 35:2, 109-135

To link to this article: http://dx.doi.org/10.1080/00401706.1993.10485033

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://amstat.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to
anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses
should be independently verified with primary sources. The publisher shall not be liable for any loss,
actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising
directly or indirectly in connection with or arising out of the use of this material.
0 1993 American Statistical Association and TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2
the American Society for Quality Control

A Statistical View of Some Chemometrics


Regression Tools
lldiko E. Frank Jerome H. Friedman
Jerll, Inc. Department of Statistics
Stanford, CA 94305 and
Stanford Linear Accelerator Center
Stanford University
Stanford, CA 94305

Chemometrics is a field of chemistry that studies the application of statistical methods to


chemical data analysis. In addition to borrowing many techniques from the statistics and
engineering literatures, chemometrics itself has given rise to several new data-analytical
methods. This article examines two methods commonly used in chemometrics for predictive
modeling-partial least squares and principal components regression-from a statistical per-
Downloaded by [University of Birmingham] at 09:58 12 April 2013

spective. The goal is to try to understand their apparent successes and in what situations they
can be expected to work well and to compare them with other statistical methods intended
for those situations. These methods include ordinary least squares, variable subset selection,
and ridge regression.

KEY WORDS: Multiple response regression; Partial least squares; Principal components
regression; Ridge regression; Variable subset selection.

1. INTRODUCTION Wold 1975) and, to a somewhat lesser extent, prin-


cipal components regression (PCR) (Massy 1965).
Statistical methodology has been successfully ap- Although PLS is heavily promoted (and used) by
plied to many types of chemical problems for some chemometricians, it is largely unknown to statisti-
time. For example, experimental design techniques cians. PCR is known to, but seldom recommended
have had a strong impact on understanding and im- by, statisticians. [The Journal of Chemometrics(John
proving industrial chemical processes. Recently the Wiley) and Chemometricsand Intelligent Laboratory
field of chemometrics has emerged with a focus on Systems (Elsevier) contain many articles on regres-
analyzing observational data originating mostly from sion applications to chemical problems using PCR
organic and analytical chemistry, food research, and and PLS. See also Martens and Naes (1989).]
environmental studies. These data tend to be char- The original ideas motivating PLS and PCR were
acterized by many measured variables on each of a entirely heuristic, and their statistical properties re-
few observations. Often the number of suchvariables main largely a mystery. There has been some recent
p greatly exceedsthe observation count N. There is progress with respect to PLS (Helland 1988; Lorber,
generally a high degree of collinearity among the Wangen, and Kowalski 1987; Phatak, Reilly, and
variables, which are often (but not always) digiti- Penlidis 1991; Stone and Brooks 1990). The purpose
zations of analog signals. of this article is to view these procedures from a
Many of the tools employed by chemometricians statistical perspective, attempting to gain some in-
are the same as those used in other fields that pro- sight as to when and why they can be expected to
duce and analyze observational data and are more work well. In situations for which they do perform
or less well known to statisticians. These tools in- well, they are compared to standard statistical meth-
clude data exploration through principal components odology intended for those situations. These include
and cluster analysis, as well as modern computer ordinary least squares (OLS) regression, variable
graphics. Predictive modeling (regression and clas- subsetselection (VSS) methods, and ridge regression
sification) is also an important goal in most appli- (RR) (Hoer1 and Kennard 1970).The goal is to bring
cations. In this area, however, chemometricians have all of these methods together into a common frame-
invented their own techniques basedon heuristic rea- work to attempt to shed some light on their similar-
soning and intuitive ideas, and there is a growing ities and differences. The characteristics of PLS in
body of empirical evidence that they perform well in particular have so far eluded theoretical understand-
many situations. The most popular regressionmethod ing. This has led to unsubstantiated claims concern-
in chemometrics is partial least squares (PLS) (H. ing its performance relative to other regression pro-

109
110 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

cedures, such as that it makes fewer assumptions axes. This leads to the assumption that the response
concerning the nature of the data. Simply not under- is likely to be influenced by a few of the predictor
standing the nature of the assumptions being made variables but leaves unspecified which ones. It will
does not mean that they do not exist. therefore tend to work best in situations character-
Space limitations force us to limit our discussion ized by true coefficient vectors ‘with components
here to methods that so far have seen the most use consisting of a very few (relatively) large (absolute)
in practice. There are many other suggested ap- values.
proaches [e.g., latent root regression (Hawkins 1973; Section 5 presents a simulation study comparing
Webster, Gunst, and Mason 1974),intermediate least the performance of OLS, RR, PCR, PLS, and VSS
squares(Frank 1987), James-Stein shrinkage (James in a variety of situations. In all of the situations stud-
and Stein 1961), and various Bayes and empirical ied, RR dominated the other methods, closely fol-
Bayes methods] that, although potentially promis- lowed by PLS and PCR, in that order. VSS provided
ing, have not yet seen wide applications. distinctly inferior performance to these but still con-
siderably better than OLS, which usually performed
1.1 Summary Conclusions quite badly.
RR, PCR, and PLS are seenin Section 3 to operate Section 6 examines multiple-response regression,
in a similar fashion. Their principal goal is to shrink investigating the circumstances under which consid-
Downloaded by [University of Birmingham] at 09:58 12 April 2013

the solution coefficient vector away from the OLS ering all of the responsestogether as a group might
solution toward directions in the predictor-variable lead to better performance than a sequenceof sep-
space of larger sample spread. Section 3.1 provides arate regressions of each response individually on
a Bayesian motivation for this under a prior distri- the predictors. Two-block multiresponse PLS is an-
bution that provides no information concerning the alyzed. It is seen to bias the solution coefficient vec-
direction of the true coefficient vector-all direc- tors away from low spread directions in the predictor
tions are equally likely to be encountered. Shrinkage variable space (as would a sequenceof separate PLS
away from low spread directions is seen to control regressions) but also toward directions in the pre-
the variance of the estimate. Section 3.2 examines dictor spacethat preferentially predict the high spread
the relative shrinkage structure of these three meth- directions in the response-variable space. An (em-
ods in detail. PCR and PLS are seen to shrink more pirical) Bayesian motivation for this behavior is de-
heavily away from the low spread directions than veloped by considering a joint prior on all of the
RR, which provides the optimal shrinkage (among (true) coefficient vectors that provides information
linear estimators) for an equidirection prior. Thus on the degree of similarity of the dependenceof the
PCR and PLS make the assumption that the truth is responses on the predictors (through the response
likely to have particular preferential alignments with correlation structure) but no information as to the
the high spread directions of the predictor-variable particular nature of those dependences. This leads
(sample) distribution. A somewhat surprising result to a multiple-response analog of RR that exhibits
is that PLS (in addition) places increased probability similar behavior to that of two-block PLS. The two
mass on the true coefficient vector aligning with the procedures are compared in a small simulation study
Kth principal component direction, where K is the in which multiresponse ridge slightly outperformed
number of PLS components used, in fact expanding two-block PLS. Surprisingly, however, neither did
the OLS solution in that direction. The solutions and dramatically better than the corresponding unire-
hence the performance of RR, PCR, and PLS tend sponse procedures applied separately to the individ-
to be quite similar in most situations, largely because ual responses, even though the situations were de-
they are applied to problems involving high colli- signed to be most favorable to the multiresponse
nearity in which variance tends to dominate the bias, methods.
especially in the directions of small predictor spread, Section 7 discusses the invariance properties of
causing all three methods to shrink heavily along these regression procedures. Only OLS is equivar-
those directions. In the presence of more symmetric iant under all nonsingular affine (linear-rotation
designs, larger differences between them might well and/or scaling) transformations of the variable axes.
emerge. RR, PCR, and PLS are equivariant under rotation
The most popular method of regression regulari- but not scaling. VSS is equivariant under scaling but
zation used in statistics, VSS, is seen in Section 4 to not rotation. These properties are seento follow from
make quite different assumptions. It is shown to cor- the nature of the (informal) priors and loss structures
respond to a limiting case of a Bayesian procedure associated with the respective procedures.
in which the prior probability distribution places all Finally, Section 8 provides a short discussion of
mass on the original predictor variable (coordinate) interpretability issues.

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 111

2. REGRESSION as
Regression analysis on observational data forms a yj = aFx, i = l,q, (5)
major part of chemometric studies. As in statistics,
with the jth coefficient vector being aT = (a,l . .
the goal is to model the predictive relationships of a
ajp) or in matrix notation
set of q response variables y = b1 . . . y,} on a set
of p predictor variables x = {xi . . . x,} given a set y = Ax (6)
of N (training) observations
with the q x p matrix of regressioncoefficients being
(Yi, xiY = Cvli * . . Yqi, xii . * . xpiY (1)
A = [ajk].
(7)
on which all of the variables have been measured.
The dominant regression methods used in che-
This model is then used both as a descriptive statistic
mometrics are PCR and PLS. The corresponding
for interpreting the data and as a prediction rule for
methods most used by statisticians (in practice) are
estimatinglikely valuesof the responsevariableswhen
OLS, RR, and VSS. The goal of this article is to
only values of the predictor variables are available.
compare and contrast these methods in an attempt
The structural form of the predictive relationship is
to identify their similarities and differences. The next
taken to be linear:
section starts with brief descriptions of PCR, PLS,
and RR. (It is assumed that the reader is familiar
Downloaded by [University of Birmingham] at 09:58 12 April 2013

Yj = ajo + 2 ajkXk, j = 1, q. (2) with OLS and the various implementations of VSS.)
k=l
We consider first the caseof only one responsevari-
The problem then is to use the training data (1) to able (q = l), since most of their similarities and
estimate the values of the coefficients pkcO {ajk}i",,
differences emerge in this simplified setting. Multi-
appearing in Model (2). variate regression (q > 1) is discussedin Section 6.
In nearly all chemometric analyses, the variables
are standardized (“autoscaled”): 2.1 Principal Components Regression

yj t (rj - yj)l[ave(yj - yj)2]1’2 PCR (Massy 1965) has been in the statistical lit-
erature for some time, although it has seenrelatively
xk t (xk - xk)/[ave(xk - xk)2]1’2, (3) little use compared to OLS and VSS. It begins with
with the training-sample covariance matrix of the predic-
7, = ave(yj) tor variables
V = ave(xx*) (8)
%k = (4)ave(xk),

where the averagesare taken over the training data and its eigenvector decomposition
(1); that is,
v= 2 &vkv~. (9)
k=l
1Y i=l
Here {ez}: are the eigenvalues of V arranged in de-
where n is the quantity being averaged. (This no- scending order (e, 2 e2 2 . . . 2 e,) and {vk}y their
tational convention will be used throughout the ar- corresponding eigenvectors. PCR produces a se-
ticle.) The analysis is then applied to the stan- quence of regression models Go . . . jjR} with
dardized variables and the resulting solutions
transformed back to reference the original locations n
YK = kl$o[aveb+~)~dlv,‘x, K = 1, R, (10)
and scales of the variables. The regression methods
discussed later are always assumed to include con-
with R being the rank of V (number of nonzero ez).
stant terms (2), thus making them invariant with re-
The Kth model (10) is just the OLS regression of y
spect to the variable locations so that translating them
on the “variables” (2, = vEx}t with the convention
to all have zero means is simply a matter of conven-
that for K = 0 the model is just the responsemean,
ience (or numerics). Most of these methods are not,
j. = 0 (3). Th e goal of PCR is to choose the par-
however, invariant to the relative scaling of the vari-
ticular model )jK with the lowest prediction mean
ables so that choosing them to all have the samescale squared error
is a deliberate choice on the part of the user. A
different choice would give rise to different estimated K* = argmin aE(y - 9K)2, (11)
OsKcsR
models. This is discussedfurther in Section 7.
After autoscaling the training data, the regression where ai% is the average over future data, not part
models (2) (on the training data) can be expressed of the training sample. The quantity K can thus be

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


112 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

considered a metaparameter of the procedure whose regression consists of computing the covariance vec-
value is to be estimatedfrom the training data through tor wK (line 3) and then using it to form a linear
some rrodel-selection procedure. In chemometrics, combination zK of the x residuals (line 4). The y
model selection is nearly always done through or- residuals are then regressed on this linear combi-
dinary cross-validation (CV) (Stone 1974), nation (line 5), and the result is added to the model
(line 6) and subtracted from the current y residuals
R = argmin $J (yi - 9K,i)2, (12) to form the new y residuals y, (line 7) for the next
OcKsR i=l step. New x residuals (xJ are then computed (line
where )jK,i is the Kth model (10) computed on the 8) by subtracting from xKMl its projection on zk. The
training sample with the ith observation removed. test (line 9) will cause the algorithm to terminate
There are many other model selection criteria in the after R steps, where R is the rank of V (8).
statistics literature [e.g., generalized cross-validation This PLS algorithm produces a sequenceof models
(Craven and Wahba 1979), minimum descriptive vKx (line 1 and line 6) on successivepassesthrough
length (Rissiden 1983), Bayesian information crite- the For loop. The one (‘Jg) that minimizes the CV
rion (Schwartz 1978). Mallows’s Cp (Mallows 1973), score (12) is selected as the PLS solution. Note that
etc.] that can also be used. (A discussion of their straightforward application of many of the competing
model-selection criteria is not appropriate here since,
Downloaded by [University of Birmingham] at 09:58 12 April 2013

relative merits is outside the scope of this article.)


unlike PCR and RR, PLS is not a linear modeling
2.2 Partial Least Squares Regression procedure; that is, the response values hi}? enter
nonlinearly into the model estimates @i}y.
PLS was introduced by Wold (H. Wold 1975) and The algorithm in Table 1 is the one first proposed
has been heavily promoted in the chemometrics lit- by Wold that defined PLS regression. Since its in-
erature as an alternative to OLS in the poorly con- troduction, several different algorithms have been
ditioned or ill-conditioned problems encountered
proposed that lead to the same sequenceof models
there. It was presented in algorithmic form as a mod- vK}T (e.g., see Naes and Martens 1985;Wold, Ruhe,
ification of the NIPALS algorithm (H. Wold 1966)
Wold, and Dunn 1984). Perhaps the most elegant
for computing principal components. Like PCR, PLS formulation (Helland 1988) is shown in Table 2.
produces a sequence of models GK}T [R = rank V Table 2 shows that the Kth PLS model fK can be
(S)] and estimates which one is best through CV (12). obtained by an OLS regression (OLS - line 5) of
The particular set of models constituting the (or-
the responsey on the K linear combinations {zk =
dered) sequenceare, however, different from those
(vk-lsy-x}y
produced by PCR. Wold’s PLS algorithm is pre-
sented in Table 1. [To simplify the description,
2.3 Ridge Regression
random-variable notation is adopted; that is, a single
symbol is used to represent the collection of values RR (Hoer1 and Kennard 1970) was introduced as
(scalar or vector) of the corresponding quantity over a method for stabilizing regression estimates in the
the data, and the observation index is omitted. This presenceof extreme collinearity, V (8) being singular
convention is used throughout the article.] or nearly so. The coefficients of the linear model (5)
At each step, K (For loop pass, lines 2-10) y re- are taken to be the solution of a penalizedleast squares
siduals from the previous step (yKel) are partially criterion with the penalty being proportional to the
regressedon x residualsfrom the previous step (xK- J. squared norm of the coefficient vector a:
In the beginning (line 1) these residuals are initialized
to the original (standardized) data. The partial I, = argmin[ave(y - aTx)* + haTa]. (13)
a
The solution is
Table 1. Weld’s PLS Algorithm I, = (V + AI)-%, (14)

(1) Initialize: y0 + y; X0 +-x; V0 +- 0


(2) For K = 1 top do:
(3) w, = ave(yKm,xK-,) Table 2. Helland’s PLS Algorithm
(4) 2, = WiX&,
(5) r, = [ave(v,-,z,)lave(z~)lz, (1) V = ave(xxT)
(6) PK=fK-,+rK (2) s = ave(yx)
(7) yK = yKml - r, (3) For K = 1 to R do:
(8) x, = xx-, - [ave(z,x,- ,)lave(zX)lz, (4) s, = VK-‘s
(9) if ave(xEJ = 0 then Exit (5) VK = OLS[y on {s,Tx}T]
(IO) end For (6) end For

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 113

with the OLS solution is then a simple least squares


regression of y on cLLsx,
s = ave(yx) (15) ”
yoLs = [aveglc~Lsx>/ave(c&Lsx)2]c&sx. (19)
and Z being the p x p identity matrix. The inverse
of the (possibly) ill-conditioned predictor-variable RR can also be cast into this framework. As in
covariance matrix V is thus stabilized by adding to OLS, the subspaceis defined by a single unit vector,
V a multiple of 1. The degree of stabilization is reg- but the criterion that defines that vector is somewhat
ulated by the value of the “ridge” parameter A > 0. different:
A value of A = M results in the model being the
var(crx)
response mean jj = 0, whereas A = 0 gives rise to CRR= argmax corr*(y , crx) var(cTx) + A’ (20)
the unregularized OLS estimates. A value for A in crc=1

any particular situation is generally obtained by con- where A is the ridge parameter [(13)-(14)]. The ridge
sidering it to be a meta parameter of the procedure solution is then taken to be a (shrinking) ridge regres-
and estimating it through some model-selection pro- sion of y on C&X with the same value for the ridge
cedure such as CV. Since here the response values parameter

1c&x.
bi}y do enter linearly in the model estimates (iti}?,
any of the competing model-selection criteria can ave(yc&x)
PRR = (211
Downloaded by [University of Birmingham] at 09:58 12 April 2013

also be straightforwardly applied (see Golub, Heath, ave(c&x)* + A


and Wahba 1979).
(See Appendix.)
3. A COMPARISON OF PCR, PLS, AND RR PCR defines a sequence of K-dimensional sub-
spaceseach spanned by the first K eigenvectors (9)
From their preceding algorithmic descriptions, it of V (8). Thus each ck (1 5 k I R) is the solution
might appear that PCR, PLS, and RR are very dif-
ferent procedures leading to quite different model
estimates. In this section we provide a heuristic com- c,(PCR) = argmax var(crx). (22)
{CWC,=0,;-’’
parison that suggests that they are, in fact, quite cTc=l
similar, in that they are all attempting to achieve the
sameoperational goal in slightly different ways. That The first constraint in (22) (V orthogonality) ensures
goal is to bias the solution coefficient vector a (5) that the linear combinations associatedwith the dif-
away from directions for which the projected sample ferent solution vectors are uncorrelated over the
predictor variables have small spread; that is, training sample

var(aTx/(a() = ave(arxl(a])* = small, (16) corr(c,Tx, cFx) = 0, k f: 1. (23)

where the average is over the training sample. As a consequenceof this and the criterion (22), they
This comparison consists of regarding the regres- also turn out to be orthogonal c’#, = 0, k # 1. The
sion procedure asa two-step processasin VSS (Stone Kth PCR model is given by a least squaresregression
and Brooks 1990); first a K-dimensional subspaceof of the response on the K linear combinations
p-dimensional Euclidean space is defined, and then {cEx}f. Since they are uncorrelated (23), this reduces
the regressionis performed under the restriction that to the sum of univariate regressionson each one (10).
the coefficient vector a lies in that subspace: PLS regression also produces a sequence of K-
K dimensional subspaces spanned by successive unit
a = c akck, 07) vectors, and then the Kth PLS solution is obtained
k=l by a least squares fit of the response onto the cor-
where the unit vectors {ck}y span the prescribed responding K-linear combinations in a strategy sim-
subspacewith c;ek = 1. The regression procedures ilar to PCR. The only difference from PCR is in the
can be compared by the way in which they define criterion used to define the vectors that span the K-
the subspace {ck}: and the manner in which the dimensional subspaceand hence the corresponding
(constrained) regression is performed. linear combinations. The criterion that gives rise to
First, consider OLS in this setup. Here the sub- PLS (Stone and Brooks 1990) is
spaceis defined by the (single) unit vector that max- ck(PLS) = argman corr*(y, cTx)var(c’x). (24)
imizes the sample correlation (squared) between the {CVC,=o}p- ’
cTc=1
response and the corresponding linear combination
of the predictor variables As with PCR the vectors ck(PLS) are constrained to
= argmax corr2(y, c’x); be mutually V orthogonal so that the corresponding
COLS (18)
CT,=1 linear combinations are Pmcorrelatedover the train-

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


114 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

ing sample (23). This causesthe K-dimensional least of the regressors being identical). This can be seen
squaresfit to be equivalent to the sum of K univariate from the PLS criterion (24). In this case, var(c’x) =
regressions on each linear combination separately, 1 for all c, and the PLS criterion reduces to that for
as with PCR. Unlike PCR, however, the {c,(PLS)}f OLS (18). With this exception, the effect of decreas-
are not orthogonal owing to the different criterion ing K is to attract the solution coefficient vector to-
(24) used to obtain them. ward larger values of var(crx) as in PCR. For a given
The OLS criterion (18) is invariant to the scale of K, however, the degree of this attraction depends
the linear combination of crx and gives an unbiased jointly on the covariance structure of the predictor
estimate of the coefficient vector and hence the variables and the OLS solution, which in turn de-
regressionmodel [( 18)- (19)]. The criteria associated pends on the sample response values. This fact is
with RR (20), PCR (22), and PLS (24) all involve often presented as an argument in favor of PLS over
the scale of crx through its sample variance, thereby PCR. Unlike PCR, there is no sharp lower bound
producing biased estimates. The effect of this bias is on var(Gx) for a given K. The behavior of PLS com-
to pull the solution coefficient vector away from the pared to PCR for changing K is examined in more
OLS solution toward directions in which the pro- detail in Section 3.2.
jected data (predictors) have larger spread. The de-
gree of this bias is regulated by the value of the 3.1 Bayesian Motivation
Downloaded by [University of Birmingham] at 09:58 12 April 2013

model-selection parameter. Inspection of the criteria used by RR (20), PCR


For RR, setting A = 0 [(20)-(21)] yields the un- (22), and PLS (24) shows that they all can be viewed
biased OLS solution, whereas A > 0 introduces in- as applying a penalty to the OLS criterion (18), where
creasing bias toward larger values of var(crx) (20) the penalty increasesasvar(cTx) decreases.A natural
and increased shrinkage of the length of the solution question to ask is: Under what circumstancesshould
coefficient vector (21). For small values of A, the this lead to improved performance over OLS? It is
former effect is the most pronounced; for example, well known (James and Stein 1961) that OLS is in-
for A > 0 the RR solution will have no projection in admissible in that one can always achieve a lower
any subspacefor which var(crx) = 0, and very little mean squared estimation error with biasedestimates.
projection on subspacesfor which it is small. The important question is: When can these esti-
In PCR, the degree of bias is controlled by the mators substantially improve performance and which
value of K, the dimension of the constraining sub- one can do it best?
space spanned by {c,(PCR)z (22)-that is, the Some insight into these questions can be provided
number of components K used (10). If K = R [rank by considering a (highly) idealized situation. Suppose
of V (S)], one obtains an unbiased OLS solution. For that in reality
K < R, bias is introduced. The smaller the value of y = cu*x + & (26)
K, the larger the bias. As with RR, the effect of this
for some (true) coefficient vector cx and E is an ad-
bias is to draw the solution toward larger values of
ditive (iid) homoscedastic error, with zero expecta-
var(crx), where c is a unit vector in the direction of
tion and variance u2,
the solution coefficient vector a (5) (c = a/la]). This
is becauseconstrainingc to lie in the subspacespanned E(E) = 0, E(&2) = IT*. (27)
by the first K eigenvectors of V [(8)-(9)] places a
Since all of the estimators being considered here are
lower bound on the sample variance of crx,
equivariant with respect to rotations in the predictor
var(crx) 2 es. variable space (after standardization), we will con-
(25)
sider (for convenience)the coordinate systemin which
Since the eigenvectors (and hence the subspaces)are the predictor variables are uncorrelated; that is,
ordered on decreasingvalues of ei, increasing K has V = diag(e: . . . eg).
the effect of easing this restriction, thereby reducing (28)
the bias. Let a be an estimate of cx (26); that is,
For PLS, the situation is similar to that of PCR. p(x) = aTx
The degree of bias is regulated by K, the number of (29)
components used. For K = R, an unbiased OLS for a given point x in the predictor space (not nec-
solution is produced. Decreasing K generally in- essarily one of the training-sample points). Consider
creasesthe degree of bias. An exception to this oc- training samples for which the (sample) predictor
curs when V = Z (totally uncorrelated predictor vari- covariance matrix V has the eigenvalues (28).
ables), in which case an unbiased OLS solution is The mean squared error (MSE) of prediction at x
reached for K = 1 and remains the same for all K is
(though for K 2 2 the regressions are singular, all MSE[it(x)] = E,[aTx - aTx]*, (30)

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 115

with the expected value over the distribution of the is the variance of the estimate. Setting (6 = l}p (33)
errors E (26). Since (Y (the truth) is unknown, the yields the least squaresestimates, which are unbiased
MSE (at x) for any particular estimator is also un- but have variance given by the second term in (35).
known. One can, however, consider various (prior) Reducing any (or all) of the {&}f to a value less than
probability distributions IT on (Yand compare the 1 causesan increase in bias [first term (35)] but de-
properties of different estimators when the relative creasesthe variance [second term (35)]. This is the
probabilities of encountering situations for which a usual bias variance trade-off encountered in nearly
particular (Y(26) occurs is given by that distribution. all estimation settings. [Setting any (or all) of the
For a given I, the mean squared prediction error {fi}$ to a value greater than 1 increasesboth the bias
averaged over the situations it represents is squared and the variance.]
This expression (35) for the MSE (in a simplified
E,E,[cxrx - aTx12. (31) setting) illustrates the important fact that justifies the
A simple and relatively unrestrictive prior probabil- qualitative behavior of RR, PCR, and PLS discussed
ity distribution is one that considers all coefficient previously, namely, the shrinking of the solution
vector directions cll/]a] equally likely; that is, the prior coefficient vector away from directions of low (sam-
distribution depends only in its norm la)* = cura, ple) variance in the predictor-variable space. One
sees from the second term in (35) that the contri-
7r(o1)= 7r(cxTcK).
Downloaded by [University of Birmingham] at 09:58 12 April 2013

(32) bution to the variance of the model estimate from a


For this exercise, we will consider simple linear given (eigen) direction (x,) is inversely proportional
shrinkage estimates of the form to the sample predictor variance ef associated with
that direction. Directions with small spread in the
aj = fibj, i = 1, P, (33) predictor variables give rise to high variance in the
where & is the OLS estimate and the {h}f are shrink- model estimate.
age factors taken to be independent of the sample The values of {A}ythat minimize the MSE (35) are
responsevalues. In this case, the mean squared pre-
diction error becomes [(33) and (31)] f; = efl(ey + A), i= LP (36)
with
MSElj(x)] = ..,.[ ~ (a/ - ~~j)~j]‘. (34) A = p(~~/E,ja)~)lN.
j=l (37)

Averaging over (Yusing the probability distribution The quantity A [(36)-(37)] is the number of (pre-
given by (32) [taking advantage of the fact that dictor) variables times the square of the noise-to-
E,(o&) = Z . E&X(*, with Z being the identity ma- signal ratio, divided by the training-sample size.
trix] yields Combining (33)) (36)) and (37) gives the optimal
(minimal MSE) linear shrinkage estimates
MW,Wl = ,zI [Cl - fi)*Lt@/~
aj = cij . ei2 j = 1,~. (38)
+ f@2/(Ne~)]x~. (35) e? + A’

Here (35) E&Y]* is the expected value of the length One sees that the unbiased OLS estimates {Gj}f are
of the coefficient vector (Yunder the prior (32), p is differentially shrunk with the relative amount of
the number of predictor variables, u* is the variance shrinkage increasing with decreasing predictor vari-
of the error term [(26)-(27)], N is the training-sample able spread ej. The amount of differential shrinkage
size, and {ef}$’ are the eigenvalues of the (sample) is controlled by the quantity A (37): The larger the
predictor-variable covariance matrix (28)) which in value of A, the more differential shrinkage, as well
this case are the sample variances of the predictor as more overall global shrinkage. The value of A in
variables due to our choice of coordinate system (28). turn is given by the inverse product of the signal/
The two terms within the brackets (35) that con- noise squared and the training-sample size.
tribute to the MSE at x have separateinterpretations. It is important to note that this high relative shrink-
The first term depends on (the distribution of the) age in directions of small spread in the (sample)
truth ((Y)and is independent of the error variance or predictor-design distribution enters only to control
the predictor-variable distribution. It represents the the variance and not becauseof any prior belief that
bias (squared) of the estimate. The second term is the true coefficient vector (Y (26) is likely to align
independent of the nature of the true coefficient vec- with the high spread directions of predictor design.
tor (Y and depends only on the experimental situa- The prior distribution on 01,rr(c~) (32), that leads to
tion-error variance and predictor-design sample. It this result (38) places equal mass on all directions

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


116 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

a/Jcu\and by definition has no preferred directions both of which are linear in that they do not involve
for the truth. Therefore, one can at least qualitatively the sample response values bi}y.
conclude that the common property of RR, PCR, The corresponding scale factors for PLS are not
and PLS of shrinking their solutions away from low linear in the response values. For a K-component
spread directions mainly serves to reduce the vari- solution, they can be expressed as
ance of their estimates, and this is what gives them
generally superior performance to OLS. The results (43)
given by (35), (37), and (38) indicate that their de-
gree of improvement (over OLS) will increase with where the vector g = {&}‘:is given by 6 = W-‘w,
decreasing signal-to-noise ratio and training-sample with the K components of the vector w being
size and increasing collinearity as reflected by the
disparity in the eigenvalues (28) of the predictor-
wk = ,zl c2yef(k+1),
variable covariance matrix [(S)- (9)].
It is well known that (38) is just RR as expressed
in the coordinate system defined by the eigenvectors and the elements of the K X K matrix W are given
of the sample predictor-variable covariance matrix by
[(S)-(9)]. Thus these results show (again well known)
Downloaded by [University of Birmingham] at 09:58 12 April 2013

that RR is a linear shrinkage estimator that is optimal W,, = $ &feT(k+‘+‘).


(in the sense of MSE) among all linear shrinkage j=l

estimators for the prior T((Y) assumedhere (32) and They depend on the number of components K used
A (37) known. PCR is also a linear shrinkageestimator
and the eigenstructure {ef}f (as do the factors for RR
a,(PCR) = bj . Z(ef - e$), (39) and PCR), but not in a simple way. They also depend
on the OLS solution {i$}$‘, which in turn depends on
where K is the number of components used and the
the response values {y,}:. The PLS scale factors are
second factor Z(a)takes the value 1 for nonnegative
seen to be independent of the length of the OLS
argument values and 0 otherwise. Thus RR domi-
solution I&(*, depending only on the relative values
nates PCR for an equidirection prior (32). PLS is
of {Gj}f. Note that for all of the methods studied here
not a linear shrinkage estimator, so RR cannot be
the estimates (for a given value of the meta param-
shown to dominate PLS through this argument.
eter) depend on the data only through the vector of
OLS estimates {Gj}f and the eigenvalues of the pre-
3.2 Shrinking Structure
dictor-covariance matrix {eT}$‘.
One way to attempt to gain some insight into the Although the scale factors for PLS (43) cannot be
relative properties of RR, PCR, and PLS is to ex- expressedby a simple formula (as can those for RR
amine their respective shrinkage structures in various and PCR), they can be computed for given values
situations. This can be done by expanding their so- of K, {ef}$‘, and {;y,>f and compared to those of RR
lutions in terms of the eigenvectors of the predictor- and PCR [(36) and (42)J for corresponding situa-
sample covariance matrix [(S)-(9)] and the OLS es- tions. This is done in Figures 1-4, for p = 10. In
timate &: each figure, the scale factors fi (PLS) - fiO (PLS)
are plotted (in order-solid line) for the first six (K
a(RR : PCR : PLS) = 1, 6) component PLS models. Each of the four
figures represents a different situation in terms of the
= ,zl fi(RR: PCR: PLS)hjivj. (40) relative values of {eT}pand {~j}e. Also plotted in each
frame for comparison are the corresponding shrink-
Here ;Y,is the projection of the OLS solution on Vi age factors for RR (dashed line) and PCR (dotted
(the jth eigenvector of V), line) for that situation, normalized so that they give
the same overall shrinkage (sh = la]/]&/); that is, for
tij = ave(j$x)leT, (41) RR the ridge parameter h (36) is chosen so that the
length of the RR solution vector is the same as that
and {J.(*)}e can be regarded as a set of factors along for PLS (laRRI = lap&. In the case of PCR, the
each of these eigendirections that scale the OLS so- number of components was chosen so that the re-
lution for each of the respective methods. As shown spective solution lengths were as close as possible
in (36) and (39), &(RR) = eyl(ef + A) and (laPoR/= la,,,]). The three numbers in each frame
J;(PCR) = 1 ef 2 eg give the number of PLS components, the correspond-
ing shrinkage factor (sh = [a(/(&(),and the ridge pa-
= 0 e: < e2,, (42) rameter (A) that provides that overall shrink-

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 117

2 4 6 a 10 2 4 6 0 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013

2 -4
.87
:. . . . . .. . . x _ .0045
I I I 1 I I
2 4 6 a 10 2 4 6 a 10

2 -6
.99
2 _ .0003
I I 1 I I
2 4 6 a 10 2 4 6 a 10

Figure 1. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Neutral Least Squares Solution and High Collinearity.
Shown in each frame are the number of PLS components (upper entry), overall shrinkage (middle entryJ, and corresponding
ridge parameter (lower entry).

age. The four situations represented in Figures l-4 moderate collinearity), {;yi = l/j} {e: - l/j’}f (fa-
are as follows: {$ = l}p {e,2- l/j’}p (neutral G’s, vorable h’s, high collinearity), and (~2,= j}$’ {e: -
high collinearity), {bj = l}p {e: - l/j} (neutral b’s, l/j*}$’ (unfavorable k’s, high collinearity).

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


118 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

I I 1 I I I

2 4 6 a 10 2 4 6 0 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013

3 4
.95 .99
.0098 .0016
I 1 I 1 I I

2 4 6 0 10 2 4 6 8 10

cu-
0
6
1.0
8 _ .0007
I I I I I
2 4 6 a 10 2 4 6 a 10

Figure 2. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Neutral Least Squares Solution and Moderate
Collinearity. The entries in each frame correspond to those in Figure 1.

In Figure 1, the OLS solution is taken to project ward the larger values (high collinearity). The one-
equally on all eigendirections (neutral) and the ei- component PLS model (K = 1, upper left frame) is
genvalue structure is taken to be highly peaked to- seen to dramatically shrink the OLS coefficients for

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 119

1 N-
2
.82 .93
.020

--
---__

2 .-.._. . .__. . . . . . . .. .
I I I I I
2 4 6 6 10 2 4 6 a 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013

--

3 4
.97 .99
.0069 , . . .. . . .0025 : .
I I

2 4 6 a 10 , 2 4 6 a 10

5 6
1.00 1.00
.0008 .0007
I I I I r I 1 I I I
2 4 6 a 10 2 4 6 a 10

Figure 3. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Favorable Least Squares Solution and High
Collinearity. The entries in each frame correspond to those in Figure 1.

the smallest eigendirections. It slightly “expands” tial; the length of the K = 1 PLS solution coefficient
the OLS coefficient for the largest (first) eigendirec- vector is about 35% of that for the OLS solution.
tion, f,(PLS) > 1. The overall shrinkage is substan- For the same overall shrinkage, the relative shrink-

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


120 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

I I I 1 I

2 4 6 a 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013

3
.56
.014 .0042

2 4 6 a 10

5 5
.93 .98
.OOlZ .0003
I I I I I I I I I I
2 4 6 a 10 2 4 6 a 10

Figure 4. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Unfavorable Least Squares Solution and High
Collinearity. The entries in each frame correspond to those in Figure 1.

age of RR tracks that of PLS but is somewhat more model (K = 2) gives roughly the sameoverall shrink-
moderate. This is a consistent trend throughout all age as the K = 1 PLS solution. Again this is a trend
situations (Figs. l-4). For PCR. a two-component throughout all situations in that one gets roughly the

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS

same overall shrinkage for KPCR= 2Kt~. As the the predictor design. Here one seesqualitatively sim-
number of PLS components is increased(left to right, ilar relative behavior as before, with a bit more ex-
top to bottom frames) the overall shrinkage applied aggeration. Due to the unfavorable alignment of the
to the OLS solution is reduced and the relative OLS solution, the overall shrinkage here is quite
shrinkage applied to each eigendirection becomes considerable. Still the OLS solution is nearly reached
more moderate. For K = 6, the PLS solution is very by the K = 6-component PLS solution.
nearly the same as the OLS solution ‘&(PLS) f
l}p even though it only becomes exactly so for K =
10. Again this feature is present throughout all sit- 3.2.1. Discussion. Although the study repre-
uations (Figs. l-4). sented by Figures l-4 is hardly exhaustive, some
An interesting aspect of the PLS solution is that tentative conclusions can be drawn. The qualitative
(unlike RR and PCR) it not only shrinks the OLS behavior of RR, PCR, and PLS as deduced from
solution in some eigendirections (f, 5 1) but expands (20), (22), and (24) is confirmed. They all penalize
it in others (fi > 1). For a K-component PLS solu- the solution coefficient vector a for projecting onto
tion, the OLS solution is expanded in the subspace the low-variance subspace of the predictor design
defined by the eigendirections associated with the [i.e., ave(arx)* = small). For PLS and PCR, the
eigenvaluesclosest to the Kth eigenvalue. Directions strength of the penalty decreasesas the number of
Downloaded by [University of Birmingham] at 09:58 12 April 2013

associatedwith somewhat larger eigenvaluestend to components K increases.For RR, the strength of the
be slightly shrunk, and those with smaller eigenval- penalty increases for increasing values of the ridge
ues are substantially shrunk. Again this behavior is parameter A. For RR, the strength of this penalty is
exhibited throughout all of the situations studied here. monotonically increasing for directions of decreasing
The expressionfor the mean squaredprediction error sample variance. For PCR, it is a sharp threshold
(35) suggeststhat, at least for linear estimators, using function, whereas for PLS it is relatively smooth but
any 6 > 1 can be highly detrimental because it in- not monotonic. All three methods are shrinkage es-
creasesboth the bias squared and the variance of the timators in that the length of their solution coefficient
model estimate. This suggeststhat the performance vector is less than that of the OLS solution. RR and
of PLS might be improved by using modified scale PCR are strictly shrinking estimators in that in any
factors {$(PLS)}f, whereA.(PLS) t min(fi(PLS), l), projection the length of their solution is less than (or
although this is not certain since PLS is not linear equal to) that of the OLS solution. This is not the
and (35) was derived assuming linear estimates. It case for PLS. It has preferred directions in which it
would, in any case, largely remove the preference of increases the projected length of the OLS solution.
PLS for (true) coefficient vectors that align with the For a K-component PLS solution, the projected length
eigendirections whose eigenvalues are close to the is expanded in the subspace of eigendirections as-
Kth eigenvalue. sociated with eigenvalues close to the Kth eigen-
The situation represented in Figure 2 has the same value.
(neutral) OLS solution but less collinearity. The In all situations depicted in Figures 1-4, PLS used
qualitative behavior of the PLS, RR, and PCR scale fewer components to achieve the same overall
factors are seen to be the same as that depicted in shrinkage asPCR, generally about half asmany com-
Figure 1. The principal difference is that PLS applies ponents. PLS closely reached the OLS solution with
less shrinkage for the same number of components about five to six components, whereas PCR requires
and (nearly) reaches the OLS solution for K = 4. all ten components. This property has been empiri-
Note that for no collinearity (all eigenvalues equal) cally observed for some time and is often stated as
PLS produces the OLS solution with the first com- an argument in favor of the superiority of PLS over
ponent (K = 1). PCR; one can fit the data at hand to the same degree
Figures 3 and 4 examine the high collinearity sit- of closeness with fewer components, thereby pro-
uation for different OLS solutions. In Figure 3, the ducing more parsimonious models. The issue of par-
OLS solution is taken to be aligned with the major simony is a bit nebulous here, since the result of any
axesof the predictor design. The relative PLS shrink- method that fits linear models (29) is a single com-
age for different eigendirections for this favorable ponent (direction)-namely, that associatedwith the
case is seen to be similar to that for the neutral case solution coefficient vector a. One can decompose a
depicted in Figure 1. The overall shrinkage is much arbitrarily into sumsof any number of (up top) other
less, however, owing to the favorable orientation of vectors and thus change its parsimony at will. For
the OLS solution. Figure 4 representsthe contrasting the same number of components, PCR applies more
situation in which the OLS solution is (unfavorably) shrinkage than PLS and thus attempts to fit the data
aligned in orthogonal directions to the major axes of at hand less closely, thereby using fewer degrees of

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


122 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

freedom to obtain the fit. In the situations studied Judging from Figures 1-4, a corresponding prior
here (Figs. l-4) it appears that PLS is using twice distribution for PLS (if it could be cast in a Bayesian
the number of degrees of freedom per component framework) would be more complicated. As with
as PCR, but this will depend on the structure of the PCR a prior for a K-component PLS solution would
predictor-sample covariance matrix. (For all eigen- put low (but nonzero) mass on coefficient vectors
values equal, PLS uses p df for a one-component that heavily project onto the smallest eigendirec-
model.) Thus fitting the data with fewer (or more) tions. It would, however, put highest mass on those
components (in and of itself) has no bearing on the that project heavily onto the space spanned by the
quality (future prediction error) of an estimator. eigenvectors associatedwith eigenvalues close to ei
Another argument often made in favor of PLS and moderate to high mass on the larger eigen-
over PCR is that PCR only usesthe predictor sample directions.
to choose its components, whereas PLS uses the re- In Figures 1-4, the scale factors for RR, PCR,
sponsevalues as well. This argument is not unrelated and PLS were compared for the same amount of
to the one discussed previously. By using the re- overall shrinkage (]a]/]&]). In any particular problem,
sponsevalues to help determine its components, PLS there is no reason that application of these three
uses more degrees of freedom per component and methods would result in exactly the same overall
thus can fit the training data to a higher degree of shrinkage of the OLS solution, although they are not
Downloaded by [University of Birmingham] at 09:58 12 April 2013

accuracy than PCR with the same number of com- likely to be dramatically different. The respective
ponents. As a consequence,a K-component PLS so- scale factors were normalized in this way so that
lution will have less bias than the corresponding K- insight could be gained through the relative shape of
component PCR solution. It will, however, have their scale-factor spectra.
greater variance, and since the mean squared pre-
diction error is the sum of the two (bias squared plus 3.3 Power Ridge Regression
variance) it is not clear which solution would be bet- If one actually had a prior belief that the true
ter in any given situation. In any case, either method coefficient vector (Y (26) is likely to be aligned with
is free to chooseits own number of components (bias- the larger eigendirections of the predictor-sample co-
variance trade-off) through model selection (CV). variance matrix V (8), PCR or PLS might be pre-
Both PLS and PCR span a full (but not the same) ferred over RR. Another approach would be to di-
spectrum of models from the most biased (sample rectly reflect such a belief in the choice of a prior
mean) to the least biased (OLS solution). The fact distribution IT for the true coefficient vector (Y
that PLS tends to balance this trade-off with fewer (26). This prior would not be spherically symmetric
components is (in general) neither an advantage nor (32) but would involve a more general quadratic form
disadvantage. in OL,
For all of the situations consideredin Figures 1-4, n-(01)= ~(GA-Qx).
PLS and PCR are seen to more strongly penalize for (45)
small ave(a*x)2 than RR for the samedegree of over- The (positive definite) matrix A would be chosen to
all shrinkage ]a]/(&(.The RR penalty (36) was derived emphasize directions for a/]cu] that align with the
to be optimal under the assumption that the (true) larger eigendirections of V (8). One such possibility
coefficient vector a (26) has no preferred alignment is to choose A to be proportional to V8,
with respect to the predictor-variable distribution; A = p2Vs, (46)
all directions are equally likely (32). Thus the set of
situations that favor PLS and PCR would involve (Y’S where the proportionality constant
that have small projections on the subspacespanned p* = E,lcu(2/tr(VS) (47)
by the eigenvectors corresponding to the smallest
eigenvalues. For example, an (improper) prior for a is chosen to explicitly involve the expected value of
K-component PCR would place zero mass on any ]a!]* [numerator (47)J under n(a) (45) and the de-
coefficient vector a for which nominator (47) is the trace of the matrix Vs. The
optimal linear shrinkage estimator (33) under this
prior [(45)-(47)] is
a = (V + AVp6)-’ ave(yx) (48)
with
and equal mass on all others. Here {vj}~+i are the
eigenvectors of the sample predictor-variable covari- A = a”l(AyP). (49)
ante matrix [(8)-(9)] associated with the smallest Here o2 is the variance of the noise [(26)-(27)] and
N - K eigenvalues. N is the training-sample size. This procedure [(48)-

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 123

(49)] is known as power ridge regression (Hoer1 and Table 3. Ratio of Actual to Optimal Expected Squared
Error Loss When the Parameter 6 = 6’ Is Used With Power
Kennard 1975; Sommers 1964). The corresponding Ridge Regression and the True Value Characterizing the
(solution) shrinkage factors (33) in the principal com- Prior Distribution rr(al is 6 = 6*
ponent representation are
eqs + 1)
fl” = ef(“:l) + A’ (50) 8’ -1 0 1 2

-1 1.00 3.57 6.87 9.43


The prior parameter S [(46)-(48)] regulates the 0 1.37 1.oo 1.10 1.27
degree to which the true coefficient vector OL(26) is 1 1.58 1.20 1.oo 1.08
supposedto align with the major axesof the predictor- 2 1.76 1.81 1.15 1.oo
variable distribution. The value 6 = 0 gives rise to
RR (36) and corresponds to no preferred alignment.
Setting 6 > 0 expressesa preference for alignment if the nature of the alignment of the true coefficient
with the larger eigendirections corresponding (ap- vector a (26) with respect to the predictor-variable
proximately) to PCR and PLS, whereas 6 < 0 places distribution is unknown.
increased probability on the smaller eigendirections.
Downloaded by [University of Birmingham] at 09:58 12 April 2013

The value 6 = - 1 gives rise to James-Stein (James 4. VARIABLE SUBSET SELECTION


and Stein 1961) shrinkage in which the least squares
VSS is the most popular method of regression reg-
solution coefficients are each shrunk by the same
ularization usedin statistics.The basicgoal is to choose
(overall) factor. If a value for S were unspecified,
a small subset of the predictor variables that yields
one could regard it as an additional meta parameter
the most accurate model when the regression is re-
of the procedure (along with A) and choose both
stricted to that subset. A sequence of subsets, in-
values (jointly) to minimize a model-selection cri-
dexed by the number of variables K constituting each
terion such as CV (12). Whether this will lead to
one, is considered. For a given K the subset of that
better performance than one of the existing com-
cardinality giving rise to the best OLS fit to the data
peting methods (RR, PCR, PLS) is an open question
is selected (“all subsetsregression”). Sometimes for-
that is the topic of current research.
ward/backward stepwise procedures are employed to
One important issueis robustnessof the procedure
approximate this strategy with less computation. The
to the choice of a value for S. Supposethat the true
subset cardinality K is considered to be a meta pa-
coefficient vector cx(26) occurred with relative prob-
rameter of the procedure whose value is chosen
ability ~(a16 = S*) [(45)-(47)] but a different value,
through some model-selection scheme, such as CV
S = S’, was chosen for power ridge regression [(48)-
(12). Other model-selection methods (intended for
(50)]. A natural question is: How much accuracy is
linear modeling) are also often employed, but their
sacrificed in such a situation for different (joint) val-
use is not strictly correct since VSS is not a linear
ues of (a*, S’)? This is examined in Table 3 for a
modeling method for a given value of its meta pa-
situation characterized byp = 20 predictor variables,
rameter K; the particular variables constituting each
N = 40 training observations, signal E&YI~ = 1,
selected subset are heavily influenced by the re-
noise (+ = .3, and predictor-variable covariance ma-
sponse values Cy,}yso that they enter into the esti-
trix eigenvalues {ef = i’}:? Shown in Table 3 are
mates Gi}y in a highly nonlinear fashion (see Brei-
the ratios of actual to optimal expected squared error
man 1989).
loss when 6 = 6’ (vertical) is assumed and 6 = 6*
To try to gain some insight into the relationship
(horizontal) is the true parameter characterizingT((Y)
between VSS and the procedures considered previ-
](45)-(47)l. ously (RR, PCR, and PLS), we again consider the
One seesfrom Table 3 that choosing 6’ = 0 (RR)
(highly) idealized situation [(26)-(27)] in a Bayesian
is the most robust choice (over these situations).
framework:
James-Stein shrinkage (6’ = - 1) is exceedingly
dangerous except when 6* = - 1, causing prefer- Pr(model]data)
ential alignment with the smaller eigendirections.For
= Pr(data]model)Pr(model)/Pr(data), (51)
all entries in which 6* and 6’ are nonnegative, choos-
ing 6’ < 6* is better than vice versa. The evidence where the left side (“posterior”) is the quantity to
presented in Figures 1-4 indicates that PCR and PLS be maximized, the first factor on the right side is the
more strongly penalize the smaller eigendirections likelihood 2, the secondfactor is the prior a(a), and
than RR, thereby more closely corresponding to 6’ the denominator is a constant (given the data). If we
> 0. The results presented in Table 3 then suggest further assume Gaussian errors E - N(0, a*), the
that RR (S’ = 0) might be the most robust choice likelihood becomes .%(a) - exp[ - (N/2a*)ave(y -

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


124 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

arx)2], and maximizing (51) is equivalent to mini-


mizing the (negative) log-posterior
ave(y - a’x) 2 - 2 log r(a), (52)
where rr(c~) is the (prior) relative probability of en-
countering a (true) coefficient vector cx (26). This is
a penalized least squares problem with penalty -2
log n(a).
In Section 3, we saw that choosing an equidirection
prior (32) leads to procedures that shrink the coef-
ficient vector estimate a away from directions in the
predictor-variablespacefor which ave(aTxi)al)2is small
to control the variance of the estimate. The prior
that leads to RR is

-2 log 7rnn((Y)= haTol


Downloaded by [University of Birmingham] at 09:58 12 April 2013

-1 .o -0.5 0.0 0.5 1.0

Informal “priors” leading to PCR and PLS were seen Figure 5. Contours of Equal Value for the Generalized Ridge
(Figs. l-4) to involve some preferential alignment Penalty for Different Values of ‘y.
of (Y with respect to the eigendirections {vj}e (9) of
the predictor covariance matrix (8). a penalty or cost for eachone, controlling the number
To study VSS, consider a generalization of (53) to that do enter. Since the penalty term expressesno
preference for particular variables, the “best” subset
-2 log T(a) = h ,zl lcyilyl (54) will be chosen through the minimization of the least
squares term, ave(y - aTx)2, of the combined cri-
where A > 0 (as before) regulates the strength of the terion (52).
penalty and y > 0 is an additional meta parameter This discussion reveals that a prior that leads to
that controls the degree of preference for the true VSS being optimal is very different from the ones
coefficient vector cu (26) to align with the original that lead to RR, PCR, and PLS. It places the entire
variable {x,}$ axis directions in the predictor space. prior probability mass on the original variable axes,
A value y = 2 yields a rotationally invariant penalty expressing the (prior) belief that only a few of the
expressing no preference for any particular direc- predictor variables are likely to have high relative
tion-leading to RR. For y f 2, (54) is not rota- influence on the response, but provides no infor-
tionally invariant, leading to a prior that places ex- mation as to which ones. It will therefore work best
cessmasson particular orientations of (Ywith respect to the extent that this tends to be the case. On the
to the (original variable) coordinate axes. other hand, RR, PCR, and PLS are controlled by a
Figure 5 shows contours of equal value for (54) prior belief that many variables together collectively
[and thus for rr(c~)] for several values of y (p = 2). effect the response with no small subset of them
One seesthat y > 2 results in a prior that supposes standing out.
that the true coefficient vector is more likely to be Expressions (52) and (54) reveal that VSS and RR
aligned in directions oblique to the variable axes, can be viewed as two points (y = 0 and y = 2,
whereas for y < 2 it is more likely to be aligned with respectively) on a continuum of possible regression-
the axes. The parameter y can be viewed as the de- modeling procedures(indexed by y). Choosing either
gree to which the prior probability is concentrated procedure correspondsto selecting from one of these
along the favored directions. A value y = w places two points. For a given situation (data set), there is
maximum concentration along the diagonals, which no a priori reason to suspect that the best value of
is in fact not very strong. On the other hand, y + 0 y might be restricted to only these two choices. It is
places the entire prior mass in the directions of the possible that an optimal value for y may be located
coordinate axes. at another point in the continuum (0 < y 5 00).An
The situation y + 0 corresponds to (all subsets) alternative might be to use a model-selection crite-
VSS. In this case, the sum in (54) simply counts the rion (say CV) to jointly estimate optimal values of
number of nonzero coefficients (variables that en- A and y to be used in the regression, thereby greatly
ter), and the strength parameter A can be viewed as expanding the class of modeling procedures. It is an

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 125

open question as to whether such an approach will Average PSE (55) in each of the 36 situations are
actually lead to improved performance; this is the the axes for this space. There are six points in the
subject of our current research (with Leo Breiman). space, each defined by the 36 simultaneous values
Note that this approach is different from those that of average PSE for OLS, RR, PCR, PLS, VSS, and
use Bayesian methods to directly compute model- the true (known) coefficient vector gtrue= cllTx(26).
selection criteria for different variable subsets (e.g., The quantities plotted in Figures 6-10 are the Eu-
see Lindley 1968; Mitchell and Beauchamp 1988). clidean distances (bar height) of each of the first five
points (OLS, RR, PCR, PLS, and VSS) from the
5. A COMPARATIVE MONTE CARLO STUDY sixth point, which represents the performance using
OF OLS, RR, PCR, PLS, AND VSS the “true” underlying coefficient vector asthe regres-
This section presents a summary of results from a sion model in each situation. Thus smaller values
set of Monte Carlo experiments comparing the rel- indicate better performance.
ative performance of OLS, RR, PCR, PLS, and VSS Figure 6 shows these distances in the full 36-
that were described in more detail by Frank (1989). dimensional space, which characterizes averageper-
The five methods were compared for 36 different formance over all 36 situations. Figures 7-10 show
situations. In all situations, the training-sample size the distances in various subspacescharacterized by
slicing (conditioning) on specific values of some of
Downloaded by [University of Birmingham] at 09:58 12 April 2013

was N = 50. The situations were differentiated by


the number of predictor variables (p = 5, 40, loo), the design variables. These represent respective av-
structure of the (population) predictor-variable cor- erage performances conditioned on these particular
relation matrix (independent-all off-diagonal ele- values.
ments 0; highly collinear-all off-diagonal elements One seesfrom Figure 6 that (not surprisingly) OLS
.9), true regressioncoefficient vector OL(26) (equal- gives the worst performance overall. RR is seen to
{crj = I}$ unequal--(aj = j’}?), and signal-to-noise provide the best averageoverall performance, closely
ratio [(26)-(27)] (a/[var(cwTx)]l’* = 7, 3, 1). A full followed by PLS and PCR. Stepwise VSS gives dis-
3 x 2 x 2 x 3 factorial design on the chosen levels tinctly inferior overall performanceto the other biased
for these four factors yields the 36 situations studied procedures but still considerably better than OLS.
here. Figure 7 showsthat the biased methods improve very
For each situation, 100repetitions of the following little on OLS in the well-conditioned (p = 5, N =
procedure were performed: 50) case, but as the conditioning of the problem be-
comesincreasingly worse (p = 40, loo), their perfor-
1. Randomly generate N = 50 training observa- mance degradessubstantially lessthan OLS, thereby
tions with a joint Gaussian distribution (with speci- providing increasing improvement over it. Figure 8
fied population correlation matrix) for the predictors shows that the biased methods provide dramatic
and using (26) for the response, with E drawn from improvement (over OLS) in the highly collinear
a Gaussian with the specified (T* (27). situations.
2. Apply OLS, RR, PCR, PLS, and VSS (forward The results shown in Figure 9 represent something
stepwise) to the training sample using CV (12) for of a surprise. From the discussion in Section 4, one
model selection.
3. Generate N, = 100 independent “test” obser-
vations from the same prescription as in 1.
4. Compute the average squared prediction error
(PSE) for the model selected for each method over
these test observations:

PSE = $ .$J [yi - uo - aTxJ2, (55)


II 1
where (a,, a) is the solution transformed back to the
original (unstandardized) representation.
The computed PSE values for each method were
averagedover the 100replications of this procedure. IL -
Figures 6-10 present a graphical summary of se-
lected results from this simulation study. [Complete OLS RR PCR PLS vss
results in both graphical and tabular form are in the Figure 6. Distances of OLS, RR, PCR, PLS, and VSS From
work of Frank (1989).] The summaries are in the the Performance of the True Coefficient Vector, Averaged
form of distancesin a 36-dimensionalEuclidean space. Over all 36 Situations.

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


126 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

-
9
c-9

9
ol
r
r
0

9
0

OLS RR
Li PCR PLS vss
OLS RR PCR PLS vss
Figure 9. Performance Comparisons Conditioned on the
Figure 7. Performance Comparisons Conditioned on the Structure of the True-Coefficients Vector-Equal and Un-
p = 5, 40, and 100 Variable Situations. equal Coefficients.
Downloaded by [University of Birmingham] at 09:58 12 April 2013

might have expected VSS to provide dramatically performance degradesless than OLS and VSS as the
improved performance in the situations correspond- noise increases.
ing to (highly) unequal (true) coefficient values for For the situations covered by this simulation study,
the respective variables. For the situations studied one can conclude that all of the biased methods (RR,
here, {cy, = j*fi, this did not turn out to be the case. PCR, PLS, and VSS) provide substantial improve-
All of the other biased methods dominated VSS for ment over OLS. In the well-determined case, the
this case. Moreover, the performance of RR, PCR, improvement was not significant. In all situations,
and PLS did not seem to degrade for the unequal RR dominated all of the other methods studied. PLS
coefficient case. Since (stepwise) VSS must surely usually did almost as well as RR and usually out-
dominate the other methods if few enough variables performed PCR, but not by very much. Surprisingly,
only contribute to the responsedependence,it would VSS provided distinctly inferior performance to the
appear that the structure provided by {cyj = j*}T other biased methods except in the well-conditioned
is not sharp enough to cause this phenomenon to case in which all methods gave nearly the same per-
set in. formance. Although not discussedhere, the perfor-
Figure 10 contains few surprises. (Remember that mance ranking of these five methods was the same
bar height is proportional to distance from the per- in terms of accuracy of estimation of the individual
formance of the true model, which itself degrades regression coefficients (see Frank 1989) as for the
with decreasingsignal-to-noise ratio.) Higher signal- model prediction error shown here. Not surprisingly,
to-noise ratio seemsto help OLS and VSS more than the prediction error improves with increasing obser-
the other biased methods. This may be becausetheir vation to variable ratio, increasing collinearity, and

9-
cu

Ln-

9-

ro -
0

9-
0
1
RR PCR PLS vss OLS RR PCR PLS
Figure 8. Performance Comparisons Conditioned on Low Figure 10. Performance Comparisons Conditioned on High,
and High Collinearity Situations. Medium, and Low Signal-to-Noise Ratio.

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 127

increasing signal-to-noise ratio. A bit surprising is Table 4. Weld’s Two-Block PLS Algorithm
the fact that performance seemed to be indifferent
to the structure of the true coefficient values. (1) Initialize: y0 + y; x, + x; PO+ 0
(2) For K = 1 top do:
The results of this simulation study are in accord (3) UT +-- (1, 0, . . , 0)
with the qualitative results derived from the discus- (4) Loop (until convergence)
sion in Section 3.2.1-namely, that RR, PCR, and (5) wK = aveltuTy,~,)x,-,l
PLS have similar properties and give similar perfor- (6) u = aveltw~xx,-,)y,-,l
mance. (Although not shown here, the actual solu- (7) end Loop
(81 z, = w;xK-,
tions given by the three methods on the same data (9) rK = [ave(y,- ,z,)iave(z~)lz,
are usually quite similar.) One can speculate on the (IO) PK = PK-, + rK
reasonswhy the performance ranking RR > PLS > (11) yK = yK-, - rK
PCR came out as it did. PCR might be troubled by (12) XK = XK-, - [ave(z,x,-,)/ave(z~)]zK
its use of a sharp threshold in defining its shrinkage (13) if ave(xLxJ = 0 then Exit
(14) end For
factors (42), whereas RR and PLS more smoothly
shrink along the respective eigendirections [(36) and
Figs. l-41. This may be (somewhat) mitigated by
linearly interpolating the PCR solution between ad-
Downloaded by [University of Birmingham] at 09:58 12 April 2013

jacent components to produce a more continuous This approach is not the one advocated for PLS
shrinkage (Marquardt 1970). PLS may give up some (H. Wold 1984). With PLS, the response variables
performance edge to RR because it is not strictly y = bi}y and th e predictors x = {x,}? are separately
shrinking (some fi > l), which likely degrades its collected together into groups (“blocks”) which are
performance at least by a little bit. then treated in a common manner more or less sym-
The performance differential between RR, PCR, metrically. Table 4 shows Wold’s two-block algo-
and PLS is seenhere not to be great. One would not rithm that defines multiple-response PLS regression.
sacrifice much average accuracy over a lifetime of If one were to develop a direct extension of Wold’s
using one of them to the exclusion of the other two. (q = 1) PLS algorithm (Table 1) according to the
Still one may see no reason to sacrifice any, in which strategy used by OLS (q-separate uniresponse re-
case this study would indicate RR as the method of gressions), line 3 of Table 1 would be replaced by
choice. The discussion in Section 3.2.1 and the sim- the calculation of a separate covariance vector wKi
ulation results presented here suggestthat claims as for each separate response residual yK-l,i on each
to the distinct superiority of any one of these three separate x residual x~-~,~, wKi = ave(yK-l,iXK-l,i)
techniques would require substantial verification. (i = 1, q). These would then be used to update
The situation is different with regard to OLS and q-separate models )jK,i (line 6), as well as q-separate
VSS. Although these are the oldest and most widely new y residuals, yKi (line 7), and x residuals, xKi
used techniques in the statistical community, the re- (line 8).
sults presented here suggestthat there might be much Examination of Table 4 reveals a different strat-
to be gained by considering one of the more modern egy. A single covariance vector wk is computed for
methods (RR, PCR, or PLS) as well. all responsesby the inner loop (lines 3-7), which is
then used to update all of the models QK (line 10)
6. MULTIVARIATE REGRESSION and the response residuals to obtain yK (line 11). A
We now consider the general case in which more single set of x residuals xK is maintained by this al-
than one variable is regarded as a response (q > 1) gorithm using the single covariance vector wK (line
[(l)-(7)] and a predictive relationship is to be mod- 12) as in the uniresponse PLS algorithm (Table 1,
eled between each one bi}f and the complement set line 8). The inner loop (lines 4-7) is an iterative
of variables, designated as predictors. The OLS so- algorithm for finding linear combinations of the re-
lution to this (multivariate) problem is a separate sponse residuals uTyK-i and the predictor residuals
(q = 1) uniresponseOLS regressionof eachyi on the w~xx,-, that have maximal joint covariance. This
predictor variables x, without regard to their com- algorithm starts with an arbitrary coefficient vector
monality. The various biasedregressionmethods (RR, II (line 3). After convergence of the inner loop, the
PCR, PLS, VSS) could be applied to this problem resulting x residual linear combination covariance
by simply replacing each such uniresponse OLS vector wK is then used for all updates.
regression with a corresponding biased (q = 1) This two-block multiple-response PLS algorithm
regression, in accordancewith this strategy. The dis- produces R models [R = rank of V (S)] for each
cussion of the previous sections indicates that this responseQKj}$= I ,4=i spanning a full spectrum of so-
would result in substantialperformance gains in many lutions from the sample means vj = O};lfor K = 0
situations. to the OLS solutions for K = R. The number of

TECHNOMETRICS, MAY 1993, VOL. 35, NO, 2


128 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

components K is considered a meta parameter of the and var(crx)] that serve as penalties to bias the so-
procedure to be selected through CV, lutions away from low spread directions in both the
x and y spaces.The penalty imposed on the predictor-
(56) variable linear combination coefficient vector c is the
OsKsR I=1 j=l same as that used for single response PLS (24). The
where yil is the value of the jth response for the [th discussion in Section 3.1 indicates that this mainly
training observation and pK,,[ is the K-component servesto control the variance of the estimated model.
model for the jth response computed with the Zth The introduction of the y-spacepenalty factor, along
observation deleted from the training sample. Note with optimizing with respect to its associatedlinear
that the same number of components K is used for combination coefficient vector u, serves to place an
each of the response models. additional penalty on the x-linear combination coef-
As with the uniresponse PLS algorithm (Table l), ficient vectors {ck}: that define the sequenceof PLS
this two-block algorithm (Table 4) defining multi- models ci)Kj)g=l prl; they are not only biased away
response PLS does not reveal a great deal of insight from low (data) spread directions in the predictor-
as to its goal. One can gain more insight by following variable spacebut also toward x directions that pref-
the prescription outlined in the beginning of Section erentially predict the high spread directions in the
response-variable space.
Downloaded by [University of Birmingham] at 09:58 12 April 2013

3-that is, to consider the regression procedure as


a two-step process. First, a K-dimensional subspace
6.1 Bayesian Motivation
of p-dimensional Euclidean spaceis defined as being
spanned by the unit vectors {ck)l;‘,and then q-OLS A natural question to ask is: To what extent (if
regressionsare performed under the constraints that any) should this multiresponse PLS strategy (58) im-
the solution coefficient vectors {a,}: (5) lie in that prove performance over that of simply ignoring the
subspace, response-spacecovariance structure and performing
K q-separate (single-response)regressionsof eachyi on
aj = C akick. the predictors x, using PLS (24) or one of the other
k=l
competing biased regressiontechniques (RR, PCR)?
One way to gain some insight into this is to adopt
A regressionprocedure is then prescribed by defining
an (idealized empirical) Bayesian framework (as in
the ordered sequenceof unit vectors (ekE that span
Sec. 3.1) and see what (joint) prior on the (true)
the successivesubspaces1 5 K I R. Defining each coefficient vectors {01,>4,
of these unit vectors to be the solution to
ck = argmax .R(%, . . * >Q (59)
~CTYC,
=o)~~~
w-y Ivar(uTy)corr2[(uTy),
(CT41
cTc=1 would lead to such a strategy being a good one. One
can then judge the appropriateness of such a prior.
* var(crx)} (58) In the caseof single-responseregression(Sec. 3.1))
gives (in this framework) the samesequenceof models we saw that a prior distribution that placed no pref-
{yK}Tas the algorithm in Table 4 defining two-block erence on any coefficient vector direction a/]cy] (32)
PLS regression. As with the uniresponse (q = 1) gave rise to the preferential shrinkage of the corre-
PLS criterion (24), the constraints on {ck}f require sponding estimate a/la] away from directions of low
them to be unit vectors and to be V orthogonal so predictor spread(36) common to RR, PCR, and PLS.
that the corresponding linear combinations are un- In particular, a Gaussian prior (53) (with Gaussian
correlated (23). errors) leads to the optimality of RR. Consider a
The multiresponse PLS criterion (58) bears some general (mean 0) joint Gaussian prior (59)
similarity to that for single-responsePLS (24). It can
be viewed as a penalized canonical correlation cri- 7r(a1, . . . , aq) - exp ( -i C aikajlr$j) 9 (60)
terion. Using the middle factor corr2[(uTy), (c’x)]
alone for the criterion would give rise to standard where the sum is over all indices (1 5 i 5 q, 1 5 j
canonical correlation analysis, producing a sequence 5 q, 1 I k 5 p, 1 5 I 5 p). The covariance structure
of uncorrelated linear combinations {c~x~ that of such a prior distribution (60) is given by the (q x
maximally predict (the corresponding optimal linear q X p X p) array r with elements rjjk[; namely,
combinations (u’y) of the responses.The (unbiased)
E ccl’ . . . 1uqcLYikajl) = rijkl. (61)
canonical correlation criterion (middle factor) is in-
variant to the scalesof the corresponding linear com- As in (32) and (53), we choose this covariance struc-
binations u’y and cTx. The complete PLS criterion ture to have no preferred directions in the predictor-
(58) is seento include two additional factors [var(u’y) variable space-but not necessarily in the response-

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 129

variable space.This correspondsto (with some abuse [(60) and (62)] distributions, then averaged over the
of notation) predictor-training sample. The quantity tr(V) is the
rijkl = rijskl (62) trace of the predictor-sample covariance matrix V
(8). If the data are standardized [(3)-(4)], then
with S,, = 1 if k = 1 and i&, = 0 otherwise. The
corresponding resulting prior [(60) and (62)] pro- tr(V) = p. (69)
vides information [through Iii (62)] on the degree of Let W be the (q x q) sample covariance matrix
similarity of the dependence of yi and yj on the pre- of the response variables
dictors x but no information as to the nature of that
x dependence. A relatively large positive value for W, = ave(yiy,). (70)
Ii, suggeststhat yi and y, have highly similar depen- Then from (68) an “estimate” for the elements of
dencieson x, whereas a large negative value indicates the matrix I would be
highly opposite dependencies.A relatively small value
indicates dissimilar dependenciesof yi and yj on the f = (W - cGI)/p, (71)
predictors. To further idealize the situation, suppose which could then be used in conjunction with Cri-
that terion (66) to obtain the resulting estimate A(RR)
yi = cyi’x + Ej, i = 1,4, (63) (given u”). The common error variance u2 remains
Downloaded by [University of Birmingham] at 09:58 12 April 2013

with the errors E = {Q}: having a joint Gaussian unknown and can be regarded as a meta parameter
distribution of the procedure to be estimated (from the training
sample) through CV:
E - N(O, 3, (64)
and, in addition, the error covariance is a multiple c2 = artyin k$l IIyk - Alk(RRb2)xkII*, (72)
of the identity matrix
2 = a2I. (65) where A,,(RR]a2) is the coefficient matrix A(RR)
estimated from (66) and (71) with the kth observa-
If C were known, one could rotate and scale the y- tion deleted from the training sample.
spacecoordinates so that (65) is obtained in the trans- Insight into the nature of solutions provided by
formed coordinate system. Otherwise (65) remains (66) and (71) can be enhanced by rotating in the x
a simplifying assumption. Under these assumptions and y spacesto their respective principal component
[(60)-(65)], the following generalization of RR to representations using orthonormal rotation matrices
multiple responsesis optimal (smallest MSE): U, and U, such that
V = UTE2U
I +
A(RR) = argmin ave(y - Ax)T(y
A W = UTH2U
Y Y (73)
- Ax) + $ IlAj12 .

Here A is a (q x p) matrix of regression coefficients


1 (66) with F and H2 being diagonal matrices constituting
the respective (ordered) eigenvalues
F = diag(e: . . . ez)
(7), I is the (q x q) “prior” matrix (62), u2 is the
(common) error variance (65)) and ]]A 1I 2is the Fro- Hz = diag(h: . . . hz). (74)
benius norm In this coordinate system, solutions to (66) and (71)
simplify to
(67)
A,(RR) = iy, . g:
g; + pa2/N’
[For a different Bayesian approach to combining
regressionequations on the samepredictor variables, i = 1, q; j = 1, p, (75)
see Lindley and Smith (1972).]
with kij being the OLS coefficient estimates (in the
If the elements of the matrix I (62) are unknown
PP coordinate systems) and
one can take an “empirical” Bayesian approach and
estimate them from the (training) data. Assuming gz = ef(h? - a2)+. (76)
(60)-(65), one has
Here the subscript “ +” indicates the positive part
aveE,E,, . . cx,(yJj) = rijW9 + c2, (68) of the argument
x
Cd+ = 77 if r) > 0
where the left side is the expected value of (yy,) over
both the error [(64)-(65)] and the coefficient prior = 0 otherwise. (77)

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


130 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

This RR solution for multiple responses[(75)-(76)] that the degree of similarity of the dependenceof a
bears considerable resemblance to that for single- pair of responses(‘yi,yj) on the predictors is reflected
response regression (38) in that each coefficient es- in their correlation. A large positive (or negative)
timate is obtained by (differentially) shrinking the correlation between yi and yj means that the corre-
corresponding (unbiased) OLS estimates. Here (for sponding (true) coefficient vectors (Yiand aj should
a given value (T*)the relative shrinkage is controlled be closely related; that is, (Yi - Lyj (or (pi - -aj).
both by ef (corresponding x-direction sample spread) Small correlations imply no special relationship. This
and hf (corresponding y-direction sample spread) in information is incorporated into the regression pro-
a more or less symmetric way through their product cedure by using the empirical responsecorrelational
(76). A smaller value for either results in more structure to estimate the transformation to linear
shrinkage. The overall result is to bias the coefficient combinations of the responsesb,(PP)}p that are un-
vector estimates (7) simultaneously away from low correlated (no relationship between any of the coef-
sample spread directions in both spaces.The overall ficient vectors) in which separate independent
degree of this bias is controlled by the value of o2 regressions are then performed.
[the variance of the noise (65)]. The larger its value These results suggest that, unless the original re-
the more bias is introduced. sponsevariables happen to be uncorrelated, there is
The solution [(75)-(76)] can be recast as profit to be gainedin consideringthem together rather
Downloaded by [University of Birmingham] at 09:58 12 April 2013

than simply performing separate regressionson the


A,(RR) = 6, ’ ~ (78)
original responses.This is accomplished by doing the
I 1 separate regressions on their principal component
with linear combinations Cvi(PP)}f (80). For OLS, this, of
hi = pcr21N(h?- u2). course, has no effect, but for the shrinking proce-
(79)
dures (RR, PCR, and PLS) this can make quite a
Comparing (78) to (38) showsthat this multiresponse difference.
RR simply applies separate (uniresponse) RR’s to The qualitative behavior of two-block multi-
each principal component linear combination of the response PLS (Table 4) as reflected in (57)-(58) is
responsesCyi(PP))f, with seen to be captured also in multiresponse RR [(66)
and (71)] as reflected in (75)-(76)-namely, simul-
Y(W = U,Y (80) taneous shrinkage of the coefficient vector estimates
(73), using separate ridge parameters {hi}f for each away from low (sample) spread directions in both
one. As in single response RR (37), the ridge pa- the x and y spaces. This fact serves then to justify
rameters (79) are related to the (inverse) signal-to- this strategy on the part of the two-block PLS al-
noise ratio. gorithm under the same assumptions that lead to
Since the b,(PP)}T are uncorrelated, they repre- multiresponse RR [(66) and (71)]. The principal as-
sent a natural responseset on which to perform sep- sumption is that the respective response errors
arate regressions. The basic difference between this {Ei}f (63) are independent between the responsesand
approach [(66), (71), (78), (79)] and one in which all have approximately the same variance
totally separateRR’s are used is that the latter would (UT = ~*}f (65). To the extent that this tends to be
separately estimate its own ridge parameter [for each the case, the low spread directions in the y spacewill
yi(PP)] through model selection (say CV) thereby be dominated more by the noise than the high spread
giving rise to q-meta parameters {hi}7 to be estimated directions, and biasing the estimates away from these
for the entire procedure. The method previously de- low spread directions will reduce the variance of the
veloped [(66), (71), (78), (79)] attempts to estimate estimates. If the error covariance matrix C (64) is
all {Ai}! with a single meta parameter, a2, selected not well approximated by (65), then the two-block
through CV. This is made possible through the as- PLS strategy (Table 4) might be counterproductive
sumption embodied in (65). To the extent that (65) and a series of uniresponse PLS regressions (Table
represents a good approximation, this should give 1) of each of the responseprincipal component linear
rise to better performance. If not, totally separate combinations yi(PP) (80) separately on the predic-
RR’s on each y,(PP) may work better. tors could be (much) more effective. The same is,
of course, also true for the respective versions of RR.
6.2 Discussion
As noted previously, if I: (64) were known, it could
The assumptions that lead to the Cy,(PP)}f (80) as be used to derive a transformation (rotation and scal-
being the natural coordinates for the single-response ing) of the y-coordinate system so that (65) was ob-
regressionsare (60) and (62) through the results (68) tained in the transformed coordinate system. The
and (71). Informally, these (quite reasonably) state analysis (two-block PLS or multiresponse RR) would

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 131

then be performed in the transformed system and Table 5. Mean Squared Prediction Error of Multivariate RR
(upper entry) and Two-Block PLS (lower entry) for Several
the inverse transform applied to the resulting solu-
Signal-to-Noise Ratios SIN (rows) and Different Prior
tions. Such a transformation can be derived by de- Parameter Values 6 (columns) for a Highly
composing C into the product Collinear Situation
C = RTR (81) 6
and taking Z = Ry as the new responses. SIN 0 1 10
The caseof X (64) unknown can be directly treated
in the context of OLS (Box and Draper 1965). Here 10 .22 .I5 .14
.24 .14 .I2
the residual covariance matrix is used as an estimate
5 .35 .28 .26
of 2, .3a .27 .24
$(A) = ave[(y - Ax)(y - Ax)=]. 1 .68 .61 .60
(82) .72 .63 .59
Since this estimate depends on the estimated coef-
ficient matrix A (which in turn depends on c), an
iterative algorithm is required. Using (82), the mul-
tiresponse (negative) log-likelihood [assumingGaus- (72)]}, 1,000 new observations were generated ac-
Downloaded by [University of Birmingham] at 09:58 12 April 2013

sian errors (64)] can be shown (see Bates and Watts cording to the same prescription and the average
1988, p. 138) to reduce to -L(A) = log det[e(A)]. squared prediction error evaluated with them.
This is minimized with respect to the coefficient ma- Table 5 compares (in terms of MSE) multivariate
trix A, using (iterative) numerical optimization tech- RR [(66) and (71)] (upper entry) with two-block PLS
niques, to obtain the estimate. It is an open question (lower entry) for (population) predictor covariance
as to whether an analog of this approach can be matrix eigenvalues {ef = l/i*}$’ and responsecovari-
developed for biased regression procedures such as ante matrix eigenvalues {hp = l/i*}? (74). The rows
RR, PCR, or PLS. correspond to different signal-to-noise ratios and the
columns to different prior parameters 6, reflecting
differing alignment of the true coefficient vectors
6.3 Monte Carlo Study
{cui}4(63) with the predictor (population) distribution
We end this section by presenting results of a small eigendirections. One seesthat for 6 = 0 (equidirec-
Monte Carlo study comparing multivariate RR [(66) tion prior) RR does a bit better than PLS. For 6 =
and (71)] with two-block PLS (Table 4) in several 1 (moderate alignment) performance is nearly iden-
situations. We also compare both multivariate meth- tical, whereasfor 6 = 10 (very heavy alignment) PLS
ods to that of applying separate univariate (q = 1) has a slight advantage.Theseresultshold for all signal-
regressions on each (original) response separately. to-noise ratios.
The situations are characterized by the respective Table 6 presents a similar set of results for the
eigenstructures of the (population) predictor- and same situation except with less collinearity in both
response-variablecovariance matrices [(8) and (70)], spaces:{e? = l/i>: and {hf = l/i}:. Here overall per-
signal-to-noise ratio, and alignment of the true coef- formance is worse for both methods, but their re-
ficient vectors {ai} (63) with the eigenstructure of spective relative performance is similar to that re-
the (population) predictor covariance matrix. flected in Table 5. These results lend further support
For the first study, there are p = 64 predictor to the conclusion that PLS assumes a prior distri-
variables, q = 4 response variables, and N = 40
training observations. The study consisted of 100rep-
lications of the following procedure. First, N = 40 Table 6. Mean Squared Prediction Error of Multivariate RR
training observations were generated with the p = (upper entrvl and Two-Block PLS (lower entry) for Several
Signal-to-Noise Ratios SIN (rows) and Different Prior
64 predictors having a joint (population) Gaussian Parameter Values 6 (columns) for Moderate Collinearity
distribution with the specified covariancematrix. The
corresponding q = 4 response variables were ob- 6
tained from (63) with the {ci}y generated from a
SIN 0 1 10
Gaussian distribution with the (same) specified vari-
ance u2. The true coefficient vectors {cwi}f (63) were 10 A4 .27 .I8
each independently generatedfrom r(cy) [(45)-(47)] .47 .26 .I5
under the constraint that the (population) response 5 .57 .41 .32
.62 .39 .27
covariance matrix be the one specified. Several val- 1 .a4 .73 .67
ues of the prior parameter 6 were used. After each .92 .74 .62
of the models were obtained {using CV [(56) and

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


132 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

Table 7. Mean Squared Prediction Error of Multivariate RR of the biased regression procedures discussed here
and Two-Block PLS Along With That of Their (RR, PCR, PLS, or VSS) enjoy this affine equivari-
Corresponding (separate) Uniresponse Procedures for ante property. Applying such transformations on the
Several Signal-to-Noise Ratios
variables can change the analysis and its result. RR,
SIN Multi-ridge Uni-ridge Two-block PLS Uni-PLS PCR, and PLS are equivariant under (rigid) rotations
of the coordinates. This property allowed us to study
10 .23 .25 .25 .27 them in the sample principal component represen-
5 .36 .39 .39 .44
1 .68 .74 .73 .79
tations in which the (transformed) covariance mat-
rices were diagonal. They are not, however, equi-
NOTE: S/N kows), and prior parameter 6 = 0.
variant to transformations that change the scalesof
the coordinates. VSS is equivariant under scaling of
bution on the true coefficient vectors {ai} (63) that the variables but not under rotations. All of these
preferentially aligns them with the larger eigendi- procedures are equivariant under translation of (the
rections of the predictor covariance matrix (6 > 0). origin of) the coordinate systems.
Table 7 compares the multivariate RR [(66) and In Section 3 we saw that the basic regularization
(71)] and two-block PLS (Table 4) procedures with provided by RR, PCR, and PLS was to shrink their
solutions away from directions of small spread in the
Downloaded by [University of Birmingham] at 09:58 12 April 2013

the corresponding strategies of applying q-separate


uniresponse (q = 1) regressions on the original re- predictor space. This is not an affine invariant con-
sponses. Here the situation is the same as that of cept. If an original predictor variable xI has a (rel-
Table 5 except that there are q = 8 responses and atively) small scale compared to the other predictor
the comparison is made only for 6 = 0 [(45)-(47)]. variables, var(x,) < var(x,) (k # Z), then the co-
The relative relationship between multivariate RR ordinate axis represented by this variable represents
and two-block PLS is seen to be the same as that a direction of small spread in the predictor spaceand
reflected in Table 5 (first column). Each multi- the solution will be biased away from involving this
responsemethod outperforms its corresponding uni- variable. Standardizing (autoscaling) the variables
variate method, but by a surprisingly small amount. [(3)-(4)] to all have the same scale represents a de-
In fact, separate RR’s do as well as two-block PLS. liberate choice on the part of the user to make all
These results are especially surprising since the sit- variables equally influential in the analysis. If it were
uation represented here is set up to provide optimal known (a priori) that some variables ought to be
advantagefor the multivariate procedures.Thus even more influential than others, this information could
in this optimal setting separate regressionsdo almost be incorporated by adjusting their relative scales to
as well as their multiresponse counterparts. This re- reflect that importance.
sult seemsto run counter to the preceding discussion Lack of affine equivariance with respect to the
in which it appeared that using the additional infor- predictor variables can be understood in the Bayes-
mation provided by y-space correlational structure ian framework adopted in Section 3.1. For RR, the
ought to help improve performance. This might well prior (32) leading to its optimality is invariant under
be the caseif the population correlationswere known. rotations; that is, if one were to apply a (rigid) ro-
The simulation results indicate that having to esti- tation characterizedby an orthonormal matrix U (UW
mate them from the data induces enough uncertainty =uw=z)
to substantially mitigate this potential advantage, at a’ = ua, (83)
least for the casesstudied here.
Overall, the performance of multivariate RR [(66) then
and (71)] and two-block PLS (Table 4) are compa- 7r(dTcx’) = %-(aTWUa) = 7i-(a=a) (84)
rable. The RR procedure has the advantage of re-
quiring about three times lesscomputation, however. and the prior is unchanged, resulting in rotational
equivariance. A more general prior would be
7. VARIABLE SCALING
r(a) = rr(c~~A-lc~), (85)
OLS is equivariant with respect to rotation and
scaling of the variable axes; that is, if one were to where A is a p x p positive definite matrix. All rigid
apply any (nonsingular) affine (linear-rotation and/ rotations (84) involve taking A = I, the identity ma-
or scaling) transformation to the variable axes, per- trix. This makes all directions for (Y(the truth) equally
form the (OLS) analysis in the transformed system, likely using the original coordinate scales to define
and then apply the inverse transformation to the so- the metric. Taking A to represent a more general
lution, the result would be the same as if the analysis quadratic form in (Y (85) imposes a specific prior
were done in the original coordinate system. None belief on the relative importance of various directions

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 133

in the predictor space (again using a metric defined be


by the original variable scales). In particular, choos- L, = E(y - y)TM-‘E(y - 8)
ing A to be diagonal, VW
with M some positive definite matrix chosen (by the
A = diag(SI . . . Sg), (86) user) to reflect the (relative) preference of accurately
alters the prior belief of the relative importance of predicting certain linear combinations of the re-
the original predictor variables (coordinates). The sponses. Choosing M to be a diagonal matrix
particular choice Sy = var(xj) (Z = 1, p) in (86)
M = diag(m, . . mJ (89)
imposes the belief that all predictor variables have
equal (a priori) importance, leading to the (data) chooses the response variables themselves to refer-
scale invariant penalty ence the preferred linear combinations (axis direc-
tions). In particular, the choice M = Z causestheir
-2 log G-((Y)= A i: var(X,.)ay relative importance to be proportional to their sam-
j=l ple variance, whereas the choice
for RR (still using the original variable scalesto form mi = var(yJ, i = l,q, (90)
the metric). This is equivalent to changing the metric
causesthem to have equal influence on the loss cri-
Downloaded by [University of Birmingham] at 09:58 12 April 2013

by standardizing the variables [(3)-(4)] and then us-


ing A = Z with respect to one’s new metric. terion (88).
The similarity of PCR and PLS to RR extends to For OLS, a choice for M is irrelevant since this
this property as well. Standardizing the variables so procedure chooses ei}f such that each E(y, - pi)
that all have the same scale imposes the prior belief (i = 1, q) is minimized separately,without regard for
that all of the predictor variables ought to be equally the other responses. Performing separate biased
important. A different choice for the relative scales regressions on each of the individual original re-
would reflect a different prior belief on their relative sponses has a similar effect in that M is irrelevant;
importance. the result is the same regardless of a choice for M.
In Section 4 we saw that a prior leading to VSS This is not, however, the case for the biased pro-
places all of its mass on certain preferred directions cedures that operate collectively on the responses
in the predictor-variable space-namely, the coor- such as two-block PLS (Table 4) or multiresponse
dinate axes [(54), y + 01). Changing the definition RR [(66) and (71)]. It is also not the caseif the biased
of the coordinate axes (preferred directions) through procedures are (separately) applied to the response
a rotation clearly alters such a prior, causing VSS to principal component linear combinations Cyi(PP)}y
not be equivariant under rotations. As y + 0, the (80) as suggestedin Section 6.2. (An exception oc-
(VSS prior) penalty (54) simply counts the number curs when the chosen values of the regularization
of nonzero coefficients and thus does not involve the parameters turn out to give rise to unbiased OLS.)
variable scales. This causes VSS to be equivariant Standardizing the responsevariables [(3)-(4)] and
under predictor-variable scaling. using M = Z (88) in the transformed system is equiv-
Since one would not expect (or want) a procedure alent to using (88), (89), and (90) in the original
to be invariant to the user’s imposed prior beliefs as coordinate system, thereby making all original re-
reflected in the chosen prior I, it is no surprise sponses (but not their linear combinations) equally
that the regularized regressionprocedures RR, PCR, important (influential) in deriving the biased regres-
PLS, and VSS are not affine equivariant in the pre- sion models for different levels of bias. If this is not
dictor space. [See Smith and Campbell (1980), and what is wanted (i.e., it is important to accurately
associatedcomments, for a spirited discussionof this predict some responsesmore than others), then this
isssue.] desire can be incorporated into a choice for M (88)
Changing the scales of the response variables in or equivalently a choice for the relative scalesof each
multiple-response regression (Sec. 6) has a similar response (if M is diagonal), or their linear combi-
effect but for a different reason. Changing their rel- nations (if M is not diagonal).
ative scales changes their relative influence on the
solution. This change, however, is reflected through
the loss criterion rather than prior belief. The squared- 8. INTERPRETATION
error loss criterion is In the preceding sections, we have compared the
various regression methods from the point of view
L = 2 E(y, - )ji)*. (87) of prediction. This is because prediction error pro-
i=l
vides an objective criterion (once all definitions and
A more general (squared-error) loss criterion would assumptions have been stated) less subject to phil-

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


134 ILDIKO E. FRANK AND JEROME H. FRIEDMAN

osophical or emotional argument. As is well known, The PLS procedure also produces a set of uncor-
the goal of a regression analysis is often not solely related (but not orthogonal) linear combinations. It
prediction but also description; one uses the com- is often (subjectively) argued that these are a more
puted regression equation(s) as a descriptive statistic “natural” set to interpret regression solutions be-
to attempt to interpret the predictive relationships cause the criterion [(24) and (58)] by which they are
derived from the data. The loss structure for this defined involves the data responseas well as predic-
enterprise is difficult to specify and depends on the tor values. Linear combinations with low response
experience and skill of the user in relation to the correlation will tend to appear later in the PLS se-
method used. quence unlesstheir (data) variance is very large. One
It is common to interpret the solution coefficients consequenceof this is that a solution regressioncoef-
on the (standardized) original variables as a measure ficient vector 6 can generally be approximated to the
of strength of the predictive relationship between the same degree of accuracy by its projection on the
response(s)and the respective predictors. In this case space spanned by fewer PLS components than prin-
accuracy of estimation of these coefficients is a rel- cipal components. As noted in Section 3.2.1, how-
evant goal. As noted in Section 5, the relative rank- ever, this parsimony argumentis not compelling, since
ing of the methods studied there on coefficient ac- any vector & can be completely represented in a sub-
curacy was the same as that for prediction (seeFrank space of dimension l-namely, that defined by a
Downloaded by [University of Birmingham] at 09:58 12 April 2013

1989). Interpretation is also often aided by the sim- unit vector proportional to it.
plicity or parsimony of the representation of the re- The choice of a set of coordinates in which to
sult. This concept is somewhat subjective depending interpret a regression solution is largely independent
on the user’s experience. In statistics, parsimony is of the method by which the solution was obtained.
often taken to refer to the number of (original) pre- One is not required to use a solution gotten through
dictor variables that “enter” the regression equa- PCR or PLS to interpret it in terms of their respective
tion-that is, the number with nonzero coefficients. components. One could interpret a regression equa-
The smaller this number, the more parsimonious and tion(s) obtained by either OLS, VSS, RR, PCR, or
interpretable is the result. This leads to VSS as the PLS in terms of the original predictor variables, the
method of choice, since it attempts to reduce mean principal components, or PLS linear combinations
squared (prediction) error by constraining coeffi- (or all three). Prediction and interpretation are sep-
cients to be 0. Moreover, it is often the original vari- arate issues, the former being amenable to (more or
ables (as opposed to their linear combinations) that less) objective analysis but the latter always depend-
are most easily related to the system under study that ing on subjective criteria associatedwith a particular
produced the data. analyst.
It is well known that, in the presence of extreme
ACKNOWLEDGMENTS
collinearity, interpretation of individual regression
coefficients as relating to the strength of the respec- This article was prepared in part while one of us
tive partial predictive relationships is dangerous. In (JHF) was visiting the Statistics Group at AT&T Bell
chemometrics applications, the number of predictor Laboratories, Murray Hill, New Jersey. We ac-
variables often (greatly) exceeds the number of ob- knowledge their generoussupport and especiallythank
servations. Thus there are many exact (as well as Trevor Hastie and Colin Mallows for valuable dis-
possibly many approximate) collinearities among the cussions.
predictors. This has led chemometricians to attempt
to interpret the solution in terms of various linear APPENDIX: PROOF OF (20) AND (21)
combinations of the predictors rather than the in- For convenience, center the data so that E(j) =
dividual predictor variables themselves. (This ap- E(x) = 0. The RR solution Cr,is given by (13). Let
proach is somewhatsimilar to the useof factor-analytic aTa = f so that a = fc, with C~C= 1. Then, given
methods in the social sciences.) The linear combi- c, the solution to (13) for f, f(c), is
nations associatedwith the principal component di-
rections are a natural set to consider for this purpose, f(c) = argmin[ave(y - fc’x)” + Af”]
since they represent a set of uncorrelated “variables”
that are mutually orthogonal (with respect to the = ave(yc’x)l[ave(c’x)* + A], (A. 1)
standardized predictors) and satisfy a simple opti- and the ridge solution is (21) with
mality criterion (22). Moreover, principal components
analysis has long been in use and is a well-studied CRR = argmin{aveb - f(c)c’x]’ + Af”(c)}. (A.2)
CTC=l
method for describing and condensing multivariate
data. Substituting (A.l) for f(c) in (A.2) and simplifying

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2


STATISTICAL VIEW OF CHEMOMETRICS REGRESSION TOOLS 135

gives for the Linear Model” (with discussion), Journal of the Royal
Statistical Society, Ser. B, 34, l-40.
Lorber, A., Wangen, L. E., and Kowalski, B. R. (1987). “A
CRR = argmin ave(y’) - Theoretical Foundation for the PLS Algorithm,” Journal of
CTC=l
Chemometrics, 1, 19-31.
or, equivalently, Mallows, C. L. (1973), “Some Comments on Cp,” Technometrics,
15, 661-667.
ave2(ycTx) Marquardt, D. W. (1970). “Generalized Inverses, Ridge Regres-
CRR = sion, Biased Linear Estimation, and Nonlinear Estimation,”
ave(y*)[ave(c’x)* + A]
Technometrics, 12, 591-612.

1 ave2(ycTx) ave(cTx)2 Martens, H., and Naes, T. (1989). Multivariate Calibration, New
= argmax York: John Wiley.
CTC=l ave(y*)ave(c’x)” ave(Cx)* + AI ’ Massy, W. F. (1965), “Principal Components Regression in Ex-
ploratory Statistical Research,” Journal of the American Statis-
If the data are uncentered then mean values would tical Association, 60, 234-246.
have to be subtracted from all quantities, giving (20). Mitchell, T. J., and Beauchamp, J. J. (1988), “Bayesian Variable
Selection in Linear Regression” (with discussion), Journal of
[Received December 1991. Revised September 1992.1
the American Statistical Association, 83, 1023- 1037.
Naes, T., and Martens, H. (1985), ‘Comparison of Prediction
REFERENCES Methods for Multicollinear Data,” Communications in Statis-
Downloaded by [University of Birmingham] at 09:58 12 April 2013

tics-Simulation and Computation, 14, 545-576.


Bates, D. M., and Watts, D. G. (1988), Nonlinear Regression Phatak, A., Reilly, P. M., and Penlidis, A. (1991), “The Ge-
Analysis, New York: John Wiley. ometry of 2-block Partial Least Squares,” Technical Report,
Box, G. E. P., and Draper, N. R. (1965), “Bayesian Estimation University of Waterloo, Dept. of Chemical Engineering.
of Common Parameters From Several Responses,” Biomefrika, Rissiden, Y. (1983), “A Universal Prior for Integers and Esti-
52, 355-365. mation by Minimum Description Length,” The Annals of Sfa-
Breiman, L. (1989), “Submodel Selection and Evaluation in tistics, 11, 416-431.
Regression I. The x-Fixed Case and Little Bootstrap,” Tech- Schwartz, G. (1978), “Estimating the Dimension of a Model,”
nical Report 169, University of California, Berkeley, Dept. of The Annals of Statistics, 6, 461-464.
Statistics. Smith, G., and Campbell, F. (1980), “A Critique of Some Ridge
Craven, P., and Wahba, G. (1979), “Smoothing Noisy Data With Regression Methods” (with discussion), Journal of the Ameri-
Spline Functions. Estimating the Correct Degree of Smoothing can Statistical Association, 75, 74- 103.
by the Method of Generalized Cross-validation,” Numerische Sommers, R. W. (1964), “Sound Application of Regression Anal-
Mathematik, 31, 317-403. ysis in Chemical Engineering,” unpublished paper presented at
Frank, I. E. (1987), “Intermediate Least Squares Regression the American Institute of Chemical Engineers Symposium on
Method,” Chemometrics and Intelligent Laboratory Systems, 1, Avoiding Pitfalls in Engineering Applications of Statistical
233-242. Methods, Memphis, TN.
- (1989), “Comparative Monte Carlo Study of Biased Stone, M. (1974), “Cross-validatory Choice and Assessment of
Regression Techniques,” Technical Report LCS 105, Stanford Statistical Predictions” (with discussion), Journal of the Royal
University, Dept. of Statistics. Statistical Society, Ser. B, 36, 111-147.
Golub, G. H., Heath, M., and Wahba, G. (1979), “Generalized Stone, M., and Brooks, R. J. (1990), “Continuum Regression:
Cross-validation as a Method for Choosing a Good Ridge Pa- Cross-validated Sequentially Constructed Prediction Embrac-
rameter,” Technometrics, 21, 215-224. ing Ordinary Least Squares, Partial Least Squares and Principal
Hawkins, D. M. (1973), “On the Investigation of Alternative Components Regression” (with discussion), Journal of the Royal
Regressions by Principal Components Analysis,” Applied Sta- Statistical Society, Ser. B, 52, 237-269.
tistics, 22, 275-286. Webster, J. T., Gunst, R. F., and Mason, R. L. (1974), “Latent
Helland, I. S. (1988), “On the Structure of Partial Least Squares Root Regression Analysis,” Technometrics, 16, 513-522.
Regression,” Communications in Statistics-Simulation and Weld, H. (1966), “Estimation of Principal Components and Re-
Computation, 17, 581-607. lated Models by Iterative Least Squares,” in Multivariate Anal-
Hoerl, A. E., and Kennard, R. W. (1970), “Ridge Regression: ysis, ed. P. R. Krishnaiah, New York: Academic Press, pp.
Biased Estimation for Nonorthogonal Problems,” Technomet- 391-420.
rics, 8, 27-51. - (1975), “Soft Modeling by Latent Variables; the Non-
- (1975), “A Note on a Power Generalization of Ridge linear Iterative Partial Least Squares Approach,” in Perspec-
Regression,” Technometrics, 17, 269. tives in Probability and Statistics, Papers in Honour of M. S.
James, W., and Stein, C. (1961), “Estimation With Quadratic Bartlett, ed. J. Gani, London: Academic Press.
Loss,” in Proceedings of the Fourth Berkeley Symposium (Vol. (1984), “PLS Regression,” in Encyclopaedia of Statistical
I), ed. J. Neyman, Berkeley: University of California Press, Sciences (Vol. 6), eds. N. L. Johnson and S. Kotz, New York:
pp. 361-379. John Wiley, pp. 581-591.
Lindley, D. V. (1968). “The Choice of Variables in Multiple Wold, S., Ruhe, A., Wold, H., and Dunn, W. J. (1984), “The
Regression” (with discussion), Journal of the Royal Statistical Collinearity Problem in Linear Regression. The Partial Least
Society, Ser. B, 30, 31-66. Squares (PLS) Approach to Generalized Inverses,” SIAM Jour-
Lindley, D. V., and Smith, A. F. M. (1972), “Bayes Estimates nal on Scientific and Statistical Computing, 5, 735-743.

TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2

You might also like