Professional Documents
Culture Documents
Technometrics
Publication details, including instructions for authors and subscription information:
http://amstat.tandfonline.com/loi/utch20
To cite this article: lldiko E. Frank & Jerome H. Friedman (1993): A Statistical View of Some Chemometrics Regression
Tools, Technometrics, 35:2, 109-135
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to
anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses
should be independently verified with primary sources. The publisher shall not be liable for any loss,
actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising
directly or indirectly in connection with or arising out of the use of this material.
0 1993 American Statistical Association and TECHNOMETRICS, MAY 1993, VOL. 35, NO. 2
the American Society for Quality Control
spective. The goal is to try to understand their apparent successes and in what situations they
can be expected to work well and to compare them with other statistical methods intended
for those situations. These methods include ordinary least squares, variable subset selection,
and ridge regression.
KEY WORDS: Multiple response regression; Partial least squares; Principal components
regression; Ridge regression; Variable subset selection.
109
110 ILDIKO E. FRANK AND JEROME H. FRIEDMAN
cedures, such as that it makes fewer assumptions axes. This leads to the assumption that the response
concerning the nature of the data. Simply not under- is likely to be influenced by a few of the predictor
standing the nature of the assumptions being made variables but leaves unspecified which ones. It will
does not mean that they do not exist. therefore tend to work best in situations character-
Space limitations force us to limit our discussion ized by true coefficient vectors ‘with components
here to methods that so far have seen the most use consisting of a very few (relatively) large (absolute)
in practice. There are many other suggested ap- values.
proaches [e.g., latent root regression (Hawkins 1973; Section 5 presents a simulation study comparing
Webster, Gunst, and Mason 1974),intermediate least the performance of OLS, RR, PCR, PLS, and VSS
squares(Frank 1987), James-Stein shrinkage (James in a variety of situations. In all of the situations stud-
and Stein 1961), and various Bayes and empirical ied, RR dominated the other methods, closely fol-
Bayes methods] that, although potentially promis- lowed by PLS and PCR, in that order. VSS provided
ing, have not yet seen wide applications. distinctly inferior performance to these but still con-
siderably better than OLS, which usually performed
1.1 Summary Conclusions quite badly.
RR, PCR, and PLS are seenin Section 3 to operate Section 6 examines multiple-response regression,
in a similar fashion. Their principal goal is to shrink investigating the circumstances under which consid-
Downloaded by [University of Birmingham] at 09:58 12 April 2013
the solution coefficient vector away from the OLS ering all of the responsestogether as a group might
solution toward directions in the predictor-variable lead to better performance than a sequenceof sep-
space of larger sample spread. Section 3.1 provides arate regressions of each response individually on
a Bayesian motivation for this under a prior distri- the predictors. Two-block multiresponse PLS is an-
bution that provides no information concerning the alyzed. It is seen to bias the solution coefficient vec-
direction of the true coefficient vector-all direc- tors away from low spread directions in the predictor
tions are equally likely to be encountered. Shrinkage variable space (as would a sequenceof separate PLS
away from low spread directions is seen to control regressions) but also toward directions in the pre-
the variance of the estimate. Section 3.2 examines dictor spacethat preferentially predict the high spread
the relative shrinkage structure of these three meth- directions in the response-variable space. An (em-
ods in detail. PCR and PLS are seen to shrink more pirical) Bayesian motivation for this behavior is de-
heavily away from the low spread directions than veloped by considering a joint prior on all of the
RR, which provides the optimal shrinkage (among (true) coefficient vectors that provides information
linear estimators) for an equidirection prior. Thus on the degree of similarity of the dependenceof the
PCR and PLS make the assumption that the truth is responses on the predictors (through the response
likely to have particular preferential alignments with correlation structure) but no information as to the
the high spread directions of the predictor-variable particular nature of those dependences. This leads
(sample) distribution. A somewhat surprising result to a multiple-response analog of RR that exhibits
is that PLS (in addition) places increased probability similar behavior to that of two-block PLS. The two
mass on the true coefficient vector aligning with the procedures are compared in a small simulation study
Kth principal component direction, where K is the in which multiresponse ridge slightly outperformed
number of PLS components used, in fact expanding two-block PLS. Surprisingly, however, neither did
the OLS solution in that direction. The solutions and dramatically better than the corresponding unire-
hence the performance of RR, PCR, and PLS tend sponse procedures applied separately to the individ-
to be quite similar in most situations, largely because ual responses, even though the situations were de-
they are applied to problems involving high colli- signed to be most favorable to the multiresponse
nearity in which variance tends to dominate the bias, methods.
especially in the directions of small predictor spread, Section 7 discusses the invariance properties of
causing all three methods to shrink heavily along these regression procedures. Only OLS is equivar-
those directions. In the presence of more symmetric iant under all nonsingular affine (linear-rotation
designs, larger differences between them might well and/or scaling) transformations of the variable axes.
emerge. RR, PCR, and PLS are equivariant under rotation
The most popular method of regression regulari- but not scaling. VSS is equivariant under scaling but
zation used in statistics, VSS, is seen in Section 4 to not rotation. These properties are seento follow from
make quite different assumptions. It is shown to cor- the nature of the (informal) priors and loss structures
respond to a limiting case of a Bayesian procedure associated with the respective procedures.
in which the prior probability distribution places all Finally, Section 8 provides a short discussion of
mass on the original predictor variable (coordinate) interpretability issues.
2. REGRESSION as
Regression analysis on observational data forms a yj = aFx, i = l,q, (5)
major part of chemometric studies. As in statistics,
with the jth coefficient vector being aT = (a,l . .
the goal is to model the predictive relationships of a
ajp) or in matrix notation
set of q response variables y = b1 . . . y,} on a set
of p predictor variables x = {xi . . . x,} given a set y = Ax (6)
of N (training) observations
with the q x p matrix of regressioncoefficients being
(Yi, xiY = Cvli * . . Yqi, xii . * . xpiY (1)
A = [ajk].
(7)
on which all of the variables have been measured.
The dominant regression methods used in che-
This model is then used both as a descriptive statistic
mometrics are PCR and PLS. The corresponding
for interpreting the data and as a prediction rule for
methods most used by statisticians (in practice) are
estimatinglikely valuesof the responsevariableswhen
OLS, RR, and VSS. The goal of this article is to
only values of the predictor variables are available.
compare and contrast these methods in an attempt
The structural form of the predictive relationship is
to identify their similarities and differences. The next
taken to be linear:
section starts with brief descriptions of PCR, PLS,
and RR. (It is assumed that the reader is familiar
Downloaded by [University of Birmingham] at 09:58 12 April 2013
Yj = ajo + 2 ajkXk, j = 1, q. (2) with OLS and the various implementations of VSS.)
k=l
We consider first the caseof only one responsevari-
The problem then is to use the training data (1) to able (q = l), since most of their similarities and
estimate the values of the coefficients pkcO {ajk}i",,
differences emerge in this simplified setting. Multi-
appearing in Model (2). variate regression (q > 1) is discussedin Section 6.
In nearly all chemometric analyses, the variables
are standardized (“autoscaled”): 2.1 Principal Components Regression
yj t (rj - yj)l[ave(yj - yj)2]1’2 PCR (Massy 1965) has been in the statistical lit-
erature for some time, although it has seenrelatively
xk t (xk - xk)/[ave(xk - xk)2]1’2, (3) little use compared to OLS and VSS. It begins with
with the training-sample covariance matrix of the predic-
7, = ave(yj) tor variables
V = ave(xx*) (8)
%k = (4)ave(xk),
where the averagesare taken over the training data and its eigenvector decomposition
(1); that is,
v= 2 &vkv~. (9)
k=l
1Y i=l
Here {ez}: are the eigenvalues of V arranged in de-
where n is the quantity being averaged. (This no- scending order (e, 2 e2 2 . . . 2 e,) and {vk}y their
tational convention will be used throughout the ar- corresponding eigenvectors. PCR produces a se-
ticle.) The analysis is then applied to the stan- quence of regression models Go . . . jjR} with
dardized variables and the resulting solutions
transformed back to reference the original locations n
YK = kl$o[aveb+~)~dlv,‘x, K = 1, R, (10)
and scales of the variables. The regression methods
discussed later are always assumed to include con-
with R being the rank of V (number of nonzero ez).
stant terms (2), thus making them invariant with re-
The Kth model (10) is just the OLS regression of y
spect to the variable locations so that translating them
on the “variables” (2, = vEx}t with the convention
to all have zero means is simply a matter of conven-
that for K = 0 the model is just the responsemean,
ience (or numerics). Most of these methods are not,
j. = 0 (3). Th e goal of PCR is to choose the par-
however, invariant to the relative scaling of the vari-
ticular model )jK with the lowest prediction mean
ables so that choosing them to all have the samescale squared error
is a deliberate choice on the part of the user. A
different choice would give rise to different estimated K* = argmin aE(y - 9K)2, (11)
OsKcsR
models. This is discussedfurther in Section 7.
After autoscaling the training data, the regression where ai% is the average over future data, not part
models (2) (on the training data) can be expressed of the training sample. The quantity K can thus be
considered a metaparameter of the procedure whose regression consists of computing the covariance vec-
value is to be estimatedfrom the training data through tor wK (line 3) and then using it to form a linear
some rrodel-selection procedure. In chemometrics, combination zK of the x residuals (line 4). The y
model selection is nearly always done through or- residuals are then regressed on this linear combi-
dinary cross-validation (CV) (Stone 1974), nation (line 5), and the result is added to the model
(line 6) and subtracted from the current y residuals
R = argmin $J (yi - 9K,i)2, (12) to form the new y residuals y, (line 7) for the next
OcKsR i=l step. New x residuals (xJ are then computed (line
where )jK,i is the Kth model (10) computed on the 8) by subtracting from xKMl its projection on zk. The
training sample with the ith observation removed. test (line 9) will cause the algorithm to terminate
There are many other model selection criteria in the after R steps, where R is the rank of V (8).
statistics literature [e.g., generalized cross-validation This PLS algorithm produces a sequenceof models
(Craven and Wahba 1979), minimum descriptive vKx (line 1 and line 6) on successivepassesthrough
length (Rissiden 1983), Bayesian information crite- the For loop. The one (‘Jg) that minimizes the CV
rion (Schwartz 1978). Mallows’s Cp (Mallows 1973), score (12) is selected as the PLS solution. Note that
etc.] that can also be used. (A discussion of their straightforward application of many of the competing
model-selection criteria is not appropriate here since,
Downloaded by [University of Birmingham] at 09:58 12 April 2013
any particular situation is generally obtained by con- where A is the ridge parameter [(13)-(14)]. The ridge
sidering it to be a meta parameter of the procedure solution is then taken to be a (shrinking) ridge regres-
and estimating it through some model-selection pro- sion of y on C&X with the same value for the ridge
cedure such as CV. Since here the response values parameter
1c&x.
bi}y do enter linearly in the model estimates (iti}?,
any of the competing model-selection criteria can ave(yc&x)
PRR = (211
Downloaded by [University of Birmingham] at 09:58 12 April 2013
where the average is over the training sample. As a consequenceof this and the criterion (22), they
This comparison consists of regarding the regres- also turn out to be orthogonal c’#, = 0, k # 1. The
sion procedure asa two-step processasin VSS (Stone Kth PCR model is given by a least squaresregression
and Brooks 1990); first a K-dimensional subspaceof of the response on the K linear combinations
p-dimensional Euclidean space is defined, and then {cEx}f. Since they are uncorrelated (23), this reduces
the regressionis performed under the restriction that to the sum of univariate regressionson each one (10).
the coefficient vector a lies in that subspace: PLS regression also produces a sequence of K-
K dimensional subspaces spanned by successive unit
a = c akck, 07) vectors, and then the Kth PLS solution is obtained
k=l by a least squares fit of the response onto the cor-
where the unit vectors {ck}y span the prescribed responding K-linear combinations in a strategy sim-
subspacewith c;ek = 1. The regression procedures ilar to PCR. The only difference from PCR is in the
can be compared by the way in which they define criterion used to define the vectors that span the K-
the subspace {ck}: and the manner in which the dimensional subspaceand hence the corresponding
(constrained) regression is performed. linear combinations. The criterion that gives rise to
First, consider OLS in this setup. Here the sub- PLS (Stone and Brooks 1990) is
spaceis defined by the (single) unit vector that max- ck(PLS) = argman corr*(y, cTx)var(c’x). (24)
imizes the sample correlation (squared) between the {CVC,=o}p- ’
cTc=1
response and the corresponding linear combination
of the predictor variables As with PCR the vectors ck(PLS) are constrained to
= argmax corr2(y, c’x); be mutually V orthogonal so that the corresponding
COLS (18)
CT,=1 linear combinations are Pmcorrelatedover the train-
ing sample (23). This causesthe K-dimensional least of the regressors being identical). This can be seen
squaresfit to be equivalent to the sum of K univariate from the PLS criterion (24). In this case, var(c’x) =
regressions on each linear combination separately, 1 for all c, and the PLS criterion reduces to that for
as with PCR. Unlike PCR, however, the {c,(PLS)}f OLS (18). With this exception, the effect of decreas-
are not orthogonal owing to the different criterion ing K is to attract the solution coefficient vector to-
(24) used to obtain them. ward larger values of var(crx) as in PCR. For a given
The OLS criterion (18) is invariant to the scale of K, however, the degree of this attraction depends
the linear combination of crx and gives an unbiased jointly on the covariance structure of the predictor
estimate of the coefficient vector and hence the variables and the OLS solution, which in turn de-
regressionmodel [( 18)- (19)]. The criteria associated pends on the sample response values. This fact is
with RR (20), PCR (22), and PLS (24) all involve often presented as an argument in favor of PLS over
the scale of crx through its sample variance, thereby PCR. Unlike PCR, there is no sharp lower bound
producing biased estimates. The effect of this bias is on var(Gx) for a given K. The behavior of PLS com-
to pull the solution coefficient vector away from the pared to PCR for changing K is examined in more
OLS solution toward directions in which the pro- detail in Section 3.2.
jected data (predictors) have larger spread. The de-
gree of this bias is regulated by the value of the 3.1 Bayesian Motivation
Downloaded by [University of Birmingham] at 09:58 12 April 2013
with the expected value over the distribution of the is the variance of the estimate. Setting (6 = l}p (33)
errors E (26). Since (Y (the truth) is unknown, the yields the least squaresestimates, which are unbiased
MSE (at x) for any particular estimator is also un- but have variance given by the second term in (35).
known. One can, however, consider various (prior) Reducing any (or all) of the {&}f to a value less than
probability distributions IT on (Yand compare the 1 causesan increase in bias [first term (35)] but de-
properties of different estimators when the relative creasesthe variance [second term (35)]. This is the
probabilities of encountering situations for which a usual bias variance trade-off encountered in nearly
particular (Y(26) occurs is given by that distribution. all estimation settings. [Setting any (or all) of the
For a given I, the mean squared prediction error {fi}$ to a value greater than 1 increasesboth the bias
averaged over the situations it represents is squared and the variance.]
This expression (35) for the MSE (in a simplified
E,E,[cxrx - aTx12. (31) setting) illustrates the important fact that justifies the
A simple and relatively unrestrictive prior probabil- qualitative behavior of RR, PCR, and PLS discussed
ity distribution is one that considers all coefficient previously, namely, the shrinking of the solution
vector directions cll/]a] equally likely; that is, the prior coefficient vector away from directions of low (sam-
distribution depends only in its norm la)* = cura, ple) variance in the predictor-variable space. One
sees from the second term in (35) that the contri-
7r(o1)= 7r(cxTcK).
Downloaded by [University of Birmingham] at 09:58 12 April 2013
Averaging over (Yusing the probability distribution The quantity A [(36)-(37)] is the number of (pre-
given by (32) [taking advantage of the fact that dictor) variables times the square of the noise-to-
E,(o&) = Z . E&X(*, with Z being the identity ma- signal ratio, divided by the training-sample size.
trix] yields Combining (33)) (36)) and (37) gives the optimal
(minimal MSE) linear shrinkage estimates
MW,Wl = ,zI [Cl - fi)*Lt@/~
aj = cij . ei2 j = 1,~. (38)
+ f@2/(Ne~)]x~. (35) e? + A’
Here (35) E&Y]* is the expected value of the length One sees that the unbiased OLS estimates {Gj}f are
of the coefficient vector (Yunder the prior (32), p is differentially shrunk with the relative amount of
the number of predictor variables, u* is the variance shrinkage increasing with decreasing predictor vari-
of the error term [(26)-(27)], N is the training-sample able spread ej. The amount of differential shrinkage
size, and {ef}$’ are the eigenvalues of the (sample) is controlled by the quantity A (37): The larger the
predictor-variable covariance matrix (28)) which in value of A, the more differential shrinkage, as well
this case are the sample variances of the predictor as more overall global shrinkage. The value of A in
variables due to our choice of coordinate system (28). turn is given by the inverse product of the signal/
The two terms within the brackets (35) that con- noise squared and the training-sample size.
tribute to the MSE at x have separateinterpretations. It is important to note that this high relative shrink-
The first term depends on (the distribution of the) age in directions of small spread in the (sample)
truth ((Y)and is independent of the error variance or predictor-design distribution enters only to control
the predictor-variable distribution. It represents the the variance and not becauseof any prior belief that
bias (squared) of the estimate. The second term is the true coefficient vector (Y (26) is likely to align
independent of the nature of the true coefficient vec- with the high spread directions of predictor design.
tor (Y and depends only on the experimental situa- The prior distribution on 01,rr(c~) (32), that leads to
tion-error variance and predictor-design sample. It this result (38) places equal mass on all directions
a/Jcu\and by definition has no preferred directions both of which are linear in that they do not involve
for the truth. Therefore, one can at least qualitatively the sample response values bi}y.
conclude that the common property of RR, PCR, The corresponding scale factors for PLS are not
and PLS of shrinking their solutions away from low linear in the response values. For a K-component
spread directions mainly serves to reduce the vari- solution, they can be expressed as
ance of their estimates, and this is what gives them
generally superior performance to OLS. The results (43)
given by (35), (37), and (38) indicate that their de-
gree of improvement (over OLS) will increase with where the vector g = {&}‘:is given by 6 = W-‘w,
decreasing signal-to-noise ratio and training-sample with the K components of the vector w being
size and increasing collinearity as reflected by the
disparity in the eigenvalues (28) of the predictor-
wk = ,zl c2yef(k+1),
variable covariance matrix [(S)- (9)].
It is well known that (38) is just RR as expressed
in the coordinate system defined by the eigenvectors and the elements of the K X K matrix W are given
of the sample predictor-variable covariance matrix by
[(S)-(9)]. Thus these results show (again well known)
Downloaded by [University of Birmingham] at 09:58 12 April 2013
estimators for the prior T((Y) assumedhere (32) and They depend on the number of components K used
A (37) known. PCR is also a linear shrinkageestimator
and the eigenstructure {ef}f (as do the factors for RR
a,(PCR) = bj . Z(ef - e$), (39) and PCR), but not in a simple way. They also depend
on the OLS solution {i$}$‘, which in turn depends on
where K is the number of components used and the
the response values {y,}:. The PLS scale factors are
second factor Z(a)takes the value 1 for nonnegative
seen to be independent of the length of the OLS
argument values and 0 otherwise. Thus RR domi-
solution I&(*, depending only on the relative values
nates PCR for an equidirection prior (32). PLS is
of {Gj}f. Note that for all of the methods studied here
not a linear shrinkage estimator, so RR cannot be
the estimates (for a given value of the meta param-
shown to dominate PLS through this argument.
eter) depend on the data only through the vector of
OLS estimates {Gj}f and the eigenvalues of the pre-
3.2 Shrinking Structure
dictor-covariance matrix {eT}$‘.
One way to attempt to gain some insight into the Although the scale factors for PLS (43) cannot be
relative properties of RR, PCR, and PLS is to ex- expressedby a simple formula (as can those for RR
amine their respective shrinkage structures in various and PCR), they can be computed for given values
situations. This can be done by expanding their so- of K, {ef}$‘, and {;y,>f and compared to those of RR
lutions in terms of the eigenvectors of the predictor- and PCR [(36) and (42)J for corresponding situa-
sample covariance matrix [(S)-(9)] and the OLS es- tions. This is done in Figures 1-4, for p = 10. In
timate &: each figure, the scale factors fi (PLS) - fiO (PLS)
are plotted (in order-solid line) for the first six (K
a(RR : PCR : PLS) = 1, 6) component PLS models. Each of the four
figures represents a different situation in terms of the
= ,zl fi(RR: PCR: PLS)hjivj. (40) relative values of {eT}pand {~j}e. Also plotted in each
frame for comparison are the corresponding shrink-
Here ;Y,is the projection of the OLS solution on Vi age factors for RR (dashed line) and PCR (dotted
(the jth eigenvector of V), line) for that situation, normalized so that they give
the same overall shrinkage (sh = la]/]&/); that is, for
tij = ave(j$x)leT, (41) RR the ridge parameter h (36) is chosen so that the
length of the RR solution vector is the same as that
and {J.(*)}e can be regarded as a set of factors along for PLS (laRRI = lap&. In the case of PCR, the
each of these eigendirections that scale the OLS so- number of components was chosen so that the re-
lution for each of the respective methods. As shown spective solution lengths were as close as possible
in (36) and (39), &(RR) = eyl(ef + A) and (laPoR/= la,,,]). The three numbers in each frame
J;(PCR) = 1 ef 2 eg give the number of PLS components, the correspond-
ing shrinkage factor (sh = [a(/(&(),and the ridge pa-
= 0 e: < e2,, (42) rameter (A) that provides that overall shrink-
2 4 6 a 10 2 4 6 0 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013
2 -4
.87
:. . . . . .. . . x _ .0045
I I I 1 I I
2 4 6 a 10 2 4 6 a 10
2 -6
.99
2 _ .0003
I I 1 I I
2 4 6 a 10 2 4 6 a 10
Figure 1. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Neutral Least Squares Solution and High Collinearity.
Shown in each frame are the number of PLS components (upper entry), overall shrinkage (middle entryJ, and corresponding
ridge parameter (lower entry).
age. The four situations represented in Figures l-4 moderate collinearity), {;yi = l/j} {e: - l/j’}f (fa-
are as follows: {$ = l}p {e,2- l/j’}p (neutral G’s, vorable h’s, high collinearity), and (~2,= j}$’ {e: -
high collinearity), {bj = l}p {e: - l/j} (neutral b’s, l/j*}$’ (unfavorable k’s, high collinearity).
I I 1 I I I
2 4 6 a 10 2 4 6 0 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013
3 4
.95 .99
.0098 .0016
I 1 I 1 I I
2 4 6 0 10 2 4 6 8 10
cu-
0
6
1.0
8 _ .0007
I I I I I
2 4 6 a 10 2 4 6 a 10
Figure 2. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Neutral Least Squares Solution and Moderate
Collinearity. The entries in each frame correspond to those in Figure 1.
In Figure 1, the OLS solution is taken to project ward the larger values (high collinearity). The one-
equally on all eigendirections (neutral) and the ei- component PLS model (K = 1, upper left frame) is
genvalue structure is taken to be highly peaked to- seen to dramatically shrink the OLS coefficients for
1 N-
2
.82 .93
.020
--
---__
2 .-.._. . .__. . . . . . . .. .
I I I I I
2 4 6 6 10 2 4 6 a 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013
--
3 4
.97 .99
.0069 , . . .. . . .0025 : .
I I
2 4 6 a 10 , 2 4 6 a 10
5 6
1.00 1.00
.0008 .0007
I I I I r I 1 I I I
2 4 6 a 10 2 4 6 a 10
Figure 3. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Favorable Least Squares Solution and High
Collinearity. The entries in each frame correspond to those in Figure 1.
the smallest eigendirections. It slightly “expands” tial; the length of the K = 1 PLS solution coefficient
the OLS coefficient for the largest (first) eigendirec- vector is about 35% of that for the OLS solution.
tion, f,(PLS) > 1. The overall shrinkage is substan- For the same overall shrinkage, the relative shrink-
I I I 1 I
2 4 6 a 10
Downloaded by [University of Birmingham] at 09:58 12 April 2013
3
.56
.014 .0042
2 4 6 a 10
5 5
.93 .98
.OOlZ .0003
I I I I I I I I I I
2 4 6 a 10 2 4 6 a 10
Figure 4. Scale Factors for PLS (solid), RR (dashed), and PCR (dotted) for Unfavorable Least Squares Solution and High
Collinearity. The entries in each frame correspond to those in Figure 1.
age of RR tracks that of PLS but is somewhat more model (K = 2) gives roughly the sameoverall shrink-
moderate. This is a consistent trend throughout all age as the K = 1 PLS solution. Again this is a trend
situations (Figs. l-4). For PCR. a two-component throughout all situations in that one gets roughly the
same overall shrinkage for KPCR= 2Kt~. As the the predictor design. Here one seesqualitatively sim-
number of PLS components is increased(left to right, ilar relative behavior as before, with a bit more ex-
top to bottom frames) the overall shrinkage applied aggeration. Due to the unfavorable alignment of the
to the OLS solution is reduced and the relative OLS solution, the overall shrinkage here is quite
shrinkage applied to each eigendirection becomes considerable. Still the OLS solution is nearly reached
more moderate. For K = 6, the PLS solution is very by the K = 6-component PLS solution.
nearly the same as the OLS solution ‘&(PLS) f
l}p even though it only becomes exactly so for K =
10. Again this feature is present throughout all sit- 3.2.1. Discussion. Although the study repre-
uations (Figs. l-4). sented by Figures l-4 is hardly exhaustive, some
An interesting aspect of the PLS solution is that tentative conclusions can be drawn. The qualitative
(unlike RR and PCR) it not only shrinks the OLS behavior of RR, PCR, and PLS as deduced from
solution in some eigendirections (f, 5 1) but expands (20), (22), and (24) is confirmed. They all penalize
it in others (fi > 1). For a K-component PLS solu- the solution coefficient vector a for projecting onto
tion, the OLS solution is expanded in the subspace the low-variance subspace of the predictor design
defined by the eigendirections associated with the [i.e., ave(arx)* = small). For PLS and PCR, the
eigenvaluesclosest to the Kth eigenvalue. Directions strength of the penalty decreasesas the number of
Downloaded by [University of Birmingham] at 09:58 12 April 2013
associatedwith somewhat larger eigenvaluestend to components K increases.For RR, the strength of the
be slightly shrunk, and those with smaller eigenval- penalty increases for increasing values of the ridge
ues are substantially shrunk. Again this behavior is parameter A. For RR, the strength of this penalty is
exhibited throughout all of the situations studied here. monotonically increasing for directions of decreasing
The expressionfor the mean squaredprediction error sample variance. For PCR, it is a sharp threshold
(35) suggeststhat, at least for linear estimators, using function, whereas for PLS it is relatively smooth but
any 6 > 1 can be highly detrimental because it in- not monotonic. All three methods are shrinkage es-
creasesboth the bias squared and the variance of the timators in that the length of their solution coefficient
model estimate. This suggeststhat the performance vector is less than that of the OLS solution. RR and
of PLS might be improved by using modified scale PCR are strictly shrinking estimators in that in any
factors {$(PLS)}f, whereA.(PLS) t min(fi(PLS), l), projection the length of their solution is less than (or
although this is not certain since PLS is not linear equal to) that of the OLS solution. This is not the
and (35) was derived assuming linear estimates. It case for PLS. It has preferred directions in which it
would, in any case, largely remove the preference of increases the projected length of the OLS solution.
PLS for (true) coefficient vectors that align with the For a K-component PLS solution, the projected length
eigendirections whose eigenvalues are close to the is expanded in the subspace of eigendirections as-
Kth eigenvalue. sociated with eigenvalues close to the Kth eigen-
The situation represented in Figure 2 has the same value.
(neutral) OLS solution but less collinearity. The In all situations depicted in Figures 1-4, PLS used
qualitative behavior of the PLS, RR, and PCR scale fewer components to achieve the same overall
factors are seen to be the same as that depicted in shrinkage asPCR, generally about half asmany com-
Figure 1. The principal difference is that PLS applies ponents. PLS closely reached the OLS solution with
less shrinkage for the same number of components about five to six components, whereas PCR requires
and (nearly) reaches the OLS solution for K = 4. all ten components. This property has been empiri-
Note that for no collinearity (all eigenvalues equal) cally observed for some time and is often stated as
PLS produces the OLS solution with the first com- an argument in favor of the superiority of PLS over
ponent (K = 1). PCR; one can fit the data at hand to the same degree
Figures 3 and 4 examine the high collinearity sit- of closeness with fewer components, thereby pro-
uation for different OLS solutions. In Figure 3, the ducing more parsimonious models. The issue of par-
OLS solution is taken to be aligned with the major simony is a bit nebulous here, since the result of any
axesof the predictor design. The relative PLS shrink- method that fits linear models (29) is a single com-
age for different eigendirections for this favorable ponent (direction)-namely, that associatedwith the
case is seen to be similar to that for the neutral case solution coefficient vector a. One can decompose a
depicted in Figure 1. The overall shrinkage is much arbitrarily into sumsof any number of (up top) other
less, however, owing to the favorable orientation of vectors and thus change its parsimony at will. For
the OLS solution. Figure 4 representsthe contrasting the same number of components, PCR applies more
situation in which the OLS solution is (unfavorably) shrinkage than PLS and thus attempts to fit the data
aligned in orthogonal directions to the major axes of at hand less closely, thereby using fewer degrees of
freedom to obtain the fit. In the situations studied Judging from Figures 1-4, a corresponding prior
here (Figs. l-4) it appears that PLS is using twice distribution for PLS (if it could be cast in a Bayesian
the number of degrees of freedom per component framework) would be more complicated. As with
as PCR, but this will depend on the structure of the PCR a prior for a K-component PLS solution would
predictor-sample covariance matrix. (For all eigen- put low (but nonzero) mass on coefficient vectors
values equal, PLS uses p df for a one-component that heavily project onto the smallest eigendirec-
model.) Thus fitting the data with fewer (or more) tions. It would, however, put highest mass on those
components (in and of itself) has no bearing on the that project heavily onto the space spanned by the
quality (future prediction error) of an estimator. eigenvectors associatedwith eigenvalues close to ei
Another argument often made in favor of PLS and moderate to high mass on the larger eigen-
over PCR is that PCR only usesthe predictor sample directions.
to choose its components, whereas PLS uses the re- In Figures 1-4, the scale factors for RR, PCR,
sponsevalues as well. This argument is not unrelated and PLS were compared for the same amount of
to the one discussed previously. By using the re- overall shrinkage (]a]/]&]). In any particular problem,
sponsevalues to help determine its components, PLS there is no reason that application of these three
uses more degrees of freedom per component and methods would result in exactly the same overall
thus can fit the training data to a higher degree of shrinkage of the OLS solution, although they are not
Downloaded by [University of Birmingham] at 09:58 12 April 2013
accuracy than PCR with the same number of com- likely to be dramatically different. The respective
ponents. As a consequence,a K-component PLS so- scale factors were normalized in this way so that
lution will have less bias than the corresponding K- insight could be gained through the relative shape of
component PCR solution. It will, however, have their scale-factor spectra.
greater variance, and since the mean squared pre-
diction error is the sum of the two (bias squared plus 3.3 Power Ridge Regression
variance) it is not clear which solution would be bet- If one actually had a prior belief that the true
ter in any given situation. In any case, either method coefficient vector (Y (26) is likely to be aligned with
is free to chooseits own number of components (bias- the larger eigendirections of the predictor-sample co-
variance trade-off) through model selection (CV). variance matrix V (8), PCR or PLS might be pre-
Both PLS and PCR span a full (but not the same) ferred over RR. Another approach would be to di-
spectrum of models from the most biased (sample rectly reflect such a belief in the choice of a prior
mean) to the least biased (OLS solution). The fact distribution IT for the true coefficient vector (Y
that PLS tends to balance this trade-off with fewer (26). This prior would not be spherically symmetric
components is (in general) neither an advantage nor (32) but would involve a more general quadratic form
disadvantage. in OL,
For all of the situations consideredin Figures 1-4, n-(01)= ~(GA-Qx).
PLS and PCR are seen to more strongly penalize for (45)
small ave(a*x)2 than RR for the samedegree of over- The (positive definite) matrix A would be chosen to
all shrinkage ]a]/(&(.The RR penalty (36) was derived emphasize directions for a/]cu] that align with the
to be optimal under the assumption that the (true) larger eigendirections of V (8). One such possibility
coefficient vector a (26) has no preferred alignment is to choose A to be proportional to V8,
with respect to the predictor-variable distribution; A = p2Vs, (46)
all directions are equally likely (32). Thus the set of
situations that favor PLS and PCR would involve (Y’S where the proportionality constant
that have small projections on the subspacespanned p* = E,lcu(2/tr(VS) (47)
by the eigenvectors corresponding to the smallest
eigenvalues. For example, an (improper) prior for a is chosen to explicitly involve the expected value of
K-component PCR would place zero mass on any ]a!]* [numerator (47)J under n(a) (45) and the de-
coefficient vector a for which nominator (47) is the trace of the matrix Vs. The
optimal linear shrinkage estimator (33) under this
prior [(45)-(47)] is
a = (V + AVp6)-’ ave(yx) (48)
with
and equal mass on all others. Here {vj}~+i are the
eigenvectors of the sample predictor-variable covari- A = a”l(AyP). (49)
ante matrix [(8)-(9)] associated with the smallest Here o2 is the variance of the noise [(26)-(27)] and
N - K eigenvalues. N is the training-sample size. This procedure [(48)-
(49)] is known as power ridge regression (Hoer1 and Table 3. Ratio of Actual to Optimal Expected Squared
Error Loss When the Parameter 6 = 6’ Is Used With Power
Kennard 1975; Sommers 1964). The corresponding Ridge Regression and the True Value Characterizing the
(solution) shrinkage factors (33) in the principal com- Prior Distribution rr(al is 6 = 6*
ponent representation are
eqs + 1)
fl” = ef(“:l) + A’ (50) 8’ -1 0 1 2
Informal “priors” leading to PCR and PLS were seen Figure 5. Contours of Equal Value for the Generalized Ridge
(Figs. l-4) to involve some preferential alignment Penalty for Different Values of ‘y.
of (Y with respect to the eigendirections {vj}e (9) of
the predictor covariance matrix (8). a penalty or cost for eachone, controlling the number
To study VSS, consider a generalization of (53) to that do enter. Since the penalty term expressesno
preference for particular variables, the “best” subset
-2 log T(a) = h ,zl lcyilyl (54) will be chosen through the minimization of the least
squares term, ave(y - aTx)2, of the combined cri-
where A > 0 (as before) regulates the strength of the terion (52).
penalty and y > 0 is an additional meta parameter This discussion reveals that a prior that leads to
that controls the degree of preference for the true VSS being optimal is very different from the ones
coefficient vector cu (26) to align with the original that lead to RR, PCR, and PLS. It places the entire
variable {x,}$ axis directions in the predictor space. prior probability mass on the original variable axes,
A value y = 2 yields a rotationally invariant penalty expressing the (prior) belief that only a few of the
expressing no preference for any particular direc- predictor variables are likely to have high relative
tion-leading to RR. For y f 2, (54) is not rota- influence on the response, but provides no infor-
tionally invariant, leading to a prior that places ex- mation as to which ones. It will therefore work best
cessmasson particular orientations of (Ywith respect to the extent that this tends to be the case. On the
to the (original variable) coordinate axes. other hand, RR, PCR, and PLS are controlled by a
Figure 5 shows contours of equal value for (54) prior belief that many variables together collectively
[and thus for rr(c~)] for several values of y (p = 2). effect the response with no small subset of them
One seesthat y > 2 results in a prior that supposes standing out.
that the true coefficient vector is more likely to be Expressions (52) and (54) reveal that VSS and RR
aligned in directions oblique to the variable axes, can be viewed as two points (y = 0 and y = 2,
whereas for y < 2 it is more likely to be aligned with respectively) on a continuum of possible regression-
the axes. The parameter y can be viewed as the de- modeling procedures(indexed by y). Choosing either
gree to which the prior probability is concentrated procedure correspondsto selecting from one of these
along the favored directions. A value y = w places two points. For a given situation (data set), there is
maximum concentration along the diagonals, which no a priori reason to suspect that the best value of
is in fact not very strong. On the other hand, y + 0 y might be restricted to only these two choices. It is
places the entire prior mass in the directions of the possible that an optimal value for y may be located
coordinate axes. at another point in the continuum (0 < y 5 00).An
The situation y + 0 corresponds to (all subsets) alternative might be to use a model-selection crite-
VSS. In this case, the sum in (54) simply counts the rion (say CV) to jointly estimate optimal values of
number of nonzero coefficients (variables that en- A and y to be used in the regression, thereby greatly
ter), and the strength parameter A can be viewed as expanding the class of modeling procedures. It is an
open question as to whether such an approach will Average PSE (55) in each of the 36 situations are
actually lead to improved performance; this is the the axes for this space. There are six points in the
subject of our current research (with Leo Breiman). space, each defined by the 36 simultaneous values
Note that this approach is different from those that of average PSE for OLS, RR, PCR, PLS, VSS, and
use Bayesian methods to directly compute model- the true (known) coefficient vector gtrue= cllTx(26).
selection criteria for different variable subsets (e.g., The quantities plotted in Figures 6-10 are the Eu-
see Lindley 1968; Mitchell and Beauchamp 1988). clidean distances (bar height) of each of the first five
points (OLS, RR, PCR, PLS, and VSS) from the
5. A COMPARATIVE MONTE CARLO STUDY sixth point, which represents the performance using
OF OLS, RR, PCR, PLS, AND VSS the “true” underlying coefficient vector asthe regres-
This section presents a summary of results from a sion model in each situation. Thus smaller values
set of Monte Carlo experiments comparing the rel- indicate better performance.
ative performance of OLS, RR, PCR, PLS, and VSS Figure 6 shows these distances in the full 36-
that were described in more detail by Frank (1989). dimensional space, which characterizes averageper-
The five methods were compared for 36 different formance over all 36 situations. Figures 7-10 show
situations. In all situations, the training-sample size the distances in various subspacescharacterized by
slicing (conditioning) on specific values of some of
Downloaded by [University of Birmingham] at 09:58 12 April 2013
-
9
c-9
9
ol
r
r
0
9
0
OLS RR
Li PCR PLS vss
OLS RR PCR PLS vss
Figure 9. Performance Comparisons Conditioned on the
Figure 7. Performance Comparisons Conditioned on the Structure of the True-Coefficients Vector-Equal and Un-
p = 5, 40, and 100 Variable Situations. equal Coefficients.
Downloaded by [University of Birmingham] at 09:58 12 April 2013
might have expected VSS to provide dramatically performance degradesless than OLS and VSS as the
improved performance in the situations correspond- noise increases.
ing to (highly) unequal (true) coefficient values for For the situations covered by this simulation study,
the respective variables. For the situations studied one can conclude that all of the biased methods (RR,
here, {cy, = j*fi, this did not turn out to be the case. PCR, PLS, and VSS) provide substantial improve-
All of the other biased methods dominated VSS for ment over OLS. In the well-determined case, the
this case. Moreover, the performance of RR, PCR, improvement was not significant. In all situations,
and PLS did not seem to degrade for the unequal RR dominated all of the other methods studied. PLS
coefficient case. Since (stepwise) VSS must surely usually did almost as well as RR and usually out-
dominate the other methods if few enough variables performed PCR, but not by very much. Surprisingly,
only contribute to the responsedependence,it would VSS provided distinctly inferior performance to the
appear that the structure provided by {cyj = j*}T other biased methods except in the well-conditioned
is not sharp enough to cause this phenomenon to case in which all methods gave nearly the same per-
set in. formance. Although not discussedhere, the perfor-
Figure 10 contains few surprises. (Remember that mance ranking of these five methods was the same
bar height is proportional to distance from the per- in terms of accuracy of estimation of the individual
formance of the true model, which itself degrades regression coefficients (see Frank 1989) as for the
with decreasingsignal-to-noise ratio.) Higher signal- model prediction error shown here. Not surprisingly,
to-noise ratio seemsto help OLS and VSS more than the prediction error improves with increasing obser-
the other biased methods. This may be becausetheir vation to variable ratio, increasing collinearity, and
9-
cu
Ln-
9-
ro -
0
9-
0
1
RR PCR PLS vss OLS RR PCR PLS
Figure 8. Performance Comparisons Conditioned on Low Figure 10. Performance Comparisons Conditioned on High,
and High Collinearity Situations. Medium, and Low Signal-to-Noise Ratio.
increasing signal-to-noise ratio. A bit surprising is Table 4. Weld’s Two-Block PLS Algorithm
the fact that performance seemed to be indifferent
to the structure of the true coefficient values. (1) Initialize: y0 + y; x, + x; PO+ 0
(2) For K = 1 top do:
The results of this simulation study are in accord (3) UT +-- (1, 0, . . , 0)
with the qualitative results derived from the discus- (4) Loop (until convergence)
sion in Section 3.2.1-namely, that RR, PCR, and (5) wK = aveltuTy,~,)x,-,l
PLS have similar properties and give similar perfor- (6) u = aveltw~xx,-,)y,-,l
mance. (Although not shown here, the actual solu- (7) end Loop
(81 z, = w;xK-,
tions given by the three methods on the same data (9) rK = [ave(y,- ,z,)iave(z~)lz,
are usually quite similar.) One can speculate on the (IO) PK = PK-, + rK
reasonswhy the performance ranking RR > PLS > (11) yK = yK-, - rK
PCR came out as it did. PCR might be troubled by (12) XK = XK-, - [ave(z,x,-,)/ave(z~)]zK
its use of a sharp threshold in defining its shrinkage (13) if ave(xLxJ = 0 then Exit
(14) end For
factors (42), whereas RR and PLS more smoothly
shrink along the respective eigendirections [(36) and
Figs. l-41. This may be (somewhat) mitigated by
linearly interpolating the PCR solution between ad-
Downloaded by [University of Birmingham] at 09:58 12 April 2013
jacent components to produce a more continuous This approach is not the one advocated for PLS
shrinkage (Marquardt 1970). PLS may give up some (H. Wold 1984). With PLS, the response variables
performance edge to RR because it is not strictly y = bi}y and th e predictors x = {x,}? are separately
shrinking (some fi > l), which likely degrades its collected together into groups (“blocks”) which are
performance at least by a little bit. then treated in a common manner more or less sym-
The performance differential between RR, PCR, metrically. Table 4 shows Wold’s two-block algo-
and PLS is seenhere not to be great. One would not rithm that defines multiple-response PLS regression.
sacrifice much average accuracy over a lifetime of If one were to develop a direct extension of Wold’s
using one of them to the exclusion of the other two. (q = 1) PLS algorithm (Table 1) according to the
Still one may see no reason to sacrifice any, in which strategy used by OLS (q-separate uniresponse re-
case this study would indicate RR as the method of gressions), line 3 of Table 1 would be replaced by
choice. The discussion in Section 3.2.1 and the sim- the calculation of a separate covariance vector wKi
ulation results presented here suggestthat claims as for each separate response residual yK-l,i on each
to the distinct superiority of any one of these three separate x residual x~-~,~, wKi = ave(yK-l,iXK-l,i)
techniques would require substantial verification. (i = 1, q). These would then be used to update
The situation is different with regard to OLS and q-separate models )jK,i (line 6), as well as q-separate
VSS. Although these are the oldest and most widely new y residuals, yKi (line 7), and x residuals, xKi
used techniques in the statistical community, the re- (line 8).
sults presented here suggestthat there might be much Examination of Table 4 reveals a different strat-
to be gained by considering one of the more modern egy. A single covariance vector wk is computed for
methods (RR, PCR, or PLS) as well. all responsesby the inner loop (lines 3-7), which is
then used to update all of the models QK (line 10)
6. MULTIVARIATE REGRESSION and the response residuals to obtain yK (line 11). A
We now consider the general case in which more single set of x residuals xK is maintained by this al-
than one variable is regarded as a response (q > 1) gorithm using the single covariance vector wK (line
[(l)-(7)] and a predictive relationship is to be mod- 12) as in the uniresponse PLS algorithm (Table 1,
eled between each one bi}f and the complement set line 8). The inner loop (lines 4-7) is an iterative
of variables, designated as predictors. The OLS so- algorithm for finding linear combinations of the re-
lution to this (multivariate) problem is a separate sponse residuals uTyK-i and the predictor residuals
(q = 1) uniresponseOLS regressionof eachyi on the w~xx,-, that have maximal joint covariance. This
predictor variables x, without regard to their com- algorithm starts with an arbitrary coefficient vector
monality. The various biasedregressionmethods (RR, II (line 3). After convergence of the inner loop, the
PCR, PLS, VSS) could be applied to this problem resulting x residual linear combination covariance
by simply replacing each such uniresponse OLS vector wK is then used for all updates.
regression with a corresponding biased (q = 1) This two-block multiple-response PLS algorithm
regression, in accordancewith this strategy. The dis- produces R models [R = rank of V (S)] for each
cussion of the previous sections indicates that this responseQKj}$= I ,4=i spanning a full spectrum of so-
would result in substantialperformance gains in many lutions from the sample means vj = O};lfor K = 0
situations. to the OLS solutions for K = R. The number of
components K is considered a meta parameter of the and var(crx)] that serve as penalties to bias the so-
procedure to be selected through CV, lutions away from low spread directions in both the
x and y spaces.The penalty imposed on the predictor-
(56) variable linear combination coefficient vector c is the
OsKsR I=1 j=l same as that used for single response PLS (24). The
where yil is the value of the jth response for the [th discussion in Section 3.1 indicates that this mainly
training observation and pK,,[ is the K-component servesto control the variance of the estimated model.
model for the jth response computed with the Zth The introduction of the y-spacepenalty factor, along
observation deleted from the training sample. Note with optimizing with respect to its associatedlinear
that the same number of components K is used for combination coefficient vector u, serves to place an
each of the response models. additional penalty on the x-linear combination coef-
As with the uniresponse PLS algorithm (Table l), ficient vectors {ck}: that define the sequenceof PLS
this two-block algorithm (Table 4) defining multi- models ci)Kj)g=l prl; they are not only biased away
response PLS does not reveal a great deal of insight from low (data) spread directions in the predictor-
as to its goal. One can gain more insight by following variable spacebut also toward x directions that pref-
the prescription outlined in the beginning of Section erentially predict the high spread directions in the
response-variable space.
Downloaded by [University of Birmingham] at 09:58 12 April 2013
variable space.This correspondsto (with some abuse [(60) and (62)] distributions, then averaged over the
of notation) predictor-training sample. The quantity tr(V) is the
rijkl = rijskl (62) trace of the predictor-sample covariance matrix V
(8). If the data are standardized [(3)-(4)], then
with S,, = 1 if k = 1 and i&, = 0 otherwise. The
corresponding resulting prior [(60) and (62)] pro- tr(V) = p. (69)
vides information [through Iii (62)] on the degree of Let W be the (q x q) sample covariance matrix
similarity of the dependence of yi and yj on the pre- of the response variables
dictors x but no information as to the nature of that
x dependence. A relatively large positive value for W, = ave(yiy,). (70)
Ii, suggeststhat yi and y, have highly similar depen- Then from (68) an “estimate” for the elements of
dencieson x, whereas a large negative value indicates the matrix I would be
highly opposite dependencies.A relatively small value
indicates dissimilar dependenciesof yi and yj on the f = (W - cGI)/p, (71)
predictors. To further idealize the situation, suppose which could then be used in conjunction with Cri-
that terion (66) to obtain the resulting estimate A(RR)
yi = cyi’x + Ej, i = 1,4, (63) (given u”). The common error variance u2 remains
Downloaded by [University of Birmingham] at 09:58 12 April 2013
with the errors E = {Q}: having a joint Gaussian unknown and can be regarded as a meta parameter
distribution of the procedure to be estimated (from the training
sample) through CV:
E - N(O, 3, (64)
and, in addition, the error covariance is a multiple c2 = artyin k$l IIyk - Alk(RRb2)xkII*, (72)
of the identity matrix
2 = a2I. (65) where A,,(RR]a2) is the coefficient matrix A(RR)
estimated from (66) and (71) with the kth observa-
If C were known, one could rotate and scale the y- tion deleted from the training sample.
spacecoordinates so that (65) is obtained in the trans- Insight into the nature of solutions provided by
formed coordinate system. Otherwise (65) remains (66) and (71) can be enhanced by rotating in the x
a simplifying assumption. Under these assumptions and y spacesto their respective principal component
[(60)-(65)], the following generalization of RR to representations using orthonormal rotation matrices
multiple responsesis optimal (smallest MSE): U, and U, such that
V = UTE2U
I +
A(RR) = argmin ave(y - Ax)T(y
A W = UTH2U
Y Y (73)
- Ax) + $ IlAj12 .
This RR solution for multiple responses[(75)-(76)] that the degree of similarity of the dependenceof a
bears considerable resemblance to that for single- pair of responses(‘yi,yj) on the predictors is reflected
response regression (38) in that each coefficient es- in their correlation. A large positive (or negative)
timate is obtained by (differentially) shrinking the correlation between yi and yj means that the corre-
corresponding (unbiased) OLS estimates. Here (for sponding (true) coefficient vectors (Yiand aj should
a given value (T*)the relative shrinkage is controlled be closely related; that is, (Yi - Lyj (or (pi - -aj).
both by ef (corresponding x-direction sample spread) Small correlations imply no special relationship. This
and hf (corresponding y-direction sample spread) in information is incorporated into the regression pro-
a more or less symmetric way through their product cedure by using the empirical responsecorrelational
(76). A smaller value for either results in more structure to estimate the transformation to linear
shrinkage. The overall result is to bias the coefficient combinations of the responsesb,(PP)}p that are un-
vector estimates (7) simultaneously away from low correlated (no relationship between any of the coef-
sample spread directions in both spaces.The overall ficient vectors) in which separate independent
degree of this bias is controlled by the value of o2 regressions are then performed.
[the variance of the noise (65)]. The larger its value These results suggest that, unless the original re-
the more bias is introduced. sponsevariables happen to be uncorrelated, there is
The solution [(75)-(76)] can be recast as profit to be gainedin consideringthem together rather
Downloaded by [University of Birmingham] at 09:58 12 April 2013
then be performed in the transformed system and Table 5. Mean Squared Prediction Error of Multivariate RR
(upper entry) and Two-Block PLS (lower entry) for Several
the inverse transform applied to the resulting solu-
Signal-to-Noise Ratios SIN (rows) and Different Prior
tions. Such a transformation can be derived by de- Parameter Values 6 (columns) for a Highly
composing C into the product Collinear Situation
C = RTR (81) 6
and taking Z = Ry as the new responses. SIN 0 1 10
The caseof X (64) unknown can be directly treated
in the context of OLS (Box and Draper 1965). Here 10 .22 .I5 .14
.24 .14 .I2
the residual covariance matrix is used as an estimate
5 .35 .28 .26
of 2, .3a .27 .24
$(A) = ave[(y - Ax)(y - Ax)=]. 1 .68 .61 .60
(82) .72 .63 .59
Since this estimate depends on the estimated coef-
ficient matrix A (which in turn depends on c), an
iterative algorithm is required. Using (82), the mul-
tiresponse (negative) log-likelihood [assumingGaus- (72)]}, 1,000 new observations were generated ac-
Downloaded by [University of Birmingham] at 09:58 12 April 2013
sian errors (64)] can be shown (see Bates and Watts cording to the same prescription and the average
1988, p. 138) to reduce to -L(A) = log det[e(A)]. squared prediction error evaluated with them.
This is minimized with respect to the coefficient ma- Table 5 compares (in terms of MSE) multivariate
trix A, using (iterative) numerical optimization tech- RR [(66) and (71)] (upper entry) with two-block PLS
niques, to obtain the estimate. It is an open question (lower entry) for (population) predictor covariance
as to whether an analog of this approach can be matrix eigenvalues {ef = l/i*}$’ and responsecovari-
developed for biased regression procedures such as ante matrix eigenvalues {hp = l/i*}? (74). The rows
RR, PCR, or PLS. correspond to different signal-to-noise ratios and the
columns to different prior parameters 6, reflecting
differing alignment of the true coefficient vectors
6.3 Monte Carlo Study
{cui}4(63) with the predictor (population) distribution
We end this section by presenting results of a small eigendirections. One seesthat for 6 = 0 (equidirec-
Monte Carlo study comparing multivariate RR [(66) tion prior) RR does a bit better than PLS. For 6 =
and (71)] with two-block PLS (Table 4) in several 1 (moderate alignment) performance is nearly iden-
situations. We also compare both multivariate meth- tical, whereasfor 6 = 10 (very heavy alignment) PLS
ods to that of applying separate univariate (q = 1) has a slight advantage.Theseresultshold for all signal-
regressions on each (original) response separately. to-noise ratios.
The situations are characterized by the respective Table 6 presents a similar set of results for the
eigenstructures of the (population) predictor- and same situation except with less collinearity in both
response-variablecovariance matrices [(8) and (70)], spaces:{e? = l/i>: and {hf = l/i}:. Here overall per-
signal-to-noise ratio, and alignment of the true coef- formance is worse for both methods, but their re-
ficient vectors {ai} (63) with the eigenstructure of spective relative performance is similar to that re-
the (population) predictor covariance matrix. flected in Table 5. These results lend further support
For the first study, there are p = 64 predictor to the conclusion that PLS assumes a prior distri-
variables, q = 4 response variables, and N = 40
training observations. The study consisted of 100rep-
lications of the following procedure. First, N = 40 Table 6. Mean Squared Prediction Error of Multivariate RR
training observations were generated with the p = (upper entrvl and Two-Block PLS (lower entry) for Several
Signal-to-Noise Ratios SIN (rows) and Different Prior
64 predictors having a joint (population) Gaussian Parameter Values 6 (columns) for Moderate Collinearity
distribution with the specified covariancematrix. The
corresponding q = 4 response variables were ob- 6
tained from (63) with the {ci}y generated from a
SIN 0 1 10
Gaussian distribution with the (same) specified vari-
ance u2. The true coefficient vectors {cwi}f (63) were 10 A4 .27 .I8
each independently generatedfrom r(cy) [(45)-(47)] .47 .26 .I5
under the constraint that the (population) response 5 .57 .41 .32
.62 .39 .27
covariance matrix be the one specified. Several val- 1 .a4 .73 .67
ues of the prior parameter 6 were used. After each .92 .74 .62
of the models were obtained {using CV [(56) and
Table 7. Mean Squared Prediction Error of Multivariate RR of the biased regression procedures discussed here
and Two-Block PLS Along With That of Their (RR, PCR, PLS, or VSS) enjoy this affine equivari-
Corresponding (separate) Uniresponse Procedures for ante property. Applying such transformations on the
Several Signal-to-Noise Ratios
variables can change the analysis and its result. RR,
SIN Multi-ridge Uni-ridge Two-block PLS Uni-PLS PCR, and PLS are equivariant under (rigid) rotations
of the coordinates. This property allowed us to study
10 .23 .25 .25 .27 them in the sample principal component represen-
5 .36 .39 .39 .44
1 .68 .74 .73 .79
tations in which the (transformed) covariance mat-
rices were diagonal. They are not, however, equi-
NOTE: S/N kows), and prior parameter 6 = 0.
variant to transformations that change the scalesof
the coordinates. VSS is equivariant under scaling of
bution on the true coefficient vectors {ai} (63) that the variables but not under rotations. All of these
preferentially aligns them with the larger eigendi- procedures are equivariant under translation of (the
rections of the predictor covariance matrix (6 > 0). origin of) the coordinate systems.
Table 7 compares the multivariate RR [(66) and In Section 3 we saw that the basic regularization
(71)] and two-block PLS (Table 4) procedures with provided by RR, PCR, and PLS was to shrink their
solutions away from directions of small spread in the
Downloaded by [University of Birmingham] at 09:58 12 April 2013
osophical or emotional argument. As is well known, The PLS procedure also produces a set of uncor-
the goal of a regression analysis is often not solely related (but not orthogonal) linear combinations. It
prediction but also description; one uses the com- is often (subjectively) argued that these are a more
puted regression equation(s) as a descriptive statistic “natural” set to interpret regression solutions be-
to attempt to interpret the predictive relationships cause the criterion [(24) and (58)] by which they are
derived from the data. The loss structure for this defined involves the data responseas well as predic-
enterprise is difficult to specify and depends on the tor values. Linear combinations with low response
experience and skill of the user in relation to the correlation will tend to appear later in the PLS se-
method used. quence unlesstheir (data) variance is very large. One
It is common to interpret the solution coefficients consequenceof this is that a solution regressioncoef-
on the (standardized) original variables as a measure ficient vector 6 can generally be approximated to the
of strength of the predictive relationship between the same degree of accuracy by its projection on the
response(s)and the respective predictors. In this case space spanned by fewer PLS components than prin-
accuracy of estimation of these coefficients is a rel- cipal components. As noted in Section 3.2.1, how-
evant goal. As noted in Section 5, the relative rank- ever, this parsimony argumentis not compelling, since
ing of the methods studied there on coefficient ac- any vector & can be completely represented in a sub-
curacy was the same as that for prediction (seeFrank space of dimension l-namely, that defined by a
Downloaded by [University of Birmingham] at 09:58 12 April 2013
1989). Interpretation is also often aided by the sim- unit vector proportional to it.
plicity or parsimony of the representation of the re- The choice of a set of coordinates in which to
sult. This concept is somewhat subjective depending interpret a regression solution is largely independent
on the user’s experience. In statistics, parsimony is of the method by which the solution was obtained.
often taken to refer to the number of (original) pre- One is not required to use a solution gotten through
dictor variables that “enter” the regression equa- PCR or PLS to interpret it in terms of their respective
tion-that is, the number with nonzero coefficients. components. One could interpret a regression equa-
The smaller this number, the more parsimonious and tion(s) obtained by either OLS, VSS, RR, PCR, or
interpretable is the result. This leads to VSS as the PLS in terms of the original predictor variables, the
method of choice, since it attempts to reduce mean principal components, or PLS linear combinations
squared (prediction) error by constraining coeffi- (or all three). Prediction and interpretation are sep-
cients to be 0. Moreover, it is often the original vari- arate issues, the former being amenable to (more or
ables (as opposed to their linear combinations) that less) objective analysis but the latter always depend-
are most easily related to the system under study that ing on subjective criteria associatedwith a particular
produced the data. analyst.
It is well known that, in the presence of extreme
ACKNOWLEDGMENTS
collinearity, interpretation of individual regression
coefficients as relating to the strength of the respec- This article was prepared in part while one of us
tive partial predictive relationships is dangerous. In (JHF) was visiting the Statistics Group at AT&T Bell
chemometrics applications, the number of predictor Laboratories, Murray Hill, New Jersey. We ac-
variables often (greatly) exceeds the number of ob- knowledge their generoussupport and especiallythank
servations. Thus there are many exact (as well as Trevor Hastie and Colin Mallows for valuable dis-
possibly many approximate) collinearities among the cussions.
predictors. This has led chemometricians to attempt
to interpret the solution in terms of various linear APPENDIX: PROOF OF (20) AND (21)
combinations of the predictors rather than the in- For convenience, center the data so that E(j) =
dividual predictor variables themselves. (This ap- E(x) = 0. The RR solution Cr,is given by (13). Let
proach is somewhatsimilar to the useof factor-analytic aTa = f so that a = fc, with C~C= 1. Then, given
methods in the social sciences.) The linear combi- c, the solution to (13) for f, f(c), is
nations associatedwith the principal component di-
rections are a natural set to consider for this purpose, f(c) = argmin[ave(y - fc’x)” + Af”]
since they represent a set of uncorrelated “variables”
that are mutually orthogonal (with respect to the = ave(yc’x)l[ave(c’x)* + A], (A. 1)
standardized predictors) and satisfy a simple opti- and the ridge solution is (21) with
mality criterion (22). Moreover, principal components
analysis has long been in use and is a well-studied CRR = argmin{aveb - f(c)c’x]’ + Af”(c)}. (A.2)
CTC=l
method for describing and condensing multivariate
data. Substituting (A.l) for f(c) in (A.2) and simplifying
gives for the Linear Model” (with discussion), Journal of the Royal
Statistical Society, Ser. B, 34, l-40.
Lorber, A., Wangen, L. E., and Kowalski, B. R. (1987). “A
CRR = argmin ave(y’) - Theoretical Foundation for the PLS Algorithm,” Journal of
CTC=l
Chemometrics, 1, 19-31.
or, equivalently, Mallows, C. L. (1973), “Some Comments on Cp,” Technometrics,
15, 661-667.
ave2(ycTx) Marquardt, D. W. (1970). “Generalized Inverses, Ridge Regres-
CRR = sion, Biased Linear Estimation, and Nonlinear Estimation,”
ave(y*)[ave(c’x)* + A]
Technometrics, 12, 591-612.
1 ave2(ycTx) ave(cTx)2 Martens, H., and Naes, T. (1989). Multivariate Calibration, New
= argmax York: John Wiley.
CTC=l ave(y*)ave(c’x)” ave(Cx)* + AI ’ Massy, W. F. (1965), “Principal Components Regression in Ex-
ploratory Statistical Research,” Journal of the American Statis-
If the data are uncentered then mean values would tical Association, 60, 234-246.
have to be subtracted from all quantities, giving (20). Mitchell, T. J., and Beauchamp, J. J. (1988), “Bayesian Variable
Selection in Linear Regression” (with discussion), Journal of
[Received December 1991. Revised September 1992.1
the American Statistical Association, 83, 1023- 1037.
Naes, T., and Martens, H. (1985), ‘Comparison of Prediction
REFERENCES Methods for Multicollinear Data,” Communications in Statis-
Downloaded by [University of Birmingham] at 09:58 12 April 2013