Professional Documents
Culture Documents
Linear Models For Regression and Model Selection PDF
Linear Models For Regression and Model Selection PDF
We minimise with respect to the parameter β ∈ Rp+1 the criterion : from the model or to consider other approches to reduce the dimension ( Ridge,
Lasso, PLS ...) that we will developed in the next chapters.
n
X 2
(Yi − β0 − β1 Xi1 − · · · − βp Xip )2 = kY − Xβk We define the vector of residuals as :
i=1
e=Y−Y
b = Y − Xβ
b = (I − H)Y
= (Y − Xβ)0 (Y − Xβ)
= Y0 Y − 2β 0 X0 Y + β 0 X0 Xβ. This is the orthogonal projection of Y onto the subspace Vect(X)⊥ in Rn .
The variance σ 2 is estimated by
Derivating the last equation, we obtain the “ normal equations” :
2
− X β̂
2
2(X0 Y − X0 Xβ) = 0 2 kek
Y
σ
b = = .
n−p−1 n−p−1
The solution is indeed a minimiser of the criterion since the Hessian 2X0 X is
positive semi definite (the criterion is convex) . 2.3 Properties of the least square estimator
0
We make the additional assumption that the matrix X X is invertible, which T HEOREM 1. — Assuming that
is equivalent to the fact that the matrix X has full rank (p + 1) and so that
there is no collinearity between the columns of X (the variables). Under this Y = Xβ + ε
assumption, the estimation of β is give by :
with ε ∼ Nn (0, σ 2 In ), we obtain that β
b is a Gaussian vector :
b = (X0 X)−1 X0 Y
β b ∼ Np+1 (β, σ 2 (X 0 X)−1 ).
β
and the predicted values of Y are :
In particular, the components of β
b are Gaussian variables :
−1
Y b = X(X0 X)
b = Xβ X0 Y = HY b ∼ N (β , σ 2 (X 0 X)−1 ).
β j j j,j
−1
where H = X(X0 X) X0 is called the “hat matrix” ; which puts a "hat" on
Y. Geometrically, it corresponds to the matrix of orthogonal projection in Rn σ2
onto the subspace Vect(X) generated by the columns of X. σ̂ 2 ∼ χ2
n − (p + 1) (n−(p+1))
Remark. — We have assumed that X0 X is invertible, which means that the
columns of X are linearly independent. If it is not the case, this means that the and is independent of β̂.
application β 7→ Xβ is not injective, hence the model is not identifiable and β
is not uniquely defined. Nevertheless, even in this case, the predicted values Y
b Exercise. — Prove Theorem 1
are still defined as the projection of Y onto the space generated by the columns β
b is a linear estimator of β (it is a linear transformation of the observation
of X, even if there is not a unique β b such that Y
b = Xβ.b In practice, if X0 X Y) and it is unbiased. One can wonder if it has some optimality property. This
is not invertible (which is necessarily the case in high dimension when the is indeed the case : the next theorem, called the Gauss-Markov theorem, is
number of variables p is larger than the number of observations n - since p very famous in statistics. It asserts that the least square estimator β b has the
vectors of Rn are necessarily linearly dependent), we have to remove variables smallest variance among all linear unbiased estimator of β.
3
T HEOREM 2. — Let A and B two matrices. We say that A B if B − A is response of the model for a new point : X0 0 = (1, X0 1 , . . . , X0 p ) :
positive semi-definite. Let β
e a linear unbiased estimator of β, with variance-
−1
covariance matrix V. Then, σ 2 (X0 X) V.
e e Y0 = β 0 + β 1 X01 + β 2 X02 + . . . + β p X0p + ε0 ,
where β ∈ Rp and consider another model, called Model (0): Exercise. — Prove that, under the null hypothesis H0 , the F -statistics is a
Fisher distribution with parameters (p − l, n − p).
Y = X̃θ + ε.
2
where θ ∈ Rl with l < p. The numerator of the F -statistics corresponds to
Y
b
0 − Y1
, where Y0
b
b
D EFINITION 3. — We define and Yb 1 correspond respectively to the predicted values under the sub-model
and under the full model. This quantity is small under the null hypothesis,
V = {Xβ, β ∈ Rp } when the sub-model is valid, and becomes larger under the alternative. Hence,
the null hypothesis is rejected for large values of F , namely, for a level-α test,
and when
W = {X̃θ, θ ∈ Rl }. F > fp−l,n−p,1−α ,
We say that Model (0) is a submodel of Model (1) if W is a linear subspace of
where fp,q,1−α is the (1−α) quantile of the Fisher distribution with parameters
V.
(p, q). The statistical softwares provide the p− value of the test :
The F -statistics is defined by : • The linear model is valid : there is no tendancy in the residuals,
kXβ b 2 /(p − l)
b − X̃θk • Detection of possible outliers with the Cook’s distance
F = .
b 2 /(n − p)
kY − Xβk
• Normality of the residuals (if this assumption was used to provide confi-
An alternative way to write the F -statistics is : dence/prediction intervals or tests).
(SSR0 − SSR1 )/(p − l)
F = , This is rather classical for linear regression, and we focus here on the detec-
SSR1 /(n − p)
tion of possible high collinearities between the regressors, since it has an im-
where SSR0 and SSR1 respectively denote the residuals sum of square under pact on the variance of our estimators. Indeed, we have seen that the variance-
Model (0) and Model (1). covariance matrix of β b is σ 2 (X0 X)−1 .
5
When the matrix X is ill-conditioned, which means that the determinant of 3 Determination coefficient and Model se-
X0 X is close to 0, we will have high variances for some components of β. b It
is therefore important to detect and remedy these situations by removing some
lection
variables of the model or introducing some constraints on the parameters to
reduce the variance of the estimators.
3.1 R2 and adjusted R2
We define respectively the total, explicated and residual sums of squares by
VIF
n
Most statistical softwares propose collinearity diagnosis. The most classical
X
2
(Yi − Ȳ )2 =
Y − Y1
,
SST =
il the Variance Influence Factor (VIF) i=1
n
2
1 X
Vj = SSE = (Ŷi − Ȳ )2 =
Y − Y1
,
b
1 − Rj2 i=1
n
X
2
2
where Rj2 corresponds to the determination coefficient of the regression of SSR = (Ŷi − Yi )2 =
Y − Y
= kek .
b
j i=1
the variable X on the other explanatory variables ; Rj represents also the
cosine of the angle in Rn between X j and the linear subspace generated by Since, by Pythagora’s theorem,
the variables {X 1 , . . . , X j−1 , X j+1 , . . . , X p }. The more X j is “linearly”
2
2
linked with the other variables, the more Rj is close to 1 ; we show that the
Y − Y1
2 =
Y − Y
+
Y
b
− Y1
,
b
variance of the estimator of βj is large in this case. This variance is minimal
when X j is orthogonal to the subspace generated by the other variables. we have the following identity :
Condition number SST = SSR + SSE.
We consider the covariance matrix R between the regressors. We denote
We define the determination coefficient R2 by :
λ1 ≥ . . . ≥ λp the ordered eigenvalues of R. If the smallest eigenvalues
are close to 0, the inversion of the matrix R will be difficult and numerical SSE SSR
problems arise. In this case, some components of the least square estimator β R2 = =1− .
SST SST
b
will have high variances. The condition number of the matrix R is defined as
the ratio Note that 0 ≤ R2 ≤ 1. The model is well adjusted to the n training data if the
residuals sum of square SSR is close to 0, or equivalently, if the determination
κ = λ1 /λp coefficient R2 is close to 1. Hence, the first hint is that a "good" model is a
model for which R2 is close to 1. This is in fact not true, as shown by the
between the largest and the smallest eigenvalues of R. If this ratio is large, following pedagogical example of polynomial regression. Suppose that we
then the problem is ill-conditioned. have a training sample (Xi , Yi )1≤i≤n where Xi ∈ [0, 1] and Yi ∈ R and we
This condition number is a global indicator of collinearities, while the VIF adjust polynomials on these data :
allows to identify the variables that are problematic.
Yi = β 0 + β 1 Xi + β 2 Xi2 + . . . + β k Xik + εi .
6
2
When k increases, the model is more and more complex, hence
Y − Y
b
Régression linéaire simple Polynôme de degré 2
decreases, and R2 increases as shown in Figures 1 and 2.
● ●
●
●
●
n − 1 (which has n coefficients) and passes through all the training points.
● ●
●
●
2
●
● Of course this model is not the best one : it has a very high variance since
1
● ●
●
●
1
y
y
●
typical case of overfitting. When the degree of the polynomial increases, the
●
0
bias of our estimators decreases, but the variance increases. The best model is
0
the one that realizes the best trade-off between the bias term and the variance
−1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
The definition of R02 takes into account the complexity of the model, repre-
sented here by its number of coefficients : k + 1 for a polynomial of degree k,
and penalizes more complex models. One can choose, between several mod-
Polynôme de degré 5 Polynôme de degré 10 els, the one which maximizes the adjusted R2 . In the previous example, we
would choose a polynomial of degree 3 with this criterion.
2.5
2.5
● ●
● ●
● ●
2.0
good compromise between a good adjustment to the data (small bias) and a
● ●
1.5
1.5
● ● ● ●
●
●
●
● small variance; and an unbiased estimator is not necessarily the best one in
1.0
1.0
this sense. We will prefer a biased model if this allows to reduce drastically
y
● ●
● ●
0.5
0.5
0.0
• Reducing the number of explanatory variables and by the same way sim-
−0.5
−0.5
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 plifying the model (variable selection or Lasso penalization)
x x
" These data come from a study that examined the correlation between the modality of the qualitative variables is equal to 0 (contr.sum in R).
level of prostate specific antigen and a number of clinical measures in men For the least square estimation, with the default constraints of R, we obtain
who were about to receive a radical prostatectomy." the following results :
It is data frame with 97 rows and 9 columns.
source Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F., Coefficients Estimate Std. Error t value Pr(>|t|)
Redwine, E.A. and Yang, N. (1989) (Intercept) 0.913313 0.840836 1.086 0.28043
lcavol 0.569989 0.090100 6.326 1.09e-08 ***
Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of
lweight 0.468783 0.169610 2.764 0.00699 **
the prostate: II. radical prostatectomy treated patients, Journal of Urology age -0.021749 0.011361 -1.914 0.05890 .
141(5), 1076–1083.” lbph 0.099685 0.058984 1.690 0.09464 .
The data frame has the following components: svi1 0.745877 0.247398 3.015 0.00338 **
lcp -0.125111 0.095591 -1.309 0.19408
gleason7 0.267601 0.219419 1.220 0.22596
lcavol log(cancer volume) gleason8 0.496798 0.769268 0.646 0.52012
lweight log(prostate weight) gleason9 -0.056230 0.500196 -0.112 0.91076
age age pgg45 0.004990 0.004672 1.068 0.28847
lbph log(benign prostatic hyperplasia amount) Residual standard error: 0.7048 on 86 degrees of freedom
svi seminal vesicle invasion (0, 1) Multiple R-squared: 0.666, Adjusted R-squared: 0.6272
lcp log(capsular penetration)
gleason Gleason score of the tumor (6, 7, 8, 9) Exercise. — Prove that the Student test of H0 : βj = 0 is equivalent to the
pgg45 percentage Gleason scores 4 or 5 Fisher test for the same hypothesis.
lpsa log(prostate specific antigen)
4 Variable selection
We denote by Y the variable (lpsa) to explain. We set X 1 , . . . X p for the As we have seen, the least square estimator is not satisfactory since it has
explanatory variables (lcavol , lweight , gleason ..). The variables are quan- low bias but generally high variance. In the previous example, several variables
titative (lcavol , lweight , ...), or qualitative (gleason , svi ). We consider the seem to be non significant, and we may have better results by removing those
linear model : variables from the model. Moreover, a model with a small number of variables
p is more interesting for the interpretation, keeping only the variables that have
Yi = β0 + β1 Xi1 + β2 Xi2 + . . . + βp Xi + εi , 1 ≤ i ≤ n,
the strongest effects on the variable to explain. There are several ways to do
For the qualitative variables, we consider indicator functions of the different that.
levels of the factor, and introduce some constraints for identifiability. By de- Assume we want to select a subset of variables among all possible subsets
fault, in R, the smallest value of the factor (0 for svi and 6 for gleason ) are set taken from the input variables. Each subset defines a model, and we want to
in the reference. This is an analysis of covariance model (mixing quantitative select the "best model". We have seen that maximizing the R2 is not a good
and qualitative variables). criterion since this will always lead to select the full model. It is more inter-
Exercise. — Write the matrix X of the model with the default constraints of esting to select the model maximizing the adjusted determination coefficient
R, and with the constraint that the sum of the coefficients associated to each R02 . Many other penalized criterion have been introduce for variable selection
8
such as the Mallow’s CP criterion or the BIC criterion. In both cases, it corre- CP de Mallows Polynôme de degré 3
sponds to the minimization of the least square criterion plus some penalty term, ● ●
●
8
depending on the number k of parameters in the model m that is considered.
2
●
7
●
n ●
6
X ●
●
1
Crit(m) = (Yi − Ŷi )2 + pen(k).
●
5
CP
y
i=1 ●
0
4
The Mallow’s CP criterion is
−1
n ●
2
● ●
● ●
●
X ●
2 2 ●
i=1 k x
and the BIC criterion penalizes more the dimension of the model with an ad-
ditional logarithmic term.
Figure 3: Mallows’CP in function of the degree of the polynomial. Selected
n
X model : polynomial with degree 3.
CritBIC (m) = (Yi − Ŷi )2 + log(n)kσ 2 .
i=1
All those algorithms stop when the criterion can no more be reduced. Let us
The aim is to select the model (among all possible subsets) that minimizes one see some applications of those algorithms on the Prostate cancer data.
of those criterion. On the example of the polynomial models, we obtain the Stepwise Algorithm
results summarized in Figure 3. We apply the StepAIC algorithm, with the option both of the sofware R in
Nevertheless, the number of subsets of a set of p variables is 2p , and it is order to select a subset of variables, and we present here an intermediate result :
impossible (as soon as p > 30) to explore all the models to minimize the cri-
terion. Fast algorithms have been developed to find a clever way to explore a
subsample of the models. This are the backward, forward and stepwise algo- Step: AIC=-60.79
rithms. lpsa lcavol + lweight + age + lbph + svi + pgg45
Backward/Forward Algorithms : Df Sum of Sq RSS AIC
- pgg45 1 0.6590 45.526 -61.374
• Forward selection : We start from the constant model (only the inter- <none> 44.867 -60.788
+ lcp 1 0.6623 44.204 -60.231
cept, no explanatory variable), and we add sequentially the variable that
- age 1 1.2649 46.132 -60.092
allows to reduce the more the criterion.
- lbph 1 1.6465 46.513 -59.293
• Backward selection : This is the same principle, but starting from the + gleason 3 1.2918 43.575 -57.622
full model and removing one variable at each step in order to reduce the - lweight 1 3.5646 48.431 -55.373
- svi 1 4.2503 49.117 -54.009
criterion.
- lcavol 1 25.4190 70.286 -19.248
• Stepwise selection: This is a mixed algorithm, adding or removing one Step: AIC=-61.37
variable at each step in order to reduce the criterion in the best way. lpsa ∼ lcavol + lweight + age + lbph + svi
9
The algorithm stops when adding or removing a variable does no more allow D EFINITION 4. — The ridge estimator of β
e in the model
to reduce the criterion. After 4 iterations, we get the following model :
Y=X
eβe + ,
● ●
2.5
2.5
● ●
2.0
2.0
● ●
● ●
1.5
1.5
● ●
● ●
● ●
criterion.
y
1.0
1.0
● ●
● ●
0.5
0.5
4. The ridge regression is equivalent to the least square estimation under the
0.0
0.0
constraint that the l2 -norm of the vector β is not too large:
−0.5
−0.5
● ●
n o 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
R
β
Régression Ridge, l=10^−4 Régression Ridge, l=0.1
The ridge regression keeps all the parameters, but, introducing constraints ●
●
●
2.5
2.5
on the values of the βj ’s avoids too large values for the estimated param- ● ●
2.0
2.0
eters, which reduces the variance. ●
●
●
1.5
1.5
● ●
● ●
● ●
y
1.0
1.0
● ●
● ●
0.5
0.5
In the Figure 4, we see results obtained by the ridge method for several
0.0
0.0
values of the tuning parameter λ = l on the polynomial regression example.
−0.5
−0.5
Increasing the penalty leads to more regular solutions, the bias increases, and ● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
the variance decreases. We have overfitting when the penalty is equal to 0 and x x
under-fitting when the penalty is too large. Régression Ridge, l=10 Régression Ridge, l=10^4
For each regularization method, the choice of the parameter λ is crucial and ●
●
●
2.5
2.5
determinant for the model selection. We see in Figure 5 the Regularisation ● ●
2.0
2.0
path, showing the profiles of the estimated parameters when the tuning param-
● ●
● ●
1.5
1.5
eter λ increases.
● ●
● ●
● ●
y
y
1.0
1.0
● ●
● ●
0.5
0.0
0.0
Most softwares use the cross-validation to select the tuning parameter
−0.5
−0.5
penalty. The principe is the following : ● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
0.6
0.4
t(x$coef)
0.2
0.0
−0.2
0 5 10 15 20
0.6
lcavol
lweight x$lambda
age
20
lbph
0.4
svi1
t(mod.ridge$coef)
lcp
10
pgg45
t(x$coef)
(−I)
0.0
– We test the performances of this estimator on the data that have not
−0.2
been used to build it, that is the one of the I-th sub-sample.
−20
Soc B., Vol. 58, No. 1, pages 267-288. The Lasso corresponds to the mini- The obtained estimator corresponds to a soft thresholding of the least square
mization of a least square criterion plus an l1 penalty term (and
Pno more an l2 estimator. The coefficients βbj are replaced by φλ (βbj ) where
p
penalization like in the ridge regression). We denote kβk1 = j=1 |βj |.
φλ : x 7→ sign(x)(|x| − λ)+ .
D EFINITION 6. — The Lasso estimator of β in the model
Exercise. — Prove the proposition 7.
Y = Xβ + ,
Another formulation
is defined by :
The LASSO is equivalent to the minimization of the criterion
n p p
(j)
X X X
βb
Lasso = argminβ∈Rp+1
(Yi − Xi βj )2 + λ |βj | , n
(1) (2) (p)
X
i=1 j=0 j=1 Crit(β) = (Yi − β0 − β1 Xi − β2 Xi − . . . − βp Xi )2
i=1
where λ is a nonnegative tuning parameter.
Pp
under the constraint j=1 |βj | ≤ t, for some t > 0.
We can show that this is equivalent to the minimization problem :
The statistical
Pp software R introduces a constraint expressed by a relative
2 bound for j=1 |βj | : the constraint is expressed by
β β∈Rp ,kβk1 ≤t (kY − Xβk ),
b = argmin
L
lcavol
lweight
0.6
age
lbph
svi1
coefficients[, −1]
lcp
0.4
gleason7
gleason9
pgg45
0.2
Figure 7: Regularization paths of the LASSO when the penalty decreases When F is non differentiable, we introduce the subdifferential ∂F of F
defined by :
hωx − ωy , x − yi ≥ 0 ∀ωx ∈ ∂F (x), ∀ωy ∈ ∂F (y). L EMMA 13. — Let h : β 7→ β ∗ Aβ where A is a symmetric matrix. Then
5h(β) = 2Aβ.
Indeed
Let g : β 7→ β ∗ z = z ∗ β = hz, βi where z ∈ Rp . Then 5g(β) = z.
F (y) ≥ F (x) + hωx , y − xi
F (x) ≥ F (y) + hωy , x − yi. Hence we have
It corresponds to a soft thresholding of the Ordinary Least Square estimator • Iterate until convergence :
β̂j = Xj∗ Y .
λ
∀j = 1, . . . , p, βj = Rj 1−
Non orthogonal setting 2|Rj | +
• If λ ≥ 2 supj |Xj∗ Y |, then β̂λ = 0. This algorithm is implemented in the R package glmnet.
Due to its parsimonious solution, this method is widely used to select vari-
• If λ < 2 supj |Xj∗ Y |, then denoting Xm̂λ the submatrix obtained from
ables in high dimension settings (when p > n). Verzelen (2012)[10] has shown
X by keeping only the columns belonging to m̂λ , we have the following
important results on the limitation of the Lasso in ultra high dimension.
equation :
∗ ∗ λ 7 Elastic Net
Xm̂ Xm̂λ (β̂λ )m̂λ = Xm̂ Y − sign((β̂λ )m̂λ ).
λ λ
2
Elastic Net is a method that combines Ridge and Lasso regression, by in-
Computing the Lasso estimator troducing simultaneously the l1 and l2 penalties. The criterion to minimize
β 7→ L(β) = kY − Xβk2 + λ|β|1 is convex. is
Hence a simple and efficient approach to minimize this function is to alternate n
(1) (2) (p)
X
minimization over each coordinate of β. (Yi − β0 − β1 Xi − β2 Xi − . . . − βp Xi )2
This algorithm converges to the Lasso estimator thanks to the convexity of L. i=1
If we assume that the columns of X have norm 1, then we have
p
X p
X
∂R βj +λ α |βj | + (1 − α) βj2
(β) = −2Xj∗ (Y − Xβ) + λ , ∀βj 6= 0. j=1 j=1
∂βj |βj |
17
• For α = 1, we recover the LASSO. • Nevertheless, interpretation is not always easy : if the variables are inter-
pretable, the principal components (that correspond to linear combination
• For α = 0, we recover the Ridge regression. of the variables) are generally difficult to interpret.
In this case, we have two tuning parameters to calibrate by cross-validation. • This method is quite similar to the Ridge regression, which shrinks the
coefficients of the principal components. Here, we set to 0 the coefficients
of the principal components of order greater than M .
8 Principal Components Regression and
Partial Least Square regression • The first principal components are not necessarily well correlated with
the variable to explain Y , this is the reason why the PLS regression has
8.1 Principal Component Regression (PCR) been introduced.
We denote by Z (1) , . . . Z (p) the principal components associated to the vari-
¯ X (1) , . . . X (p) :
ables
8.2 Partial Least Square (PLS) regression
Le principle of this method is to make a regression on linear combinations
• Z (1) is the linear combination of X (1) , . . . , X (p) of the form of the variables Xi ’s, that are highly correlated with Y .
Pp
αj X (j) with αj2 = 1 with maximal variance.
P
i=1
• We assume that Y has been centered, and that the variables X (j) are also
Z (m) is the linear P
• P combination of X (1) , . . . , X (p) of the form centered and normalized (with norm 1).
p (j) 2
i=1 αj,m X with αj,m = 1 with maximal variance and orthog-
(1) (m−1) • The first PLS component is defined by :
onal to Z , . . . , Z .
p
The Principal Component Regression (PCR) consists in considering a predictor (1)
X
W = hY, X (j) iX (j) .
of the form :
M j=1
X
Ŷ P CR = θ̂m Z (m)
m=1 • The prediction associated to this first component is :
with
hZ (m) , Y i hY, W (1) i (1)
θ̂m = . Ŷ 1 = W .
kZ (m) k2 kW (1) k2
Comments :
Note that if the matrix X is orthogonal, this estimator corresponds to the ordi-
nary least square (OLS) estimator, and in this case, the following steps of the
• If M = p, we keep all the variables and we recover the ordinary least
PLS regression are useless.
square (OLS) estimator.
• If one can obtain a good prediction with M < p, then we have reduced • In order to obtain the following directions, we orthogonalize the variables
the number of variables, hence the dimension. X (j) with respect to the first PLS component W (1) :
18
tion in the direction given by W (1) and we normalize the variables thus
obtained.
1.1
• We compute the second PLS component W (2) in the same way as the first
component by replacing the variables X (j) ’s by the new variables.
1.0
RMSEP
• We iterate this process by orthogonalizing at each step the variables with
0.9
respect to the PLS components.
0.8
The algorithm is the following :
0 2 4 6 8 10
• Ŷ 0 = Ȳ and X (j),0 = X (j) . For m = 1, . . . , p number of components
Pp
• W (m) = j=1 hY, X
(j,m−1)
iX (j,m−1) .
hY,W (m) i Figure 8: Estimation of the error by cross-validation in function of the number
• Ŷ m = Ŷ m−1 + kW (m) k2
W (m) .
of PLS components
X (j),m−1 −ΠW (m) (X (j),m−1 )
• ∀j = 1, . . . , p, X (j),m = kX (j),m−1 −ΠW (m) (X (j),m−1 )k
.
• There exists a sparse version : sparse PLS (inspired from the Lasso
method), for which we consider linear combinations of the initial vari-
• The predictor Ŷ p obtained at step p corresponds to ordinary least square ables X (j) with only a few non zero coefficients, hence keeping only a
estimator. few variables, which makes the interpretation more easy.
• This method is useless if the variables X (j) are orthogonal.
Application of the PLS regression
• When the variables X (j) are correlated, PCR and PLS methods present
Application to the prostate cancer data.
the advantage to deal with new variables, that are orthogonal. Data:
X dimension: 97 10
• The choice of the number of PCR or PLS components can be done by Y dimension: 97 1
Number of components considered: 10
cross-validation.
VALIDATION: RMSEP
Cross-validated using 10 random segments.
• In general, the PLS method leads to more parcimoneous representations CV
(Intercept)
1.160
1 comps
1.066
2 comps
1.031
3 comps
0.8344
4 comps
0.7683
5 comps
0.7501
6 comps
0.7634
7 comps
0.7672
than the PCR method. adjCV 1.160 1.064 1.075 0.8291 0.7641 0.7451 0.7584 0.7618
8 comps 9 comps 10 comps
CV 0.7704 0.7758 0.7730
adjCV 0.7641 0.7695 0.7671
• The PLS regression leads to a reduction of the dimension. TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
X 93.47 98.49 99.66 99.75 99.93 99.96 99.97 99.98
• If p is large, this is particularly interesting, but can lead to problems of lpsa 18.20
9 comps
26.98
10 comps
57.02 63.09 64.38 66.15 66.46 66.59
lpsa lpsa, 6 comps, validation [6] C.L. Mallows. Some comments on cp. Technometrics, 15:661–675, 1973.
●
● ●
4
● ●
● ● ●
● ●
●
●
0.4
● ●
● ●
●
●
●●
●●
●
● biscuits and biscuit doughs. J. Sci. Food Agric., 35:99–105, 1984.
regression coefficient
●● ● ●
●●
0.3
● ●
3
●
● ●
● ● ●
●
predicted
● ●●
● ● ●
● ●
● ●
●●
0.2
● ●
●
●
●
●
●
●
●●
● ●●
● ●●
● ●
●
quentially constructed prediction embracing ordinary least squares, par-
●
2
tial least squares and principal components regression. Journal of The
0.1
● ●
● ● ● ● ● ●
●
●
●●
●
●
●
●
Royal Statistical Society B, 52:237–269, 1990.
0.0
●
● ●
● ● ● ●
●
●
1
●
● ●
● ●
−0.1
●
●
●
●
[9] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal.
lcav lwei age lbph svi1 lcp gl7 gl8 gl9 pg45 0 1 2 3 4 5
Statist. Soc B, 58:267–288, 1996.
variable measured
6 composantes PLS
References
[1] P.J. Brown, T. Fearn, and M. Vannucci. Bayesian wavelet regression on
curves with applications to a spectroscopic calibration problem. Journal
of the American Statistical Society, 96:398–408, 2001.
[5] Nicole Krr, Anne-Laure Boulesteix, and Gerhard Tutz. Penalized partial
least squares with applications to b-spline transformations and functional
data. Chemometrics and Intelligent Laboratory Systems, 94:60–69, 2008.