Datasplitting Varselection

Statistical Science
2015, Vol. 30, No. 4, 533–558

DOI: 10.1214/15-STS527
c Institute of Mathematical Statistics, 2015
High-Dimensional Inference: Confidence

Intervals, p-Values and R-Software hdi
Ruben Dezeure, Peter Bühlmann, Lukas Meier and Nicolai Meinshausen
arXiv:1408.4026v2 [stat.ME] 10 Dec 2015
Abstract. We present a (selective) review of recent frequentist high-

dimensional inference methods for constructing p-values and confidence
intervals in linear and generalized linear models. We include a broad,
comparative empirical study which complements the viewpoint from
statistical methodology and theory. Furthermore, we introduce and il-
lustrate the R-package hdi which easily allows the use of different meth-
ods and supports reproducibility.
Key words and phrases: Clustering, confidence interval, generalized
linear model, high-dimensional statistical inference, linear model, mul-
tiple testing, p-value, R-software.
1. INTRODUCTION van de Geer et al., 2014; Javanmard and Montanari,

2014; Meinshausen, 2015).
Over the last 15 years, a lot of progress has been The current paper has three main pillars: (i) a
achieved in high-dimensional statistics where the (selective) review of the development in frequentist
number of parameters can be much larger than high-dimensional inference methods for p-values and
sample size, covering (nearly) optimal point esti- confidence regions; (ii) presenting the first broad,
mation, efficient computation and applications in comparative empirical study among different meth-
many different areas; see, for example, the books by ods, mainly for linear models: since the methods are
Hastie, Tibshirani and Friedman (2009), Bühlmann mathematically justified under noncheckable and
and van de Geer (2011) or the review article by Fan sometimes noncomparable assumptions, a thorough
and Lv (2010). The core task of statistical inference simulation study should lead to additional insights
accounting for uncertainty, in terms of frequentist about reliability and performance of various pro-
confidence intervals and hypothesis testing, is much cedures; (iii) presenting the R-package hdi (high-
less developed. Recently, a few methods for assign- dimensional i nference) which enables to easily use
ing p-values and constructing confidence intervals many of the different methods for inference in high-
have been suggested (Wasserman and Roeder, 2009; dimensional generalized linear models. In addition,
Meinshausen, Meier and Bühlmann, 2009; Bühlmann, we include a recent line of methodology allowing to
2013; Zhang and Zhang, 2014; Lockhart et al., 2014; detect significant groups of highly correlated vari-
ables which could not be inferred as individually sig-
Ruben Dezeure is a Ph.D. student, Peter Bühlmann is nificant single variables (Meinshausen (2015)). The
Professor, Lukas Meier is Senior Scientist and Nicolai review and exposition in Bühlmann, Kalisch and
Meinshausen is Professor, Seminar for Statistics, ETH Meier (2014) is vaguely related to points (i) and (iii)
Zürich, CH-8092 Zürich, Switzerland e-mail: above, but much more focusing on an application
dezeure@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch, oriented viewpoint and covering much less statisti-
meier@stat.math.ethz.ch; meinshausen@stat.math.ethz.ch. cal methodology, theory and computational details.
This is an electronic reprint of the original article Our comparative study, point (ii), mentioned
published by the Institute of Mathematical Statistics in above, exhibits interesting results indicating that
Statistical Science, 2015, Vol. 30, No. 4, 533–558. This more “stable” procedures based on Ridge-estimation
reprint differs from the original in pagination and or random sample splitting with subsequent aggre-
typographic detail. gation are somewhat more reliable for type I error
1
2 DEZEURE, BÜHLMANN, MEIER AND MEINSHAUSEN
control than asymptotically power-optimal meth- point mass at zero. This implies, because of non-
ods. Such results cannot be obtained by comparing continuity of the distribution, that standard boot-
underlying assumptions of different methods, since strapping and subsampling schemes are delicate to
these assumptions are often too crude and far from apply and uniform convergence to the limit seems
necessary. As expected, we are unable to pinpoint hard to achieve. The latter means that the estimator
to a method which is (nearly) best in all considered is exposed to undesirable super-efficiency problems,
scenarios. In view of this, we also want to offer a as illustrated in Section 2.5. All the problems men-
collection of useful methods for the community, in tioned are expected to apply not only for the Lasso
terms of our R-package hdi mentioned in point (iii) but also for other sparse estimators as well.
above. In high-dimensional settings and for general fixed
design X, the regression parameter is not iden-
2. INFERENCE FOR LINEAR MODELS tifiable. However, when making some restrictions
on the design, one can ensure that the regression
We consider first a high-dimensional linear model,
vector is identifiable. The so-called compatibility
while extensions are discussed in Section 3:
condition on the design X (van de Geer (2007))
(2.1) Y = Xβ 0 + ε, is a rather weak assumption (van de Geer and
Bühlmann (2009)) which guarantees identifiability
with n × p fixed or random design matrix X, n × 1
and oracle (near) optimality results for the Lasso.
response and error vectors Y and ε, respectively.
For the sake of completeness, the compatibility con-
The errors are assumed to be independent of X (for
dition is described in Appendix A.1.
random design) with i.i.d. entries having E[εi ] = 0.
When assuming the compatibility condition with
We allow for high-dimensional settings where p ≫ n.
constant φ20 (φ20 is close to zero for rather ill-posed
In further development, the active set or the set of
designs, and sufficiently larger than zero for well-
relevant variables
posed designs), the Lasso has thepfollowing property:
S0 = {j; βj0 6= 0, j = 1, . . . , p}, for Gaussian errors and if λ ≍ log(p)/n, we have
with high probability that
as well as its cardinality s0 = |S0 |, are important
quantities. The main goals of this section are the (2.2) kβ̂ − β 0 k1 ≤ 4s0 λ/φ20 .
construction of confidence intervals and p-values
Thus, if s0 ≪ n/ log(p) and φ20 ≥ M > 0, we have
p
for individual regression parameters βj0 (j = 1, . . . , p)
and corresponding multiple testing adjustment. The kβ̂ − β 0 k1 → 0 and, hence, the parameter β 0 is iden-
former is a highly nonstandard problem in high- tifiable.
dimensional settings, while for the latter we can use Another often used assumption, although not nec-
standard well-known techniques. When considering essary by any means, is the so-called beta-min as-
both goals simultaneously, though, one can develop sumption:
more powerful multiple testing adjustments. The (2.3) min|βj0 | ≥ βmin ,
j∈S0
Lasso (Tibshirani (1996)) is among the most popu-
lar procedures for estimating the unknown parame- for some choice of constant βmin > 0. The result in
ter β 0 in a high-dimensional linear model. It exhibits (2.2) immediately implies the screening property: if
desirable or sometimes even optimal properties for βmin > 4s0 λ/φ20 , then
point estimation such as prediction of Xβ 0 or of a
(2.4) Ŝ = {j; β̂j 6= 0} ⊇ S0 .
new response Ynew , estimation in terms of kβ̂ − β 0 kq
for q = 1, 2, and variable selection or screening; see, Thus, the screening property holds when assum-
for example, the book of Bühlmann and van de Geer ing the compatibility and beta-min condition. The
(2011). For assigning uncertainties in terms of con- power of the screening property is a massive dimen-
fidence intervals or hypothesis testing, however, the sionality reduction (in the original variables) be-
plain Lasso seems inappropriate. It is very difficult cause |Ŝ| ≤ min(n, p); thus, if p ≫ n, the selected
to characterize the distribution of the estimator in set Ŝ is much smaller than the full set of p variables.
the high-dimensional setting; Knight and Fu (2000) Unfortunately, the required conditions are overly re-
derive asymptotic results for fixed dimension as sam- strictive and exact variable screening seems rather
ple size n → ∞ and already for such simple situa- unrealistic in practical applications (Bühlmann and
tions, the asymptotic distribution of the Lasso has Mandozzi (2014)).
HIGH-DIMENSIONAL INFERENCE: CONFIDENCE INTERVALS, P -VALUES AND R-SOFTWARE HDI 3
2.1 Different Methods correction over the p tests. This is not necessary:
we only need to control over the considered |Ŝ(I1 )|
We describe here three different methods for con-
tests in I2 . Therefore, a Bonferroni corrected p-value
struction of statistical hypothesis tests or confidence
for H0,j is given by
intervals. Alternative procedures are presented in
Sections 2.3 and 2.5. Pcorr,j = min(Praw,j · |Ŝ(I1 )|, 1).
2.1.1 Multi sample-splitting A generic way for de- In high-dimensional scenarios, p ≫ n > ⌊n/2⌋ ≥
riving p-values in hypotheses testing is given by |Ŝ(I1 )|, where the latter inequality is an implicit
splitting the sample with indices {1, . . . , n} into assumption which holds for the Lasso (under weak
two equal halves denoted by I1 and I2 , that is, assumptions), and thus, the correction factor em-
Ir ⊂ {1, . . . , n} (r = 1, 2) with |I1 | = ⌊n/2⌋, |I2 | = ployed here is rather small. Such corrected p-values
n − ⌊n/2⌋, I1 ∩ I2 = ∅ and I1 ∪ I2 = {1, . . . , n}. The control the familywise error rate in multiple test-
idea is to use the first half I1 for variable selection ing when assuming the screening property in (2.4)
and the second half I2 with the reduced set of se-
for the selector Ŝ = Ŝ(I1 ) based on the first half I1
lected variables (from I1 ) for statistical inference in
only, exactly as stated in Fact 1 below. The reason is
terms of p-values. Such a sample-splitting procedure
that the screening property ensures that the reduced
avoids the over-optimism to use the data twice for
model is a correct model, and hence the result is not
selection and inference after selection (without tak-
surprising. In practice, the screening property typi-
ing the effect of selection into account).
cally does not hold exactly, but it is not a necessary
Consider a method for variable selection based on
condition for constructing valid p-values (Bühlmann
the first half of the sample:
and Mandozzi (2014)).
Ŝ(I1 ) ⊂ {1, . . . , p}. The idea about sample-splitting and subsequent
statistical inference is implicitly contained in Wasser-
A prime example is the Lasso which selects all the
man and Roeder (2009). We summarize the whole
variables whose corresponding estimated regression
procedure as follows:
coefficients are different from zero. We then use the
Single sample-splitting for multiple testing of H0,j
second half of the sample I2 for constructing p-
among j = 1, . . . , p:
values, based on the selected variables Ŝ(I1 ). If the
cardinality |Ŝ(I1 )| ≤ n/2 ≤ |I2 |, we can run ordinary 1. Split (partition) the sample {1, . . . , n} = I1 ∪ I2
least squares estimation using the subsample I2 and with I1 ∩ I2 = ∅ and |I1 | = ⌊n/2⌋ and |I2 | = n −
the selected variables Ŝ(I1 ), that is, we regress YI2 ⌊n/2⌋.
(Ŝ(I ))
on XI2 1 where the sub-indices denote the sample 2. Using I1 only, select the variables Ŝ ⊆ {1, . . . , p}.
half and the super-index stands for the selected vari- Assume or enforce that |Ŝ| ≤ |I1 | = ⌊n/2⌋ ≤ |I2 |.
ables, respectively. Thereby, we implicitly assume 3. Denote the design matrix with the selected
(Ŝ(I ))
that the matrix XI2 1 has full rank |Ŝ(I1 )|. Thus, set of variables by X(Ŝ) . Based on I2 with data
(Ŝ)
from such a procedure, we obtain p-values Pt-test,j (YI2 , XI2 ), compute p-values Praw,j for H0,j , for
for testing H0,j : βj0 = 0, for j ∈ Ŝ(I1 ), from the clas- j ∈ Ŝ, from classical least squares estimation [i.e.,
sical t-tests, assuming Gaussian errors or relying on t-test which can be used since |Ŝ(I1 )| ≤ |I2 |]. For
asymptotic justification by the central limit theo- j∈/ Ŝ, assign Praw,j = 1.
rem. To be more precise, we define (raw) p-values 4. Correct the p-values for multiple testing: con-

(Ŝ(I1 )) sider
 Pt-test,j based on YI2 , XI2
 ,
Praw,j = Pcorr,j = min(Pj · |Ŝ|, 1),
 ifj ∈ Ŝ(I1 ),
which is an adjusted p-value for H0,j for controlling

1, ifj ∈ / Ŝ(I1 ).
the familywise error rate.
An interesting feature of such a sample-splitting pro-
cedure is the adjustment for multiple testing. For ex- A major problem of the single sample-splitting
ample, if we wish to control the familywise error rate method is its sensitivity with respect to the choice
over all considered hypotheses H0,j (j = 1, . . . , p), a of splitting the entire sample: sample splits lead to
naive approach would employ a Bonferroni–Holm wildly different p-values. We call this undesirable
Multi sample-splitting for multiple testing of H0,j

among j = 1, . . . , p:
1. Apply the single sample-splitting procedure
[b]
B times, leading to p-values {Pcorr,j ; b = 1, . . . , B}.
Typical choices are B = 50 or B = 100.
2. Aggregate these p-values as in (2.5), leading to
Pj which are adjusted p-values for H0,j (j = 1, . . . , p),
controlling the familywise error rate.
Fig. 1. Histogram of p-values Pcorr,j for a single covariable, The Multi sample-splitting method enjoys the prop-
in the riboflavin data set, when doing 50 different (random) erty that the resulting p-values are approximately
sample splits. The figure is taken from Bühlmann, Kalisch and reproducible and not subject to a “p-value lottery”
Meier (2014). anymore, and it controls the familywise error rate
under the following assumptions:
phenomenon a p-value lottery, and Figure 1 pro- (A1) The screening property as in (2.4) for the
vides an illustration. To overcome the “p-value lot-
first half of the sample: P[Ŝ(I1 ) ⊇ S0 ] ≥ 1 − δ for
tery,” we can run the sample-splitting method B
some 0 < δ < 1.
times, with B large. Thus, we obtain a collection of
p-values for the jth hypothesis H0,j : (A2) The reduced design matrix for the second
(Ŝ(I1 ))
[1] [B]
half of the sample satisfies rank(XI2 ) = |Ŝ(I1 )|.
Pcorr,j , . . . , Pcorr,j (j = 1, . . . , p).
Fact 1 [Meinshausen, Meier and Bühlmann
The task is now to do an aggregation to a single (2009)]. Consider a linear model as in (2.1) with
[b]
p-value. Because of dependence among {Pcorr,j ; b = fixed design X and Gaussian errors. Assume (A1)–
1, . . . , B}, because all the different half samples are (A2). Then, for a significance level 0 < α < 1 and
part of the same full sample, an appropriate aggre- denoting by B the number of sample splits,
gation needs to be developed. A simple solution is [
to use an empirical γ-quantile with 0 < γ < 1: P I(Pj ≤ α) ≤ α + Bδ,
Qj (γ) j∈S0c
[b]
= min(emp. γ-quantile{Pcorr,j /γ; b = 1, . . . , B}, that is, the familywise error rate (FWER) is con-
trolled up to the additional (small) value Bδ.
1).
A proof is given in Meinshausen, Meier and
For example, with γ = 1/2, this amounts to taking Bühlmann (2009). We note that the Multi sample-
[b]
the sample median {Pcorr,j ; b = 1, . . . , B} and multi- splitting method can be used in conjunction with
plying it with the factor 2. A bit more sophisticated any reasonable, sparse variable screening method
approach is to choose the best and properly scaled fulfilling (A1) for very small δ > 0 and (A2); and
γ-quantile in the range (γmin , 1) (e.g., γmin = 0.05), it does not necessarily rely on the Lasso for vari-
leading to the aggregated p-value able screening. See also Section 2.1.6. Assump-
tion (A2) typically holds for the Lasso satisfying
Pj = min (1 − log(γmin )) inf Qj (γ)
γ∈(γmin ,1) |Ŝ(I1 )| ≤ |I1 | = ⌊n/2⌋ ≤ |I2 | = n − ⌊n/2⌋.
(2.5) The screening property (A1). The screening prop-
(j = 1, . . . , p). erty (A1) with very small δ > 0 is not a necessary
Thereby, the factor (1 − log(γmin )) is the price to condition for constructing valid p-values and can be
be paid for searching for the best γ ∈ (γmin , 1). This replaced by a zonal assumption requiring the follow-
Multi sample-splitting procedure has been proposed ing: there is a gap between large and small regres-
and analyzed in Meinshausen, Meier and Bühlmann sion coefficients and there are not too many small
(2009), and we summarize it below. Before doing nonzero regression coefficients (Bühlmann and Man-
so, we remark that the aggregation of dependent p- dozzi (2014)). Still, such a zonal assumption makes
values as described above is a general principle as a requirement about the unknown β 0 and the abso-
described in Appendix A.1. lute values of its components: but this is the essence
of the question in hypothesis testing to infer whether rank(X) = p. The jth component of the ordinary
coefficients are sufficiently different from zero, and least squares estimator β̂OLS;j can be obtained as fol-
one would like to do such a test without an assump- lows. Do an OLS regression of X(j) versus all other
tion on the true values. variables X(−j) and denote the corresponding resid-
The Lasso satisfies (A1) with δ → 0 when assum- uals by Z (j) . Then
ing the compatibility condition (A.1) pon the design (2.6) β̂OLS;j = Y T Z (j) /(X(j) )T Z (j)
X, the sparsity assumption s0 = o( n/ log(p)) [or
s0 = o(n/ log(p)) when requiring a restricted eigen- can be obtained by a linear projection. In a high-
value assumption] and a beta-min condition (2.3), dimensional setting, the residuals Z (j) would be
as shown in (2.4). Other procedures also exhibit equal to zero and the projection is ill-posed.
the screening property such as the adaptive Lasso For the high-dimensional case with p > n, the idea
(Zou (2006)), analyzed in detail in van de Geer, is to pursue a regularized projection. Instead of or-
Bühlmann and Zhou (2011), or methods with con- dinary least squares regression, we use a Lasso re-
cave regularization penalty such as SCAD (Fan and gression of X(j) versus X(−j) with corresponding
Li (2001)) or MC+ (Zhang (2010)). As criticized residual vector Z (j) : such a penalized regression in-
above, the required beta-min assumption should be volves a regularization parameter λj for the Lasso,
avoided when constructing a hypothesis test about and hence Z (j) = Z (j) (λj ). As in (2.6), we immedi-
the unknown components of β 0 . ately obtain (for any vector Z (j) )
Fact 1 has a corresponding asymptotic formula-
Y T Z (j) 0
X
0 εT Z (j)
tion where the dimension p = pn and the model = βj + Pjk βk + ,
depends on sample size n: if (A1) is replaced by (X(j) )T Z (j) k6=j
(X(j) )T Z (j)
limn→∞ P[Ŝ(I1;n ) ⊇ S0;n (2.7)
S ] → 1 and for a fixed num- Pjk = (X(k) )T Z (j) /(X(j) )T Z (j) .
ber B, lim supn→∞ P[ j∈S c I(Pj ≤ α)] ≤ α. In such
0
an asymptotic setting, the Gaussian assumption in We note that in the low-dimensional case with Z (j)
Fact 1 can be relaxed by invoking the central limit being the residuals from ordinary least squares, due
theorem (for the low-dimensional part). to orthogonality, Pjk = 0. When using the Lasso-
The Multi sample-splitting method is very generic: residuals for Z (j) , we do not have exact orthogonal-
it can be used for many other models, and its basic ity and a bias arises. Thus, we make a bias correction
assumptions are an approximate screening property by plugging in the Lasso estimator β̂ (of the regres-
(2.4) and that the cardinality |Ŝ(I1 )| < |I2 | so that sion Y versus X): the bias-corrected estimator is
we only have to deal with a fairly low-dimensional
inference problem. See, for example, Section 3 for Y T Z (j) X
(2.8) b̂j = − Pjk β̂k .
GLMs. An extension for testing group hypotheses (X(j) )T Z (j) k6=j
of the form H0,G : βj = 0 for all j ∈ G is indicated in
Section 4.1. Using (2.7), we obtain
Confidence intervals can be constructed based on √ n−1/2 εT Z (j)
the duality with the p-values from equation (2.5). n(b̂j − βj0 ) =
n−1 (X(j) )T Z (j)
A procedure is described in detail in Appendix A.2. X√
The idea to invert the p-value method is to apply + nPjk (βk0 − β̂k ).
a bisection method having a point in and a point k6=j
outside of the confidence interval. To verify if a point The first term on the right-hand side has a Gaus-
is inside the aggregated confidence interval, one looks sian distribution, when assuming Gaussian errors;
at the fraction of confidence intervals from the splits otherwise, it has an asymptotic Gaussian distribu-
which cover the point. tion assuming that E|εi |2+κ < ∞ for κ > 0 (which
2.1.2 Regularized projection: De-sparsifying the suffices for the Lyapunov CLT). We will argue in
Lasso We describe here a method, first introduced Appendix A.1 that the second term is negligible un-
by Zhang and Zhang (2014), which does not require der the following assumptions:
an assumption about β 0 except for sparsity. (B1) The design matrix X has compatibility con-
It is instructive to give a motivation starting stant bounded
√ away from zero, and the sparsity is
with the low-dimensional setting where p < n and s0 = o( n/ log(p)).
(B2) The rows of X are fixed realizations of i.i.d. Pötscher (2003)), and yet b̂j is optimal for construc-
random vectors ∼ Np (0, Σ), and the minimal eigen- tion of confidence intervals of a single coefficient βj0 .
value of Σ is bounded away from zero. From a practical perspective, we need to choose
−1 is row-sparse with s =
P(B3) The−1 inverse Σ j the regularization parameters λ (for the Lasso re-
k6=j I((Σ )jk 6= 0) = o(n/ log(p)). gression of Y versus X) and λj [for the node-
Fact 2 (Zhang and Zhang (2014); van de Geer et wise Lasso regressions (Meinshausen and Bühlmann
al., 2014). Consider a linear model as in (2.1) with (2006)) of X(j) versus all other variables X(−j) ].
fixed design and Gaussian errors. Assume (B1), Regarding the former, we advocate a choice using
(B2) and (B3) (or an ℓ1 -sparsity assumption on the cross-validation; for the latter, we favor a proposal
rows of Σ−1 ). Then for a smaller λj than the one from CV, and the de-
√ −1 tails are described in Appendix A.1.
nσε (b̂ − β 0 ) = W + ∆, W ∼ Np (0, Ω), Furthermore, for a group G ⊆ {1, . . . , p}, we can
n(Z (j) )T Z (k) test a group hypothesis H0,G : βj0 = 0 for all j ∈ G
Ωjk = , by considering the test-statistic
[(X(j) )T Z (j) ][(X (k) )T Z (k) ]
−1/2 √ −1/2
k∆k∞ = oP (1). max σε−1 Ωjj n|b̂j | ⇒ max Ωjj |Wj |,
j∈G j∈G
[We note that this statement holds with probability
tending to one, with respect to the variables X ∼ where the limit on the right-hand side occurs if the
NP (0, Σ) as assumed in (B2)]. null-hypothesis H0,G holds true. The distribution of
−1/2
maxj∈G |Ωjj Wj | can be easily simulated from de-
The asymptotic implications of Fact 2 are as fol-
pendent Gaussian random variables. We also remark
lows:
that sum-type statisticsPfor large groups cannot be
−1/2 √
σε−1 Ωjj n(b̂j − βj0 ) ⇒ N (0, 1), easily treated because j∈G |∆j | might get out of
from which we can immediately construct a con- control.
fidence interval or hypothesis test by plugging in 2.1.3 Ridge projection and bias correction Re-
an estimate σ̂ε as briefly discussed in Section 2.1.4. lated to the desparsified Lasso estimator b̂ in (2.8) is
From a theoretical perspective, it is more elegant to an approach based on Ridge estimation. We sketch
use the square root Lasso (Belloni, Chernozhukov here the main properties and refer to Bühlmann
and Wang (2011)) for the construction of Z (j) ; then (2013) for a detailed treatment.
one can drop (B3) [or the ℓ1 -sparsity version of (B3)] Consider
(van de Geer (2014)). In fact, all that we then need
is formula (2.12) β̂Ridge = (n−1 XT X + λI)−1 n−1 XT Y.
kβ̂ − β 0 k1 = oP (1/ log(p)).
p
A major source of bias occurring in Ridge estima-
From a practical perspective, it seems to make es- tion when p > n comes from the fact that the Ridge
sentially no difference whether one takes the square estimator is estimating a projected parameter
root or plain Lasso for the construction of the Z (j) ’s.
θ 0 = PR β 0 , PR = XT (XXT )− X,
More general than the statements in Fact 2,
the following holds assuming (B1)–(B3) (van de where (XXT )− denotes a generalized inverse of
Geer et al. (2014)): the asymptotic variance σε2 Ωjj XXT . The minor bias for θ 0 then satisfies
reaches the Cramér–Rao lower bound, which equals
σε2 (Σ−1 )jj [which is bounded away from zero, due to max|E[β̂Ridge;j ] − θj0 | ≤ λkθ 0 k2 λmin6=0 (Σ̂)−1 ,
j
(B2)], and the estimator b̂j is efficient in the sense
of semiparametric inference. Furthermore, the con- where λmin6=0 (Σ̂) denotes the minimal nonzero
vergence in Fact 2 is uniform over the subset of the eigenvalue of Σ̂ (Shao and Deng (2012)). The quan-
parameter space where the number of nonzero co- tity can be made small by choosing λ small. There-
efficients kβ 0 k0 is small and, therefore, we obtain fore, for λ ց 0+ and assuming Gaussian errors, we
honest confidence intervals and tests. In particu- have that
lar, both of these results say that all the complica-
tions in post-model selection do not arise (Leeb and (2.9) σε−1 (β̂Ridge − θ 0 ) ≈ W, W ∼ Np (0, ΩR ),
where ΩR = (Σ̂ + λ)−1 Σ̂(Σ̂ + λ)−1 /n. Since has a distribution which is approximately stochasti-
cally upper bounded by
θ0 X PR;jk
= βj0 + β0, −1/2 −1/2
PR;jj PR;jj k max(ΩR;jj |Wj |/|PR;jj | + σε−1 ΩR;jj |∆R;j |);
k6=j j∈G
the major bias for βj0 can be estimated and corrected see also (2.11). When invoking an upper bound
with for ∆Rbound;j ≥ |∆R;j | as in (2.11), we can easily
X PR;jk simulate this distribution from dependent Gaussian
β̂k , random variables, which in turn can be used to
PR;jj
k6=j construct a p-value; we refer for further details to
Bühlmann (2013).
where β̂ is the ordinary Lasso. Thus, we construct a
bias-corrected Ridge estimator, which addresses the 2.1.4 Additional issues: Estimation of the error
potentially substantial difference between θ 0 and the variance and multiple testing correction Unlike the
target β 0 : Multi sample-splitting procedure in Section 2.1.1,
the desparsified Lasso and Ridge projection method
β̂Ridge;j X PR;jk
b̂R;j = − β̂k , outlined in Sections 2.1.2–2.1.3 require to plug-in
PR;jj PR;jj an estimate of σε and to adjust for multiple testing.
k6=j
(2.10) The scaled Lasso (Sun and Zhang (2012)) leads to a
j = 1, . . . , p. consistent estimate of the error variance: it is a fully
Based on (2.9), we derive in Appendix A.1 that automatic method which does not need any speci-
fication of a tuning parameter. In Reid, Tibshirani
−1/2
σε−1 ΩR;jj (b̂R;j − βj0 ) and Friedman (2013), an empirical comparison of
−1/2
various estimators suggests that the estimator based
≈ ΩR;jj Wj /PR;jj on a residual sum of squares of a cross-validated
−1/2 Lasso solution often yields good finite-sample per-
(2.11) + σε−1 ΩR;jj ∆R;j , W ∼ Np (0, ΩR ), formance.
|∆R;j | ≤ ∆Rbound;j Regarding the adjustment when doing many
tests for individual regression parameters or groups
PR;jk thereof, one can use any valid standard method to
:= max (log(p)/n)1/2−ξ ,
k6=j PR;jj correct the p-values from the desparsified Lasso or
with the typical choice ξ = 0.05. Sufficient condi- Ridge projection method. The prime examples are
tions for deriving (2.11) are assumption (B1) and the Bonferroni–Holm procedure for controlling the
that the sparsity satisfies s0 = O((n/ log(p))ξ ) for ξ familywise error rate and the method from Ben-
as above. jamini and Yekutieli (2001) for controlling the false
Unlike as in Fact 2, the term ∆R;j is typically discovery rate. An approach for familywise error
not negligible and we correct the Gaussian part in control which explicitly takes the dependence among
(2.11) by the upper bound ∆Rbound;j . For example, the multiple hypotheses is proposed in Bühlmann
for testing H0,j : βj0 = 0 we use the upper bound for (2013), based on simulations for dependent Gaus-
the p-value sian random variables.
−1/2 2.1.5 Conceptual differences between the methods
2(1 − Φ(σε−1 ΩR;jj |PR;jj |(|b̂R;j | − ∆Rbound;j )+ )).
We briefly outline here conceptual differences while
Similarly, for two-sided confidence intervals with Section 2.5 presents empirical results.
coverage 1 − α we use The Multi sample-splitting method is very generic
and in the spirit of Breiman’s appeal for stability
[b̂R;j − cj , b̂R;j + cj ],
(Breiman, 1996a, 1996b), it enjoys some kind of sta-
1/2 −1 bility due to multiple sample splits and aggregation;
cj = ∆Rbound;j + σε ΩR;jj /|PR;jj |Φ (1 − α/2).
see also the discussion in Sections 2.1.6 and 2.4. The
For testing a group hypothesis for G ⊆ {1, . . . , p}, disadvantage is that, in the worst case, the method
H0,G : βj0 = 0 for all j ∈ G, we can proceed simi- needs a beta-min or a weaker zonal assumption on
larly as at the end of Section 2.1.2: under the null- the underlying regression parameters: this is some-
−1/2
hypotheses H0,G , the statistic σε−1 maxj∈G ΩR;jj |b̂R;j | what unpleasant since a significance test should find
out whether a regression coefficient is sufficiently well, as mentioned below. When using such another
large or not. estimator, the wording “desparsified Lasso” does not
Both the desparsified Lasso and Ridge projection make sense anymore. Furthermore, when using the
procedures do not make any assumption on the un- square root Lasso for the construction of Z (j) , we
derlying regression coefficient except sparsity. The only need√ (2.12) to obtain asymptotic normality
former is most powerful and asymptotically optimal with the n convergence rate (van de Geer (2014)).
if the design were generated from a population dis- For the Ridge projection method, a bound for
tribution whose inverse covariance matrix is sparse. kβ̂ − β 0 k1 is again the only assumption such that
Furthermore, the convergence is uniform over all the procedure is asymptotically valid. Thus, for the
sparse regression vectors and, hence, the method corresponding bias correction, other methods than
yields honest confidence regions or tests. The Ridge the Lasso can be used.
projection method does not require any assumption We briefly mention a few other methods for which
on the fixed design but does not reach the asymp- we have reasons that (A1) with very small δ > 0
totic Cramér–Rao efficiency bound. The construc- and (A2), or the bound in (2.12) hold: the adap-
tion with the additional correction term in (A.3) tive Lasso (Zou (2006)) analyzed in greater detail in
leads to reliable type I error control at the cost of van de Geer, Bühlmann and Zhou (2011), the MC+
power. procedure with its high-dimensional mathematical
In terms of computation, the Multi sample- analysis (Zhang (2010)), or methods with concave
splitting and Ridge projection method are substan- regularization penalty such as SCAD (Fan and Li
tially less demanding than the desparsified Lasso. (2001)) analyzed in broader generality and detail in
Fan, Xue and Zou (2014). If the assumptions (A1)
2.1.6 Other sparse methods than the Lasso All the
with small δ > 0 and (A2) fail for the Multi sample-
methods described above are used “in default mode”
in conjunction with the Lasso (see also Section 2.2). splitting method, the multiple sample splitting still
[b]
This is not necessary, and other estimators can be allows to check the stability of the p-values Pcorr,j
used. across b (i.e., across sample splits). If the variable
[b]
For the Multi sample-splitting procedure, assump- screening is unstable, many of the Pcorr,j (across b)
tions (A1) with δ → 0 and (A2) are sufficient for will be equal to 1, therefore, the aggregation has a
asymptotic correctness; see Fact 1. These assump- tendency to produce small p-values if most of them,
tions hold for many reasonable sparse estimators each from a sample split, are stable and small. See
when requiring a beta-min assumption and some also Mandozzi and Bühlmann (2015), Section 5. In
sort of identifiability condition such as the restricted connection with the desparsified method, a failure
eigenvalue or the compatibility condition on the de- of the single sufficient condition in (2.12), when us-
sign matrix X; see also the discussion after Fact 1. ing, for example, the square root Lasso for construc-
It is unclear whether one could gain substantially by tion of the Z (j) ’s, might result in a too large bias.
using a different screening method than the Lasso. In absence of resampling or Multi sample splitting,
In fact, the Lasso has been empirically found to per- it seems difficult to diagnose such a failure (of the
form rather well for screening in comparison to the desparsified or Ridge projection method) with real
elastic net (Zou and Hastie (2005)), marginal cor- data.
relation screening (Fan and Lv (2008)) or thresh-
2.2 hdi for Linear Models
olded Ridge regression; see Bühlmann and Mandozzi
(2014). In the R-package hdi, available on R-Forge (Meier,
For the desparsified Lasso, the error of the esti- Meinshausen and Dezeure (2014)), we provide im-
mated bias correction can be controlled by using a plementations for the Multi sample-splitting, the
bound for kβ̂ − β 0 k1 . If we require (B2) and (B3) Ridge projection and the desparsified Lasso method.
[or an ℓ1 sparsity assumption instead of (B3)], the Using the R functions is straightforward:
estimation error in the bias correction, based on an > outMssplit
estimator β̂ in (2.8), is asymptotically negligible if <- multi.split(x = x, y = y)
> outRidge
kβ̂ − β 0 k1 = oP (1/ log(p)).
p
(2.12)
<- ridge.proj(x = x, y = y)
This bound is implied by (B1) and (B2) for the > outLasso
Lasso, but other estimators exhibit this bound as <- lasso.proj(x = x, y = y)
For users that are very familiar with the pro- The covariance test (Lockhart et al. (2014)) is an-
cedures, we provide flexible options. For example, other proposal which relies on the solution path of
we can easily use an alternative model selection or the Lasso and provides p-values for conditional tests
another “classical” fitting procedure using the ar- that all relevant variables enter the Lasso solution
guments model.selector and classical.fit in path first. It is related to post-selection inference,
multi.split. The default options should be satis- mentioned in Section 7.1.
factory for standard usage. In Javanmard and Montanari (2014), a procedure
All procedures return p-values and confidence in- was proposed that is very similar to the one de-
tervals. The Ridge and desparsified Lasso methods scribed in Section 2.1.2, with the only difference
return both single testing p-values as well as mul- being that Z is picked as the solution of a convex
tiple testing corrected p-values, unlike the Multi program rather than using the Lasso. The method
sample-splitting procedure which only returns mul- is aiming to relax the sparsity assumption (B3) for
tiple testing corrected p-values. The confidence in- the design.
tervals are for individual parameters only (corre- A conservative Group-bound method which needs
sponding to single hypothesis testing). no regularity assumption for the design, for example,
The single testing p-values and the multiple test- no compatibility assumption (A.1), has been pro-
ing corrected p-values are extracted from the fit as posed by Meinshausen (2015). The method has the
follows: capacity to automatically determine whether a re-
> outRidge$pval gression coefficient is identifiable or not, and this
> outRidge$pval.corr makes the procedure very robust against ill-posed
By default, we correct for controlling the family- designs. The main motivation of the method is in
wise error rate for the p-values pval.corr. terms of testing groups of correlated variables, and
Confidence intervals are acquired through the we discuss it in more detail in Section 4.1.
usual confint interface. Below we extract the 95 While all the methods mentioned above are con-
% confidence intervals for those p-values that are sidered in a comparative simulation study in Sec-
smaller than 0.05: tion 2.5, we mention here some others. The idea
of estimating a low-dimensional component of a
> confint(outMssplit,
high-dimensional parameter is also worked out in
parm = which(outMssplit$pval.corr
Belloni et al. (2012), Belloni, Chernozhukov and
<= 0.05),
Kato (2015), bearing connections to the approach
level = 0.95)
of desparsifying the Lasso. Based on stability se-
Due to the fact that the desparsified Lasso method
lection (Meinshausen and Bühlmann (2010)), Shah
is quite computationally intensive, we provide the
and Samworth (2013) propose a version which leads
option to parallelize the method on a user-specified
to p-values for testing individual regression param-
number of cores.
eters. Furthermore, there are new and interesting
We refer to the manual of the package for more proposals for controlling the false discovery rate, in
detailed information. a “direct way” (Bogdan et al. 2013, 2014; Barber
2.3 Other Methods and Candès (2015)).
Recently, other procedures have been suggested 2.4 Main Assumptions and Violations
for construction of p-values and confidence intervals. We discuss here some of the main assumptions,
Residual-type bootstrap approaches are proposed potential violations and some corresponding impli-
and analyzed in Chatterjee and Lahiri (2013) and cations calling for caution when aiming for confir-
Liu and Yu (2013). A problem with these approaches matory conclusions.
is the nonuniform convergence to a limiting dis- Linear model assumption. The first one is that the
tribution and exposure to the super-efficiency phe- linear (or some other) model is correct. This might
nomenon, that is, if the true parameter equals zero, be rather unrealistic and, thus, it is important to
a confidence region might be the singleton {0} (due interpret the output of software or a certain method.
to a finite amount of bootstrap resampling), while Consider a nonlinear regression model
for nonzero true parameter values, the coverage
random design : Y0 = f 0 (X0 ) + η0 ,
might be very poor or a big length of the confidence
interval. fixed design : Y = f 0 (X) + η,
where, with some slight abuse of notation, f 0 (X) = The assumption about constant error variance
f 0 (X1 ), . . . , (f 0 (Xn ))T . We assume for the random might not hold. We note that in the random de-
design model, η0 is independent from X0 , E[η0 ] = 0, sign case of a nonlinear model as above, the error in
E[f 0 (X0 )] = 0, E[X0 ] = 0, and the data are n i.i.d. (2.13) has nonconstant variance when conditioning
realizations of (X0 , Y0 ); for the fixed design model, on X, but, unconditionally, the noise is homoscedas-
the n × 1 random vector η has i.i.d. components with tic. Thus, as outlined, the inference for a random
E[ηi ] = 0. For the random design model, we consider design linear model is asymptotically valid (uncon-
ditional on X) even though the conditional error
Y0 = (β 0 )T X0 + ε0 , distribution given X has nonconstant variance.
Compatibility or incoherence-type assumption.
(2.13) ε0 = f 0 (X0 ) − (β 0 )T X0 + η0 , The methods in Section 2.1 require an identifiability
β 0 = argminβ E[(f 0 (X0 ) − β T X0 )2 ] assumption such as the compatibility condition on
the design matrix X described in (A.1). The proce-
[where the latter is unique if Cov(X0 ) is positive dure in Section 4.1 does not require such an assump-
definite]. We note that E[ε0 |X0 ] 6= 0 while E[ε0 ] = 0 tion: if a component of the regression parameter is
and, therefore, the inference should be uncondi- not identifiable, the method will not claim signifi-
tional on X and is to be interpreted for the pro- cance. Hence, some robustness against nonidentifia-
jected parameter β 0 in (2.13). Furthermore, for cor- bility is offered with such a method.
rect asymptotic inference of the projected parame- Sparsity. All the described methods require some
ter β 0 , a modified estimator for the asymptotic vari- sparsity assumption of the parameter vector β 0 [if
ance of the estimator is needed; and then both the the model is misspecified, this concerns the parame-
Multi sample-splitting and the desparsified Lasso ter β 0 as in (2.13) or the basis pursuit solution]; see
the discussion of (A1) after Fact 1 or assumption
are asymptotically correct (assuming similar con-
(B1). Such sparsity assumptions can be somewhat
ditions as if the model were correct). The Multi
relaxed to require weak sparsity in terms of kβ 0 kr for
sample-splitting method is well suited for the ran-
some 0 < r < 1, allowing that many or all regression
dom design case because the sample splitting (re- parameters are nonzero but sufficiently small (cf.
sampling type) is coping well with i.i.d. data. This van de Geer (2015); Bühlmann and van de Geer,
is in contrast to fixed design, where the data is not 2015).
i.i.d. and the Multi sample-splitting method for a When the truth (or the linear approximation of
misspecified linear model is typically not working the true model) is nonsparse, the methods are ex-
anymore. The details are given in Bühlmann and pected to break down. With the Multi sample-
van de Geer (2015). splitting procedure, however, a violation of sparsity
For a fixed design model with rank(X) = n, we might be detected, since for nonsparse problems, a
can always write sparse variable screening method will be typically
unstable with the consequence that the resulting ag-
Y = Xβ 0 + ε, ε=η gregated p-values are typically not small; see also
for many solutions β 0 . For ensuring that the in- Section 2.1.6.
Finally, we note that for the desparsified Lasso,
ference is valid, one should consider a sparse β 0 ,
the sparsity assumption (B3) or its weaker version
for example, the basis pursuit solution from com-
can be dropped when using the square root Lasso;
pressed sensing (Candes and Tao (2006)) as one
see the discussion after Fact 2.
among many solutions. Thus, inference should be in- Hidden variables. The problem of hidden variables
terpreted for a sparse solution β 0 , in the sense that is most prominent in the area of causal inference
a confidence interval for the jth component would (cf. Pearl (2000)). In the presence of hidden vari-
cover this jth component of all sufficiently sparse ables, the presented techniques need to be adapted,
solutions β 0 . For the high-dimensional fixed design adopting ideas from, for example, the framework of
case, there is no misspecification with respect to lin- EM-type estimation (cf. Dempster, Laird and Ru-
earity of the model; misspecification might happen, bin (1977)), low-rank methods (cf. Chandrasekaran,
though, if there is no solution β 0 which fulfills a Parrilo and Willsky (2012)) or the FCI technique
required sparsity condition. The details are given from causal inference (cf. Spirtes, Glymour and
again in Bühlmann and van de Geer (2015). Scheines (2000)).
2.5 A Broad Comparison are generated ∼ Np (0, Σ) with covariance matrix Σ

of the following three types:
We compare a variety of methods on the basis of
multiple testing corrected p-values and single test- Toeplitz: Σj,k = 0.9|j−k| ,
ing confidence intervals. The methods we look at are
the multiple sample-splitting method MS-Split (Sec- Exp.decay: (Σ−1 )j,k = 0.4|j−k|/5 ,
tion 2.1.1), the desparsified Lasso method Lasso- Equi.corr: Σj,k ≡ 0.8 for all j 6= k,
Pro (Section 2.1.2), the Ridge projection method
Ridge (Section 2.1.3), the covariance test Covtest Σj,j ≡ 1 for all j.
(Section 2.3), the method by Javanmard and Mon- The sample size and dimension are fixed at n = 100
tanari Jm2013 (Section 2.3) and the two bootstrap and p = 500, respectively. We note that the Toeplitz
procedures mentioned in Section 2.3 [Res-Boot cor- type has a banded inverse Σ−1 , and, vice-versa,
responds to Chatterjee and Lahiri (2013) and liuyu the Exp.decay type exhibits a banded Σ. The de-
to Liu and Yu (2013)]. sign matrix RealX from real gene expression data of
2.5.1 Specific details for the methods For the es- Bacillus Subtilis (n = 71, p = 4088) was kindly pro-
timation of the error variance, for the Ridge provided by DSM (Switzerland) and is publicly avail-
jection or the desparsified Lasso method, the scaled able (Bühlmann, Kalisch and Meier (2014)). To
Lasso is used as mentioned in Section 2.1.4. make the problem somewhat comparable in diffi-
For the choice of tuning parameters for the node- culty to the simulated designs, the number of vari-
wise Lasso regressions (discussed in Section 2.1.2), ables is reduced to p = 500 by taking the variables
we look at the two alternatives of using either cross- with highest empirical variance.
validation or our more favored alternative procedure The cardinality of the active set is picked to be
(denoted by Z&Z) discussed in Appendix A.1. one of two levels s0 ∈ {3, 15}.
We do not look at the bootstrap procedures in For each of the active set sizes, we look at 6 dif-
connection with multiple testing adjustment due to ferent ways of picking the sizes of the nonzero coef-
the fact that the required number of bootstrap sam- ficients:
ples grows out of proportion to go far enough in the Randomly generated : U (0, 2), U (0, 4), U (−2, 2),
tails of the distribution; some additional importance
sampling might help to address such issues. A fixed value : 1, 2 or 10.
Regarding the covariance test, the procedure does The positions of the nonzero coefficients as columns
not directly provide p-values for the hypotheses we of the design X are picked at random. Results where
are interested in. For the sake of comparison though, the nonzero coefficients were positioned to be the
we use the interpretation as in Bühlmann, Meier and first s0 columns of X can be found in the supple-
van de Geer (2014). mental article (Dezeure et al. (2015)).
This interpretation does not have a theoretical Once we have the design matrix X and coefficient
reasoning behind it and functions more as a heuris- vector β 0 , the responses Y are generated according
tic. to the linear model equation with ε ∼ N (0, 1).
Thus, the results of the covariance test procedure
should be interpreted with caution. 2.5.3 p-values We investigate multiple testing
For the method Jm2013, we used our own imple- corrected p-values for two-sided testing of the null
mentation instead of the code provided by the au- hypotheses H0,j : βj0 = 0 for j = 1, . . . , p. We report
thors. The reason for this is that we had already the power and the familywise error rate (FWER)
implemented our own version when we discovered for each method:
that code was available and our own version was X
Power = P[H0,j is rejected]/s0 ,
(orders of magnitude) better in terms of error con-
j∈S0
trol. Posed with the dilemma of fair comparison, we
stuck to the best performing alternative. FWER = P[∃j ∈ S0c : H0,j is rejected].
2.5.2 Data used For the empirical results, simu- We calculate empirical versions of these quantities
lated design matrices as well as design matrices from based on fitting 100 simulated responses Y coming
real data are used. The simulated design matrices from newly generated ε.
Fig. 2. Familywise error rate (FWER), average number of false positive [AVG(V)] and power for multiple testing based on
various methods for a linear model. The desired control level for the FWER is α = 0.05. The average number of false positives
AVG(V) for each method is shown in the middle. The design matrix is of type Toeplitz, and the active set size being s0 = 3
(top) and s0 = 15 (bottom).
For every design type, active set size and coef- The results, illustrating the performance for var-
ficient type combination we obtain 50 data points ious methods, can be found in Figures 2, 3, 4 and
of the empirical versions of “Power” and “FWER,” 5.
from 50 independent simulations. Thereby, each
2.5.4 Confidence intervals We investigate confi-
data point has a newly generated X, β 0 (if not fixed)
dence intervals for the one particular setup of the
and active set positions S0 ∈ {1, . . . , p}; thus, the Toeplitz design, active set size s0 = 3 and coeffi-
50 data points indicate the variability with respect cients βj0 ∼ U [0, 2] (j ∈ S0 ). The active set positions
to the three quantities in the data generation (for are chosen to be the first s0 columns of X. The re-
the same covariance model of the design, the same sults we show will correspond to a single data point
model for the regression parameter and its active set in the p-value results.
positions). The data points are grouped in plots by In Figure 6, 100 confidence intervals are plotted
design type and active set size. for each coefficient for each method. These confi-
We also report the average number of false posi- dence intervals are the results of fitting 100 different
tives AVG(V) over all data points per method next responses Y resulting from newly generated ε error
to the FWER plot. terms.
Fig. 3. See caption of Figure 2 with the only difference being the type of design matrix. In this plot, the design matrix type
is Exp.decay.
is Equi.corr.
For the Multi sample-splitting method from Sec- For the sparsity s0 = 3, the Ridge projection
tion 2.1.1, if a variable did not get selected often method manages to control the FWER as desired
enough in the sample splits, there is not enough in- for all setups. In the case of s0 = 15, it is the Multi
formation to draw a confidence interval for it. This sample-splitting method that comes out best in
is represented in the plot by only drawing confidence comparison to the other methods. Generally speak-
intervals when this was not the case. If the (uncheck- ing, good error control tends to be associated with
able) beta-min condition (2.3) would be fulfilled, we a lower power, which is not too surprising since we
would know that those confidence intervals cover are dealing with the trade-off between type I and
zero. type II errors. The desparsified Lasso method turns
For the bootstrapping methods, an invisible con- out to be a less conservative alternative with not
fidence interval is the result of the coefficient being perfect but reasonable FWER control as long as the
set to zero in all bootstrap iterations. problem is sparse enough (s0 = 3). The method has
2.5.5 Summarizing the empirical results As a first a slightly too high FWER for the Equi.corr and Re-
observation, the impact of the sparsity of the prob- alX setups, but FWER around 0.05 for Toeplitz and
lem on performance cannot be denied. The power Exp.decay designs. Doing the Z&Z tuning procedure
clearly gets worse for s0 = 15 for the Toeplitz and helps the error control, as can be seen most clearly
Exp.decay setups. The FWER becomes too high in the Equi.corr setup.
for quite a few methods for s0 = 15 in the cases of The results for the simulations where the positions
Equi.corr and RealX. for the nonzero coefficients were not randomly cho-
is RealX.
Fig. 6. Confidence intervals and their coverage rates for 100 realizations of a linear model with fixed design of dimensions
n = 100, p = 500. The design matrix was of type Toeplitz and the active set was of size s0 = 3. The nonzero coefficients were
chosen by sampling once from the uniform distribution U [0, 2]. For each method, 18 coefficients are shown from left to right
with the 100 estimated 95%-confidence intervals drawn for each coefficient.The first 3 coefficients are the non-zero coefficients
in descending order of value. The other 15 coefficients, to the right of the first 3, were chosen to be those coefficients with the
worst coverage. The size of each coefficient is illustrated by the height of a black horizontal bar. To illustrate the coverage of the
confidence intervals, each confidence interval is either colored red or black depending on the inclusion of the true coefficient in
the interval. Black means the true coefficient was covered by the interval. The numbers written above the coefficients are the
number of confidence intervals, out of 100, that covered the truth. All confidence intervals are on the same scale such that one
can easily see which methods have wider confidence intervals. To summarize the coverage for all zero coefficients S0c (including
those not shown on the plot), the rounded average coverage of those coefficients is given to the right of all coefficients.
sen, presented in the supplemental article (Dezeure 3. GENERALIZED LINEAR MODELS

et al. (2015)), largely give the same picture. In com- Consider a generalized linear model
parison to the results presented before, the Toeplitz
setup is easier while the Exp.decay setup is more Y1 , . . . , Yn independent,
challenging. The Equi.corr results are very similar p
X
to the ones from before, which is to be expected g(E[Yi |Xi = x]) = µ0 + βj0 x(j) ,
from the covariance structure. j=1
Looking into the confidence interval results, it where g(·) is a real-valued, known link function. As
is clear that the confidence intervals of the Multi before, the goal is to construct confidence intervals
sample-splitting method and the Ridge projection and statistical tests for the unknown parameters
method are wider than the rest. For the bootstrap- β10 , . . . , βp0 , and maybe µ0 as well.
ping methods, the super-efficiency phenomenon 3.1 Methods
mentioned in Section 2.3 is visible. Important to
note here is that the smallest nonzero coefficient, The Multi sample-splitting method can be modi-
the third column, has very poor coverage from these fied for GLMs in an obvious way: the variable screen-
ing step using the first half of the data can be based
methods.
on the ℓ1 -norm regularized MLE, and p-values and
We can conclude that the coverage of the zero
confidence intervals using the second half of the sam-
coefficients is decent for all methods and that the ple are constructed from the asymptotic distribution
coverage of the nonzero coefficients is in line with of the (low-dimensional) MLE. Multiple testing cor-
the error rates for the p-values. rection and aggregation of the p-values from multi-
Confidence interval results for many other setup ple sample splits are done exactly as for linear mod-
combinations are provided in the supplemental arti- els in Section 2.1.1.
cle (Dezeure et al. (2015)). The observations are to A desparsified Lasso estimator for GLMs can be
a large extent the same. constructed as follows (van de Geer et al. (2014)):
The ℓ1 -norm regularized MLE θ̂ for the param- 3.3 Small Empirical Comparison
eters θ 0 = (µ0 , β 0 ) is desparsified with a method We provide a small empirical comparison of the
based on the Karush–Kuhn–Tucker (KKT) condi- methods mentioned in Sections 3.1 and 3.2. When
tions for θ̂, leading to an estimator with an asymp- applying the linear model procedures, we use the
totic Gaussian distribution. The Gaussian distribu- naming from Section 2.5. The new GLM-specific
tion can then be used to construct confidence inter- methods from Section 3.1 are referred to by their
vals and hypothesis tests. linear model names with a capital G added to them.
For simulating the data, we use a subset of the
3.2 Weighted Squared Error Approach
variations presented in Section 2.5.2. We only look
The problem can be simplified in such a way that at Toeplitz and Equi.corr and an active set size of
we can apply the approaches for the linear model s0 = 3. The number of variables is fixed at p = 500,
from Section 2. This can be done for all types of gen- but the sample size is varied n ∈ {100, 200, 400}. The
eralized linear models (as shown in Appendix A.3), coefficients were randomly generated:
but we restrict ourselves in this section to the spe- Randomly generated : U (0, 1), U (0, 2), U (0, 4).
cific case of logistic regression. Logistic regression is
The nonzero coefficient positions are chosen ran-
usually fitted by applying the iteratively reweighted
domly in one case and fixed as the first s0 columns
least squares (IRLS) algorithm where at every iter-
of X in the other.
ation one solves a weighted least squares problem
For every combination (of type of design, type of
(Hastie, Tibshirani and Friedman (2009)). coefficients, sample size and coefficient positions),
The idea is now to apply a standard l1-penalized 100 responses Y are simulated to calculate empirical
fitting of the model, build up the weighted least versions of the “Power” and “FWER” described in
squares problem at the l1-solution and then apply Section 2.5.3. In contrast to the p-value results from
our linear model methods on this problem. Section 2.5.3, there is only one resulting data point
We use the notation π̂i , i = 1, . . . , n for the esti- per setup combination (i.e., no additional replica-
mated probability of the binary outcome. π̂ is the tion with new random covariates, random coeffi-
vector of these probabilities. cients and random active set). For each method,
From Hastie, Tibshirani and Friedman (2009), the there are 18 data points, corresponding to 18 set-
adjusted response variable becomes tings, in each plot. The results can be found in Fig-
ure 7.
Yadj = Xβ̂ + W−1 (Y − π̂), Both the modified GLM methods as well as the
and the weighted least squares problem is weighted squared error approach work adequately.
The Equi.corr setup does prove to be challenging
β̂new = argminβ (Yadj − Xβ)T W(Yadj − Xβ), for Lasso-ProG.
with weights 3.4 hdi for Generalized Linear Models

π̂1 (1 − π̂1 ) 0 ... 0
 In the hdi R-package (Meier, Meinshausen and
.. .. Dezeure (2014)) we also provide the option to use
.
 
 0 π̂ 2 (1 − π̂ 2 ) .  the Ridge projection method and the desparsified
W= ..
.
 .. ..  Lasso method with the weighted squared error ap-
 . . . 0 
proach.
0 ... 0 π̂n (1 − π̂n ) We provide the option to specify the family of
√ √ the response Y as done in the R-package glmnet:
We rewrite Yw = WYadj and Xw = WX to get
> outRidge
β̂new = argminβ (Yw − Xw β)T (Yw − Xw β). <- ridge.proj(x = x, y = y,
family = ’’binomial’’)
The linear model methods can now be applied > outLasso
to Yw and Xw , thereby the estimate σ̂ε has to <- lasso.proj(x = x, y = y,
be set to the value 1. We note that in the low- family = ’’binomial’’)
dimensional case, the resulting p-values (with un- p-values and confidence intervals are extracted in
regularized residuals Zj ) are very similar to the p- the exact same way as for the linear model case; see
values provided by the standard R-function glm. Section 2.2.
Fig. 7. Familywise error rate (FWER) and power for multiple testing based on various methods for logistic regression. The
desired control level for the FWER is α = 0.05. The design matrix is of type Toeplitz in the top plot and Equi.corr in the
bottom plot. If the method name contains a capital G, it is the modified glm version, otherwise the linear model methods are
using the weighted squared error approach.
4. HIERARCHICAL INFERENCE IN THE variables {1, . . . , p}. For any two clusters Ck , Cℓ , ei-
PRESENCE OF HIGHLY CORRELATED ther one cluster is a subset of the other or they have
VARIABLES an empty intersection. Usually, a hierarchical clus-
tering has an additional notion of a level such that,
The previous sections and methods assume in on each level, the corresponding clusters build a par-
some form or another that the effects are strong tition of {1, . . . , p}. We consider a hierarchy T and
enough to enable accurate estimation of the con- first test the root node cluster C0 = {1, . . . , p} with
tribution of individual variables. hypothesis H0,C0 : β1 = β2 = · · · = βp = 0. If this hy-
Variables are often highly correlated for high- pothesis is rejected, we test the next clusters Ck in
dimensional data. Working with a small sample size, the hierarchy (all clusters whose supersets are the
it is impossible to attribute any effect to an individ- root node cluster C0 only): the corresponding clus-
ual variable if the correlation between a block of ter hypotheses are H0,Ck : βj = 0 for all j ∈ Ck . For
variables is too high. Confidence intervals for indi- the hypotheses which can be rejected, we consider
vidual variables are then very wide and uninforma- all smaller clusters whose only supersets are clusters
tive. Asking for confidence intervals for individual which have been rejected by the method before, and
variables thus leads to poor power of all procedures we continue to go down the tree hierarchy until no
considered so far. Perhaps even worse, under high more cluster hypotheses can be rejected.
correlation between variables the coverage of some With the hierarchical scheme in place, we still
procedures will also be unreliable as the necessary need a test for the null hypothesis H0,C of a cluster
conditions for correct coverage (such as the compat- of variables. The tests have different properties. For
example, whether a multiplicity adjustment is neces-
ibility assumption) are violated.
sary will depend on the chosen test. We will describe
In such a scenario, the individual effects are not
below some methods that are useful for testing the
granular enough to be resolved. However, it might
effect of a group of variables and which can be used
yet still be possible to attribute an effect to a group in such a hierarchical approach. The nice and inter-
of variables. The groups can arise naturally due to esting feature of the procedures is that they adapt
a specific structure of the problem, such as in appli- automatically to the level of the hierarchical tree:
cations of the group Lasso (Yuan and Lin (2006)). if a signal of a small cluster of variables is strong,
Perhaps more often, the groups are derived via and if that cluster is sufficiently uncorrelated from
hierarchical clustering (Hartigan (1975)), using the all other variables or clusters, the cluster will be de-
correlation structure or some other distance between tected as significant. Vice-versa, if the signal is weak
the variables. The main idea is as follows. A hierar- or if the cluster has too high a correlation with other
chy T is a set of clusters or groups {Ck ; k} with variables or clusters, the cluster will not become sig-
Ck ⊆ {1, . . . , p}. The root node (cluster) contains all nificant. For example, a single variable cannot be
detected as significant if it has too much correlation wise error control (Meinshausen (2008)). When test-
to other variables or clusters. ing a cluster hypotheses H0,C , one can use a modified
form of the partial F -test for high-dimensional set-
4.1 Group-Bound Confidence Intervals Without
tings; and the multiple testing adjustment due to
Design Assumptions the multiple cluster hypotheses considered can be
The Group-bound proposed in Meinshausen (2015) taken care of by a hierarchical adjustment scheme
gives confidence intervals for the ℓ1 -norm kβC0k k1 of proposed in Meinshausen (2008). A detailed descrip-
a group Ck ⊆ {1, . . . , p} of variables. If the lower- tion of the method, denoted here by Hier. MS-Split,
bound of the 1 − α confidence interval is larger than together with theoretical guarantees is given in Man-
0, then the null hypothesis βC0k ≡ 0 can be rejected dozzi and Bühlmann (2015).
for this group. The method combines a few proper- 4.3 Simultaneous Inference with the Ridge or
ties: Desparsified Lasso Method
(i) The confidence intervals are valid without an Simultaneous inference for all possible groups can
assumption like the compatibility condition (A.1). be achieved by considering p-values Pj of individ-
In general, they are conservative, but if the com- ual hypotheses H0,j : βj0 = 0 (j = 1, . . . , p) and ad-
patibility condition holds, they have good “power” justing them for simultaneous coverage, namely,
properties (in terms of length) as well. Padjusted,j = Pj · p. The individual p-values Pj can be
(ii) The test is hierarchical. If a set of variables obtained by the Ridge or desparsified Lasso method
can be rejected, all supersets will also be rejected. in Section 2.
And vice-versa, if a group of variables cannot be We can then test any group hypothesis H0,G :
rejected, none of its subsets can be rejected. βj0 = 0 for all j ∈ G by simply looking whether
(iii) The estimation accuracy has an optimal de- minj∈G Padjust,j ≤ α, and we can consider as many
tection rate under the so-called group effect com- group hypotheses as we want without any further
patibility condition, which is weaker than the com- multiple testing adjustment.
patibility condition necessary to detect the effect of
4.4 Illustrations
individual variables.
(iv) The power of the test is unaffected by adding A semi-real data example is shown in Figure 8,
highly or even perfectly correlated variables in Ck to where the predictor variables are taken from the
the group. The compatibility condition would fail to Riboflavin data set (Bühlmann, Kalisch and Meier
yield a nontrivial bound, but the group effect com- (2014)) (n = 71, p = 4088) and the coefficient vector
patibility condition is unaffected by the addition of is taken to have entries 0, except for 2 clusters of
perfectly correlated variables to a group. highly correlated variables. In example 1, the clus-
ters both have size 3 with nonzero coefficient sizes
The price to pay for the assumption-free nature of equal to 1 for all the variables in the clusters and
the bound is a weaker power than with previously Gaussian noise level σ = 0.1. In example 2, the clus-
discussed approaches when the goal is to detect the ters are bigger and have different sizes 11 and 21;
effect of individual variables. However, for groups the coefficient sizes for all the variables in the clus-
of highly correlated variables, the approach can be ters is again 1, but the Gaussian noise level here is
much more powerful than simply testing all vari- chosen to be σ = 0.5.
ables in the group. In the first example, 6 out of the 6 relevant vari-
We remark that previously developed tests can ables are discovered as individually significant by
be adapted to the context of hierarchical testing of the Lasso-Pro, Ridge and MS-Split methods (as
groups with hierarchical adjustment for familywise outlined in Sections 2.1.1–2.1.2), after adjusting for
error control (Meinshausen (2008)); for the Multi multiplicity.
sample-splitting method, this is described next. In the second example, the methods cannot reject
4.2 Hierarchical Multi Sample-Splitting the single variables individually any longer. The re-
sults for the Group-bound estimator are shown in the
The Multi sample-splitting method (Section 2.1.1) right column. The Group-bound can reject a group
can be adapted to the context of hierarchical testing of 4 and 31 variables in the first example, each con-
of groups by using hierarchical adjustment of family- taining a true cluster of 3 variables. The method
Fig. 8. A visualization of the hierarchical testing scheme as described in the beginning of Section 4, for the examples described
in Section 4.4. One moves top-down through the output of a hierarchical clustering scheme, starting at the root node. For each
cluster encountered, the null hypothesis that all the coefficients of that particular cluster are 0 is tested. A rejection is visualized
by a red semi-transparent circle at a vertical position that corresponds to the size of the cluster. The chosen significance level
was α = 0.05. The children of significant clusters in the hierarchy are connected by a black line. The process is repeated by
testing the null hypotheses for all those children clusters until no more hypotheses could be rejected. The ordering of the
hierarchy in the horizontal direction has no meaning and was chosen for a clean separation of children hierarchies. The
hierarchical clustering and orderings are the same for all 6 plots since the design matrix was the same. Two different examples
were looked at (corresponding to top and bottom row, resp.) and four different methods were applied to these examples. The
desparsified Lasso and the Ridge method gave identical results and were grouped in the two plots on the left, while results from
the hierarchical Multi sample-splitting method are presented in the middle column and the results for the Group-bound method
are shown in the right column. In example 1, the responses were simulated with 2 clusters of highly correlated variables of
size 3 having coefficients different from zero. In example 2, the responses were simulated with 2 clusters of highly correlated
variables of sizes 11 and 21 having coefficients different from zero. More details about the examples can be found in Section
4.4.
can also detect a group of 2 variables (a subset of even goes one step further by detecting a smaller
the cluster of 4) which contains 2 out of the 3 highly subcluster.
correlated variables. In the second example, a group We also consider the following simulation model.
of 34 variables is rejected with the Group-bound es- The type of design matrix was chosen to be such
timator, containing 16 of the group of 21 important that the population covariance matrix Σ is a block-
variables. The smallest group of variables contain- diagonal matrix with blocks of dimension 20 × 20
ing the cluster of 21 that the method can detect is of being of the same type as Σ for Equi.corr (see Sec-
size 360. It can thus be detected that the variables tion 2.5.2) with off-diagonal ρ instead of 0.8. The di-
jointly have a substantial effect even though the null mensions of the problem were chosen to be p = 500
hypothesis cannot be rejected for any variable in- number of variables, n = 100 number of samples
dividually. The hierarchical Multi sample-splitting and noise level σ = 1. There were only 3 nonzero
method (outlined in Section 4.2) manages to detect coefficients chosen with three different signal levels
the same clusters as the Group-bound method. It U [0, 2], U [0, 4] and U [0, 8] being used for the simu-
Fig. 9. The power for the rejection of the group-hypothe- Fig. 10. The power for the rejection of the group-hypothe-
sis of all variables (top) and the power for the rejection of sis of all S0 variables (top) and type I error rate correspond-
the group-hypothesis of the variables in blocks highly corre- ing to the rejection of the group-hypothesis of all S0c vari-
lated with S0 variables (bottom). The design matrix used is of ables (bottom) for the design matrix of type Block Equi.corr
type Block Equi.corr which is similar to the Equi.corr setup when changing the correlation ρ between variables. The design
in that Σ is block diagonal with blocks (of size 20 × 20) being matrix type is described in detail in the caption of Figure 9
the Σ of Equi.corr. The power is plotted as a function of the and in the text. The desparsified Lasso, Hier. MS-Split and
correlations in the blocks, quantified by ρ. The Ridge-based the Ridge-based method lose power as the correlation between
method loses power as the correlation between variables in- variables increases, while the Group-bound cannot reject the
creases, while the group bound, Hier. MS-Split and Lasso-Pro small group of variables S0 (3 in this case). The desparsified
methods can maintain power close to 1 for both measures of Lasso and MS-Split methods also exceed the nominal type I er-
power. ror rate for high correlations (as the design assumptions break
down), whereas the Ridge-based method and the Group-bound
are both within the nominal 5% error rate for every correlation
strength.
lations. Aside from varying signal level, we studied
the two cases where in one case all the nonzero co-
efficients were contained in one single highly corre- The power of the Ridge-based method (Bühlmann
lated block and in the other case each of those vari- (2013)) drops substantially for high correlations.
ables was in a different block. We look at 3 differ- The power of the Group-bound stays close to 1 at the
ent measures of power. One can define the power as level of the highly correlated groups (Block-power)
the fraction of the 100 repeated simulations that the and above (Power G = 1, . . . , p) throughout the en-
method managed to reject the group of all variables tire range of correlation values. The Lasso-Pro and
G = 1, . . . , p. This is shown at the top in Figure 9. MS-Split perform well here as well. The power of
Alternatively, one can look at the rejection rate of the Group-bound is 0 when attempting to reject the
the hypothesis for the group G that contains all vari- small groups H0,S0 . The type I error rate is sup-
ables in the highly correlated blocks that contain a posedly controlled at level α = 0.05 with all three
variable from S0 . This is the plot at the bottom in methods. However, the Lasso-Pro and the hierar-
Figure 9. Finally, one can look at the rejection rate chical MS-Split methods fail to control the error
of the hypothesis where the group G contains only rates, with the type I error rate even approaching 1
the variables in S0 (of size 3 in this case). The type I for large values of the correlation. The Group-bound
error we define to be the fraction of the simulations and Ridge-based estimator have, in contrast, a type
in which the method rejected the group hypothe- I error rate close to 0 for all values of the correlation.
sis H0,S0c where all regression coefficients are equal For highly correlated groups of variables, trying to
to zero. These last two measures are presented in detect the effect of individual variables has thus two
Figure 10. inherent dangers. The power to detect interesting
groups of variables might be very low. And the as- parameter fit, as illustrated for the group of all vari-
sumptions for the methods might be violated, which ables:
invalidates the type I error control. The assumption- > outRidge
free Group-bound method provides a powerful test <- ridge.proj(x = x, y = y)
for the group effects even if variables are perfectly > outLasso
correlated, but suffers in power, relatively speaking, <- lasso.proj(x = x, y = y)
when variables are not highly correlated. > group
4.5 hdi for Hierarchical Inference <- 1:ncol(x)
> outRidge$groupTest(group)
An implementation of the Group-bound method is > outLasso$groupTest(group)
provided in the hdi R-package (Meier, Meinshausen To apply a hierarchical clustering scheme as
and Dezeure (2014)). done in clusterGroupBound, one calls cluster-
For specific groups, one can provide a vector or GroupTest:
a list of vectors where the elements of the vector > outRidge$clusterGroupTest
specify the desired columns of X to be tested for. (alpha = 0.95)
The following code tests the group hypothesis if the To summarize, the R-package provides functions
group contains all variables: to test individual groups as well as to test accord-
> group ing to a hierarchical clustering scheme for the meth-
<- 1:ncol(x) ods Group-bound, Ridge and desparsified Lasso. An
> outGroupBound implementation of the hierarchical Multi sample-
<- groupBound(x = x, y = y, splitting method is not provided at this point in
group = group, alpha = 0.05) time.
> rejection
<- outGroupBound > 0 5. STABILITY SELECTION AND
Note that one needs to specify the significance ILLUSTRATION WITH HDI
level α.
One can also let the method itself apply the hi- Stability selection (Meinshausen and Bühlmann
erarchical clustering scheme as described at the be- (2010)) is another methodology to guard against
ginning of Section 4. false positive selections, by controlling the expected
This works as follows: number of false positives E[V ]. The focus is on se-
> outClusterGroupBound lection of a single or a group of variables in a regres-
<- clusterGroupBound(x = x, sion model, or on a selection of more general discrete
y = y, alpha = 0.05) structures such as graphs or clusters. For example,
The output contains all clusters that were tested for a linear model in (2.1) and with a selection of
for significance in members. The corresponding lower single variables, stability selection provides a subset
bounds are contained in lowerBound. of variables Ŝstable such that for V = |Ŝstable ∩ S0c |
To extract the significant clusters, one can do we have that E[V ] ≤ M , where M is a prespecified
> significant.cluster.numbers number.
<- which For selection of single variables in a regression
(outClusterGroupBound model, the method does not need a beta-min as-
$lowerBound > 0) sumption, but the theoretical analysis of stability
> significant.clusters selection for controlling E[V ] relies on a restrictive
<- outClusterGroupBound$members exchangeability condition (which, e.g., is ensured by
[[significant.cluster.numbers]] a restrictive condition on the design matrix). This
The figures in the style of Figure 8 can be achieved exchangeability condition seems far from necessary
by using the function plot on outCluster- though (Meinshausen and Bühlmann (2010)). A re-
GroupBound. finement of stability selection is given in Shah and
Note that one can specify the distance matrix used Samworth (2013).
for the hierarchical clustering, as done for hclust. An implementation of the stability selection pro-
To test group hypotheses H0,G for the Ridge cedure is available in the hdi R-package. It is called
and desparsified Lasso method as described in Sec- in a very similar way as the other methods. If we
tion 4.3, one uses the output from the original single want to control, for example, E[V ] ≤ 1, we use
> outStability > outClusterGroupBound

<- stability <- clusterGroupBound
(x = x, y = y, EV = 1) (x = riboflavin$x,
The “stable” predictors are available in the ele- y = riboflavin$y,
ment select. alpha = 0.05)
The default model selection algorithm is the Lasso > significant.cluster.numbers
(the first q variables entering the Lasso paths). The <- which(outClusterGroupBound
option model.selector allows to apply a user de- $lowerBound
fined model selection function. > 0)
> significant.clusters
6. R WORKFLOW EXAMPLE <- outClusterGroupBound
$members
We go through a possible R workflow based on the
[[significant.cluster.numbers]]
Riboflavin data set (Bühlmann, Kalisch and Meier
> str(significant.clusters)
(2014)) and methods provided in the hdi R-package:
num [1:4088] 1 2 3 4 5 6 7 8 9 10...
> library(hdi)
Only a single group, being the root node of the
> data(riboflavin)
clustering tree, is found significant.
We assume a linear model and we would like to in-
These results are in line with the results achievable
vestigate which effects are statistically significant on
in earlier studies of the same data set in Bühlmann,
a significance level of α = 0.05. Moreover, we want
Kalisch and Meier (2014) and van de Geer et al.
to construct the corresponding confidence intervals.
(2014).
We start by looking at the individual variables.
We want a conservative approach and, based on the
7. CONCLUDING REMARKS
results from Section 2.5, we choose the Ridge pro-
jection method for its good error control: We present a (selective) overview of recent de-
> outRidge velopments in frequentist high-dimensional infer-
<- ridge.proj ence for constructing confidence intervals and as-
(x = riboflavin$x, signing p-values for the parameters in linear and
y = riboflavin$y) generalized linear models. We include some meth-
We investigate if any of the multiple testing cor- ods which are able to detect significant groups of
rected p-values are smaller than our chosen signifi- highly correlated variables which cannot be indi-
cance level: vidually detected as single variables. We comple-
> any(outRidge$pval.corr <= 0.05) ment the methodology and theory viewpoints with
[1] FALSE a broad empirical study. The latter indicates that
We calculate the 95% confidence intervals for the more “stable” procedures based on Ridge estima-
first 3 predictors: tion or sample splitting with subsequent aggrega-
> confint(outRidge,parm=1:3, tion might be more reliable for type I error control,
level=0.95) at the price of losing power; asymptotically, power-
lower upper optimal methods perform nicely in well-posed sce-
AADK_at -0.8848403 1.541988 narios but are more exposed to fail for error control
AAPA_at -1.4107374 1.228205 in more difficult settings where the design or degree
ABFA_at -1.3942909 1.408472 of sparsity are more ill-posed. We introduce the R-
Disappointed with the lack of significance for test- package hdi which allows the user to choose from a
ing individual variables, we want to investigate if we collection of frequentist inference methods and eases
can find a significant group instead. From the pro- reproducible research.
cedure proposed for the Ridge method in Section 4,
7.1 Post-Selection and Sample Splitting
we know that if the Ridge method can not find any
Inference
significant individual variables, it would not find a
significant group either. Since the main assumptions outlined in Sec-
We apply the Group-bound method with its clus- tion 2.4 might be unrealistic in practice, one can
tering option to try to find a significant group: consider a different route.
The view and “POSI” (Post-Selection Inference) Here βA denotes the components {βj ; j ∈ A} where
method by Berk et al. (2013) makes inferential state- A ⊆ {1, . . . , p}. The number φ0 is called the compat-
ments which are protected against all possible sub- ibility constant.
models and, therefore, the procedure is not exposed Aggregation of dependent p-values. Aggregation of
to the issue of having selected an “inappropriate” dependent p-values can be generically done as fol-
submodel. The way in which Berk et al. (2013) lows.
deal with misspecification of the (e.g., linear) model
Lemma 1 [Implicitly contained in Meinshausen,
is closely related to addressing this issue with the
Meier and Bühlmann (2009)]. Assume that we
Multi sample splitting or desparsified Lasso method;
have B p-values P (1) , . . . , P (B) for testing a null-
see Section 2.4 and Bühlmann and van de Geer
hypothesis H0 , that is, for every b ∈ {1, . . . , B} and
(2015). The method by Berk et al. (2013) is conser- any 0 < α < 1, PH0 [P (b) ≤ α] ≤ α. Consider for any
vative, as it protects against any possible submodel, 0 < γ < 1 the empirical γ-quantile
and it is not feasible yet for high-dimensional prob-
lems. Wasserman (2014) briefly describes the “HAR- Q(γ)
NESS” (High-dimensional Agnostic Regression Not
= min(empirical γ-quantile{P (1) /γ, . . . , P (B) /γ},
Employing Structure or Sparsity) procedure: it is
based on single data splitting and making inference 1),
for the selected submodel from the first half of the
and the minimum value of Q(γ), suitably corrected
data. When giving up on the goal to infer the true
with a factor, over the range (γmin , 1) for some pos-
or best approximating parameter β 0 in (2.13), one
itive (small) 0 < γmin < 1:
can drop many of the main assumptions which are
needed for high-dimensional inference. P = min (1 − log(γmin )) min Q(γ), 1 .
The “HARNESS” is related to post-selection in- γ∈(γmin ,1)
ference where the inefficiency of sample splitting Then, both Q(γ) [for any fixed γ ∈ (0, 1)] and P are
is avoided. Some recent work includes exact post- conservative p-values satisfying for any 0 < α < 1,
selection inference, where the full data is used for PH0 [Q(γ) ≤ α] ≤ α or PH0 [P ≤ α] ≤ α, respectively.
selection and inference: it aims to avoid the poten-
tial inefficiency of single sample splitting and to be Bounding the error of the estimated bias correc-
less conservative than “POSI”, thereby restricting tion in the desparsified Lasso. We will argue now
the focus to a class of selection procedures which why the error from the bias correction
are determined by affine inequalities, including the X√
nPjk (β̂k − βk0 )
Lasso and least angle regression (Lee et al. (2013);
k6=j
Taylor et al., 2014; Fithian, Sun and Taylor, 2014).
Under some conditions, the issue of selective infer- is negligible. From the KKT conditions when using
ence can be addressed by using an adjustment factor the Lasso of X(j) versus X(−j) , we have (Bühlmann
(Benjamini and Yekutieli (2005)): this could be done and van de Geer, 2011, cf. Lemma 2.1)
by adjusting the output of our high-dimensional in-
ference procedures, for example, from the hdi R- (A.2) max 2|n−1 (X (k) )T Z (j) | ≤ λj .
k6=j
package.
Therefore,
√ X

APPENDIX 0

n
Pjk (β̂k − βk )
A.1 Additional Definitions and Descriptions k6=j
√
Compatibility condition (Bühlmann and van de ≤ n max |Pjk |kβ̂ − β 0 k1
k6=j
Geer (2011), page106). Consider a fixed design ma-
√
trix X. We define the following: ≤ 2 nλj kβ̂ − β 0 k1 (n−1 (X(j) )T Z (j) )−1 .
The compatibility condition holds if for some φ0 >
0 and all β satisfying kβS0c k1 ≤ 3kβS0 k1 , Assuming sparsity and the compatibility
p condition
(A.1), and when choosing λj ≍ log(p)/n, one can
(A.1) kβS0 k21 ≤ β T Σ̂βs0 /φ20 , Σ̂ = n−1 XT X. show that (n−1 (X(j) )T Z (j) )−1 = OP (1) and kβ̂ −
β 0 k1 = OP (s0 log(p)/n) [for the latter, see (2.2)].

p
quantities are fixed and observed for fixed design
Therefore, X.
√ X
By invoking the compatibility constant for the
0 design X, we obtain the bound for kβ̂ − β 0 k1 ≤

n
P jk (β̂k − βk )
k6=j s0 4λ/φ0 in (2.2) and, therefore, we can upper-bound
√ p
PR;jk
≤ OP ( ns0 log(p)/nλj ) 2
|∆R;j | ≤ 4s0 λ/φ0 max .
k6=j PR;jj
= OP (s0 log(p)n−1/2 ),
Asymptotically, for Gaussian errors, we have with
where the last bound follows by assuming λj ≍ high probability
log(p)/n. Thus, if s0 ≪ n1/2 / log(p), the error
p

p PR;jk
from bias correction is asymptotically negligible. |∆R;j | = O s0 log(p)/n max
k6=j PR;jj
Choice of λj for desparsified Lasso. We see from
(A.3)
(A.2) that the numerator of the error in the bias
1/2−ξ

PR;jk
correction term (i.e., the Pjk ’s) is decreasing as λj ց ≤ O (log(p)/n) max ,
k6=j PR;jj
0; for controlling the denominator, λj should not
be too small to ensure that the denominator [i.e., where the last inequality holds due to assuming s0 =
n−1 (X(j) )T Z (j) ] behaves reasonable (staying away O((n/ log(p))ξ ) for some 0 < ξ < 1/2. In practice, we
from zero) for a fairly large range of λj . use the bound from (A.3) in the form
Therefore, the strategy is as follows:
PR;jk
∆Rbound;j := max (log(p)/n)1/2−ξ ,
1. Compute a Lasso regression of X(j) versus all k6=j PR;jj
other variables X(−j) using CV, and the correspond- with the typical choice ξ = 0.05.
ing residual vector is denoted by Z (j) .
2. Compute kZ (j) k22 /((X(j) )T Z (j) )2 which is the A.2 Confidence Intervals for Multi
asymptotic variance of b̂j /σε , assuming that the er- Sample-Splitting
ror in the bias correction is negligible. We construct confidence intervals that satisfy the
3. Increase the variance by 25%, that is, Vj = duality with the p-values from equation (2.5), and,
1.25kZ (j) k22 /((X(j) )T Z (j) )2 . thus, they are corrected already for multiplicity:
4. Search for the smallest λj such that the corre-
(1 − α)% CI
sponding residual vector Z (j) (λj ) satisfies
= Those values c for which the p-value ≥
kZ (j) (λj )k22 /((X(j) )T Z (j) (λj ))2 ≤ Vj .
α for testing the null hypothesis H0,j : β = c,
This procedure is similar to the choice of λj advo-
cated in Zhang and Zhang (2014). = Those c for which the p-value resulting from
Bounding the error of bias correction for the Ridge the p-value aggregation procedure is ≥ α,
projection. The goal is to derive the formula (2.11).
= {c|Pj ≥ α},
Based on (2.9), we have n o
−1/2 = c|(1 − log γmin ) inf Qj (γ) ≥ α ,
σε−1 ΩR;jj (b̂R;j − βj0 ) γ∈(γmin ,1)
−1/2
≈ ΩR;jj Wj /PR;jj = {c|∀γ ∈ (γmin , 1) : (1 − log γmin )Qj (γ) ≥ α},
−1/2 = {c|∀γ ∈ (γmin , 1) :
+ σε−1 ΩR;jj ∆R;j , W ∼ Np (0, ΩR ),
[b]
min(1, emp. γ quantile(Pcorr;j )/γ) ≥
PR;jk
|∆R;j | ≤ max kβ̂ − β 0 k .
k6=j PR;jj
1 α/(1 − log γmin )},
In relation to the result in Fact 2 for the despar- = {c|∀γ ∈ (γmin , 1) :
sified Lasso, the problem here is that the behaviors [b]
−1 emp. γ quantile(Pcorr;j )/γ ≥
of maxk6=j |PR;jj PR:jk | and of the diagonal elements
ΩR;jj are hard to control, but, fortunately, these α/(1 − log γmin )},

As in Section 3.2, the idea is now to apply a stan-
= c|∀γ ∈ (γmin , 1) :
dard l1-penalized fitting of the model, then build up
the weighted least squares problem at the l1-solution
[b] αγ
emp. γ quantile(Pcorr;j ) ≥ . and apply our linear model methods on this prob-
(1 − log γmin ) lem.
We will use the notation γ [b] for the position of From McCullagh and Nelder (1983), using the no-
[b]
Pcorr;j
in the ordering by increasing the value of the tation ẑi = g−1 ((Xβ̂)i ), i = 1, . . . , n, the adjusted re-
[i] sponse variable becomes
corrected p-values Pcorr;j , divided by B.
We can now rewrite our former expression in a ∂g(z)
Yi,adj = (Xβ̂)i + (Yi − ẑi ) ,
form explicitly using our information from every ∂z z=ẑi
sample split i = 1, . . . , n.
(1 − α)% CI
We then get a weighted least squares problem
= c|∀b = 1, . . . , B : (γ [b] ≤ γmin ) β̂new = argminβ (Yadj − Xβ)T W(Yadj − Xβ),
with weights
αγ [b]

[b]
∨ Pcorr;j ≥ W−1
(1 − log γmin )  2
∂g(z)
V (ẑ1 ) 0
= c|∀b = 1, . . . , B : (γ [b] ≤ γmin )


 ∂z
z=ẑ1
 2
 ∂g(z)
αγ [b] 0 V (ẑ2 )

=

∨ c ∈ the 1−  ∂z
z=ẑ2
(1 − log γmin )|Ŝ [b] | 
.. ..
.

 .
· 100% CI for split b . 0 ...
... 0

For single testing (not adjusted for multiplicity), ..
..
the corresponding confidence interval becomes . . 

.. ,

(1 − α)% CI . 0
2 
∂g(z) 
= c|∀b = 1, . . . , B : (γ [b] ≤ γmin ) 0 V (ẑn )
∂z
z=ẑn
with variance function V (z).
αγ [b]

∨ c ∈ the 1 − The variance function V (z) is related to the vari-
(1 − log γmin ) ance of the response Y . To more clearly define this
relation, we assume that the response Y has a dis-

· 100% CI for split b . tribution of the form described in McCullagh and
Nelder (1983):
If one has starting points with one being in the
confidence interval and the other one outside of fY (y; θ, φ) = exp [(yθ − b(θ))/a(φ) + c(y, φ)],
it, one can apply the bisection method to find the with known functions a(·), b(·) and c(·). θ is the
bound in between these points. canonical parameter and φ is the dispersion param-
A.3 Weighted Squared Error Approach for eter.
General GLM As defined in McCullagh and Nelder (1983), the
variance function is then related to the variance of
We describe the approach presented in Section 3.2 the response in the following way:
in a more general way. One algorithm for fitting
generalized linear models is to calculate the max- Var(Y ) = b′′ (θ)a(φ) = V (g−1 (Xβ 0 ))a(φ).
imum likelihood estimates β̂ by applying itera- √ √
We rewrite Yw = WYadj and Xw = WX to get
tive weighted least squares (McCullagh and Nelder
(1983)). β̂new = argminβ (Yw − Xw β)T (Yw − Xw β).
The linear model methods can now be applied to Bühlmann, P., Kalisch, M. and Meier, L. (2014). High-
Yw and Xw , thereby the estimate σ̂ε has to be set dimensional statistics with a view towards applications in
to the value 1. biology. Annual Review of Statistics and Its Applications 1
255–278.
Bühlmann, P. and Mandozzi, J. (2014). High-dimensional
ACKNOWLEDGMENTS variable screening and bias in subsequent inference, with
an empirical comparison. Comput. Statist. 29 407–430.
We would like to thank some reviewers for insight- MR3261821
ful and constructive comments. Bühlmann, P., Meier, L. and van de Geer, S. (2014). Dis-
cussion: “A significance test for the Lasso”. Ann. Statist.
42 469–477. MR3210971
SUPPLEMENTARY MATERIAL Bühlmann, P. and van de Geer, S. (2011). Statistics for
High-Dimensional Data: Methods, Theory and Applica-
Supplement to “High-Dimensional Inference: tions. Springer, Heidelberg. MR2807761
Confidence Intervals, p-Values and R-Software hdi” Bühlmann, P. and van de Geer, S. (2015). High-
(DOI: 10.1214/15-STS527SUPP; .pdf). The supple- dimensional inference in misspecified linear models. Elec-
tron. J. Stat. 9 1449–1473. MR3367666
mental article contains additional empirical results.
Candes, E. J. and Tao, T. (2006). Near-optimal signal
recovery from random projections: Universal encoding
strategies? IEEE Trans. Inform. Theory 52 5406–5425.
REFERENCES MR2300700
Barber, R. F. and Candès, E. J. (2015). Controlling the Chandrasekaran, V., Parrilo, P. A. and Willsky, A. S.
false discovery rate via knockoffs. Ann. Statist. 43 2055– (2012). Latent variable graphical model selection via con-
2085. MR3375876 vex optimization. Ann. Statist. 40 1935–1967. MR3059067
Belloni, A., Chernozhukov, V. and Kato, K. (2015). Uni- Chatterjee, A. and Lahiri, S. N. (2013). Rates of con-
form post-selection inference for least absolute deviation re- vergence of the adaptive LASSO estimators to the oracle
gression and other Z-estimation problems. Biometrika 102 distribution and higher order refinements by the bootstrap.
77–94. MR3335097 Ann. Statist. 41 1232–1259. MR3113809
Belloni, A., Chernozhukov, V. and Wang, L. (2011). Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977).
Square-root Lasso: Pivotal recovery of sparse signals via Maximum likelihood from incomplete data via the EM al-
conic programming. Biometrika 98 791–806. MR2860324 gorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 1–38.
Belloni, A., Chen, D., Chernozhukov, V. and MR0501537
Hansen, C. (2012). Sparse models and methods for Dezeure, R., Bühlmann, P., Meier, L. and Mein-
optimal instruments with an application to eminent shausen, N. (2015). Supplement to “High-Dimensional
domain. Econometrica 80 2369–2429. MR3001131 Inference: Confidence Intervals, p-Values and R-Software
Benjamini, Y. and Yekutieli, D. (2001). The control of the hdi.” DOI:10.1214/15-STS527SUPP.
false discovery rate in multiple testing under dependency. Fan, J. and Li, R. (2001). Variable selection via noncon-
Ann. Statist. 29 1165–1188. MR1869245 cave penalized likelihood and its oracle properties. J. Amer.
Benjamini, Y. and Yekutieli, D. (2005). False discovery Statist. Assoc. 96 1348–1360. MR1946581
rate-adjusted multiple confidence intervals for selected pa- Fan, J. and Lv, J. (2008). Sure independence screening for
rameters. J. Amer. Statist. Assoc. 100 71–93. MR2156820 ultrahigh dimensional feature space. J. R. Stat. Soc. Ser.
Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. B. Stat. Methodol. 70 849–911. MR2530322
(2013). Valid post-selection inference. Ann. Statist. 41 802– Fan, J. and Lv, J. (2010). A selective overview of variable
837. MR3099122 selection in high dimensional feature space. Statist. Sinica
Bogdan, M., van den Berg, E., Su, W. and Candès, E. 20 101–148. MR2640659
(2013). Statistical estimation and testing via the sorted l1 Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimal-
norm. Preprint. Available at arXiv:1310.1969. ity of folded concave penalized estimation. Ann. Statist. 42
Bogdan, M., van den Berg, E., Sabatti, C., Su, W. 819–849. MR3210988
and Candès, E. (2014). SLOPE—adaptive variable se- Fithian, W., Sun, D. and Taylor, J. (2014). Optimal
lection via convex optimization. Preprint. Available at inference after model selection. Preprint. Available at
arXiv:1407.3824. arXiv:1410.2597.
Breiman, L. (1996a). Bagging predictors. Mach. Learn. 24 Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New
123–140. York. MR0405726
Breiman, L. (1996b). Heuristics of instability and stabi- Hastie, T., Tibshirani, R. and Friedman, J. (2009). The
lization in model selection. Ann. Statist. 24 2350–2383. Elements of Statistical Learning: Data Mining, Inference,
MR1425957 and Prediction, 2nd ed. Springer, New York. MR2722294
Bühlmann, P. (2013). Statistical significance in high- Javanmard, A. and Montanari, A. (2014). Confidence in-
dimensional linear models. Bernoulli 19 1212–1242. tervals and hypothesis testing for high-dimensional regres-
MR3102549 sion. J. Mach. Learn. Res. 15 2869–2909. MR3277152
Knight, K. and Fu, W. (2000). Asymptotics for Lasso-type Tibshirani, R. (1996). Regression shrinkage and selection via
estimators. Ann. Statist. 28 1356–1378. MR1805787 the Lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–
Lee, J., Sun, D., Sun, Y. and Taylor, J. (2013). Exact post- 288. MR1379242
selection inference, with application to the Lasso. Preprint. van de Geer, S. (2007). The deterministic Lasso. In JSM
Available at arXiv:1311.6238. Proceedings 140. American Statistical Association, Alexan-
Leeb, H. and Pötscher, B. M. (2003). The finite-sample dria, VA.
distribution of post-model-selection estimators and uni- van de Geer, S. (2014). Statistical theory for
form versus nonuniform approximations. Econometric The- high-dimensional models. Preprint. Available at
ory 19 100–142. MR1965844 arXiv:1409.8557.
Liu, H. and Yu, B. (2013). Asymptotic properties of Lasso + van de Geer, S. (2015). χ2 -confidence sets in
mLS and Lasso + Ridge in sparse high-dimensional linear high-dimensional regression. Preprint. Available at
regression. Electron. J. Stat. 7 3124–3169. MR3151764 arXiv:1502.07131.
Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshi- van de Geer, S. A. and Bühlmann, P. (2009). On the con-
rani, R. (2014). A significance test for the Lasso. Ann. ditions used to prove oracle results for the Lasso. Electron.
Statist. 42 413–468. MR3210970 J. Stat. 3 1360–1392. MR2576316
Mandozzi, J. and Bühlmann, P. (2015). Hierarchi- van de Geer, S., Bühlmann, P. and Zhou, S. (2011). The
cal testing in the high-dimensional setting with cor- adaptive and the thresholded Lasso for potentially misspec-
related variables. J. Amer. Statist. Assoc. To ap- ified models (and a lower bound for the Lasso). Electron.
pear. DOI:10.1080/01621459.2015.1007209. Available at J. Stat. 5 688–749. MR2820636
arXiv:1312.5556. van de Geer, S., Bühlmann, P., Ritov, Y. and
McCullagh, P. and Nelder, J. A. (1983). Generalized Lin- Dezeure, R. (2014). On asymptotically optimal confi-
ear Models, 2nd ed. Chapman & Hall, London. MR0727836 dence regions and tests for high-dimensional models. Ann.
Meier, L., Meinshausen, N. and Dezeure, R. (2014). hdi: Statist. 42 1166–1202. MR3224285
High-Dimensional Inference. R package version 0.1-2. Wasserman, L. (2014). Discussion: “A significance test for
Meinshausen, N. (2008). Hierarchical testing of variable im- the Lasso”. Ann. Statist. 42 501–508. MR3210975
portance. Biometrika 95 265–278. MR2521583 Wasserman, L. and Roeder, K. (2009). High-dimensional
Meinshausen, N. (2015). Group-bound: Confidence inter- variable selection. Ann. Statist. 37 2178–2201. MR2543689
vals for groups of variables in sparse high-dimensional Yuan, M. and Lin, Y. (2006). Model selection and estimation
regression without assumptions on the design. J. in regression with grouped variables. J. R. Stat. Soc. Ser.
R. Stat. Soc. Ser. B. Stat. Methodol. To appear. B. Stat. Methodol. 68 49–67. MR2212574
DOI:10.1111/rssb.12094. Available at arXiv:1309.3489. Zhang, C.-H. (2010). Nearly unbiased variable selection un-
Meinshausen, N. and Bühlmann, P. (2006). High- der minimax concave penalty. Ann. Statist. 38 894–942.
dimensional graphs and variable selection with the Lasso. MR2604701
Ann. Statist. 34 1436–1462. MR2278363 Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals
Meinshausen, N. and Bühlmann, P. (2010). Stability selec- for low dimensional parameters in high dimensional linear
tion. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473. models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
MR2758523 MR3153940
Meinshausen, N., Meier, L. and Bühlmann, P. (2009). Zou, H. (2006). The adaptive Lasso and its oracle properties.
p-values for high-dimensional regression. J. Amer. Statist. J. Amer. Statist. Assoc. 101 1418–1429. MR2279469
Assoc. 104 1671–1681. MR2750584 Zou, H. and Hastie, T. (2005). Regularization and variable
Pearl, J. (2000). Causality: Models, Reasoning, and Infer- selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat.
ence. Cambridge Univ. Press, Cambridge. MR1744773 Methodol. 67 301–320. MR2137327
Reid, S., Tibshirani, R. and Friedman, J. (2013). A study
of error variance estimation in Lasso regression. Preprint.
Available at arXiv:1311.5274.
Shah, R. D. and Samworth, R. J. (2013). Variable selection
with error control: Another look at stability selection. J.
R. Stat. Soc. Ser. B. Stat. Methodol. 75 55–80. MR3008271
Shao, J. and Deng, X. (2012). Estimation in high-
dimensional linear models with deterministic design ma-
trices. Ann. Statist. 40 812–831. MR2933667
Spirtes, P., Glymour, C. and Scheines, R. (2000). Cau-
sation, Prediction, and Search, 2nd ed. MIT Press, Cam-
bridge, MA. MR1815675
Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear re-
gression. Biometrika 99 879–898. MR2999166
Taylor, J., Lockhart, R., Tibshirani, R. and Tibshi-
rani, R. (2014). Exact post-selection inference for forward
stepwise and least angle regression. Preprint. Available at
arXiv:1401.3889.

Datasplitting Varselection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datasplitting Varselection

Uploaded by

Copyright:

Available Formats

Statistical Science

2015, Vol. 30, No. 4, 533–558

High-Dimensional Inference: Confidence

Abstract. We present a (selective) review of recent frequentist high-

1. INTRODUCTION van de Geer et al., 2014; Javanmard and Montanari,

Multi sample-splitting for multiple testing of H0,j

2.5 A Broad Comparison are generated ∼ Np (0, Σ) with covariance matrix Σ

sen, presented in the supplemental article (Dezeure 3. GENERALIZED LINEAR MODELS

> outStability > outClusterGroupBound

β 0 k1 = OP (s0 log(p)/n) [for the latter, see (2.2)].

You might also like