Professional Documents
Culture Documents
11 STS356
11 STS356
584
MULTIPLE TESTING FOR EXPLORATORY RESEARCH 585
next phase of validation. The open-minded nature of FWER-based methods control the probability of
exploratory research can be described by three charac- making any false rejection at a prespecified rate. These
teristics: exploratory research is mild, flexible and post are the archetypical methods for confirmatory analy-
hoc. We explain these three terms below, contrasting sis. Such methods are clearly not mild, and they are
them with the more familiar characteristics of confir- not post hoc, as all data analysis decisions have to be
matory research. made before seeing the data. They can be argued to
An inferential procedure is mild if it allows some be flexible in a limited sense: it is possible to refrain
false positives among the selected hypotheses. This is from rejecting some of the rejected hypotheses with-
the most obvious characteristic of exploratory research. out violating control of the familywise error, but it is
Mildness is reasonable because false positives are ex- not possible to reject any hypotheses that were not se-
pected to be detected and removed in later validation lected by the procedure. A variant of familywise error,
experiments. Confirmatory research, in contrast, being k-FWER, has been formulated that controls the prob-
the final phase of the research cycle, is not mild but ability of making at least k ≥ 1 false rejections (Ro-
strict. mano and Wolf, 2007). Depending on k, methods with
An inferential procedure is flexible if it does not pre- this error rate are mild and are flexible in the same
scribe to the researcher which precise hypotheses to se- limited way as FWER itself is. Still, k-FWER-based
lect or not to select. For example, if the procedure ranks methods have so far only attracted theoretical interest
the hypotheses from most to least promising, but the as in these methods value of k may not be chosen post
researcher detects a common theme in the hypotheses hoc, and nobody knows how to choose k a priori in an
ranked second, third and fourth, he or she can choose to applied setting. A recent permutation method of Mein-
follow up on these three hypotheses and disregard the shausen (2006) can be seen as a method that controls
hypothesis that ranked first. In fact, the researcher may k-FWER simultaneously for all values of k, and conse-
also choose to follow up on the hypothesis that ranked
quently allows post hoc selection of k. This method is
last, if that fits the same theme. Such freedom, “picking
mild, post hoc, and quite flexible, although it does not
and choosing,” is an important part of the hypothesis-
allow a fully arbitrary selection of the set of rejected
generating aspect of exploratory research. In confirma-
hypotheses.
tory research, in contrast, selection of an interesting
False Discovery Rate (Benjamini and Hochberg,
and coherent collection of hypotheses has been done
1995) methods control the expected proportion of
prior to the experiment, so that flexible selection is not
falsely rejected hypotheses among the rejected hy-
necessary.
Finally, an inferential procedure is post hoc if it al- potheses. Such methods are not very well suited for tra-
lows all choices that are inherent to the procedure to be ditional confirmatory research and take a step toward
made after seeing the data. Specifically, how mild the exploratory research. FDR-based methods are certainly
procedure should be, and which precise set of hypothe- mild compared to FWER-based methods. However,
ses to select does not have to be chosen beforehand, but they are not post hoc, as the set of rejected hypotheses
may be chosen on the basis of the data. This is probably is completely determined after setting the FDR thresh-
the most distinguishing feature of exploratory research. old. Moreover, FDR-based methods are not flexible: as
The decision which inferences, and how many, to fol- shown by Finner and Roters (2001), and illustrated in
low up is often based on a mixture of considerations; a practical example by Marenne et al. (2009), selecting
these considerations are usually not purely statistical, a subset from the hypotheses that the FDR-controlling
and are often difficult to make explicit. In contrast, in procedure rejects may increase the false discovery rate
pure confirmatory research all choices regarding the above the prespecified level, just like, of course, select-
testing procedure have to be set in stone before data ing a superset can. Many variants of FDR have been
collection. proposed (e.g., Storey, 2002; Efron et al., 2001; Van
An ideal multiple hypothesis testing procedure for Der Laan, Dudoit and Pollard, 2004), but none of these
exploratory research should sanction a mild, post hoc has the desired three characteristics of the ideal multi-
and flexible approach. Multiple testing procedures gen- ple testing procedure for exploratory inference. Meth-
erally do not fulfil these criteria. The main present dis- ods have been formulated for selective inference (Ben-
tinction is between multiple testing methods based on jamini and Yekutieli, 2005), but these still do not allow
the familywise error (FWER), and variants, and meth- the full flexibility of exploratory selection.
ods based on the false discovery rate (FDR), and vari- In this paper we present an approach to multiple test-
ants of that. ing that does allow mild, flexible and post hoc infer-
586 J. J. GOEMAN AND A. SOLARI
ence. By the nature of the requirements of being flex- Hochberg, 2000; Langaas, Lindqvist and Ferkingstad,
ible and post hoc, such a procedure cannot prescribe 2005; Meinshausen and Bühlmann, 2005; Jin and Cai,
what hypotheses to reject, but can only advise. This 2007). The procedure outlined in this paper automati-
reverses the traditional roles of the user and the proce- cally gives a confidence set for the quantity π0 , because
dure in multiple testing. Rather than, as in FWER- or the collection of all hypotheses is one of the possible
FDR-based methods, to let the user choose the quality sets of rejected hypotheses that the user can choose to
criterion, and to let the procedure return the collection follow up, and the number of false rejections in that set
of rejected hypotheses, the user chooses the collection is exactly π0 .
of rejected hypotheses freely, and the multiple testing The outline of this paper is as follows. In the next
procedure returns the associated quality criterion. In section, we review the closed testing procedure and the
our view, the task of a multiple testing procedure in role of the concept of consonance in that procedure.
the exploratory context is not to dictate what to reject, We argue that non-consonant closed testing procedures
but to quantify the risk taken, in terms of the poten- have been underrated, and illustrate the type of addi-
tial number of false rejections, of following up on any tional inference that is possible from a non-consonant
specific set of hypotheses, chosen freely. closed testing procedure, but typically neglected, be-
This reversal of roles can be achieved while avoiding fore we argue how these additional inferences can be
the pitfall of proposing yet another variant of FWER or used to construct a confidence set. Section 3 applies the
FDR; it can be done simply by combining the famil- approach to selection of variables in a multiple regres-
iar concept of the confidence set, the discrete version sion model. Section 4 explores computational issues
of the confidence interval, with the well-known closed related to closed testing procedures and proposes sit-
testing procedure (Marcus, Peritz and Gabriel, 1976), uations in which shortcuts can be found. Finally, Sec-
widely recognized as a fundamental principle of multi- tion 5 looks at estimation of the number of correctly
ple testing. What we will show is that the closed testing rejected hypotheses. Software to perform the methods
procedure can be used to construct exact simultaneous described in this paper is available in the cherry pack-
confidence sets for the number of false rejections in- age, downloadable from CRAN.
curred when rejecting any specific set of hypotheses,
measuring the risk of following up on this particular set 2. NON-CONSONANT CLOSED TESTING
of hypotheses. Because the confidence sets are simul-
taneous over all possible sets of rejected hypotheses, The closed testing procedure (Marcus, Peritz and
the user is free to optimize, making the procedure valid Gabriel, 1976) is well known as a cornerstone of fam-
even under post hoc selection of the rejected set. ilywise error control. In this section we show how
The approach we propose is constrained by the re- closed testing may also be used to construct confidence
quirement that the number of hypotheses potentially sets for the number of falsely rejected hypotheses.
to be followed up is finite and that these hypotheses First we introduce some notation. Let H1 , . . . , Hn be
can be listed a priori. While this requirement rules out the collection of hypotheses of interest, the elementary
the most open-minded and unstructured applications of hypotheses, out of which we want to select hypothe-
exploratory research, many exploratory problems are ses to follow up. Some of these hypotheses are true; let
structured enough to fit the framework. T ⊆ {1, . . . , n} denote the unknown indices of true hy-
Our proposed procedure has strong links to k-FWER potheses. To use a closed testing procedure, we must
methods. In fact, the constructed confidence sets can consider not only the elementary hypotheses,but also
be seen as controlling the k-FWER, but simultaneously all intersection hypotheses of the form HI = i∈I Hi ,
for all values of k, thus sanctioning post hoc selection where I ⊆ {1, . . . , n}, I = ∅. Figure 1 illustrates the
of k and removing the requirement of selecting k a pri- intersection hypotheses formed by three hypotheses
ori, which traditionally plagues k-FWER-based meth- H1 , H2 and H3 in the form of a graph, with arrows
ods. Through this, our method links to the approach of denoting subset relationships (ignore the crosses for
Meinshausen (2006); we come back to this link in Sec- now).
tion 4.2. An intersection hypothesis HI is true whenever all
Another interesting link is with methods that have Hi , i ∈ I , are true, that is, whenever I ⊆ T . Let the
appeared in recent years for estimating π0 , the number closure C be the collection of all nonempty subsets of
of true hypotheses among the collection of all hypothe- the index set {1, . . . , n}. Each element of C corresponds
ses (Schweder and Spjøtvoll, 1982; Benjamini and to an intersection hypothesis, some of which are true.
MULTIPLE TESTING FOR EXPLORATORY RESEARCH 587
fidence that each of the seven sets contains at least It is interesting to investigate the price of post hoc
one truly relevant covariate. It is tempting to say that, selection relative to a priori selection. It is immedi-
beside waist, forearm must be relevant, since it is in- ate from the procedure that reducing the tested set of
cluded in all defining sets except the first. This is not hypotheses to a set R a priori is at least as powerful
warranted, however, as the sets are also consistent with as, and likely more powerful than, testing a larger set
alternative truths, such as that both shoulder and thigh and selecting the same set R post hoc. Post hoc se-
are relevant variables. What we can conclude, is that if lection will generally result in wider confidence sets
we select, for example, the set than a priori selection: this is the price to be paid for
the risk of overfit caused by post hoc selection. In
(2) R = {waist, forearm, calf, thigh} the example this price is surprisingly small. If the set
we have selected at least two relevant variables. Fur- R = {waist, forearm, height, thigh} would have been
thermore, we can also directly conclude that waist is a defined a priori as the set of hypotheses of interest,
relevant variable. treating the remaining covariates’ regression coeffi-
Coming back to the set R = {waist, forearm, height, cients as nuisance parameters, the confidence set of
thigh} selected by the forward–backward procedure, φ(R) improves from {1, 2, 3, 4} to {2, 3, 4}. The con-
we can find all 15 intersection hypotheses of the four fidence set for φ(R) for the set R defined in (2) does
hypotheses in R and check whether they were rejected not change.
by the closed testing procedure. We find that R ∈ X ,
but {forearm, height, thigh} ∈ / X , so that tα (R) = 3. 4. SHORTCUTS
Therefore, we can say with confidence that the selected In its standard form, application of a closed testing
set R contains one truly relevant hypothesis, but not procedure requires 2n − 1 tests to be performed. Smart
that it contains more than one. From this result, it is bookkeeping can reduce this number somewhat, espe-
clear that the p-values given for the selected model in cially if some intersection hypotheses high up in the hi-
Table 2 are highly untrustworthy. erarchy turn out non-significant, because it can be used
To find out how many of the original 10 hypothe- that if I ∈/ X , then immediately J ∈ / X for every J ⊂ I ,
ses are relevant, we take R to be the full set of 10 hy- which saves calculation of some of the tests. Still, even
potheses, and we calculate tα (R) = 8 for this set. Ap- with such tricks and with high computational power,
parently, we can conclude that there are at least two co- the closed testing procedure becomes computationally
variates among these 10 that are determinants of mass. intractable in its general form for a number of hypothe-
The smallest set that contains at least two relevant co- ses around 20–30, depending on the computational ef-
variates is the set (2). fort needed for each single test.
It should be noted that the set selected by variable se- If a large number of hypotheses is to be investigated,
lection procedure should not generally be expected to it is, therefore, convenient if the local tests can be cho-
be optimal from a confidence set perspective, because sen in such a way that not all these tests need to be
the perspectives of the two procedures are quite dif- calculated. Methods for avoiding calculation of some
ferent. This is best illustrated by thinking of a dataset of the hypothesis tests in the closed testing procedure
in which there are two covariates which are both highly are known as shortcuts. The literature on shortcuts in
correlated with each other, and with the response. Vari- the closed testing procedure has been focused mainly
able selection algorithms will always choose one of on consonant procedures, and on finding the rejected
the two variables, disregarding the second one as su- individual hypotheses (Grechanovsky and Hochberg,
perfluous given the first. The confidence set approach, 1999; Zaykin et al., 2002; Hommel, Bretz and Maurer,
however, will emphasize the uncertainty of the choice 2007; Bittman et al., 2009; Brannath and Bretz, 2010).
between the two variables, and will not reject any in- In this section, we loosely extend the concept of short-
tersection hypothesis that involves only one of the two cuts to non-consonant procedures, and discuss ways
covariates. To have confidence that at least one truly of finding tα (R) in a computationally easy way for
relevant covariate is included, both of the highly corre- specific choices of the local test, namely those based
lated variables must be selected. This reflects a differ- on Fisher combinations, on Simes’ inequality, and on
ence in emphasis between the two approaches: variable sums of normally distributed test statistics. We also
selection selects optimal sets, whereas the confidence demonstrate how the permutation-based procedure of
set approach quantifies the uncertainty inherent in the Meinshausen (2006) fits into the closed testing frame-
selection process. work. Finally, we touch upon the possible use of other
MULTIPLE TESTING FOR EXPLORATORY RESEARCH 591
has tα (R) < k, for example taking R as the set cor- est among the p-values {pj }j ∈I of the elementary hy-
responding to the i smallest p-values, choosing i as potheses with indices in I , and cim , 1 ≤ m ≤ n, 1 ≤
the largest value such that tα (R) < k still holds. The i ≤ m, are appropriately chosen critical values. With-
graph of Figure 2 simultaneously shows the numbers out loss of generality we can take cim ≤ cjm if i ≤ j .
of rejections allowed with k = 1, 2, 3, 4, . . . , which are We call local tests of the form (3) Simes type local
given by i = 0, 3, 6, 7, . . . . A major advantage of our tests because if we choose
approach over traditional k-FWER control approaches iα
(4) cim = ,
is that control is simultaneous over all rejected sets, m
and therefore over all choices of k. The procedure thus the test based on (3) is a valid α-level test of HI
bridges the gap between weak FWER control, related by Simes’ (1986) inequality. Simes’ inequality holds
to n-FWER, and strong 1-FWER control. Furthermore, whenever p-values of true null hypotheses are indepen-
rather than choosing k in advance, its value may be dent, but also under more general conditions, as inves-
picked after seeing the data without destroying the as- tigated by Sarkar (1998). In particular it holds for p-
sociated control property. The link between k-FWER values from identically distributed, nonnegatively cor-
and our approach is not limited to local tests based on related test statistics.
Fisher combinations. A variant of Simes’ inequality has been proposed by
An interesting feature of using Fisher’s method Hommel (1983). This variant uses critical values
in combination with the confidence set approach is iα
that the method may prove the presence of false null (5) cim = ,
Km · m
MULTIPLE TESTING FOR EXPLORATORY RESEARCH 593
−1
where Km = m v=1 v . Unlike the one based on discovery rate algorithm are very similar. For the ex-
Simes’ inequality, the local test defined by these criti- ample data of Section 4.1, the Simes local test leads to
cal values is of the correct level α whatever the depen- no rejections, which is consistent with finding no rejec-
dence structure of the original p-values. tions with the procedure of Benjamini and Hochberg
Local tests of the form (3) do not generally allow (1995).
shortcuts for the calculation of tα (R), but two useful Permutation testing can be a powerful tool to take
shortcuts are available if the critical values are chosen into account the joint distribution of the p-values.
in such a way that Useful shortcuts in a closed testing procedure with
permutation-based local tests can be constructed from
(6) cil ≤ cim whenever l ≥ m.
the work of Meinshausen and Bühlmann (2005) and
The first shortcut this condition allows is the general Meinshausen (2006). These authors describe a permu-
shortcut described in Appendix A, the conditions of tation-based way to find critical values ki , i = 1, . . . , n,
which are fulfilled whenever (6) holds. The second such that the probability under the complete null hy-
shortcut is even faster to calculate, but is less general: pothesis that p(i) ≤ ki for at least one i is bounded
it holds for rejected sets of the form R = {i : pi ≤ q} by α. The same method may in principle also be used
only. Let p(i) , i = 1, . . . , n, be short for p(i)
I with to find corresponding permutation critical values kiI
I = {1, . . . , n}. For R of the form mentioned, we have for every intersection hypothesis HI , I ∈ C , and there-
the shortcut fore a local test for every intersection hypothesis HI ;
a closed testing procedure can be made on the basis of
(7) fα (R) > max{Sr : 1 ≤ r ≤ #R},
these tests. However, unless the number of hypotheses
where Sr = max{s ≥ 0 : p(r) ≤ cr−s n }. The value of S
r is limited, this will be extremely time-consuming, and
can be interpreted as the number of more stringent criti- it would lead to a closed testing procedure for which
n , . . . , cn by which the p-value p
cal values cr−1 shortcuts are not available. A way out of this dilemma
1 (r) over-
shoots its mark crn . The number of false hypotheses is can be found by remarking that, by construction of the
larger than the greatest such overshoot of the ordered permutation critical values, we have kiI ≥ ki for every
p-values in the set R. The shortcut (7) is useful for i and I . Therefore, a valid, though conservative, local
making plots such as the one in Figure 2. We prove test may be constructed by simply using a procedure
this shortcut in Appendix B. A slightly more powerful of the form (3) with cim = ki for every 1 ≤ m ≤ n.
variant of the shortcut (7) is available if we have This local test fulfils the condition (6) and therefore
m−w admits shortcuts. With this choice of a local test, the
(8) cim ≤ ci−w for every 1 ≤ w < i. confidence set approach to multiple testing can be used
In this case, we have the same shortcut as (7), but with for every collection of test statistics for which permu-
n−s
Sr = max{s ≥ 0 : p(r) ≤ cr−s }. The proof of this state- tation is possible, opening up the possibility to use
ment is analogous to the proof for (7) and is also given permutation-based closed testing in genomics research.
in Appendix B. It is easy but tedious to show that the Meinshausen (2006) constructed simultaneous con-
Simes critical values (4) and (5) conform to (6) and that fidence bands for the number of falsely rejected hy-
the critical values (4) also conform to the stronger (8), potheses for rejected sets of the form R = {i : pi ≤ q},
so that the shortcuts may be used for these choices of similar to Figure 2, based on the permutation critical
the local test (see also Benjamini and Heller, 2008). values k1 , . . . , kn he found. Even though Meinshausen
As a side note, we remark that the critical values did not use closed testing, these confidence bands are
(4) and (5) are the same as the critical values used identical to the confidence bounds that would be ob-
in the false discovery rate controlling algorithms of tained when using the local tests (3) with cim = ki
Benjamini and Hochberg (1995) and Benjamini and in combination with the shortcut (7). By exploiting
Yekutieli (2001), respectively. The correspondence be- the shortcut of Appendix A rather than this simpler
tween the critical values creates a connection between shortcut, it becomes possible to extend Meinshausen’s
the corresponding methods. The set that has been re- method to be able to find confidence bounds for τ (R)
jected by the false discovery rate algorithm always has for sets R not of the form R = {i : pi ≤ q}. Alter-
fα (R) > 0 based on the closed testing procedure with natively, for a very small number of tests, the full
the corresponding local test. Note that the assumptions permutation-based closed testing procedure may be
underlying each local test and its corresponding false used, which could be more powerful.
594 J. J. GOEMAN AND A. SOLARI
4.3 Normally Distributed Test Statistics and Solari (2010), the focus level procedure of Goeman
and Mansmann (2008), and the gatekeeping method of
Workable local tests may also be constructed on the
Edwards and Madsen (2007). These procedures allow
basis of normally distributed scores. Consider the sit-
familywise error inference on a collection of hypothe-
uation that we have scores z1 , . . . , zn for each hypoth-
ses comprising the elementary hypotheses and a selec-
esis H1 , . . . , Hn , respectively, which are standard nor- tion from the 2n − 1 intersection hypotheses, and may
mally distributed if their respective null hypothesis is produce non-consonant rejections on these intersection
true, and we would reject Hi one-sidedly when zi is hypotheses. The results of these procedures may be
large. This situation occurs quite frequently in prac- used as a basis for constructing confidence intervals in
tice, at least asymptotically, for example if we do many the same way as the results of the closed testing proce-
one-sided binomial z-tests. A sensible
choice for a test dure were used in Section 2.
statistic for a local test is ZI = i∈I zi . Consider first
the case in which the scores of true null hypotheses 5. ESTIMATION
are independent. In that case ZI is normally distributed
with mean 0 and√variance #I , and we may reject HI In addition to the confidence interval, it can some-
whenever ZI ≥ #I · (1 − α), where is the stan- times be informative to have a point estimate of the
dard normal distribution function. If the scores are not number of true null hypotheses among a set of in-
independent but only jointly normally distributed, we terest. Estimation of the number of true null hy-
have the following, more conservative result. In that potheses has been a subject of recent interest in the
case ZI is normally distributed with mean 0 but un- context of genomic data analysis, and several au-
known variance. Let be the correlation matrix of thors (Schweder and Spjøtvoll, 1982; Benjamini and
{zi }i∈I , then the variance of ZI is given by 1T 1, Hochberg, 2000; Langaas, Lindqvist and Ferkingstad,
2005; Meinshausen and Bühlmann, 2005; Jin and Cai,
where 1 is a vector of ones of length #I . This vari-
2007) have proposed methods for estimating τ (R), al-
ance is bounded by #I times the largest eigenvalue of
though for R = {1, . . . , n} only. The quantity τ (R) for
, and therefore by (#I )2 . It follows that for α ≤ 1/2,
R = {1, . . . , n} is commonly referred to as π0 .
we may reject HI whenever ZI ≥ #I · (1 − α). This
The confidence intervals of the previous sections are
type of test was used by Van De Wiel, Berkhof and Van
easily adapted to produce a point estimate of τ (R) for
Wieringen (2009).
any set R. We propose to use the value t1/2 (R) as an
Both tests are exchangeable and lead to easy short-
estimate, the upper bound of the confidence interval,
cuts in the sense of Appendix A. In practice, the test
calculated at the significance level α = 1/2. This es-
for the non-independent case can be highly conserva-
timate can be seen as a conservative median estimate
tive if used for small values of α, unless the scores are of the true quantity τ (R): by the properties of tα (R)
strongly positively correlated. One case to note, how- derived in the previous sections, t1/2 (R) exceeds the
ever, is the case that α = 1/2, when the critical value value of τ (R) with a probability that is bounded above
is 0 for both the independent and the general situation, by 1/2. Furthermore, this property holds simultane-
negating the conservativeness of the latter. This situa- ously for all R by the simultaneity of the confidence
tion is relevant for the method of Section 5. interval on which it is based, which makes the defining
4.4 Other Types of Shortcuts property of the estimate robust against selection of R.
The estimate can be used to get an impression where
Shortcuts of the form described in the appendices the “midpoint” of the confidence interval is. Applying
can only be used within a restricted class of local tests the procedure to the physical dataset of Section 3 at
that is calculated as an exchangeable function of per- α = 1/2, we find the following defining rejections:
hypothesis statistics. Other types of shortcuts may be
devised for other classes of local tests in the future. {waist}
A very different way to construct confidence inter- {forearm}
vals of τ (R) while avoiding calculation of the com- {height}
plete closed testing procedure is to use a different mul- {chest, calf, thigh}
tiple testing procedure that still allows non-consonant {neck, calf, thigh}
rejection of some intersection hypotheses. Examples of {thigh, head}
such procedures are the tree-based testing procedure The estimated number of true null hypotheses among
of Meinshausen (2008), recently improved by Goeman all 10 hypotheses, for which the 95% confidence set
MULTIPLE TESTING FOR EXPLORATORY RESEARCH 595
was {0, . . . , 8}, is calculated as 6, which points to an es- researcher as relevant and interesting may have arisen
timated number of four relevant variables in the regres- due to chance, and turn out to be false positives in
sion. For the set R = {waist, forearm, height, thigh} follow-up experiments. To protect a researcher against
of four variables selected by the stepwise procedure, too many disappointments of this type, it is important
the number of truly relevant variables is estimated to make a realistic assessment of the risk taken when
at 3 (the 95% confidence set for this quantity was following up on a certain collection of hypotheses.
{1, 2, 3, 4}). In contrast, the set (2) which included with In this paper, we have presented an approach to mul-
95% confidence at least two relevant variables, also tiple testing that is especially designed for the require-
contains an estimated number of two relevant vari- ments of exploratory research, and which reverses the
ables. The smallest rejected set that contains an es- way that multiple testing methods are typically used.
timated number of four relevant variables is the set Rather than letting the user decide on the error rate, and
R = {waist, forearm, height, thigh, head}. We can re- the procedure on the rejections, we let the user decide
port this optimized set without fear of overfit, because on the rejections, and the procedure on the error rate.
the property that the number of truly relevant covariates Our approach does not rely on the definition of any new
is overestimated with probability at most 1/2 holds si- error rates, and has not even required the design of a
multaneously for all rejected sets. new algorithm. The approach uses the classical con-
In the adverse event data of Section 4.1 the number cept of the simultaneous confidence set, together with
of true null hypotheses among the 16 hypotheses is es- the equally classical closed testing procedure, although
timated at two using a Fisher local test at α = 1/2. In both in a novel way.
this case, all rejections turn out to be consonant: re- The end result of the procedure is a collection of
jecting the 14 hypotheses with smallest p-values leads confidence sets for the number of falsely rejected hy-
to an estimated number of 0 false discoveries. If we potheses for all possible choices of the rejected set.
use Simes rather than Fisher for the local test, we even The most important property of these confidence sets
obtain an estimated number of 0 true null hypotheses is that they are simultaneous. This simultaneity pro-
among all 16 hypotheses. tects the user of the procedure against overoptimism
We warn against using the estimate of the number
resulting from post hoc selection of the rejected set,
of falsely rejected hypotheses by itself, without the
and removes many of the problems traditionally asso-
associated confidence interval. To see the danger of
ciated with cherry-picking from a large set.
this, consider the simplest “multiple testing problem”
Finally, the approach is very general, and the limits
in which only a single null hypothesis is tested. The es-
on its useability are mostly computational. The most
timation procedure of this section would estimate this
important assumption we make is that the number of
hypothesis as true whenever the p-value is greater than
hypotheses potentially to be followed up is finite, and
1/2, and as false whenever it is smaller than or equal to
that these hypotheses may be enumerated before start-
1/2. This seems generally too lenient a conclusion to
be a viable strategy, although it may be useful in some ing the experiment. Aside from that, the ability of the
highly exploratory and risk-seeking settings. In these closed testing procedure to work with any choice of
situations, the special status of α = 1/2 in the shortcut a local test makes that procedure very flexible. Only
of Section 4.3 may be of interest. if the number of hypotheses becomes large, computa-
tional issues limit the choice of local tests to those for
6. CONCLUSION which shortcuts are available. The shortcuts described
in this paper already cover a wide range of applica-
All exploratory research is essentially picking and tion areas. More and improved shortcuts are likely to
choosing. From a large number of potential hypothe- be found in the future.
ses to follow up, the researcher selects for further in-
vestigation those hypotheses or sets of hypotheses that APPENDIX A: SHORTCUTS FOR EXCHANGEABLE
stand out in the researcher’s eyes. This selection is LOCAL TESTS
made in complete freedom. The notion that any statis-
tical method would dictate what the researcher should We present a fairly general method for constructing
find interesting is contrary to the spirit of exploratory shortcuts in the closed testing procedure which can be
research. used for finding tα (R) and are appropriate for the meth-
However, a well-known risk of picking and choos- ods in Section 4. This shortcut delineates a class of lo-
ing is overfit, “cherry-picking.” Patterns that strike the cal tests for which tα (R) can be calculated for any R
596 J. J. GOEMAN AND A. SOLARI
by calculating only n2 , rather than 2n tests. We give Consequently, K ∈ X for every K ⊆ R with #K ≥ r −
the shortcut for a p-value-based method. The shortcut s, so that tα (R) < r − s, and (1) follows. To obtain
for methods based on other scores (e.g., Section 4.3) is the final statement (7), remark that fα (R) ≥ fα (S) for
completely analogous. every R ⊇ S, and apply the bound (1) on all S ⊂ R of
Assume that the local test is exchangeable, that is, the form specified.
rejection of HI , I ∈ C , only depends on the set PI = Analogously, if (8) holds, choose K and J as above,
{pi }i∈I of raw p-values, and not on the collection I n−s
and let s̃ = #(R \ J ) ≤ s. Then p(r) ≤ cr−s implies that
itself. Let δ be the function that maps from a set of n−s̃
p-values P to rejection, δ(P ) = 1 if the collection
J
p(r−s̃) ≤ p(r) ≤ cr−s
n−s
≤ cr−s̃ ≤ cr−s̃
#J
,
P = PI would lead to rejection of HI , and δ(P ) = 0 noting, in the last inequality, that #J ≤ n − s̃ and that
otherwise. Further, suppose that (8) implies (6). From this result (1) and (7) follow as
(9) δ({p1 , . . . , pk }) ≥ δ({q1 , . . . , qk }) above.
whenever p1 ≤ q1 , . . . , pk ≤ qk , and that
REFERENCES
(10) δ(q ∪ P ) ≥ δ(P )
B ENJAMINI , Y. and H ELLER , R. (2008). Screening for partial con-
whenever q ≤ min(p ∈ P ). junction hypotheses. Biometrics 64 1215–1222. MR2522270
If these assumptions hold, it can be shown that for B ENJAMINI , Y. and H OCHBERG , Y. (1995). Controlling the false
any s < #R, discovery rate: A practical and powerful approach to multiple
testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392
s+1 ∪ Q̄j ) = 1 for every j ∈ {0, . . . , mR }
(11) δ(QR R
B ENJAMINI , Y. and H OCHBERG , Y. (2000). On the adaptive con-
trol of the false discovery rate in multiple testing with indepen-
implies tα (R) ≤ s. Here, QR s+1 is the set of the s + 1 dent statistics. Journal of Educational and Behavioral Statistics
largest p-values of hypotheses in R; Q̄R j is the set of
25 60–83.
B ENJAMINI , Y. and Y EKUTIELI , D. (2001). The control of the
the j largest p-values of hypotheses not in R, and mR
false discovery rate in multiple testing under dependency. Ann.
is the number of p-values not in R that are larger than Statist. 29 1165–1188. MR1869245
the smallest p-value in QR s+1 . B ENJAMINI , Y. and Y EKUTIELI , D. (2005). False discovery rate-
To show this, note that by assumptions (9) and (10), adjusted multiple confidence intervals for selected parameters.
equation (11) implies that J. Amer. Statist. Assoc. 100 71–81. MR2156820
B ITTMAN , R. M., ROMANO , J. P., VALLARINO , C. and
s+1 ∪ PI ) = 1
δ(QR W OLF, M. (2009). Optimal testing of multiple hypotheses with
common effect direction. Biometrika 96 399–410. MR2507151
for every I ∈ C , and that therefore, by assumption (9), B RANNATH , W. and B RETZ , F. (2010). Shortcuts for locally con-
δ(PJ ∪ PI ) = 1 sonant closed test procedures. J. Amer. Statist. Assoc. 105 660–
669. MR2724850
for every I ∈ C and for every J ⊆ R for which #J = E DWARDS , D. and M ADSEN , J. (2007). Constructing multiple test
s + 1. Consequently, J ∈ X for every J ⊆ R for which procedures for partially ordered hypothesis sets. Stat. Med. 26
5116–5124. MR2412460
#J = s + 1, so that tα (R) ≤ s by definition.
E FRON , B., T IBSHIRANI , R., S TOREY, J. D. and T USHER , V.
(2001). Empirical Bayes analysis of a microarray experiment.
APPENDIX B: SHORTCUTS FOR SIMES-TYPE J. Amer. Statist. Assoc. 96 1151–1160. MR1946571
LOCAL TESTS F INNER , H. and ROTERS , M. (2001). On the false discovery rate
and expected type I errors. Biom. J. 43 985–1005. MR1878272
Next, we prove the shortcut (7) for Simes-type local G OEMAN , J. J. and M ANSMANN , U. (2008). Multiple testing on
tests. Let R = {i : pi ≤ q} be a rejected set, and assume the directed acyclic graph of gene ontology. Bioinformatics 24
that condition (6) holds. 537–544.
First, let r = #R, and remark that p(r) ≤ cr−s n , for G OEMAN , J. J. and S OLARI , A. (2010). The sequential rejection
some s ≥ 0, implies that principle of familywise error control. Ann. Statist. 38 3782–
3810. MR2766868
(1) fα (R) > s. G RECHANOVSKY, E. and H OCHBERG , Y. (1999). Closed proce-
dures are better and often admit a shortcut. J. Statist. Plann.
To see why this is true, choose any K ⊆ R with #K ≥ Inference 76 79–91. MR1673341
r − s and any J ⊇ K. Remark that p(r) ≤ cr−s
n implies H ERSON , J. (2009). Data and Safety Monitoring Committees in
that Clinical Trials. CRC Press, Boca Raton, FL.
H OMMEL , G. (1983). Tests of the overall hypothesis for arbitrary
J
p(r−s) ≤ p(r−s)
K
≤ p(r) ≤ cr−s
n
≤ cr−s
#J
. dependence structures. Biometrical J. 25 423–430. MR0735888
MULTIPLE TESTING FOR EXPLORATORY RESEARCH 597
H OMMEL , G., B RETZ , F. and M AURER , W. (2007). Powerful sociations under general dependence structures. Biometrika 92
short-cuts for multiple testing procedures with special ref- 893–907. MR2234193
erence to gatekeeping strategies. Stat. Med. 26 4063–4073. ROMANO , J. P. and W OLF, M. (2007). Control of generalized
MR2405792 error rates in multiple testing. Ann. Statist. 35 1378–1408.
H UANG , Y. and H SU , J. C. (2007). Hochberg’s step-up method: MR2351090
Cutting corners off Holm’s step-down method. Biometrika 94 S ARKAR , S. K. (1998). Some probability inequalities for ordered
965–975. MR2416802 MTP2 random variables: A proof of the Simes conjecture. Ann.
J IN , J. and C AI , T. T. (2007). Estimating the null and the propor- Statist. 26 494–504. MR1626047
tional of nonnull effects in large-scale multiple comparisons. S CHWEDER , T. and S PJØTVOLL , E. (1982). Plots of p-values to
J. Amer. Statist. Assoc. 102 495–506. MR2325113 evaluate many tests simultaneously. Biometrika 69 493–502.
L ANGAAS , M., L INDQVIST, B. H. and F ERKINGSTAD , E. (2005). Š IDÁK , Z. (1967). Rectangular confidence regions for the means
Estimating the proportion of true null hypotheses, with appli- of multivariate normal distributions. J. Amer. Statist. Assoc. 62
cation to DNA microarray data. J. R. Stat. Soc. Ser. B Stat. 626–633. MR0216666
Methodol. 67 555–572. MR2168204
S IMES , R. J. (1986). An improved Bonferroni procedure for mul-
L ARNER , M. (1996). Mass and its relationship to physical mea-
tiple tests of significance. Biometrika 73 751–754. MR0897872
surements. MS305 Data Project, Dept. Mathematics, Univ.
S TOREY, J. D. (2002). A direct approach to false discovery rates.
Queensland, Australia.
J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479–498. MR1924302
M ARCUS , R., P ERITZ , E. and G ABRIEL , K. R. (1976). On closed
T UKEY, J. (1980). We need both exploratory and confirmatory.
testing procedures with special reference to ordered analysis of
variance. Biometrika 63 655–660. MR0468056 Amer. Statist. 34 23–25.
M ARENNE , G., DALMASSO , C., P ERDRY, H., G ÉNIN , E. and VAN D E W IEL , M., B ERKHOF, J. and VAN W IERINGEN , W.
B ROËT, P. (2009). Impaired performance of FDR-based strate- (2009). Testing the prediction error difference between 2 pre-
gies in whole-genome association studies when SNPs are ex- dictors. Biostatistics 10 550–560.
cluded prior to the analysis. Genetic Epidemiology 33 45–53. VAN DER L AAN , M. J., D UDOIT, S. and P OLLARD , K. S. (2004).
M EINSHAUSEN , N. (2006). False discovery control for multiple Augmentation procedures for control of the generalized family-
tests of association under general dependence. Scand. J. Statist. wise error rate and tail probabilities for the proportion of false
33 227–237. MR2279639 positives. Stat. Appl. Genet. Mol. Biol. 3 Art. 15, 27 pp. (elec-
M EINSHAUSEN , N. (2008). Hierarchical testing of variable impor- tronic). MR2101464
tance. Biometrika 95 265–278. MR2521583 Z AYKIN , D., Z HIVOTOVSKY, L. A., W ESTFALL , P. and W EIR , B.
M EINSHAUSEN , N. and B ÜHLMANN , P. (2005). Lower bounds for (2002). Truncated product method for combining p-values. Ge-
the number of false null hypotheses for multiple testing of as- netic Epidemiology 22 170–185.