You are on page 1of 15

J. R. Statist. Soc.

B (2002)
64, Part 1, pp. 63±77

Familywise robustness criteria for multiple-


comparison procedures

Burt Holland
Temple University, Philadelphia, USA

and Siu Hung Cheung


Chinese University of Hong Kong, Shatin, People's Republic of China

[Received March 2000. Final revision July 2001]

Summary. A criticism of multiple-comparison procedures is that the family of inferences over which
an error rate is controlled is often arbitrarily selected, yet the conclusion may depend heavily on the
choice of the family. Such ambiguity is most likely in large exploratory studies requiring numerous
simultaneous inferences. In ambiguous situations it is desirable that results of multiple-comparison
procedures depend little on the chosen family. To assess this, we propose several familywise robust-
ness criteria to evaluate such procedures, and we ®nd some of their properties theoretically and
by simulation. Procedures that control the false discovery rate seem to be familywise robust.
Keywords: False discovery rate; Familywise error rate; Familywise robustness; Simultaneous
inference; Type A robustness; Type R robustness

1. Introduction
Multiple-comparisons techniques are beginning to be used for considerably larger families of
related inferences than was typical in the past. For example, Drigalenko and Elston (1997)
and Yekutieli and Benjamini (1999) described situations where we would want to conduct
hundreds of simultaneous inferences. In such large studies there is an increased chance that a
particular inference can logically appear in any of several inference families of varying sizes
so a choice of the containing family may be arbitrary or ambiguous. Under such conditions,
insensitivity to the family speci®cation may be a desirable property for a multiple-comparison
procedure (MCP). For example, in a multicentre clinical trial, the inferences at a particular
centre may be the family of interest to the centre's manager, whereas the overall trial manager
is more likely to focus on the union of inferences at all centres. See Westfall and Young
(1993) and Westfall et al. (1999) for some guidelines for the selection of an appropriate
family.
In this paper we attempt a systematic investigation to compare MCPs in regard to altering
the inferential family. We limit ourselves here to a consideration of Bonferroni-type pro-
cedures that use p-values from univariate distributions of test statistics. By MCP we mean a
procedure that controls the tendency for the number of type I errors in a family of inferences
to increase as the family size increases, using either the familywise error rate FWE or the false
discovery rate FDR criterion. We do not regard a per comparison procedure that ignores

Address for correspondence: Burt Holland Department of Statistics, Temple University, Philadelphia, PA 19122-
2585, USA.
E-mail: bholland@sbm.temple.edu

Ó 2002 Royal Statistical Society 1369±7412/02/64063


64 B. Holland and S. H. Cheung

multiplicity to be a special type of MCP. Since we focus on how the probability of rejection or
acceptance of a particular hypothesis changes as the size of its containing family varies, we
propose and investigate measures that are tailored for this purpose rather than measures of
power. These new measures are intended to supplement, but not to supplant, the use of
power as a criterion for the choice of a procedure.
The new measures can be used to evaluate and compare any MCPs, even ones yet to be
developed. Among the procedures that we have evaluated with these measures are two that
control the false discovery rate rather than the traditional familywise error rate, since our
motivation for undertaking this investigation stems from papers that compared the power of
FDR and FWE controlling procedures.
Williams et al. (1999) compared the means of test scores for 2 years for each of a family of
m ˆ 45 populations, controlling for multiplicity by using each of several MCPs. Then they
subdivided this family into successively smaller families of sizes m ˆ 15, 12, 6 motivated by
logically reasonable partitionings of the original family. They noted that the FDR controlling
procedure of Benjamini and Hochberg (1995) described below tended to be far more
consistent than procedures that controlled FWE in terms of whether a particular hypothesis
was rejected, as the family in which this hypothesis was located changed in size. A widely
circulated early draft of Williams et al. (1999) was apparently the ®rst mention of this
property of FDR control.
An analysis of a well-known data set in Benjamini and Yekutieli (2001) demonstrated that
MCPs can di€er appreciably in their sensitivities to changes in the family de®nition.
The false discovery rate was proposed by Benjamini and Hochberg (1995) as an alterna-
tive to FWE as a criterion for simultaneous type I error control. They also gave an easily
understood procedure (that we refer to as BH) for controlling FDR at a designated value, say
q*. FDR is de®ned as the expected value of V/R, where R is the number of rejected null
hypotheses and V is the number of rejected null hypotheses that are true, noting that V/R is
taken to be 0 when R ˆ 0. FDR control may be more appropriate than FWE control in
certain exploratory situations, and situations that cannot be classi®ed as either exploratory or
con®rmatory such as when an overall conclusion for a family of inferences is not governed by
the conclusion for one particular inference in the family. Our examples in Section 5 may fall
into one of these categories. FWE is the preferred error criterion for con®rmatory studies.
Benjamini and Hochberg (1995) proved that their procedure controls FDR at the
designated level if the test statistics are independent, and that it is considerably more
powerful than any of the better improved FWE controlling Bonferroni procedures. Although
for most MCPs the power to reject false hypotheses decreases rapidly as the family size
increases, Benjamini and Hochberg (1995) provided simulation evidence that for independent
normally distributed test statistics the power of their procedure declines much more slowly
than that of either the Hochberg (1988) or Bonferroni procedures. Benjamini and Yekutieli
(2001) showed that the BH procedure controls FDR for test statistics which are positive
regression dependent on each one from a subset of true null hypotheses.
Another promising FDR controlling MCP is the adaptive procedure of Benjamini and
Hochberg (2000). This procedure involves using the data to provide a `prior' estimate of the
number of true null hypotheses in the family. They demonstrate by simulation that, if the test
statistics are independent, this procedure controls FDR and it is substantially more power-
ful than the BH procedure. However, it has the unattractive property of potentially allowing
the rejection of a hypothesis having a p-value in excess of the designated FDR. Therefore,
we shall only consider a modi®ed form of this adaptive procedure that does not reject a
hypothesis unless its p-value is less than the designated FDR. The algorithm for imple-
Familywise Robustness Criteria 65

menting this procedure is relatively complicated. We use adaptive to refer to this modi
®cation.
The Hochberg (1988) improved Bonferroni procedure, henceforth referred to as the
Hochberg procedure, is perhaps the most popular improved Bonferroni MCP. It is easier to
understand and implement, and only very slightly less powerful than other FWE controlling
Bonferroni variants. Hochberg (1988) proved that his procedure controls FWE, henceforth
denoted by a, when the test statistics are independent. Sarkar and Chang (1997) proved that
the Hochberg procedure controls FWE when the distribution of the test statistics has
exchangeable positive dependence, and Sarkar (1998) proved that the Hochberg procedure
controls FWE when the distribution satis®es the ordered MTP2 condition.
The new measures are de®ned in the following section after the presentation of some
notation. Section 3 presents some theoretical properties of these measures. Measures for
assessing familywise robustness are de®ned in Section 4. Additional explorations by
simulation appear in Section 5, and we conclude with a discussion in Section 6. The proofs of
theorems 3 and 4 are contained in Appendix A.

2. Measures of robustness to changing family size


We begin with some notation and de®nitions. Let Wn ˆ {H1, H2 , . . . , Hn} be a family of n
hypotheses and let xm ˆ {H1, H2 , . . . , Hm} be a subset of Wn, where m < n, i.e. n is the size
of the larger family and m is the size of the smaller family. We consider a particular
hypothesis Hi 2 xm  Wn and study how the result of testing it changes if we either enlarge
the family of hypotheses from xm to Wn or contract the family from Wn to xm.
If the testing result (reject or accept) for Hi is the same regardless of whether Hi 2 xm or
Wn, we have made a family size consistent decision. If the testing result di€ers with the family
selected, the decision is family size inconsistent. A testing procedure that tends to make
family size consistent decisions is henceforth referred to as familywise robust.
Table 1 summarizes conditional decisions from testing Hi in one of the families given the
result of testing it in the other family and de®nes additional terminology in its third column.
We provide some further notation that will be used in the statement and proofs of our
theorems.
pi,rnjrm ˆ P (Hi is rejected in Xn jHi is rejected in xm ),
pi,anjam ˆ P (Hi is accepted in Xn jHi is accepted in xm ),
pi,rmjrn ˆ P (Hi is rejected in xm jHi is rejected in Xn ),
pi,amjan ˆ P (Hi is accepted in xm jHi is accepted in Xn ):
We next de®ne four types of perfect familywise robustness.

Table 1. Conditional decisions from testing Hi

Decision Given that Hi is Description of decision

Hi rejected in Wn Rejected in xm Type R enlargement familywise robust


Hi accepted in Wn Accepted in xm Type A enlargement familywise robust
Hi rejected in xm Rejected in Wn Type R contraction familywise robust
Hi accepted in xm Accepted in Wn Type A contraction familywise robust
66 B. Holland and S. H. Cheung

(a) If pi,rn|rm ˆ 1 for all i (1 O i O m < n) and for all xm  Wn, the testing procedure is said
to be perfectly type R enlargement familywise robust.
(b) If pi,an|am ˆ 1 for all i (1 O i O m < n) and for all xm  Wn, the testing procedure is
said to be perfectly type A enlargement familywise robust.
(c) If pi,rm|rn ˆ 1 for all i (1 O i O m < n) and for all xm  Wn, the testing procedure is said
to be perfectly type R contraction familywise robust.
(d) If pi,am|an ˆ 1 for all i (1 O i O m < n) and for all xm  Wn, the testing procedure is
said to be perfectly type A contraction familywise robust.
In using the familywise robustness concept to choose between competing procedures, how do
the four measures compare in importance? Since an incorrect rejection of a hypothesis is
generally considered a more serious error than its incorrect acceptance, consistency in rejection
decisions, measured by type R familywise robustness, is usually more important than type A
familywise robustness. Family size enlargement seems more likely than family size contraction.
For example, when an analyst decides that the proposed family of inferences is insucient to
reach a decision, he or she may choose to enlarge the experiment and hence the family. Thus the
single most important of the four familywise robustness measures is usually type R enlargement
familywise robustness. A situation where family size contraction would be relevant is where the
analyst has personal responsibility for only one segment of a large study, and hence wishes to
focus her attention on the subfamily of inferences pertaining to that segment.

3. Some theoretical properties of the familywise robustness measures


The following result relates the condition of perfectly type A enlargement familywise
robustness to the probabilities that are associated with the other three types. The proof
follows easily from elementary probability arguments.
Theorem 1. If an MCP is perfectly type A enlargement familywise robust, then
(a) the MCP is perfectly type R contraction familywise robust,
(b) pi,rn|rm ˆ P(Hi is rejected in Wn)/P(Hi is rejected in xm) and
(c) pi,am|an ˆ P(Hi is accepted in xm)/P(Hi is accepted in Wn).
When the conditions of this theorem hold, pi,rn|rm may be interpreted as the relative change in
the probability of rejection of Hi as the family size increases, and pi,am|an is interpretable as the
relative change in the probability of acceptance of Hi as the family size decreases.
Similarly to theorem 1, perfect type R enlargement familywise robustness is equivalent to
perfect type A contraction familywise robustness. These dual relationships between family
size enlargement and family size contraction hold approximately if there is near perfect
familywise robustness. For example, suppose that pi,rn|rm is close to 1. This means that it is
unusual for a hypothesis to be rejected in the smaller family but accepted in the larger family.
Thus there will be few hypotheses that are accepted in the larger family but rejected in the
smaller family and therefore most hypotheses accepted in the larger family will also be
accepted in the smaller family, i.e. pi,am|an is close to 1. Similarly, if pi,rm|rn is close to 1 then so
also is pi,an|am. Also, since it is usually easier to accept a hypothesis in a large family than the
same hypothesis in a smaller family, we typically expect to ®nd that pi,an|am > pi,am|an.
The next theorem requires a few preliminary de®nitions.
De®nition 1. Let D be the class of step-down MCPs that satis®es the following two
conditions.
Familywise Robustness Criteria 67

(a) There is a monotonic increasing sequence of constants {aj,n} such that the hypotheses
corresponding to P(1), P(2) , . . . , P(j) are rejected when P(h) O ah,n for h ˆ 1 , . . . , j.
(b) For any hypothesis which is in both the smaller and the larger families, having rank i
among p-values in Wn and rank i* among p-values in xm,
ai,n ÿ a*i,m O 0:
De®nition 2. Let U be the class of step-up MCPs that satis®es the following two con-
ditions.
(a) There is a monotonic increasing sequence of constants {aj,n} such that the hypotheses
corresponding to P( j), P( j+1) , . . . , P(n) are retained when P(h) > ah,n for h ˆ j , . . . , n.
(b) The monotonic sequence {ai,n} for testing hypotheses in Wn and the monotonic
sequence {aj,m} for testing hypotheses in xm satisfy an)i+1,n ) am)i+1,m 6 0 for all
i ˆ 1 , . . . , m.
Theorem 2. All MCPs that are in either U or D as de®ned above are perfectly type A
enlargement familywise robust.
The proof appears in Holland and Cheung (2000), as do corollaries to this theorem which
establish that the procedures of Bonferroni, Holm (1979), Hommel (1988), Hochberg (1988)
and Rom (1990) are all perfectly type A enlargement familywise robust.
The BH procedure is not perfectly type A enlargement familywise robust. We illustrate
this with a simple numerical example. Suppose that m ˆ 2, n ˆ 3 and q* ˆ a ˆ 0.05. Then
{a1,m, a2,m} ˆ {0.025, 0.05} and {a1,n, a2,n, a3,n} ˆ {0.01667, 0.03333, 0.05}. Assume that the
hypotheses in the larger family W3 are H1, H2 and H3 with p-values 0.01, 0.03 and 0.06
respectively. Using the BH procedure, H1 and H2 are rejected and H3 is retained. But, in a
smaller family x2 which contains only the two hypotheses with p-values 0.03 and 0.06
respectively, the BH procedure accepts both hypotheses. Hence, H2 with p-value 0.03 is
rejected in the larger family but accepted in the smaller family. This counter-example
demonstrates that the BH procedure is not perfectly type A enlargement familywise robust. It
can also be demonstrated that the adaptive procedure is not perfectly type A enlargement
familywise robust.
In the light of theorem 2, many common MCPs are perfectly type A enlargement familywise
robust, and hence also perfectly type R contraction familywise robust. Therefore, to compare
the robustness of such MCPs to changing the family, it suces to focus on the other two types
of robustness. This is investigated by simulation in Section 5 later, along with all types of
familywise robustness for the BH procedure and adaptive procedure. Although the BH
procedure is not perfectly type A enlargement familywise robust, for all testing situations,
parameter con®gurations and sample sizes studied, we found that it is almost perfectly type A
enlargement familywise robust in the sense that the estimate of pi,an|am never fell below 0.94
and typically exceeded 0.999. On the basis of our work with the two examples reported
below, the adaptive procedure also is almost perfectly type A enlargement familywise robust;
the minimum p^i;anjam we saw was 0.95. We conjecture that a similar near perfectly familywise
robust situation for the BH and adaptive procedures will extend to other testing situations
as well.
The proof of the following theorem appears in Appendix A.
Theorem 3. For a true null hypothesis, both type A enlargement and type A contraction
familywise robustness are bounded below by 1 ) a for the Hochberg, and by 1 ) q* for
both the BH procedure and the adaptive procedure.
68 B. Holland and S. H. Cheung

This theorem indicates that, if a true hypothesis if accepted, it is very likely to continue to
be accepted if its family is either enlarged or contracted. In fact, as can be seen from the
proof, this theorem applies to all procedures that require acceptance of any hypothesis having
a p-value greater than c, where c ˆ q* for a false discovery rate controlling procedure or a for
a familywise error rate controlling procedure.
We have seen that several procedures are perfectly type A enlargement familywise robust.
Are any procedures perfectly type R enlargement familywise robust? The answer is yes. For
the family consisting of a set of contrasts among the means of a ®nite number of populations,
the Sche€e procedure has all four types of perfectly familywise robustness because the Sche€eÂ
procedure controls FWE in an in®nite-sized family containing Wn. Although the Sche€eÂ
procedure is not a particularly powerful MCP, it does have this nice familywise robustness
property.
Theorem 4. For any procedure in either U or D as well as for the BH and adaptive
procedures, if m is ®xed and the test statistics are independent and identically distributed,
type R enlargement familywise robustness goes to zero as n ® 1.
The proof is contained in Appendix A. Theorem 4 indicates that, for all procedures
discussed in this paper, the type R enlargement familywise robustness goes to zero as the size
of the larger family goes to 1. However, theorem 4 says nothing about the rate of decrease
and so this is investigated via simulation in Section 5.
The type R familywise robustness criteria are di€erent from power in that they focus
directly on the possibility of changing the size of the family in which the hypotheses are
situated. For example, in simulations not displayed in this paper, we found that the
Bonferroni procedure consistently had slightly higher estimated pi,rn|rm and pi,rm|rn than the
Hochberg procedure, whereas the latter has greater power than the former.

4. Measures for assessing familywise robustness


For assessing a procedure's type R enlargement familywise robustness we propose the
measure
1 Pm
Prnjrm ˆ pi,rnjrm ,
m iˆ1

the probability that a hypothesis is rejected in Wn given that it is rejected in xm, averaged over
all m hypotheses in the smaller family xm. We refer to this new measure as the procedure's
type R enlargement familywise robustness.
Similarly, for assessing a procedure's type R contraction familywise robustness, we
propose the measure
1 Pm
Prmjrn ˆ pi,rmjrn ,
m iˆ1

the probability that a hypothesis is rejected in xm given that it is rejected in Wm, averaged over
all m hypotheses in the smaller family xm. We refer to this new measure as the procedure's
type R contraction familywise robustness. Type A enlargement or contraction familywise
robustness of an MCP are de®ned in analogous fashion.
These measures deal with family averages of familywise robustness rather than the
individual familywise robustness introduced in Section 2. We call these averages familywise
Familywise Robustness Criteria 69

robustness as well. If the family of statistics is interchangeable, then average familywise


robustness should re¯ect individual familywise robustness.

5. Simulation results for two examples


In a simulation study with r replications, de®ne Mi,m as the number of times Hi is rejected in
the subfamily xm and Ni,m,n as the number of times Hi is rejected in both Wn and xm. It is
natural to estimate pi,rn|rm with
d ˆ Ni,m,n =Mi,m
pi,rnjrm

and Prn|rm with

1 Pm 1 Pm N
i,m,n
Pd
rnjrm ˆ d ˆ
pi,rnjrm :
m iˆ1 m iˆ1 Mi,m

d and Pd
It is straightforward to show that pi,rnjrm rnjrm are unbiased estimators of pi,rn|rm and
Prn|rm respectively. Unbiased estimators of pi,an|am, pi,rm|rn and pi,am|an are de®ned similarly.
If all the null hypotheses are true, it is easy to see that for the Bonferroni procedure
Prnjrm ˆ m=n. For all other MCPs, and for the Bonferroni procedure when not all null
hypotheses are true, Prnjrm depends on the interrelationships between the hypotheses and on
the e€ect sizes, and must be estimated by simulation. For each of two testing situations where
a `large' family size is plausible, we computed and compared Pd rnjrm and the estimators for the
other three types of familywise robustness, for the BH, adaptive and Hochberg procedures,
for a variety of con®gurations of parameters and (m, n).
Throughout this study our simulations, programmed in Fortran, consisted of 2 ´ 106
replications.
We could not estimate the standard error of our estimated familywise robustness
parameters with the usual binomial formula since both the numerator and the denominator
of the estimated parameters are random. Instead we used a jackknife approach for this
purpose. Pseudovalues were calculated by systematically eliminating 5% of each set of
simulated estimates, calculating the pseudovalue from the remaining 95% of the estimates
and repeating to form 20 pseudovalues from which the standard errors were estimated.
Such estimates were calculated for each estimated familywise robustness parameter used to
construct each of our ®gures. The maximum jackknife estimated standard errors for any
estimated familywise robustness parameters in the following examples were 0.0021 and
0.0017 respectively.
In the following subsections we present selected plots of Pd d d
rnjrm , Prmjrn and Pamjan for a
variety of parameter con®gurations for two examples. Pd anjam is not shown because in all
cases considered these estimates were extremely close to 1. When considered as functions
of the larger sample size n for a ®xed value of the smaller sample size m, it is desirable
that the graphs be as close to 1 as possible, and decay to 0 as slowly as possible as n
increases.
The three MCPs considered were the Hochberg, BH and adaptive procedures. We have
done comparable calculations for the FWE controlling procedures of Bonferroni, Hommel
(1988) and Rom (1990), but their estimated familywise robustness parameters were always
very close to those of the Hochberg procedure and so are not reported here.
Throughout our simulation work, both the BH procedure and the adaptive procedure
controlled FDR at the designated 0.05 level. Although these procedures are not designed to
70 B. Holland and S. H. Cheung

control FWE at the same level as set for FDR, we noted that they usually did maintain such
control. However, on occasion, the estimated FWEs exceeded 0.05, with the estimated FWE
of the adaptive procedure always exceeding that of the BH procedure. The maximum
estimated FWE rates that we observed for any parameter con®guration in our examples were
0.15 and 0.39 for the BH and adaptive procedures respectively.
Neither of the two examples to follow is obviously purely exploratory or con®rmatory, so
both FWE and FDR are possible error control criteria.

5.1. Example 1
The ®rst testing family that we considered is motivated by the multitude of F-tests in large
analysis-of-variance tables. Studies using the BH procedure to control for multiplicity in
such situations were presented by Basford and Tukey (1997) and Williams et al. (1999). We
simultaneously test whether each of n ratios of variances of normal populations having
common denominator variance, i.e. r2i =r20 , i ˆ 1, . . . , n, equals 1 against the alternative that
each variance ratio exceeds 1. Under the null hypotheses, each simulated sample statistic has
an Fmi,m0-distribution. Strictly speaking, these ratio statistics are not independent, but with the
selected degrees of freedom their mutual dependence due to the common denominator is very
slight.
The simulations examined the following con®gurations of parameters and sample sizes.
Half the tests in the family had d ˆ r2i =r20 ˆ 1 and the other half of the family had d ˆ 1, 3, 5.
We considered m ˆ 4, 10, 16, n ˆ m+2, m+4, m+6 , . . . , m+32, and throughout the experi-
ment m0 ˆ 20, mi ˆ 1, i ˆ 1 , . . . , n, and a ˆ q* ˆ 0.05.
Fig. 1 contains plots of the estimated type R enlargement familywise robustness against
the larger family's size n for the three MCPs, for each of three values of the e€ect size d, and
three values for the size of the smaller family, m. For all family size combinations, this
familywise robustness is highest for the BH procedure and adaptive procedure, and the extent
of their superiority over the Hochberg procedure increases with both d and n. Moreover, as n
increases, the familywise robustness decreases much more slowly for the BH procedure and

Fig. 1. Estimated type R enlargement familywise robustness for example 1 (with selected m and d): ÐÐÐ, BH
procedure; ..........., Hochberg procedure; - - - - -, adaptive procedure
Familywise Robustness Criteria 71

adaptive procedure than for the Hochberg procedure, particularly if the evidence against a
particular null hypothesis is strong (i.e. d is large). This means that the probability that the
BH procedure or adaptive procedure will reject a false null hypothesis does not greatly
depend on the size of the family in which the hypothesis resides, so long as the size of this
family is not too small.
As expected from theorem 1, very similar conclusions to these for estimated type A
familywise robustness for family size contraction are displayed in Fig. 2.
As a consequence of theorems 1 and 2, the type R contraction familywise robustness of
the Hochberg procedure is always 1. We found that for the parameters indicated the type R
contraction familywise robustness of the BH procedure always exceeded that of the
adaptive procedure, and the latter was typically at least 0.9. The di€erences between the
procedures increased with the size of the smaller family. This ordering of procedures for
type R contraction is the opposite to that in Fig. 1. The explanation stems from a dual
relationship between enlargement and contraction familywise robustness. If a procedure has
low type R enlargement familywise robustness, it rejects only a small proportion of those
hypotheses in the larger family which have been rejected in the smaller family. To be
rejected in the larger family, such hypotheses must have a very small p-value. So, given that
a hypothesis is rejected in the large family, it must have a small p-value and hence it is very
likely to be rejected also in the smaller family, i.e. the procedure has high type R
contraction familywise robustness.
Our overall conclusion for this example is that the BH procedure is the best of the three
procedures. These are the reasons.
(a) The BH procedure is clearly better than the Hochberg and not much worse than the
adaptive procedures, in Figs 1 and 2. (Note that the ®gures have di€erent vertical
scalings.)
(b) The BH procedure is better than the adaptive and not much worse than the Hochberg
procedures in terms of type R contraction familywise robustness.

Fig. 2. Estimated type A contraction familywise robustness for example 1 (with selected m and d): ÐÐÐ, BH
procedure; ..........., Hochberg procedure; - - - - -, adaptive procedure
72 B. Holland and S. H. Cheung

(c) The BH procedure has been theoretically shown to control FDR in some situations
with dependent statistics, whereas the current evidence for the FDR control of the
adaptive procedure is limited to some situations with independent test statistics.

5.2. Example 2
In the second example we considered a family of independent t-tests, as might arise
when comparable clinical trial inferences are performed at each of several centres. Such
a study with a desire to control for multiplicity is described in Westfall and Young
(1993), page 5.
We ®xed m ˆ 8 and took n ˆ 12, 16, 20 , . . . , 80. For our investigation of family size
enlargement, we started with n ˆ 12 and systematically added four additional hypotheses at a
time to Wn. For our investigation of family size contraction, we started with n ˆ 80 and
systematically subtracted four hypotheses at a time from Wn until reaching n ˆ 12. In altering
Wn, we maintained its proportion of each variety of size of e€ect.
Each simulation run began by generating independent random samples of size 10 from
each of two four-variate normal distributions with unit variances. We ran two-sided t-tests of
equality of the two population means, choosing one sample for this test from each of the two
populations. For the tests of true null hypotheses, both population means were 0. For tests of
false null hypotheses, one population mean was 0 and the other d, the value of which is
discussed below. This was repeated for each new simulation run, with independence of
samples between runs.
We used the three values of d shown in each row of Figs 3 and 4. In the ®rst row all null
hypotheses are true. The second and third rows con®gurations were selected to ensure that
the proportion of p-values less than 0.05 for each of the two types of test statistic are close to
0.2 and 0.4 respectively. We considered three choices of the number T of true hypotheses in
each family: n/2, n/4 and 4.
Fig. 3 contains estimates for type R enlargement familywise robustness. As the non-
centrality increases and/or the proportion of true hypotheses in the family decreases, the

Fig. 3. Estimated type R enlargement familywise robustness for example 2 (with m ˆ 8 and selected T and d):
ÐÐÐ, BH procedure; ..........., Hochberg procedure; - - - - -, adaptive procedure
Familywise Robustness Criteria 73

Fig. 4. Estimated type R contraction familywise robustness for example 2 (with m ˆ 8 and selected T and d):
ÐÐÐ, BH procedure; ..........., Hochberg procedure; - - - - -, adaptive procedure

adaptive procedure becomes increasingly superior to the other two procedures,


and the BH procedure becomes increasingly superior to the Hochberg procedure.
Moreover, the familywise robustness of the adaptive and BH procedures each decrease only
slightly, if at all, as n increases above 32. For the case n ˆ 80 with 76 of these hypotheses false
with a large d, the estimated type R enlargement familywise robustness estimates of the
adaptive, BH and Hochberg procedures are respectively 0.9573, 0.6978 and 0.1562.
Fig. 4 contains estimates for type R contraction familywise robustness. Generally, the
Hochberg procedure has the highest familywise robustness followed by the BH procedure
and adaptive procedure in that order. The distinctions between the procedures increase as
either the number of true hypotheses in each family decreases or the non-centralities of the
false hypotheses increase. However, the familywise robustness for the BH procedure never
fell below about 0.87, whereas familywise robustness estimates as low as 0.4591 were
observed for the adaptive procedure. The explanations for the di€erence in ®ndings for type
R enlargement and type R contraction parallel those provided above for example 1.
The type A contraction familywise robustness estimates were similar to the type R family
size enlargement estimates, as expected from theorem 1. The adaptive procedure has the
greatest familywise robustness followed by the BH procedure and then the Hochberg
procedure. The distinctions between the procedures increase as the proportion of true
hypotheses in the family decreases and the non-centralities of the false hypotheses increase. In
all situations, the familywise robustness estimates of the adaptive and BH procedures depend
only slightly on n if nP32:
The conclusions from the simulation for this example are consistent with those from the
previous example. Except for the type R contraction results, the familywise robustness
superiority of the BH procedure over the Hochberg procedure increases with
(a) the family size (or the discrepancy between the sizes of the smaller and larger families),
(b) the proportion of false null hypotheses in the family and
(c) the extent of departures from the null hypotheses among those null hypotheses that are
false.
74 B. Holland and S. H. Cheung

Although the adaptive procedure has slightly better familywise robustness estimates than
those of the BH procedure for type R expansion and type A contraction, it is worse for the
other two types of familywise robustness. Since the adaptive procedure has not been
theoretically shown to control FDR, and the only evidence for such control is simulation
based for the independent case, we believe that the familywise robustness concept does not
support the general preference for the adaptive over the BH procedure at this time.
In addition to these examples, we examined a family of independent v2-statistics. A simu-
lation study for this situation, which we do not report here, yielded familywise robustness
conclusions that are virtually identical with those in example 2 and Figs 3 and 4.

6. Conclusions
In the context of multiple hypothesis testing, we have introduced new measures of the
robustness of a multiple-comparison procedure for assessing the extent to which the
acceptances and rejections of particular hypotheses will vary as the family in which these
hypotheses are located is enlarged or contracted. These measures are appropriate for deciding
between competing procedures in situations where the choice of the testing family is
equivocal. The family size consistency measures are additions to the list of desirable criteria
for an MCP to have, beyond the mandatory control of either FWE or FDR, and good
performance with respect to various concepts of power for MCPs.
As an example of the use of familywise robustness, we have studied some FDR and FWE
controlling multiple-comparison procedures. We can tentatively state, on the basis of the
performance of the new measures for the examples studied, that the BH procedure is an MCP
which is well worth considering in multiple-testing situations where the FDR error concept
makes sense, the family size is uncertain and the BH procedure is known to control FDR.
The modi®ed adaptive procedure that we considered shows promise, but its FDR control
under dependence needs to be further studied before it can be recommended for use with
dependent tests.

Acknowledgements
We thank Yoav Benjamini, Charles Dunnett, Richard Heiberger, Yosef Hochberg, Harvey
Keselman, Juliet Sha€er, James Troendle and Peter Westfall for helpful comments on early
drafts of this paper. We are also indebted to the Joint Editor and referees for helping us to
shape the ®nal version. The work of Burt Holland was supported in part by a Temple
University research and study leave grant.

Appendix A
A.1. Proof of theorem 3
In general, for any events A and B, P(AjB)PP(A and B). Therefore, both type A enlargement and type
A contraction familywise robustness are greater than or equal to

P (Hi is accepted in Xn and Hi is accepted in xm ) P P (Pi P c)

where Pi is the p-value associated with Hi. This inequality follows from the fact that for all procedures
considered in this paper a hypothesis will not be rejected if its p-value exceeds the familywise error rate a
for FWE procedures or q* for FDR procedures. If Hi is true, P(Pi Pc) ˆ 1 ÿ c and the proof of theorem
3 is completed.
Familywise Robustness Criteria 75

A.2. Proof of theorem 4


The proof of theorem 4 requires separate approaches according to whether the MCP is a member
of the step-down or step-up class. We begin with the step-up class. For any events A and B,
P(AjB)OP(A)=P(B). Therefore,
type R enlargement familywise robustness O P (Hi is rejected in Xn )=P (Hi is rejected in xm ):
The denominator is a bounded constant greater than 0. We shall show that the numerator goes to 0
when n ® 1. Let the hypotheses be H1, H2 , . . . , Hn with test statistics T1 , . . . , Tn. Without loss of
generality, assume that all the null hypotheses Hi: li ˆ 0 are to be tested against one-sided alternatives
H0i : li > 0. The step-up procedures compare the sequence of constants c1 Oc2 O . . . Ocn with the
ordered tested statistics T(1) OT(2) O . . . OT(n) . De®ne the event E ˆ (T1 , . . . , Tr) < (cr , . . . , cr) which
represents T(1) < c1, T(2) < c2 , . . . , T(r) < cr. Since the test statistics are independent and identically
distributed then

P (Hi is rejected in Xn ) ˆ P (Hn is rejected in Xn )


 
P nÿ1
nÿ1
ˆ P (accept H1 , . . . , Hr ; reject Hr‡1 , . . . , Hnÿ1 , Hn ) (A:1)
rˆ0 r
 
P nÿ1
nÿ1
ˆ {P (E); min (Tr‡1 , . . . , Tn ) P cr‡1 }
rˆ0 r
 
P nÿ1
nÿ1 Q
n
ˆ P (E) P (Tj P cr‡1 )
rˆ0 r jˆr‡1
 
P nÿ1
nÿ1
ˆ P (E)anÿr
r‡1 , (A:2)
rˆ0 r

where P(Ti P cr ) ˆ ar and 1 > a1 P a2 P . . . P an > 0. Next, choose e>0. Then, for suciently large n,
a ®xed integer N < n can be found such that 0 < aN O e. Decompose equation (A.2) into the
following two pieces:
   
PN nÿ1 nÿr
nP
ÿ1 nÿ1
P (E)ar‡1 ‡ P (E)anÿr
r‡1
rˆ0 r rˆN ‡1 r
   
PN n ÿ 1 nÿr nP
ÿ1 nÿ1
O ar‡1 ‡ P (E)anÿr
r‡1
rˆ0 r rˆN ‡1 r
   
PN n ÿ 1 nÿr nP
ÿ1 nÿ1
O a1 ‡ P (E)anÿr
r‡1 : (A:3)
rˆ0 r rˆN ‡1 r

Suppose that n ® 1. The ®rst term in inequality (A.3) goes to 0 since a1 < 1. The second term
equals

nP
ÿ1
  nP
ÿ1
 
nÿ1 nÿ1
P (E)anÿrÿ1
r‡1 ar‡1 O e P (E)anÿrÿ1
r‡1 Oe
rˆN ‡1 r rˆN ‡1 r

because

 
P
nÿ1
nÿ1
P (E)anÿrÿ1
r‡1
rˆN ‡1 r

represents the probability of accepting at least N ‡ 1 hypotheses testing a family of n ÿ 1 hypotheses


using critical constants c1 , . . . , cnÿ1 , and thus is less than 1. Hence, for large n, P(Hn is rejected in
Wn) O e, which is an arbitrary constant greater than 0. Therefore, P(Hn is rejected in Wn) goes to 0 as
n ® 1 and the proof is complete.
This completes the proof for step-up MCPs. We now turn to step-down MCPs.
76 B. Holland and S. H. Cheung
For step-down procedures, let K be the event (Tr+1 , . . . , Tn) P (cr+1 , . . . , cn). Then equation (A.1)
becomes
nP

ÿ1 n ÿ 1
 
ÿ1 n ÿ 1
nP
 r
Q
P { max (T1 , . . . , Tr ) < cr ; K} ˆ P (Ti < cr ) P (K)
rˆ0 r rˆ0 r iˆ1
nP

ÿ1 n ÿ 1

ˆ (1 ÿ ar )r P (K): (A:4)
rˆ0 r

Next, choose e > 0. Then, for suciently large n, a ®xed integer N < n can be found such that
0 < aN O e. Decompose equation (A.4) into the following two pieces:
   
PN nÿ1 nP
ÿ1 nÿ1
(1 ÿ ar )r P (K) ‡ (1 ÿ ar )r P (K): (A.5)
rˆ0 r rˆN ‡1 r

The ®rst term is less than or equal to


   
PN nÿ1 PN nÿ1
P (K) O P {(Tr‡1 , . . . , Tn ) P (cr‡1 , . . . , cr‡1 )}
rˆ0 r rˆ0 r
 
PN nÿ1 nÿr
ˆ ar‡1
rˆ0 r

which goes to 0 as n ® 1. The second term of expression (A.5) is less than or equal to
 
P
nÿ1 nÿ1
(1 ÿ ar )r P {(Tr‡1 , . . . , Tnÿ1 ) P (cr‡1 , . . . , cnÿ1 ); Tn P cr‡1 }
rˆN ‡1 r
 
P nÿ1
nÿ1
ˆ (1 ÿ ar )r P {(Tr‡1 , . . . , Tnÿ1 ) P (cr‡1 , . . . , cnÿ1 )} P (Tn P cr‡1 )
rˆN ‡1 r
 
P nÿ1
nÿ1
Oe (1 ÿ ar )r P {(Tr‡1 , . . . , Tnÿ1 ) P (cr‡1 , . . . , cnÿ1 )}:
rˆN ‡1 r

Following logic similar to that used for the step-up case, this goes to 0 as n ® 1. Hence the proof is
complete.

References
Basford, K. E. and Tukey, J. W. (1997) Graphical pro®les as an aid to understanding plant breeding experiments.
J. Statist. Planng Inf., 57, 93±107.
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to
multiple testing. J. R. Statist. Soc. B, 57, 289±300.
ÐÐÐ (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J.
Educ. Behav. Statist., 25, 60±83.
Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency.
Ann. Statist., 29, 1152±1175.
Drigalenko, E. I. and Elston, R. C. (1997) False discoveries in genome scanning. Genet. Epidem., 14, 779±784.
Hochberg, Y. (1988) A sharper Bonferroni procedure for multiple tests of signi®cance. Biometrika, 75, 800±803.
Holland, B. and Cheung, S. H. (2000) Family-size robustness criteria for multiple comparison procedures. Technical
Report 2±2000. Department of Statistics, Temple University, Philadelphia.
Holm, S. (1979) A simple sequentially rejective multiple test procedure. Scand. J. Statist., 6, 65±70.
Hommel, G. (1988) A stagewise rejective multiple test procedure based on a modi®ed Bonferroni test. Biometrika, 75,
783±786.
Rom, D. (1990) A sequentially rejective test procedure based on a modi®ed Bonferroni inequality. Biometrika, 77,
663±665.
Sarkar, S. K. (1998) Some probability inequalities for ordered MTP2 random variables: a proof of the Simes
conjecture. Ann. Statist., 26, 494±504.
Familywise Robustness Criteria 77
Sarkar, S. K. and Chang, C. K. (1997) The Simes method for multiple hypothesis testing with positively dependent
test statistics. J. Am. Statist. Ass., 92, 1601±1608.
Westfall, P. H., Tobias, R. D., Rom, D., Wol®nger, R. D. and Hochberg, Y. (1999) Multiple Comparisons and
Multiple Tests using the SAS System. Cary: SAS Institute.
Westfall, P. H. and Young, S. S. (1993) Resampling-based Multiple Testing. New York: Wiley.
Williams, V. S. L., Jones, L. V. and Tukey, J. W. (1999) Controlling error in multiple comparisons, with examples
from state-to-state di€erences in educational achievement. J. Educ. Behav. Statist., 24, 42±69.
Yekutieli, D. and Benjamini, Y. (1999) Resampling based false discovery rate controlling multiple test procedures for
correlated test statistics. J. Statist. Planng Inf., 82, 171±196.

You might also like