You are on page 1of 14

Austral. J. Statist.

, 32(2), 1990,177-190

O N B O O T S T R A P H Y P O T H E S I S TESTING

NICHOLAS I. FISHER'

AND PETER HALL'

CSIRO and The. Australian National University


Summary We describe methods for constructing bootstrap hypothesis tests, illustrating our approach using analysis of variance. The importance of pivotalness is discussed. Pivotal statistics usually result in improved accuracy of level. We note that hypothesis tests and confidence intervals call for different methods of resampling, so a s to ensure that accurate critical point estimates are obtained in the former case even when data fail to comply with the null hypothesis. Our main points are illustrated by a simulation study and application to three real data sets.
Key words: Analysis of variance; Behrens-Fisher problem; bootstrap; hypothesis test; level error; Monte Car10 test; pivotal; resample.

1. Introduction
Our aim in this paper is to describe the construction of bootstrap

hypothesis tests, illustrated using the example of analysis of variance. The notion of a general bootstrap hypothesis test dates back to Efron (1979), and appears also in Berm (1988) and Hinkley (1988), but has not been developed to anywhere near the extent of bootstrap ideas for confidence intervals. While there are close links between bootstrap methods for testing and for interval estimation, there are important, explicit differences which call for a speciahed treatment of the bootstrap testing problem. We contend that, as in the case of confidence intervals, an important feature which bootstrap tests should have is that they be based on asymp-

' CSIRO, Division of Mathematics and Statistics, Sydney.


Statistics Research Section, School of Mathematical Sciences, The Australian National University, Canberra. Acknowledgements. The hospitality of SFAL is noted with gratitude.

Received February 1989; revised April 1989.

178

NICHOLAS I. FISHER AND PETER HALL

totically pivotal statistics. This leads us to rule out certain test statistics which feature prominently in some accounts of analysis of variance (e.g., Dijkstra & Werter, 1981). Pivotalness is not universally accepted as a desirable feature of confidence intervals - see for example the discussion of Hall (1988). However, its acceptance is increasing. Its virtue for tests is similar to that for confidence intervals: it results in tests with more accurate levels. With a sample of size n, tests based on pivotal statistics often result in level errors of O ( n - 2 ) ,compared with only O ( n - l ) for tests based on non-pivotal statistics. Our insistence on using pivotal statistics results in our advocating different test statistics in homoscedastic and heteroscedastic problems. The analysis of variance example is an ideal vehicle for bringing out this fundamental point. One major feature distinguishing hypothesis tests from confidence intervals is that when testing, it is important to have accurate estimates of critical points even when the data fail to satisfy the null hypothesis. This means that the data must be transformed appropriately, so that any failure to comply with the null hypothesis is not reflected in reduced accuracy of level error. Put another way, we want the bootstrap distribution of our test statistic t o be invariant under a change from the null hypothesis t o any of the members of the set of alternative hypotheses which the test is designed to detect. Therefore, we argue, bootstrap analysis proceeds rather differently in testing and confidence interval problems. Our approach to bootstrap hypothesis testing for one-way analysis of variance is outlined in Section 2. In the case of homoscedastic problems, our test statistic is the classic F-ratio introduced by R.A. Fisher, although of course it does not have Fishers F distribution in the nonparametric context which we treat. For heteroscedastic problems we use a statistic proposed by James (1951). An alternative statistic suggested by Brown & Forsythe (1974a, b) was not investigated because it fails to be asymptotically pivotal. Section 3 extends our results to tweway analysis of variance, and Section 4 discusses further generalizations and extensions. Section 5 summarizes the results of a simulation study and gives applications to real data.
2. Methodology for One-way Analysis of Variance

This section motivates the general approach to bootstrap hypothesis testing by making explicit its application to the comparison of means of several samples.

ON BOOTSTRAP HYPOTHESIS TESTING

179

Subsection 2.1 introduces two test statistics, one classical and the other more suited t o heteroscedastic problems. Subsection 2.2 develops a resampling scheme tailored t o the heteroscedastic case, and Subsection 2.3 presents an Edgeworth expansion argument showing that the more classical of the two statistics from Subsection 2.1 will tend t o perform poorly in heteroscedastic problems. Finally, Subsection 2.4 presents new resampling schemes, and outlines of asymptotic theory, for homoscedastic problems.
2.1 Test Statistics

Assume that a random sample { X ; j , l 5 j 5 n;} is gathered from population IIj, for 1 5 i 5 T . The i t h population has mean pi and : . Without making distributional assumptions about the ITis variance a - indeed, often without supposing that the His have common variance - we wish t o test the null hypothesis =/ ~ 2 = * . * = Pr * Ho :
-1 C j X j j and n-l C C X j j . There are Put n = C n ; , X ; . ni several possible test statistics, of which just two are

x..

and

Under the model that each IT; is Normal N ( p , a 2 ) ,where neither p nor u2 depends on i, ( r - l)-Tl is distributed as Fr-1,n-r. The statistic TI is standard in this context, indeed in most circumstances where hornoscedasticity is a realistic assumption. On the other hand, if each IIi is Normal N ( p , u : ) where the uis are unequal, T I has a distribution depending on the ais in a complex manner. In this situation the distribution of TZ is that of

C(n; - 1)
i=l

lj2 2; - n -1 ni

n;/22j)

2/w; ,

j= 1

where 2; is N ( 0 , l), W; is xii-l, and these variables are stochastically independent. The statistic T 2 therefore has the advantage that its null distribution does not depend on the unknown u j s . However, i n cases of homoscedasticity it gives rise t o a less powerful test. It dates back to

180

NICHOLAS I. FISHER AND PETER HALL

James (1951), and has been considered by Beran (1988) in the bootstrap case with T = 2. Now drop the assumption that II; is Normal, supposing only that it has mean p and variance a:. As the sample sizes ni increase, the distri2 converges to which of course does not depend on the bution of T ais. On the other hand, the limit distribution of T I does depend on the 0;s. Therefore T2, unlike T I , is an asymptotically pivotal statistic, and is well-suited to bootstrap methods; see Subsection 2.3 for details. For discussions of the importance of pivotalness to bootstrap resampling, see Hartigan (1986), Beran (1987) and Hall (1988).
2.2 Bootstrap Resampling i n Heteroscedastic P r o b l e m s

The idea is to approximate the distribution of the test statistic under the null hypothesis, and is a little different from the approach in confidence interval construction. Our bootstrap method is based on a very simple idea, which we now describe. P u t Y ; j z Xij - p i , define and in the obvious manner, and let

c. r.

o 1 and T o 2 are invariant under choices of the pis. The distributions of T When H o is true, these distributions are identical to those of T1 and T2,respectively. T h i s fact suggests the following procedure. Draw a resample X;* { X s , . .. ,X,.,,}, with replacement, from the sample Xi { X i l , . . ,X i n i } . Naturally, each sample is resampled independently of the others. -Put Yi3 E X i ; . - xi., define n; and ZE n-l C C Y , ; , and let

xjl$

9:

Our object is to approximate the distributions o fT o 1 and TO^, under Ho, by those of T<; and Tc2 respectively. UX;denote the entire sample. By repeated resampling we Let X may approximate as closely as desired the distributions of TO; and T&,

ON BOOTSTRAP HYPOTHESIS TESTING

181

given by

X. Thus, we may compute the bootstrap critical point io;, defined


P(T& 5 io; I X ) = 1 - a ,
(1)

for i = 1 and 2. An approximate a-level test of Ho is to reject Ho if T; > 20;. We argue in the next subsection that in heteroscedastic problems, the level-error of this test is smaller when i = 2 than it is when i = 1.
2.3 Choice of T 2 over 2 1 ' in Heteroscedastic Problems

We claim that the bootstrap approximation of the null distribution of To1 is accurate only to order n-'j2, whereas that of To2 is accurate to order n - 3 / 2 . To see why, let F ( . ;oi,... ,o:) denote the asymptotic distribution function of Tol.It may be proved by Edgeworth expansion that

P(T01 SZ)= F(s;a,2,...,u p ) + n - l m ( z ) + n - 2 p 2 ( ~ ) + . . .,

(2)

where functions p l , p 2 , .. . depend on 01,. ..,u, and on other cumulants of the populations II;, but not on n. Terms of order n - J / 2 for odd j do not enter the expansion, because the hypothesis test is two-sided; BarndorffNielsen & Hall (1988) show how to prove this type of result. Let l j j denote l l population cumulants are replaced by their the version of p j when a 7 : = n;' Cj(X:j - a;)2. Then sample counterparts, and put C

the last relation holding since I j j = p j -t 0 , ( n - 1 / 2 ) . However, the fact that 6; is distant O,(n-'l2) from u: means that F ( z ;6;, ...,6:) is O,(n-'I2) from F(z;4 1 2 , . ..,a:). Consequently, comparing (2) and (3)) we see that P(T& 5 z I X ) is Op(n-'i2) from P(To1 5 z),assuming Ho, as claimed. If we repeat this argument for To2 instead of Tol, we find that the first 2 term on the right-hand side of (2) does not depend on any of 012,. . . ,or. Comparing versions of (2) and (3) in this case we conclude that

Therefore the error in the bootstrap approximation t o the null distribution of To2 is of order n-3/2 not n - 1 / 2 . In fact, level-error of a bootstrap hypothesis test based on To2 is O ( n - 2 ) , not O ( ~ Z - ~ due / ~ ) t,o inherent symmetry of general two-sided

182

NICHOLAS I. FISHER AND PETER HALL

testing problems; see for example BarndorfF-Nielsen & Hall (1988). On the other hand, level-error of a test based on 2 ' 0 1 is only O(n-'). It could be reduced to O ( n - 2 ) by bootstrap iteration (e.g., Hall, 1986; Beran, 1987; Hall & Martin, 1988b), but the numerical expense of that procedure can be considerable.
2.4 Homoscedastic Problems

Treatment of homoscedastic problems should be different from that described above. Firstly, it is cleat that when the ui's are identical, TI is a more appropriate test statistic than T z because it uses all the data to estimate scale in each population. Secondly, the method of resampling should take account of the assumption of homoscedasticity. We shall confine attention throughout to tests based on TI, and discuss two different resampling sch'emes. The first scheme is appropriate when the ui's are identical but populations I I i may have differing shapes, and the second scheme when the Hi's are the same except possibly for their means pi. The object here is to highlight how the method of resampling should be modified according to assumptions; of course, it is rare in practice to encounter the first of these models, and there are standard, non-bootstrap methods for handling the second model. P u t Zij z ( X i j - p i ) / u i , define Zi and 2.. in the obvious manner, and let

The distribution of T 1 1 is invariant under translations and re-scalings of the data, and is identical to that of1'2 when H o is true and u1 = u2 = = or. This observation motivates the following resampling scheme. Let Xi+ be as in Subsection 2.2; put Zi*j = (Xi*j-Xi.J/t?iwhere ii? n f ' Cj(Xij-Xi.J2; define 2; = nf' C j 2; and 2: = n-l C C 2;; and let

...

If (71 = gz = -..= upthen an approximate a-level test of Ho is to reject Ho if T ' > 211, where 211 is defined by

P(T;,5 211 I X)= 1 - a .


Arguing as in Subsection 2.3 we may show that the error in the bootstrap approximation to the null distribution of1'2 is O , ( ~ Z - ~ / and ~ ) ,that

ON BOOTSTRAP HYPOTHESIS TESTING

183

the level-error of the test is O(n-). To appreciate the basis for the resampling schemes, observe that conditional on X i the resample 2%* 3 (2; , 1 5 j 5 n ; } is drawn from a population whose mean and variance do not depend on i . The population from which 2 7 comes does depend on i . To treat the case where populations are assumed distributed identically about their means, let W denote the set of all n values of ( X i j X i . ) / & i , and let W * = {W$,1 5 j 5 ni and 1 5 i 5 T } be an n-sample drawn with replacement from W. Define and in the obvious manner, and put

m T

An approximate a-level test of Ho is to reject Ho if defined by P(T;, 5 2 2 1 I X ) = 1 - a .

TI> 2 2 1 , where

221

is

Assuming that populations are identically distributed about their means, this bootstrap approximation t o the null distribution of TI is in error by 0 , ( n - 3 / 2 ) ,and the level-error of the test is O(n-).
3. Methodology for Two-way Analysis of Variance

In the presence of replication it is feasible to conduct two-way andysis of variance without assuming homoscedasticity. Then, an analogue of the procedure developed for one-way analysis in subsections 2.2 and 2.3 may be employed. However, for simplicity of exposition we shall content ourselves here with a resampling scheme tailored to the non-replicated, homoscedastic case; of course, standard ANOVA procedures are usually adequate for this situation. Our model is
X i j = p + a i t p j t E i j ,

1 5 i S ~ and
~jjs are

1 5 j < ~ ,

where C ai = C = 0 and the distributed. The hypothesis under test is

independent and identically

Ho:

C X ~ = O ~ = =a,..

...

Put n TS, Xi- s- Cj X i j , Then our test statistic is

X,j 5 r-

xi X i j

and X..

tz-l

CCXij.

184

NICHOLAS I. FISHER AND PETER HALL

TABLE 1

Level accuracy when


Test based on T 1 Std devs u1 ,u2
1,l 2.2 3.5 3.8 1,3 7.3 7.8 6.4 3,l 7.3 7.8 9.6 1,6 9.2 7.4 6.6 6,l 9.2 7.4 9.4

=2
Test based on T 2 Std d e w u1,u2
1,l 1.6 2.1 3.1 1,3 5.8 5.6 3.9 3,l 5.8 5.6 6.6 1,6 7.4 6.7 6.5 6,l 7.4 6.7 8.6

Sample sizes
1 1 . 1
n2

15 20 15

15 20 25

A resampling scheme for approximating the null distribution of T may be developed as follows. Put W { X i j - X;. - X . j + X.., 1 5 i 5 T and 1 5 j 5 s } . Let W* zz {W;, 1 5 i 5 T and 1 5 j 5 s} denote = 3-l W;, an n-sample drawn with replacement from W . Define W.; = r-l WG, We: f n-l WG, and

xi

w2

cj

An approximate a-level test of Ho is to reject Ho if T defined by P(T* 5 f I X )= 1 - a .

> 2, where 2 is

It may be shown that the approximation error P(T 5 z)- P(T 5 z I X ) equals 0,(n-3/2), and that the level error of the test is O(n-2).
4. Monte Carlo Results, and Data Applications 4.1 Simulation Study

To illustrate our point numerically we simulated application of our test to skew, heteroscedastic data in the context of one-way ANOVA. Assume that the i th population follows the model

for 1 5 i 5 T . Here the ~ j j swere simulated as independent exponential variables with unit mean. Under Ho, p; does not depend on i and may be taken as zero. Tables 1 and 2 depict actual levels, in percentages, each

ON BOOTSTRAP HYPOTHESIS TESTING

185

TABLE 2

Level accuracy when T = 3


Test based on TI Std dew u 1 , ~ 2 , ~ 3 1,1,1 1,3,6 6,3,1 1,5,10 10,5,1
2.3

Sample sizes
nl

nz

n3

Test based on T a Std devs u l,u2 ,u3 1,1,1 1,3,6 6,3,1 1,5,10 10,5,1
3.5
2.6

15 15 20 20 15 25

15 20 25

1 . 4
2.9

8.0 6.7 7.4

8.0 6.7 8.6

7.1 7.2 6.2

7.1
7.2 9.9

3.3

5.0 4.6 3.6

8.0 6.7 3.9

4.0 5.0 4.0

4.0 5.0 5.3

estimated from 1,000 trials with B = 200 bootstrap resamples, of nominal 5% bootstrap tests based on either TI or T2. The tests were made at three sample sizes and various standard deviations, and for T = 2 (Table 1) and T = 3 (Table 2). The data were exponential, distributed according to model ( 5 ) with p1 = pz = 0. Recall that under conditions of heteroscedasticity, only T2 provides a test which is asymptotically pivotal. Our argument in Section 2 predicted that for this reason, and when heteroscedasticity is present, tests based on T2 should have greater level accuracy than tests based on TI. Our simulation study confirms that this tends to be the case. Only in the case of fixed ui does the test based on TIoffer greater level accuracy than the test based on T2, and then only for the case T = 2. For symmetric errors this phenomenon is not clearly so evident because of higher level accuracy of both tests. These results are part of a larger study, exploring a range of issues in bootstrap re-sampling. Also included in the study w a s an investigation of the power of the bootstrap procedures, which generally indicated satisfactory performance. Table 3 depicts actual powers, each estimated from 1000 trials, of bootstrap tests based on either TI or T2. Data were exponential, distributed according to model ( 5 ) with T = 3 samples and p1 = 0, p2 = 1, p3 = 2, u 1 = 5, u 2 = 3, u 3 = 1; nominal level of each test was 5%.
4.2 Application t o Data Sets

Some comparison between bootstrap testing procedures and more classical procedures, in situations where the latter are appropriate, can be made by looking at their relative performances on particular data sets.

186

NXCHOLAS I. FISHER AND PETER HALL

TABLE 3

Power when
Sample sizes
=I

= 3.
Power of test based on T 2
20.1% 42.2%
61.0%

nz

n3

Power of test bas& on TI


24.8%
35.9%

10 20
30

10

10
20 30

20
30

45.6%

We have investigated a range of published examples in which the means of two or more samples were compared, in the presence of heterogeneous variances. The conclusion drawn in each case was the same for the bootor Tz)as it was for the parametric test. A strap tests (based on either TI, few of these are reported below.

Example 1. Tippett (1952, Table 5.3) analysed the following data (% muscle glycogen) on the effect of insulin on rabbits:
Control: 0.19, 0.18, 0.21, 0.30, 0.66, 0.42, 0.08, 0.12, 0.30, 0.27 Treated: 0.15, 0.13, O.OO*, 0.07, 0.27, 0.24, 0.19, 0.04, 0.08, 0.20, 0.12 * below detectable level

= 1 0 , ng = 11 ; XI= 0.273 ,x2= 0.135 ; s1 = 0.168 ,s2 = 0.085 . The usual 2-sample t-test for equality of means yields a P-value of about 0.027. Tippett noted the possibility that the variances may be different, but concluded nevertheless that the means were probably different because of Welch's (1938) observation that the t-test was not unduly affected when the sample sizes were equal (here they are almost equal). The two bootstrap tests, using 200 re-samples, yield significance probabilities 1 ' and T 2 respectively, in agreement with Tippett's of 0.056 and 0.044 2 assessment. Example 2. Goulden (1952, Example 4 2 ) analysed the following data on protein determinations in wheat samples from two different provinces in Canada, abstracted from a large survey.
71.1

Sample 1: 15.1, 14.3, 11.5, 14.5, 15.4, 12.5, 14.6, 16.6 Sample 2: 12.2, 12.5, 11.2, 12.6, 11.0, 11.6, 12.0, 12.5,

11.8, 12.4, 11.5, 12.0, 11.6, 12.7 nl = 8 ,n2 = 14; 3 1 = 14.31, x 2 = 11.97;

s1 = 0.57, s2 = 0.14.

ON BOOTSTRAP HYPOTHESIS TESTING

187

Using an approximation to the significance level of the statistic Tl due to Cochran & Cox (1957, p.101) Goulden found that the significance probability associated with a test of identical means was between 0.01 and 0.05. The standard Welch (1938) procedure gives a two-sided P-value of 0.00425. Using 1000 re-samples, the significance probabilities for bootstrap tests based on TI and T2 were 0.0055 and 0.0025 respectively. Example 3. Snedecor & Cochran (1976, Example 10.12.1) reported the following data comprising the number of days survived by mice inoculated with three strains of typhoid organism (numbers in parentheses indicate multiplicities): Sample 1: 2(6),3(4),4(9),5(8),6(3),7 = 31) 2,3(3),4(3),5(6),6(6),7(14),8(11),9(4), 10(6), 11(2), 12(3), 13

(n1

Sample 2: (122 = 60) sample 3: (723 = 133)


8 1

(19), 7( 23) ,8( 22) ,9( 14), 1O ( 14) ,1 1(7) , 2( 3) ,3( 5) ,4( 5), 5( 8) ,6 12(8), 13(4), 14(1) = 4.03 , x 2 = 7.37 , x 3 = 7.80 ; 8 1 = 1.38, ~2 = 2.42 , 83 = 2.58 .

The standard F-test for equal means (given merely as an exercise in Example 10.12.2) yields an F-value of 31.0, for 2 and 221 degrees of freedom. Snedecor & Cochran noted that the unequal sample sizes and possible heterogeneity of variances should be borne in mind in analysing these data. For bootstrap re-sampling, there is a different potential problem, namely the large number of replicated values in each sample. Were the sample sizes smaller, one might wish to re-sample from smoothed data (e.g., the data have been rounded t o the nearest day; so a re-sampled random uniform U[-1/2,1/2] could be added to each resample value XG: see Fisher & Hall, 1989). However, no difficulties were encountered in this instance. The bootstrap tests based on TIand 2'2 resulted in extremely small P-values, with the actual value of each statistic exceeding the largest of the 1000 bootstrap values. (We have andysed the data directly, here, for comparative purposes; i t would certainly be beneficial to consider a preliminary transformation of the data t o stabilize the variance).
5. Discussion

We have described bootstrap analysis of variance tests which are nonpumrnetric, in the sense that they involve no assumptions about any of

188

NICHOLAS I. FISHER AND PETER HALL

the parent populations. This has been achieved by resampling from data values. Should the underlying populations be specified up to some vector X of parameters, we would estimate X at i, say, and resample from the population with X set equal to i. For example, suppose we assume that the i t h population is Normal N ( p i , o ? ) . We estimate pi at Xi., estimate ui at sf 3 (ni - 1)-l Cj(Xij x i . ) 2 , draw {Xi., ...,Xi.,i 1 at random from an N(xj.,s : ) population, and put yi;l XTj -Xi.exactly as before. Of course, this is equivalent to drawing % ; from an N ( 0 , s : ) population. In this circumstance T 2 and Tc2 have identical distributions, and so the bootstrap produces a test which is exact except for inaccuracies arising from doing only a finite amount of resampling. The bootstrap is equivalent t o constructing exact statistical tables by Monte Carlo methods. Likewise, the parametric bootstrap applied to homoscedastic Normal data and using the statistics 2'1 or T also produces an exact test. Since the distributions of Tl and T are known - they are Fisher's F - then on this occasion the bootstrap is doing no more than reconstruct existing tables by simulation. For T 2 the exact distribution is not tabulated, although it is known in principle. James (1951) devised clever and practical approximations to this distribution. In the case of one-way analysis of variance with r = 2, there is no difficulty in testing one-sided alternative hypotheses such as H1 : p1 < p2. Appropriate test statistics are

and

in homoscedastic and heteroscedastic cases respectively. For a test in nonparametric circumstances, follow the resampling procedures recommended in Section 2, obtaining resampled data values (yi;} for heteroscedastic problems, {X;) for homoscedastic problems with different populations, and {W;} for homoscedastic problems with identical populations. Use the y i 3 s in place of the X i j s to construct Ui using the formula for U2, and either the 2; s or the Wt;.s in place of the Xi'js to construct Ui using the formula for U1. Approximate the distribution of Ui by that of U; conditional on X,and estimate critical points accordingly. In the heteroscedastic case, this solution of the two-sample testing problem is related

ON BOOTSTRAP HYPOTHESIS TESTING

189

to the nonparametric bootstrap solution of the Behrens-Fisher problem (Hall St Martin, 1988a), but with the important difference that the resampling scheme now centres the difference between means under both null and alternative hypotheses. These arguments and those in earlier sections may be adapted in an obvious way to test hypotheses involving contrasts, for example to test the goodness of fit of a set of contrasts or to make multiple comparisons. The philosophy is exactly that in the previous paragraph: resample as described in earlier sections, and use the resampled data values in place of the X j j s to construct a bootstrap statistic whose distribution, conditional on X , approximates the distribution of the test statistic under both null and alternative hypotheses. All our statistics use sample standard deviations to scale differences between means. This is more out of a desire to keep reasonably close to traditional methods, than out of necessity. Alternative robust scale estimates, such as the interquartile range, could be employed. Likewise, our bootstrap-based testing philosophy applies to statistics based on sums of absolute differences rather than sums of squares. A test for homoscedasticity, such as Levenes test (Levene, 1960) may be used as precursor to any analysis of variance. It too has a bootstrap version. Finally, three points should be made. Firstly, bootstrapping should not proceed independently of other aspects of good data analysis. For example, attention t o outliers and use of appropriate transformations may improve both the validity of the model and the efficacy of the bootstrap method. Secondly, we have employed the analysis of variance example merely to illustrate the main issues which arise in bootstrap hypothesis testing. Thirdly, the duality between testing and interval estimation opens up the possibility of constructing confidence intervals indirectly from bootstrap hypothesis tests. This aspect is particularly interesting in cases where standard bootstrap interval methods do not perform well, and will be discussed elsewhere.

References
BARNDORFF-NIELSEN, O.E. k HALL, P. (1988). On the level-error after Bartlett adjustment of the likelihood ratio statistic. Biometrika 7 6 , 374-378. BERAN, R. (1987). Prepivoting t o reduce level error of confidence sets. Biometrika 74, 457-468. BERAN, R. (1988). Prepivoting test statistics: A bootstrap view of asymptotic refinements. J. Amer. Statist. Assoc. 83, 682-697.

190

NICHOLAS I. FISHER AND PETER HALL

BROWN, M.B. k FORSYTHE, A.B. (1974a). The small sample behaviour of some statistics which test the equality of several means. Technometrics 16, 129-132. BROWN, M.B. k FORSYTHE, A.B. (1974b). The ANOVA and multiple comparisons for data with heterogeneous variances. Biometria SO, 719-724. COCHRAN, W.G. & COX, G.M. (1957). Experimental Designs. Second edition. New York John Wiley. DIJKSTRA, J.B. k WERTER, P.S.P.J. (1981). Testing the equality of several means when the population variances are unequal. Commun. Statist. - Simula. Computa. B10,557-569. EFRON, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist. 7 , 1-26. FISHER, N.I. k HALL, P. (1990). Bootstrap algorithm for small samples. J. Statist. Comput. Simul., to appear. GOULDEN, C.H. (1952). Methods of Statistical Analysis. Second edition. New York: John Wiley. HALL, P (1986). On the bootstrap and confidence intervals. Ann. Statist. 14, 14311452. HALL, P. (1988). Theoretical comparison of bootstrap confidence intervals. (With discussion.) Ann. Statist. 16,927-953. HALL, P. k MARTIN, M.A. (1988a). O n the bootstrap and two-sample problems. Aus tral. J. Statist. SOA, 179-192. HALL, P. k MARTIN, M.A. (1988b). On bootstrap resampling and iteration. Biometrika 75, 661-671. HARTIGAN, J.A. (1986). Contribution to discussion. Statist. Su. 1, 75-77. HINKLEY, D.V. (1988). Bootstrap methods. J. Roy. Statist. SOC. Ser. B 6 0 , 321-337. JAMES, G.S.(1951). The comparison of several groups of observations when the ratios of the population variances are unknown. Biometrika 38, 324-329. LEVENE, H. (1960). Robust tests for equality of variances. In Contributions to Pro& abiZity and Statistics: Essays in Honor of Harold Hotelling, eds. I. Olkin et al., pp.278-292. Stanford, Calif.: Stanford University Press. SNEDECOR, G.W. & COCHRAN, W.G. (1976). Statistical Methods. Sixth edition. Ames, Iowa: The Iowa State University Press. TIPPETT, L.H.C. (1952). The Methods of Statistics. Fourth edition. New York John Wiley. WELCH, B.L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika 20, 350-362.

You might also like