Professional Documents
Culture Documents
JEMBE 01761
Bryan F . J . Manly
Centrefor Applications of Statistics and Mathematics, Universityof Otago, Dunedin, New Zealand
(Revision received 6 January 1992; accepted 17 January 1992)
Abstract: Bootstrapping of a pilot sample to aid in sample size determination has been advocated by Bros
& Cowell (1987). In this note it is pointed out that the apparent limitation with this method to assessing
samples up to half the size of the pilot sample is easily overcome by sampling the pilot sample with
replacement rather than without replacement. The use of the modified procedure is discussed in the context
of two-sample and multi-sample comparisons. Evidence is presented which shows that for normally
distributed data even pilot samples as small as 20 will give satisfactory results, but rather larger pilot samples
may be required for extremely non-normal distributions.
INTRODUCTION
In a recent paper in this journal, Bros & Cowell (1987) discussed the problem of
deciding on the al:propriate sample size to use in a study to compare two treatments
in terms of the difference between the population means that can be detected and the
effort required to take samples. As part of their discussion, they suggested that
bootstrapping a pilot sample can be used to assess the variation inherent in using
samples of different sizes, but noted that this runs into difficulties with larger sample
sizes because the number of possible samples of size n that can be drawn from a pilot
sample of size P becomes small as n approaches P.
In fact, one of the key aspects of bootstrapping is that samples are taken with
replacement rather than without replacement (Efron, 1979; Efron & Tibshirani, 1986).
Hence, a pilot sample of size P can be bootstrap-sampled to produce samples of any
size, including sizes that are > P (Bickel & Freedman, 1981). This means that if the
method of sample size determination proposed by Bros & Cowell (1987) is modified
so that samples are taken without replacement, then it becomes much more useful.
Indeed, with samples from the highly non-normal distributions that are often encoun-
tered in biological studies the method has the potential to be the most useful available
approach for deciding on sample sizes.
Correspondence address: B. F.J. Manly, Department of Mathematics and Statistics, University of Otago,
PO Box 56, Dunedin, New Zealand.
190 B.F.J. MANLY
To clarify the procedure that is now proposed, consider the example data that were
used by Bros & Cowell (1987), as shown in Table I. These data are assumed to be a
pilot sample ofP = 30 cores from a benthic community, with the observation considered
being the number of species observed in a core. Suppose that it is required to determine
the sample size n that should be used in an experiment that involves comparing the mean
5 T of a sample from a treated population with the mean 5o of a sample of the same size
from a control population. Furthermore, assume that the comparison between the
means should be in terms of:
(i) a 95% confidence interval 5 T - - 5 O __. t(0.05,2n - 2)Sx/(2/n), where t(0.05,2n - 2)
is the absolute value from the t-distribution with 2n - 2 df that is exceeded with
P - 0.05, and S is the usual pooled within-sample standard deviation; and
(ii) whether or not the statistic t = (ST - 5o)/{S x/(2/n)} is significantly different from
0 on a two-sided test. In this context, the proposed procedure for assessing the use
of samples of size n is as follows:
(a) Take a large number B (say 1000) of pairs of samples of size n from the pilot sample
with replacement.
(b) Add a range of effects fi to the values in the second sample in each pair to represent
treatment eflbcts (say t5 = 0, 0.25z, 0.ST, I z, and 2z, where Tis the standard deviation
of the pilot sample).
(c) Find the difference 5T - 5o between the mean of the treated and untreated sample
for each pair of samples, and hence estimate the mean difference that is exceeded
with P = 0.05.
(d) For the means with 6 = 0, calculate t = (ST - Yio)/{Sv/(2/n)} for each pair of
samples, and hence estimate the absolute value of t that is exceeded with P = 0.05.
(e) For each value of fi determine which of the B bootstrap t-statistics are significantly
different from 0 in comparison with the usual t-table, and hence estimate the
probability of getting a significant test result for this value of fi using the observed
proportion significant.
"I'A I,H,i'] I
Pilot sample used by Bros & Cowcll (1987) to illustrate their discussion. This is assumed to be counts of
number of bivalve species in 30 cores taken from an intertidal mudflat.
0 1
1 2
2 7
3 10
4 7
5 3
Total 30
SAMPLE SIZE DETERMINATION 191
The idea is then to use the estimates obtained in c to e as estimates of what would be
obtained from resampling the population that the pilot sample came from rather than
the pilot sample itself. In fact, this procedure works very well in practice, as is indicated
by the small simulation experiment that is described in the next section of this paper.
Two extra comments can be made abuat this procedure. First, it can be noted that
the same B pairs of bootstrap samples can be used for each value of ,5 to reduce the
computing required. Second, it is crucial to define effects in terms of multiples of
standard deviation. Simulations indicate that the procedure works relatively poorly if
an attempt is made to define effects absolutely since the level of variation from pilot
sample to pilot sample is not controlled for. In fact, this is hardly surprising. For
example, suppose that population has a SD of 1.0 and a pilot sample taken from this
population has a SD of 0.5. Then, it would be unreasonable to expect to estimate the
distribution of any statistic that depends on the population standard deviation with any
accuracy by bootstrapping. However, the distribution of t = (£T -- -£o)/{ S x / ( 2 / n ) } may
still be well determined in the null hypothesis case by bootstrapping since it is a function
of the pilot sample standard deviation rather than the population standard deviation.
Furthermore, the distribution of t may be well determined by bootstrapping if the null
hypothesis is not true providing that the expected value of the difference ~x -- To is a
function of the pilot sample standard deviation rather than the population standard
deviation.
There is a general principle here that has been discussed at greater length by Hall &
Wilson (1991)' bootstrapping should use pivotal statistics whenever possible.
Suppose that the two populations being compared have the following true distribu-
tions of species counts, where P(i) means the pro' lbility of observing i species:
Population l: P(0) = 0.033, P(1) = 0.067, P(2) = 0.233, P(3) = 0.333, P(4) = 0.233,
P(5) = 0.100
Population 2: P(l) = 0.033, P(2) = 0.067, P(3) = 0.233, P(4) = 0.333, P(5) = 0.233,
P ( 6 ) - 0.100
Then, the expected frequencies in a sample of 30 from the first population are equal to
the observed frequencies for Bros & Cowell's (1987) pilot sample (Table I), while the
distribution for Population 2 is the distribution for Population 1 but with all the values
shifted by 1. As a result, the ~ values for Populations 1 and 2 (2.97 and 3.97, respec-
tively) differ by exactly 1, while the SD values are both 1.20.
Now, suppose that a pilot sample of size ~0 is taken from Population 1, and bootstrap
sampled to see how well a sample of size 5 from each population can estimate the mean
difference between the two populations, and to determine the probability that the
difference between the two sample means will be significantly large. Three specific
questions can be asked. First, how does the distribution of mean differences obtained
192 B.F.J. MANLY
CUMULATIVEDISTRIBUTIONOFSAMPLEMEANDIFFERENCES
1000 ......
Z
,,, 800~
/
ua 600! !,
,:' SAMPLESFROM
' I
POPULATION
400: PILOTSAMPLE
!
/
2®,
/
,
/
l/
/
. . . . =- ~ J
; - ' 7 " = ~ " - ~( ~ -T - ,
; T ~r - ~ T "T !
CUMULATIVEDISTRIBUTIONOF t
1000
800
600 SAMPLESFROM
POPULATION
400 PILOTSAMPLE
200
!
I
.~.......~._/i
0 ~--r--,~=~
" i ~ T i , : FF--~ ~ T-= ~ ~-, r - r ~
-2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
t-STATISTIC
Fig. 2. C u m u l a t i v e d i s t r i b u t i o n o f t - s t a t i s t i c f o r c o m p a r i n g t w o s a m p l e s o f s i z e 5 as e s t i m a t e d b y b o o t s t r a p -
ping and by resampling populations.
Finally, it can be noted that 192 of the 1000 bootstrap pairs of samples gave an
absolute t-statistic greater than the critical value of 2.31 that is obtained from the t-table
for a two-sided test at the 5°~, level of significance, with 8 df. Hence, the boot-
strap estimate of the power of the t-test with sample sizes of 5 is 0.192. Since this is
a simple proportion, the SE value associated with the estimate is
x//{0.192(1 - 0.192)/1000} = 0.012, and an approximate 95~o confidence interval
for the true power (the estimate + 1.96 SE) is 0.168-0.216. This can be compared with
the result found by taking 1000 pairs of samples from the true populations, which
resulted in an estimated power of 0.202, with 95 ~o confidence interval 0.177-0.227.
Note that the confidence limit obtained from bootstrap-sampling is approximate for
three reasons. First, the standard error of the estimated power is only an estimate of
the true standard error. Second, the normal approximation to the binomial distribution
of the proportion is being assumed. Third, it is being assumed that any elTecls due to
the particular pilot sample being used are negligible.
This example indicates that for the populations being considered bootstrap-sampling
of a pilot sample of 30 gives very nearly the same result as sampling the populations,
at least for samples of size 5. In fact, a more extensive study (Manly, 1991) shows that
the same good agreement is found for sample sizes of 3-40 for these populations, and
presumably for larger sample sizes as well. (A comparison between bootstrap- and
194 B.F.J. MANLY
population-sampling for sample sizes of 2 was complicated by the fact that the
probability of getting an estimated SD of 0 is not negligible.) The more extensive study
involved examining the results obtained with 100 different pilot samples and it was
found that the good agreement shown in Figs. 1 and 2 for bootstrap- and population-
sampling was not the result of a lucky choice of pilot sample.
Simulations of bootstrapping for two-sample comparisons have also been carried out
for other types of population. These simulations show that adequate pilot sample sizes
depend on the shape of the distributions being sampled. For distributions that are close
to normal (such as that shown in Table I), a pilot sample of size 20 qives good results,
for distributions that are distinctly non-normal (e.g., exponential ~,stributions) a pilot
sample of size 40 gives good results, and for the extremely non-normal distributions that
often occur in a biological context (e.g., counts of very clustered organisms, with a high
proportion of zeroes) a pilot sample size in the hundreds may be necessary.
The simulations also indicate that large pilot sample sizes are required for extremely
non-normal distributions mainly to get good estimates of power values in the range
0.1-0.9. However, a pilot sample size of the order of 100 should be sufficient to
approximate the distribution of the t-statistic when the null hypothesis is true.
The bootstrap procedure can be used equally well for the design of studies to compare
several sample means. For example, consider an experiment that may be of interest to
government regulatory agencies for assessing the effect of discharges from a sewage
treatment plant on the growth of organisms. This involves measuring growth rates for
n replicates for each of several levels of discharge consisting of a certain percentage of
treated sewage mixed with seawater. Thus, the treatments might be 0~o sewage (the
control), 5% sewage, 10% sewage, 25~o sewage, 50% sewage, and 75% sewage. The
results are then analysed by comparing the mean for each of the treatments involving
sewage with the control mean using Dunnett's (1964) method, so that there is a P of
0.05 of declaring one or more results significant when the sewage has no effect.
In this situation, there might be some concern about using Dunnett's table for critical
values of t-statistics if the distribution of growth rates appears to be non-normal since
the critical valaes were computed on the assumption of a normal distribution. Also,
tables are not currently available for determining the power associated with different
numbers of replicates.
However, the bootstrap procedure can be used in an obvious way to estimate the
properties of different sample sizes. Thus, if a pilot sample with standard deviation
is available, then the following procedure can be repeated a large number of times (e.g.,
1000 times) to estimate the critical values for t-statistics to use for tests, and to estimate
power values:
(a) Take six bootstrap samples of size n from the pilot sample to give a sample for each
treatment level.
SAMPLE SIZE DETERMINATION 195
(b) Add P?o of a sewage effect 6fl to the observations with this percentage of sewage,
for a suitable range of "st.,rage" effects t5 in terms of the population standard
deviation (using the same bootstrap samples for each value of b).
(c) Calculate t-statistics of the form ti = (2; - a:o)/,f (2S/n) to compare the means of
the treated samples with the control mean, where S is the pooled sample standard
deviation. Hence, determine which differences, if any, are significant using
Dunnett's table of critical values.
From the results obtained in this way it is straightforward to estimate the distribution
ofthe t-statistics when the null hypothesis of no sewage effect is true, and hence estimate
the critical value such that the probability of it being exceeded by any of the ti values
is 0.05. This value can then be used in place of the critical value tabulated by Dunnett
if this seems necessary. It is also straightforward to estimate the probability of getting
a signilicant result for any particular percentage of sewage with different levels of effect.
Simulations described in detail elsewhere (Manly, 1991) show that this procedure
works extremely well with normally distributed data and a pilot sample of > 20, while
for moderately non-normal data a larger pilot sample of 40 will be satisfactory. How-
ever, for extremely non-normal data a pilot sample size in the hundreds may be needed
to estia~ate power values in the range 0.1-0.9. Essentially, the conditions on the pilot
sample size are the same as for the two-sample comparison.
Note that the particular assumption made about the dose-response effect in (b)
above, that P',,, sewage given P0, of the maximum response, is not an inherent part of
the bootstrap procedure and could be modified easily.
DISCUSSION
It might be argued that it will often be difficult to take a pilot sample that is large
enough for sample design. But, as pointed out by Bros & Cowell (1987), it may be
possible to combine observations from several different samples after first adjusting the
values in each sample to a common mean, and possibly a common standard deviation.
Bootstrapping can then be carried out on ~hc pooled sample.
Another potential problem is that different treatments may give different standard
deviations for observations, rather than just a change in the mean. A solution in that
case is to include a change in the standard deviation in the treatment effect that is applied
in the bootstrapping. It will then still be possible to estimate the distributions of tests
statistics, power, etc. In fact, the great value of the bootstrap procedure is that this type
of effect is easy to allow for.
An obvious third drawback with bootstrapping is the computing involved or, what
may be more important, the need for a special computer program to be written to do
the calculations. Nothing much can be said about this, except to note that with small
samples it is possible to do the calculations using a spreadsheet program like
LOTUS 123 on a microcomputer.
196 B.F.J. MANLY
REFERENCES
Bickel, P.J. & D.A. Freedman, 1981. Some asymptotic theory for the bootstrap. Ann. Statist., Vol. 6,
pp. 1196-1217.
Bros, W. E. & B.C. Cowell, 1987. A technique for optimizing sample size (replication). J. Exp. Mar. Biol.
Ecol., Vol. 114, pp. 63-71.
Dunnctt, C.W., 1964. New tables for multiple comparisons with a control. Biometrics, Vol. 20, pp. 482-49 I.
Efron, B., 1979. Bootstrap methods: another look at the jackknife. Ann. Stat., Vol. 7, pp. 1-25.
Efron, B. & R. Tibshirani, 1986. Bootstrap methods for standard errors, confidence intervals, and other
methods of statistical accuracy. Statist. Sci., Vol. l, pp. 54-77.
Hall, P. & S.R. Wilson, 1991. Two guidelines for bootstrap hypothesis testing. Biometrics, Vol. 47,
pp. 757-762.
Manly, B.F.J., 1991. On bootstrapping for sample design. Res. Rep. l, 1991, Department of Mathematics
and Statistics, University of Otago, Dunedin, New Zealand.