You are on page 1of 8

d. Exp. Mar. Biol. Ecol.

, 158 (1992) 189-196 189


© 1992 Elsevier Science Publishers BV. All rights reserved 0022-0981/92/$05.00

JEMBE 01761

Bootstrapping for determining sample sizes in biological


studies

Bryan F . J . Manly
Centrefor Applications of Statistics and Mathematics, Universityof Otago, Dunedin, New Zealand
(Revision received 6 January 1992; accepted 17 January 1992)

Abstract: Bootstrapping of a pilot sample to aid in sample size determination has been advocated by Bros
& Cowell (1987). In this note it is pointed out that the apparent limitation with this method to assessing
samples up to half the size of the pilot sample is easily overcome by sampling the pilot sample with
replacement rather than without replacement. The use of the modified procedure is discussed in the context
of two-sample and multi-sample comparisons. Evidence is presented which shows that for normally
distributed data even pilot samples as small as 20 will give satisfactory results, but rather larger pilot samples
may be required for extremely non-normal distributions.

Key words: Bootstrapping; Experimental design; Number of replicates; Sample size

INTRODUCTION

In a recent paper in this journal, Bros & Cowell (1987) discussed the problem of
deciding on the al:propriate sample size to use in a study to compare two treatments
in terms of the difference between the population means that can be detected and the
effort required to take samples. As part of their discussion, they suggested that
bootstrapping a pilot sample can be used to assess the variation inherent in using
samples of different sizes, but noted that this runs into difficulties with larger sample
sizes because the number of possible samples of size n that can be drawn from a pilot
sample of size P becomes small as n approaches P.
In fact, one of the key aspects of bootstrapping is that samples are taken with
replacement rather than without replacement (Efron, 1979; Efron & Tibshirani, 1986).
Hence, a pilot sample of size P can be bootstrap-sampled to produce samples of any
size, including sizes that are > P (Bickel & Freedman, 1981). This means that if the
method of sample size determination proposed by Bros & Cowell (1987) is modified
so that samples are taken without replacement, then it becomes much more useful.
Indeed, with samples from the highly non-normal distributions that are often encoun-
tered in biological studies the method has the potential to be the most useful available
approach for deciding on sample sizes.

Correspondence address: B. F.J. Manly, Department of Mathematics and Statistics, University of Otago,
PO Box 56, Dunedin, New Zealand.
190 B.F.J. MANLY

THE PROCEDURE FOR A TWO-SAMPLE C O M P A R I S O N

To clarify the procedure that is now proposed, consider the example data that were
used by Bros & Cowell (1987), as shown in Table I. These data are assumed to be a
pilot sample ofP = 30 cores from a benthic community, with the observation considered
being the number of species observed in a core. Suppose that it is required to determine
the sample size n that should be used in an experiment that involves comparing the mean
5 T of a sample from a treated population with the mean 5o of a sample of the same size

from a control population. Furthermore, assume that the comparison between the
means should be in terms of:
(i) a 95% confidence interval 5 T - - 5 O __. t(0.05,2n - 2)Sx/(2/n), where t(0.05,2n - 2)
is the absolute value from the t-distribution with 2n - 2 df that is exceeded with
P - 0.05, and S is the usual pooled within-sample standard deviation; and
(ii) whether or not the statistic t = (ST - 5o)/{S x/(2/n)} is significantly different from
0 on a two-sided test. In this context, the proposed procedure for assessing the use
of samples of size n is as follows:
(a) Take a large number B (say 1000) of pairs of samples of size n from the pilot sample
with replacement.
(b) Add a range of effects fi to the values in the second sample in each pair to represent
treatment eflbcts (say t5 = 0, 0.25z, 0.ST, I z, and 2z, where Tis the standard deviation
of the pilot sample).
(c) Find the difference 5T - 5o between the mean of the treated and untreated sample
for each pair of samples, and hence estimate the mean difference that is exceeded
with P = 0.05.
(d) For the means with 6 = 0, calculate t = (ST - Yio)/{Sv/(2/n)} for each pair of
samples, and hence estimate the absolute value of t that is exceeded with P = 0.05.
(e) For each value of fi determine which of the B bootstrap t-statistics are significantly
different from 0 in comparison with the usual t-table, and hence estimate the
probability of getting a significant test result for this value of fi using the observed
proportion significant.

"I'A I,H,i'] I

Pilot sample used by Bros & Cowcll (1987) to illustrate their discussion. This is assumed to be counts of
number of bivalve species in 30 cores taken from an intertidal mudflat.

Number of species Frequency

0 1
1 2
2 7
3 10
4 7
5 3
Total 30
SAMPLE SIZE DETERMINATION 191

The idea is then to use the estimates obtained in c to e as estimates of what would be
obtained from resampling the population that the pilot sample came from rather than
the pilot sample itself. In fact, this procedure works very well in practice, as is indicated
by the small simulation experiment that is described in the next section of this paper.
Two extra comments can be made abuat this procedure. First, it can be noted that
the same B pairs of bootstrap samples can be used for each value of ,5 to reduce the
computing required. Second, it is crucial to define effects in terms of multiples of
standard deviation. Simulations indicate that the procedure works relatively poorly if
an attempt is made to define effects absolutely since the level of variation from pilot
sample to pilot sample is not controlled for. In fact, this is hardly surprising. For
example, suppose that population has a SD of 1.0 and a pilot sample taken from this
population has a SD of 0.5. Then, it would be unreasonable to expect to estimate the
distribution of any statistic that depends on the population standard deviation with any
accuracy by bootstrapping. However, the distribution of t = (£T -- -£o)/{ S x / ( 2 / n ) } may
still be well determined in the null hypothesis case by bootstrapping since it is a function
of the pilot sample standard deviation rather than the population standard deviation.
Furthermore, the distribution of t may be well determined by bootstrapping if the null
hypothesis is not true providing that the expected value of the difference ~x -- To is a
function of the pilot sample standard deviation rather than the population standard
deviation.
There is a general principle here that has been discussed at greater length by Hall &
Wilson (1991)' bootstrapping should use pivotal statistics whenever possible.

A Two-SAMPLE SIMULATION EXPERIMENT

Suppose that the two populations being compared have the following true distribu-
tions of species counts, where P(i) means the pro' lbility of observing i species:
Population l: P(0) = 0.033, P(1) = 0.067, P(2) = 0.233, P(3) = 0.333, P(4) = 0.233,
P(5) = 0.100
Population 2: P(l) = 0.033, P(2) = 0.067, P(3) = 0.233, P(4) = 0.333, P(5) = 0.233,
P ( 6 ) - 0.100
Then, the expected frequencies in a sample of 30 from the first population are equal to
the observed frequencies for Bros & Cowell's (1987) pilot sample (Table I), while the
distribution for Population 2 is the distribution for Population 1 but with all the values
shifted by 1. As a result, the ~ values for Populations 1 and 2 (2.97 and 3.97, respec-
tively) differ by exactly 1, while the SD values are both 1.20.
Now, suppose that a pilot sample of size ~0 is taken from Population 1, and bootstrap
sampled to see how well a sample of size 5 from each population can estimate the mean
difference between the two populations, and to determine the probability that the
difference between the two sample means will be significantly large. Three specific
questions can be asked. First, how does the distribution of mean differences obtained
192 B.F.J. MANLY

from bootstrap-sampling compare with the distribution of mean differences obtained by


resampling the original populations? Second, how well does the distribution of
t - (XT - ~:o)/{S x/(2/n)} that is estimated by bootstrapping agree with the distribution
of this statistic that is obtained by resampling the original populations? Third, how does
the power to detect the mean difference between the two populations that is estimated
by bootstrapping compare with the actual power?
The pilot sample of size 30 drawn from Population 1 gave the values: 0, with
frequency 1; 1, with frequency 2; 2-4, each with frequency 8; and 5, with frequency 3.
This pilot sampled was bootstrap-sampled 1000 times as described in the previous
section (i.e., with 1000 pairs of samples) with values in treated samples increased by 1
to mirror the difference between the control and treated population. The resulting
estimated cumulative distribution of the difference between the sample means is shown
in Fig. 1. This figure also shows the distribution of the mean difference that was obtained
by resampling the two populations. Clearly, the distribution from bootstrap-sampling
and the distribution from population-sampling are very similar.
Fig. 2 shows how the bootstrap distribution of the statistic t = (XT - ~o)/{S x/(2/n)}
(with a treatment effect of 1 imposed) compares with the distribution that is obtained
by sampling the original populations. Again, the two distributions agree very closely.

CUMULATIVEDISTRIBUTIONOFSAMPLEMEANDIFFERENCES
1000 ......

Z
,,, 800~
/
ua 600! !,
,:' SAMPLESFROM
' I
POPULATION
400: PILOTSAMPLE

!
/
2®,
/
,
/
l/
/
. . . . =- ~ J
; - ' 7 " = ~ " - ~( ~ -T - ,
; T ~r - ~ T "T !

-2,0-1,5.1.0-0.5 O,O 0,5 1,0 1,5 2,0 2,5 3.0 3,5


SAMPLEMEANDIFFERENCE
Fig. 1. Cumulative distribution of difference between means of samples ofsize 5 taken from two populations
as estimated by bootstrapping a pilot sample of 30, compared with distribution found from resampling
population~. Two populations have same standard deviation but a mean differenceof 1. Both methods for
estimating differences used 1000pairs of samples.
SAMPLE SIZE DETERMINATION 193

CUMULATIVEDISTRIBUTIONOF t

1000

800

600 SAMPLESFROM
POPULATION
400 PILOTSAMPLE

200

!
I
.~.......~._/i
0 ~--r--,~=~
" i ~ T i , : FF--~ ~ T-= ~ ~-, r - r ~
-2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
t-STATISTIC
Fig. 2. C u m u l a t i v e d i s t r i b u t i o n o f t - s t a t i s t i c f o r c o m p a r i n g t w o s a m p l e s o f s i z e 5 as e s t i m a t e d b y b o o t s t r a p -
ping and by resampling populations.

Finally, it can be noted that 192 of the 1000 bootstrap pairs of samples gave an
absolute t-statistic greater than the critical value of 2.31 that is obtained from the t-table
for a two-sided test at the 5°~, level of significance, with 8 df. Hence, the boot-
strap estimate of the power of the t-test with sample sizes of 5 is 0.192. Since this is
a simple proportion, the SE value associated with the estimate is
x//{0.192(1 - 0.192)/1000} = 0.012, and an approximate 95~o confidence interval
for the true power (the estimate + 1.96 SE) is 0.168-0.216. This can be compared with
the result found by taking 1000 pairs of samples from the true populations, which
resulted in an estimated power of 0.202, with 95 ~o confidence interval 0.177-0.227.
Note that the confidence limit obtained from bootstrap-sampling is approximate for
three reasons. First, the standard error of the estimated power is only an estimate of
the true standard error. Second, the normal approximation to the binomial distribution
of the proportion is being assumed. Third, it is being assumed that any elTecls due to
the particular pilot sample being used are negligible.
This example indicates that for the populations being considered bootstrap-sampling
of a pilot sample of 30 gives very nearly the same result as sampling the populations,
at least for samples of size 5. In fact, a more extensive study (Manly, 1991) shows that
the same good agreement is found for sample sizes of 3-40 for these populations, and
presumably for larger sample sizes as well. (A comparison between bootstrap- and
194 B.F.J. MANLY

population-sampling for sample sizes of 2 was complicated by the fact that the
probability of getting an estimated SD of 0 is not negligible.) The more extensive study
involved examining the results obtained with 100 different pilot samples and it was
found that the good agreement shown in Figs. 1 and 2 for bootstrap- and population-
sampling was not the result of a lucky choice of pilot sample.
Simulations of bootstrapping for two-sample comparisons have also been carried out
for other types of population. These simulations show that adequate pilot sample sizes
depend on the shape of the distributions being sampled. For distributions that are close
to normal (such as that shown in Table I), a pilot sample of size 20 qives good results,
for distributions that are distinctly non-normal (e.g., exponential ~,stributions) a pilot
sample of size 40 gives good results, and for the extremely non-normal distributions that
often occur in a biological context (e.g., counts of very clustered organisms, with a high
proportion of zeroes) a pilot sample size in the hundreds may be necessary.
The simulations also indicate that large pilot sample sizes are required for extremely
non-normal distributions mainly to get good estimates of power values in the range
0.1-0.9. However, a pilot sample size of the order of 100 should be sufficient to
approximate the distribution of the t-statistic when the null hypothesis is true.

THE COMPARISON OF SEVERAL SAMPLES

The bootstrap procedure can be used equally well for the design of studies to compare
several sample means. For example, consider an experiment that may be of interest to
government regulatory agencies for assessing the effect of discharges from a sewage
treatment plant on the growth of organisms. This involves measuring growth rates for
n replicates for each of several levels of discharge consisting of a certain percentage of
treated sewage mixed with seawater. Thus, the treatments might be 0~o sewage (the
control), 5% sewage, 10% sewage, 25~o sewage, 50% sewage, and 75% sewage. The
results are then analysed by comparing the mean for each of the treatments involving
sewage with the control mean using Dunnett's (1964) method, so that there is a P of
0.05 of declaring one or more results significant when the sewage has no effect.
In this situation, there might be some concern about using Dunnett's table for critical
values of t-statistics if the distribution of growth rates appears to be non-normal since
the critical valaes were computed on the assumption of a normal distribution. Also,
tables are not currently available for determining the power associated with different
numbers of replicates.
However, the bootstrap procedure can be used in an obvious way to estimate the
properties of different sample sizes. Thus, if a pilot sample with standard deviation
is available, then the following procedure can be repeated a large number of times (e.g.,
1000 times) to estimate the critical values for t-statistics to use for tests, and to estimate
power values:
(a) Take six bootstrap samples of size n from the pilot sample to give a sample for each
treatment level.
SAMPLE SIZE DETERMINATION 195

(b) Add P?o of a sewage effect 6fl to the observations with this percentage of sewage,
for a suitable range of "st.,rage" effects t5 in terms of the population standard
deviation (using the same bootstrap samples for each value of b).
(c) Calculate t-statistics of the form ti = (2; - a:o)/,f (2S/n) to compare the means of
the treated samples with the control mean, where S is the pooled sample standard
deviation. Hence, determine which differences, if any, are significant using
Dunnett's table of critical values.
From the results obtained in this way it is straightforward to estimate the distribution
ofthe t-statistics when the null hypothesis of no sewage effect is true, and hence estimate
the critical value such that the probability of it being exceeded by any of the ti values
is 0.05. This value can then be used in place of the critical value tabulated by Dunnett
if this seems necessary. It is also straightforward to estimate the probability of getting
a signilicant result for any particular percentage of sewage with different levels of effect.
Simulations described in detail elsewhere (Manly, 1991) show that this procedure
works extremely well with normally distributed data and a pilot sample of > 20, while
for moderately non-normal data a larger pilot sample of 40 will be satisfactory. How-
ever, for extremely non-normal data a pilot sample size in the hundreds may be needed
to estia~ate power values in the range 0.1-0.9. Essentially, the conditions on the pilot
sample size are the same as for the two-sample comparison.
Note that the particular assumption made about the dose-response effect in (b)
above, that P',,, sewage given P0, of the maximum response, is not an inherent part of
the bootstrap procedure and could be modified easily.

DISCUSSION

It might be argued that it will often be difficult to take a pilot sample that is large
enough for sample design. But, as pointed out by Bros & Cowell (1987), it may be
possible to combine observations from several different samples after first adjusting the
values in each sample to a common mean, and possibly a common standard deviation.
Bootstrapping can then be carried out on ~hc pooled sample.
Another potential problem is that different treatments may give different standard
deviations for observations, rather than just a change in the mean. A solution in that
case is to include a change in the standard deviation in the treatment effect that is applied
in the bootstrapping. It will then still be possible to estimate the distributions of tests
statistics, power, etc. In fact, the great value of the bootstrap procedure is that this type
of effect is easy to allow for.
An obvious third drawback with bootstrapping is the computing involved or, what
may be more important, the need for a special computer program to be written to do
the calculations. Nothing much can be said about this, except to note that with small
samples it is possible to do the calculations using a spreadsheet program like
LOTUS 123 on a microcomputer.
196 B.F.J. MANLY

Before concluding, it is appropriate to sound a note of caution. Bootstrapping does


not always work and it needs to be tested before it is accepted as reliable for a new
application. This raises the question of how researchers should proceed if, for example,
they have a pilot sample of size 100 from a distribution that seems to be very non-normal
and want to know whether this is large enough to use for assessing sample size
requirements for a further study.
What can then be done, at the expense of some computing, is double bootstrapping,
which amoun's essentially to carrying out simulations in the manner described in this
paper. The available pilot sample can be taken as the population being sampled. This
is sampled with replacement to obtain (say) 100 further pilot samples of size 100, and
each of these pilot samples is bootstrap sampled (say) 1000 times to produce bootstrap
estimates of the characteristics of samples of different sizes n. The estimates obtained
can then be compared with what is obtained from resampling the original pilot sample.
If the results are similar for all 100 pilot samples then it seems that the original pilot
sample of size 100 is probably satisfactory for bootstrapping. However, if the bootstrap
estimates vary considerably from pilot sample to pilot sample, then it seems that a pilot
sample of size 100 is not satisfactory and the results obtained from the original pilot
sample must be treated with some reservations. They may still, of course, be better than
anything else that is available.

REFERENCES

Bickel, P.J. & D.A. Freedman, 1981. Some asymptotic theory for the bootstrap. Ann. Statist., Vol. 6,
pp. 1196-1217.
Bros, W. E. & B.C. Cowell, 1987. A technique for optimizing sample size (replication). J. Exp. Mar. Biol.
Ecol., Vol. 114, pp. 63-71.
Dunnctt, C.W., 1964. New tables for multiple comparisons with a control. Biometrics, Vol. 20, pp. 482-49 I.
Efron, B., 1979. Bootstrap methods: another look at the jackknife. Ann. Stat., Vol. 7, pp. 1-25.
Efron, B. & R. Tibshirani, 1986. Bootstrap methods for standard errors, confidence intervals, and other
methods of statistical accuracy. Statist. Sci., Vol. l, pp. 54-77.
Hall, P. & S.R. Wilson, 1991. Two guidelines for bootstrap hypothesis testing. Biometrics, Vol. 47,
pp. 757-762.
Manly, B.F.J., 1991. On bootstrapping for sample design. Res. Rep. l, 1991, Department of Mathematics
and Statistics, University of Otago, Dunedin, New Zealand.

You might also like