An Empirical Test of Bootstrapping As A Method For Assessing Confidence in Phylogenetic Analysis

Syst. Biol.
42(2):182-192, 1993
AN EMPIRICAL TEST OF BOOTSTRAPPING AS A

METHOD FOR ASSESSING CONFIDENCE IN
PHYLOGENETIC ANALYSIS
DAVID M. HILLIS AND JAMES J. BULL

Department of Zoology, The University of Texas, Austin, Texas 78712, USA
Abstract.—Bootstrapping is a common method for assessing confidence in phylogenetic anal-

yses. Although bootstrapping was first applied in phylogenetics to assess the repeatability of a
given result, bootstrap results are commonly interpreted as a measure of the probability that a
phylogenetic estimate represents the true phylogeny. Here we use computer simulations and a
laboratory-generated phylogeny to test bootstrapping results of parsimony analyses, both as
measures of repeatability (i.e., the probability of repeating a result given a new sample of
characters) and accuracy (i.e., the probability that a result represents the true phylogeny). Our
results indicate that any given bootstrap proportion provides an unbiased but highly imprecise
measure of repeatability, unless the actual probability of replicating the relevant result is nearly
one. The imprecision of the estimate is great enough to render the estimate virtually useless as
a measure of repeatability. Under conditions thought to be typical of most phylogenetic analyses,
however, bootstrap proportions in majority-rule consensus trees provide biased but highly con-
Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

servative estimates of the probability of correctly inferring the corresponding clades. Specifically,
under conditions of equal rates of change, symmetric phylogenies, and internodal change of
<20% of the characters, bootstrap proportions of >70% usually correspond to a probability of
>95% that the corresponding clade is real. However, under conditions of very high rates of
internodal change (approaching randomization of the characters among taxa) or highly unequal
rates of change among taxa, bootstrap proportions >50% are overestimates of accuracy. [Boot-
strapping; accuracy; repeatability; phylogeny; parsimony; precision; statistical analyses; simu-
lations.]
Felsenstein (1985) proposed using the phylogeny, but the phylogeny that would
statistical test of bootstrapping (Efron, 1979, be estimated on repeated sampling of many
1982,1987) to estimate confidence limits of characters from the underlying pool of
internal branches in phylogenetic analy- characters," many workers treat bootstrap
ses. He suggested that characters in a ma- results as statements about the probability
trix of taxa x characters can be sampled that a particular clade is a real historical
with replacement (bootstrapped) to create group. However, at best bootstrap esti-
many new matrices of the same size as the mates should provide an indication only
original, each of which can be analyzed to of the degree of support of a particular
find the best-fit tree (e.g., the shortest tree technique for a particular clade. In cases in
in the case of parsimony analysis). Felsen- which the given technique is positively
stein suggested that the results from nu- misleading about phylogeny (e.g., Felsen-
merous bootstrap replicates can then be stein, 1978), we may be quite confident that
combined in a majority-rule consensus tree the technique supports a clade, even
to assess confidence in particular internal though the probability that the clade is real
branches of the tree. An investigator can may be quite low.
then evaluate whether a particular tech- The first goal of this study was to ex-
nique provides significant support of a par- amine the effectiveness of bootstrapping
ticular a priori phylogenetic hypothesis. in assessing the probability that a given
The use of bootstrapping is increasing phylogeny would be obtained upon re-
in systematic studies. Although Felsen- peated sampling of different sets of char-
stein (1985) suggested that "bootstrapping acters from the underlying distribution of
provides us with a confidence interval characters. The second goal was to examine
within which is contained not the true the relationship, if any, between bootstrap
182
1993 BOOTSTRAPPING IN PHYLOGENETIC ANALYSIS 183
n estimates of phylogeny
Precision
n estimates of phylogeny
An infinite set of
pseudosamples

FIGURE 1. The relationships among a true phylogeny, samples of characters drawn from the taxa, bootstrap
pseudosamples drawn from an initial sample, and the concepts of precision, accuracy, and repeatability.
proportions and the probability of obtain- from an infinite set of pseudosamples. The
ing a true clade in parsimony analyses un- sampling variance of bootstrap propor-
der a variety of tree topologies and rates tions follows the binomial distribution,
of change. such that a2 = P(l - P)/n, where P is the
bootstrap proportion and n is the number
REPEATABILITY, ACCURACY, AND PRECISION of replications (Hedges, 1992). Hedges
Before evaluating the performance of (1992) called this relationship "accuracy"
bootstrapping, it is necessary to distin- with respect to phylogenetic analyses, but
guish among the terms "repeatability," this is not within the usual meaning of this
"accuracy," and "precision." The meaning term (and he no longer follows this ter-
of these terms as they relate to bootstrap- minology, e.g., Hedges and Maxon, 1993).
ping is illustrated in Figure 1. Given an Instead, we use accuracy to refer to the
actual phylogeny of organisms, one could probability that a specified group is con-
sample a set of characters among the taxa tained in the true phylogeny. Although
and then estimate the phylogeny from this the use of bootstrap proportions as a mea-
sample. Bootstrapping is then used to cre- sure of accuracy has not been explicitly
ate a set of pseudosamples by drawing defended in print, they are commonly
characters from this initial sample with re- treated as such in the framework of hy-
placement. Phylogenetic analyses of these pothesis testing. In his original paper, Fel-
pseudosamples support a set of relation- senstein (1985) suggested that the boot-
ships for various clades; potential clades strap proportions could be used as a
may appear in all, some, or none of the measure of what we will call repeatability:
analyses based on the pseudosamples. We the probability that a specified group will
will refer to the proportions at which each be found in an analysis of an independent
clade is represented in these analyses as sample of characters. If a group of interest
the bootstrap -proportions (sometimes mis- is found in a phylogenetic analysis, an es-
leadingly called "bootstrap P values"). The timate of the repeatability of this result
precision of bootstrapping is the degree to would tell us how likely we would be to
which bootstrap proportions based on a find the same result in an identical analysis
finite set of pseudosamples are expected to of a new sample of characters. Thus, as
match the values that would be obtained used in this paper, precision relates to the
184 SYSTEMATIC BIOLOGY VOL. 42
2 3 4 5 6 7 8 9 1992). Hedges (1992) recommended per-

b c d e f g h i forming 400-2,000 bootstrap replications
"if one wishes the expectation to be that
1 m
j k
the 95% confidence range is ±1% of the
n 0
BP" (where BP = bootstrap P value, or
bootstrap proportion of our terminology).
However, to what does the "95% confi-
(a)
dence range" apply? The findings of Hedg-
es relate not to accuracy or repeatability
but to the precision of bootstrapping: the
1 2 3 4 5 6 7 8 9
number of pseudosamples needed to ob-
a b c d e f g h i tain an estimate (of specified precision) of
the bootstrap proportions that would be
obtained after an infinite number of pseu-
dosamples. However, if the bootstrap pro-
portions are not good estimates of accu-
racy, repeatability, or some other useful

measure, it makes little sense to worry too
much about precision. A highly precise es-
timate of an inaccurate measure is of little
importance. Therefore, we will focus on
the interpretation of the bootstrap propor-
tions: how do they perform as measures of
(b) either accuracy or repeatability?
METHODS
To evaluate bootstrap proportions as
measures of accuracy and repeatability,
knowledge of the true phylogeny and abil-
ity to take true replicate samples are nec-
essary. We have used two systems to meet
FIGURE 2. Simulated phylogenies. The letters cor- these requirements: simulations and a lab-
respond to branch lengths shown in Tables 1 and 2. oratory-generated phylogeny of viruses.
(a) Phylogeny for simulations 1-9 for nine taxa (Table
1). (b) Phylogeny for simulation 10 for nine taxa (Ta- Simulations
ble 1). (c) Phylogeny for simulations 11-21 for four
taxa (Table 2). Ten nine-taxon phylogenies were sim-
ulated 100 times each (Figs. 2a, 2b; Table
1), and 11 four-taxon phylogenies were
correspondence between multiple sets of simulated 1,000 or 10,000 times each (Fig.
bootstrap pseudosamples taken from the 2c; Table 2). In each simulation, the ances-
same initial sample, accuracy is the prob- tral node was assigned a DNA sequence
ability that a given result represents the with equal proportions of all four nucle-
true phylogeny, and repeatability is the otides, from 50 to 1,000 nucleotides in
probability that a given result will be found length. Branch lengths of the trees were
again using a subsequent sample of char- assigned values between 1 and 10 units
acters. inclusive, with a unit equal to an assigned
Of these concepts, only the precision of rate of change. The rate of change along a
bootstrapping has been examined to date unit branch length (R) was varied from 4%
with respect to phylogenetic analysis; the to 75%; therefore, the instantaneous rates
precision of any given bootstrap analysis of change (r) varied from 5.5% to infinity,
has a simple relationship to the number of because R = 0.75(1 - e~r). The ratio of
pseudosamples (see above and Hedges, transitions: transversions was varied from
TABLE 1. Relative branch lengths and rates of mutation for simulated phylogenies of nine taxa. These
simulations were each replicated 100 times, and 100 bootstrap pseudoreplicates were generated for each
replicate.
Simu- Mutation rate

lation To- Branch lengths
num- polo- Transi- Trans-
ber gya a b c d e f g h i j k 1 m n 0 tions versions
1 a 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.020 0.020
2 a 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.050 0.050
3 a 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.075 0.075
4 a 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.100 0.100
5 a 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.050 0.005
6 a 3 1 3 1 3 1 3 1 3 1 1 1 1 1 1 0.050 0.005
7 a 3 1 5 1 5 1 5 1 5 1 1 1 1 1 1 0.050 0.005
8 a 3 1 7 1 7 1 7 1 7 1 1 1 1 1 1 0.050 0.005
9 a 3 1 10 1 10 1 10 1 10 1 1 1 1 1 1 0.050 0.005
10 b 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1 0.050 0.005
1
See Figure 2.

1:2 (equal probability of all base changes) restriction sites were mapped, of which all
to 10:1 (strong bias in favor of transitions). of the deletions and 199 restriction sites
In three sets of simulations (numbers 15- were variable among the lineages. Previ-
17, Table 2), the rate of change and the ous study (Hillis et al., 1992) has shown
transition: transversion ratio were set to that standard methods of tree construction
produce random DNA sequences for each using this matrix of restriction sites pro-
taxon (i.e., no phylogenetic signal). The duce the correct (known) branching rela-
21,000 simulated phylogenies were gen- tionships.
erated with a program written in C (TREES,
available upon request) and compiled on Bootstrapping
a Sun SPARC station 1 + . Each of the 21,000 matrices from the sim-
ulations was analyzed by bootstrapping
Bacteriophage Phytogeny with 100 replicates using the software
Generation of the phylogeny of 17 bac-
teriophage was described by Hillis et al.
(1992). The design of the experiment was TABLE 2. Simulations based on four-taxon trees
identical to the topology in Figure 2a, with with equal branch lengths (Fig. 2c). These simulations
relative branch lengths equal to those in were each replicated 1,000 times, except for simula-
simulations 1-4 (Table 1). The ancestral tion 21, which was replicated 10,000 times, and 100
bootstrap pseudoreplicates were drawn from each
taxon was wild-type bacteriophage T7 (with replicate.
a total genome of 39,937 base pairs [bp] of
DNA). The descendant lineages were Mutation rate
grown in the presence of a mutagen (ni- Simulation Tran- Trans- No.
trosoguanidine), which produces an excess number sitions versions characters
of GC -> AT changes, although all com- 11 0.05 0.05 50
binations of mutations are possible. The 12 0.10 0.10 50
phage at each node was obtained from a 13 0.15 0.15 50
single plaque to ensure that all descen- 14 0.20 0.20 50
15 0.25 0.50 50
dants were derived from the same ancestor 16 0.25 0.50 200
(see Hillis et al., 1992, for further details). 17 0.25 0.50 1,000
The entire genomes of the phages at each 18 0.05 0.05 100
node of the phylogeny were mapped for 19 0.05 0.05 200
34 restriction enzymes (with 4-, 5-, and 6-bp 20 0.05 0.05 1,000
21 0.15 0.30 50
recognition sites). Three deletions and 290
(a) was used in the bootstrap runs to ensure

that all most-parsimonious trees were
found. We also found the most-parsimo-
nious trees for each matrix, so that the
bootstrap proportions could be compared
with the probability of finding the same
a. result through repeated sampling of char-
to acters from the underlying distribution
(repeatability). In addition, the collection
of bootstrap estimates was compared with
the known phylogenies to study the rela-
tionship between bootstrap proportions
and the probability of correctly estimating
phylogeny (accuracy).
To produce data matrices of phage re-
striction-site characters that were compa-
rable to the simulations, we jackknifed

(b) (sampled characters without replacement)
the complete matrix from the T7 phage
Q. study (Hillis et al., 1992) to produce 500
(0 subsamples of 50 characters each. Each of
these matrices was then analyzed by boot-
strapping as above so that the bootstrap
proportions could be compared with the
probability of obtaining a true clade.
Percentage of tree A
BOOTSTRAP ESTIMATES AS MEASURES OF
REPEATABILITY
FIGURE 3. Bootstrap proportions as estimates of re-
peatability in phylogenetic analysis, (a) Results of 1,089 How well do bootstrap proportions rep-
bootstrap analyses (100 pseudoreplicates each) on resent the proportions that would be ob-
1,089 actual replicates from simulation 21 (Table 2). tained from the independent samples that
Results shown are the proportions of the initial tree
(tree A, which is the correct tree) found in each of the bootstrap aspires to represent? This
the 1,089 analyses, (b) Proportions of solution A (the question was evaluated directly. Bootstrap
correct tree) in 100 samples of 100 actual replicates of proportions were calculated for a single
simulation 21. The probability of estimating tree A model phylogeny (simulation 21, Table 2).
from any given replicate is approximately 70%; sam-
ples of 100 actual replicates produce estimates of this Each bootstrap proportion was based on
value of 60-80%. In contrast, estimates of repeatability 100 pseudoreplicates of a data matrix. The
based on 100 bootstrap pseudoreplicates range from distribution of this bootstrap proportion
0% to 100%, depending on the initial sample exam- statistic is shown in Figure 3a. For com-
ined.
parison, the distribution of the analogous
sample proportion was also calculated. A
sample proportion is the fraction of correct
package PAUP 3.0q (Swofford, 1990). With reconstructions from 100 independent data
100 replicates, the sample variance of the matrices (without any pseudoreplication).
bootstrap proportions (BP) ranges from a The distribution of the sample proportion
maximum of 0.0025 (at 50% BP) to a min- is shown in Figure 3b.
imum of 0 (at 0 and 100% BP) (Hedges, Comparison of these two distributions
1992). Because the mean bootstrap propor- reveals that the process of bootstrap resam-
tions reported in this study are based on pling is not the same as repeated, inde-
> 100 replicates each, the standard error of pendent sampling of data. The means of
the means is no greater than 0.005 for any both distributions are similar, but the vari-
value. The branch-and-bound algorithm ance of the bootstrap distribution is much
greater. The larger variance of the boot-

strap proportion arises because of the bias
inherent in the sample used for the 100
pseudoreplicates in each bootstrap pro-
portion. Bootstrapping as a general statis-
tical procedure provides a biased estimate
of the mean, and consequently, bootstrap-
ping is typically used to estimate the vari- Bootstrap proportions
ance. Yet in the present case, which is one 2.2

of binomial sampling, the variance is not 2 O 4% a
n 10%
independent of the mean, further compli- 1.8'
A 15%
cating the use of bootstrapping. 1.6' O 20% A
What do these distributions indicate 1.2 0

o 9
about repeatability? From our definition, 1
O
o
B
.8 ft
the repeatability of recovering the correct .6' a O
tree is approximately 70% (the mean), and

both distributions give similar values. (b)

Bootstrap proportions
However, the variance of the bootstrap
FIGURE 4. (a) Relationship between bootstrap pro-
proportion is so large that any single value portions and the probability of the corresponding
is unreliable: we could easily obtain boot- clade being correct at various rates of internodal
strap proportions near 0 or 100. Use of a change (shown in inset) in nine-taxon simulations
single bootstrap proportion as a measure (1-4). The diagonal line indicates direct correspon-
dence between x and y axes, (b) The average number
of repeatability is therefore unreliable in of clades found within given bootstrap proportions
this case. in simulations 1-4 (Table 1).
We repeated similar analyses for each of
the simulations in Tables 1 and 2, and the
bootstrap proportions are unreliable esti- However, if a technique only provides an
mates of repeatability unless repeatability unambiguous answer when it is likely to
is near 100% (in which case both distri- be correct, then the technique may be ac-
butions cluster near the right-hand side of curate even though the result is obtained
the scale). Unfortunately, a value of 100% in only a small percentage of trials. Al-
in a bootstrap analysis is not informative though bootstrapping was introduced as a
about repeatability because such an esti- measure of repeatability, bootstrap results
mate is likely to arise when the true re- commonly are interpreted as a measure of
peatability value is considerably lower (e.g., accuracy (e.g., in a framework of hypoth-
see Fig. 3). When repeatability is apprecia- esis testing). Therefore, we examined the
bly less than 100%, the size of the initial possibility of a relationship between boot-
sample of characters has little effect on the strap proportions and the probability of
quality of the estimates. However, increas- obtaining a correct clade.
ing sample size does tend to force the true The relationship between bootstrap pro-
repeatability toward 100%. portions and probability of the corre-
sponding clade being correct is shown for
REPEATABILITY VERSUS ACCURACY simulations 1-4 in Figure 4a. This figure
Figure 3 indicates that bootstrap pro- shows that estimated internal branches
portions are not good estimates of the re- with bootstrap proportions above 70% rep-
peatability of a given phylogenetic anal- resent true clades over 95% of the time (for
ysis. However, under most circumstances the conditions tested in these simulations).
systematists are interested in accuracy more In fact, the bootstrap proportions are lower
than repeatability. A phylogenetic analysis than the probability of being correct for
may be highly repeatable but always pro- all estimates above 50% in these simula-
duce the wrong answer (e.g., the conditions. Although the probability of char-
tions described by Felsenstein, 1978). acter change (over the tested range of 4-
true clade in all these simulations, and

>95% of the estimated clades with boot-
strap confidence limits above 70% were
correct (Fig. 4a). Therefore, under these
conditions, bootstrapping provides a bi-
ased but highly conservative estimate of
accuracy.
Simulations are sometimes criticized be-
cause of the numerous simplifying as-
FIGURE 5. Relationship between bootstrap pro-
portions and the probability of the corresponding sumptions they must incorporate (Hillis et
clade being correct for the laboratory-generated phy- al., 1993). Experimental phylogenies of ac-
logeny of nine taxa derived from bacteriophage T7. tual organisms, generated in the laboratory
The diagonal line indicates direct correspondence be- so that the actual phylogeny is known, are
tween x and y axes. a step closer to reality. We have generated
such a phylogeny (using bacteriophage T7)
20% change between nodes) has little effect for conditions similar to simulations 1-4
on the relationship between these two (Hillis et al., 1992). We jackknifed the com-

variables, it does affect the likelihood of plete data matrix for this actual phylogeny
obtaining a bootstrap proportion within a to produce sample matrices of 50 characters
given range (Fig. 4b). For instance, one is each; these matrices provide a comparison
twice as likely to obtain a clade with a boot- for the simulations. Figure 5 shows that
strap proportion of at least 90% if the prob- the relationship between bootstrap pro-
ability of change between nodes is 10% portions (bootstrapping done on the sam-
than if it is 4% or 20% (Fig. 4b). Nonethe- ple matrices) and the probability of an es-
less, almost every internal branch with a timated branch being correct is virtually
bootstrap proportion of >80% defined a the same for the T7 phylogeny as it is in
TOO
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
(a) Bootstrap proportions (c) Bootstrap proportions
o correct tree o correct tree

a wrong tree 1 • wrong tree 1
A wrong tree 2 A wrong tree 2
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Bootstrap proportions Bootstrap proportions
(b) (d)
FIGURE 6. Results from simulations 11 (a, b) and 15 (c, d) with four taxa. The relationship between bootstrap
proportions and accuracy (for simulation 11) is shown in (a); the shape of this curve can be understood by
comparing the bootstrap results for each of the three possible topologies in (b). In contrast, simulations in
which the characters are randomized among taxa by high rates of change produce bootstrap proportions in
which any tree is equally likely to be supported (c, d).
20 30 40 50 60 70 80 90
(a) Bootstrap proportions
FIGURE 8. The effect of increasing size of the data

matrix on the relationship between bootstrap pro-
portions and the probability of the corresponding
clade being correct (from simulations 11, 18, and 19,
all with four taxa).
tions are highly skewed in opposite direc-

tions, so that almost all the observations at
the high end of the scale (>70%) corre-
spond to the correct tree, and almost all

FIGURE 7. Bootstrap results from simulation 19 (Ta- the observations at the low end of the scale
ble 2) with four taxa. With a relatively large number (<30%) correspond to one of the wrong
of characters and appropriate rates of internodal trees. Compare this situation with that in
change, all clades with bootstrap proportions above Figure 6c: if characters are randomized
50% are correct (a). This occurs because of the non-
overlapping ranges of bootstrap proportions for cor-
among taxa by high rates of change, all
rect and incorrect trees (b). bootstrap frequencies are equally likely to
represent any of the three topologies (Fig.
6c), although the number of trees with high
the simulations. Almost every internal bootstrap proportions will be relatively low
branch with a bootstrap proportion above (Fig. 6d). At appropriate rates of change
70% defines a true clade, whereas fewer (simulations 18-20, 100-1,000 characters),
than 10% of the estimated branches with an increase in the number of informative
bootstrap proportions below 30% are cor- characters decreases the chances of a cor-
rect. rect tree having a low bootstrap proportion
The shape of the curves in Figures 4 and or of a wrong tree having a high bootstrap
5 can be understood by examining the fre- proportion (Figs. 7, 8). In simulation 20
quency distributions of bootstrap propor- (1,000 characters, 10% change between
tions for correct and incorrect internal nodes), the correct tree was found in every
branches. In the simulations with four taxa bootstrap replicate on all 1,000 matrices.
(Table 2), there are only three possible un- The results presented in Figures 4-8 in-
rooted trees (each with a unique internal dicate that when conditions are favorable
branch). In Figure 6, the results of two four- for phylogenetic analysis (i.e., equal and
taxon simulations are contrasted: simula- appropriate rates of change and symmet-
tion 11, in which the rate of change is ap- rical phylogenies), bootstrap proportions
propriate for phylogenetic analysis, and are highly conservative measures of the
simulation 15, in which the sequences are probability that the corresponding clade is
randomized among taxa by a high rate of true. To investigate the conditions under
change. In Figure 6a, the plot of bootstrap which bootstrapping might positively re-
proportions against probability of a branch inforce misleading results, we further in-
being correct takes on a shape similar to vestigated high rates of change, highly un-
that for nine-taxon trees (Figs. 4, 5). In Fig- equal rates of change, and asymmetrical
ure 6b, the bootstrap frequencies for each tree topologies.
of the three possible trees (one correct and Although high rates of change do reduce
two incorrect) are shown. These distribu- the gap between the probability of cor-
O 10%
O20%
60
8 A 30%
© 40%
Bootstrap proportions Bootstrap proportions
FIGURE 9. Relationship between bootstrap pro-

portions and the probability of the corresponding o
clade being correct, as affected by different rates (10-
40%) of internodal change. As internodal change ap-
D
e O o
proaches 40% of the characters, the bootstrap pro- a a

•
A
portions approach the probability that the corre-
sponding clade is correct. * a
22 A
I •
o
A
—,
a
*
rectly resolving a node and bootstrap pro- (b)
portions (Fig. 9), the conservative nature Bootstrap proportions

of the bootstrap proportions is substantial FIGURE 10. Relationship between bootstrap pro-
throughout the range of rates typical of portions and the probability of the corresponding
most phylogenetic studies. Only as inter- clade being correct, as affected by unequal rates of
change among lineages (from simulations with nine
nodal change approaches 40% are boot- taxa). (a) When the ratio of short to long branches is
strap proportions close to the probability less than 1:5 (simulations 5 and 6), bootstrap propor-
of correctly resolving the corresponding tions are conservative estimates of accuracy. At a ratio
branch, at least in the four-taxon simula- of 1:5 (simulation 7), the bootstrap proportions are a
fairly accurate indicator of accuracy; above a ratio of
tions (Fig. 9). At such high rates of change, 1:5 (simulations 8 and 9), bootstrap results can be
it is difficult to detect a significant amount positively misleading. However, under the mislead-
of phylogenetic signal (Hillis and Huel- ing conditions, the likelihood of finding any clade
senbeck, 1992), and phylogenetic analyses with a high bootstrap proportion is very low (b).
are rarely attempted with such rapidly
evolving sequences.
Trees that contain long, undivided is likely to be represented in a high pro-
branches interspersed with short branches portion of bootstrap replicates (Fig. 10b).
are particularly difficult to reconstruct (Fel- For instance, among the 11,099 bootstrap
senstein, 1978). Such trees may exist if rates proportions generated in simulation 9,
of evolution vary greatly among lineages none were >85% and only 17 were >65%.
or because of the timing of cladogenic Asymmetric phylogenies (Fig. 2b, sim-
events. In simulations 5-9, we examined ulation 10) also reduce the gap between
the effects of various rates of change among bootstrap proportions and the probability
terminal lineages. The results of the boot- of correctly resolving the corresponding
strap analyses of these simulations are branch (Fig. 11). Nonetheless, the boot-
shown in Figure 10. For the nine-taxon to- strap proportions (above 50%) for even a
pology tested (Fig. 2a; Table 1), bootstrap- completely asymmetrical phylogeny are
ping proportions above 50% are consis- below the probability of correctly infer-
tently conservative measures of accuracy ring the existence of a clade when rates of
when the ratio of the short: long terminal change are equal.
branches is less extreme than 1:5 (Fig. 10).
At more extreme ratios for the short: long CONCLUSIONS AND RECOMMENDATIONS
branches, parsimony analysis becomes The factors that affect the performance
misleading, and branches with high boot- of bootstrapping depend on the use to
strap proportions are highly unlikely to be which it is applied. If one wants only a
correct. However, under these extreme precise measure of a bootstrap proportion,
conditions, no clade (either correct or not) with no connection between that value and
any measure of repeatability or accuracy,

then only the number of bootstrap itera-
tions is of concern (Hedges, 1992). How-
ever, the statistical meaning of such a mea-
sure is obscure: the measure may be precise,
but what is it measuring? If one is inter-
ested in the repeatability of a given result
(i.e., what would happen if someone were Bootstrap proportions
to draw a new set of characters that are
evolving in the same way as the first set), FIGURE 11. Relationship between bootstrap results
and the probability of the corresponding clade being
then bootstrap proportions are rarely use- correct for a completely asymmetrical topology (sim-
ful. With an infinite number of bootstrap ulation 10, with nine taxa). For such a topology, the
iterations on a given data set, one would bootstrap proportions are still conservative measures
obtain a perfectly precise but highly in- of reliability but less so than for symmetrical topol-
accurate estimate of repeatability. The pro- ogies (contrast with Fig. 4).
portions from bootstrap pseudoreplica-
tions are likely to differ dramatically from on the assumption that the characters ex-

proportions calculated by actual replica- amined are evolving independently, so
tions. Finally, if one is interested in the some effort is usually expended to ensure
probability that a recovered group repre- that this assumption has been met (e.g.,
sents a true clade, then at least the follow- Wheeler and Honeycutt, 1988).
ing variables should be taken into account: Zharkikh and Li (1992a, 1992b) have re-
(1) number of characters, (2) number of cently studied the four-taxon case (with
taxa, (3) number of bootstrap iterations and without a molecular clock) using an-
performed, (4) rate of change, (5) tree to- alytical and simulation approaches. They
pology, (6) position of the group of interest studied the relationship between accuracy
within the tree, (7) variance of rates of and bootstrap proportions in parsimony
change among lineages, (8) independence and neighbor-joining analyses and also
of characters, and (9) method of phyloge- found the bias that we report here for both
netic inference. Under a wide variety of types of analyses. For the four-taxon case,
conditions for the first seven of these they recommended that bootstrap propor-
variables using parsimony analysis of tions could be used in a conservative as-
independently evolving characters, boot- sessment of accuracy as long as rates of
strapping proportions above 50% are con- evolution are not so variable that the phy-
sistently much lower than the probability logenetic method is inconsistent. Other
that the corresponding branch is correct. previous studies of bootstrapping are also
Because bootstrap results are typically pre- consistent with our finding that bootstrap
sented in the form of a 50% majority-rule proportions provide conservative esti-
consensus tree, it is safe to assume that the mates of accuracy under many conditions
bootstrap values are underestimates of (e.g., Penny and Hendy, 1986). In respond-
phylogenetic accuracy unless (1) rates of ing to our paper, Felsenstein and Kishino
change are highly unequal, (2) rates of (1993) addressed the issue of bias in boot-
change are high enough to randomize strap analyses and concluded that this bias
characters with respect to history, or (3) a may be a more general phenomenon, as-
systematic bias exists in the data set (such sociated with placing a probability value
as lack of independence among the char- on a prespecified hypothesis, rather than
acters). The first two of these three con- a characteristic that is limited to bootstrap-
ditions can be detected using existing ping.
methods (e.g., Allard and Miyamoto, 1992; If bootstrap analyses provide biased es-
Hillis and Huelsenbeck, 1992). Systematic timates of accuracy, is it worth undertaking
biases may be harder to address for some these analyses at all? Bootstrapping may
data sets, but all phylogenetic analyses rest provide a relative ranking of the degree of
support in a particular analysis for the var- BULL, J. J., C. W. CUNNINGHAM, I. J. MOLINEUX, M. R.
ious recovered clades. Sanderson (1989) BADGETT, AND D. M. HILLIS. In press. Experimental
molecular evolution of bacteriophage T7. Evolu-
concluded that these rankings are superior tion.
to other measures, such as number of char- EFRON, B. 1979. Bootstrapping methods: Another look
acters supporting a given branch. The at the jackknife. Ann. Stat. 7:1-26.
strong positive relationship between high EFRON, B. 1982. The jackknife, the bootstrap, and
other resampling plans. Conf. Board Math. Sci. Soc.
bootstrap proportions and phylogenetic Ind. Appl. Math. 38:1-92.
accuracy does indicate a use for bootstrap- EFRON, B. 1987. Better bootstrap confidence inter-
ping. However, bootstrap results should vals. J. Am. Stat. Assoc. 82:171-185.
not be interpreted directly as estimates of FELSENSTEIN, J. 1978. Cases in which parsimony or
either repeatability or accuracy under most compatibility methods will be positively mislead-
ing. Syst. Zool. 27:401-410.
conditions. They are poor estimates of re- FELSENSTEIN, J. 1985. Confidence limits on phylog-
peatability and are usually very conser- enies: An approach using the bootstrap. Evolution
vative estimates of accuracy. Under many 39:783-791.
conditions, bootstrapping can be used as a FELSENSTEIN, J., AND H. KISHINO. 1993. Is there some-
thing wrong with the bootstrap on phylogenies? A
highly conservative measure of accuracy, reply to Hillis and Bull. Syst. Biol. 42:193-200.
but the magnitude of bias will differ from HEDGES, S. B. 1992. The number of replications need-

branch to branch and study to study. The ed for accurate estimation of the bootstrap P value
values cannot be directly compared among in phylogenetic studies. Mol. Biol. Evol. 9:366-369.
HEDGES, S. B., AND L. R. MAXON. 1993. A molecular
studies. perspective on lissamphibian phylogeny. Herpetol.
It may be possible to use retrospective Monogr. 7 (in press).
simulation studies to calibrate bootstrap HILLIS, D. M., J. J. BULL, M. E. WHITE, M. R. BADGETT,
proportions so that these proportions can AND I. J. MOLINEUX. 1992. Experimental phylo-
be converted into acceptable estimates of genetics: Generation of a known phylogeny. Sci-
ence 255:589-592.
accuracy. Such studies would use a model HILLIS, D. M., J. J. BULL, M. E. WHITE, M. R. BADGETT,
phylogeny that is estimated from the ini- AND I. J. MOLINEUX. 1993. Experimental approach-
tial sample to conduct simulations, which es to phylogenetic analysis. Syst. Biol. 42:90-92.
would be used in turn to calibrate the boot- HILLIS, D. M., AND J. HUELSENBECK. 1992. Signal,
strap proportions. It remains to be seen if noise, and reliability in molecular phylogenetic
analyses. J. Hered. 83:189-195.
biases in the initial phylogenetic estimate PENNY, D., AND M. HENDY. 1986. Estimating the re-
would adversely affect such retrospective liability of evolutionary trees. Mol. Biol. Evol. 3:403-
simulations (=parametric bootstrapping; 417.
see Bull et al., in press). SANDERSON, M. J. 1989. Confidence limits on phy-
logenies: The bootstrap revisited. Cladistics 5:113-
129.
ACKNOWLEDGMENTS SWOFFORD, D. L. 1990. PAUP: Phylogenetic analysis
We thank Michael Donoghue, Joe Felsenstein, Ar- using parsimony, version 3.0. Illinois Natural His-
gye Hillis, Wen-Hsiung Li, David Maddison, Wayne tory Survey, Champaign.
Maddison, David Penny, Michael Sanderson, and An- WHEELER, W. C, AND R. L. HONEYCUTT. 1988. Paired
drey Zharkikh for commenting on various versions sequence difference in ribosomal RNAs: Evolution
of this manuscript and Wen-Hsiung Li and Andrey and phylogenetic implications. Mol. Biol. Evol. 5:90-
Zharkikh for sending us advance copies of their work 96.
on bootstrapping. This work was supported by NSF ZHARKIKH, A., AND W.-H. LI. 1992a. Statistical prop-
grants DEB 9221052 and BSR 9106746. erties of bootstrap estimation of phylogenetic vari-
ability from nucleotide sequences: I. Four taxa with
a molecular clock. Mol. Biol. Evol. 9:1119-1147.
REFERENCES ZHARKIKH, A., AND W.-H. Li. 1992b. Statistical prop-
erties of bootstrap estimation of phylogenetic vari-
ALLARD, M. W., AND M. M. MIYAMOTO. 1992. Testing ability from nucleotide sequences: II. Four taxa
phylogenetic approaches with empirical data, as without a molecular clock. J. Mol. Evol. 35:356-366.
illustrated with the parsimony method. Mol. Biol.
Evol. 9:778-786. Received 10 July 1992; accepted 7 January 1993

An Empirical Test of Bootstrapping As A Method For Assessing Confidence in Phylogenetic Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Empirical Test of Bootstrapping As A Method For Assessing Confidence in Phylogenetic Analysis

Uploaded by

Copyright:

Available Formats

Syst. Biol.

AN EMPIRICAL TEST OF BOOTSTRAPPING AS A

DAVID M. HILLIS AND JAMES J. BULL

Abstract.—Bootstrapping is a common method for assessing confidence in phylogenetic anal-

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

2 3 4 5 6 7 8 9 1992). Hedges (1992) recommended per-

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

Simu- Mutation rate

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

(a) was used in the bootstrap runs to ensure

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

greater. The larger variance of the boot-

ance. Yet in the present case, which is one 2.2

What do these distributions indicate 1.2 0

tree is approximately 70% (the mean), and

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

true clade in all these simulations, and

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

o correct tree o correct tree

FIGURE 8. The effect of increasing size of the data

tions are highly skewed in opposite direc-

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

Bootstrap proportions Bootstrap proportions

FIGURE 9. Relationship between bootstrap pro-

proaches 40% of the characters, the bootstrap pro- a a

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

any measure of repeatability or accuracy,

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

Downloaded from sysbio.oxfordjournals.org by David Gernandt on August 9, 2011

You might also like