Professional Documents
Culture Documents
783-791
JOSEPH FELSENSTEIN
Department of Genetics SK-50, University of Washington, Seattle, WA 98195
Abstract. - The recently-developed statistical method known as the "bootstrap" can be used
to place confidence intervals on phylogenies. It involves resampling points from one's own
data, with replacement, to create a series ofbootstrap samples ofthe same size as the original
data. Each of these is analyzed, and the variation among the resulting estimates taken to
indicate the size of the error involved in making estimates from the original data. In the
case of phylogenies, it is argued that the proper method of resampling is to keep all of the
original species while sampling characters with replacement, under the assumption that the
characters have been independently drawn by the systematist and have evolved indepen-
dently. Majority-rule consensus trees can be used to construct a phylogeny showing all of
the inferred monophyletic groups that occurred in a majority of the bootstrap samples. If
a group shows up 95% of the time or more, the evidence for it is taken to be statistically
significant. Existing computer programs can be used to analyze different bootstrap samples
by using weights on the characters, the weight of a character being how many times it was
drawn in bootstrap sampling. When all characters are perfectly compatible, as envisioned
by Hennig, bootstrap sampling becomes unnecessary; the bootstrap method would show
significant evidence for a group if it is defined by three or more characters.
It is rare that any attempt is made to a confidence interval by finding all trees
put a confidence interval on an estimate that cannot be rejected in comparison
of a phylogeny. Most methods for infer- with the best supported tree. I have re-
ring phylogenies yield one or a few trees, cently extended Cavender's analysis to
and their users rarely go beyond exam- the case of a molecular clock with three
ining the variation among trees that are species, obtaining, in that case, confi-
tied with the best tree under whatever dence limits that were somewhat smaller
criterion is being employed. There is no than Cavender's (Felsenstein, 1985). I
reason to believe that this practice con- have also recently reviewed the appli-
stitutes an adequate exploration of the cation of statistics to inferring phyloge-
size of the confidence limits on the esti- nies (Felsenstein, 1983a); that paper may
mate. be consulted for earlier references on sta-
A few authors have explored the ques- tistical estimation of phylogenies.
tion of confidence limits on phylogenies. An important recent statistical method
The pioneer in doing so is Cavender is the bootstrap (Efron, 1979), a relative
(1978, 1981) who examined the confi- of the jackknife. Like the jackknife, it is
dence limits for a four-species case, in a method of resampling one's own data
terms of how many steps worse a tree to infer the variability of the estimate.
must be than the most parsimonious tree This paper will explore the use of the
to be significantly worse. His results were bootstrap in inferring phylogenies, where
a bit disconcerting: when inferences were it leads to a practical method for placing
based on twenty characters, a tree would confidence intervals on the estimates.
have to be 9 steps worse to be signifi-
cantly worse. This implies that the con- The Bootstrap
fidence intervals would be quite large. A straightforward statistical exposition
Templeton (1983) has constructed a of the bootstrap is given by Efron and
test of whether one tree is significantly Gong (1983), and a readable elementary
better supported than another. In prin- treatment is that by Diaconis and Efron
ciple such a test could be used to delimit (1983). The basic idea of the bootstrap
783
784 JOSEPH FELSENSTEIN
be analyzed to obtain an estimate of the they are both present in the true tree. But
phylogeny. We then have r phylogenies. at least they cannot be contradictory: each
Each of these is a complicated multi- being present on at least 95% of the
variate entity that has a tree topology and bootstrap estimated trees, they must co-
may also have branch lengths. Defining occur on at least one ofthe trees and must
a confidence interval and summarizing it thus be either nested or disjoint.
in a useable form is far from a simple The same argument has been used by
matter. Margush and McMorris (1981) to define
In bootstrapping, confidence limits on "majority rule" consensus trees. These
a statistic are frequently constructed by are trees composed of all those subsets
the percentile method, which involves that appear in a majority of a collection
simply taking (for a 95% confidence in- oftrees. By the argument just given, these
terval) the empirical upper and lower subsets must define a tree, since no two
2.5% points of the distribution of boot- of them can conflict. If we take the set of
strap estimates of the statistic. Consider phylogenies that result from analyzing a
testing whether the probability of heads series of bootstrap samples and make a
of a tossed coin exceeds 0.50. If we did majority-rule consensus tree, recording
not know about the binomial distribu- on it how often each subset appears, we
tion and decided instead to use the boot- will obtain a tree that can be used to de-
strap, a one-sided confidence interval on fine at a glance confidence sets for any
the probability of heads could be con- rejection probability below 50%. The
structed by finding the empirical lower majority-rule consensus tree itself can be
5% point ofthe distribution of bootstrap considered to be an overall bootstrap es-
estimates. The set ofvalues less than 0.50 timate of the phylogeny.
would therefore be rejected if values of In cases where we are using a statisti-
the estimated probability of heads that cally well-founded method, such as max-
small or smaller occurred less than 5% imum likelihood estimation, we would
of the time among the bootstrap esti- hope that the bootstrap method and the
mates. curvature ofthe likelihood surface would
The approach used here starts with the give similar indications of which parts of
assumption that the systematist is pri- the phylogeny were well estimated and
marily interested in whether some par- which not. Where the method ofinferring
ticular group is monophyletic. A rooted phylogenies is one with undesirable sta-
tree is a series of statements asserting tistical properties such as inconsistency,
monophyly of a series of nested or dis- the bootstrap does not correct for these.
joint sets of species. Suppose that we are For example, clustering by overall sim-
interested in a subset S of species and ilarity makes an inconsistent estimate of
wish to know whether there is significant the phylogeny if rates ofevolution in dif-
support in the data for the assertion that ferent lineages differ by more than a cer-
this group is monophyletic. We can reject tain amount. Parsimony methods are
the alternatives to the subset S if they subject to the same problem, but require
occur in less than 5% of the bootstrap greater inequalities of evolutionary rate
estimates. to be inconsistent. For an elementary dis-
We thus wish to search for all subsets cussion of these phenomena, see my re-
S of species that occur on 95% or more cent review article (Felsenstein, 1983b).
of the bootstrap estimates. Each of these Bootstrapping provides us with a confi-
subsets may be considered to be sup- dence interval within which is contained
ported (in the sense that its alternatives not the true phylogeny, but the phylogeny
are rejected), although those confidence that would be estimated on repeated
statements are not joint confidence state- sampling of many characters from the
ments: if two subsets are each supported underlying pool of characters. As such it
at the 95% level, we might have as little may be misleading if the method used to
as 90% confidence in the statement that infer phylogenies is inconsistent.
BOOTSTRAPPING PHYLOGENIES 787
TABLE l. Fossil horse data of Cam in and Sokal (1965). The states of each character are in a linear series.
-1, 0, 1, 2, ... , with the ancestral state being O. The data are also shown in binary recoded form in
which the nine multistate characters have been recoded into 20 binary factors. The first line ofthat table
indicates the correspondence between the original and recoded characters. Bootstrap sampling ofcharacters
should be done before any recoding into binary factors.
Binary Factors
Name Characters 11112 22333 44566 77889
dropped from the analysis. A weight of gram package PHYLIP, available free
w means that the character is counted as from me (see the Appendix below).
if present w times, so that each change of
state in the character is counted as if it An Example
were w changes of state. Table 1 shows the fossil horse data giv-
This automatically accomplishes the en by Camin and Sokal (1965 pp. 321-
duplication and deletion of characters 322) as a computational example. The
without the necessity of recopying the full list of species and references for the
data matrix. Different bootstrap samples original data are given by Camin and 80-
could be fed into the programs by doing kal (1965). The data set has ten species
computer runs with different weights. The and nine multistate characters. Mesohip-
weights are generated by starting with pus has been taken as the outgroup, as it
weights ofzero for all characters. We then was in Camin and Sokal's paper.
sample n characters at random with re- Figure 1 shows the results of running
placement (using a table ofrandom num- a branch-and-bound program that finds
bers, for example). Each time a character all most parsimonious trees according to
is drawn, its weight is increased by one, the Wagner parsimony criterion. There
so that in the end its weight counts the are ten most parsimonious trees. The left
number of times it was sampled. Here tree in Figure 1 shows nine of them: the
are five of the weight vectors that were two empty circles with three descendants
generated when bootstrap sampling was represent not trifurcations, but points at
done on 20 characters: which the tree can be resolved into any
of three bifurcating topologies. All nine
21100212120121012010
01001510031020011211 possible combinations of these are in the
list of most parsimonious trees. The tenth
01031211201012100121 tree is the one shown at the right ofFigure
12130421030000100101
1. All of these trees require 29 changes
20010100211211121211
of character state.
To do bootstrap sampling, one would Ifwe were to take the variation among
generate a vector ofweights, run the phy- the most parsimonious trees as providing
logeny estimation program with those an adequate indication ofthe uncertainty
weights, generate another vector of in our estimate of the phylogeny, we
weights, run the program with those, and would conclude that four monophyletic
so on. The process is fairly tedious, al- groups were defined, as these groups show
though with microcomputers it need not up in all ten of the most parsimonious
be expensive. In the fossil horse example trees. They are:
below, I have used 50 bootstrap samples.
(Plioh ipp us, Merychippus secundus,
This might strike a statistician as too few,
Calippus)
but a systematist as too many. The more
(Nannippus, Neohipparion, Merychip-
samples are taken, the more accurate an pus)
idea we will have of which groups are
(Plioh ipp us, Merychippus secundus,
likely to be monophyletic. Even with a
Calippus, Na nnippus, Neohippa-
small number of bootstrap samples we
rion, Merychippus)
will quickly get a feel for which parts of (all but Mesohippus and Archaeohip-
our estimate of the phylogeny are well pus)
supported and which not.
A computer program that carries out When we carry out bootstrap sampling
bootstrap sampling and computes the ofcolumns from the left-hand part ofTa-
majority rule consensus tree is available ble 1 and analyze 50 bootstrap replicates,
for the case of discrete characters ana- we get the results shown in Figure 2. Next
lyzed by the parsimony and compatibil- to each branch of the tree is shown the
ity methods. It is contained in the pro- number of times that the bootstrap es-
BOOTSTRAPPING PHYLOGENIES 789
FIG. 1. All most parsimonious trees for the fossil horse data in Table 1 when phylogenies are evaluated
by the Wagner parsimony criterion. There are ten most parsimonious trees in all. Nine of these can be
generated by resolving each of the trifurcations in the left tree into all three possible bifurcations. The
tenth is shown in the right tree. All abbreviations are first three letters of names in Table 1, except MSE =
Merychippus secundus.