This action might not be possible to undo. Are you sure you want to continue?

**Bootstraps and Testing Trees
**

Joe Felsenstein Department of Genome Sciences University of Washington, Seattle email: joe@gs.washington.edu

−2620 −2625

ln L

−2630 −2635 −2640

5

10

20

50

100

200

Transition / transversion ratio

Likelihood curve (and interval) of Ts/Tn ratio

A

v1 v 6

B

v 2

C

D

v 4

E

v5

**Constraints for a clock
**

v1 = v 2 v4 = v5

v3 v 8 v7

v1 + v = v 3 6 v +v = v +v 3 4 7 8

Constraints on a tree for a clock

−204

ln Likelihood

A −205 x

C

B

A x

B

C

B −206 x

C

A

0.10

0

0.10

0.20

x Likelihood surface (in x) for three clocklike trees

Mouse Bovine

Tree I

Gibbon Orang Gorilla Chimp Human

Mouse Bovine

Tree II

Gibbon Orang Gorilla Chimp Human

Two trees to be tested using KHT test

site Tree

1

2

3

4

5

6

231

232

ln L

−1405.61 −1408.80

I II Diff

−2.971 −4.483 −5.673 −5.883 −2.691 −8.003 −2.983 −4.494 −5.685 −5.898 −2.700 −7.572

... ... ...

−2.971 −2.691 −2.987 −2.705

+0.012 +0.111 +0.013 +0.015 +0.010 −0.431

+0.012 +0.010

+3.19

Table of diﬀerences in log-likelihood by site

−0.50

0.0

0.50

1.0

1.5

2.0

Difference in log likelihood at site

Histogram of ∆LnL among sites (Hasegawa 232-site data)

Paired sites tests

• Winning sites test (Prager and Wilson, 1988). Do a sign test on the signs of the diﬀerences. • z test (me, 1993 in PHYLIP documentation). Assume diﬀerences are normal, do z test of whether mean (hence sum) diﬀerence is signiﬁcant. • t test. Swoﬀord et. al., 1996: do a t test (paired) • Wilcoxon ranked sums test (Templeton, 1983). • RELL test (Kishino and Hasegawa, 1989 per my suggestion). Bootstrap resample sites, get distribution of diﬀerence of totals.

In this example ...

• Winning sites test. 3.279 × 10−9

160 of 232 sites favor tree I. P <

• z test. Diﬀerence of log-likelihood totals is 0.948104 standard deviations from 0, P = 0.343077. Not signiﬁcant. • t test. Same as z test for this large a number of sites. • Wilcoxon ranked sums test. Rank sum is 4.82805 standard deviations below its expected value, P = 0.000001378765 • RELL test. 8,326 out of 10,000 samples have a positive sum, P = 0.3348 (two-sided)

for each parameter value, find data values (unshaded) that account for 95% of the probability

data value

parameter value

then, given a data value, the parameters that are in the 95% confidence region are those for which that data value is in the unshaded region

A 3−species clocklike tree with Jukes−Cantor model Possible data patterns

A

B

C

( x, y, z stand for different bases) ABC xxx xxy xyx yxx xyz we will ignore all outcomes except these three

Expected frequencies of xxy, xyx, yxx: (p ,p ,p ) 1 2 3 (p, q, q)

p 3 p 1

p 2

(q, p, q)

(q, q, p)

**Test of 3−species Tree with a Clock
**

(Felsenstein, 1985)

(informative characters) possible data

or tin gt ree 5 3 10 p su po rti 5 ng 0

A B C

su

pp

e1 0 tre

10

0

5 10 supporting tree 2

confidence region

10

statistic: number of steps different between best and next best tree

0

Chars

S (0.05) 4 5 4 5 4 5 6 7

4 5 6 7 8 9−13 14−20 21−29

5

5

10

0

0

5

10

estimate ofq

q (unknown) true value of

empirical distribution of sample Bootstrap replicates

(unknown) true distribution

Distribution of estimates of parameters

Bootstrap sampling from a distribution (a mixture of two normals) to estimate the variance of the mean

Bootstrap sampling

**To infer the error in a quantity, θ, estimated from a sample of points x1, x2, . . . , xn we can
**

• •

Do the following R times (R = 1000 or so) Draw a “bootstrap sample” by sampling n times with replacement from the sample. Call these x∗, x∗, . . . , x∗ . Note n 1 2 that some of the original points are represented more than once in the bootstrap sample, some once, some not at all.

ˆ∗ Estimate θ from the bootstrap sample, call this θk (k = 1, 2, . . . , R)

•

•

When all R bootstrap samples have been done, the ˆ∗ distribution of θi estimates the distribution one would get if one were able to draw repeated samples of n points from the unknown true distribution.

Original Data

sites

sequences

Bootstrap sample #1

Estimate of the tree

sites

sequences

sample same number of sites, with replacement

Bootstrap sample #2

sequences

sites

Bootstrap estimate of the tree, #1

sample same number of sites, with replacement

(and so on)

Bootstrap estimate of the tree, #2

Bootstrap sampling of phylogenies

The sites are assumed to have evolved independently given the tree. They are the entities that are sampled (the xi). The trees play the role of the parameter. One ends up with a cloud of R sampled trees. To summarize this cloud, we ask, for each branch in the tree, how frequently it appears among the cloud of trees. We make a tree that summarizes this for all the most frequently occurring branches. This is the majority rule consensus tree of the bootstrap estimates of the tree.

Trees:

E A C F B D E C A B D F

E

A F D B

C

E

A D

F B C

E

C A D

F B

How many times each partition of species is found: AE | BCDF ACE | BDF ACEF | BD AC | BDEF AEF | BCD ADEF | BC ABDF | EC ABCE | DF 3 3 1 1 1 2 1 3 Majority−rule consensus tree of the unrooted trees: E C

60 60

B

60

D

A

F

**Bovine Mouse Squir Monk
**

35 72 42 74 80

Chimp Human Gorilla Orang Gibbon

84

77 99 99 100

Rhesus Mac Jpn Macaq Crab−E.Mac BarbMacaq

49

Tarsier Lemur

An example of bootstrap sampling of trees 232 nucleotide, 14-species mitochondrial D-loop data set Analyzed by parsimony, 100 bootstrap replicates

Potential problems with the bootstrap 1. Sites may not evolve independently 2. Sites may not come from a common distribution (but can consider them sampled from a mixture of possible distributions) 3. If do not know which branch is of interest at the outset, a “multiple-tests” problem means P values are overstated 4. P values are biased (too conservative) 5. Bootstrapping does not correct biases in phylogeny methods

True value of mean

Distribution of individual values of x

True distribution of sample means

Estimated distributions of sample means

"Topology" II

0

"Topology" I

A model showing the bias in bootstrap P vales

note that the true P is more extreme than the average of the P’s

P

estimate of the "phylogeny"

topology II

0

topology I

the true mean

Illustration of the source of the bias

1.00

0.80

Average P

0.60

0.40

0.20

0.00

0.00 0.20 0.40 0.60 0.80 1.00

True P

Extent of the bias in the example

Probability of correct topology

1.00

2.0

0.80

1.0

0.60

0.1

0.40 0.20 0.00 0.00 0.50 1.00

P value

Probability of being correct and variance of prior

Other resampling methods • Delete-half jackknife. replacement. Sample a random 50% of the sites, without

• Delete-1/e jackknife (Farris et. al. 1996) (too little deletion from a statistical viewpoint). • Reweighting characters by choosing weights from an exponential distribution. • In fact, reweighting them by any exchangeable weights having coeﬃcient of variation of 1 • Parametric bootstrap – simulate data sets of this size assuming the estimate of the tree is the truth • (to correct for correlation among adjacent sites) (K¨nsch, 1989) Blocku bootstrapping – sample n/b blocks of b adjacent sites.

**Bovine Mouse Squir Monk
**

80

Chimp Human Gorilla Orang Gibbon

84

69 50 32 72

80 98 99 100

Rhesus Mac Jpn Macaq Crab−E.Mac BarbMacaq

59

Tarsier Lemur

Delete-half jackknife P values (compare with bootstrap)

**Exact computation of the effects of deletion fraction for the jackknife
**

n 1 n 2

(suppose 1 and 2 are conflicting groups)

n characters

m 1

m 2

n(1−δ) characters

We can compute for various n’s the probabilities of getting more evidence for group 1 than for group 2 A typical result is for n = 10, n = 8, n = 100 : 2 1 Jackknife Bootstrap Prob( m >m ) 1 2 Prob( m >m ) 1 2 Prob( m >m ) 1 2 + 1 Prob( m = m ) 2 1 2 0.6384 0.7230 0.6807 δ = 1/2 0.5923 0.7587 0.6755 δ = 1/e 0.6441 0.8040 0.7240

**The Parametric Bootstrap (Efron, 1985)
**

Suppose we have independent observations drawn from a known distribution: and a parameter, θ , calculated from this.

x , x , x , ... 1 2 3

x

n

^ θ

To infer the variability of θ ^ Use the current estimate, θ Use the distribution that has that as its true parameter

x *, x *, x * ... , 1 2 3 x * n x * n x * n

^ θ ^ ^

1

sample R data sets from that distribution, each having the same sample size as the original sample

x *, x *, x * ... , 1 2 3

. . .

θ2

x *, x *, x * ... , 1 2 3

. . .

θ3

x *, x *, x * ... , 1 2 3

x * n

i as the estimate of the distribution from which

and take the distribution of the

^ θ

^ θ

R

θ is drawn

θ

A resampling approach to distributions of the likelihood ratio statistics

Goldman (1993) suggests that, in cases where we may wonder whether the Likelihood Ratio Test statistic really has its desired χ2 distribution we can: • Take our best estimate of the tree • Simulate on it the evolution of data sets of the same size • For each replicate, calculate the LRT statistic • Use this as the distribution and see where the actual LRT value lies in it (e.g.: in the upper 5%?) This, of course, is a parametric bootstrap.

computer simulation

data set #1

estimation of tree

T 1

estimate of tree original data

data set #2

T

2

data set #3

T

3

data set #100

T

100

References Bremer, K. 1988. The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42: 795-803. [Bremer support] Cavender, J. A. 1978. Taxonomy with conﬁdence. Mathematical Biosciences 40: 271-280. [Pioneering paper on conﬁdence intervals on trees] Efron, B. 1979. Bootstrap methods: another look at the jackknife. Annals of Statistics 7: 1-26. [The original bootstrap paper] Efron, B. 1985. Bootstrap conﬁdence intervals for a class of parametric problems. Biometrika 72: 45-58. [The parametric bootstrap] Farris, J. S., V. A. Albert, M. Kallersj¨, D. Lipscomb, and A. G. Kluge. 1996. o Parsimony jackkniﬁng outperforms neighbor-joining. Cladistics 12: 99-124. [The delete-1/e jackknife for phylogenies] Felsenstein, J. 1981b. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17: 368-376. [Mentions possibility of likelihood ratio tests]

Felsenstein, J. 1985a. Conﬁdence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783-791. [The bootstrap ﬁrst applied to phylogenies] Felsenstein, J. 1985b. Conﬁdence limits on phylogenies with a molecular clock. Systematic Zoology 34: 152-161. Felsenstein, J. and H. Kishino. 1993. Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Systematic Biology 42: 193-200. [A more detailed exposition of the bias of P values in a normal case] Fisher, R. A. 1922. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, A 222: 309368. [Fisher’s great likelihood paper, with mention of asymptotic variances of MLE’s] Goldman, N. 1993. Statistical tests of models of DNA substitution. Journal of Molecular Evolution 36: 182-98. [Parametric bootstrapping for testing models]

Harshman, J. 1994. The eﬀect of irrelevant characters on bootstrap values. Systematic Zoology 43: 419-424. [Not much eﬀect on parsimony whether or not you include invariant characters when bootstrapping] Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29: 170-179. [The KHT test] Hasegawa, M., H. Kishino. 1989. Conﬁdence limits on the maximumlikelihood estimate of the hominoid tree from mitochondrial-DNA sequences. Evolution 43: 672-677 [The KHT test] Hasegawa, M. and H. Kishino. 1994. Accuracies of the simple methods for estimating the bootstrap probability of a maximum-likelihood tree. Molecular Biology and Evolution 11: 142-145. [RELL probabilities] Hillis, D. M. and J. J. Bull. 1993. An empirical test of bootstrapping as a method for assessing conﬁdence in phylogenetic analysis. Systematic Biology 42: 182-192. [Bias in P values seen in a large simulation study]

Huelsenbeck, J. P. and B. Rannala. 1997. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276: 227-232 (11 April) [Review of hyothesis testing with trees] Huelsenbeck, J. P. and K. A. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annual Review of Ecology and Systematics 28: 437-466. [Review] Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29: 170-179. [KHT test with likelihoods] Kishino, H. T. Miyata and M. Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. Journal of Molecular Evolution 31: 151-160. K¨nsch, H. R. 1989. The jackknife and the bootstrap for general stationary u observations. Annals of Statistics 17: 1217-1241. [The block-bootstrap] Margush, T. and F. R. McMorris. 1981. Consensus n-trees. Bulletin of

Mathematical Biology 43: 239-244i. [Majority-rule consensus trees] Mueller, L. D. and F. J. Ayala. 1982. Estimation and interpretation of genetic distance in empirical studies. Genetical Research 40: 127-137. [Suggest conventional jackknife to assess variance of branch length.] Penny, D. and M. D. Hendy. 1985. Testing methods of evolutionary tree construction. Cladistics 1: 266-278. [Use jackknife resampling to assess accuracy of tree reconstruction, independently of my use of the bootstrap] Prager, E. M. and A. C. Wilson. 1988. Ancient origin of lactalbumin from lysozyme: analysis of DNA and amino acid sequences. Journal of Molecular Evolution 27: 326-335. [winning-sites test] Sanderson, M. J. 1995. Objections to bootstrapping phylogenies: a critique. Systematic Biology 44: 299-320. [Good but he accepts a few criticisms I would not have accepted] Shimodaira, H. and M. Hasegawa. 1999. Multiple comparisons of loglikelihoods with applications to phylogenetic inference. Molecular Biology

and Evolution 16: 1114-1116. [Correction of KHT test for multiple hypothesis] Sitnikova, T., A. Rzhetsky, and M. Nei. 1995. Interior-branch and bootstrap tests of phylogenetic trees. Molecular Biology and Evolution 12: 319-333. [The interior-branch test] Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37: 221-224. [The ﬁrst paper on the KHT test] Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling plans in regression analysis. Annals of Statistics 14: 1261-1295. [The delete-half jackknife] Zharkikh, A., and W.-H. Li. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock. Molecular Biology and Evolution 9: 11191147. [Discovery and explanation of bias in P values]

This Microsoft-free presentation prepared with • PDFLaTeX (mathematical typesetting and PDF preparation) • Free Pascal Compiler (calculating curves) • GNU Plotutils (plotting curves) • Idraw (drawing program to modify plots and draw ﬁgures) • Adobe Acrobat Reader (to display the PDF in full-screen mode) • Linux (operating system)

- C.R. Davis. 2009. Review of Beautiful minds: the parallel lives of great apes and dolphins, by M. Bearzi and C.B. Stanford. Journal of Mammalogy 90:1271-1272.by Clay Davis
- Altruismby Isaias Morales
- Chomsky (2005) the Biolinguistic Perspective After 50 Yearsby LetíciaSampaio
- 7 Geological Oceanographyby culturejunkie

- the-theory-of-biological-evolution-and-islam1.pdf
- Kurzweil Excerpt
- Protoplasma
- Phylogenetic Analysis
- 185548
- Thoughts on Evolution
- bmet
- Darwinism refuted
- The concept of evolution
- Cosmic Mindreach Downsizing Darwin
- Unit 13
- CRSQ Fall 09 Bergman.pdf
- Evidence combination from an evolutionary game theory perspective
- Philosphy
- Adaptation
- 159-677-1-PB
- Towards a General Theory of Selection.
- The Case for Thinking Evolutionarily
- Cavanagh 2008 Game Enables Users to Guide Evolution on Screen
- compModelsOfEvolOfLangAndLangs
- Evolution
- The long tail of the Singularity
- Scott-phillips - The Social Evolution of Language and the Language of Social Evolution
- Article BriefHistoryOfTRIZ
- C.R. Davis. 2009. Review of Beautiful minds
- Altruism
- Chomsky (2005) the Biolinguistic Perspective After 50 Years
- 7 Geological Oceanography
- ANTHOR PATHIC CIRCUIT
- Evolutionary History of the Grasses-Kellogg 2001
- Felsenstein 2

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd