Reubyn William Chong Dr.

Sayan Mukherjee STAT 113: Statistics for Engineers April 28, 2011 Statistical Inference Methods for Determining Phylogenetic Trees Phylogenetics is a branch of evolutionary biology that deals with the relatedness of species and the formation of evolutionary lineages of species. Apart from determining the ways in which species have diverged in evolutionary time, phylogenies can be used to create a history of the development of morphological traits, show migration patterns of organisms, and predict the emergence of new disease strains. Phylogenetic tree creation is a complex process involving large amounts of data; therefore it is common for computer programs to use statistical tests to calculate the tree of best fit. The construction of phylogenetic trees requires statistical inference on a vast amount of morphological or genomic data. Taxa with similar phenotypic characteristics or DNA sequence homology will likely be closely related within the same monophyletic group. Basic evolutionary models used to create a tree of life include distance methods and parsimony. Distance methods utilize nucleotide base differences within genetic information to design a tree where branch lengths are proportional to the amount of base differences. The method of parsimony requires that the tree best fitted to the data would have the least assumptions of changes in morphological features or genetic mutations during its evolutionary history. The most parsimonious tree can be tested through a variety of approaches; among the most common are maximum likelihood methods or Bayesian Markov Chain Monte Carlo (BMCMC) inference methods.

The maximum likelihood approach computes the probability that a phylogenetic tree is correct given a certain set of traits for each taxa or single-nucleotide polymorphisms (SNP’s) in DNA sequences between taxa. When a branch occurs in over fifty-percent of bootstrapped samples it is likely that the branch belongs to the true phylogenetic tree. it is impossible to determine the population distribution from the sample. However. . The likelihood of each of these trees is calculated by multiplying the likelihood of a node by the product likelihood of its branches. When this is the case. In this process. this approach has shortcomings in phylogenetics because of its rigorous computations needed. First. a lack of data makes tree creation difficult. every possible tree is created based on the evolutionary data. a technique known as bootstrap estimation is used. Without a large amount of data. The method has several advantages since it works well with data from distantly related sequences. Bootstrapping is also useful for assessing the validity of a node or branch in a given tree. A new sample is drawn independently and identically from the small pool of data. the frequency at which the part of the tree of interest within bootstrap samples occurs determines its appropriateness. the method also fails when inference of larger trees is involved. The randomness of this new sample provides a better representative distribution. maximum likelihood allows the use of a variety of evolutionary models of tree construction. Equation 1: Likelihood Calculation for a Path Within a Tree The tree with the highest probability is deemed the best via the maximum likelihood method. Often in phylogenetic inference.

If this change is not statistically preferred.Bayesian inference is a more recently established statistical measure of phylogenies. If this change improves the tree. Where Pr(Tree) is the Prior Since an analytical solution is impossible for large trees. Equation 2: Bayes Theorem for Posterior Distribution of Trees. The most common of these is the Markov Chain Monte Carlo (MCMC) method. the greater the likelihood it would have within the posterior distribution. The disadvantage of this method is that use of a prior distribution can lead to biased inference. The Bayes formula involves an integration of each tree over all different branch lengths and node placements followed by a summation over all possible combination of trees. Otherwise. the new tree will be scrapped. this change will be kept. Inference of larger phylogenetic trees can be problematic. This prior information or distribution of the goodness of tree can create a more accurate posterior distribution from which the expectation value can be found. Through the sampling of all possible trees. The MCMC approach attempts to iteratively test all possible trees for their fit. The problem can be solved by running the perturbation chains for different lengths of time. . It can give a better depiction of the best tree since prior information is used. First a branch is moved randomly on the tree or “perturbed”. This problem arises from complicated data. Bayesian and MCMC methods can lead to convergence of a wrong tree during the “tree perturbation” process. The Bayesian method is commonly used for phylogenetic inference. such as multiple base-substitutions at a single SNP. and other branches will be moved around. The more often a tree is sampled. a posterior distribution is created based on the frequency each tree is visited in the stochastic perturbation process. numerical approximations are used instead.

Distance methods and parsimony are good strategies for simple trees. but formal statistical computer programs are required for trees of more massive data. In conclusion. computationally-intensive process that utilizes many statistical tests and evolutionary models. Therefore. for large tree inference.maximum likelihood methods could fix the convergence problem given that the tree sample is not too large. bootstrap estimation. is performed on the posterior distribution in order for MLE tests to be conducted. Maximum likelihood methods are computationally intensive but are generally reliable as long as the tree is not too large. Bayesian methods. Bootstrap estimation is often used in phylogenetics in order to create a tree sample representative of the population distribution of all possible tree combinations. . used to create smaller samples. which work for large data sets. phylogenetic inference is a complicated. involve the creation of tree perturbation chains in order to assemble a posterior distribution.

“Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology. 2007 J. J.” Science.C. http://en.sciencemag. R. J. Ronquist.” 12 April 2011. F. The Free Encyclopedia. Evolutionary Analysis. 4th edition. Herron. 14 December 2001. “Maximum Parsimony (phylogenetics). Freeman. Bollback. Huelsenbeck. 1 May 2011.P. Pearson Prentice Hall.wikipedia. .References Wikipedia. Nielsen.

Sign up to vote on this title
UsefulNot useful