You are on page 1of 12

1

More Phylogenetic Methods


Bayesian Methods
Molecular Clocks
Gotchas in Phylogenetics
Bayesian Methods
Utilize Bayes Theorem
Attempts to calculate probabilities of
hypotheses being correct
Combines prior probabilities of
hypotheses with data to calculate
posterior probabilities
A Problem for Bayes Theorem
1% of people in Louisiana develop Barney
syndrome.
80% of Louisianans with Barney syndrome turn
purple
9.6% of Louisianans without Barney syndrome also
turn purple (false positives).
A Louisianan turns purple. What is the probability
that she actually has Barney syndrome?
Bayes Theorem
H
i
is hypothesis
P(H
i
) a priori probability of H
i
D is results (data)
P(H
i
|D) is posterior probability of H
i
Likelihoods

hypotheses j all
) | ( ) (
) ( ) | (
) | (
j j
i i
i
H D P H P
H P H D P
D H P
Prior Probabilities
H1: individual has Barney syndrome
P(H1) =0.01
H2: individual doesnt have Barney syndrome
P(H2) =1 - 0.01 =0.99
Pr(D|H1)=0.8 (80% w/ BS turn purple)
Pr(D|H2)=0.096 (9.6% w/o BS turn purple)
P(H1|D) =0.078 (she probably doesnt have BS)

hypotheses j all
) | ( ) (
) ( ) | (
) | (
j j
i i
i
H D P H P
H P H D P
D H P
Venn Diagram of Ks Syndrome
(areas not to scale)
Turn purple Have Barney Syndrome
2
Bayesian Inference of Phylogeny
Pr [Tree | Data] =posterior probability
Pr [Data | Tree] is likelihood
Pr [Tree] is prior probability, usually all
trees equally likely
Calculation of Posterior Probability
Sum over all trees
For each tree, integrate all possible
combinations of branch lengths and
substitution model parameters
impossible to do analytically
numerical methods used for
approximation
Markov Chain Monte Carlo (MCMC) Method
Construct a Markov chain with parameters of
model as state space and stationary
distribution that is posterior probability
distribution of parameters
Two steps:
1) a new tree is proposed by stochastically
perturbing current tree
2) new tree is either accepted or rejected with
probability described by Metropolis-Hastings
The Metropolis-Hastings Algorithm
1) Propose a new parameter value by perturbing the current
value in some pre-defined way.
2) Run the resulting new model through the forward
algorithm and measure how well the new model predicts the
data.
If the new data fit is an improvement, retain the new
model and make it the current model. Goto 1.
If the new data fit is poorer, retain the model anyway
with probability proportional to the ratio Goto 1
) (
) (
old L
new L
The Bayesian Approach Does Not
Require Finding the Most Probable Tree
Proportion of time any tree is visited by
the MC is an approximation of the
posterior probability of tree
Instead of searching for optimal tree,
trees sampled according to posterior
probabilities
3
Paths of 3 chains in MCMC, showing sampling in regions
of high probability and mixing of chains
From a Sample of Trees with
Posterior Probabilities
Sample can be used to construct a
consensus tree
Posterior probabilities of each internal
node can be estimated from the sample
Summary
Phylogenies (trees) are relationships of descent
A gene tree is not the same as a species tree
Phylogenetic inference is based on either clustering
algorithms or optimality criteria
Clustering algorithms work fast
Finding optimal trees is slow
The criterion of parsimony is based on a
philosophical principle
The criterion of maximum likelihood has a firm
statistical foundation
Bayesian posterior probabilities provide a more
powerful approach to phylogenetic inference
Molecular Clocks
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200
Time since divergence (millions of years)
A
m
i
n
o

a
c
i
d

s
u
b
s
t
i
t
u
t
i
o
n
s
(
p
e
r

1
0
0

r
e
s
i
d
u
e
s
)

i
n

c
y
t
o
c
h
r
o
m
e

c
Yeast vs mould
Angiosperms vs animals
Insects vs vertebrates
Fish vs land vertebrates
Amphibians vs birds and mammals
Birds vs mammals
Mammals
vs reptiles
Birds vs
reptiles
Molecular clock for cytochrome c
Zuckerkandl and Pauling 1965
Amino acids have multiple overlapping properties
(charge, bulk, etc.)
The role of a particular amino acid in a protein can
usually be filled by other amino acids
Most observed amino acid replacements are
selectively neutral
Adaptive evolution of proteins may not require many
replacements
For a given class of proteins, rates of amino acid
replacements are approximately constant
4
Implicit Assumptions of Molecular Clock
1) Most amino acid replacements are neutral
2) Rate of mutation same for different lineages
3) Number of mutations depends on absolute
time rather than number of generations
Neutral Theory and Molecular Clock
1) Most amino acid replacements are neutral
2) Rate of mutation same for different lineages
3) Number of mutations depends on absolute
time rather than number of generations
Evidence of a molecular clock supports the
neutral theory, exceptions are explained
away by 2) or 3)
Stochastic vs. Metronomic Clocks
Molecular clocks are not expected to tick at
a constant rate (metronomic)
Stochastic clocks ticks occur at random
times, but with constant probabilities
Purely random ticks would follow Poisson
distribution
Expect Poisson variance in number of ticks
Margoliash, Sarich and Wilson
Relative Rate test
Compare rates in lineages leading to A and B
Needs an outgroup (C)
d is estimate of difference in number of
substitutions from outgroup to each lineage
(A and B)
5
Margoliash, Sarich and Wilson Relative Rate test
) ( 2 ) ( ) ( ) ( OC BC AC V V V d V
) , ( Cov ) ( Var ) ( Var ) ( Var Y X Y X Y X 2
Variance of differences for relative rate test
Depends on model of amino acid
replacement (and model parameters)
Requires:
appropriate model is known
rate variation among sites fits a known
distribution (such as the gamma distribution)
Tajimas 1D Method
Simpler, but less powerful
Doesn't require model of sequence
evolution
Only 1 degree of freedom
Some Results from Relative Rate
Tests
Rates in mice and rats are similar
Rates in rodents higher than primates
Rates in humans lower than other
African apes and monkeys
Possible Causes of "molecular clock"
rate variation
Replication dependent factors
generation time
DNA repair efficiency
Replication independent factors
Metabolic rate (DNA damage and
synthesis)
Body size
From Ayala, 1997
6
Molecular Clocks are Not Constant
Rate varies among genes
Rate varies among lineages
Without Molecular Clock Assumption
Amount of sequence divergence
between taxa is product of rate and time
Branch lengths based on sequence
divergence confound rate and time
Most methods of phylogenetic inference
make no assumptions about rate (no
clock)
Problems with assuming no clock
Maximum likelihood and Bayesian analyses
are free to assign any rate to a branch if it
increases likelihood
Unrealistic extremes of rate may be allowed
Branch lengths cannot be used to estimate
time
Trees are not ultrametric (terminal nodes not
equal distance from root)
Ultrametric Distances
An ultrametric tree is automatically rooted!
Relaxed Molecular Clocks:
An Alternative to Clock vs. No-Clock Dilemma
Most commonly used in Bayesian
analyses
Set limits or priors on distribution of
clock rates
In Bayesian analysis, estimate both rate
and time
Advantages of
Relaxed Clock Models
May be more likely to find correct
phylogeny (no clock is a bad prior)
Allows estimation of divergence times if
tree can be calibrated
Ultrametric tree can be rooted
without an outgroup!
7
Choices of Relaxed Clock Priors
Autocorrelated rates implies a large
component of rate variation is inherited
Lognormal distribution of rates implies
rates change continuously along branch
(more change for longer branches)
Exponential distribution of rates implies
rates change at nodes, independent of
branch lengths
Gotchas in Phylogenetics
Orthologous vs. Paralogous genes
Rates of sequence evolution and
phylogenetic inference
Long Branch Attraction
Mitochondrial DNA and phylogenetic
inference
The Paralogy Problem
Orthologs descended from a single
ancestral gene (i.e., via speciation)
(what we want)
Paralogs descended from separate
products of gene duplication
(what we don't want)
Many genes are duplicated (multigene
families)
Non-orthologous genes (paralogs)
Signs of Paralogous Relationships
Paralogs may differ in number and
position of introns
Differ in functionally conserved residues
(e.g. active sites of enzymes)
Position on phylogeny placement may
indicate paralogy (requires ortholog is
included in the analysis)
Enzyme substrate specificity
8
Paralogous Vicilin Genes
Problems associated with rates of
sequence evolution
Too fast or too slow
Rates uneven across lineages long
branch attraction
Rate of Sequence Evolution is
too Fast or too Slow
Genes that evolve too slowly dont provide
enough phylogenetically informative
characters
Genes that evolve too quickly become
saturated with changes at variable positions
Slowly evolving genes are best for deep
nodes, rapidly evolving genes for shallow
nodes
Saturation of COI for decapod sequences
How to pick genes with optimal rates
No simple method
Need to consider distribution of rates across
sites as well as average rage
Some genes have wide distributions of rates
others narrow
Two genes can have the same average rate
across sites, but very different distributions
A site with a mix of very fast sites and very slow
sites may be saturated at the fast sites, and
uninformative at the slow sites
Townsend 2007 Measure of
phylogenetic informativeness
Based on probability of resolution of soft
polytomies
Considers distribution of rates across
sites (estimated by maximum likelihood)
PI also depends on depth of node in
tree
9
Profiles of PI for 4 different genes
Townsend 2007
Relative PI for 4 genes
(Townsend 2007 )
PI Profiles for arginine kinase vs.
mitochondrial COI
(Mahon and Neigel 2008)
RPI for Arginine Kinase vs COI
(Mahon and Neigel, 2008)
Long Branch Attraction (LBA)
Branches with high rates of sequence
evolution are incorrectly joined
Can occur with any method (parsimony,
likelihood, Bayesian)
Often exacerbated by other factors that
violate the model of sequence evolution
How to tell if LBA is really a problem:
J ust because long branches are
connected doesnt mean LBA is at fault
Use simulations used to examine
specific cases
Models of sequence evolution +tree
This can be done in Mesquite
See if methods less sensitive to LBA
yield different results
10
III. Inadequate model of sequence
evolution
a central problem in molecular
phylogenetics
making progress, but hard to predict
when the problem will be solved
worth paying attention to this area even
if you are only interested in doing
systematics
Model assumptions that may be
Inadequate
distribution of rates across sites constant
among branches (violated by heterotachy)
stationarity nucleotide (or amino acid)
proportions are at equilibrium, uniform across
branches
relative rates of nucleotide/amino acid
substitution same across sites
sites evolve independently
Heterogeneity and Convergence in
Composition
a non-phylogenetic signal
caused by variation in nucleotide or amino
acid composition among taxa (stationarity
violated)
sequences similar in composition, will tend to
match character states by chance more than
expected
composition is not just AT/GC composition,
but frequencies of all four nucleotides on a
single strand
Mitochondrial genes:
exhibit strong strand-specific composition
biases
a consequence of how mitochondrial
genomes replicate
reversals in biases have occurred in some
lineages
pronounced convergence may occur among
unrelated taxa with similar biases
Some questionable findings from mtDNA-
based phylogenetic analyses of arthropods
Paraphyletic Crustacea: Malacostraca
sister to Hexapoda rather than
Branchiopoda
Paraphyletic Hexapoda: Insecta sister
to Crustacea instead of Collembola
Paraphyletic Chelicerata and Myriapoda
Myriapoda sister to Chelicerata
Hassanin et al 2005
Identified several reversals of strand-
specific bias in metazoan mitchondrial
DNA evolution
Taxa with reversed biases tend to be
joined in phylogenetic analyses, even
when unrelated
11
MP tree for mtDNA alignments
of six protein-coding gene
MrBayes GTRIG tree for
mtDNA Alignments of six
protein-coding gene
How to fix the problem
Throw out taxa with reversed biases
Recode data to neutralize effect of
bias
Code purines as R and pyrimidines as Y at
degenerate codon positions at for codons
of similar amino acids
GTRIG model for 1
st
and 2
nd
positions
IG two-state model for 3
rd
positions
MrBayes tree with neutral
transitions excluded
12
Some fixes that dont always work
Translate DNA sequences to amino
acid sequences
LogDet transformation calculates a
distance matrix that can recover the true
tree when stationarity is violated.
From Foster and Hickey, 1999
From Foster and Hickey, 1999
Phylogenies based on DNA sequences of 12
protein-coding mitochondrial genes
From Foster and Hickey, 1999
Phylogenies based on amino-acid sequences of
mitochondrial proteins
Tree with LogDet distances

You might also like