Phylogeny Lars Arvestad

Introduction to
phylogeny
Lars Arvestad
arve@nada.su.se, lars.arvestad@scilifelab.se
Goals of this lecture
● Awareness of techniques for phylogeny inference
● Understand value of reliability of trees
● Interpret results
● Some basic applications

Charles
Darwin,
1809 – 1882
Ernst Haeckel, 1834 – 1919

The computational problem
● Input: Aligned sequences (matrix)
● Intended output: A binary tree, representing

the evolutionary history of the input sequences.
es e’ s
l Wo
Car
Benton and Donoghue, MBE, 2006
Our view of conifers et al
Modern view
● Gene family with interesting
function
● Many genes per species
● Carefully selected set of

species
Features
● One clade marked: those share

a motif (“KLEEK”)
● Numbers: Reliability
Pretty ≠ good
Quick tree at
NCBI
• Bad method?
• Doubtful
alignment
• Good start?
Visualizing evolution
(with Dendroscope)
Branchlengths
● Length b proportional to average number of
mutations per site
● b = tλ, time × speed of evolution
● λ is the rate of evolution
Molecular clock?
● Dubious assumption: Evolution has a constant
rate
● Works for some data
– Pseudogenes?
– Introns?
– Reasonable in 3rd codon position
Molecular clock?
Methods for inferring evolution
● Parsimony
≈ simplicity, greed. Which is the most simple tree?
● Distance based
Two steps: Estimate pairwise distances, puzzle together
the most reasonable tree.
● Probability based
Find the most likely tree.
State of th
e
ar t
Probability-based methods
● What tree should we choose?
● The most likely one

RAxML
● Variant 1: Maximum likelihood (ML), given PhyML
FastTree
alignment D, choose T maximizing Pr(D|T ).
● Variant 2: Markov Chain Monte Carlo (MCMC), MrBayes

BEAST
systematically consider different trees, to estimate PrIME
the distribution Pr(T |D).
Properties of ML and MCMC
Advantages
● Strong connection to model
● Detailed models, e.g.

– rate variation over sites
– rate variation over edges
– positive selection
– calibrate to geological time
● Probabilities are easy to understand
Disadvantages
● Slow?
● How much do you need to compute? (MCMC in
particular)
Reliability
● For “common” variables: confidence intervals, std dev,
etc
● For phylogenies: Study statistical support for an edge.
– Bootstrap for distance methods, ML, (parsimony)

– Probabilities for MCMC (and ML)
● The split concept: a partitioning of leaves.
● Each edge defines a split.

MCMC support values
Numbers are estimated probabilities
Bootstrap
● Idea: A reliable phylogeny can be inferred also

from subset of the data.
● Therefore: Try estimating phylogeny from parts

of data. What subtrees are persistent?
● Definition: A replicate of a multialignment is

achieved by column sampling with replacement.
Simple alignment: A replicate:
ACGTACGT GAGAATACT
AC--ACCT -ACAA-AC-
ACG-AGGT GAGAA-AG-
GTGTAAGT GGGGGTGAT
Algorithm for bootstrap
In: Alignment L and parameter k (≥ 100)

Out: A consensus tree
1. Create k replicates L1 to Lk from L.
2. For each replicate Li , estimate phylogeny Ti.
3. For each Ti , gather its split in the set Si .
4. Compute consensus tree for those splits in S1 to

Sk that are found ≥ k/2 times.
Bootstrap example
● Rule of thumb: ≥70 %

support is reliable
● Bootstrap evaluates
edges, not clades!
Rooting of phylogenies
● What is the root of the tree?
● Better: What is the start of evolution in our data?
● Or: What is the oldest point?
● Midpoint rooting: Lousy.
● Commonly accepted: Rooting by outgroup
● An outgroup is one or more sequences distant to all

others.
Rooting of conifers et al
Animals used as outgroup
to plants
About outgroups
● Outgroup must be unquestioned

● Outgroup must be a homolog!
● Very diverged outgroup is a bad
outgroup
● More than one sequence in outgroup
may detect problems
Side effects of outgroups
● One extra sequence reduce quality of whole

phylogeny
● Thus, outgroups sequence are “dangerous”
● Two outgroup sequences is better than one

Genes and species trees
● A reconciliation explains
how one tree depends
on another.
● Reconciliations decide
which gene node is a
speciation
Orthology and other terms
● Orthologs: Two genes descending from one

speciation
● Paralogs: Two genes descending from one

duplication.
● Xenologs: A lateral transfer (or ”horizontal

transfer”), a gene has “switched species”.
Reconciling a gene tree
1. Estimate a gene tree G
2. Retrieve a species tree S

o nc il e
3. Run program for estimating G to S nt to re
c
o u w a ge ne
Y fe rr i ng
i n
while
What could possibly go wrong?
tree
Reconciling a gene tree
1. Estimate a gene tree G
2. Retrieve a species tree S
3. Run program for estimating G to S
http://code.google.com/p/jprime/
Phylogeny quality
Final comments
● Homologs a requirement!
● Alignments are important: Look at them!
● It is OK to remove ”noise” from an alignment.
Use domains if needed.
● It is good to use complementary methods
● We discussed models of evolution. They can
be compared. ML good for this.

Phylogeny Lars Arvestad

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phylogeny Lars Arvestad

Uploaded by

Copyright:

Available Formats

Introduction to

● Awareness of techniques for phylogeny inference

● Understand value of reliability of trees

● Some basic applications

Ernst Haeckel, 1834 – 1919

● Input: Aligned sequences (matrix)

● Intended output: A binary tree, representing

● Many genes per species

● Carefully selected set of

● One clade marked: those share

● b = tλ, time × speed of evolution

● λ is the rate of evolution

● What tree should we choose?

● The most likely one

● Variant 2: Markov Chain Monte Carlo (MCMC), MrBayes

● Detailed models, e.g.

● Probabilities are easy to understand

● For phylogenies: Study statistical support for an edge.

– Bootstrap for distance methods, ML, (parsimony)

● The split concept: a partitioning of leaves.

● Each edge defines a split.

● Idea: A reliable phylogeny can be inferred also

● Therefore: Try estimating phylogeny from parts

● Definition: A replicate of a multialignment is

In: Alignment L and parameter k (≥ 100)

2. For each replicate Li , estimate phylogeny Ti.

3. For each Ti , gather its split in the set Si .

4. Compute consensus tree for those splits in S1 to

● Rule of thumb: ≥70 %

● What is the root of the tree?

● Better: What is the start of evolution in our data?

● Or: What is the oldest point?

● Midpoint rooting: Lousy.

● Commonly accepted: Rooting by outgroup

● An outgroup is one or more sequences distant to all

● Outgroup must be unquestioned

● One extra sequence reduce quality of whole

● Thus, outgroups sequence are “dangerous”

● Two outgroup sequences is better than one

● Orthologs: Two genes descending from one

● Paralogs: Two genes descending from one

● Xenologs: A lateral transfer (or ”horizontal

1. Estimate a gene tree G

2. Retrieve a species tree S

1. Estimate a gene tree G

2. Retrieve a species tree S

3. Run program for estimating G to S

You might also like