You are on page 1of 31

Introduction to

phylogeny
Lars Arvestad
arve@nada.su.se, lars.arvestad@scilifelab.se
Goals of this lecture

● Awareness of techniques for phylogeny inference

● Understand value of reliability of trees

● Interpret results

● Some basic applications


Charles
Darwin,
1809 – 1882

Ernst Haeckel, 1834 – 1919


The computational problem

● Input: Aligned sequences (matrix)

● Intended output: A binary tree, representing


the evolutionary history of the input sequences.
es e’ s
l Wo
Car
Benton and Donoghue, MBE, 2006
Our view of conifers et al
Modern view
● Gene family with interesting
function

● Many genes per species

● Carefully selected set of


species

Features

● One clade marked: those share


a motif (“KLEEK”)

● Numbers: Reliability
Pretty ≠ good

Quick tree at
NCBI
• Bad method?
• Doubtful
alignment
• Good start?
Visualizing evolution
(with Dendroscope)
Branchlengths
● Length b proportional to average number of
mutations per site

● b = tλ, time × speed of evolution

● λ is the rate of evolution

Molecular clock?
● Dubious assumption: Evolution has a constant
rate
● Works for some data
– Pseudogenes?
– Introns?
– Reasonable in 3rd codon position
Molecular clock?
Methods for inferring evolution
● Parsimony
≈ simplicity, greed. Which is the most simple tree?

● Distance based
Two steps: Estimate pairwise distances, puzzle together
the most reasonable tree.

● Probability based
Find the most likely tree.
State of th
e
ar t
Probability-based methods

● What tree should we choose?

● The most likely one


RAxML
● Variant 1: Maximum likelihood (ML), given PhyML
FastTree
alignment D, choose T maximizing Pr(D|T ).

● Variant 2: Markov Chain Monte Carlo (MCMC), MrBayes


BEAST
systematically consider different trees, to estimate PrIME
the distribution Pr(T |D).
Properties of ML and MCMC
Advantages
● Strong connection to model

● Detailed models, e.g.


– rate variation over sites
– rate variation over edges
– positive selection
– calibrate to geological time

● Probabilities are easy to understand

Disadvantages
● Slow?
● How much do you need to compute? (MCMC in
particular)
Reliability
● For “common” variables: confidence intervals, std dev,
etc

● For phylogenies: Study statistical support for an edge.

– Bootstrap for distance methods, ML, (parsimony)


– Probabilities for MCMC (and ML)

● The split concept: a partitioning of leaves.

● Each edge defines a split.


MCMC support values
Numbers are estimated probabilities
Bootstrap

● Idea: A reliable phylogeny can be inferred also


from subset of the data.

● Therefore: Try estimating phylogeny from parts


of data. What subtrees are persistent?

● Definition: A replicate of a multialignment is


achieved by column sampling with replacement.
Simple alignment: A replicate:
ACGTACGT GAGAATACT
AC--ACCT -ACAA-AC-
ACG-AGGT GAGAA-AG-
GTGTAAGT GGGGGTGAT
Algorithm for bootstrap

In: Alignment L and parameter k (≥ 100)


Out: A consensus tree
1. Create k replicates L1 to Lk from L.

2. For each replicate Li , estimate phylogeny Ti.

3. For each Ti , gather its split in the set Si .

4. Compute consensus tree for those splits in S1 to


Sk that are found ≥ k/2 times.
Bootstrap example

● Rule of thumb: ≥70 %


support is reliable

● Bootstrap evaluates
edges, not clades!
Rooting of phylogenies

● What is the root of the tree?

● Better: What is the start of evolution in our data?

● Or: What is the oldest point?

● Midpoint rooting: Lousy.

● Commonly accepted: Rooting by outgroup

● An outgroup is one or more sequences distant to all


others.
Rooting of conifers et al
Animals used as outgroup
to plants
About outgroups

● Outgroup must be unquestioned


● Outgroup must be a homolog!
● Very diverged outgroup is a bad
outgroup
● More than one sequence in outgroup
may detect problems
Side effects of outgroups

● One extra sequence reduce quality of whole


phylogeny

● Thus, outgroups sequence are “dangerous”

● Two outgroup sequences is better than one


Genes and species trees

● A reconciliation explains
how one tree depends
on another.

● Reconciliations decide
which gene node is a
speciation
Orthology and other terms

● Orthologs: Two genes descending from one


speciation

● Paralogs: Two genes descending from one


duplication.

● Xenologs: A lateral transfer (or ”horizontal


transfer”), a gene has “switched species”.
Reconciling a gene tree

1. Estimate a gene tree G

2. Retrieve a species tree S


o nc il e
3. Run program for estimating G to S nt to re
c
o u w a ge ne
Y fe rr i ng
i n
while
What could possibly go wrong?
tree
Reconciling a gene tree

1. Estimate a gene tree G

2. Retrieve a species tree S

3. Run program for estimating G to S

http://code.google.com/p/jprime/
Phylogeny quality
Final comments

● Homologs a requirement!
● Alignments are important: Look at them!
● It is OK to remove ”noise” from an alignment.
Use domains if needed.
● It is good to use complementary methods
● We discussed models of evolution. They can
be compared. ML good for this.

You might also like