You are on page 1of 49

What is Molecular Phylogenetics

Phylogenetics is the study of evolutionary relationships

Example - relationship among species


crocodiles
primates
rodents birds
birds lizards
crocodiles
snakes
marsupials snakes rodents
primates
lizards marsupials
A Brief History of Molecular Phlogenetics
1900s
Immunochemical studies: cross-reactions stronger for closely related
organisms
Nuttall (1902) - apes are closest relatives to humans

1960s - 1970s
Protein sequencing methods, electrophoresis, DNA hybridization and PCR
contributed to a boom in molecular phylogeny

late 1970s to present


Discoveries using molecular phylogeny:
- Endosymbiosis - Margulis, 1978
- Divergence of phyla and kingdom - Woese, 1987
- Many Tree of Life projects completed or underway
Molecular data vs. Morphology / Physiology

 Strictly heritable entities  Can be influenced by


environmental factors
 Data is unambiguous  Ambiguous modifiers: “reduced”,
“slightly elongated”, “somewhat
flattened”
 Regular & predictable evolution  Unpredictable evolution

 Quantitative analyses  Qualitative argumentation

 Ease of homology assesment  Homology difficult to assess

 Relationship of distantly related  Only close relationships can be


organisms can be inferred confidently inferred

 Abundant and easily generated with  Problems when working with micro-
PCR and sequencing organisms and where visible
morphology is lacking
Phylogenetic concepts:
Interpreting a Phylogeny
Sequence A

Sequence B
• Physical position in tree
Sequence C is not meaningful
• Swiveling can only be
Sequence D done at the nodes
• Only tree structure
matters
Sequence E
Present
Time
Phylogenetic concepts:
Interpreting a Phylogeny
Sequence A

Sequence B
• Physical position in tree
Sequence E is not meaningful
• Swiveling can only be
Sequence D done at the nodes
• Only tree structure
matters
Sequence C
Present
Time
Tree Terminology
- Relationships are illustrated by a phylogenetic tree / dendrogram
- The branching pattern is call the tree’s topology
- Trees can be represented in several forms:

Rectangular cladogram

Slanted cladogram
Tree Terminology
- Relationships are illustrated by a phylogenetic tree / dendrogram
- The branching pattern is call the tree’s topology
- Trees can be represented in several forms:

Circular cladogram
Phenogram
Curve-O-Gram
Eurogram
Tree Terminology
Operational taxonomic units (OTU) / Taxa

Internal nodes A

C
Terminal nodes
D
Sisters
Root E

Branches
Polytomy
Tree Terminology
Rooted vs. unrooted trees
D
A B
A E
B
C
D
Root E
C
F
F

Rooted trees: Has a root that denotes common ancestry


Unrooted trees: Only specifies the degree of kinship among taxa but
not the evolutionary path
Tree Terminology
Scaled vs. unscaled trees

A
B
C
D
E
F

Scaled trees: Branch lengths are proportional to the number of


nucleotide/amino acid changes that occurred on that branch (usually a
scale is included).
Unscaled trees: Branch lengths are not proportional to the number of
nucleotide/amino acid changes (usually used to illustrate evolutionary
relationships only).
Tree Terminology
Monophyletic vs. paraphyletic

Saturnite 1 Jupiterian 32
Saturnite 2 Jupiterian 5
Saturnite 3 Jupiterian 67
Martian 1 Human 11
Martian 3 Jupiterian 8
Martian 2 Human 3

Monophyletic groups: All taxa within the group are derived from a
single common ancestor and members form a natural clade.
Paraphyletic groups: The common ancestor is shared by other taxon in
the group and members do not form a natural clade.
Methods in Phylogenetic Reconstruction

 Distance
 Maximum Parsimony
 Maximum Likelihood
Bayesian

* All algorithms are calculated using available software,


eg. PAUP, PHYLIP, McClade, Mr. Bayes etc.
Comparison of Methods
Distance Maximum Maximum likelihood
parsimony
Uses only pairwise Uses only shared Uses all data
distances derived characters

Minimizes distance Minimizes total Maximizes tree likelihood


between nearest distance given specific parameter
neighbors values
Very fast Slow Very slow
Easily trapped in local Assumptions fail Highly dependent on
optima when evolution is assumed evolution
rapid model
Good for generating Best option when Good for very small data
tentative tree, or tractable (<30 taxa, sets and for testing trees
choosing among homoplasy rare) built using other methods
multiple trees
Methods in Phylogenetic Reconstruction

Distance
• Using a sequence alignment, pairwise distances are calculated
• Creates a distance matrix
• A phylogenetic tree is calculated with clustering algorithms, using the
distance matrix.
• Examples of clustering algorithms include the Unweighted Pair Group
Method using Arithmetic averages (UPGMA) and Neighbor Joining
clustering.
A A A

B B B

C C

D
Methods in Phylogenetic Reconstruction

Maximum Parsimony
• All possible trees are determined for each position of the sequence
alignment
• Each tree is given a score based on the number of evolutionary step
needed to produce said tree
• The most parsimonious tree is the one that has the fewest evolutionary
changes for all sequences to be derived from a common ancestor
• Usually several equally parsimonious trees result from a single run.
Maximum parsimony: exhaustive stepwise addition
B C

Step 1
A

B D B D C B C
C D
Step 2
A A A

E
D E D D E
B B B
C C C

A A A …………………
Step 3
Methods in Phylogenetic Reconstruction

Maximum Likelihood
• Creates all possible trees like Maximum Parsimony method but
instead of retaining trees with shortest evolutionary steps……
• Employs a model of evolution whereby different rates of
transition/transversion ration can be used
• Each tree generated is calculated for the probability that it reflects
each position of the sequence data.
• Calculation is repeated for all nucleotide sites
• Finally, the tree with the best probability is shown as the maximum
likelihood tree - usually only a single tree remains
• It is a more realistic tree estimation because it does not assume equal
transition-transversion ratio for all branches.
How confident are we about the inferred phylogeny?

? rat
human
?
turtle
? fruit fly
? oak
duckweed
Bootstrapping
The Bootstrap

• Computational method to estimate the confidence level of a certain


phylogenetic tree.
Pseudo sample 1
001122234556667
Sample
rat GGAAGGGGCTTTTTA
0123456789 human GGTTGGGGCTTTTTA
rat GAGGCTTATC turtle GGTTGGGCCCCTTTA
human GTGGCTTATC fruitfly CCTTCCCGCCCTTTT
turtle GTGCCCTATG oak AATTCCCGCTTCCCT
fruitfly CTCGCCTTTG duckweed AATTCCCCCTTCCCC
oak ATCGCTCTTG
duckweed ATCCCTCCGG
Pseudo sample 2
445556777888899
rat CCTTTTAAATTTTCC
rat human CCTTTTAAATTTTCC
turtle CCCCCTAAATTTTGG
human
fruitfly CCCCCTTTTTTTTGG
turtle
oak CCTTTCTTTTTTTGG
fruit fly
duckweed CCTTTCCCCGGGGGG
oak
duckweed
Many more replicates
Inferred tree (between 100 - 1000)
Bootstrap values

100 rat
65 human
turtle
0
fruit fly
55 oak
duckweed

• Values are in percentages


• Conventional practice: only values 60-100% are shown
Some Discoveries Made Using
Molecular Phylogenetics
Universal Tree of Life
• Using rRNA sequences
• Able to study the
relationships of uncultivated
organisms, obtained from a
hot spring in Yellowstone
National Park

Barns et al., 1996


Endosymbiosis: Origin of the Mitochondrion
and Chloroplast

-Purple
Bacteria Other bacteria

Chloroplasts
Mitochondria

Root Cyanobacteria
Eukaryotes

Archaea

Mitochondria and chloroplasts are derived from the -purple bacteria


and the cyanobacteria respectively, via separate endosymbiotic events.
Relationships within species: HIV subtypes

Rwanda A
Ivory Coast
Italy Uganda
B U.S.
U.S. India Rwanda
U.K. C
Ethiopia
Uganda
S. Africa
Uganda
D Netherlands
Tanzania

Russia
Romania F G
Taiwan
Cameroon Brazil Netherlands
Problems and Errors in Phylogenetic Reconstruction

• Inherent strengths and weaknesses in different tree-making


methodologies (see the Holder & Lewis reading for this week).
• More is better: Errors in inferred phylogeny may be caused by small data
sets and/or limited sampling.
• Unsuitable sequences: those undergoing rapid nucleotide changes or
slow to zero changes overtime may skew phylogenetic estimations
• Mutations: Duplications, inversions, insertions, deletions etc. can give
inaccurate signals
• Genomic hotspots: small regions of rapid evolution are not easily detected

• Homoplasy: nucleotide changes that are similar but occurred independently


in separate lineages are mistakenly assumed as inherited changes

• Sample contamination / mislabeling: always a possibility when working with


large data sets
Part of sequence alignment of Mitochondrial Small Sub-Unit rRNA
Full gene is length ~950
11 Primate species with mouse as outgroup

* 20 * 40 * 60 *
Mouse : CUCACCAUCUCUUGCUAAUUCAGCCUAUAUACCGCCAUCUUCAGCAAACCCUAAAAAGG-UAUUAAAGUAAGCAAAAGA : 78
Lemur : CUCACCACUUCUUGCUAAUUCAACUUAUAUACCGCCAUCCCCAGCAAACCCUAUUAAGGCCC-CAAAGUAAGCAAAAAC : 78
Tarsier : CUUACCACCUCUUGCUAAUUCAGUCUAUAUACCGCCAUCUUCAGCAAACCCUAAUAAAGGUUUUAAAGUAAGCACAAGU : 79
SakiMonkey : CUUACCACCUCUUGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCUA-UAAUGACAGUAAAGUAAGCACAAGU : 76
Marmoset : CUCACCACGUCUAGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCCU-UAAUGAUUGUAAAGUAAGCAGAAGU : 76
Baboon : CCCACCCUCUCUUGCU----UAGUCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACGAAGUGAGCGCAAAU : 75
Gibbon : CUCACCAUCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACAAAGGCUAUAAAGUAAGCACAAAC : 75
Orangutan : CUCACCACCCCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCCACGAAGUAAGCGCAAAC : 75
Gorilla : CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACGAAGGCCACAAAGUAAGCACAAGU : 75
PygmyChimp : CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU : 75
Chimp : CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU : 75
Human : CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACAAAGUAAGCGCAAGU : 75
CucACC cuCUuGCu cAgccUaUAUACCGCCAUCuuCAGCAAACcCu A G aAAGUaAGC AA
Distance Matrix Methods
Part of the Jukes-Cantor Distance Matrix for the Primates example

Baboon Gibbon Orang Gorilla


PygmyCh. Chimp Human
Baboon 0.00000 0.18463 0.19997
0.18485 0.17872 0.18213 0.17651
Gibbon 0.18463 0.00000 0.13232
0.11614 0.11901 0.11368 0.11478
Orang 0.19997 0.13232 0.00000
0.09647 Mouse-Primates
0.09767 ~ 0.09974
0.3 0.09615
Gorilla 0.18485 0.11614 0.09647
Use as input to clustering methods
0.00000 0.04124 0.04669 0.04111
PygmyChimp 0.17872 0.11901 0.09767
Follow a clustering procedure:
1. Join closest 2 clusters
2. Recalculate distances between all the clusters
3. Repeat 1 and 2 until all species are connected in a single
cluster.

Initially each species is a cluster on its own. The clusters get bigger
during the process, until they are all connected to a single tree.

There are many slightly different clustering methods. The two most
important are:
 UPGMA (Unweighted Pair Group Method with Arithmetic
Mean)
 Neighbour-Joining
UPGMA - The distance between cluster X and cluster Y is
defined as the average of pairwise distances dAB where A is in
X and B is in Y. Mouse
Lemur
Tarsier
SakiMonkey
Marmoset
Baboon
Gibbon
Orangutan
Gorilla
Human
PygmyChimp
Chimp
0.1

• assumes a molecular clock


•distance = twice node height
• forces distances to be ultrametric (for any three species, the
two largest distances are equal)
• produces rooted tree (in this case root is incorrect but
topology is otherwise correct)
Additive trees – example with just three species

1
n
2
3

1, 2 and 3 are known species.


We know d12, d13, and d23.
Want to add an internal node n to the tree such that distances between
species are equal to the sum of the lengths of the branches along the path
between the species.
d1n + d2n = d12 ; d1n + d3n = d13 ; d2n + d3n = d23 ;
Solve simultaneous equations
d1n = (d12 + d13 – d23)/2;
d2n = (d12 + d23 – d13)/2;
d3n = (d13 + d23 – d12)/2;
Neighbour-Joining method

i
n n
j
k

Take two neighbouring nodes i and j and replace by a single new node n.
din + dnk = dik ; djn + dnk = djk ; din + djn = dij ;
therefore dnk = (dik + djk - dij)/2 ; applies for every k
define 1 1 but ...
ri   dik rj   d jk
N 2 k N 2 k
Let din = (dij + ri - rj)/2 ; djn = (dij + rj - ri)/2 .

Rule: choose i and j for which Dij is smallest, where Dij = dij - ri -
rj .
NJ method produces an Unrooted, Additive Tree
Additive means distance between species = distance summed along internal
branches Orangutan
Gorilla PygmyChimp
Gibbon
Chimp
Human
Marmoset
The tree has been rooted
Baboon
using the Mouse as
outgroup
SakiMonkey

Mouse
0.1 Lemur
47
Mouse Tarsier
Tarsier SakiMonkey
Lemur 100
Marmoset
100 Baboon
100 Gibbon
100 Orang
99 Gorilla
100 Human
0.1 84 PygmyChimp
100
Chimp
Bootstrapping
real sequences:
Human CAACAGAGGC TTACGACCCC TTATTTACC
Chimp .......... .C........ .........
Gorilla T......... .C..A..... .........
Orang utan T..T..G.C. CC..A..... .........
Gibbon ...T...... .CGAA...T. ..GC.....
resample columns at random:
Human C C G etc.
Chimp . . .
Gorilla . . A
Orang utan . T A
Gibbon . T A

Generate many random samples. Calculate tree for each one. These trees
will be slightly different because the input sequences were slightly different.
For each clade, the bootstrap value is the percentage of the trees for which
this group of species falls together in the tree. High bootstrap values
(>70%) indicate reliable trees. Lower percentages indicate that there is
insufficient information in the sequences to be sure about the resulting tree.
Why are trees never exact?

Ideal World :
Take a known tree. Simulate evolution along that tree according to a known
evolutionary model. Generate a set of sequences for the tips of the tree.
Use correct evolutionary model to generate distance matrix. Regenerate Tree.
You will not get exactly back to the original tree because substitution events are a
stochastic process (finite size effects).

Real World:
You don’t know the correct model of evolution.
There may be variability of rates between sites in the sequence, between species
in the tree, or between times during evolution.
Saturation of mutations will occur for widely separated species.
Evolutionary radiations may have occurred in short periods of time.
Recombination or horizontal transfer may have occurred.
Criteria for Judging Additive Trees
There are N(N-1)/2 pairwise distances between N species.
There are only 2N-3 branch lengths on an unrooted tree, therefore in general
the data do not fit exactly to an additive tree (cf. N-1 node heights in an
ultrametric tree).

NJ:
• Produces an additive tree that approximates to the data matrix, but there is
no criterion for defining the best tree.
• Algorithm for tree construction is well-defined and very rapid.

Fitch-Margoliash method:
E
d ij d ij 
tree 2

• Define measure of non-additivity i, j dij2

• Choose tree topology and branch lengths that minimise E.


• Algorithm for tree construction is not defined. Requires a slow heuristic tree
search method.
Searching Tree Space
Require a way of generating trees to be tested by Additivity criterion,
Maximum Likelihood, or Parsimony.
No. of distinct tree topologies
Nearest neighbour
interchange N Unrooted (UN) Rooted (RN)
3 1 3
4 3 15
5 15 105
6 105 945
7 945 10395
UN = (2N-5)UN-1 RN =
Subtree pruning (2N-3)RN-1
and regrafting Conclusion: there are huge numbers of trees,
even for relatively small numbers of species.
Therefore you cannot look at them all.
The Maximum Likelihood Criterion
Calculate the likelihood of observing the data on a given the tree.
Choose tree for which the likelihood is the highest.
x x

t1 t1 t2
t2

A y z
G
L( x)  PxA (t1 ) PxG (t2 ) L( x)   L( y ) Pxy (t1 ) L( z ) Pxz (t2 )
y z

Can calculate total likelihood for the site recursively.


Likelihood is a function of tree topology, branch lengths, and
parameters in the substitution rate matrix. All of these can be
optimized.
The Parsimony Criterion
Try to explain the data in the simplest possible way with the fewest arbitrary
assumptions
Used initially with morphological characters.
Suppose C and D possess a character (1) that is absent in A and B (0)

2 + 3
1
+
* + +
A(0) B(0) C(1) A(0) C(1) B(0) A(0) C(1) B(0) D(1)
D(1) D(1)
In 1, the character evolves only once (+)
In 2, the character evolves once (+) and is lost once (*)
In 3, the character evolves twice independently
The first is the simplest explanation, therefore tree 1 is to be preferred
by the parsimony criterion.
Parsimony with molecular data

E(G) A(T) C(T)


A(T)
* *
*
B(T) B(T) D(G)
D(G) E(G)
C(T)
1. Requires one mutation 2. Requires two
mutations
By parsimony, 1 is to be preferred to 2.
A(T) E(G) A(T) C(T)
*
*
B(T) D(T) B(T) D(T)
C(T) E(G)

This site is non-informative. Whatever the arrangement of species, only one


mutation is required. To be informative, a site must have at least two bases
present at least twice.
The best tree is the one that minimizes the overall number of mutations at all
sites.
Parsimony and Maximum Likelihood
• There is an efficient algorithm to calculate the parsimony score
for a given topology, therefore parsimony is faster than ML.
• Parsimony is an approximation to ML when mutations are rare
events.
• Weighted parsimony schemes can be used to treat most of the
different evolutionary models used with ML.
• Parsimony throws away information from non-informative sites
that is informative in ML and distance matrix methods.
• Parsimony gives little information about branch lengths.
• Parsimony is inconsistent in certain cases (Felsenstein zone),
and suffers badly from long branch attraction.

Personal conclusion : if you’ve got a decent computer, you’re better off with
ML.
Markov Chain Monte Carlo method: MCMC
We don’t want a single tree, we want to sample over all tree-space with a
weighting determined by the likelihood.
Sample of many high-likelihood trees – not just top of highest peak.

Run a long simulation that moves from one tree to another


Move set = parameter changes + branch length changes + topology
changes.
L(data | treei )
Bayes’ theorem: L (tree | data) 
 L(data | treek )
i

Average observable quantities over the sample of trees.


Posterior clade probabilities are similar to bootstraps, but they have a
different meaning.
Topology probabilities for the primates according to
MCMCProb. Cum. Tree topology
1. 0.316 0.316 ((L,T),O) + (G,(C,H)) = NJ
2. 0.196 0.512 ((L,T),O) + ((G,C),H)
3. 0.145 0.657 (L,(T,O)) + (G,(C,H))
4. 0.121 0.778 (L,(T,O)) + ((G,C),H)
5. 0.113 0.891 (T,(L,O)) + ((G,C),H)
6. 0.101 0.992 (T,(L,O)) + (G,(C,H))
7. 0.007 0.999 (T,(L,O)) + ((G,H),C)
8. 0.001 0.999 ((L,T),O) + ((G,H),C)
9. 0.001 1.000 (L,(T,O)) + ((G,H),C)

Only 9 topologies arose (= 3 x 3). The two uncertain points in the tree are
almost independent.
The NJ tree has maximum posterior probability even though it was not the ML
tree.
The ML tree has the second highest posterior probability.
?? Broad, Low peaks versus Narrow, High peaks.
Clade probabilities
compared
MCMC NJ Bootstrap
Clades in the top tree
(Human,Chimp) 56% 84%
(Lemur,Tarsier) 51% 46%

Clades not in the top tree


(Gorilla,Chimp) 43% 9%
(Tarsier,Other Pri.) 27% 29%
(Lemur,Other Pri.) 18% 24%
(Human,Gorilla) 1% 7%

You might also like