Professional Documents
Culture Documents
1960s - 1970s
Protein sequencing methods, electrophoresis, DNA hybridization and PCR
contributed to a boom in molecular phylogeny
Abundant and easily generated with Problems when working with micro-
PCR and sequencing organisms and where visible
morphology is lacking
Phylogenetic concepts:
Interpreting a Phylogeny
Sequence A
Sequence B
• Physical position in tree
Sequence C is not meaningful
• Swiveling can only be
Sequence D done at the nodes
• Only tree structure
matters
Sequence E
Present
Time
Phylogenetic concepts:
Interpreting a Phylogeny
Sequence A
Sequence B
• Physical position in tree
Sequence E is not meaningful
• Swiveling can only be
Sequence D done at the nodes
• Only tree structure
matters
Sequence C
Present
Time
Tree Terminology
- Relationships are illustrated by a phylogenetic tree / dendrogram
- The branching pattern is call the tree’s topology
- Trees can be represented in several forms:
Rectangular cladogram
Slanted cladogram
Tree Terminology
- Relationships are illustrated by a phylogenetic tree / dendrogram
- The branching pattern is call the tree’s topology
- Trees can be represented in several forms:
Circular cladogram
Phenogram
Curve-O-Gram
Eurogram
Tree Terminology
Operational taxonomic units (OTU) / Taxa
Internal nodes A
C
Terminal nodes
D
Sisters
Root E
Branches
Polytomy
Tree Terminology
Rooted vs. unrooted trees
D
A B
A E
B
C
D
Root E
C
F
F
A
B
C
D
E
F
Saturnite 1 Jupiterian 32
Saturnite 2 Jupiterian 5
Saturnite 3 Jupiterian 67
Martian 1 Human 11
Martian 3 Jupiterian 8
Martian 2 Human 3
Monophyletic groups: All taxa within the group are derived from a
single common ancestor and members form a natural clade.
Paraphyletic groups: The common ancestor is shared by other taxon in
the group and members do not form a natural clade.
Methods in Phylogenetic Reconstruction
Distance
Maximum Parsimony
Maximum Likelihood
Bayesian
Distance
• Using a sequence alignment, pairwise distances are calculated
• Creates a distance matrix
• A phylogenetic tree is calculated with clustering algorithms, using the
distance matrix.
• Examples of clustering algorithms include the Unweighted Pair Group
Method using Arithmetic averages (UPGMA) and Neighbor Joining
clustering.
A A A
B B B
C C
D
Methods in Phylogenetic Reconstruction
Maximum Parsimony
• All possible trees are determined for each position of the sequence
alignment
• Each tree is given a score based on the number of evolutionary step
needed to produce said tree
• The most parsimonious tree is the one that has the fewest evolutionary
changes for all sequences to be derived from a common ancestor
• Usually several equally parsimonious trees result from a single run.
Maximum parsimony: exhaustive stepwise addition
B C
Step 1
A
B D B D C B C
C D
Step 2
A A A
E
D E D D E
B B B
C C C
A A A …………………
Step 3
Methods in Phylogenetic Reconstruction
Maximum Likelihood
• Creates all possible trees like Maximum Parsimony method but
instead of retaining trees with shortest evolutionary steps……
• Employs a model of evolution whereby different rates of
transition/transversion ration can be used
• Each tree generated is calculated for the probability that it reflects
each position of the sequence data.
• Calculation is repeated for all nucleotide sites
• Finally, the tree with the best probability is shown as the maximum
likelihood tree - usually only a single tree remains
• It is a more realistic tree estimation because it does not assume equal
transition-transversion ratio for all branches.
How confident are we about the inferred phylogeny?
? rat
human
?
turtle
? fruit fly
? oak
duckweed
Bootstrapping
The Bootstrap
100 rat
65 human
turtle
0
fruit fly
55 oak
duckweed
-Purple
Bacteria Other bacteria
Chloroplasts
Mitochondria
Root Cyanobacteria
Eukaryotes
Archaea
Rwanda A
Ivory Coast
Italy Uganda
B U.S.
U.S. India Rwanda
U.K. C
Ethiopia
Uganda
S. Africa
Uganda
D Netherlands
Tanzania
Russia
Romania F G
Taiwan
Cameroon Brazil Netherlands
Problems and Errors in Phylogenetic Reconstruction
* 20 * 40 * 60 *
Mouse : CUCACCAUCUCUUGCUAAUUCAGCCUAUAUACCGCCAUCUUCAGCAAACCCUAAAAAGG-UAUUAAAGUAAGCAAAAGA : 78
Lemur : CUCACCACUUCUUGCUAAUUCAACUUAUAUACCGCCAUCCCCAGCAAACCCUAUUAAGGCCC-CAAAGUAAGCAAAAAC : 78
Tarsier : CUUACCACCUCUUGCUAAUUCAGUCUAUAUACCGCCAUCUUCAGCAAACCCUAAUAAAGGUUUUAAAGUAAGCACAAGU : 79
SakiMonkey : CUUACCACCUCUUGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCUA-UAAUGACAGUAAAGUAAGCACAAGU : 76
Marmoset : CUCACCACGUCUAGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCCU-UAAUGAUUGUAAAGUAAGCAGAAGU : 76
Baboon : CCCACCCUCUCUUGCU----UAGUCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACGAAGUGAGCGCAAAU : 75
Gibbon : CUCACCAUCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACAAAGGCUAUAAAGUAAGCACAAAC : 75
Orangutan : CUCACCACCCCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCCACGAAGUAAGCGCAAAC : 75
Gorilla : CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACGAAGGCCACAAAGUAAGCACAAGU : 75
PygmyChimp : CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU : 75
Chimp : CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU : 75
Human : CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACAAAGUAAGCGCAAGU : 75
CucACC cuCUuGCu cAgccUaUAUACCGCCAUCuuCAGCAAACcCu A G aAAGUaAGC AA
Distance Matrix Methods
Part of the Jukes-Cantor Distance Matrix for the Primates example
Initially each species is a cluster on its own. The clusters get bigger
during the process, until they are all connected to a single tree.
There are many slightly different clustering methods. The two most
important are:
UPGMA (Unweighted Pair Group Method with Arithmetic
Mean)
Neighbour-Joining
UPGMA - The distance between cluster X and cluster Y is
defined as the average of pairwise distances dAB where A is in
X and B is in Y. Mouse
Lemur
Tarsier
SakiMonkey
Marmoset
Baboon
Gibbon
Orangutan
Gorilla
Human
PygmyChimp
Chimp
0.1
1
n
2
3
i
n n
j
k
Take two neighbouring nodes i and j and replace by a single new node n.
din + dnk = dik ; djn + dnk = djk ; din + djn = dij ;
therefore dnk = (dik + djk - dij)/2 ; applies for every k
define 1 1 but ...
ri dik rj d jk
N 2 k N 2 k
Let din = (dij + ri - rj)/2 ; djn = (dij + rj - ri)/2 .
Rule: choose i and j for which Dij is smallest, where Dij = dij - ri -
rj .
NJ method produces an Unrooted, Additive Tree
Additive means distance between species = distance summed along internal
branches Orangutan
Gorilla PygmyChimp
Gibbon
Chimp
Human
Marmoset
The tree has been rooted
Baboon
using the Mouse as
outgroup
SakiMonkey
Mouse
0.1 Lemur
47
Mouse Tarsier
Tarsier SakiMonkey
Lemur 100
Marmoset
100 Baboon
100 Gibbon
100 Orang
99 Gorilla
100 Human
0.1 84 PygmyChimp
100
Chimp
Bootstrapping
real sequences:
Human CAACAGAGGC TTACGACCCC TTATTTACC
Chimp .......... .C........ .........
Gorilla T......... .C..A..... .........
Orang utan T..T..G.C. CC..A..... .........
Gibbon ...T...... .CGAA...T. ..GC.....
resample columns at random:
Human C C G etc.
Chimp . . .
Gorilla . . A
Orang utan . T A
Gibbon . T A
Generate many random samples. Calculate tree for each one. These trees
will be slightly different because the input sequences were slightly different.
For each clade, the bootstrap value is the percentage of the trees for which
this group of species falls together in the tree. High bootstrap values
(>70%) indicate reliable trees. Lower percentages indicate that there is
insufficient information in the sequences to be sure about the resulting tree.
Why are trees never exact?
Ideal World :
Take a known tree. Simulate evolution along that tree according to a known
evolutionary model. Generate a set of sequences for the tips of the tree.
Use correct evolutionary model to generate distance matrix. Regenerate Tree.
You will not get exactly back to the original tree because substitution events are a
stochastic process (finite size effects).
Real World:
You don’t know the correct model of evolution.
There may be variability of rates between sites in the sequence, between species
in the tree, or between times during evolution.
Saturation of mutations will occur for widely separated species.
Evolutionary radiations may have occurred in short periods of time.
Recombination or horizontal transfer may have occurred.
Criteria for Judging Additive Trees
There are N(N-1)/2 pairwise distances between N species.
There are only 2N-3 branch lengths on an unrooted tree, therefore in general
the data do not fit exactly to an additive tree (cf. N-1 node heights in an
ultrametric tree).
NJ:
• Produces an additive tree that approximates to the data matrix, but there is
no criterion for defining the best tree.
• Algorithm for tree construction is well-defined and very rapid.
Fitch-Margoliash method:
E
d ij d ij
tree 2
t1 t1 t2
t2
A y z
G
L( x) PxA (t1 ) PxG (t2 ) L( x) L( y ) Pxy (t1 ) L( z ) Pxz (t2 )
y z
2 + 3
1
+
* + +
A(0) B(0) C(1) A(0) C(1) B(0) A(0) C(1) B(0) D(1)
D(1) D(1)
In 1, the character evolves only once (+)
In 2, the character evolves once (+) and is lost once (*)
In 3, the character evolves twice independently
The first is the simplest explanation, therefore tree 1 is to be preferred
by the parsimony criterion.
Parsimony with molecular data
Personal conclusion : if you’ve got a decent computer, you’re better off with
ML.
Markov Chain Monte Carlo method: MCMC
We don’t want a single tree, we want to sample over all tree-space with a
weighting determined by the likelihood.
Sample of many high-likelihood trees – not just top of highest peak.
Only 9 topologies arose (= 3 x 3). The two uncertain points in the tree are
almost independent.
The NJ tree has maximum posterior probability even though it was not the ML
tree.
The ML tree has the second highest posterior probability.
?? Broad, Low peaks versus Narrow, High peaks.
Clade probabilities
compared
MCMC NJ Bootstrap
Clades in the top tree
(Human,Chimp) 56% 84%
(Lemur,Tarsier) 51% 46%