Lecture 8. Phylogenetic Tree Reconstruction: The Chinese University of Hong Kong CSCI3220 Algorithms For Bioinformatics

Lecture 8.
Phylogenetic Tree
Reconstruction
The Chinese University of Hong Kong
CSCI3220 Algorithms for Bioinformatics
Lecture outline
1. Phylogenetic tree reconstruction
– Problem definition
2. Sequence-based methods
– Maximum parsimony
– Maximum likelihood
3. Distance-based methods
– UPGMA
– Neighbor-joining
Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 2
Part 1
PHYLOGENETIC TREE
RECONSTRUCTION
Classification of species
Domains and
kingdoms
The animal kingdom

Image credit: Wikipedia, http://ridge.icu.ac.jp/gen-ed/classif-gifs/animal-class-example.gif
Taxonomy
Image credit: Wikipedia, http://www2.estrellamountain.edu/faculty/farabee/biobk/BioBookDivers_class.html
Relating biological objects
• How were the hierarchies
determined?
– Species: traditionally by
morphological and
behavioral similarities, or
paleontological evidences
– Bacterial strains: by
physical, chemical and
biological properties
• Question: Which features
should be used first?
Image credit: http://www2.estrellamountain.edu/faculty/farabee/biobk/BioBookDivers_class.html
Phylogeny
• A systematic and objective way to construct
these trees is by comparing DNA/protein
sequences
• In this lecture, we study trees that relate
objects that are sufficiently different
– Different species
– Different strains/populations of a species
• Our goal is to reconstruct the actual
evolutionary relationships based on observable
sequences
Assumptions
• Basic assumptions behind phylogenetic trees:
1. The current sequences share a common ancestor
2. All were mutated from the common ancestor
3. Mutations are rare. Therefore if the DNA of A and B are
more similar than both A and C as well as B and C, likely
C was separated from A and B before their separation
Common ancestor of A, B and C
Common ancestor of A and B

Time
A B C
Terminology
• A tree is an acyclic graph with
nodes connected by edges
Branch length
• A phylogenetic tree is a binary
tree with sequences (nodes)
connected by branches (edges)
– Leaf nodes are the observed
Root
sequences A leaf
An internal node
– Internal nodes are the node
unobserved ancestral
sequences A branch
– Branch lengths may represent
evolutionary distances
Image credit: Hershberg et al., Genome Biology 8:R164 (2007) , http://www.jdrf.ca/
Rooted and unrooted trees
• Sometimes it is not very clear where the common
ancestor should be put
– We can have a tree without root – an unrooted tree
Image credit: Lowery et al., Oncogene 24(2):248-259, (2005)
Phylogenetic tree reconstruction
• General problem:
– Given a set of DNA/protein sequences
– Find a phylogenetic tree such that it likely corresponds
to the actual historical evolutionary events, involving:
• Order of separation events (how the nodes are connected)
• Ancestral sequences (what sequences the internal nodes have)
• Branch lengths (how much time it has been since the
separation)
There are various ways to evaluate how likely a tree is correct.
We will study them in this lecture
– “Re”-construction: The tree was defined by history. We
only try to reconstruct it from the observed sequences
What sequence to use?
• If we are studying a gene
– DNA/protein sequence of the gene
• If we want to know the relationship between
different species
– Whole genome (may not be feasible)
– Some genes that are essential and single-copied
• Ribosomal RNA
Complexity of problem
• Finding the “best” tree is a hard problem
• How many tree topologies (i.e., ignore branch lengths and
left-right order) are there for a set of k sequences?
• For rooted trees: k Num. of rooted tree topologies
2 1
3 3
– k=2: 1 possible tree topology 4 15
– k=3: 3 possible branches to add #3 5

6
105
945
– k=4: 5 possible branches to add #4, and so on 7

8
10,395
135,135
– Therefore number of tree topologies is 9 2,027,025

10 34,459,425
1  3  5  ...  (2k-3) 11 654,729,075
12 13,749,310,575
• Exponential 13 316,234,143,225
14 7,905,853,580,625
15 213,458,046,676,875
16 6,190,283,353,629,370
17 191,898,783,962,511,000
18 6,332,659,870,762,850,000
19 221,643,095,476,700,000,000
1 2 1 2 3 1 3 2 1 3 2 20 8,200,794,532,637,890,000,000
Complexity of problem k Num. of rooted tree topologies Num. of unrooted tree topologies
2 1 1
• Similarly, for unrooted trees, 3

4 15
3 1
3
5 105 15
– k=2: 1 possible tree topology 6 945 105
– k=3: 1 possible branch to add #3 7

8
10,395
135,135
945
10,395
– k=4: 3 possible branches to add #4 9

10
2,027,025
34,459,425
135,135
2,027,025
– k=5: 5 possible branches to add #5 11 654,729,075 34,459,425
12 13,749,310,575 654,729,075
– There number of tree topologies is 13 316,234,143,225 13,749,310,575
14 7,905,853,580,625 316,234,143,225
1  3  5  ...  (2k-5) 15 213,458,046,676,875 7,905,853,580,625
16 6,190,283,353,629,370 213,458,046,676,875
17 191,898,783,962,511,000 6,190,283,353,629,370
18 6,332,659,870,762,850,000 191,898,783,962,511,000
19 221,643,095,476,700,000,000 6,332,659,870,762,850,000
20 8,200,794,532,637,890,000,000 221,643,095,476,700,000,000
1 1 1 4 1 1 4
3 3 3 3
2 2 2 2 4 2
Solving the problem: Ideas
• What do you do when you encounter a
computationally hard problem?
– Define an easier version of the problem
• By making certain assumptions
– Design smart algorithms/data structures to avoid
redundant calculations
– Use heuristics to solve it, not necessarily getting
the optimal solution
Phylogenetic tree reconstruction methods
• Two main types of methods:
– Sequence-based: need the sequences
• Parsimony methods (easier problems, smart algorithms)
• Probabilistic methods (easier problems, smart algorithms)
– Maximum likelihood
– Bayesian
• ...
– Distance-based: only depends on the distances between
the sequences
• UPGMA (heuristics)
• Neighbor joining (heuristics)
• ...
– We will study some of these algorithms
The Newick format
• Use brackets and comma to group two sub-trees
• Use colon to indicate distance to parent, if available
• End with a semicolon
Graphical representation: Newick:
1.00 ((((Homo:0.21,Pongo:0.21):0.28,
Galago
Macaca:0.49):0.13,Ateles:0.62):
0.62 0.38,Galago:1.00);
Ateles
0.49 Remarks:
Macaca • For an unrooted tree, one simple way to
0.38
represent it using the Newick format is to
0.21 root the tree arbitrarily
0.13 Pongo
• You can name internal nodes by giving the
0.28 label after the close bracket (e.g.,
Homo
0.21 (Homo:0.21, Pongo:0.21)HP:0.28
Image credit: http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.phylogeny.html
Part 2a
SEQUENCE-BASED METHODS:
MAXIMUM PARSIMONY
Maximum parsimony
• Assumption: A tree is likely to be true if it involves few mutations
• Rationale:
– Mutations are rare
– “Occam’s razor”: The simplest explanation is likely the correct one
• “Large parsimony” problem:
– Given a set of sequences
– Find a rooted tree topology of the sequences and the ancestral sequences of
the tree
– Such that the total number of mutations along the branches is minimized
– NP hard: Currently no polynomial time algorithm is known
• “Small parsimony” problem:
– Given a set of sequences and a rooted tree topology of the sequences
– Find the ancestral sequences
– Such that the total number of mutations along the branches is minimized
• We will focus on the small parsimony problem
Small parsimony example
• We will consider one single site G
– By assuming that sites are independent, we only GC G
need an algorithm for one site GA
– Will show an example with more sites later C G
CA GT
• In the upper tree on the right, the number of A C G T A
mutations is 4
– Is it the minimum (i.e., most parsimonious solution)?
– For this tree topology, the minimum number of mutations
is 3. There are three sets of ancestral states that result in
this number of mutations, shown in the three trees below
A A A
A A A
AG AT
A G A T A A
AC GT AC TG AC AG AT
A C G T A A C G T A A C G T A
Small parsimony problem
• How to assign ancestral states so that the total
number of mutations is minimized?
• Ideas: For a given node,
– If both children have the same state, probably
good to adopt the state
– If the two children have different states, probably
good to adopt one of them
– Delay the decision of the exact choice until the
parent has also expressed a preference
The algorithm: simple version
• Fitch’s algorithm: If you only need some solutions p
– For each internal node i with parent p and children l and r, we will i
determine its preference set Si and its final character Ci that
would minimize the total number of mutations r
l
– Steps:
1. For each leaf node i, set Si to the character of the sequence
2. Upward phase: For each internal node i,
if (Sl  Sr)={} // l and r do not agree: take both sets
Si := Sl  Sr
else // l and r agree on something: take
the agreed part
Si := Sl  Sr
3. Downward phase: First pick any Croot from Sroot. Then for each other
internal node i,
if Cp  Si // p agrees with i on something: take it
Ci := Cp
else // p disagrees with i: use i’s own
preferences
An example Preference set
Final character chosen
A
A,G,T
Upward phase
A,C G,T
A C G T A A C G T A
Downward phase
(2 choices)
A A
A
A,G,T
OR A
A
A,C G
G,T A T
A C G T A A C G T A
Why does it work?
• Proof by induction
– When there are two leaves, there are only two
cases:
• They have the same character
– Actual minimum number of mutations: 0
– The algorithm gives the same number A
• They have different characters A A A A
– Actual minimum number of mutations: 1

– The algorithm also gives the same number A
A C A C
Therefore the algorithm is optimal
C
A C
Why does it work?
• Assume the algorithm is able to minimize
the number of mutations for trees with k or
fewer leaves
• Now for a tree with k+1 leaves, root
– It consists of a root connected to two sub-trees
with roots l and r, both with k or fewer leaves l r
– Two cases:
... ... ... ...
• If Sl  Sr  {}, the algorithm gives a solution with ml
Minimum Minimum
+ mr mutations, which is optimal due to the
number of number of
induction hypothesis mutations: ml mutations: mr
• If Sl  Sr = {}, the algorithm gives a solution with ml
+ mr + 1 mutations, which is also optimal since one
extra mutation must be introduced between the
root and one of its children
The algorithm: extended version
• If you need all solutions p
– Steps: i
1. For each leaf node i, set Si to the character of the sequence
2. Upward phase (same as before): For each internal node i, l r
if (Sl  Sr)={} // l and r do not agree: take both sets
Si := Sl  Sr
else // l and r agree on something: take it
Si := Sl  Sr
3. Downward phase: First pick Croot from Sroot. Then for each other
internal node i (different strategy -- majority vote): we will choose Ci
from the characters that exist in the largest number of sets among
{Cp}, Sl and Sr. Also, whenever there are multiple choices, we choose
each in turn to enumerate all optimal solutions.
– We can prove that this algorithm gives all optimal solutions
– A special case of Sankoff’s dynamic programming algorithm
Revisiting the same example
A
A,G,T
Upward phase
A,C G,T
A C G T A A C G T A
Downward phase
(3 choices)
Found by
Algorithm 2 but
not Algorithm 1
A A A
A OR A OR A
A G A T A A
A C G T A A C G T A A C G T A
A more complex example
A,C,G
G
Upward phase A,G
A,C A
A C A A G G A C A A G G
Downward phase
(6 choices)
A
A,C,G A
A,C,G C
A,C,G
G
A G G
A
A,G G
A,G G
A,G
A
A,C A A
A,C A C
A,C A
A C A A G G A C A A G G A C A A G G
G
A,C,G G
A,C,G G
A,C,G
G G G
G
A,G G
A,G G
A,G
A
A,C A C
A,C A G
A,C A
A C A A G G A C A A G G A C A A G G
Multiple sites
• In a real situation, we need to deal with
sequences that contain more than one site
• We simply apply the above algorithm to the
different sites independently
– As we assume that different sites mutate
independently
Example
[G]
[C,T]
[A,G]
Upward phase
[C]
AC GC GT AC GC GT
Downward phase
[G] [G]
GC
[C,T] GT
[C,T]
[A,G] [A,G]
GC
[C] OR GC
[C]
AC GC GT AC GC GT
• Minimum: 1 substitution for position 1, 1 substitution for position 2

• Maximum parsimony: 2 trees that can achieve this minimum
From small parsimony to large parsimony
• Some heuristic methods try different tree
topologies. For each one, solve the small
parsimony problem. Then compare and find
the best.
– A standard “searching problem” in AI
– Need ways to:
• Determine the first trees
• Determine the next trees based on the current tree
• Avoid trapping in a local optimum
– There are many methods proposed for these tasks
Part 2b
SEQUENCE-BASED METHODS:
MAXIMUM LIKELIHOOD
Evolutionary distance
• Suppose we have an alignment of two sequences. TTAGG
At a site, one sequence has an A and one has a C. TTCGG
– Assume that the sequences have a common ancestor TT?GG
– What did the common ancestor have at that site?
– We don’t know. TTAGG TTCGG
– Let’s say A. How many mutations have happened?
TTAGG
• Could be one (A  C)
• Could be more (A  G  C, A  T  C, A  C  A, etc.)
TTAGG TTCGG
• We want a way to define the “evolutionary
distance” between two observed sequences TTAGG
– According to the number of mutations happened or
the time since their divergence TTAGG TTGGG
– We need to first define a mutation model
TTCGG
Technical assumptions
• These assumptions are usually not true, but they
will greatly simplify the problem:
– Sites are independent
– Mutation rates are the same for different sites and at
different time in the history
– Given current state, future states do not depend on
past states
• Markov property!
• More complex models that require fewer strong
assumptions exist. We only study the simple
models here
The Jukes-Cantor model
• Proposed by Jukes and Cantor in 1969
• Equal rate of substitution, , to the other three bases
in one unit of time
– Assume there is at most one mutation within one unit of
time – We can always make the unit smaller to ensure this
1-3  1-3
A  G Notes: In this lecture,
1. We do not consider indels

2. A “substitution” is a point

    mutation actually
happened, while a

mismatch in an alignment
 could be caused by one or
C  T more substitutions
1-3  1-3
Illustration of the Jukes-Cantor model
• Suppose at time 0, site 1 of a sequence was A
• At time 1:
– There is a probability of 1-3 that the site was A
– There is a probability of  that the site was C
– There is a probability of  that the site was G
– There is a probability of  that the site was T
• At time 2, what is the probability that the site is A?
– Two possibilities:
1. At time 1, the site was A, and there was no mutation from time 1 to time 2
[probability: (1-3)2]
2. At time 1, the site was C, G or T, and there was a mutation to A from time 1
to time 2 [probability: 32]
– Therefore, the total probability that the site is A at time 2 is (1-3)2 +
32
Recursive formulas
• Denote PXY(t) as the probability that for a base that
was X at time 0, it is Y at time t, for any X and Y
– Here “site”, “base”, “nucleotide” all mean the same thing
– PAA(1) = 1 - 3
– PAA(2) = (1 - 3)2 + 32
– In general,
PAA(t+1) = (1 - 3)PAA(t) + [1 - PAA(t)]
– Similarly:
• PXX(t+1) = (1 - 3)PXX(t) + [1 - PXX(t)] for any X
• PXY(t+1) = [1 - PXX(t+1)] / 3 for any X and
Y
Solving the equation
• (Suggested by former student Cao Jianquan when he
took BMEG3102)
– PAA(t) = (1 - 3)PAA(t-1) + [1 - PAA(t-1)]
= (1 - 4)PAA(t-1) + 
 PAA(t) - 1/4 = (1 - 4)PAA(t-1) +  - 1/4
= (1 - 4)PAA(t-1) - 1/4 (1 - 4)
= (1 - 4)(PAA(t-1) - 1/4) Key: The underlined
= (1 - 4) (PAA(t-2) - 1/4)
2
parts have exactly
= ... the same form
= (1 - 4)t-1(PAA(1) - 1/4)
= (1 - 4)t-1(1 - 3 - 1/4)
= (1 - 4)t-1(3/4 - 3)
= 3/4 (1 - 4)t
 PAA(t) = 3/4 (1 - 4)t + 1/4
Maximum likelihood
• Likelihood: Probability of producing the
observed data by a model given the model
parameters, Pr(X|)
– X: Observed data
• The input sequences, assumed aligned
• Again, we consider one single site here. The likelihood for
the whole sequences is the product of the likelihood of
individual sites since they are assumed independent
– : Model parameters (see next page)
• Maximum likelihood: Find value of  such that
Pr(X|) is maximized
Model parameters
• There are different possibilities
– In all cases, X is the input sequences
• Big likelihood problem
– : tree topology, mutation rates and divergence times
– Very difficult
• Small likelihood problem
– Tree topology is given
– : mutation rates and divergence times
– There are effective heuristic solutions that usually
(but not always) produce optimal results
Computing likelihood
• Suppose we are given the followings, as
shown in the figure: g:G
– Tree topology
teg tfg
– Observed data, X = {a:G, b:G, c:T, d:G}
– Ancestral sequences e:G f:G
– Parameters,  = {<mutation rates>, tae, tbe, tcf, tdf,
tae tbe tcf tdf
teg, tfg}
• Likelihood =Pr(g:G) a:G b:G c:T d:G
Pr(e:G|g:G, teg) Pr(f:G|g:G, tfg)
Pr(a:G|e:G, tae) Pr(b:G|e:G, tbe) Node labels
Pr(c:T|f:G, tcf) Pr(d:G|f:G, tdf) Observed sequences
:Ancestral sequences
– We have learned how to compute these Divergence times
conditional probabilities for the Jukes-Cantor
model
– Noticed the Markov property used here?
Computing likelihood
• In the small likelihood problem, we are only given the tree
topology, but not the ancestral sequences – Then how to
compute likelihood? g:?
• Need to try them all (summation of 43 = 64 terms): Likelihood =
Pr(g:A)
teg tfg
Pr(e:A|g:A, teg) Pr(f:A|g:A, tfg)
Pr(a:G|e:A, tae) Pr(b:G|e:A, tbe) e:? f:?
Pr(c:T|f:A, tcf)Pr(d:G|f:A, tdf)
+ tae tbe tcf tdf
Pr(g:C)
Pr(e:A|g:C, teg) Pr(f:A|g:C, tfg) a:G b:G c:T d:G
Pr(a:G|e:A, tae) Pr(b:G|e:A, tbe)
Pr(c:T|f:A, tcf)Pr(d:G|f:A, tdf)
+ Possible ancestral states:
...
e A A A A A T
+
Pr(g:T) f A A A A C .. T
Pr(e:T|g:T, teg) Pr(f:T|g:T, tfg) .
Pr(a:G|e:T, tae) Pr(b:G|e:T, tbe)
Pr(c:T|f:T, tcf)Pr(d:G|f:T, tdf) g A C G T A T
Computational efficiency
• How much computation is needed?
• For our example:
– 3 internal node  43 = 64 possible sets of ancestral states
– For each set of ancestral states, we need to multiply 7 terms
(because there are 7 nodes in the tree)
• In general:
– If there are n input sequences, there are n-1 internal nodes  4n-1
possible sets of ancestral states
– For each set of ancestral states, we need to multiply n+n-1 = 2n-1
terms
• Impractical to perform this exponential number of operations
– You must know how to deal with this situation by now –
dynamic programming!
Computing likelihood efficiently
• An important observation: once the root
of a sub-tree is determined, the likelihood
of this sub-tree does not depend on other
nodes in the whole tree
• E.g., once node e is decided to take
character A, the likelihood of the sub-tree
involving nodes a, b and e is g:?
Pr(e:A|g, teg) teg tfg
Pr(a:G|e:A, tae)Pr(b:G|e:A, tbe) e:A f:?

– If the character at node g does not change,
tae tbe tcf tdf
the value of the above expression will not
change no matter what character node f a:G b:G c:T d:G
takes.
– Therefore this value can be re-used
Computing likelihood efficiently
• Define table V, where entry V(i,x) is the likelihood of the sub-tree rooted at i
when the parent of i takes character x
– Likelihood =
Pr(g:A) V(e,A) V(f,A) +
Pr(g:C) V(e,C) V(f,C) + g:?
Pr(g:G) V(e,G) V(f,G) +
teg tfg
Pr(g:T) V(e,T) V(f,T)
– V(e, A) =
e:? f:?
Pr(e:A|g:A,teg) V(a,A) V(b,A) +
Pr(e:C|g:A,teg) V(a,C) V(b,C) + tae tbe tcf tdf
Pr(e:G|g:A,teg) V(a,G) V(b,G) +
Pr(e:T|g:A,teg) V(a,T) V(b,T) a:G b:G c:T d:G
– V(a,A) = Pr(a:G|e:A, tae)
– V(a,C) = Pr(a:G|e:C, tae)
– ...
• Table V contains O(n) entries. Computing the value for each entry requires a
constant number of operations  Linear time overall
Example e:?
tde tce
• Let’s assume:
– All four bases are equally likely at the root
d:?
– The Jukes-Cantor mutation model is correct
– Mutation rate per unit time, =0.1 tad tbd
– Each branch in the tree represents one unit of time
• V(i,x) is the likelihood of the sub-tree rooted at i when the a:A b:C c:G
parent of i takes character x
V(i, x) x=A x=C x=G x=T
i=a 0.7 0.1 0.1 0.1
i=b 0.1 0.7 0.1 0.1
i=c 0.1 0.1 0.7 0.1
i=d 0.7(0.7)(0.1) + 0.1(0.7)(0.1) + 0.1(0.7)(0.1) + 0.1(0.7)(0.1) +
0.1(0.1)(0.7) + 0.7(0.1)(0.7) + 0.1(0.1)(0.7) + 0.1(0.1)(0.7) +
0.1(0.1)(0.1) + 0.1(0.1)(0.1) + 0.7(0.1)(0.1) + 0.1(0.1)(0.1) +
0.1(0.1)(0.1) 0.1(0.1)(0.1) 0.1(0.1)(0.1) 0.7(0.1)(0.1)
= 0.058 = 0.058 = 0.022 = 0.022
• Overall likelihood: 0.25(0.058)(0.1) + 0.25(0.058)(0.1) +

0.25(0.022)(0.7) + 0.25(0.022)(0.1) = 0.0073
Solving the small likelihood problem
• Then how to find the optimal parameter values?
– Start with a random estimate of 
– Apply a “hill climbing” algorithm
• Change the value of a parameter so that the likelihood is increased
• Repeat it for each parameter in turn, for multiple iterations
• Will reach maximum if there is a single “peak” – This is true in
many real situations, though theoretically cases can be constructed
in which this it not true
(For simplicity, assume ={, tab} here)
New estimate
Likelihood Current estimate
tab
a
Image source: http://www.absoluteastronomy.com/topics/Hill_climbing
Part 3a
DISTANCE-BASED METHODS:
UPGMA
Motivation
• In the previous sequence-based algorithms, the exact
sequences are used when reconstructing the
phylogenetic trees
• In a distance-based method, only the pairwise
distances between the sequences are considered
– Good if the sequences are long, and we care only about the
tree structure but not the ancestral sequences
– The distances can be computed by methods based on
sequence alignment
– Once the pairwise distances have been computed, the
original sequences will not be used
UPGMA
• Unweighted Pair Group Method with Arithmetic Mean
• Algorithm:
1. Compute the distance between each pair of sequences
2. Treat each sequence as a cluster by itself
3. Merge the two closest clusters. The distance between two
clusters is the average distance between all their sequences
(except that d(Ci, Ci)=0):
1
d൫Ci,Cj ൯= ෍ dሺ𝑟,𝑠ሻ
ȁCi ȁหCj ห
𝑟∈C i ,𝑠∈C j
(Notice that d(r, s) is the distance between r and s in the input distance
matrix)
4. Repeat 2 and 3 until only one cluster remains
Note: Here node labels are
Example sequence names, not the
A B C D E actual characters/bases
A 0 8 4 6 8 A,C B D E
B 8 0 8 8 4 A,C 0 8 6 8 A,C B,E D
C 4 8 0 6 8 {A}, B 8 0 8 4 A,C 0 8 6
D 6 8 6 0 8 {C} D 6 8 0 8 {B},{E} B,E 8 0 8
E 8 4 8 8 0 E 8 4 8 0 D 6 8 0
A B C D E A C B D E A C B E D
2 2 2 2 2 2
A,C,D B,E
Note: In this case,
{A,C}, A,C,D 0 8 A,B,C,D,E
{A,C,D}, {B,E} • The tree is unique
{D} B,E 8 0 A,B,C,D,E 0
• Sum of branch lengths
between two
sequences equals their
A C D B E A C D B E input distance
2 2 • All leaves are on the
3 2 2 2 2
3
2 2
same horizontal line
1 1 2 Do we always have these
1 properties?
Uniqueness
• Not always unique, also not always possible to
put all leaf nodes on a line:
A
A B C
3 B C
2 2 1 2
1
A,B C
A,B 0 5 A,B,C
A B C {A}, C 5 0 {A,B},{C} A,B,C 0
A 0 4 6 {B}
B 4 0 4
C 6 4 0 A B,C A,B,C
{B},{C} {A},{B,C}
A 0 5 A,B,C 0
A B C B,C 5 0
A B C C
2 2 A B 3
2 1
1
Branch lengths
• Not always possible to assign branch lengths
according to distances:
A B C
A 0 4 8 A B,C
B 4 0 2 A 0 6 A,B,C
{B},{C} {A},{B,C}
C 8 2 0 B,C 6 0 A,B,C 0
A B C A B C B C
1 1 A 1 1
3 3
Here the branch lengths only reflect

the cluster distances, not the
sequence distances
Ultrametric distances
• Additive tree: A tree where the length between two nodes is
the total length of the branches between them
– For example, when the length represents the number of observed +
unobserved substitutions
• Desirable properties of the input distance matrix: A B C
i. d(x, y)  0 A B C D E
A 0 4 6
B 4 0 4
ii. d(x, y) = 0 if x = y A 0 8 4 6 8
C 6 4 0
B 8 0 8 8 4
iii. d(x, y) = d(y, x)
C 4 8 0 6 8 True: i, ii, iii and iv
iv. d(x, y) + d(y, z)  d(x, z) D 6 8 6 0 8 A B C
v. d(x, y)  max{d(x, z), d(y, z)} E 8 4 8 8 0 A 0 4 8
B 4 0 2
• We can get an additive tree True: i, ii, iii, iv and v C 8 2 0
and with all leaf nodes on the True: i, ii and iii
same line (i.e., an ultrametric tree) if i, ii, iii, iv and v are true
Last update: 1-Dec-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 54
Additive trees
• When a tree is additive, the branch lengths can be
determined by solving simultaneous equations
A B C
A 0 4 6 A,B C
B 4 0 4 A,B 0 5 A,B,C
C 6 4 0 {A}, C 5 0 A,B,C 0
{A,B},{C}
{B}
A
A B C A B C
3 B C
2 2 1
x 2
1 y
• For non-additive trees, we will d(A,x) + d(B,x) = d(A,B) = 4

d(A,x) + d(x,y) + d(C,y) = d(A,C) = 6 (2)
(1)
learn how to assign “reasonable” d(B,x) + d(x,y) + d(C,y) = d(B,C) = 4 (3)

[(1) – (2) + (3)] / 2:
distances by neighbor joining d(B,x) = 1
d(A,x) = 3
d(x,y) + d(C,y) = 3 [e.g., d(x,y) = 1, d(C,y) = 2]
Part 3b
DISTANCE-BASED METHODS:
NEIGHBOR JOINING
Neighbor joining
• In UPGMA, each time we merge the two closest clusters
according to
1
their distance (criterion #1):
d൫Ci,Cj ൯= ෍ dሺ𝑟,𝑠ሻ
ȁCi ȁหCj ห
𝑟∈C i ,𝑠∈C j
• Would be good to choose the pair that is also far away

from other clusters (criterion #2), measured by:
uሺCiሻ=෍ d൫Ci, Cj൯
j
• In the Neighbor Joining algorithm, the two clusters to

merge is the pair that minimizes
Qሺi,jሻ= ሺr − 2ሻd൫C ,C ൯− uሺC ሻ− u൫C ൯ i j i j
, where r is the current number of
clusters (and Q(i, i)  0 for all i)
– The formula considers both criteria, while the (r-2) factor is to
balance the relative weights of them
Neighbor joining
• The algorithm:
1. Start with each sequence as a cluster. All of them are connected
to a hub, forming a star.
2. Find clusters i and j connected to the hub where Q(i, j) is
minimum among all cluster pairs
3. Insert a new internal node Ck
• Connect it to Ci, Cj and the hub
d൫Ci,Cj൯ uሺCiሻ−u൫Cj൯
+
• Assign length to the edge2 C iCk (r is the number of clusters before the
2ሺr−2ሻ
merge)
d൫Ci,Cj൯ u൫Cj൯−uሺCiሻ
• Assign length to the edge2 C + C
2ሺr−2ሻ
j k
• For each node Cl, d(Ck, Cl) = [d(Ci, Cl) + d(Cj, Cl) – d(Ci, Cj)] / 2
(Notice that all d(Cx, Cy) values are from the previous step.)
4. Repeat 2 and 3 until all branch lengths are assigned
• The final result will be an unrooted tree
Reason for the branch length assignment
• If distances are additive: Length:
d(Ci, Cj)
– x + z = [u(ci) – d(Ci, Cj)] / (r – 2)
– y + z = [u(cj) – d(Ci, Cj)] / (r – 2)
Ci
– x + y = d(Ci, Cj) Cj
x
–  [(x + z) – (y + z) + (x + y)] / 2 = y
x= d൫Ci,Cj൯ uሺCi ሻ−u൫Cj൯
+
2 2ሺr−2ሻ
Average length: Ck Average length:
– and [(y + z) – (x + z) + (x + y)] / 2 = [u(ci) – d(Ci, Cj)] / [u(cj) – d(Ci, Cj)] /
y= d൫Ci,Cj൯ u൫Cj ൯−uሺCiሻ (r – 2) z (r – 2)
+
2 2ሺr−2ሻ
• If distances are not additive, the assigned Everything

distances are still usually reasonable else
• d(Ck, Cl) = [(d(Ci, Cl) + d(Cj, Cl) – d(Ci, Cj)] / 2
Example Q(i,j) = (r-2)d(Ci,Cj) – u(Ci) – u(Cj)
d A B C D u Q A B C D
A B
A 0 8 4 6 A 18 A 0 -26 -28 -26
B 8 0 8 8 B 24 B -26 0 -26 -28
C 4 8 0 6 C 18 C -28 -26 0 -26
Distance between A D 6 8 6 0 D 20 D -26 -28 -26 0
and the new node:
D C
d(A,C)/2 + [u(A) –
u(C)] / [2(r-2)] = 4/2
A C d A,C B D u Q A,C B D
+ (18-18) / [2(2)] = 2
2 2 A,C 0 6 4 A,C 10 A,C 0 -18 -18
{A}, {C} B 6 0 8 B 14 B -18 0 -18
D 4 8 0 D 12 D -18 -18 0
D B
A C d A,B,C D u
2 2 A,B,C 0 3 A,B,C 3
1
{A,C}, {B} D 3 0 D 3
3 5
D In the last step, we simply
B remove the hub and write
down the distance
Last update: 8-Dec-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 60
Comparing the results
• UPGMA: (with one more node, E)
A C D B E
2 2 2 2
3
1 2
1
• Neighbor Joining:
A C
2 2
1
3 5
D
B
Which method to use?
• No definite answer
– There are different camps
• In general it is good to use methods that
– Do not require strong assumptions
– Are robust (do not produce drastically different results when
the inputs are just slightly changed)
• Build multiple trees using different parameters, then combine
• Build trees with different subsets of sequences, then combine
• Use probabilistic methods
– Are computationally efficient
• There are many other algorithms that we did not cover,
including those that consider mutation models.
Rooting an unrooted tree
• How to find the
root of an
unrooted tree?
– Usually by using
an “out group”,
something that
should be
separated first
– There are some
other methods
Image credit: Wikipedia, http://blog.ohinternet.com/wp-content/uploads/2011/03/fugu.jpg ,
http://www.currentprotocols.com/protocol/bi0601
Remarks
• Different types of DNA have different mutation rates
– Gene tree vs. species tree
• Some DNA is not inherited from the usual path
– E.g., bacteria can acquire new DNA by taking up from
plasmid (horizontal gene transfer)
– Need a general phylogenetic graph that allows multiple
parents and loops
• Phylogenetic tree reconstruction would benefit from
having an accurate multiple sequence alignment,
and vice versa
– Some methods perform the two iteratively
Epilogue
CASE STUDY, SUMMARY AND

FURTHER READINGS
Case study: Unexpected classifications
• In the old days, biologists classified species
based on their high-level features
– If a species possesses features that make the
organisms similar to different types of species, it
could be difficult to classify
– When molecular features (e.g., DNA sequences)
become available, they can be used to classify
species in a systematic way
• Some previous classifications were found to be
inconsistent with molecular evidence
• Example 1: Mammals
– Bats look like birds, dolphins look like fish, but both are actually
mammals
• Kingdom: Animalia (animals)
Superphylum: Deuterostomia
Phylum: Chordata
Subphylum: Vertebrata (animals with backbones)
Infraphylum: Gnathostomata (jawed vertebrates)
Class: Chondrichthyes (cartilaginous fish)
Superclass: Osteichthyes (bony fish)
Superclass: Tetrapoda (four-limbed vertebrates)
Class: Aves (birds)
Class Mammalia (mammals)
Image source: Wikipedia
• Example 2: The three domains
• All species on earth belong to one of
the three domains Halobacteria sp. strain NRC-1, an
archaaeon
– Archaea
• Single-celled, no nucleus
• Usually live in places with extreme
conditions (e.g., high temperature or salinity Escherichia coli, a beacterium
– “extremophiles”)
– Bacteria
• Single-celled, no nucleus
– Eukaryote
• Many are multi-celled, with nucleus
Image source: Wikipedia Various eukaryotic species
• It seems reasonable to assume that eukaryotes
separated from the other two first
• However, based on the sequence of ribosomal RNAs,
something so important that evolve slowly, archaea
are closer to eukaryotes than bacteria
Image source: Wikipedia
Summary
• Phylogenetic trees capture separation events
and when they happened
• Two main types of tree reconstruction
methods:
– Sequence-based
• Maximum parsimony
• Maximum likelihood
– Distance-based
• UPGMA
• Neighbor joining
Further readings
• Chapter 3 of the book “Fundamentals of Molecular
Evolution (Second Edition)” by Dan Graur and Wen-Hsiung
Li. Sinauer Associates, Inc., 2000
– Other mutation models
– Models for coding sequences
• Chapter 4 of the same book
– More on rates and patterns of nucleotide substitution
• Chapter 7 of Algorithms in Bioinformatics: A Practical
Introduction
– More details of the algorithms
– Complexity analysis
– Free slides available

Lecture 8. Phylogenetic Tree Reconstruction: The Chinese University of Hong Kong CSCI3220 Algorithms For Bioinformatics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 8. Phylogenetic Tree Reconstruction: The Chinese University of Hong Kong CSCI3220 Algorithms For Bioinformatics

Uploaded by

Copyright:

Available Formats

Lecture 8.

The animal kingdom

Image credit: Wikipedia, http://www2.estrellamountain.edu/faculty/farabee/biobk/BioBookDivers_class.html

Common ancestor of A, B and C

Common ancestor of A and B

Image credit: Lowery et al., Oncogene 24(2):248-259, (2005)

– k=3: 3 possible branches to add #3 5

– k=4: 5 possible branches to add #4, and so on 7

– Therefore number of tree topologies is 9 2,027,025

• Similarly, for unrooted trees, 3

– k=3: 1 possible branch to add #3 7

– k=4: 3 possible branches to add #4 9

Final character chosen

• They have different characters A A A A

– Actual minimum number of mutations: 1

• Minimum: 1 substitution for position 1, 1 substitution for position 2

Pr(e:A|g, teg) teg tfg

Pr(a:G|e:A, tae)Pr(b:G|e:A, tbe) e:A f:?

• Overall likelihood: 0.25(0.058)(0.1) + 0.25(0.058)(0.1) +

Here the branch lengths only reflect

• For non-additive trees, we will d(A,x) + d(B,x) = d(A,B) = 4

learn how to assign “reasonable” d(B,x) + d(x,y) + d(C,y) = d(B,C) = 4 (3)

• Would be good to choose the pair that is also far away

• In the Neighbor Joining algorithm, the two clusters to

• If distances are not additive, the assigned Everything

CASE STUDY, SUMMARY AND

Image source: Wikipedia

Image source: Wikipedia

You might also like