You are on page 1of 71

Lecture 8.

Phylogenetic Tree
Reconstruction
The Chinese University of Hong Kong
CSCI3220 Algorithms for Bioinformatics
Lecture outline
1. Phylogenetic tree reconstruction
– Problem definition
2. Sequence-based methods
– Maximum parsimony
– Maximum likelihood
3. Distance-based methods
– UPGMA
– Neighbor-joining

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 2
Part 1

PHYLOGENETIC TREE
RECONSTRUCTION
Classification of species

Domains and
kingdoms

The animal kingdom


Image credit: Wikipedia, http://ridge.icu.ac.jp/gen-ed/classif-gifs/animal-class-example.gif

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 4
Taxonomy

Image credit: Wikipedia, http://www2.estrellamountain.edu/faculty/farabee/biobk/BioBookDivers_class.html

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 5
Relating biological objects
• How were the hierarchies
determined?
– Species: traditionally by
morphological and
behavioral similarities, or
paleontological evidences
– Bacterial strains: by
physical, chemical and
biological properties
• Question: Which features
should be used first?
Image credit: http://www2.estrellamountain.edu/faculty/farabee/biobk/BioBookDivers_class.html

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 6
Phylogeny
• A systematic and objective way to construct
these trees is by comparing DNA/protein
sequences
• In this lecture, we study trees that relate
objects that are sufficiently different
– Different species
– Different strains/populations of a species
• Our goal is to reconstruct the actual
evolutionary relationships based on observable
sequences
Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 7
Assumptions
• Basic assumptions behind phylogenetic trees:
1. The current sequences share a common ancestor
2. All were mutated from the common ancestor
3. Mutations are rare. Therefore if the DNA of A and B are
more similar than both A and C as well as B and C, likely
C was separated from A and B before their separation

Common ancestor of A, B and C

Common ancestor of A and B


Time

A B C

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 8
Terminology
• A tree is an acyclic graph with
nodes connected by edges
Branch length
• A phylogenetic tree is a binary
tree with sequences (nodes)
connected by branches (edges)
– Leaf nodes are the observed
Root
sequences A leaf
An internal node
– Internal nodes are the node

unobserved ancestral
sequences A branch
– Branch lengths may represent
evolutionary distances
Image credit: Hershberg et al., Genome Biology 8:R164 (2007) , http://www.jdrf.ca/

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 9
Rooted and unrooted trees
• Sometimes it is not very clear where the common
ancestor should be put
– We can have a tree without root – an unrooted tree

Image credit: Lowery et al., Oncogene 24(2):248-259, (2005)

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 10
Phylogenetic tree reconstruction
• General problem:
– Given a set of DNA/protein sequences
– Find a phylogenetic tree such that it likely corresponds
to the actual historical evolutionary events, involving:
• Order of separation events (how the nodes are connected)
• Ancestral sequences (what sequences the internal nodes have)
• Branch lengths (how much time it has been since the
separation)
There are various ways to evaluate how likely a tree is correct.
We will study them in this lecture
– “Re”-construction: The tree was defined by history. We
only try to reconstruct it from the observed sequences

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 11
What sequence to use?
• If we are studying a gene
– DNA/protein sequence of the gene
• If we want to know the relationship between
different species
– Whole genome (may not be feasible)
– Some genes that are essential and single-copied
• Ribosomal RNA

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 12
Complexity of problem
• Finding the “best” tree is a hard problem
• How many tree topologies (i.e., ignore branch lengths and
left-right order) are there for a set of k sequences?
• For rooted trees: k Num. of rooted tree topologies
2 1
3 3
– k=2: 1 possible tree topology 4 15

– k=3: 3 possible branches to add #3 5


6
105
945

– k=4: 5 possible branches to add #4, and so on 7


8
10,395
135,135

– Therefore number of tree topologies is 9 2,027,025


10 34,459,425
1  3  5  ...  (2k-3) 11 654,729,075
12 13,749,310,575
• Exponential 13 316,234,143,225
14 7,905,853,580,625
15 213,458,046,676,875
16 6,190,283,353,629,370
17 191,898,783,962,511,000
18 6,332,659,870,762,850,000
19 221,643,095,476,700,000,000
1 2 1 2 3 1 3 2 1 3 2 20 8,200,794,532,637,890,000,000

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 13
Complexity of problem k Num. of rooted tree topologies Num. of unrooted tree topologies
2 1 1

• Similarly, for unrooted trees, 3


4 15
3 1
3
5 105 15
– k=2: 1 possible tree topology 6 945 105

– k=3: 1 possible branch to add #3 7


8
10,395
135,135
945
10,395

– k=4: 3 possible branches to add #4 9


10
2,027,025
34,459,425
135,135
2,027,025
– k=5: 5 possible branches to add #5 11 654,729,075 34,459,425
12 13,749,310,575 654,729,075
– There number of tree topologies is 13 316,234,143,225 13,749,310,575
14 7,905,853,580,625 316,234,143,225
1  3  5  ...  (2k-5) 15 213,458,046,676,875 7,905,853,580,625
16 6,190,283,353,629,370 213,458,046,676,875
17 191,898,783,962,511,000 6,190,283,353,629,370
18 6,332,659,870,762,850,000 191,898,783,962,511,000
19 221,643,095,476,700,000,000 6,332,659,870,762,850,000
20 8,200,794,532,637,890,000,000 221,643,095,476,700,000,000

1 1 1 4 1 1 4

3 3 3 3

2 2 2 2 4 2

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 14
Solving the problem: Ideas
• What do you do when you encounter a
computationally hard problem?
– Define an easier version of the problem
• By making certain assumptions
– Design smart algorithms/data structures to avoid
redundant calculations
– Use heuristics to solve it, not necessarily getting
the optimal solution

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 15
Phylogenetic tree reconstruction methods
• Two main types of methods:
– Sequence-based: need the sequences
• Parsimony methods (easier problems, smart algorithms)
• Probabilistic methods (easier problems, smart algorithms)
– Maximum likelihood
– Bayesian
• ...
– Distance-based: only depends on the distances between
the sequences
• UPGMA (heuristics)
• Neighbor joining (heuristics)
• ...
– We will study some of these algorithms

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 16
The Newick format
• Use brackets and comma to group two sub-trees
• Use colon to indicate distance to parent, if available
• End with a semicolon
Graphical representation: Newick:

1.00 ((((Homo:0.21,Pongo:0.21):0.28,
Galago
Macaca:0.49):0.13,Ateles:0.62):
0.62 0.38,Galago:1.00);
Ateles
0.49 Remarks:
Macaca • For an unrooted tree, one simple way to
0.38
represent it using the Newick format is to
0.21 root the tree arbitrarily
0.13 Pongo
• You can name internal nodes by giving the
0.28 label after the close bracket (e.g.,
Homo
0.21 (Homo:0.21, Pongo:0.21)HP:0.28
Image credit: http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.phylogeny.html

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 17
Part 2a

SEQUENCE-BASED METHODS:
MAXIMUM PARSIMONY
Maximum parsimony
• Assumption: A tree is likely to be true if it involves few mutations
• Rationale:
– Mutations are rare
– “Occam’s razor”: The simplest explanation is likely the correct one
• “Large parsimony” problem:
– Given a set of sequences
– Find a rooted tree topology of the sequences and the ancestral sequences of
the tree
– Such that the total number of mutations along the branches is minimized
– NP hard: Currently no polynomial time algorithm is known
• “Small parsimony” problem:
– Given a set of sequences and a rooted tree topology of the sequences
– Find the ancestral sequences
– Such that the total number of mutations along the branches is minimized
• We will focus on the small parsimony problem

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 19
Small parsimony example
• We will consider one single site G
– By assuming that sites are independent, we only GC G
need an algorithm for one site GA
– Will show an example with more sites later C G
CA GT
• In the upper tree on the right, the number of A C G T A
mutations is 4
– Is it the minimum (i.e., most parsimonious solution)?
– For this tree topology, the minimum number of mutations
is 3. There are three sets of ancestral states that result in
this number of mutations, shown in the three trees below

A A A

A A A
AG AT
A G A T A A
AC GT AC TG AC AG AT
A C G T A A C G T A A C G T A

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 20
Small parsimony problem
• How to assign ancestral states so that the total
number of mutations is minimized?
• Ideas: For a given node,
– If both children have the same state, probably
good to adopt the state
– If the two children have different states, probably
good to adopt one of them
– Delay the decision of the exact choice until the
parent has also expressed a preference

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 21
The algorithm: simple version
• Fitch’s algorithm: If you only need some solutions p
– For each internal node i with parent p and children l and r, we will i
determine its preference set Si and its final character Ci that
would minimize the total number of mutations r
l
– Steps:
1. For each leaf node i, set Si to the character of the sequence
2. Upward phase: For each internal node i,
if (Sl  Sr)={} // l and r do not agree: take both sets
Si := Sl  Sr
else // l and r agree on something: take
the agreed part
Si := Sl  Sr
3. Downward phase: First pick any Croot from Sroot. Then for each other
internal node i,
if Cp  Si // p agrees with i on something: take it
Ci := Cp
else // p disagrees with i: use i’s own
Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 22
preferences
An example Preference set

Final character chosen

A
A,G,T

Upward phase
A,C G,T

A C G T A A C G T A

Downward phase
(2 choices)

A A

A
A,G,T
OR A

A
A,C G
G,T A T

A C G T A A C G T A

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 23
Why does it work?
• Proof by induction
– When there are two leaves, there are only two
cases:
• They have the same character
– Actual minimum number of mutations: 0
– The algorithm gives the same number A

• They have different characters A A A A

– Actual minimum number of mutations: 1


– The algorithm also gives the same number A
A C A C
Therefore the algorithm is optimal
C
A C

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 24
Why does it work?
• Assume the algorithm is able to minimize
the number of mutations for trees with k or
fewer leaves
• Now for a tree with k+1 leaves, root
– It consists of a root connected to two sub-trees
with roots l and r, both with k or fewer leaves l r
– Two cases:
... ... ... ...
• If Sl  Sr  {}, the algorithm gives a solution with ml
Minimum Minimum
+ mr mutations, which is optimal due to the
number of number of
induction hypothesis mutations: ml mutations: mr
• If Sl  Sr = {}, the algorithm gives a solution with ml
+ mr + 1 mutations, which is also optimal since one
extra mutation must be introduced between the
root and one of its children

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 25
The algorithm: extended version
• If you need all solutions p
– Steps: i
1. For each leaf node i, set Si to the character of the sequence
2. Upward phase (same as before): For each internal node i, l r
if (Sl  Sr)={} // l and r do not agree: take both sets
Si := Sl  Sr
else // l and r agree on something: take it
Si := Sl  Sr
3. Downward phase: First pick Croot from Sroot. Then for each other
internal node i (different strategy -- majority vote): we will choose Ci
from the characters that exist in the largest number of sets among
{Cp}, Sl and Sr. Also, whenever there are multiple choices, we choose
each in turn to enumerate all optimal solutions.
– We can prove that this algorithm gives all optimal solutions
– A special case of Sankoff’s dynamic programming algorithm

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 26
Revisiting the same example
A
A,G,T

Upward phase
A,C G,T

A C G T A A C G T A

Downward phase
(3 choices)
Found by
Algorithm 2 but
not Algorithm 1

A A A

A OR A OR A

A G A T A A

A C G T A A C G T A A C G T A

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 27
A more complex example
A,C,G

G
Upward phase A,G
A,C A
A C A A G G A C A A G G
Downward phase
(6 choices)

A
A,C,G A
A,C,G C
A,C,G

G
A G G
A
A,G G
A,G G
A,G
A
A,C A A
A,C A C
A,C A
A C A A G G A C A A G G A C A A G G

G
A,C,G G
A,C,G G
A,C,G

G G G
G
A,G G
A,G G
A,G
A
A,C A C
A,C A G
A,C A
A C A A G G A C A A G G A C A A G G

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 28
Multiple sites
• In a real situation, we need to deal with
sequences that contain more than one site
• We simply apply the above algorithm to the
different sites independently
– As we assume that different sites mutate
independently

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 29
Example
[G]
[C,T]

[A,G]
Upward phase
[C]

AC GC GT AC GC GT

Downward phase

[G] [G]
GC
[C,T] GT
[C,T]

[A,G] [A,G]
GC
[C] OR GC
[C]

AC GC GT AC GC GT

• Minimum: 1 substitution for position 1, 1 substitution for position 2


• Maximum parsimony: 2 trees that can achieve this minimum

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 30
From small parsimony to large parsimony
• Some heuristic methods try different tree
topologies. For each one, solve the small
parsimony problem. Then compare and find
the best.
– A standard “searching problem” in AI
– Need ways to:
• Determine the first trees
• Determine the next trees based on the current tree
• Avoid trapping in a local optimum
– There are many methods proposed for these tasks

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 31
Part 2b

SEQUENCE-BASED METHODS:
MAXIMUM LIKELIHOOD
Evolutionary distance
• Suppose we have an alignment of two sequences. TTAGG
At a site, one sequence has an A and one has a C. TTCGG
– Assume that the sequences have a common ancestor TT?GG
– What did the common ancestor have at that site?
– We don’t know. TTAGG TTCGG
– Let’s say A. How many mutations have happened?
TTAGG
• Could be one (A  C)
• Could be more (A  G  C, A  T  C, A  C  A, etc.)
TTAGG TTCGG
• We want a way to define the “evolutionary
distance” between two observed sequences TTAGG
– According to the number of mutations happened or
the time since their divergence TTAGG TTGGG
– We need to first define a mutation model
TTCGG

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 33
Technical assumptions
• These assumptions are usually not true, but they
will greatly simplify the problem:
– Sites are independent
– Mutation rates are the same for different sites and at
different time in the history
– Given current state, future states do not depend on
past states
• Markov property!
• More complex models that require fewer strong
assumptions exist. We only study the simple
models here
Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 34
The Jukes-Cantor model
• Proposed by Jukes and Cantor in 1969
• Equal rate of substitution, , to the other three bases
in one unit of time
– Assume there is at most one mutation within one unit of
time – We can always make the unit smaller to ensure this
1-3  1-3
A  G Notes: In this lecture,
1. We do not consider indels

2. A “substitution” is a point

    mutation actually
happened, while a

mismatch in an alignment
 could be caused by one or
C  T more substitutions
1-3  1-3

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 35
Illustration of the Jukes-Cantor model
• Suppose at time 0, site 1 of a sequence was A
• At time 1:
– There is a probability of 1-3 that the site was A
– There is a probability of  that the site was C
– There is a probability of  that the site was G
– There is a probability of  that the site was T
• At time 2, what is the probability that the site is A?
– Two possibilities:
1. At time 1, the site was A, and there was no mutation from time 1 to time 2
[probability: (1-3)2]
2. At time 1, the site was C, G or T, and there was a mutation to A from time 1
to time 2 [probability: 32]
– Therefore, the total probability that the site is A at time 2 is (1-3)2 +
32

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 36
Recursive formulas
• Denote PXY(t) as the probability that for a base that
was X at time 0, it is Y at time t, for any X and Y
– Here “site”, “base”, “nucleotide” all mean the same thing
– PAA(1) = 1 - 3
– PAA(2) = (1 - 3)2 + 32
– In general,
PAA(t+1) = (1 - 3)PAA(t) + [1 - PAA(t)]
– Similarly:
• PXX(t+1) = (1 - 3)PXX(t) + [1 - PXX(t)] for any X
• PXY(t+1) = [1 - PXX(t+1)] / 3 for any X and
Y

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 37
Solving the equation
• (Suggested by former student Cao Jianquan when he
took BMEG3102)
– PAA(t) = (1 - 3)PAA(t-1) + [1 - PAA(t-1)]
= (1 - 4)PAA(t-1) + 
 PAA(t) - 1/4 = (1 - 4)PAA(t-1) +  - 1/4
= (1 - 4)PAA(t-1) - 1/4 (1 - 4)
= (1 - 4)(PAA(t-1) - 1/4) Key: The underlined
= (1 - 4) (PAA(t-2) - 1/4)
2
parts have exactly
= ... the same form
= (1 - 4)t-1(PAA(1) - 1/4)
= (1 - 4)t-1(1 - 3 - 1/4)
= (1 - 4)t-1(3/4 - 3)
= 3/4 (1 - 4)t
 PAA(t) = 3/4 (1 - 4)t + 1/4

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 38
Maximum likelihood
• Likelihood: Probability of producing the
observed data by a model given the model
parameters, Pr(X|)
– X: Observed data
• The input sequences, assumed aligned
• Again, we consider one single site here. The likelihood for
the whole sequences is the product of the likelihood of
individual sites since they are assumed independent
– : Model parameters (see next page)
• Maximum likelihood: Find value of  such that
Pr(X|) is maximized
Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 39
Model parameters
• There are different possibilities
– In all cases, X is the input sequences
• Big likelihood problem
– : tree topology, mutation rates and divergence times
– Very difficult
• Small likelihood problem
– Tree topology is given
– : mutation rates and divergence times
– There are effective heuristic solutions that usually
(but not always) produce optimal results

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 40
Computing likelihood
• Suppose we are given the followings, as
shown in the figure: g:G
– Tree topology
teg tfg
– Observed data, X = {a:G, b:G, c:T, d:G}
– Ancestral sequences e:G f:G
– Parameters,  = {<mutation rates>, tae, tbe, tcf, tdf,
tae tbe tcf tdf
teg, tfg}
• Likelihood =Pr(g:G) a:G b:G c:T d:G
Pr(e:G|g:G, teg) Pr(f:G|g:G, tfg)
Pr(a:G|e:G, tae) Pr(b:G|e:G, tbe) Node labels
Pr(c:T|f:G, tcf) Pr(d:G|f:G, tdf) Observed sequences
:Ancestral sequences
– We have learned how to compute these Divergence times
conditional probabilities for the Jukes-Cantor
model
– Noticed the Markov property used here?

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 41
Computing likelihood
• In the small likelihood problem, we are only given the tree
topology, but not the ancestral sequences – Then how to
compute likelihood? g:?
• Need to try them all (summation of 43 = 64 terms): Likelihood =
Pr(g:A)
teg tfg
Pr(e:A|g:A, teg) Pr(f:A|g:A, tfg)
Pr(a:G|e:A, tae) Pr(b:G|e:A, tbe) e:? f:?
Pr(c:T|f:A, tcf)Pr(d:G|f:A, tdf)
+ tae tbe tcf tdf
Pr(g:C)
Pr(e:A|g:C, teg) Pr(f:A|g:C, tfg) a:G b:G c:T d:G
Pr(a:G|e:A, tae) Pr(b:G|e:A, tbe)
Pr(c:T|f:A, tcf)Pr(d:G|f:A, tdf)
+ Possible ancestral states:
...
e A A A A A T
+
Pr(g:T) f A A A A C .. T
Pr(e:T|g:T, teg) Pr(f:T|g:T, tfg) .
Pr(a:G|e:T, tae) Pr(b:G|e:T, tbe)
Pr(c:T|f:T, tcf)Pr(d:G|f:T, tdf) g A C G T A T

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 42
Computational efficiency
• How much computation is needed?
• For our example:
– 3 internal node  43 = 64 possible sets of ancestral states
– For each set of ancestral states, we need to multiply 7 terms
(because there are 7 nodes in the tree)
• In general:
– If there are n input sequences, there are n-1 internal nodes  4n-1
possible sets of ancestral states
– For each set of ancestral states, we need to multiply n+n-1 = 2n-1
terms
• Impractical to perform this exponential number of operations
– You must know how to deal with this situation by now –
dynamic programming!

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 43
Computing likelihood efficiently
• An important observation: once the root
of a sub-tree is determined, the likelihood
of this sub-tree does not depend on other
nodes in the whole tree
• E.g., once node e is decided to take
character A, the likelihood of the sub-tree
involving nodes a, b and e is g:?

Pr(e:A|g, teg) teg tfg

Pr(a:G|e:A, tae)Pr(b:G|e:A, tbe) e:A f:?


– If the character at node g does not change,
tae tbe tcf tdf
the value of the above expression will not
change no matter what character node f a:G b:G c:T d:G
takes.
– Therefore this value can be re-used

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 44
Computing likelihood efficiently
• Define table V, where entry V(i,x) is the likelihood of the sub-tree rooted at i
when the parent of i takes character x
– Likelihood =
Pr(g:A) V(e,A) V(f,A) +
Pr(g:C) V(e,C) V(f,C) + g:?
Pr(g:G) V(e,G) V(f,G) +
teg tfg
Pr(g:T) V(e,T) V(f,T)
– V(e, A) =
e:? f:?
Pr(e:A|g:A,teg) V(a,A) V(b,A) +
Pr(e:C|g:A,teg) V(a,C) V(b,C) + tae tbe tcf tdf
Pr(e:G|g:A,teg) V(a,G) V(b,G) +
Pr(e:T|g:A,teg) V(a,T) V(b,T) a:G b:G c:T d:G
– V(a,A) = Pr(a:G|e:A, tae)
– V(a,C) = Pr(a:G|e:C, tae)
– ...
• Table V contains O(n) entries. Computing the value for each entry requires a
constant number of operations  Linear time overall

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 45
Example e:?
tde tce
• Let’s assume:
– All four bases are equally likely at the root
d:?
– The Jukes-Cantor mutation model is correct
– Mutation rate per unit time, =0.1 tad tbd
– Each branch in the tree represents one unit of time
• V(i,x) is the likelihood of the sub-tree rooted at i when the a:A b:C c:G
parent of i takes character x
V(i, x) x=A x=C x=G x=T
i=a 0.7 0.1 0.1 0.1
i=b 0.1 0.7 0.1 0.1
i=c 0.1 0.1 0.7 0.1
i=d 0.7(0.7)(0.1) + 0.1(0.7)(0.1) + 0.1(0.7)(0.1) + 0.1(0.7)(0.1) +
0.1(0.1)(0.7) + 0.7(0.1)(0.7) + 0.1(0.1)(0.7) + 0.1(0.1)(0.7) +
0.1(0.1)(0.1) + 0.1(0.1)(0.1) + 0.7(0.1)(0.1) + 0.1(0.1)(0.1) +
0.1(0.1)(0.1) 0.1(0.1)(0.1) 0.1(0.1)(0.1) 0.7(0.1)(0.1)
= 0.058 = 0.058 = 0.022 = 0.022

• Overall likelihood: 0.25(0.058)(0.1) + 0.25(0.058)(0.1) +


0.25(0.022)(0.7) + 0.25(0.022)(0.1) = 0.0073

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 46
Solving the small likelihood problem
• Then how to find the optimal parameter values?
– Start with a random estimate of 
– Apply a “hill climbing” algorithm
• Change the value of a parameter so that the likelihood is increased
• Repeat it for each parameter in turn, for multiple iterations
• Will reach maximum if there is a single “peak” – This is true in
many real situations, though theoretically cases can be constructed
in which this it not true
(For simplicity, assume ={, tab} here)
New estimate
Likelihood Current estimate

tab

a
Image source: http://www.absoluteastronomy.com/topics/Hill_climbing

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 47
Part 3a

DISTANCE-BASED METHODS:
UPGMA
Motivation
• In the previous sequence-based algorithms, the exact
sequences are used when reconstructing the
phylogenetic trees
• In a distance-based method, only the pairwise
distances between the sequences are considered
– Good if the sequences are long, and we care only about the
tree structure but not the ancestral sequences
– The distances can be computed by methods based on
sequence alignment
– Once the pairwise distances have been computed, the
original sequences will not be used

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 49
UPGMA
• Unweighted Pair Group Method with Arithmetic Mean
• Algorithm:
1. Compute the distance between each pair of sequences
2. Treat each sequence as a cluster by itself
3. Merge the two closest clusters. The distance between two
clusters is the average distance between all their sequences
(except that d(Ci, Ci)=0):
1
d൫Ci,Cj ൯= ෍ dሺ𝑟,𝑠ሻ
ȁCi ȁหCj ห
𝑟∈C i ,𝑠∈C j
(Notice that d(r, s) is the distance between r and s in the input distance
matrix)
4. Repeat 2 and 3 until only one cluster remains

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 50
Note: Here node labels are
Example sequence names, not the
A B C D E actual characters/bases
A 0 8 4 6 8 A,C B D E
B 8 0 8 8 4 A,C 0 8 6 8 A,C B,E D
C 4 8 0 6 8 {A}, B 8 0 8 4 A,C 0 8 6
D 6 8 6 0 8 {C} D 6 8 0 8 {B},{E} B,E 8 0 8
E 8 4 8 8 0 E 8 4 8 0 D 6 8 0

A B C D E A C B D E A C B E D
2 2 2 2 2 2

A,C,D B,E
Note: In this case,
{A,C}, A,C,D 0 8 A,B,C,D,E
{A,C,D}, {B,E} • The tree is unique
{D} B,E 8 0 A,B,C,D,E 0
• Sum of branch lengths
between two
sequences equals their
A C D B E A C D B E input distance
2 2 • All leaves are on the
3 2 2 2 2
3
2 2
same horizontal line
1 1 2 Do we always have these
1 properties?

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 51
Uniqueness
• Not always unique, also not always possible to
put all leaf nodes on a line:
A
A B C
3 B C
2 2 1 2
1
A,B C
A,B 0 5 A,B,C
A B C {A}, C 5 0 {A,B},{C} A,B,C 0
A 0 4 6 {B}
B 4 0 4
C 6 4 0 A B,C A,B,C
{B},{C} {A},{B,C}
A 0 5 A,B,C 0
A B C B,C 5 0

A B C C
2 2 A B 3
2 1
1
Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 52
Branch lengths
• Not always possible to assign branch lengths
according to distances:
A B C
A 0 4 8 A B,C
B 4 0 2 A 0 6 A,B,C
{B},{C} {A},{B,C}
C 8 2 0 B,C 6 0 A,B,C 0

A B C A B C B C
1 1 A 1 1

3 3

Here the branch lengths only reflect


the cluster distances, not the
sequence distances

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 53
Ultrametric distances
• Additive tree: A tree where the length between two nodes is
the total length of the branches between them
– For example, when the length represents the number of observed +
unobserved substitutions
• Desirable properties of the input distance matrix: A B C

i. d(x, y)  0 A B C D E
A 0 4 6
B 4 0 4
ii. d(x, y) = 0 if x = y A 0 8 4 6 8
C 6 4 0
B 8 0 8 8 4
iii. d(x, y) = d(y, x)
C 4 8 0 6 8 True: i, ii, iii and iv
iv. d(x, y) + d(y, z)  d(x, z) D 6 8 6 0 8 A B C
v. d(x, y)  max{d(x, z), d(y, z)} E 8 4 8 8 0 A 0 4 8
B 4 0 2
• We can get an additive tree True: i, ii, iii, iv and v C 8 2 0
and with all leaf nodes on the True: i, ii and iii
same line (i.e., an ultrametric tree) if i, ii, iii, iv and v are true

Last update: 1-Dec-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 54
Additive trees
• When a tree is additive, the branch lengths can be
determined by solving simultaneous equations
A B C
A 0 4 6 A,B C
B 4 0 4 A,B 0 5 A,B,C
C 6 4 0 {A}, C 5 0 A,B,C 0
{A,B},{C}
{B}
A
A B C A B C
3 B C
2 2 1
x 2
1 y

• For non-additive trees, we will d(A,x) + d(B,x) = d(A,B) = 4


d(A,x) + d(x,y) + d(C,y) = d(A,C) = 6 (2)
(1)

learn how to assign “reasonable” d(B,x) + d(x,y) + d(C,y) = d(B,C) = 4 (3)


[(1) – (2) + (3)] / 2:
distances by neighbor joining d(B,x) = 1
d(A,x) = 3
d(x,y) + d(C,y) = 3 [e.g., d(x,y) = 1, d(C,y) = 2]

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 55
Part 3b

DISTANCE-BASED METHODS:
NEIGHBOR JOINING
Neighbor joining
• In UPGMA, each time we merge the two closest clusters
according to
1
their distance (criterion #1):
d൫Ci,Cj ൯= ෍ dሺ𝑟,𝑠ሻ
ȁCi ȁหCj ห
𝑟∈C i ,𝑠∈C j

• Would be good to choose the pair that is also far away


from other clusters (criterion #2), measured by:
uሺCiሻ=෍ d൫Ci, Cj൯
j

• In the Neighbor Joining algorithm, the two clusters to


merge is the pair that minimizes
Qሺi,jሻ= ሺr − 2ሻd൫C ,C ൯− uሺC ሻ− u൫C ൯ i j i j
, where r is the current number of
clusters (and Q(i, i)  0 for all i)
– The formula considers both criteria, while the (r-2) factor is to
balance the relative weights of them

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 57
Neighbor joining
• The algorithm:
1. Start with each sequence as a cluster. All of them are connected
to a hub, forming a star.
2. Find clusters i and j connected to the hub where Q(i, j) is
minimum among all cluster pairs
3. Insert a new internal node Ck
• Connect it to Ci, Cj and the hub
d൫Ci,Cj൯ uሺCiሻ−u൫Cj൯
+
• Assign length to the edge2 C iCk (r is the number of clusters before the
2ሺr−2ሻ

merge)
d൫Ci,Cj൯ u൫Cj൯−uሺCiሻ
• Assign length to the edge2 C + C
2ሺr−2ሻ
j k

• For each node Cl, d(Ck, Cl) = [d(Ci, Cl) + d(Cj, Cl) – d(Ci, Cj)] / 2
(Notice that all d(Cx, Cy) values are from the previous step.)
4. Repeat 2 and 3 until all branch lengths are assigned
• The final result will be an unrooted tree

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 58
Reason for the branch length assignment
• If distances are additive: Length:
d(Ci, Cj)
– x + z = [u(ci) – d(Ci, Cj)] / (r – 2)
– y + z = [u(cj) – d(Ci, Cj)] / (r – 2)
Ci
– x + y = d(Ci, Cj) Cj
x
–  [(x + z) – (y + z) + (x + y)] / 2 = y
x= d൫Ci,Cj൯ uሺCi ሻ−u൫Cj൯
+
2 2ሺr−2ሻ
Average length: Ck Average length:
– and [(y + z) – (x + z) + (x + y)] / 2 = [u(ci) – d(Ci, Cj)] / [u(cj) – d(Ci, Cj)] /
y= d൫Ci,Cj൯ u൫Cj ൯−uሺCiሻ (r – 2) z (r – 2)
+
2 2ሺr−2ሻ

• If distances are not additive, the assigned Everything


distances are still usually reasonable else
• d(Ck, Cl) = [(d(Ci, Cl) + d(Cj, Cl) – d(Ci, Cj)] / 2

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 59
Example Q(i,j) = (r-2)d(Ci,Cj) – u(Ci) – u(Cj)
d A B C D u Q A B C D
A B
A 0 8 4 6 A 18 A 0 -26 -28 -26
B 8 0 8 8 B 24 B -26 0 -26 -28
C 4 8 0 6 C 18 C -28 -26 0 -26
Distance between A D 6 8 6 0 D 20 D -26 -28 -26 0
and the new node:
D C
d(A,C)/2 + [u(A) –
u(C)] / [2(r-2)] = 4/2
A C d A,C B D u Q A,C B D
+ (18-18) / [2(2)] = 2
2 2 A,C 0 6 4 A,C 10 A,C 0 -18 -18
{A}, {C} B 6 0 8 B 14 B -18 0 -18
D 4 8 0 D 12 D -18 -18 0
D B

A C d A,B,C D u
2 2 A,B,C 0 3 A,B,C 3
1
{A,C}, {B} D 3 0 D 3
3 5
D In the last step, we simply
B remove the hub and write
down the distance

Last update: 8-Dec-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 60
Comparing the results
• UPGMA: (with one more node, E)
A C D B E
2 2 2 2
3
1 2
1

• Neighbor Joining:
A C
2 2
1

3 5
D
B

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 61
Which method to use?
• No definite answer
– There are different camps
• In general it is good to use methods that
– Do not require strong assumptions
– Are robust (do not produce drastically different results when
the inputs are just slightly changed)
• Build multiple trees using different parameters, then combine
• Build trees with different subsets of sequences, then combine
• Use probabilistic methods
– Are computationally efficient
• There are many other algorithms that we did not cover,
including those that consider mutation models.

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 62
Rooting an unrooted tree
• How to find the
root of an
unrooted tree?
– Usually by using
an “out group”,
something that
should be
separated first
– There are some
other methods
Image credit: Wikipedia, http://blog.ohinternet.com/wp-content/uploads/2011/03/fugu.jpg ,
http://www.currentprotocols.com/protocol/bi0601

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 63
Remarks
• Different types of DNA have different mutation rates
– Gene tree vs. species tree
• Some DNA is not inherited from the usual path
– E.g., bacteria can acquire new DNA by taking up from
plasmid (horizontal gene transfer)
– Need a general phylogenetic graph that allows multiple
parents and loops
• Phylogenetic tree reconstruction would benefit from
having an accurate multiple sequence alignment,
and vice versa
– Some methods perform the two iteratively

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 64
Epilogue

CASE STUDY, SUMMARY AND


FURTHER READINGS
Case study: Unexpected classifications
• In the old days, biologists classified species
based on their high-level features
– If a species possesses features that make the
organisms similar to different types of species, it
could be difficult to classify
– When molecular features (e.g., DNA sequences)
become available, they can be used to classify
species in a systematic way
• Some previous classifications were found to be
inconsistent with molecular evidence

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 66
Case study: Unexpected classifications
• Example 1: Mammals
– Bats look like birds, dolphins look like fish, but both are actually
mammals
• Kingdom: Animalia (animals)
Superphylum: Deuterostomia
Phylum: Chordata
Subphylum: Vertebrata (animals with backbones)
Infraphylum: Gnathostomata (jawed vertebrates)
Class: Chondrichthyes (cartilaginous fish)
Superclass: Osteichthyes (bony fish)
Superclass: Tetrapoda (four-limbed vertebrates)
Class: Aves (birds)
Class Mammalia (mammals)

Image source: Wikipedia

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 67
Case study: Unexpected classifications
• Example 2: The three domains
• All species on earth belong to one of
the three domains Halobacteria sp. strain NRC-1, an
archaaeon

– Archaea
• Single-celled, no nucleus
• Usually live in places with extreme
conditions (e.g., high temperature or salinity Escherichia coli, a beacterium
– “extremophiles”)
– Bacteria
• Single-celled, no nucleus
– Eukaryote
• Many are multi-celled, with nucleus
Image source: Wikipedia Various eukaryotic species

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 68
Case study: Unexpected classifications
• It seems reasonable to assume that eukaryotes
separated from the other two first
• However, based on the sequence of ribosomal RNAs,
something so important that evolve slowly, archaea
are closer to eukaryotes than bacteria

Image source: Wikipedia

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 69
Summary
• Phylogenetic trees capture separation events
and when they happened
• Two main types of tree reconstruction
methods:
– Sequence-based
• Maximum parsimony
• Maximum likelihood
– Distance-based
• UPGMA
• Neighbor joining

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 70
Further readings
• Chapter 3 of the book “Fundamentals of Molecular
Evolution (Second Edition)” by Dan Graur and Wen-Hsiung
Li. Sinauer Associates, Inc., 2000
– Other mutation models
– Models for coding sequences
• Chapter 4 of the same book
– More on rates and patterns of nucleotide substitution
• Chapter 7 of Algorithms in Bioinformatics: A Practical
Introduction
– More details of the algorithms
– Complexity analysis
– Free slides available

Last update: 15-Aug-2018 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2018 71

You might also like