You are on page 1of 39

Bioinformatics (BIO213)

Session 16
Molecular Phylogeny
• Molecular phylogeny is the study of the evolutionary relationships
among organisms or molecules using the techniques of molecular
biology.
• Any phylogenetic tree provides two pieces of information:
• Topology of a tree: defines the relationships of the proteins represented in the
tree.
• The branch lengths: reflect the degree of relatedness of the objects in the
tree.
Topologies and branches of trees
• A phylogenetic tree is a graph composed of branches and nodes.
• Branch (edge) connects any two nodes.
• Nodes (taxonomic units) represent the protein sequence of an
organism.
• An operational taxonomic unit (OTU) is an extant taxon present at an external
node or leaf.
Rooted and Unrooted tree

Rooted tree shows the ancestry relationship: Evolutionary relationship between the OTU’s and their ancestor.
Unrooted tree only shows the relatedness of organisms: when you are trying to understand the
conservancy/diversity in the sequences.
A unrooted tree can be converted to a rooted tree (outgroup necessary).
Types of rooted trees
• Cladograms: Branch length have no meaning
• Phylograms: Branch length represent evolutionary change
• Ultrametric: Branch length represent time, and the length from
the root to the leaves are the same
Enumerating Trees

• The number of possible trees to describe the relationships of a dozen


protein sequences is staggeringly large.
• There is only one “true” tree representing the evolutionary path by which
molecular sequences (or even species) evolved.
• For 2 OTUs, there is only one tree possible. For 3 taxa, it is possible to
construct either one.
• The number of unrooted trees for n OTUs (n ≥ 3)

• The number of rooted trees for nOTUs (n ≥ 2)


Rooted vs unrooted tree
• Rooted tree: directed to a unique node
• (2 * number of leaves) - 1 nodes
• (2 * number of leaves) - 2 branches
• Unrooted tree: shows the relatedness of the leaves without
assuming ancestry at all
• (2 * number of leaves) - 2 nodes
• (2 * number of leaves) - 3 branches
Enumerating Trees
• The number of possible trees to describe the relationships of a
dozen protein sequences is staggeringly large.
Tree search strategies
• An exhaustive search examines all possible trees and selects the one with the most
optimal features such as the shortest overall sum of the branch lengths.
• Practical limit of exhaustive search is around 12 sequences, over 6.5 × 10 8 possible unrooted trees
and 1.3×1010 rooted trees.

Branch-and-bound method:
• Provides an exact algorithm for identifying the optimal tree (or trees) without performing
an exhaustive search.
• By considering the tree in each group having the shortest branch lengths, it is possible to
efficiently identify candidates for the optimal tree(s).
• Exhaustive searches is not necessary for trees (or subtrees) having a worse score than
the potential optimal tree.
• Name of this method refers to a boundary that is reached once the search process has identified a
subtree with a suboptimal score.
How does this heuristic algorithm work?
The algorithm detects the tree the shortest total branch lengths from a
dataset of sequences (i.e., the most parsimonious tree).
• This search occurs without evaluating all possible trees, but instead
by performing a series of rearrangements of the topology.
• Each time the algorithm sieves through subtrees and once a tree with
a particular score is obtained, discard all trees for which
rearrangements are unlikely
• The algorithm iteratively establishes the upper limit of the score and
chooses the final tree.

Several variants of this heuristics are available.


An example is tree bisection and reconnection method.
A species tree and a protein (or gene) tree can
have a complex relationship
• Speciation, the process by which two new species are created by
reproductive isolation from a single ancestral species (Eukaryotes).

• Phylogenetic analysis of a specific group of


proteins is complicated by the fact that a
gene duplication could have preceded or
followed the speciation event.

• In essentially all phylogenetic analyses, the


extant proteins (OTUs) are sequences from
organisms that are alive today.

• It is necessary to reconstruct the history of


the protein family as well as the history of
each species.

Li et al, 2000
A gene tree differs from a species tree in
two ways
(1) The divergence of two genes from two species may have predated
the speciation event, and cause overestimation of branch lengths.
(2) The topology of the gene tree may differ from that of the species tree.
In particular, it may be difficult to reconstruct a species tree from a
gene tree.
• A molecular clock may be applied to a gene tree in order to date the
time of gene divergence, but it cannot be assumed that this is also the
time that speciation occurred.

Evolutionary changes occur in a clock-like fashion

Li et al, 2000
Gene tree is not reliable to estimate species
relationships (despite molecular clock)
• Reconstructing a phylogenetic tree based upon a single protein
(or gene) can therefore give complicated results.

• Thus, trees are constructed from a variety of distinct gene


families in order to assess the relationships of different species.

• Another strategy is to generate concatenated protein (or DNA)


sequences.
Phylogenetic tree of eukaryotes
by concatenating EF‐1α, actin, α‐
tubulin, and β‐tubulin.
Five stages of phylogenetic analysis
• Molecular phylogenetic analyses can be divided into five
stages:
(1) selection of sequences for analysis
(2) multiple sequence alignment of homologous protein or nucleic
acid sequences
(3) specification of a statistical model of nucleotide or amino acid
evolution
(4) tree building
(5) tree evaluation.
Tree building methods
• Four principal methods of building trees:
1. Distance-based
2. Maximum parsimony
3. Maximum likelihood
4. Bayesian inference
Distance-based methods
Distance-based methods of phylogeny are computationally fast, so
they are particularly useful for analyses of a larger number of
sequences

• UPGMA: Unweighted-Pair Group Method with Arithmetic mean

• NJ: Neighbor-Joining

These methods use some distance metric, such as the number of


amino acid changes between the sequences, or a distance score
Distance-based methods
• Calculate all the distance between leaves (taxa)
• Based on the distance, construct a tree
• Good for continuous characters
• Not very accurate
Properties of distance metric
(1) the distance from a point to itself must be zero, that is, D(x, x) = 0
(2) the distance from point x to y must equal the distance from y to x,
that is, D(x, y) = D(y, x)
(3) the triangle inequality must apply in that D(x, y) ≤ D(x, z) + D(z, y).
UPGMA: Unweighted Pair Group Method
with Arithmetic Mean
• Originally developed for numeric taxonomy in 1958 by Sokal
and Michener
• Simplest algorithm for tree construction, so it's fast!
How to construct a tree with UPGMA?
• Prepare a distance matrix
• Repeat step 1 and step 2 until there are only two clusters
• Step 1: Cluster a pair of leaves (taxa) by shortest distance
• Step 2: Recalculate a new average distance with the new
cluster and other taxa, and make a new distance matrix
How to construct a tree with UPGMA?
How to construct a tree with UPGMA?
How to construct a tree with UPGMA?
How to construct a tree with UPGMA?
• NJ method
• Overview of character-based and contingency-based methods
Bioinformatics (BIO213)

Session 17
Distance matrix
• Distance matrices can be generated using different methods.
• Mega can compute distance matrix.
• You can also do it manually.

• Let's try with some examples.


Neighbor joining method
• Neighbor-joining (Saitou and Nei, 1987) 
Huai-Kuang Tsai

Masatoshi Nei Wen-Hsiung Li Krishna Swamy


Neighbor joining method
• It does not require that all lineages have diverged by equal amounts.
• The method is especially suited for datasets comprising lineages with
largely varying rates of evolution.
• It can be used in combination with methods that allow correction for
superimposed substitutions.
• Advantages
• is fast and thus suited for large datasets and for bootstrap analysis
• permits lineages with largely different branch lengths
• permits correction for multiple substitutions
• Disadvantages
• sequence information is reduced
• gives only one possible tree
• strongly dependent on the model of evolution used.
Example
• Suppose we have the following tree:

• Since B and D have accumulated mutations at a higher rate than A.


• The Three-point criterion is violated and the UPGMA method cannot be
used.
• Three-point criterion:
• For any three taxa: dist AC <= max (distAB, distBC)
• the two greatest distances are equal, or UPGMA assumes that the
evolutionary rate is the same for all branches
• In such a case the neighbor-joining method is one of the
recommended methods.
Testing Three point criteria for this tree
• Since the divergence of A and B, B has accumulated mutations
at a much higher rate than A.
• The Three-point criterion: dist(BD) <= max (dist(BA), dist(AD))
• 10 <= max (5,7) ?
• False
Construct the distance matrix for this tree

   A  B  C  D  E
 B  5        
 C  4  7      
 D  7  10  7    
 E  6  9  6  5  
 F  8  11  8  9  8

We have in total 6 OTUs (N=6).


Step 1: Calculate the net divergence r(i)
for each OTU from all other OTUs

• r(A) = 5+4+7+6+8=30    A  B  C  D  E
• r(B) = 5+7+10+9+11 = 42  B  5        
• r(C) = 4+7+7+6+8 = 32  C  4  7      

• r(D) = 38  D  7  10  7    
 E  6  9  6  5  
• r(E) = 34
 F  8  11  8  9  8
• r(F) = 44
Step 2: Calculate a modified matrix for each pair of
OTUs using the formula:
   A  B  C  D  E
 B  5        

 C  4  7      
 D  7  10  7    
• For the case of the pair A,B:  E  6  9  6  5  
• = 5 = -13  F  8  11  8  9  8

r(A) = 30
Calculate M(AC), M(AD) and M(AE) r(B) = 42
r(C) = 32
r(D) = 38
r(E) = 34
r(F) = 44
Modified matrix

   A  B  C  D  E
 B  -13        
 C  -11.5  -11.5      
 D  -10  -10  -10.5    
 E  -10  -10  -10.5  -13  
 F  -10.5  -10.5  -11  -11.5  -11.5
Step 3: Choose as neighbors those two OTUs for
which Mij is the smallest.
• These are A and B and D and E.
• Let's take A and B as neighbors and we form a new node called U.
• Now we calculate the branch length from the internal node U to the
external OTUs A and B.

• S(AU) =
   A  B  C  D  E
• S(AU) = = 1 r(A) = 30
 B  5        
r(B) = 42
• S(BU) =d(AB) -S(AU)  C  4  7       r(C) = 32
• =5–1=4  D  7  10  7     r(D) = 38
 E  6  9  6  5   r(E) = 34
r(F) = 44
 F  8  11  8  9  8
Step 4: Define new distances from U to each other
terminal node:
• d(CU) = d(AC) + d(BC) – (d(AB) / 2) = 3
• d(DU) = d(AD) + d(BD) – (d(AB) / 2) = 6
• d(EU) = d(AE) + d(BE) – (d(AB) / 2) = 5
• d(FU) = d(AF) + d(BF) – (d(AB) / 2) = 7
N= N-1 = 5
The entire procedure is repeated starting at step 1
• Create a new matrix:
   U  C  D  E
 C  3       U

 D  6  7    
 E  5  6  5  
 F  7  8  9  8

You might also like