You are on page 1of 45

Phylogenetic Analysis

based on two talks, by

Caro-Beth Stewart, Ph.D.

Department of Biological Sciences


University at Albany, SUNY
c.stewart@albany.edu

and Tal Pupko, Ph.D.


Faculty of Life Science
Tel-Aviv University
talp@post.tau.ac.il

Basedonlecturesby
What is phylogenetic analysis and why
should we perform it?

Phylogenetic analysis has two major components:

1. Phylogeny inference or tree building


the inference of the branching orders, and
ultimately the evolutionary relationships,
between taxa (entities such as genes,
populations, species, etc.)
2. Character and rate analysis
using phylogenies as analytical frameworks
for rigorous understanding of the evolution of
various traits or conditions of interest

Basedonlecturesby
Common Phylogenetic Tree Terminology

Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny

D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)

Basedonlecturesby
Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B

Taxon C
No meaning to the
spacing between the
Taxon A taxa, or to the order in
which they appear from
top to bottom.
Taxon D

Taxon E

This dimension either can have no scale (for cladograms),


can be proportional to genetic distance or amount of change
(for phylograms or additive trees), or can be proportional
to time (for ultrametric trees or true evolutionary trees).

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses


These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
Basedonlecturesby
D and E. If the tree has a time scale, then D and E are the most closely related.
A few examples of what can be inferred
from phylogenetic trees built from DNA
or protein sequence data:

Which species are the closest living relatives of


modern humans?
Did the infamous Florida Dentist infect his
patients with HIV?
What were the origins of specific transposable
elements?
Plus countless others..

Basedonlecturesby
Which species are the closest living
relatives of modern humans?

Humans Gorillas
Chimpanzees Chimpanzees

Bonobos Bonobos

Gorillas Orangutans
Orangutans Humans

14 0 15-30 0
MYA MYA

Mitochondrial DNA, most nuclear DNA- The pre-molecular view was that the great
encoded genes, and DNA/DNA apes (chimpanzees, gorillas and
hybridization all show that bonobos and orangutans) formed a clade separate
chimpanzees are related more closely to from humans, and that humans diverged
humans than either are to gorillas. from the apes at least 15-30 MYA.
Basedonlecturesby
Did the Florida Dentist infect his patients with HIV?

Phylogenetic tree DENTIST


of HIV sequences Patient C
from the DENTIST, Patient A
his Patients, & Local Patient G
HIV-infected People:
Patient B
Yes:
The HIV sequences from
Patient E these patients fall within
Patient A the clade of HIV sequences
found in the dentist.
DENTIST
Local control 2
Local control 3
Patient F No
Local control 9

Local control 35
Local control 3
Patient D No
Basedonlecturesby
From Ou et al. (1992) and Page & Holmes (1998)
A few examples of what can be learned
from character analysis using phylogenies
as analytical frameworks:

When did specific episodes of positive Darwinian


selection occur during evolutionary history?
Which genetic changes are unique to the human
lineage?
What was the most likely geographical location of
the common ancestor of the African apes and
humans?
Plus countless others..
Basedonlecturesby
The number of unrooted trees increases in a greater
than exponential manner with number of taxa
A B

C A C

B D

C
A D

B E

A C
D

B F E (2N - 5)!! = # unrooted trees for N taxa

Basedonlecturesby
Inferring evolutionary relationships between
the taxa requires rooting the tree:
B
C
To root a tree mentally,
imagine that the tree is
made of string. Grab the
string at the root and Root D
tug on it until the ends of
the string (the taxa) fall Unrooted tree
opposite the root: A

A B C D

Rooted tree
Note that in this rooted tree, taxon A is
no more closely related to taxon B than Root
it is to C or D.

Basedonlecturesby
Now, try it again with the root at another position:

B
C

Root
Unrooted tree
D

B
C D

Rooted tree

Note that in this rooted tree, taxon A is most


closely related to taxon B, and together they
Root
are equally distantly related to taxa C and D.

Basedonlecturesby
An unrooted, four-taxon tree theoretically can be rooted in five
different places to produce five different rooted trees

2 4
A C
The unrooted tree 1: 1 5

B 3 D

Rooted tree 1a Rooted tree 1b Rooted tree 1c Rooted tree 1d Rooted tree 1e
B A A C D

A B B D C

C C C A A

D D D B B
These trees show five different evolutionary relationships among the taxa!

Basedonlecturesby
There are two major ways to root trees:
By outgroup:
Uses taxa (the outgroup) that are
known to fall outside of the group of
interest (the ingroup). Requires
some prior knowledge about the
relationships among the taxa. The
outgroup can either be species (e.g.,
birds to root a mammalian tree) or
previous gene duplicates (e.g., outgroup
-globins to root -globins).

By midpoint or distance:
Roots the tree at the midway point A
d (A,D) = 10 + 3 + 5 = 18
between the two most distant taxa in
Midpoint = 18 / 2 = 9
the tree, as determined by branch
10
lengths. Assumes that the taxa are C
evolving in a clock-like manner. This 3 2
assumption is built into some of the B 2
5 D
distance-based tree building methods.
Basedonlecturesby
Each unrooted tree theoretically can be rooted
anywhere along any of its branches

A C

x =
B D
C
A D

B E
C
A D

B F E (2N - 3)!! = # unrooted trees for N taxa


Basedonlecturesby
Molecular phylogenetic tree building methods:
Are mathematical and/or statistical methods for inferring the divergence
order of taxa, as well as the lengths of the branches that connect them.
There are many phylogenetic methods available today, each having
strengths and weaknesses. Most can be classified as follows:

COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm
Characters

PARSIMONY

MAXIMUM LIKELIHOOD
DATA TYPE

Distances

MINIMUM EVOLUTION UPGMA

LEAST SQUARES NEIGHBOR-JOINING

Basedonlecturesby
Types of data used in phylogenetic inference:
Character-based methods: Use the aligned characters, such as DNA
or protein sequences, directly during tree inference.
Taxa Characters
Species A ATGGCTATTCTTATAGTACG
Species B ATCGCTAGTCTTATATTACA
Species C TTCACTAGACCTGTGGTCCA
Species D TTGACCAGACCTGTGGTCCG
Species E TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise


distances (dissimilarities), and then use the matrix during tree building.
A B C D E
Species A ---- 0.20 0.50 0.45 0.40
Example 1:
Species B 0.23 ---- 0.40 0.55 0.50 Uncorrected
Species C 0.87 0.59 ---- 0.15 0.40 p distance
(=observed percent
Species D 0.73 1.12 0.17 ---- 0.25 sequence difference)
Species E 0.59 0.89 0.61 0.31 ----

Basedonlecturesby
Example 2: Kimura 2-parameter distance
(estimate of the true number of substitutions between taxa)
Computational methods for finding optimal trees:

Exact algorithms: "Guarantee" to find the optimal or


"best" tree for the method of choice. Two types used in tree
building:
Exhaustive search: Evaluates all possible unrooted
trees, choosing the one with the best score for the method.
Branch-and-bound search: Eliminates the parts of the
search tree that only contain suboptimal solutions.

Heuristic algorithms: Approximate or quick-and-dirty


methods that attempt to find the optimal tree for the method of
choice, but cannot guarantee to do so. Heuristic searches
often operate by hill-climbing methods.

Basedonlecturesby
Exact searches become increasingly difficult, and
eventually impossible, as the number of taxa increases:

A B

A C
C

B D

C
A D

B E
C
A D

B F E
(2N - 5)!! = # unrooted trees for N taxa

Basedonlecturesby
Heuristic search algorithms are Rerunning heuristic searches using
input order dependent and can get different input orders of taxa can help
stuck in local minima or maxima find global minima or maxima
Search
for global
Search maximum
for global
minimum GLOBAL GLOBAL
MAXIMUM MAXIMUM
local
maximum

local
minimum GLOBAL GLOBAL
MINIMUM MINIMUM

Basedonlecturesby
Classification of phylogenetic inference methods

COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm
Characters

PARSIMONY

MAXIMUM LIKELIHOOD
DATA TYPE

Distances

MINIMUM EVOLUTION UPGMA

LEAST SQUARES NEIGHBOR-JOINING

Basedonlecturesby
Parsimony methods:
Optimality criterion: The most-parsimonious tree is the one that
requires the fewest number of evolutionary events (e.g., nucleotide
substitutions, amino acid replacements) to explain the sequences.

Advantages:
Are simple, intuitive, and logical (many possible by pencil-and-paper).
Can be used on molecular and non-molecular (e.g., morphological) data.
Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy)
Can be used for character (can infer the exact substitutions) and rate analysis.
Can be used to infer the sequences of the extinct (hypothetical) ancestors.

Disadvantages:
Are simple, intuitive, and logical (derived from Medieval logic, not statistics!)
Can be fooled by high levels of homoplasy (same events).
Can become positively misleading in the Felsenstein Zone:

[See Stewart (1993) for a simple explanation of parsimony analysis, and Swofford
et al. (1996) for a detailed explanation of various parsimony methods.]
Basedonlecturesby
Branch and Bound

Tal Pupko, Tel-Aviv University

Basedonlecturesby
There are many trees..,

We cannot go over all the trees. We will try to find


a way to find the best tree.
There are approximate solutions But what if we
want to make sure we find the global maximum.

There is a way more efficient than just go over all


possible tree. It is called BRANCH AND BOUND
and is a general technique in computer science,
that can be applied to phylogeny.

Basedonlecturesby
BRANCH AND BOUND

To exemplify the BRANCH AND BOUND (BNB)


method, we will use an example not connected to
evolution. Later, when the general BNB method is
understood, we will see how to apply this method
to finding the MP tree. We will present the
traveling salesperson path problem (TSP).

Basedonlecturesby
THE TSP PROBLEM
(especially adapted to israel).
A guard has to visit n check-points whose location
on a map is known. The problem is to find the
shortest path that goes through all points exactly
once (no need to come back to starting point).

Nave approach: (say for 5 points). You have 5


starting points. For each such starting point you
have 4 next steps. For each such combination of
starting point and first step, you have 3 possible
second steps, etc. All together we have 5*4*3*2*1
Possible solutionsBasedonlecturesby
= 5! .
THE TSP TREE

1 2 3 4 5

2 3 4 5 1 3 4 5 1 2 4 5 1 2 3 5 1 2 3 4

245 145 125 124

45 25 24

54 52 42

Basedonlecturesby
THE SHP NAVE APPROACH

Each solution can be represented as a


permutation:

(1,2,3,4,5)
(1,2,3,5,4)
(1,2,4,3,5)
(1,2,4,5,3)
(1,2,5,3,4)

We can go over the list and find the one giving the
highest score.
Basedonlecturesby
THE SHP NAVE APPROACH

However, for 15 points, for example, there are


1,307,674,368,000

The rate of increase of the number of solutions is


too fast for this to be practical.

Basedonlecturesby
A TSP GREEDY HEURISTIC

Start from a random point. Go to the closest point.


Go to its closest point, etc.etc.
This approach doesnt work so well

(but a reasonably close heuristic, based on simulated


annealing, will be presented in a couple of lectures.)

Basedonlecturesby
BNB SOLUTION TO SHP

1 2 3 4 5

2 3 4 5 1 3 4 5 1 2 4 5 1 2 3 5 1 2 3 4

Score here
Shortest path 245 145 125 124
already 16:
found so far = no point in
15 45 25 24 expanding
the rest of
54 52 42 the subtree
Basedonlecturesby
Back to finding the MP
tree

Finding the MP tree is NP-Hard (will see shortly)

BNB helps, though it is still exponential

Basedonlecturesby
The MP search tree
1

4 is added to branch 1. 3

1 1 1
4 3 4 3 3
4
2 2 2

5 is added to branch 2.
There are 5 branches Basedonlecturesby
The MP search tree
30
4 is added to branch 1.

43 55 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

Basedonlecturesby
MP-BNB
30
4 is added to branch 1.

43 55 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

Best (minimum) value = 52


Basedonlecturesby
MP-BNB
30
4 is added to branch 1.

43 55 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

Best record = 52
Basedonlecturesby
MP-BNB
30
4 is added to branch 1.

43 55 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

Best record = 52
Basedonlecturesby
MP-BNB
30

43 55 39

52 54 52 53 58 53 51 42 47 47

Best record = 52
Basedonlecturesby
MP-BNB
30

43 55 39

52 54 52 53 58 53 51 42 47 47

Best record = 52
Basedonlecturesby
MP-BNB
30

43 55 39

52 54 52 53 58 53 51 42 47 47

Best record = 52 51
Basedonlecturesby
MP-BNB
30

43 55 39

52 54 52 53 58 53 51 42 47 47

Best record = 52 51 42
Basedonlecturesby
MP-BNB
30

43 55 39

52 54 52 53 58 53 51 42 47 47

Best record = 52 51 42
Basedonlecturesby
MP-BNB
30

43 55 39

52 54 52 53 58 53 51 42 47 47

Best record = 52 51 42
Basedonlecturesby
MP-BNB
30

43 55 39

52 54 52 53 58 53 51 42 47 47

Total # trees visited: 14 Best TREE.


Basedonlecturesby MP score = 42
Order of Evaluation Matters
30 The bound
Evaluate all 3 first after searching
this subtree
will be 42.
43 55 39

53 51 42 47 47

Total tree visited: 9

Basedonlecturesby
And Now

Maximum Parsimony is
Computationally Intractable

Felsensteins Dynamic Programming


Algorithm for tiny maximum likelihood

and more, time permitting


Basedonlecturesby

You might also like