You are on page 1of 108

Phylogenetics

S2/109
Systematic
It is used to determine the evolutionary history and
relationships among organisms that are classifed in a
wide variety of sources including paleontology,
morphology, and molecular biology.

Evolutionary Phenetics Phylogenetic


(Synthetic) (Numerical Taxanomy) (Cladistic)

l
It is the study of evolutionary relationships and
determines how the family might have been
derived during evolution.
Objective-
To discover all of the branching relationships
in the tree and the branch lengths.
S3/109
Why do we need to study phylogenies?

• To know the origin of organisms (i.e. How we evolved )

IE68 - biological databases - phylogeny


S4/109
Clusters of Orthologous Groups
• The COG are generated by comparing predicted and known proteins in all
completely sequenced microbial genomes to infer sets of orthologs.
• Each COG consists of a group of proteins found to be orthologous across at least
three lineages and likely corresponds to an ancient conserved domain.
• It provides a fast alternative for describing the functional characteristics of one
microbe or a community of microbes because the database is significantly smaller.
• The current COG database used is CloVR (Cloud Virtual Resource), which is
composed of 144k proteins and over 4800 COGs.
• Each COG has a specific functional description:
• Cellular Processes And Signaling
• Information Storage And Processing
• Metabolism

S5/109
Determination of ‘COGs' for a proteins
• A single text file containing amino acid sequence of a proteins in FASTA format.
NOTE - All proteins must belong to a single species, but different strains.
• We can have two approaches here;
1. First one is simple.
i. Run BLASTp of my query sequences against COG database available at following
URL: ftp://ftp.ncbi.nlm.nih.gov/pub/kristensen/thousandgenomespogs/blastdb/
ii. then look for the best match for each query, and then see to which COG group it belongs .
2. Second approach is complex:
i. install the COG software (COGsoft.201204.tar) from the following link;
URL: ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
i. then run PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) of all-against-all
ii. then manipulate the data in acceptable format to COGnitor by using different modules of COG.
iii. Every step requires time in this approach.

S6/109
Synteny
It is a valid deduction that two or
more genomic regions are derived
from a single ancestral genomic
region in such proximity that they
may be subject to linkage.

• Two types of synteny


1. SYNTENIC CORRELATION
• It is a measure of genomic conservation.
2. SYNTENIC ASSOCIATION
• It measures the proportion of errors made in assigning a gene to a chromosome in one species that can be
eliminated by knowing which chromosome the gene belongs to in the other.

S7/109
SyMAP (Synteny Mapping and Analysis Program)

• It is a software package for


• detecting,
• displaying, and
• Querying
syntenic relationships between
sequenced chromosomes.

• SyMAP can run as a standalone


desktop application or from the web.

S8/109
Phylogenetic trees

S9/109
Trees terminology
• Tree is composed of outer branches representing the taxa,
nodes and branches.
• Taxa: It refers to the sequences. (singular taxon)
• Operational taxonomic unit (OTU): the samples selected
to be used in a study.
• Node: a branch point in a tree
• Branch: defines the relationship between the taxa in terms
of descent and ancestry
• Branch length : represents the number of character changes
that have occurred in the branch.
• Topology: the branching patterns of the tree
• Root: the common ancestor of all taxa
• Clade: a group of two or more taxa that includes both their
common ancestor and all their descendents.

S10/109
Relationship between Hierarchy and Phylogeny

S11/109
Trees

Classification based on

Rooting Branching Groups Constructions Of Tree Distance

Rooted Bifurcating Monophyletic Phenetic Cladogram

Unrooted Multifurcating Polyphyletic Cladistic Additive


Paraphyletic Ultrameric
S12/109
Classification of tree based on Rooting
ROOTED TREE
• It has a node that is identified as the root from which ultimately
all other nodes descend,
• It has a direction that corresponds to evolutionary time;
• the closer a node is to the root of the tree the older it is in time.
• It allow us to define ancestor-descendant relationships between
nodes.

UNROOTED TREES
• They lack a root, and hence do not specify evolutionary
relationships
• they don’t talk of ancestors and descendants.
• sequences that may be adjacent on an unrooted tree need not
be evolutionarily closely related.
S13/109
Unrooted Trees
• Most of the phylogenetic methods produces an unrooted trees thus there are two means to
root an unrooted tree -
• 1. Outgroup Method
• 2. Molecular Clock Hypothesis

Outgroup Method Molecular Clock Hypothesis


• It analysis a group of sequences known • All lineages are supposed to have
a priori to be external to the group under evolved with the same speed since
study. divergence from their common ancestor.
• The root is at the equidistant point from
all tree leaves.

S14/109
Rooting a Tree
STEP 1 - To get a direction for time (tree-building algorithms are completely time-reversible.
STEP 2 - Including one or more sequences that are known to be more distantly related from
all the others known as an “outgroup”.
STEP 3 - The root is placed on the branch that connects the outgroup to the rest of the
sequences, halfway between them.

S15/109
Possible evolutionary trees

Distinct trees for 8 taxa


S16/109
Classification of tree based on Branching
Bifurcating
Exactly two ancestors per interior node

Multifurcating
More than two ancestors per interior node

S17/109
Classification of tree based on Groups
Monophyletic group Paraphyletic group Polyphyletic group

Includes an ancestor Includes ancestor and some, Includes two convergent descendants
all of its descendants but not all of its descendants but not their common ancestor

Taxon A, B and C Taxon A is highly derived Taxon A and C share


share common ancestor and looks very different similar traits through
from B, C, and ancestor convergent evolution
S18/109
S19/109
Classification of tree based on construction
Phenetic methods Cladistic methods
• It construct phenograms by • It construct cladograms that rely on
considering the current states of assumptions about ancestral
characters without regard to the relationships as well as on current
evolutionary history that brought the data
species to their current phenotypes

• Phenograms are based on overall • Cladograms are based on character


similarity evolution like
• eye color (blue, brown, green)
• nucleotide bases-A, C, T, G
• amino acid codons-ACC, CGT, GAT,
etc

S20/109
Phenetics

Phenetics (overall similarity)

S21/109
Classification of tree based on distance
CLADOGRAM PHYLOGRAM ULTRAMETRIC TREE

It shows branching but Also known as Additive or Branch lengths are proportional
pattern. Metric tree to time.
Branch lengths have no Branch lengths are proportional This is the molecular clock
meaning but sometimes to evolutionary distance. model that implies that
differ for artistic effect. evolution occurs at a constant
rate in all species.

S22/109
“Splits / partitions”
• Splits arose when a specific internal branch
is removed.
• When you list the species in one split, the
other is automatically defined.

Generating a Set of Splits

S23/109
Phylogenetics approach

S24/109
Step 1 : Sequence analysis

Step 2: Multiple sequence alignment

Step 3: Phylogenetic analysis

S25/109
Phylogenetic analysis
1. Data selection.
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation

S26/109
1. Data selection
Parameters To be considered: Type of Data
• Input data must be homolog! • Morphological characters
• Similar distribution • Physiological characters
• Number of character states size of the • Gene order
dataset • Sequence data (nucleotide and amino acid)

2. Data comparison
• Chose a suitable alignment method
MSA methods NOTE - Highly diverged sequences
• ClustalW (very fast) Domain/family predictions
• Muscle (very fast) Structures
• MAFFT (fast)
• Probcons
• T-coffee S27/109
3. Selection of an data model
• Two categories
Parameters to be considered:
• That each position in the alignment • Numerical data
should be homolog! 1. Distance between objects i.e. evolutionary
• Missing data (in some OTU) distance between two species
• Number of characters • Character data
1. Each character has a finite number of states
2. E.g. number or legs = 1, 2, 4

4. Selection of an evolutionary model


Parameters to be considered? • Phylogenetic tree-building presumes
1. Frequencies of aa exchange during different evolutionary models
evolution • The model chosen influences the
2. Presence of Invariable site. outcome of the analysis and should be
considered in the interpretation of the
analysis results. S28/109
5. Tree Building methods

S29/109
Distance based
General principle :
Sequence alignment
 (1)
Matrix of evolutionary distances between sequence pairs
 (2)
(unrooted) tree

STEP 1 - Compute distances


STEP 2 - Tree-building by
• Fitch morgolish FM
• Neighbor Joining (NJ)
• UnWeighted pair-group method using arithmetic averages
(UPGMA) S30/109
Step 1: Compute distances
Measure for the extend of sequence divergence:
p distance: ^p=nd/n
p = proportion (p distance)
nd= number of aa differences
n = number of aa used

• Relationship of p with t (time)


S31/109
Step 2: Tree-building

S32/109
Fitch-Margoliash Method
• This method generates an unrooted additive tree:
• The branch lengths are unequal they are proportional to
evolutionary distance
• The method works on small, 3-branch trees,
STEP 2
STEP 1 - determines the branch
lengths and the position of the
internal node from the distances.

STEP 3

S33/109
Neighbor Joining (NJ)
Principles:
• It is a bottom-up clustering methods.
• Tree topology and branch lengths are estimated from a genetic distance
matrix
• Neighbours are defined as taxa connected by a single node in an unrooted
tree.
• Closest neighbours are successively joined by a new node until the tree is
resolved.
• It results in a single unrooted tree with branch length estimates that need
to be rooted by the outgroup method.
• NJ is a fast method, even for hundreds of sequences.

S34/109
The Neighbor-Joining Method: algorithm
• Start from a star - topology and progressively construct a tree as :
• Step 1: Use d distances measured between the N sequences
• Step 2: For all pairs i et j: consider the following tree topology, and compute
Si,j , the sum of all “best” branch lengths. (Saitou and Nei have found a simple
way to compute Si,j )
• Step 3: Retain the pair (i,j) with smallest Si,j value . Group i and j in the tree.
• Step 4: Compute new distances d between N-1 objects: pair (i,j) and the N-2
remaining sequences.
d(i,j),k = (di,k + dj,k) / 2
• Step 5: Return to step 1 as long as N ≥ 4. When N = 3, an (unrooted) tree is
obtained

S35/109
• We are using only 4 taxa.
Example
• Step 1 is to calculate the neighbor distances Q, using the following equation .

dist A B C D
A 0 N N

B 7 0
Qij   N  2dij   dik   djk
k 1 k 1
C 13 8 0
D 17 12 14 0

Q score
(N-2)*AB-(AB+AC+AD) –(AB+BC+BD)
A-B (4-2)*7 – (7+13+17) – (7+8+12) -50
A-C (4-2)*13 – (7+13+17) – (13+8+12) -46
A-D (4-2)*17 – (7+13+17) – (17+12+14) -46
B-C (4-2)*8 – (7+8+12) – (13+8+12) -46
-50 is the lowest score, and we could use either A-B or C-D. B-D (4-2)*12– (7+8+12) – (17+12+14) -46
We arbitrarily choose A-B to join first. C-D (4-2)*14– (13+8+12) – (17+12+14) -50
S36/109
1 1  N N

dAY  dAB  *  dAk   dBk 
2 2( N  2)  k 1 k 1 

d score
A-Y (1/2)*7 + {1/4*[(7+13+17) – (7+8+12)]} 6
B-Y 7-6 1
C-X (1/2)*14 – {1/4*[(13+8+127) – (17+12+14)]} 5
D-X 14-5 9
X-Y 2

S37/109
Neighbor Joining (NJ)

Advantages: Disadvantages:
• Very efficient • The method lacks accuracy because
• Also for large datasets there is no attempt to correct for
potential bias (homoplasy).
• A single tree is estimated by
• The method lacks precision because
minimising genetic the outcome is partly contingent on
distance, in a short time and the tree with which the search process
with little computational begins.
expenditure. • Does not examine all possible
topologies

S38/109
UPGMA (Unweighted Pair Group Method
with Arithmetic mean)
• UPGMA is the oldest distance matrix method.
• Simplest method - uses sequential clustering algorithm
• It uses a distance matrix representing measure of genetic distance
between pairs of species being considered
• It clusters the two closest species.
• Compute new distance matrix using arithmetic mean to first cluster
• It is repeated until all species are grouped

S39/109
S40/109
A B C D E
A 0 10 12 10 7
B 0 4 4 13 UPGMA Step 1 combine B and C
C 0 6 15
D 0 13
E 0

A BCD E
A 0 10.5 7
BCD 0 13.5
AE BCD
E 0
AE 0 12
BCD 0

S41/109
A B C D E
A 0 10 12 10 7
UPGMA Result B
C
0 4
0
4
6
13
15
D 0 13
E 0

2 .5 3.5

3 3
Correction using Formulae
2.5
3.5
3.5 .5 2
d 2
a 5 1 d
a
2
2 3
1
e b c e b c

S42/109
Distance method
Advantages: Disadvantages:
• Fast - suitable for analysing • Information is lost - given
data sets which are too large for only the distances, it is
other more computationally impossible to derive the
intensive methods such as original sequences.
maximum likelihood. • Only through character based
• A large number of models are analyses can the history of
available with many parameters sites be investigated; e.g.,
-improves estimation of most informative positions be
distances. inferred.
S43/109
Character based

S44/109
Character- (Sequence-) based methods
Most common:
• Maximum Parsimony (MP)
• Maximum Likelihood (ML)
• Baysian Inference

S45/109
Parsimony Methods
• Aligning a sequences to generate a tree that minimizes the number of mutations by
minimizing the sum of all branch lengths, and not worrying about the length of branch.
• It directly align the sequences and don’t use a distance matrix or evolutionary model and
completely ignores the possibility of multiple mutations.

• Informative sites
Not all sites contribute useful information to counting mutations.
• Invariant sites
They are the sites where all sequences have the same base, are worthless.
• Singleton sites
They are the sites where only one sequence has the mutation, are also worthless, because
no matter what the tree topology is, a singleton site always needs exactly 1 mutation to
generate.
• Uninformative sites
nucleotide (or amino acid) columns that do not allow the distinction between two trees.
S46/109
Maximum Parsimony
• It was originally developed for morphological characters.
• The topology of the result tree is the one that requires the smallest
number of evolutionary changes- William of Ockham
Principle:
1. Estimate the minimum number of substitutions for a given topology
2. Parsimony-informative sites (exclude invariable sites and singletons)
3. Searching MP trees by
i. Exhaustive search
ii. Heuristic search
4. Result- Multiple result trees are possible (Mainly Unrooted trees are
resulted)

S47/109
Algorithm
Step 1: Determine the ancestral residues for a given tree topology and for a given
alignment site that requires the smallest total number of changes in the whole
tree. Let d be this total number of changes.

Step 2: Compute d for each alignment site.

Step 3 : Add d values for all alignment sites giving the length L of tree.

Step 4: Compute L value for each possible tree shape.

Step 5: Retain the shortest tree(s)


the tree(s) that require the smallest number of changes
the most parsimonious tree(s).

S48/109
N-1
0 0 3

0 0 3

0 0 3

S49/109
0 3 2 0 3 2 1

0 3 2 0 3 2 1

0 3 2 0 3 2 1

S50/109
1 3

0 3 2 2 0 1 1 1 1 3 = 14
2 4
1 2
0 3 2 2 0 1 2 1 2 3 = 16
3 4
1 3
0 3 2 1 0 1 2 1 2 3 = 15
4 2

S51/109
Maximum Parsimony (MP)
Advantages: Disadvantages:
•It is a simple method and • Generally produces multiple result
free from assumptions. trees.
•Easy to understand the
• Does not take into account
operation. homoplasy.
•Does not depend on an
explicit model of evolution • creates wrong topologies, if the
•Gives both trees and substitution rate varies extensively
associated hypotheses of between lineages
character evolution.

S52/109
Maximum Likelihood (ML)
Principle
• It looks for the tree that, under a given model of evolution, maximizes the
likelihood of the observed data
• It calculates likelihoods for each position in the alignment and for all possible
topologies (gaps generally removed) and results a tree with the highest
likelihood.
• It locates the most likely tree topology through a hill-climbing algorithm
• Searching strategies are rarely exhaustive and mostly heuristic, like-
• NNI (Nearest neighbor interchanges)
• TBR (Tree bisection-reconnection)
• SPR (Subtree pruning and regrafting)

S53/109
Maximum likelihood methods
• Hypotheses
• The substitution process follows a probabilistic model whose mathematical
expression, but not parameter values, is known a priori.
• Sites evolve independently from each other.
• All sites follow the same substitution process (some methods use a more
realistic hypothesis).
• Substitution probabilities do not change with time on any tree branch. They
may vary between branches.

S54/109
Maximum likelihood algorithm
• Step 1:
Let us consider a given rooted tree, a given site, and a given set of branch lengths.
• Let S1, S2, S3, S4: observed bases at site in seq. 1, 2, 3, 4
and S5, S6, S7: unknown and variable ancestral bases
and l1, l2, …, l6 be the given branch lengths S2
S3
S1
l3 S4
l1 l2
l4
S5 S6
l5 l6
S7

P(S1, S2, S3, S4)= SS7SS5SS6P(S7) Pl5(S7,S5) Pl6(S7,S6) Pl1(S5,S1) Pl2(S5,S2)


Pl3(S6,S3) Pl4(S6,S4)
where P(S7) is estimated by the average base frequencies in studied sequences.

S55/109
• Step 2: Let us compute the probability that entire sequences have
evolved :
P(Sq1, Sq2, Sq3, Sq4) = Pall sites P(S1, S2, S3, S4)

• Step 3: Let us compute branch lengths l1, l2, …, l6 that give the
highest P(Sq1, Sq2, Sq3, Sq4) value. This is the likelihood of the tree.

• Step 4: Let us compute the likelihood of all possible trees. The tree
predicted by the method is that having the highest likelihood.

S56/109
• Example: Likelihood of a single sequence with
two nucleotides AC
• For DNA sequence comparison the model has 2 parts, the base
composition (A, G, C, T) and the process.
• If the model is Jukes – Cantor model, which has a base composition of ¼
for each nucleotide then the likelihood will be 1/4 X 1/4 = 1/16.
• If the model has a composition of 40%A and 10%C the likelihood of the
sequence will be 0.4 x 0.1=0.04
• If we take the 16 possible nucleotide combinations and calculate the sum
of all of them the sum of those likelihoods is 1.
• For any model ,the sum of the likelihoods of all the different data
possibilities should be 1.

S57/109
The probability of nucleotide substitution
SpC
TCAGCCGACTGT
SpD
TCAGACGACTGT

• The actual distance (d) of the two sequences will be related to the
probability of the sequences to be different (p) α
A G
3 4 d
p = [1 - e3 ]
4 α α α α

where d = 3 αt
C α T

57
S58/109
Maximum likelihood : properties
• This is the best justified method from a theoretical viewpoint.

• Sequence simulation experiments have shown that this method works better than
all others in most cases.

• But it is a very computer-intensive method.

• It is nearly always impossible to evaluate all possible trees because there are too
many.

• A partial exploration of the space of possible trees is done. The mathematical


certainty of obtaining the most likely tree is lost.
S59/109
Maximum Likelihood
Advantages: Disadvantages:
• Highly accurate • The complexity of the
• Allows various forms of estimation process means that it
homoplasy to be corrected is slow and computationally
for. demanding.
• Provides a robust statistical
context in which to evaluate • The hill-climbing algorithm is
specific hypotheses. susceptible to local optima and
• A single tree is produced that so does not guarantee to return
is generally precise. the most optimal solution.
S60/109
Tree evaluation
• Tree evaluation is done to analyse how well the data supports the
result tree
• Tests for Tree evaluation
• Topology
• Tree reconciliation (comparison of the gene tree with the species tree)
• Robustness (e.g. bootstrap, aLRT (PhyML)
• Branch lengths tests

S61/109
Bootstrap
Principle:
• New MSA datasets are created by choosing randomly N columns from the original MSA;
where N is the length of the original MSA
• Phylogenetic analysis is then performed on all bootstrap replicates
• The consensus tree indicates bootstrap support for each node
• Mostly 1000 replicates (100 copies for large datasets)
• Bootstrap support values: min. 98% (strict), min. 95% (accepted)

Properties
• Internal branches supported by ≥ 90% of replicates are considered as statistically
significant.
• The bootstrap procedure only detects if sequence length is enough to support a particular
node.
• The bootstrap procedure does not help determining if the tree-building method is good.

S62/109
Bootstrapping Tests
• It involves repeatedly taking random samples of the data of the same size as the original data
set, and then recalculating the test statistic of interest this process can be called “sampling with
replacement”, which means some data points are used more than once and others aren’t used at
all.
• In phylogenetic trees, Bootstrapping Tests is used to determine how different tree nodes supports
well .
• Imagine the data in the form of a multiple alignment use every column (position) in the
alignment once, and build a tree
• once the tree is built, determine all of the splits in the tree.
• Now, resample the data about 1000 times.
• For each run, build a tree and determine its splits.
• For each node in the original tree, count how many samples give the same splits as that
node.
• These numbers are listed on the tree by each node.
• Often there is a lower limit of 50% or 60% for accepting a node as valid.
• Nodes with lesser scores are often fused into a condensed tree.
S63/109
Bootstrap procedure

The support of each internal branch is expressed as percent of replicates.

S64/109
Bootstrapping Example
1 2 3 4 5 6 7 8 9 10
1: T G A A G G C T T C
2: T A G A G A G C T C
3: T G G A G G G A C T
4: T G G A G G C A T T
5: C G G A G A G C T T

# each line below is a randomly chosen list of columns to use.


1343194975
5496142252
9651422826
9969191658
6724464136
3119574879
S65/109
Phylogenomic databases
Phylogenomic databases differ in their

• Goals
• Methodologies
• Number of species
• Taxonomic range
• Hierarchies
• Result presentation
• Update frequencies

Certain databases that we have gone through are-

COG, KOG, eggNOG, Ensembl [Compara], HOGENOM, InParanoid ,


OMA browser, OrthoDB, OrthoMCL, PhylomeDB

S66/109
COG http://www.ncbi.nlm.nih.gov/COG/

S67/109
KOG http://genome.jgi.doe.gov/Tutorial/tutorial/kog.html

S68/109
eggNOG http://eggnogdb.embl.de/#/app/home

S69/109
Ensembl (Compara) http://www.ensembl.org/info/docs/api/compara/index.html

S70/109
HOGENOM http://doua.prabi.fr/databases/hogenom/home.php?contents=query

S71/109
InParanoid http://inparanoid.sbc.su.se/cgi-bin/index.cgi

S72/109
OMA browser http://omabrowser.org/oma/home/

S73/109
OrthoDB http://orthodb.org/

S74/109
OrthoMCL http://www.orthomcl.org/orthomcl/

S75/109
PhylomeDB http://phylomedb.org/

S76/109
Software For Phylogenetic Analysis
Examples of online tools Examples of offline tools
• Phylodendron
http://iubio.bio.indiana.edu/treeapp/treep • Phylip
rint-form.html • Clustal X
• Clustal w
• Mrbayes
http://www.genome.jp/tools/clustalw/
• PAUP
• Mac clade
http://paup.csit.fsu.edu/ • TCS (Transitive Consistency Score)
• BioNJ • Bioedit
http://www.atgc-montpellier.fr/bionj/ • Tree view
• PhyML • Dna sp
http://www.atgc-montpellier.fr/phyml/ • Arlequin
274 software packages described at one website S77/109
Online Tools

S78/109
PHYLODENDRON

S79/109
Clustal W2
ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or
more sequences. For the alignment of two sequences please instead use our pairwise sequence alignment tools.

S80/109
PAUP [Phylogenetic Analysis Using Parsimony and other Methods]
URL: http://paup.csit.fsu.edu/

•PAUP has been released by Sinauer Associates, of


Sunderland, Massachusetts.

•It is the most sophisticated parsimony program, with


many options and close compatibility with MacClade.

•It also includes parsimony, distance matrix, invariants,


and maximum likelihood methods and many indices and
statistical tests.

•It has Macintosh, PowerMac, Windows, and


Unix/OpenVMS versions.

S81/109
BIONJ

S82/109
PHYML

S83/109
Offline Tools

S84/109
PHYLIP (Phylogeny Inference Package)
• Available free in Windows/MacOS/Linux systems
• Parsimony, distance matrix and likelihood methods (bootstrapping and
consensus trees)
• Data can be molecular sequences, gene frequencies, restriction sites
and fragments, distance matrices and discrete characters
input and output

S85/109
S86/109
S87/109
S88/109
Clustal X
Clustal X is a windows interface MSA program that provides an integrated environment for performing multiple
sequence and profile alignments and analysing the results. The sequence alignment is displayed in a window on the
screen.

S89/109
MrBayes

•MrBayes (pronounced em es bayeszzz) is a program for


Bayesian inference of phylogeny using Markov chain
Monte Carlo methods.

•This program is developed by Mike Hickerson, Eli Stahl,


Wen Huang, and Naoki Takebayashi and released under the
GNU Public License.

•It allows complex and flexible comparative


phylogeographic inference.

•We can test for simultaneous divergence or colonization


across multiple co-distributed pairs of taxa .

•It uses hierarchical approximate Bayesian computation


(HABC) to estimate hyper-parameters given DNA sequence
data.

S90/109
2009. Bayesian phylogenetic analysis using MRBAYES
S91/109
MacClade
A M. cephalotes B
M. phaeocephalus

M. panamensis

M. phaeocephalus
URL: http://phylogeny.arizona.edu/macclade/ M. ferox

macclade.html M. barbirostris

M. tuberculifer (Ecuador)
Developed by MacClade, Wayne Maddison and M. tuberculifer (Argentina)

David Maddison phaeonotus-pelzelni

phaeonotus-pelzelni

MacClade enables you to use the mouse- ferocior

window interface to specify and rearrange pelzelni

phaeonotus-pelzelni

phylogenies by hand, and watch the number of swainsoni-pelzelni

phaeonotus-pelzelni
character steps and the distribution of states of pelzelni

a given character on the tree change as you do pelzelni

phaeonotus-pelzelni
so. phaeonotus-pelzelni

swainsoni-ferocior

pelzelni

swainsoni-pelzelni

phaeonotus-pelzelni

swainsoni

swainsoni

swainsoni

M. tyrannulus

Rhytipterna immunda

Tyrannus caudifasciatus

Breeding range: Northern South America Southern South America

S92/109
TCS (Transitive Consistency Score)
D2

D3, 5 F9
E4
F5, 7
•A program for estimating gene genealogies
D4
A10
*

within a population. A5
B2

D6 F2 A9
E3
•A cladistic analysis of phenotypic G7, 9
C2

associations with haplotypes inferred from


B6
restriction endonuclease mapping and DNA
sequence data.
B3

•Cladogram estimation is a method that


connects existing haplotypes in a minimum E1, 2, 5
F1, 3, 4, 6, 8, 10
G1, 2, 3, 4, 5, 6, 8, 10
spanning tree which is essentially a parsimony
method. A2 * D1
A7

•It can also infer networks with loops in them. B7

A6, 8
B4 B5, 8, 9
C7 * C1, 3, 4, 5, 6, 8, 9
*
A1
A3 A4
B1

S93/109
BioEdit

•URL:http://www.mbio.ncsu.edu/RNaseP/in
fo/programs/BIOEDIT/bioedit.html.

•It is a sequence editor with many kinds of


general molecular biology functions
available (alignment, BLAST searches,
plasmid drawing, restriction mapping,
sequence machine trace viewing, etc.).

•It comes with a number of existing


phylogeny programs which can be
automatically run from within BioEdit. Eg.
TreeView, fastDNAml, PHYLIP.

Joining different parts of a sequence together


(consensus sequence)
Sequence alignments (manual vs. ClustalW)
Alignments up to 20.000 sequences
Export in GenBank, Fasta, or PHYLIP
format
All information from: http://evolution.genetics.washington.edu/phylip/software.html
S94/109
TreeView

• Visualising trees
• We can change the graphic
presentation of a tree to a
cladogram, rectangular
cladogram, radial tree,
phylogram etc.
• But it does not change the
structure of a tree

S95/109
DnaSP

•It is for the analysis of nucleotide


polymorphism from aligned DNA sequence data.

•DnaSP can estimate several measures of DNA


sequence variation , linkage disequilibrium,
recombination, gene flow and gene conversion
parameters within and between populations.

•It gives calculation of measures of population


divergence, which include the Jukes-Cantor
method which can be used as a distance in
phylogeny reconstruction.

S96/109
Arlequin

•It can perform many kinds of population genetic


tasks including estimation of gene frequencies,
testing of linkage disequilibrium, and analysis of
diversity between populations.

•It can compute a variety of genetic distance


measures including of Jukes and Cantor, the
Kimura 2-parameter distance, and the Tamura-Nei
distance, each of these with or without correction
for gamma-distributed rates of evolution.

•It can also compute a Minimum Spanning Tree


network.

S97/109
Servers for phylogenetic analysis
• http://www.phylogeny.fr/
• http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html
• http://phylobench.vital-it.ch/raxml-bb/
• http://power.nhri.org.tw/power/home.htm

S98/109
http://www.phylogeny.fr/

S99/109
S100/109
http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html

S101/109
S102/109
http://phylobench.vital-it.ch/raxml-bb/

S103/109
http://power.nhri.org.tw/power/home.htm

S104/109
Applications of Phylogenetic analysis
1. It is used as tools for investigating iv. Forensic science
problems
i. HIV virus mutation
ii. Evolution of influenza
iii. Biogeography

2. Drug discovery
i. Vaccine development
S105/109
3. It is used to study the order of separation of the areas based on different taxa
occupied.
4. Predicting functions of uncharacterized genes - ortholog detection

DISADVANTAGES
Due to saturation: loss of phylogenetic signal
When compared homologous sequences have experienced too many residue
substitutions since divergence,

It is impossible to determine the phylogenetic tree, whatever the tree-building method


used.
Often saturation may not be detectable S106/109
Textbooks
• Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell
Science.
• Felsenstein Inferring Phylogenies. Sinauer Associates.
• Hall Phylogenetic trees made easy. Sinauer Associates.
• Molecular evolution
Fundamentals of molecular evolution (2nd edition); Graur & Li; Sinauer, 2000.
• Evolution in general
Evolution (2nd edition); M. Ridley; Blackwell, 1996.
• Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical guide
to the analysis of genes and proteins”, 2001, Wiley.
• Barbara Resch, “Hidden Markov Models - A Tutorial for the Course
Computational Intelligence”, 2010.
• Molecular Evolution; A phylogenetic Approach

S107/109
Software:
• PHYLIP : an extensive package of programs for all platforms (NJ, MP,
ML)
http://evolution.genetics.washington.edu/phylip/software.html
• MrBayes (Bayesian): http://mrbayes.csit.fdsu.edu
• ClustalX : multiple sequence alignment with a graphical interface
(for all types of computers).
http://www.ebi.ac.uk/FTP/index.html and go to ‘software’
• Database similarity searches (Blast) :
http://www.ncbi.nlm.nih.gov/BLAST/

S108/109
Websites:
• MultiPhyl (ML via email)
http://distributed.cs.nuim.ie/multiphyl.php

• Felsenstein’s Phylogeny program page (links to available software):


http://evolution.genetics.washington.edu/phylip/software.html

• Lecture notes of molecular systematics


http://www.bioinf.org/molsys/lectures.html

S109/109

You might also like