Phylogenetics

Phylogenetics
S2/109
Systematic
It is used to determine the evolutionary history and
relationships among organisms that are classifed in a
wide variety of sources including paleontology,
morphology, and molecular biology.
Evolutionary Phenetics Phylogenetic

(Synthetic) (Numerical Taxanomy) (Cladistic)
l
It is the study of evolutionary relationships and
determines how the family might have been
derived during evolution.
Objective-
To discover all of the branching relationships
in the tree and the branch lengths.
S3/109
Why do we need to study phylogenies?
• To know the origin of organisms (i.e. How we evolved )
IE68 - biological databases - phylogeny

S4/109
Clusters of Orthologous Groups
• The COG are generated by comparing predicted and known proteins in all
completely sequenced microbial genomes to infer sets of orthologs.
• Each COG consists of a group of proteins found to be orthologous across at least
three lineages and likely corresponds to an ancient conserved domain.
• It provides a fast alternative for describing the functional characteristics of one
microbe or a community of microbes because the database is significantly smaller.
• The current COG database used is CloVR (Cloud Virtual Resource), which is
composed of 144k proteins and over 4800 COGs.
• Each COG has a specific functional description:
• Cellular Processes And Signaling
• Information Storage And Processing
• Metabolism
S5/109
Determination of ‘COGs' for a proteins
• A single text file containing amino acid sequence of a proteins in FASTA format.
NOTE - All proteins must belong to a single species, but different strains.
• We can have two approaches here;
1. First one is simple.
i. Run BLASTp of my query sequences against COG database available at following
URL: ftp://ftp.ncbi.nlm.nih.gov/pub/kristensen/thousandgenomespogs/blastdb/
ii. then look for the best match for each query, and then see to which COG group it belongs .
2. Second approach is complex:
i. install the COG software (COGsoft.201204.tar) from the following link;
URL: ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
i. then run PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) of all-against-all
ii. then manipulate the data in acceptable format to COGnitor by using different modules of COG.
iii. Every step requires time in this approach.
S6/109
Synteny
It is a valid deduction that two or
more genomic regions are derived
from a single ancestral genomic
region in such proximity that they
may be subject to linkage.
• Two types of synteny

1. SYNTENIC CORRELATION
• It is a measure of genomic conservation.
2. SYNTENIC ASSOCIATION
• It measures the proportion of errors made in assigning a gene to a chromosome in one species that can be
eliminated by knowing which chromosome the gene belongs to in the other.
S7/109
SyMAP (Synteny Mapping and Analysis Program)
• It is a software package for

• detecting,
• displaying, and
• Querying
syntenic relationships between
sequenced chromosomes.
• SyMAP can run as a standalone

desktop application or from the web.
S8/109
Phylogenetic trees
S9/109
Trees terminology
• Tree is composed of outer branches representing the taxa,
nodes and branches.
• Taxa: It refers to the sequences. (singular taxon)
• Operational taxonomic unit (OTU): the samples selected
to be used in a study.
• Node: a branch point in a tree
• Branch: defines the relationship between the taxa in terms
of descent and ancestry
• Branch length : represents the number of character changes
that have occurred in the branch.
• Topology: the branching patterns of the tree
• Root: the common ancestor of all taxa
• Clade: a group of two or more taxa that includes both their
common ancestor and all their descendents.
S10/109
Relationship between Hierarchy and Phylogeny
S11/109
Trees
Classification based on
Rooting Branching Groups Constructions Of Tree Distance
Rooted Bifurcating Monophyletic Phenetic Cladogram
Unrooted Multifurcating Polyphyletic Cladistic Additive

Paraphyletic Ultrameric
S12/109
Classification of tree based on Rooting
ROOTED TREE
• It has a node that is identified as the root from which ultimately
all other nodes descend,
• It has a direction that corresponds to evolutionary time;
• the closer a node is to the root of the tree the older it is in time.
• It allow us to define ancestor-descendant relationships between
nodes.
UNROOTED TREES
• They lack a root, and hence do not specify evolutionary
relationships
• they don’t talk of ancestors and descendants.
• sequences that may be adjacent on an unrooted tree need not
be evolutionarily closely related.
S13/109
Unrooted Trees
• Most of the phylogenetic methods produces an unrooted trees thus there are two means to
root an unrooted tree -
• 1. Outgroup Method
• 2. Molecular Clock Hypothesis
Outgroup Method Molecular Clock Hypothesis

• It analysis a group of sequences known • All lineages are supposed to have
a priori to be external to the group under evolved with the same speed since
study. divergence from their common ancestor.
• The root is at the equidistant point from
all tree leaves.
S14/109
Rooting a Tree
STEP 1 - To get a direction for time (tree-building algorithms are completely time-reversible.
STEP 2 - Including one or more sequences that are known to be more distantly related from
all the others known as an “outgroup”.
STEP 3 - The root is placed on the branch that connects the outgroup to the rest of the
sequences, halfway between them.
S15/109
Possible evolutionary trees
Distinct trees for 8 taxa

S16/109
Classification of tree based on Branching
Bifurcating
Exactly two ancestors per interior node
Multifurcating
More than two ancestors per interior node
S17/109
Classification of tree based on Groups
Monophyletic group Paraphyletic group Polyphyletic group
Includes an ancestor Includes ancestor and some, Includes two convergent descendants
all of its descendants but not all of its descendants but not their common ancestor
Taxon A, B and C Taxon A is highly derived Taxon A and C share

share common ancestor and looks very different similar traits through
from B, C, and ancestor convergent evolution
S18/109
S19/109
Classification of tree based on construction
Phenetic methods Cladistic methods
• It construct phenograms by • It construct cladograms that rely on
considering the current states of assumptions about ancestral
characters without regard to the relationships as well as on current
evolutionary history that brought the data
species to their current phenotypes
• Phenograms are based on overall • Cladograms are based on character

similarity evolution like
• eye color (blue, brown, green)
• nucleotide bases-A, C, T, G
• amino acid codons-ACC, CGT, GAT,
etc
S20/109
Phenetics
Phenetics (overall similarity)
S21/109
Classification of tree based on distance
CLADOGRAM PHYLOGRAM ULTRAMETRIC TREE
It shows branching but Also known as Additive or Branch lengths are proportional
pattern. Metric tree to time.
Branch lengths have no Branch lengths are proportional This is the molecular clock
meaning but sometimes to evolutionary distance. model that implies that
differ for artistic effect. evolution occurs at a constant
rate in all species.
S22/109
“Splits / partitions”
• Splits arose when a specific internal branch
is removed.
• When you list the species in one split, the
other is automatically defined.
Generating a Set of Splits
S23/109
Phylogenetics approach
S24/109
Step 1 : Sequence analysis
Step 2: Multiple sequence alignment
Step 3: Phylogenetic analysis
S25/109
Phylogenetic analysis
1. Data selection.
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
S26/109
1. Data selection
Parameters To be considered: Type of Data
• Input data must be homolog! • Morphological characters
• Similar distribution • Physiological characters
• Number of character states size of the • Gene order
dataset • Sequence data (nucleotide and amino acid)
2. Data comparison
• Chose a suitable alignment method
MSA methods NOTE - Highly diverged sequences
• ClustalW (very fast) Domain/family predictions
• Muscle (very fast) Structures
• MAFFT (fast)
• Probcons
• T-coffee S27/109
3. Selection of an data model
• Two categories
Parameters to be considered:
• That each position in the alignment • Numerical data
should be homolog! 1. Distance between objects i.e. evolutionary
• Missing data (in some OTU) distance between two species
• Number of characters • Character data
1. Each character has a finite number of states
2. E.g. number or legs = 1, 2, 4
4. Selection of an evolutionary model

Parameters to be considered? • Phylogenetic tree-building presumes
1. Frequencies of aa exchange during different evolutionary models
evolution • The model chosen influences the
2. Presence of Invariable site. outcome of the analysis and should be
considered in the interpretation of the
analysis results. S28/109
5. Tree Building methods
S29/109
Distance based
General principle :
Sequence alignment
 (1)
Matrix of evolutionary distances between sequence pairs
 (2)
(unrooted) tree
STEP 1 - Compute distances

STEP 2 - Tree-building by
• Fitch morgolish FM
• Neighbor Joining (NJ)
• UnWeighted pair-group method using arithmetic averages
(UPGMA) S30/109
Step 1: Compute distances
Measure for the extend of sequence divergence:
p distance: ^p=nd/n
p = proportion (p distance)
nd= number of aa differences
n = number of aa used
• Relationship of p with t (time)

S31/109
Step 2: Tree-building
S32/109
Fitch-Margoliash Method
• This method generates an unrooted additive tree:
• The branch lengths are unequal they are proportional to
evolutionary distance
• The method works on small, 3-branch trees,
STEP 2
STEP 1 - determines the branch
lengths and the position of the
internal node from the distances.
STEP 3
S33/109
Neighbor Joining (NJ)
Principles:
• It is a bottom-up clustering methods.
• Tree topology and branch lengths are estimated from a genetic distance
matrix
• Neighbours are defined as taxa connected by a single node in an unrooted
tree.
• Closest neighbours are successively joined by a new node until the tree is
resolved.
• It results in a single unrooted tree with branch length estimates that need
to be rooted by the outgroup method.
• NJ is a fast method, even for hundreds of sequences.
S34/109
The Neighbor-Joining Method: algorithm
• Start from a star - topology and progressively construct a tree as :
• Step 1: Use d distances measured between the N sequences
• Step 2: For all pairs i et j: consider the following tree topology, and compute
Si,j , the sum of all “best” branch lengths. (Saitou and Nei have found a simple
way to compute Si,j )
• Step 3: Retain the pair (i,j) with smallest Si,j value . Group i and j in the tree.
• Step 4: Compute new distances d between N-1 objects: pair (i,j) and the N-2
remaining sequences.
d(i,j),k = (di,k + dj,k) / 2
• Step 5: Return to step 1 as long as N ≥ 4. When N = 3, an (unrooted) tree is
obtained
S35/109
• We are using only 4 taxa.
Example
• Step 1 is to calculate the neighbor distances Q, using the following equation .
dist A B C D
A 0 N N
B 7 0
Qij   N  2dij   dik   djk
k 1 k 1
C 13 8 0
D 17 12 14 0
Q score
(N-2)*AB-(AB+AC+AD) –(AB+BC+BD)
A-B (4-2)*7 – (7+13+17) – (7+8+12) -50
A-C (4-2)*13 – (7+13+17) – (13+8+12) -46
A-D (4-2)*17 – (7+13+17) – (17+12+14) -46
B-C (4-2)*8 – (7+8+12) – (13+8+12) -46
-50 is the lowest score, and we could use either A-B or C-D. B-D (4-2)*12– (7+8+12) – (17+12+14) -46
We arbitrarily choose A-B to join first. C-D (4-2)*14– (13+8+12) – (17+12+14) -50
S36/109
1 1  N N

dAY  dAB  *  dAk   dBk 
2 2( N  2)  k 1 k 1 
d score
A-Y (1/2)*7 + {1/4*[(7+13+17) – (7+8+12)]} 6
B-Y 7-6 1
C-X (1/2)*14 – {1/4*[(13+8+127) – (17+12+14)]} 5
D-X 14-5 9
X-Y 2
S37/109
Neighbor Joining (NJ)
Advantages: Disadvantages:
• Very efficient • The method lacks accuracy because
• Also for large datasets there is no attempt to correct for
potential bias (homoplasy).
• A single tree is estimated by
• The method lacks precision because
minimising genetic the outcome is partly contingent on
distance, in a short time and the tree with which the search process
with little computational begins.
expenditure. • Does not examine all possible
topologies
S38/109
UPGMA (Unweighted Pair Group Method
with Arithmetic mean)
• UPGMA is the oldest distance matrix method.
• Simplest method - uses sequential clustering algorithm
• It uses a distance matrix representing measure of genetic distance
between pairs of species being considered
• It clusters the two closest species.
• Compute new distance matrix using arithmetic mean to first cluster
• It is repeated until all species are grouped
S39/109
S40/109
A B C D E
A 0 10 12 10 7
B 0 4 4 13 UPGMA Step 1 combine B and C
C 0 6 15
D 0 13
E 0
A BCD E
A 0 10.5 7
BCD 0 13.5
AE BCD
E 0
AE 0 12
BCD 0
S41/109
A B C D E
A 0 10 12 10 7
UPGMA Result B
C
0 4
0
4
6
13
15
D 0 13
E 0
2 .5 3.5
3 3
Correction using Formulae
2.5
3.5
3.5 .5 2
d 2
a 5 1 d
a
2
2 3
1
e b c e b c
S42/109
Distance method
• Fast - suitable for analysing • Information is lost - given
data sets which are too large for only the distances, it is
other more computationally impossible to derive the
intensive methods such as original sequences.
maximum likelihood. • Only through character based
• A large number of models are analyses can the history of
available with many parameters sites be investigated; e.g.,
-improves estimation of most informative positions be
distances. inferred.
S43/109
Character based
S44/109
Character- (Sequence-) based methods
Most common:
• Maximum Parsimony (MP)
• Maximum Likelihood (ML)
• Baysian Inference
S45/109
Parsimony Methods
• Aligning a sequences to generate a tree that minimizes the number of mutations by
minimizing the sum of all branch lengths, and not worrying about the length of branch.
• It directly align the sequences and don’t use a distance matrix or evolutionary model and
completely ignores the possibility of multiple mutations.
• Informative sites
Not all sites contribute useful information to counting mutations.
• Invariant sites
They are the sites where all sequences have the same base, are worthless.
• Singleton sites
They are the sites where only one sequence has the mutation, are also worthless, because
no matter what the tree topology is, a singleton site always needs exactly 1 mutation to
generate.
• Uninformative sites
nucleotide (or amino acid) columns that do not allow the distinction between two trees.
S46/109
Maximum Parsimony
• It was originally developed for morphological characters.
• The topology of the result tree is the one that requires the smallest
number of evolutionary changes- William of Ockham
Principle:
1. Estimate the minimum number of substitutions for a given topology
2. Parsimony-informative sites (exclude invariable sites and singletons)
3. Searching MP trees by
i. Exhaustive search
ii. Heuristic search
4. Result- Multiple result trees are possible (Mainly Unrooted trees are
resulted)
S47/109
Algorithm
Step 1: Determine the ancestral residues for a given tree topology and for a given
alignment site that requires the smallest total number of changes in the whole
tree. Let d be this total number of changes.
Step 2: Compute d for each alignment site.
Step 3 : Add d values for all alignment sites giving the length L of tree.
Step 4: Compute L value for each possible tree shape.
Step 5: Retain the shortest tree(s)

the tree(s) that require the smallest number of changes
the most parsimonious tree(s).
S48/109
N-1
0 0 3
0 0 3
0 0 3
S49/109
0 3 2 0 3 2 1
0 3 2 0 3 2 1
0 3 2 0 3 2 1
S50/109
1 3
0 3 2 2 0 1 1 1 1 3 = 14
2 4
1 2
0 3 2 2 0 1 2 1 2 3 = 16
3 4
1 3
0 3 2 1 0 1 2 1 2 3 = 15
4 2
S51/109
Maximum Parsimony (MP)
•It is a simple method and • Generally produces multiple result
free from assumptions. trees.
•Easy to understand the
• Does not take into account
operation. homoplasy.
•Does not depend on an
explicit model of evolution • creates wrong topologies, if the
•Gives both trees and substitution rate varies extensively
associated hypotheses of between lineages
character evolution.
S52/109
Maximum Likelihood (ML)
Principle
• It looks for the tree that, under a given model of evolution, maximizes the
likelihood of the observed data
• It calculates likelihoods for each position in the alignment and for all possible
topologies (gaps generally removed) and results a tree with the highest
likelihood.
• It locates the most likely tree topology through a hill-climbing algorithm
• Searching strategies are rarely exhaustive and mostly heuristic, like-
• NNI (Nearest neighbor interchanges)
• TBR (Tree bisection-reconnection)
• SPR (Subtree pruning and regrafting)
S53/109
Maximum likelihood methods
• Hypotheses
• The substitution process follows a probabilistic model whose mathematical
expression, but not parameter values, is known a priori.
• Sites evolve independently from each other.
• All sites follow the same substitution process (some methods use a more
realistic hypothesis).
• Substitution probabilities do not change with time on any tree branch. They
may vary between branches.
S54/109
Maximum likelihood algorithm
• Step 1:
Let us consider a given rooted tree, a given site, and a given set of branch lengths.
• Let S1, S2, S3, S4: observed bases at site in seq. 1, 2, 3, 4
and S5, S6, S7: unknown and variable ancestral bases
and l1, l2, …, l6 be the given branch lengths S2
S3
S1
l3 S4
l1 l2
l4
S5 S6
l5 l6
S7
P(S1, S2, S3, S4)= SS7SS5SS6P(S7) Pl5(S7,S5) Pl6(S7,S6) Pl1(S5,S1) Pl2(S5,S2)

Pl3(S6,S3) Pl4(S6,S4)
where P(S7) is estimated by the average base frequencies in studied sequences.
S55/109
• Step 2: Let us compute the probability that entire sequences have
evolved :
P(Sq1, Sq2, Sq3, Sq4) = Pall sites P(S1, S2, S3, S4)
• Step 3: Let us compute branch lengths l1, l2, …, l6 that give the
highest P(Sq1, Sq2, Sq3, Sq4) value. This is the likelihood of the tree.
• Step 4: Let us compute the likelihood of all possible trees. The tree
predicted by the method is that having the highest likelihood.
S56/109
• Example: Likelihood of a single sequence with
two nucleotides AC
• For DNA sequence comparison the model has 2 parts, the base
composition (A, G, C, T) and the process.
• If the model is Jukes – Cantor model, which has a base composition of ¼
for each nucleotide then the likelihood will be 1/4 X 1/4 = 1/16.
• If the model has a composition of 40%A and 10%C the likelihood of the
sequence will be 0.4 x 0.1=0.04
• If we take the 16 possible nucleotide combinations and calculate the sum
of all of them the sum of those likelihoods is 1.
• For any model ,the sum of the likelihoods of all the different data
possibilities should be 1.
S57/109
The probability of nucleotide substitution
SpC
TCAGCCGACTGT
SpD
TCAGACGACTGT
• The actual distance (d) of the two sequences will be related to the
probability of the sequences to be different (p) α
A G
3 4 d
p = [1 - e3 ]
4 α α α α
where d = 3 αt
C α T
57
S58/109
Maximum likelihood : properties
• This is the best justified method from a theoretical viewpoint.
• Sequence simulation experiments have shown that this method works better than
all others in most cases.
• But it is a very computer-intensive method.
• It is nearly always impossible to evaluate all possible trees because there are too
many.
• A partial exploration of the space of possible trees is done. The mathematical

certainty of obtaining the most likely tree is lost.
S59/109
Maximum Likelihood
• Highly accurate • The complexity of the
• Allows various forms of estimation process means that it
homoplasy to be corrected is slow and computationally
for. demanding.
• Provides a robust statistical
context in which to evaluate • The hill-climbing algorithm is
specific hypotheses. susceptible to local optima and
• A single tree is produced that so does not guarantee to return
is generally precise. the most optimal solution.
S60/109
Tree evaluation
• Tree evaluation is done to analyse how well the data supports the
result tree
• Tests for Tree evaluation
• Topology
• Tree reconciliation (comparison of the gene tree with the species tree)
• Robustness (e.g. bootstrap, aLRT (PhyML)
• Branch lengths tests
S61/109
Bootstrap
Principle:
• New MSA datasets are created by choosing randomly N columns from the original MSA;
where N is the length of the original MSA
• Phylogenetic analysis is then performed on all bootstrap replicates
• The consensus tree indicates bootstrap support for each node
• Mostly 1000 replicates (100 copies for large datasets)
• Bootstrap support values: min. 98% (strict), min. 95% (accepted)
Properties
• Internal branches supported by ≥ 90% of replicates are considered as statistically
significant.
• The bootstrap procedure only detects if sequence length is enough to support a particular
node.
• The bootstrap procedure does not help determining if the tree-building method is good.
S62/109
Bootstrapping Tests
• It involves repeatedly taking random samples of the data of the same size as the original data
set, and then recalculating the test statistic of interest this process can be called “sampling with
replacement”, which means some data points are used more than once and others aren’t used at
all.
• In phylogenetic trees, Bootstrapping Tests is used to determine how different tree nodes supports
well .
• Imagine the data in the form of a multiple alignment use every column (position) in the
alignment once, and build a tree
• once the tree is built, determine all of the splits in the tree.
• Now, resample the data about 1000 times.
• For each run, build a tree and determine its splits.
• For each node in the original tree, count how many samples give the same splits as that
node.
• These numbers are listed on the tree by each node.
• Often there is a lower limit of 50% or 60% for accepting a node as valid.
• Nodes with lesser scores are often fused into a condensed tree.
S63/109
Bootstrap procedure
The support of each internal branch is expressed as percent of replicates.
S64/109
Bootstrapping Example
1 2 3 4 5 6 7 8 9 10
1: T G A A G G C T T C
2: T A G A G A G C T C
3: T G G A G G G A C T
4: T G G A G G C A T T
5: C G G A G A G C T T
# each line below is a randomly chosen list of columns to use.

1343194975
5496142252
9651422826
9969191658
6724464136
3119574879
S65/109
Phylogenomic databases
Phylogenomic databases differ in their
• Goals
• Methodologies
• Number of species
• Taxonomic range
• Hierarchies
• Result presentation
• Update frequencies
Certain databases that we have gone through are-
COG, KOG, eggNOG, Ensembl [Compara], HOGENOM, InParanoid ,

OMA browser, OrthoDB, OrthoMCL, PhylomeDB
S66/109
COG http://www.ncbi.nlm.nih.gov/COG/
S67/109
KOG http://genome.jgi.doe.gov/Tutorial/tutorial/kog.html
S68/109
eggNOG http://eggnogdb.embl.de/#/app/home
S69/109
Ensembl (Compara) http://www.ensembl.org/info/docs/api/compara/index.html
S70/109
HOGENOM http://doua.prabi.fr/databases/hogenom/home.php?contents=query
S71/109
InParanoid http://inparanoid.sbc.su.se/cgi-bin/index.cgi
S72/109
OMA browser http://omabrowser.org/oma/home/
S73/109
OrthoDB http://orthodb.org/
S74/109
OrthoMCL http://www.orthomcl.org/orthomcl/
S75/109
PhylomeDB http://phylomedb.org/
S76/109
Software For Phylogenetic Analysis
Examples of online tools Examples of offline tools
• Phylodendron
http://iubio.bio.indiana.edu/treeapp/treep • Phylip
rint-form.html • Clustal X
• Clustal w
• Mrbayes
http://www.genome.jp/tools/clustalw/
• PAUP
• Mac clade
http://paup.csit.fsu.edu/ • TCS (Transitive Consistency Score)
• BioNJ • Bioedit
http://www.atgc-montpellier.fr/bionj/ • Tree view
• PhyML • Dna sp
http://www.atgc-montpellier.fr/phyml/ • Arlequin
274 software packages described at one website S77/109
Online Tools
S78/109
PHYLODENDRON
S79/109
Clustal W2
ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or
more sequences. For the alignment of two sequences please instead use our pairwise sequence alignment tools.
S80/109
PAUP [Phylogenetic Analysis Using Parsimony and other Methods]
URL: http://paup.csit.fsu.edu/
•PAUP has been released by Sinauer Associates, of

Sunderland, Massachusetts.
•It is the most sophisticated parsimony program, with

many options and close compatibility with MacClade.
•It also includes parsimony, distance matrix, invariants,

and maximum likelihood methods and many indices and
statistical tests.
•It has Macintosh, PowerMac, Windows, and

Unix/OpenVMS versions.
S81/109
BIONJ
S82/109
PHYML
S83/109
Offline Tools
S84/109
PHYLIP (Phylogeny Inference Package)
• Available free in Windows/MacOS/Linux systems
• Parsimony, distance matrix and likelihood methods (bootstrapping and
consensus trees)
• Data can be molecular sequences, gene frequencies, restriction sites
and fragments, distance matrices and discrete characters
input and output
S85/109
S86/109
S87/109
S88/109
Clustal X
Clustal X is a windows interface MSA program that provides an integrated environment for performing multiple
sequence and profile alignments and analysing the results. The sequence alignment is displayed in a window on the
screen.
S89/109
MrBayes
•MrBayes (pronounced em es bayeszzz) is a program for

Bayesian inference of phylogeny using Markov chain
Monte Carlo methods.
•This program is developed by Mike Hickerson, Eli Stahl,

Wen Huang, and Naoki Takebayashi and released under the
GNU Public License.
•It allows complex and flexible comparative

phylogeographic inference.
•We can test for simultaneous divergence or colonization

across multiple co-distributed pairs of taxa .
•It uses hierarchical approximate Bayesian computation

(HABC) to estimate hyper-parameters given DNA sequence
data.
S90/109
2009. Bayesian phylogenetic analysis using MRBAYES
S91/109
MacClade
A M. cephalotes B
M. phaeocephalus
M. panamensis
M. phaeocephalus
URL: http://phylogeny.arizona.edu/macclade/ M. ferox
macclade.html M. barbirostris
M. tuberculifer (Ecuador)
Developed by MacClade, Wayne Maddison and M. tuberculifer (Argentina)
David Maddison phaeonotus-pelzelni
phaeonotus-pelzelni
MacClade enables you to use the mouse- ferocior
window interface to specify and rearrange pelzelni
phaeonotus-pelzelni
phylogenies by hand, and watch the number of swainsoni-pelzelni
phaeonotus-pelzelni
character steps and the distribution of states of pelzelni
a given character on the tree change as you do pelzelni
phaeonotus-pelzelni
so. phaeonotus-pelzelni
swainsoni-ferocior
pelzelni
swainsoni-pelzelni
phaeonotus-pelzelni
swainsoni
swainsoni
swainsoni
M. tyrannulus
Rhytipterna immunda
Tyrannus caudifasciatus
Breeding range: Northern South America Southern South America
S92/109
TCS (Transitive Consistency Score)
D2
D3, 5 F9
E4
F5, 7
•A program for estimating gene genealogies
D4
A10
*
within a population. A5
B2
D6 F2 A9
E3
•A cladistic analysis of phenotypic G7, 9
C2
associations with haplotypes inferred from

B6
restriction endonuclease mapping and DNA
sequence data.
B3
•Cladogram estimation is a method that

connects existing haplotypes in a minimum E1, 2, 5
F1, 3, 4, 6, 8, 10
G1, 2, 3, 4, 5, 6, 8, 10
spanning tree which is essentially a parsimony
method. A2 * D1
A7
•It can also infer networks with loops in them. B7
A6, 8
B4 B5, 8, 9
C7 * C1, 3, 4, 5, 6, 8, 9
*
A1
A3 A4
B1
S93/109
BioEdit
•URL:http://www.mbio.ncsu.edu/RNaseP/in
fo/programs/BIOEDIT/bioedit.html.
•It is a sequence editor with many kinds of

general molecular biology functions
available (alignment, BLAST searches,
plasmid drawing, restriction mapping,
sequence machine trace viewing, etc.).
•It comes with a number of existing

phylogeny programs which can be
automatically run from within BioEdit. Eg.
TreeView, fastDNAml, PHYLIP.
Joining different parts of a sequence together

(consensus sequence)
Sequence alignments (manual vs. ClustalW)
Alignments up to 20.000 sequences
Export in GenBank, Fasta, or PHYLIP
format
All information from: http://evolution.genetics.washington.edu/phylip/software.html
S94/109
TreeView
• Visualising trees
• We can change the graphic
presentation of a tree to a
cladogram, rectangular
cladogram, radial tree,
phylogram etc.
• But it does not change the
structure of a tree
S95/109
DnaSP
•It is for the analysis of nucleotide

polymorphism from aligned DNA sequence data.
•DnaSP can estimate several measures of DNA

sequence variation , linkage disequilibrium,
recombination, gene flow and gene conversion
parameters within and between populations.
•It gives calculation of measures of population

divergence, which include the Jukes-Cantor
method which can be used as a distance in
phylogeny reconstruction.
S96/109
Arlequin
•It can perform many kinds of population genetic

tasks including estimation of gene frequencies,
testing of linkage disequilibrium, and analysis of
diversity between populations.
•It can compute a variety of genetic distance

measures including of Jukes and Cantor, the
Kimura 2-parameter distance, and the Tamura-Nei
distance, each of these with or without correction
for gamma-distributed rates of evolution.
•It can also compute a Minimum Spanning Tree

network.
S97/109
Servers for phylogenetic analysis
• http://www.phylogeny.fr/
• http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html
• http://phylobench.vital-it.ch/raxml-bb/
• http://power.nhri.org.tw/power/home.htm
S98/109
http://www.phylogeny.fr/
S99/109
S100/109
http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html
S101/109
S102/109
http://phylobench.vital-it.ch/raxml-bb/
S103/109
http://power.nhri.org.tw/power/home.htm
S104/109
Applications of Phylogenetic analysis
1. It is used as tools for investigating iv. Forensic science
problems
i. HIV virus mutation
ii. Evolution of influenza
iii. Biogeography
2. Drug discovery
i. Vaccine development
S105/109
3. It is used to study the order of separation of the areas based on different taxa
occupied.
4. Predicting functions of uncharacterized genes - ortholog detection
DISADVANTAGES
Due to saturation: loss of phylogenetic signal
When compared homologous sequences have experienced too many residue
substitutions since divergence,
It is impossible to determine the phylogenetic tree, whatever the tree-building method

used.
Often saturation may not be detectable S106/109
Textbooks
• Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell
Science.
• Felsenstein Inferring Phylogenies. Sinauer Associates.
• Hall Phylogenetic trees made easy. Sinauer Associates.
• Molecular evolution
Fundamentals of molecular evolution (2nd edition); Graur & Li; Sinauer, 2000.
• Evolution in general
Evolution (2nd edition); M. Ridley; Blackwell, 1996.
• Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical guide
to the analysis of genes and proteins”, 2001, Wiley.
• Barbara Resch, “Hidden Markov Models - A Tutorial for the Course
Computational Intelligence”, 2010.
• Molecular Evolution; A phylogenetic Approach
S107/109
Software:
• PHYLIP : an extensive package of programs for all platforms (NJ, MP,
ML)
http://evolution.genetics.washington.edu/phylip/software.html
• MrBayes (Bayesian): http://mrbayes.csit.fdsu.edu
• ClustalX : multiple sequence alignment with a graphical interface
(for all types of computers).
http://www.ebi.ac.uk/FTP/index.html and go to ‘software’
• Database similarity searches (Blast) :
http://www.ncbi.nlm.nih.gov/BLAST/
S108/109
Websites:
• MultiPhyl (ML via email)
http://distributed.cs.nuim.ie/multiphyl.php
• Felsenstein’s Phylogeny program page (links to available software):

http://evolution.genetics.washington.edu/phylip/software.html
• Lecture notes of molecular systematics

http://www.bioinf.org/molsys/lectures.html
S109/109

Phylogenetics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phylogenetics

Uploaded by

Copyright:

Available Formats

Phylogenetics

Evolutionary Phenetics Phylogenetic

• To know the origin of organisms (i.e. How we evolved )

IE68 - biological databases - phylogeny

• Two types of synteny

• It is a software package for

• SyMAP can run as a standalone

Rooting Branching Groups Constructions Of Tree Distance

Rooted Bifurcating Monophyletic Phenetic Cladogram

Unrooted Multifurcating Polyphyletic Cladistic Additive

Outgroup Method Molecular Clock Hypothesis

Distinct trees for 8 taxa

Taxon A, B and C Taxon A is highly derived Taxon A and C share

• Phenograms are based on overall • Cladograms are based on character

Phenetics (overall similarity)

Generating a Set of Splits

Step 2: Multiple sequence alignment

Step 3: Phylogenetic analysis

4. Selection of an evolutionary model

STEP 1 - Compute distances

• Relationship of p with t (time)

Step 2: Compute d for each alignment site.

Step 4: Compute L value for each possible tree shape.

Step 5: Retain the shortest tree(s)

P(S1, S2, S3, S4)= SS7SS5SS6P(S7) Pl5(S7,S5) Pl6(S7,S6) Pl1(S5,S1) Pl2(S5,S2)

• But it is a very computer-intensive method.

• A partial exploration of the space of possible trees is done. The mathematical

The support of each internal branch is expressed as percent of replicates.

# each line below is a randomly chosen list of columns to use.

Certain databases that we have gone through are-

COG, KOG, eggNOG, Ensembl [Compara], HOGENOM, InParanoid ,

•PAUP has been released by Sinauer Associates, of

•It is the most sophisticated parsimony program, with

•It also includes parsimony, distance matrix, invariants,

•It has Macintosh, PowerMac, Windows, and

•MrBayes (pronounced em es bayeszzz) is a program for

•This program is developed by Mike Hickerson, Eli Stahl,

•It allows complex and flexible comparative

•We can test for simultaneous divergence or colonization

•It uses hierarchical approximate Bayesian computation

David Maddison phaeonotus-pelzelni

MacClade enables you to use the mouse- ferocior

window interface to specify and rearrange pelzelni

phylogenies by hand, and watch the number of swainsoni-pelzelni

a given character on the tree change as you do pelzelni

Breeding range: Northern South America Southern South America

associations with haplotypes inferred from

•Cladogram estimation is a method that

•It can also infer networks with loops in them. B7

•It is a sequence editor with many kinds of

•It comes with a number of existing

Joining different parts of a sequence together

•It is for the analysis of nucleotide

•DnaSP can estimate several measures of DNA

•It gives calculation of measures of population

•It can perform many kinds of population genetic

•It can compute a variety of genetic distance

•It can also compute a Minimum Spanning Tree

It is impossible to determine the phylogenetic tree, whatever the tree-building method

• Felsenstein’s Phylogeny program page (links to available software):

• Lecture notes of molecular systematics

You might also like