Professional Documents
Culture Documents
Phylogenetics
Phylogenetics
S2/109
Systematic
It is used to determine the evolutionary history and
relationships among organisms that are classifed in a
wide variety of sources including paleontology,
morphology, and molecular biology.
l
It is the study of evolutionary relationships and
determines how the family might have been
derived during evolution.
Objective-
To discover all of the branching relationships
in the tree and the branch lengths.
S3/109
Why do we need to study phylogenies?
S5/109
Determination of ‘COGs' for a proteins
• A single text file containing amino acid sequence of a proteins in FASTA format.
NOTE - All proteins must belong to a single species, but different strains.
• We can have two approaches here;
1. First one is simple.
i. Run BLASTp of my query sequences against COG database available at following
URL: ftp://ftp.ncbi.nlm.nih.gov/pub/kristensen/thousandgenomespogs/blastdb/
ii. then look for the best match for each query, and then see to which COG group it belongs .
2. Second approach is complex:
i. install the COG software (COGsoft.201204.tar) from the following link;
URL: ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
i. then run PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) of all-against-all
ii. then manipulate the data in acceptable format to COGnitor by using different modules of COG.
iii. Every step requires time in this approach.
S6/109
Synteny
It is a valid deduction that two or
more genomic regions are derived
from a single ancestral genomic
region in such proximity that they
may be subject to linkage.
S7/109
SyMAP (Synteny Mapping and Analysis Program)
S8/109
Phylogenetic trees
S9/109
Trees terminology
• Tree is composed of outer branches representing the taxa,
nodes and branches.
• Taxa: It refers to the sequences. (singular taxon)
• Operational taxonomic unit (OTU): the samples selected
to be used in a study.
• Node: a branch point in a tree
• Branch: defines the relationship between the taxa in terms
of descent and ancestry
• Branch length : represents the number of character changes
that have occurred in the branch.
• Topology: the branching patterns of the tree
• Root: the common ancestor of all taxa
• Clade: a group of two or more taxa that includes both their
common ancestor and all their descendents.
S10/109
Relationship between Hierarchy and Phylogeny
S11/109
Trees
Classification based on
UNROOTED TREES
• They lack a root, and hence do not specify evolutionary
relationships
• they don’t talk of ancestors and descendants.
• sequences that may be adjacent on an unrooted tree need not
be evolutionarily closely related.
S13/109
Unrooted Trees
• Most of the phylogenetic methods produces an unrooted trees thus there are two means to
root an unrooted tree -
• 1. Outgroup Method
• 2. Molecular Clock Hypothesis
S14/109
Rooting a Tree
STEP 1 - To get a direction for time (tree-building algorithms are completely time-reversible.
STEP 2 - Including one or more sequences that are known to be more distantly related from
all the others known as an “outgroup”.
STEP 3 - The root is placed on the branch that connects the outgroup to the rest of the
sequences, halfway between them.
S15/109
Possible evolutionary trees
Multifurcating
More than two ancestors per interior node
S17/109
Classification of tree based on Groups
Monophyletic group Paraphyletic group Polyphyletic group
Includes an ancestor Includes ancestor and some, Includes two convergent descendants
all of its descendants but not all of its descendants but not their common ancestor
S20/109
Phenetics
S21/109
Classification of tree based on distance
CLADOGRAM PHYLOGRAM ULTRAMETRIC TREE
It shows branching but Also known as Additive or Branch lengths are proportional
pattern. Metric tree to time.
Branch lengths have no Branch lengths are proportional This is the molecular clock
meaning but sometimes to evolutionary distance. model that implies that
differ for artistic effect. evolution occurs at a constant
rate in all species.
S22/109
“Splits / partitions”
• Splits arose when a specific internal branch
is removed.
• When you list the species in one split, the
other is automatically defined.
S23/109
Phylogenetics approach
S24/109
Step 1 : Sequence analysis
S25/109
Phylogenetic analysis
1. Data selection.
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
S26/109
1. Data selection
Parameters To be considered: Type of Data
• Input data must be homolog! • Morphological characters
• Similar distribution • Physiological characters
• Number of character states size of the • Gene order
dataset • Sequence data (nucleotide and amino acid)
2. Data comparison
• Chose a suitable alignment method
MSA methods NOTE - Highly diverged sequences
• ClustalW (very fast) Domain/family predictions
• Muscle (very fast) Structures
• MAFFT (fast)
• Probcons
• T-coffee S27/109
3. Selection of an data model
• Two categories
Parameters to be considered:
• That each position in the alignment • Numerical data
should be homolog! 1. Distance between objects i.e. evolutionary
• Missing data (in some OTU) distance between two species
• Number of characters • Character data
1. Each character has a finite number of states
2. E.g. number or legs = 1, 2, 4
S29/109
Distance based
General principle :
Sequence alignment
(1)
Matrix of evolutionary distances between sequence pairs
(2)
(unrooted) tree
S32/109
Fitch-Margoliash Method
• This method generates an unrooted additive tree:
• The branch lengths are unequal they are proportional to
evolutionary distance
• The method works on small, 3-branch trees,
STEP 2
STEP 1 - determines the branch
lengths and the position of the
internal node from the distances.
STEP 3
S33/109
Neighbor Joining (NJ)
Principles:
• It is a bottom-up clustering methods.
• Tree topology and branch lengths are estimated from a genetic distance
matrix
• Neighbours are defined as taxa connected by a single node in an unrooted
tree.
• Closest neighbours are successively joined by a new node until the tree is
resolved.
• It results in a single unrooted tree with branch length estimates that need
to be rooted by the outgroup method.
• NJ is a fast method, even for hundreds of sequences.
S34/109
The Neighbor-Joining Method: algorithm
• Start from a star - topology and progressively construct a tree as :
• Step 1: Use d distances measured between the N sequences
• Step 2: For all pairs i et j: consider the following tree topology, and compute
Si,j , the sum of all “best” branch lengths. (Saitou and Nei have found a simple
way to compute Si,j )
• Step 3: Retain the pair (i,j) with smallest Si,j value . Group i and j in the tree.
• Step 4: Compute new distances d between N-1 objects: pair (i,j) and the N-2
remaining sequences.
d(i,j),k = (di,k + dj,k) / 2
• Step 5: Return to step 1 as long as N ≥ 4. When N = 3, an (unrooted) tree is
obtained
S35/109
• We are using only 4 taxa.
Example
• Step 1 is to calculate the neighbor distances Q, using the following equation .
dist A B C D
A 0 N N
B 7 0
Qij N 2dij dik djk
k 1 k 1
C 13 8 0
D 17 12 14 0
Q score
(N-2)*AB-(AB+AC+AD) –(AB+BC+BD)
A-B (4-2)*7 – (7+13+17) – (7+8+12) -50
A-C (4-2)*13 – (7+13+17) – (13+8+12) -46
A-D (4-2)*17 – (7+13+17) – (17+12+14) -46
B-C (4-2)*8 – (7+8+12) – (13+8+12) -46
-50 is the lowest score, and we could use either A-B or C-D. B-D (4-2)*12– (7+8+12) – (17+12+14) -46
We arbitrarily choose A-B to join first. C-D (4-2)*14– (13+8+12) – (17+12+14) -50
S36/109
1 1 N N
dAY dAB * dAk dBk
2 2( N 2) k 1 k 1
d score
A-Y (1/2)*7 + {1/4*[(7+13+17) – (7+8+12)]} 6
B-Y 7-6 1
C-X (1/2)*14 – {1/4*[(13+8+127) – (17+12+14)]} 5
D-X 14-5 9
X-Y 2
S37/109
Neighbor Joining (NJ)
Advantages: Disadvantages:
• Very efficient • The method lacks accuracy because
• Also for large datasets there is no attempt to correct for
potential bias (homoplasy).
• A single tree is estimated by
• The method lacks precision because
minimising genetic the outcome is partly contingent on
distance, in a short time and the tree with which the search process
with little computational begins.
expenditure. • Does not examine all possible
topologies
S38/109
UPGMA (Unweighted Pair Group Method
with Arithmetic mean)
• UPGMA is the oldest distance matrix method.
• Simplest method - uses sequential clustering algorithm
• It uses a distance matrix representing measure of genetic distance
between pairs of species being considered
• It clusters the two closest species.
• Compute new distance matrix using arithmetic mean to first cluster
• It is repeated until all species are grouped
S39/109
S40/109
A B C D E
A 0 10 12 10 7
B 0 4 4 13 UPGMA Step 1 combine B and C
C 0 6 15
D 0 13
E 0
A BCD E
A 0 10.5 7
BCD 0 13.5
AE BCD
E 0
AE 0 12
BCD 0
S41/109
A B C D E
A 0 10 12 10 7
UPGMA Result B
C
0 4
0
4
6
13
15
D 0 13
E 0
2 .5 3.5
3 3
Correction using Formulae
2.5
3.5
3.5 .5 2
d 2
a 5 1 d
a
2
2 3
1
e b c e b c
S42/109
Distance method
Advantages: Disadvantages:
• Fast - suitable for analysing • Information is lost - given
data sets which are too large for only the distances, it is
other more computationally impossible to derive the
intensive methods such as original sequences.
maximum likelihood. • Only through character based
• A large number of models are analyses can the history of
available with many parameters sites be investigated; e.g.,
-improves estimation of most informative positions be
distances. inferred.
S43/109
Character based
S44/109
Character- (Sequence-) based methods
Most common:
• Maximum Parsimony (MP)
• Maximum Likelihood (ML)
• Baysian Inference
S45/109
Parsimony Methods
• Aligning a sequences to generate a tree that minimizes the number of mutations by
minimizing the sum of all branch lengths, and not worrying about the length of branch.
• It directly align the sequences and don’t use a distance matrix or evolutionary model and
completely ignores the possibility of multiple mutations.
• Informative sites
Not all sites contribute useful information to counting mutations.
• Invariant sites
They are the sites where all sequences have the same base, are worthless.
• Singleton sites
They are the sites where only one sequence has the mutation, are also worthless, because
no matter what the tree topology is, a singleton site always needs exactly 1 mutation to
generate.
• Uninformative sites
nucleotide (or amino acid) columns that do not allow the distinction between two trees.
S46/109
Maximum Parsimony
• It was originally developed for morphological characters.
• The topology of the result tree is the one that requires the smallest
number of evolutionary changes- William of Ockham
Principle:
1. Estimate the minimum number of substitutions for a given topology
2. Parsimony-informative sites (exclude invariable sites and singletons)
3. Searching MP trees by
i. Exhaustive search
ii. Heuristic search
4. Result- Multiple result trees are possible (Mainly Unrooted trees are
resulted)
S47/109
Algorithm
Step 1: Determine the ancestral residues for a given tree topology and for a given
alignment site that requires the smallest total number of changes in the whole
tree. Let d be this total number of changes.
Step 3 : Add d values for all alignment sites giving the length L of tree.
S48/109
N-1
0 0 3
0 0 3
0 0 3
S49/109
0 3 2 0 3 2 1
0 3 2 0 3 2 1
0 3 2 0 3 2 1
S50/109
1 3
0 3 2 2 0 1 1 1 1 3 = 14
2 4
1 2
0 3 2 2 0 1 2 1 2 3 = 16
3 4
1 3
0 3 2 1 0 1 2 1 2 3 = 15
4 2
S51/109
Maximum Parsimony (MP)
Advantages: Disadvantages:
•It is a simple method and • Generally produces multiple result
free from assumptions. trees.
•Easy to understand the
• Does not take into account
operation. homoplasy.
•Does not depend on an
explicit model of evolution • creates wrong topologies, if the
•Gives both trees and substitution rate varies extensively
associated hypotheses of between lineages
character evolution.
S52/109
Maximum Likelihood (ML)
Principle
• It looks for the tree that, under a given model of evolution, maximizes the
likelihood of the observed data
• It calculates likelihoods for each position in the alignment and for all possible
topologies (gaps generally removed) and results a tree with the highest
likelihood.
• It locates the most likely tree topology through a hill-climbing algorithm
• Searching strategies are rarely exhaustive and mostly heuristic, like-
• NNI (Nearest neighbor interchanges)
• TBR (Tree bisection-reconnection)
• SPR (Subtree pruning and regrafting)
S53/109
Maximum likelihood methods
• Hypotheses
• The substitution process follows a probabilistic model whose mathematical
expression, but not parameter values, is known a priori.
• Sites evolve independently from each other.
• All sites follow the same substitution process (some methods use a more
realistic hypothesis).
• Substitution probabilities do not change with time on any tree branch. They
may vary between branches.
S54/109
Maximum likelihood algorithm
• Step 1:
Let us consider a given rooted tree, a given site, and a given set of branch lengths.
• Let S1, S2, S3, S4: observed bases at site in seq. 1, 2, 3, 4
and S5, S6, S7: unknown and variable ancestral bases
and l1, l2, …, l6 be the given branch lengths S2
S3
S1
l3 S4
l1 l2
l4
S5 S6
l5 l6
S7
S55/109
• Step 2: Let us compute the probability that entire sequences have
evolved :
P(Sq1, Sq2, Sq3, Sq4) = Pall sites P(S1, S2, S3, S4)
• Step 3: Let us compute branch lengths l1, l2, …, l6 that give the
highest P(Sq1, Sq2, Sq3, Sq4) value. This is the likelihood of the tree.
• Step 4: Let us compute the likelihood of all possible trees. The tree
predicted by the method is that having the highest likelihood.
S56/109
• Example: Likelihood of a single sequence with
two nucleotides AC
• For DNA sequence comparison the model has 2 parts, the base
composition (A, G, C, T) and the process.
• If the model is Jukes – Cantor model, which has a base composition of ¼
for each nucleotide then the likelihood will be 1/4 X 1/4 = 1/16.
• If the model has a composition of 40%A and 10%C the likelihood of the
sequence will be 0.4 x 0.1=0.04
• If we take the 16 possible nucleotide combinations and calculate the sum
of all of them the sum of those likelihoods is 1.
• For any model ,the sum of the likelihoods of all the different data
possibilities should be 1.
S57/109
The probability of nucleotide substitution
SpC
TCAGCCGACTGT
SpD
TCAGACGACTGT
• The actual distance (d) of the two sequences will be related to the
probability of the sequences to be different (p) α
A G
3 4 d
p = [1 - e3 ]
4 α α α α
where d = 3 αt
C α T
57
S58/109
Maximum likelihood : properties
• This is the best justified method from a theoretical viewpoint.
• Sequence simulation experiments have shown that this method works better than
all others in most cases.
• It is nearly always impossible to evaluate all possible trees because there are too
many.
S61/109
Bootstrap
Principle:
• New MSA datasets are created by choosing randomly N columns from the original MSA;
where N is the length of the original MSA
• Phylogenetic analysis is then performed on all bootstrap replicates
• The consensus tree indicates bootstrap support for each node
• Mostly 1000 replicates (100 copies for large datasets)
• Bootstrap support values: min. 98% (strict), min. 95% (accepted)
Properties
• Internal branches supported by ≥ 90% of replicates are considered as statistically
significant.
• The bootstrap procedure only detects if sequence length is enough to support a particular
node.
• The bootstrap procedure does not help determining if the tree-building method is good.
S62/109
Bootstrapping Tests
• It involves repeatedly taking random samples of the data of the same size as the original data
set, and then recalculating the test statistic of interest this process can be called “sampling with
replacement”, which means some data points are used more than once and others aren’t used at
all.
• In phylogenetic trees, Bootstrapping Tests is used to determine how different tree nodes supports
well .
• Imagine the data in the form of a multiple alignment use every column (position) in the
alignment once, and build a tree
• once the tree is built, determine all of the splits in the tree.
• Now, resample the data about 1000 times.
• For each run, build a tree and determine its splits.
• For each node in the original tree, count how many samples give the same splits as that
node.
• These numbers are listed on the tree by each node.
• Often there is a lower limit of 50% or 60% for accepting a node as valid.
• Nodes with lesser scores are often fused into a condensed tree.
S63/109
Bootstrap procedure
S64/109
Bootstrapping Example
1 2 3 4 5 6 7 8 9 10
1: T G A A G G C T T C
2: T A G A G A G C T C
3: T G G A G G G A C T
4: T G G A G G C A T T
5: C G G A G A G C T T
• Goals
• Methodologies
• Number of species
• Taxonomic range
• Hierarchies
• Result presentation
• Update frequencies
S66/109
COG http://www.ncbi.nlm.nih.gov/COG/
S67/109
KOG http://genome.jgi.doe.gov/Tutorial/tutorial/kog.html
S68/109
eggNOG http://eggnogdb.embl.de/#/app/home
S69/109
Ensembl (Compara) http://www.ensembl.org/info/docs/api/compara/index.html
S70/109
HOGENOM http://doua.prabi.fr/databases/hogenom/home.php?contents=query
S71/109
InParanoid http://inparanoid.sbc.su.se/cgi-bin/index.cgi
S72/109
OMA browser http://omabrowser.org/oma/home/
S73/109
OrthoDB http://orthodb.org/
S74/109
OrthoMCL http://www.orthomcl.org/orthomcl/
S75/109
PhylomeDB http://phylomedb.org/
S76/109
Software For Phylogenetic Analysis
Examples of online tools Examples of offline tools
• Phylodendron
http://iubio.bio.indiana.edu/treeapp/treep • Phylip
rint-form.html • Clustal X
• Clustal w
• Mrbayes
http://www.genome.jp/tools/clustalw/
• PAUP
• Mac clade
http://paup.csit.fsu.edu/ • TCS (Transitive Consistency Score)
• BioNJ • Bioedit
http://www.atgc-montpellier.fr/bionj/ • Tree view
• PhyML • Dna sp
http://www.atgc-montpellier.fr/phyml/ • Arlequin
274 software packages described at one website S77/109
Online Tools
S78/109
PHYLODENDRON
S79/109
Clustal W2
ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or
more sequences. For the alignment of two sequences please instead use our pairwise sequence alignment tools.
S80/109
PAUP [Phylogenetic Analysis Using Parsimony and other Methods]
URL: http://paup.csit.fsu.edu/
S81/109
BIONJ
S82/109
PHYML
S83/109
Offline Tools
S84/109
PHYLIP (Phylogeny Inference Package)
• Available free in Windows/MacOS/Linux systems
• Parsimony, distance matrix and likelihood methods (bootstrapping and
consensus trees)
• Data can be molecular sequences, gene frequencies, restriction sites
and fragments, distance matrices and discrete characters
input and output
S85/109
S86/109
S87/109
S88/109
Clustal X
Clustal X is a windows interface MSA program that provides an integrated environment for performing multiple
sequence and profile alignments and analysing the results. The sequence alignment is displayed in a window on the
screen.
S89/109
MrBayes
S90/109
2009. Bayesian phylogenetic analysis using MRBAYES
S91/109
MacClade
A M. cephalotes B
M. phaeocephalus
M. panamensis
M. phaeocephalus
URL: http://phylogeny.arizona.edu/macclade/ M. ferox
macclade.html M. barbirostris
M. tuberculifer (Ecuador)
Developed by MacClade, Wayne Maddison and M. tuberculifer (Argentina)
phaeonotus-pelzelni
phaeonotus-pelzelni
phaeonotus-pelzelni
character steps and the distribution of states of pelzelni
phaeonotus-pelzelni
so. phaeonotus-pelzelni
swainsoni-ferocior
pelzelni
swainsoni-pelzelni
phaeonotus-pelzelni
swainsoni
swainsoni
swainsoni
M. tyrannulus
Rhytipterna immunda
Tyrannus caudifasciatus
S92/109
TCS (Transitive Consistency Score)
D2
D3, 5 F9
E4
F5, 7
•A program for estimating gene genealogies
D4
A10
*
within a population. A5
B2
D6 F2 A9
E3
•A cladistic analysis of phenotypic G7, 9
C2
A6, 8
B4 B5, 8, 9
C7 * C1, 3, 4, 5, 6, 8, 9
*
A1
A3 A4
B1
S93/109
BioEdit
•URL:http://www.mbio.ncsu.edu/RNaseP/in
fo/programs/BIOEDIT/bioedit.html.
• Visualising trees
• We can change the graphic
presentation of a tree to a
cladogram, rectangular
cladogram, radial tree,
phylogram etc.
• But it does not change the
structure of a tree
S95/109
DnaSP
S96/109
Arlequin
S97/109
Servers for phylogenetic analysis
• http://www.phylogeny.fr/
• http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html
• http://phylobench.vital-it.ch/raxml-bb/
• http://power.nhri.org.tw/power/home.htm
S98/109
http://www.phylogeny.fr/
S99/109
S100/109
http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html
S101/109
S102/109
http://phylobench.vital-it.ch/raxml-bb/
S103/109
http://power.nhri.org.tw/power/home.htm
S104/109
Applications of Phylogenetic analysis
1. It is used as tools for investigating iv. Forensic science
problems
i. HIV virus mutation
ii. Evolution of influenza
iii. Biogeography
2. Drug discovery
i. Vaccine development
S105/109
3. It is used to study the order of separation of the areas based on different taxa
occupied.
4. Predicting functions of uncharacterized genes - ortholog detection
DISADVANTAGES
Due to saturation: loss of phylogenetic signal
When compared homologous sequences have experienced too many residue
substitutions since divergence,
S107/109
Software:
• PHYLIP : an extensive package of programs for all platforms (NJ, MP,
ML)
http://evolution.genetics.washington.edu/phylip/software.html
• MrBayes (Bayesian): http://mrbayes.csit.fdsu.edu
• ClustalX : multiple sequence alignment with a graphical interface
(for all types of computers).
http://www.ebi.ac.uk/FTP/index.html and go to ‘software’
• Database similarity searches (Blast) :
http://www.ncbi.nlm.nih.gov/BLAST/
S108/109
Websites:
• MultiPhyl (ML via email)
http://distributed.cs.nuim.ie/multiphyl.php
S109/109