Professional Documents
Culture Documents
PHYLOGENETIC ANALYSIS
• A phylogenetic tree also known as a
phylogeny is a diagram that depicts the lines
of evolutionary descent of different species,
organisms, or genes from a common ancestor.
Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny
D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
Three types
of trees
Cladogram Phylogram Ultrametric tree
6
Taxon B Taxon B Taxon B
1
Taxon C Taxon C Taxon C
All show the same evolutionary relationships, or branching orders, between the taxa.
• Phylogenetic analysis can be used to solve a
number of interesting problems
– Forensics
• HIV virus mutates rapidly
– Predicting evolution of influenza viruses
– Predicting functions of uncharacterized genes -
orthologue detection
– Drug discovery
– Vaccine development
• Target inferred common ancestor
Objectives
• Evolution,
• Elements of phylogeny,
• Methods of phylogenetic analysis,
• Phylogenetic tree of life,
• Comparison of genetic sequence
of organisms,
• Phylogenetic analysis tools-
– Phylip,
– ClustalW.
Evolution
• Speciation
– Evolution of new organisms is driven by
• Mutations
– The DNA sequence can be changed due to single base changes,
deletion/insertion of DNA segments, etc.
• Selection bias
– Speciation events lead to creation of different
species.
– Speciation caused by physical separation into
groups where
different genetic variants become dominant
• Any two species share a (possibly distant) common
ancestor
• The molecular clock hypothesis
Elements of phylogeny
A phylogenetic tree
• A phylogenetic tree is a graph reflecting the
approximate distances between a set of objects
(species, genes, proteins, families) in a hierarchical
fashion
2 Taxon
Branch (Edge)
1 B
Split (Bipartition)
Taxon C
1
Taxon D
Taxon
C
Taxon
1 A
1
4
2
1
Taxon Taxon
C B
Split (Bipartition)
Branch (Edge)
Root of Tree
• UPGMA clustering,
• Neighbour Joining
• Fitch-Margolish
UPGMA
(Unweighted – Pair – Group – Method –with Arithmetic mean)
• Maximum Parsimony
• Maximum Likelihood(ML)
Maximum Parsimony
(Fitch, 1977)
• Parsimony – carefulness in the use of resources.
• The basic underlying principle behind parsimony is
given by Occam’s Razor:
• “Given a choice between – a hard and easy way of
doing things, nature will always pick the easiest way i.e.
simple is always preferred over complex.”
• Parsimony assumes that the relationship that requires
the fewest number of mutations to explain the current
state of sequences being considered is the relationship
that is most likely to be correct.
concept of parsimony
• The concept of parsimony is at the heart of all
character based methods of phylogenetic
reconstruction.
• The 2 fundamental ideas of biological parsimony are:
– Mutations are exceedingly rare events ;
– The more unlikely events a model invokes, the less likely the
model is to be correct.
• As a result, the relationship that requires the fewest
number of mutations to explain the current state of
the sequences being considered, is the
relationship that is most likely to be
correct.
• The maximum parsimony algorithm searches
for the minimum number of genetic events
(nucleotide substitutions or amino acids
changes) to infer the most parsimonious tree
from a set of sequences.
• The best tree is one which needs fewest
changes.
• Maximum Parsimony (positive points):
– Does not reduce sequence information to a single number
– Tries to provide information on the ancestral sequences
– Evaluates different trees
• Maximum Parsimony (negative points):
– Is slow in comparison with distance methods
– Does not use all the sequence information (only
informative sites are used)
– Does not correct for multiple mutations (does not imply a
model of evolution)
– Does not provide information on the branch lengths
Maximum Parsimony
Maximum likelihood
• This approach is a purely statistical based method.
• Probabilities are considered for every individual
nucleotide substitutions in a set of sequence alignment.
• Since transitions are observed roughly 3 times as often as
transversions; it can be reasonably argued that a greater
likelihood exists that the sequence with C and T are more
closely related to each other than they are to the
sequence with G.
• Calculation of probabilities is complicated by the fact that
the sequence of the common ancestor to the sequences
considered being unknown.
• Furthermore multiple substitutions may have occurred at
one or more sites and that all sites are not necessarily
independent or equivalent.
• Notes :
• 1. This is the best justified method from a
theoretical viewpoint;
• 2. ML estimates the branch lengths of the final
tree ;
• 3. ML methods are usually consistent ;
• 4. Sequence simulation experiments have shown
that this method works better than all others in
most cases.
• Drawbacks : they need long computation time to
construct a tree.
Maximum Likelihood(ML)
Advantages and disadvantages of character
based methods
• Advantages
– MP tries to provide information on the ancestral
sequences
– ML tends to outperform alternative methods such as
parsimony or distance methods even with very
short sequences
• Disadvantages
– Slow in comparison with distance methods
– MP does not use all the sequence information
– ML result is dependent on the model of evolution
used
Applications
• There are wide array of applications of
phylogenetic analysis which include:
– Evolution studies
– Medical research and epidemiology
– In ecology
– In criminal studies
– Finding the orthologues and paralogs
Comparison of genetic sequence of organisms
Phylogenetic analysis tools-
• Phylip,
• ClustalW/X.
PHYLIP (Phylogeny Inference Package)
http://evolution.genetics.washington.edu/phylip.html
• The source code is distributed (in C), and executables are also
distributed.
• The data are read into the program from a text file, which the user
can prepare using any word processor or text editor (but it is
important that this text file not be in the special format of that
word processor -- it should instead be in "flat ASCII" or "Text Only"
format).
• Most of the programs look for the data in a file called "infile" -- if
they do not find this file they then ask the user to type in the file
name of the data file.
• Output is written onto special files with names like "outfile" and
"outtree".