You are on page 1of 58

BTC 506

PHYLOGENETIC ANALYSIS
• A phylogenetic tree also known as a
phylogeny is a diagram that depicts the lines
of evolutionary descent of different species,
organisms, or genes from a common ancestor.

– Attempt to reconstruct evolutionary ancestors

– Estimate time of divergence from ancestor


What is phylogenetic analysis and why
should we perform it?

Phylogenetic analysis has two major components:

1. Phylogeny inference or “tree building” —


the inference of the branching orders, and
ultimately the evolutionary relationships,
between “taxa” (entities such as genes,
populations, species, etc.)
2. Character and rate analysis —
using phylogenies as analytical frameworks
for rigorous understanding of the evolution of
various traits or conditions of interest
Common Phylogenetic Tree Terminology

Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny

D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
Three types
of trees
Cladogram Phylogram Ultrametric tree

6
Taxon B Taxon B Taxon B
1
Taxon C Taxon C Taxon C

Taxon A Taxon A Taxon A

Taxon D Taxon D Taxon D

no genetic change time


meaning

All show the same evolutionary relationships, or branching orders, between the taxa.
• Phylogenetic analysis can be used to solve a
number of interesting problems
– Forensics
• HIV virus mutates rapidly
– Predicting evolution of influenza viruses
– Predicting functions of uncharacterized genes -
orthologue detection
– Drug discovery
– Vaccine development
• Target inferred common ancestor
Objectives
• Evolution,
• Elements of phylogeny,
• Methods of phylogenetic analysis,
• Phylogenetic tree of life,
• Comparison of genetic sequence
of organisms,
• Phylogenetic analysis tools-
– Phylip,
– ClustalW.
Evolution
• Speciation
– Evolution of new organisms is driven by
• Mutations
– The DNA sequence can be changed due to single base changes,
deletion/insertion of DNA segments, etc.
• Selection bias
– Speciation events lead to creation of different
species.
– Speciation caused by physical separation into
groups where
different genetic variants become dominant
• Any two species share a (possibly distant) common
ancestor
• The molecular clock hypothesis
Elements of phylogeny
A phylogenetic tree
• A phylogenetic tree is a graph reflecting the
approximate distances between a set of objects
(species, genes, proteins, families) in a hierarchical
fashion

 Leaves – current species; sequences in current species


 Internal nodes - hypothetical common ancestors

 Branches (Edges) length - “ from


one
time speciation
” to the next (branching represents
speciation into two new species)
Example of Rooted tree
Interior Nodes (Vertex) Taxon
1 A
Root of Tree
2
Terminal Nodes (Leaf)
2

2 Taxon
Branch (Edge)
1 B
Split (Bipartition)
Taxon C
1

Taxon D
Taxon
C
Taxon
1 A
1

4
2
1

Taxon Taxon
C B

Fig. Example of an Unrooted Tree for 4 Taxa


A B C D E
Terminal Node (Leaf)

Interior Node (Vertex)

Split (Bipartition)

Branch (Edge)

Root of Tree

Fig. Terms Used in representing a phylogenetic Tree


Methods of phylogenetic analysis
Methods for analysing phylogenetic
tree
Distance Methods Character Methods
• Also called Phenetic • Also called Cladistic
• Trees are constructed by • Trees are calculated by considering
similarity of sequences. various possible pathway of
evolution.
• Based on parsimony or likelihood
methods
• Tree is called Dendrogram • Tree is called Cladogram.
• Does not necessarily reflect • Use each alignment position as
evolutionary relationship. evolutionary information to build a
tree.
• E.g.
• E.g. – Maximum parsimony
– UPGMA clustering, – Maximum likelihood
– Neighbour Joining – Bayesian
– Fitch-Margolish
Distance Methods

• UPGMA clustering,
• Neighbour Joining
• Fitch-Margolish
UPGMA
(Unweighted – Pair – Group – Method –with Arithmetic mean)

 Stands for Unweighted pair group method with


arithmetic mean.
 Originally developed for numeric taxonomy in 1958 by
Sokal and Michener.
 This method uses sequential clustering algorithm.
 Oldest Distance Method
 Proposed by Michener & Sokal in 1957
 Produces rooted trees.
 It assumes that the trees are ultrametric, meaning that
it assumes constant rate of substitutions in all branches
of the tree.
This method follows a clustering procedure:
(1) Assume that initially each species is a cluster
on its own.
(2) Join closest 2 clusters and
recalculate distance of the joint pair
by taking the average.
(3) Repeat this process until all species are
connected in a single cluster.
• Advantage
– Fast
– Can handle many sequences
• Disadvantage
– Cannot be used when rates of substitutions are
unequal
– Does not consider multiple substitutions.
NEIGHBOUR JOINING METHOD
• Neighbor-joining methods apply general data
clustering techniques to sequence analysis
using genetic distance as a clustering metric.
• Developed in 1987 by Saitou and Nei.
• The simple neighbor-joining method produces
unrooted trees, but it does not assume a
constant rate of evolution (i.e., a molecular
clock) across lineages.
• It begins with an unresolved star-like tree .
• Each pair is evaluated for being joined and the
sum of all branches length is calculated of the
resultant tree.
• The pair that yields the smallest sum is considered
the closest neighbors and is thus joined .
• A new branch is inserted between them and the
rest of the tree and the branch length is
recalculated.
• This process is repeated until only one terminal is
present
DRAWBACKS
• But it produces only one tree and neglects
other possible trees, which might be as good
as NJ trees, if not significantly better.
• Moreover since errors in distance estimates
are exponentially larger for longer distances,
under some condition, this method will yield a
biased tree.
FITCH – MARGOLIASH METHOD
• Proposed in 1967
• Produces unrooted trees
• Criteria for fitting trees to distance matrices
• Uses a weighted least squares method for
clustering based on genetic distance.
• Closely related sequences are given more weight
in the tree construction process to correct for the
increased inaccuracy in measuring distances
between distantly related sequences.
Character Based methods

• Maximum Parsimony
• Maximum Likelihood(ML)
Maximum Parsimony
(Fitch, 1977)
• Parsimony – carefulness in the use of resources.
• The basic underlying principle behind parsimony is
given by Occam’s Razor:
• “Given a choice between – a hard and easy way of
doing things, nature will always pick the easiest way i.e.
simple is always preferred over complex.”
• Parsimony assumes that the relationship that requires
the fewest number of mutations to explain the current
state of sequences being considered is the relationship
that is most likely to be correct.
concept of parsimony
• The concept of parsimony is at the heart of all
character based methods of phylogenetic
reconstruction.
• The 2 fundamental ideas of biological parsimony are:
– Mutations are exceedingly rare events ;
– The more unlikely events a model invokes, the less likely the
model is to be correct.
• As a result, the relationship that requires the fewest
number of mutations to explain the current state of
the sequences being considered, is the
relationship that is most likely to be
correct.
• The maximum parsimony algorithm searches
for the minimum number of genetic events
(nucleotide substitutions or amino acids
changes) to infer the most parsimonious tree
from a set of sequences.
• The best tree is one which needs fewest
changes.
• Maximum Parsimony (positive points):
– Does not reduce sequence information to a single number
– Tries to provide information on the ancestral sequences
– Evaluates different trees
• Maximum Parsimony (negative points):
– Is slow in comparison with distance methods
– Does not use all the sequence information (only
informative sites are used)
– Does not correct for multiple mutations (does not imply a
model of evolution)
– Does not provide information on the branch lengths
Maximum Parsimony
Maximum likelihood
• This approach is a purely statistical based method.
• Probabilities are considered for every individual
nucleotide substitutions in a set of sequence alignment.
• Since transitions are observed roughly 3 times as often as
transversions; it can be reasonably argued that a greater
likelihood exists that the sequence with C and T are more
closely related to each other than they are to the
sequence with G.
• Calculation of probabilities is complicated by the fact that
the sequence of the common ancestor to the sequences
considered being unknown.
• Furthermore multiple substitutions may have occurred at
one or more sites and that all sites are not necessarily
independent or equivalent.
• Notes :
• 1. This is the best justified method from a
theoretical viewpoint;
• 2. ML estimates the branch lengths of the final
tree ;
• 3. ML methods are usually consistent ;
• 4. Sequence simulation experiments have shown
that this method works better than all others in
most cases.
• Drawbacks : they need long computation time to
construct a tree.
Maximum Likelihood(ML)
Advantages and disadvantages of character
based methods
• Advantages
– MP tries to provide information on the ancestral
sequences
– ML tends to outperform alternative methods such as
parsimony or distance methods even with very
short sequences
• Disadvantages
– Slow in comparison with distance methods
– MP does not use all the sequence information
– ML result is dependent on the model of evolution
used
Applications
• There are wide array of applications of
phylogenetic analysis which include:
– Evolution studies
– Medical research and epidemiology
– In ecology
– In criminal studies
– Finding the orthologues and paralogs
Comparison of genetic sequence of organisms
Phylogenetic analysis tools-

• Phylip,
• ClustalW/X.
PHYLIP (Phylogeny Inference Package)
http://evolution.genetics.washington.edu/phylip.html

• Available free in Windows/MacOS/Linux


systems
• Parsimony, distance matrix and likelihood
methods (bootstrapping and consensus trees)
• Data can be molecular sequences, gene
frequencies, restriction sites and fragments,
distance matrices and discrete characters
PHYLIP (Phylogeny Inference Package)
http://evolution.genetics.washington.edu/phylip.html
• PHYLIP (the PHYLogeny Inference Package) is a package of programs for
inferring phylogenies (evolutionary trees).

• It is available free over the Internet, and written to work on as many


different kinds of computer systems as possible.

• The source code is distributed (in C), and executables are also
distributed.

• In particular, already-compiled executables are available for Windows


(95/98/NT/2000/me/xp/Vista), Mac OS X, and Linux systems.

• Older executables are also available for Mac OS 8 or 9 systems.

• Complete documentation is available on documentation files that come


with the package.
The Phylip Manual
• is an excellent source of information.
• Brief one line descriptions of the programs are here
• The easiest way to run PHYLIP programs is via a
command line menu (similar to clustalw).
• The program is invoked through clicking on an icon, or by
typing the
program name at the command line.
• > protdist
• > neighbor
• If there is no file called infile the program responds with:
• [gogarten@carrot gogarten]$ seqboot
• seqboot: can't find input file "infile"
• Please enter a new file name>

Methods
• Methods that are available in the package include
– parsimony,
– distance matrix, and
– likelihood methods, including
• bootstrapping and
• consensus trees.

• Data types that can be handled include


– molecular sequences,
– gene frequencies,
– restriction sites and fragments,
– distance matrices, and
– discrete characters.
Programs
• The programs are controlled through a menu, which asks the users
which options they want to set, and allows them to start the
computation.

• The data are read into the program from a text file, which the user
can prepare using any word processor or text editor (but it is
important that this text file not be in the special format of that
word processor -- it should instead be in "flat ASCII" or "Text Only"
format).

• Some sequence analysis programs such as the ClustalW alignment


program can write data files in the PHYLIP format.

• Most of the programs look for the data in a file called "infile" -- if
they do not find this file they then ask the user to type in the file
name of the data file.
• Output is written onto special files with names like "outfile" and
"outtree".

• Trees written onto "outtree" are in the Newick format, an informal


standard agreed to in 1986 by authors of a number of major
phylogeny packages.

• At this stage we do not have a mouse-windows interface for


PHYLIP.

• PHYLIP is probably the most widely-distributed phylogeny package.


• It is the sixth most frequently cited phylogeny package,
after MrBayes, PAUP*, RAxML, Phyml, and MEGA.

• PHYLIP is also the oldest widely-distributed package.

• It has been in distribution since October, 1980, and has


over 30,000 registered users.

• It is still being updated.


program folder
Menu interface
CLUSTAL – w
• www.ebi.ac.uk/clustalw/

• Clustal is progressive MSA program available


either as a stand alone or online program.

• Clustal is a widely used multiple sequence


alignment computer program.
The latest version is 2.0. There are two
main variations:
• ClustalW: command line interface
• ClustalX: This version has a graphical user
interface.

• It is available for Windows, Mac OS, and


Unix/Linux.

• This program is available from the Clustal


Homepage or European Bioinformatics Institute
ftp server.
There are three main steps:
• Do a pair wise alignment
• Create a phylogenetic tree (or use a user-
defined tree)
• Use the phylogenetic tree to carry out a
multiple alignment

You might also like