You are on page 1of 24

Wellcome Trust Workshop

Working with Pathogen Genomes


Module 6 Phylogeny
Phylogeny

• Phylogeny refers to the ancestry of a


biological lineage, but is also synonymous
with phylogenetic tree

• Taxonomy began by grouping taxa together


based on morphology at various structural
levels

• Phylogeny is tree-like, or dichotomous

• Phylogeny provides the historical basis to the


comparative method
Principle of phylogenetics

• Inferring relationships is about similarity.

• Homology describes similarity due to common inheritance from


an ancestor. Homologous characters are useful similarity.

• Homoplasy describes similarity due to independent acquisitions


of the same or superficially similar character state.
Homoplasious characters provide a mis-leading picture of
phylogeny.

• Distance in a phylogenetic tree reflects a decreasing number of


shared, homologous characters (assuming that evolution
maximises homology).
Phylogenetic trees in biology

• Tool for understanding biological


processes

• Examination of phylogeny to
determine distance to characterized
molecules
• draw conclusions regarding biological
functions not otherwise apparent
• multiple alignments vs. pairwise
homology

• Genomes are historical entities


• their structure and function reflect the
past
Applications to genome biology

• Gene family evolution


– orthology vs paralogy
– gene duplications and losses can be inferred through comparisons of
‘gene’ and ‘species’ trees
– the placement of a gene in the ‘wrong’ position within a phylogeny is used
to support horizontal gene transfer.
• Microarray data analysis
– Comparative genome hybridization (CGH) distance matrix
• Phylogenomics
– gene order, gene content and concatenated sequences can be used to
infer phylogeny
• Recombination
– tests for recombination and gene conversion use phylogenetic profiles to
detect breakpoints
Building a phylogenetic tree

• Identify protein, DNA or RNA sequences of interest


– Fasta format file of concatenated
• Multiple sequence alignment
– ClustalX
• Construct phylogeny
– PHYML
• View and edit tree
– ATV
Overview of ClustalX Procedure
Hbb_Human 1 -
CLUSTAL W
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Hba_Horse 4 .59 .59 .13 - Quick pairwise alignment:
Myg_Whale 5 .77 .77 .75 .75 - calculate distance matrix

Hbb_Human
1 3 4
Hbb_Horse
Hba_Human Neighbor-joining tree
2 (guide tree)
Hba_Horse

Myg_Whale

alpha-helices
1 PEEKSAVTALWGKVN--VDEVGG 1 3 4
2 GEEKAAVLALWDKVN--EEEVGG Progressive alignment
3 PADKTNVKAAWGKVGAHAGEYGA
4 AADKTNVKAAWSKVGGHAGEYGA
2 following guide tree
5 EHEWQLVLHVWAKVEADVAGHGQ
Creating multiple alignments

• Phylogeny is meaningless unless it is based on a


well-done alignment
• Issues to consider
– Alignment parameters
• Weight matrix parameters
• Gap penalties
– Truncated sequences
– Non homologous sequences
Multiple alignments: parameters
Multiple alignments: Gap penalties

High gap penalties

Default gap penalties

Low penalties
Multiple alignments: truncated sequences
Multiple alignments: non-homologous sequences
Constructing phylogenies
• Stages in constructing phylogenies:

1. Data scoring; producing genetic distances or character states (‘distance’ or ‘discrete’


data).
2. Tree sorting; processes for searching ‘tree-space’, e.g., hill-climbing or MCMC.
3. Estimation; identifying the most acceptable tree topology and model parameters using
a variety of methods (‘clustering’ or ‘optimising’ methods)

• Phylogenetic methods:
• Algorithmic
• Neighbor-joining
• UPGMA
• Tree-searching
• Maximum parsimony
• Maximum likelihood
• Bayesian inference

• No one method is best for all circumstances


Neighbor Joining (NJ)
A B C D E F G H I

Principles: A ·

B 0.001 ·

• Tree topology and branch lengths are C 0.025 0.024 ·

estimated from a genetic distance matrix. D

E
0.003

0.336
0.002

0.331
0.019

0.219
·

0.231 ·

F 0.021 0.019 0.001 0.018 0.233 ·

Advantages: G 0.001 0.001 0.025 0.002 0.256 0.023 ·

H 0.056 0.044 0.005 0.042 0.132 0.051 0.043 ·

I 0.325 0.300 0.116 0.195 0.005 0.122 0.366 0.213 ·

• A single tree is estimated by minimising


genetic distance, in a short time and with
little computational expenditure.

Disadvantages:

• The method lacks accuracy because there is no attempt to correct for potential
bias (homoplasy).
• The method lacks precision because the outcome is partly contingent on the tree
with which the search process begins.
Maximum parsimony (MP)
Principles:
• Searches through tree topologies in ‘tree-space’ using a ‘hill-climbing’ algorithm.
• Scores trees on their ‘length’, i.e., the number of character state changes required to
explain the distribution of characters on a given tree topology.
• Looks for the tree with the minimum number of changes, i.e. the topology with the fewest
character changes overall.

Advantages:
• Generally accurate method with few assumptions.
• Phylogenetic hypotheses can be statistically tested by comparing the lengths of different
trees.
• Tree estimation is relatively fast and undemanding.

Disadvantages:
• There are typically several shortest trees, resulting in a potentially ambiguous consensus
topology.
• There is no explicit model of evolution and so the method is prone to error under certain
circumstances, e.g., long-branch attraction (homoplasy).
Maximum likelihood (ML)
Principles:

• Looks for the tree that, under a given model of evolution, maximizes the likelihood of the
observed data

• Applies a complex model of DNA or protein sequence evolution that estimates parameters for
specific substitutions and other qualities of molecular sequences

• Locates the most likely tree topology through a hill-climbing algorithm

• Various models accommodate sources of


molecular homoplasy that might result in
the wrong tree:
• ‘Multiple hits’
(substitutional saturation)
• Rate convergence
• Rate heterogeneity
• Base composition bias
• Codon usage bias
• Secondary structure
• Covariance
Maximum Likelihood
Advantages:

• Highly accurate because considerable biological realism is introduced through the


substitutional model. This allows various forms of homoplasy to be corrected for.

• Phylogenetic estimation within the likelihood framework provides a robust statistical context
in which to evaluate specific hypotheses.

• A single tree is produced that is generally precise.

Disadvantages:

• The complexity of the estimation process means that it is slow and computationally
demanding.

• The hill-climbing algorithm is susceptible to local optima and so does not guarantee to
return the most optimal solution.
Bootstrapping a tree
• Statistical estimate of the
reliability of groupings
• Subsamples of sites in an
alignment are used to
generate trees
• Process is iterated multiple
times (100-1000 times)
• Agreement among the
resulting trees is
summarized with a
majority-rule consensus
tree
Bayesian
Principles:

• Based on the notion of posterior probabilities: probabilities that are estimated, based on
some model (prior expectations), after learning something about the data.
• Uses an MCMC process to search through tree-space.
• Selects the tree-topology with the highest probability, given the data.

Advantages:

• Intuitive
• Potential for any complex model.
• Provides both parameter estimates (i.e., tree) and their probabilities in a single analysis.
• Many different hypotheses can be evaluated in a single analysis.
• The MCMC algorithm makes integrating over all parameter values fast and accurate;
MCMCs are able to break out of local optima.
Bayesian
Disadvantages: Tb93 6

Tb93 7

• An evolutionary model must be specified a priori,


BI
Tb93 9
in form of prior probabilities (‘priors’). Is there
sufficient knowledge of these probabilities? Tb93 1

Tb93 3
• The MCMC must be run long enough for variation
in the parameter estimates to smooth out or reach Tb93 4

‘convergence’. The time required is never certain.


Tb93 2

• Posterior probabilities describe the absolute Tb93 14


probability of particular nodes and branch lengths;
these can be overestimated. Tb93 12

Tb93 13

Tb93 5

Tb93 10

Tb93 8

Tb93 11
0.1
Remember

All trees are wrong


Cladograms and phylograms
Bacterium 1
Cladograms show
Bacterium 2
branching order -
Bacterium 3 branch lengths are
Eukaryote 1 meaningless
Eukaryote 2
Eukaryote 3
Eukaryote 4

Bacterium 1 Phylograms show


Bacterium 2
branch order and
Bacterium 3
branch lengths
Eukaryote 1
Eukaryote 2
Eukaryote 3
Eukaryote 4
Rooting using an outgroup
archaea
eukaryote
archaea
Unrooted tree
archaea

eukaryote

eukaryote
The root defines
eukaryote common ancestry
bacteria outgroup

archaea
Rooted archaea Monophyletic group
by outgroup archaea

eukaryote
eukaryote
Monophyletic
root eukaryote group
eukaryote
Further details
Textbooks:

Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell Science.

Felsenstein Inferring Phylogenies. Sinauer Associates.

Hall Phylogenetic trees made easy. Sinauer Associates.

Software:
Phyml http://atgc.lirmm.fr/phyml/
PAUP* (NJ, MP, ML): http://paup.csit.fdsu.edu
PHYLIP (NJ, MP, ML): http://evolution.genetics.washington.edu/phylip.html
MrBayes (Bayesian): http://mrbayes.csit.fdsu.edu
Splitstree (Networks) http://www.splitstree.org
FindModel (Model Test) http://www.hiv.lanl.gov/content/sequence/findmodel/findmodel.html

Websites:
MultiPhyl (ML via email) http://distributed.cs.nuim.ie/multiphyl.php

Felsenstein’s Phylogeny program page (links to available software):


http://evolution.genetics.washington.edu/phylip/software.html

You might also like