Working With Pathogen Genomes: Module 6 Phylogeny

Wellcome Trust Workshop
Working with Pathogen Genomes

Module 6 Phylogeny
Phylogeny
• Phylogeny refers to the ancestry of a

biological lineage, but is also synonymous
with phylogenetic tree
• Taxonomy began by grouping taxa together

based on morphology at various structural
levels
• Phylogeny is tree-like, or dichotomous
• Phylogeny provides the historical basis to the

comparative method
Principle of phylogenetics
• Inferring relationships is about similarity.
• Homology describes similarity due to common inheritance from

an ancestor. Homologous characters are useful similarity.
• Homoplasy describes similarity due to independent acquisitions

of the same or superficially similar character state.
Homoplasious characters provide a mis-leading picture of
phylogeny.
• Distance in a phylogenetic tree reflects a decreasing number of

shared, homologous characters (assuming that evolution
maximises homology).
Phylogenetic trees in biology
• Tool for understanding biological

processes
• Examination of phylogeny to
determine distance to characterized
molecules
• draw conclusions regarding biological
functions not otherwise apparent
• multiple alignments vs. pairwise
homology
• Genomes are historical entities

• their structure and function reflect the
past
Applications to genome biology
• Gene family evolution

– orthology vs paralogy
– gene duplications and losses can be inferred through comparisons of
‘gene’ and ‘species’ trees
– the placement of a gene in the ‘wrong’ position within a phylogeny is used
to support horizontal gene transfer.
• Microarray data analysis
– Comparative genome hybridization (CGH) distance matrix
• Phylogenomics
– gene order, gene content and concatenated sequences can be used to
infer phylogeny
• Recombination
– tests for recombination and gene conversion use phylogenetic profiles to
detect breakpoints
Building a phylogenetic tree
• Identify protein, DNA or RNA sequences of interest

– Fasta format file of concatenated
• Multiple sequence alignment
– ClustalX
• Construct phylogeny
– PHYML
• View and edit tree
– ATV
Overview of ClustalX Procedure
Hbb_Human 1 -
CLUSTAL W
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Hba_Horse 4 .59 .59 .13 - Quick pairwise alignment:
Myg_Whale 5 .77 .77 .75 .75 - calculate distance matrix
Hbb_Human
1 3 4
Hbb_Horse
Hba_Human Neighbor-joining tree
2 (guide tree)
Hba_Horse
Myg_Whale
alpha-helices
1 PEEKSAVTALWGKVN--VDEVGG 1 3 4
2 GEEKAAVLALWDKVN--EEEVGG Progressive alignment
3 PADKTNVKAAWGKVGAHAGEYGA
4 AADKTNVKAAWSKVGGHAGEYGA
2 following guide tree
5 EHEWQLVLHVWAKVEADVAGHGQ
Creating multiple alignments
• Phylogeny is meaningless unless it is based on a

well-done alignment
• Issues to consider
– Alignment parameters
• Weight matrix parameters
• Gap penalties
– Truncated sequences
– Non homologous sequences
Multiple alignments: parameters
Multiple alignments: Gap penalties
High gap penalties
Default gap penalties
Low penalties
Multiple alignments: truncated sequences
Multiple alignments: non-homologous sequences
Constructing phylogenies
• Stages in constructing phylogenies:
1. Data scoring; producing genetic distances or character states (‘distance’ or ‘discrete’

data).
2. Tree sorting; processes for searching ‘tree-space’, e.g., hill-climbing or MCMC.
3. Estimation; identifying the most acceptable tree topology and model parameters using
a variety of methods (‘clustering’ or ‘optimising’ methods)
• Phylogenetic methods:
• Algorithmic
• Neighbor-joining
• UPGMA
• Tree-searching
• Maximum parsimony
• Maximum likelihood
• Bayesian inference
• No one method is best for all circumstances

Neighbor Joining (NJ)
A B C D E F G H I
Principles: A ·
B 0.001 ·
• Tree topology and branch lengths are C 0.025 0.024 ·
estimated from a genetic distance matrix. D
E
0.003
0.336
0.002
0.331
0.019
0.219
·
0.231 ·
F 0.021 0.019 0.001 0.018 0.233 ·
Advantages: G 0.001 0.001 0.025 0.002 0.256 0.023 ·
H 0.056 0.044 0.005 0.042 0.132 0.051 0.043 ·
I 0.325 0.300 0.116 0.195 0.005 0.122 0.366 0.213 ·
• A single tree is estimated by minimising

genetic distance, in a short time and with
little computational expenditure.
Disadvantages:
• The method lacks accuracy because there is no attempt to correct for potential
bias (homoplasy).
• The method lacks precision because the outcome is partly contingent on the tree
with which the search process begins.
Maximum parsimony (MP)
Principles:
• Searches through tree topologies in ‘tree-space’ using a ‘hill-climbing’ algorithm.
• Scores trees on their ‘length’, i.e., the number of character state changes required to
explain the distribution of characters on a given tree topology.
• Looks for the tree with the minimum number of changes, i.e. the topology with the fewest
character changes overall.
Advantages:
• Generally accurate method with few assumptions.
• Phylogenetic hypotheses can be statistically tested by comparing the lengths of different
trees.
• Tree estimation is relatively fast and undemanding.
Disadvantages:
• There are typically several shortest trees, resulting in a potentially ambiguous consensus
topology.
• There is no explicit model of evolution and so the method is prone to error under certain
circumstances, e.g., long-branch attraction (homoplasy).
Maximum likelihood (ML)
Principles:
• Looks for the tree that, under a given model of evolution, maximizes the likelihood of the
observed data
• Applies a complex model of DNA or protein sequence evolution that estimates parameters for
specific substitutions and other qualities of molecular sequences
• Locates the most likely tree topology through a hill-climbing algorithm
• Various models accommodate sources of

molecular homoplasy that might result in
the wrong tree:
• ‘Multiple hits’
(substitutional saturation)
• Rate convergence
• Rate heterogeneity
• Base composition bias
• Codon usage bias
• Secondary structure
• Covariance
Maximum Likelihood
Advantages:
• Highly accurate because considerable biological realism is introduced through the

substitutional model. This allows various forms of homoplasy to be corrected for.
• Phylogenetic estimation within the likelihood framework provides a robust statistical context
in which to evaluate specific hypotheses.
• A single tree is produced that is generally precise.
Disadvantages:
• The complexity of the estimation process means that it is slow and computationally
demanding.
• The hill-climbing algorithm is susceptible to local optima and so does not guarantee to
return the most optimal solution.
Bootstrapping a tree
• Statistical estimate of the
reliability of groupings
• Subsamples of sites in an
alignment are used to
generate trees
• Process is iterated multiple
times (100-1000 times)
• Agreement among the
resulting trees is
summarized with a
majority-rule consensus
tree
Bayesian
Principles:
• Based on the notion of posterior probabilities: probabilities that are estimated, based on
some model (prior expectations), after learning something about the data.
• Uses an MCMC process to search through tree-space.
• Selects the tree-topology with the highest probability, given the data.
Advantages:
• Intuitive
• Potential for any complex model.
• Provides both parameter estimates (i.e., tree) and their probabilities in a single analysis.
• Many different hypotheses can be evaluated in a single analysis.
• The MCMC algorithm makes integrating over all parameter values fast and accurate;
MCMCs are able to break out of local optima.
Bayesian
Disadvantages: Tb93 6
Tb93 7
• An evolutionary model must be specified a priori,

BI
Tb93 9
in form of prior probabilities (‘priors’). Is there
sufficient knowledge of these probabilities? Tb93 1
Tb93 3
• The MCMC must be run long enough for variation
in the parameter estimates to smooth out or reach Tb93 4
‘convergence’. The time required is never certain.

Tb93 2
• Posterior probabilities describe the absolute Tb93 14

probability of particular nodes and branch lengths;
these can be overestimated. Tb93 12
Tb93 13
Tb93 5
Tb93 10
Tb93 8
Tb93 11
0.1
Remember
All trees are wrong

Cladograms and phylograms
Bacterium 1
Cladograms show
Bacterium 2
branching order -
Bacterium 3 branch lengths are
Eukaryote 1 meaningless
Eukaryote 2
Eukaryote 3
Eukaryote 4
Bacterium 1 Phylograms show

Bacterium 2
branch order and
Bacterium 3
branch lengths
Eukaryote 1
Eukaryote 2
Eukaryote 3
Eukaryote 4
Rooting using an outgroup
archaea
eukaryote
archaea
Unrooted tree
archaea
eukaryote
eukaryote
The root defines
eukaryote common ancestry
bacteria outgroup
archaea
Rooted archaea Monophyletic group
by outgroup archaea
eukaryote
eukaryote
Monophyletic
root eukaryote group
eukaryote
Further details
Textbooks:
Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell Science.
Felsenstein Inferring Phylogenies. Sinauer Associates.
Hall Phylogenetic trees made easy. Sinauer Associates.
Software:
Phyml http://atgc.lirmm.fr/phyml/
PAUP* (NJ, MP, ML): http://paup.csit.fdsu.edu
PHYLIP (NJ, MP, ML): http://evolution.genetics.washington.edu/phylip.html
MrBayes (Bayesian): http://mrbayes.csit.fdsu.edu
Splitstree (Networks) http://www.splitstree.org
FindModel (Model Test) http://www.hiv.lanl.gov/content/sequence/findmodel/findmodel.html
Websites:
MultiPhyl (ML via email) http://distributed.cs.nuim.ie/multiphyl.php
Felsenstein’s Phylogeny program page (links to available software):

http://evolution.genetics.washington.edu/phylip/software.html

Working With Pathogen Genomes: Module 6 Phylogeny

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Working With Pathogen Genomes: Module 6 Phylogeny

Uploaded by

Copyright:

Available Formats

Wellcome Trust Workshop

Working with Pathogen Genomes

• Phylogeny refers to the ancestry of a

• Taxonomy began by grouping taxa together

• Phylogeny is tree-like, or dichotomous

• Phylogeny provides the historical basis to the

• Inferring relationships is about similarity.

• Homology describes similarity due to common inheritance from

• Homoplasy describes similarity due to independent acquisitions

• Distance in a phylogenetic tree reflects a decreasing number of

• Tool for understanding biological

• Genomes are historical entities

• Gene family evolution

• Identify protein, DNA or RNA sequences of interest

• Phylogeny is meaningless unless it is based on a

High gap penalties

Default gap penalties

1. Data scoring; producing genetic distances or character states (‘distance’ or ‘discrete’

• No one method is best for all circumstances

• Tree topology and branch lengths are C 0.025 0.024 ·

estimated from a genetic distance matrix. D

F 0.021 0.019 0.001 0.018 0.233 ·

Advantages: G 0.001 0.001 0.025 0.002 0.256 0.023 ·

H 0.056 0.044 0.005 0.042 0.132 0.051 0.043 ·

I 0.325 0.300 0.116 0.195 0.005 0.122 0.366 0.213 ·

• A single tree is estimated by minimising

• Locates the most likely tree topology through a hill-climbing algorithm

• Various models accommodate sources of

• Highly accurate because considerable biological realism is introduced through the

• A single tree is produced that is generally precise.

• An evolutionary model must be specified a priori,

‘convergence’. The time required is never certain.

• Posterior probabilities describe the absolute Tb93 14

All trees are wrong

Bacterium 1 Phylograms show

Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell Science.

Felsenstein Inferring Phylogenies. Sinauer Associates.

Hall Phylogenetic trees made easy. Sinauer Associates.

Felsenstein’s Phylogeny program page (links to available software):

You might also like