You are on page 1of 21

Subjects of this lecture

Introduction to 1 Introducing some of the terminology of


phylogenetics.
Phylogenetic Analysis
2 Introducing some of the most commonly
used methods for phylogenetic analysis.
Irit Orr
3 Explain how to construct phylogenetic
trees.

Phylogenetics - WHY?

¾ Find evolutionary ties between organisms.


Taxonomy - is the science of
(Analyze changes occuring in different organisms
classification of organisms. during evolution).
Phylogeny - is the evolution of a ¾ Find (understand) relationships between an
genetically related group of organisms. ancestral sequence and it descendants.
Or: a study of relationships between (Evolution of family of sequences)
collection of "things" (genes, proteins,
organs..) that are derived from a common ¾ Estimate time of divergence between a group of
organisms that share a common ancestor.
ancestor.
Basic tree
Of Life

Eukaryote tree

Eubacteria tree

Similar sequences, common ancestor...

From a common ancestor sequence, two


DNA sequences are diverged.

Each of these two sequences start to


accumulate nucleotide substitutions.

The number of these mutations are used in


... common ancestor, similar function molecular evolution analysis.
How we calculate the Relationships of Phylogenetic
Degree of Divergence Analysis and Sequences Analysis
When 2 sequences found in 2 organisms are
If two sequences of length N differ from very similar, we assume that they have derived
each other at n sites, then their degree of from one ancestor.
divergence is: AAGAATC AAGAGTT
n/N or n/N*100%.

AAGA(A/G)T(C/ T)

The sequences alignment reveal which positions are


conserved from the ancestor sequence.

Relationships of Phylogenetic Relationships of Phylogenetic


Analysis and Sequences Analysis Analysis and Sequences Analysis

The progressive multiple alignment of a group of Most phylogenetic methods assume that each
sequences, first aligns the most similar pair. position in a sequence can change
Then it adds the more distant pairs. independently from the other positions.
The alignment is influenced by the “most Gaps in alignments represent mutations in
similar” pairs and arranged accordingly, but….it sequences such as: insertion, deletion, genetic
does not always correctly represent the rearrangments.
evolutionary history of the occured changes. Gaps are treated in various ways by the
Not all phylogenetic methods work this way. phylogenetic methods. Most of them ignore
gaps.
Relationships of Phylogenetic
Analysis and Sequences Analysis What is a phylogenetic tree?

An illustration of the evolutionary


Another approach to treat gaps is by using relationships among a group of organisms.
sequences similarity scores as the base
for the phylogenetic analysis, instead of
using the alignment itself, and trying to Dendrogram is another name for a
phylogenetic tree.
decide what happened at each position.
The similarity scores based on scoring A tree is composed of nodes and branches.
matrices (with gaps scores) are used by One branch connects any two adjacent
the DISTANCE methods. nodes. Nodes represent the taxonomic
units. (sequences)

Rooted Phylogenetic Tree


What is a phylogenetic tree?

™ E.G: 2 very similar sequences will be seqA Leaves = Outer branches


neighbors on the outer branches and will be 1
Represent the taxa (sequences)
seqB
connected by a common internal branch. 2 *
Nodes = 1 2 3
Types of Trees
* 3 seqC Represent the relationships
Trees Networks
Among the taxa (sequences)
* e.g Node 1 represent the ancestor
seqD
seq from which seqA and seqB derived.
Only one path between More than one path
any pair of nodes between any pair of
nodes Branches *
The length of the branch represent
the # of changes that occurred in the
seqs prior to the next level of separation.
External branches - are branches
that end with a tip.
(FA,FB,GC,GD,IE)
In a phylogenetic tree...
A more recent diversions
9 Each NODE represents a speciation event in
F
evolution. Beyond this point any sequence
H
B changes that occurred are specific for each
branch (specie).
I C
G

9 The BRANCH connects 2 NODES of the tree.


D The length of each BRANCH between one
NODE to the next, represents the # of changes
Internal branches - are that occurred until the next separation
branches that do not end E (speciation).
with a tip.
(IH,HF,HG)
more ancient diversions

In a phylogenetic tree... Tree structure

9 NOTE: The amount of evolutionary time that ♠Terminal nodes - represent the data (e.g
passed from the separation of the 2 sequences
is not known. The phylogenetic analysis can only
sequences) under comparison
estimate the # of changes that occurred from the (A,B,C,D,E), also known as OTUs,
time of separation. (Operational Taxonomic Units).
9 After the branching event, one taxon (sequence)
can undergo more mutations then the other ♠Internal nodes - represent inferred
taxon.
ancestral units (usually without empirical
data), also known as HTUs, (Hypothetical
9 Topology of a tree is the branching pattern of a Taxonomic Units).
tree.
Slide taken from Dr. Itai Yanai
The Molecular Clock Hypothesis
Different kinds of trees can be used to
depict different aspects of evolutionary history
All the mutations occur in the same rate in all
1. Cladogram: the tree branches.
simply shows relative recency of common ancestry
The rate of the mutations is the same for all
positions along the sequence.
7 2. Additive trees:
2 a cladogram with branch lengths,
The Molecular Clock Hypothesis is most suitable for
3 4
3 also called phylograms and metric trees
1 5

1
3
closely related species.

1
3 1
3. Ultrametric trees:
(dendograms) special kind of additive tree in which the
1
3 tips of the trees are all equidistant from the root
2
1 1 1 1

Unrooted Tree = Phenogram


Rooted Tree = Cladogram
A phylogenetic tree where all
A phylogenetic tree that the "objects" on it are related
all the "objects" on it descendants - but there is B
share a known common not enough information to
ancestor (the root). specify the common
There exists a particular ancestor (root).
root node. A

The paths from the root The path between nodes of


to the nodes Root
the tree do not specify an
B
correspond to evolutionary time.
evolutionary time. C
Slide taken from Dr. Itai Yanai Rooted versus Unrooted

™ The number of tree topologies of rooted


Types of Trees
Rooted vs. Unrooted
tree is much higher than that of the
unrooted tree for the same number of
Branches Nodes
OTUs.
Rooted Interior M–2 M–1
Total 2M – 2 2M – 1
Unrooted Interior
Total
M–3
2M – 3
M–2
2M – 2
™ Therefore, the error of the unrooted tree
topology is smaller than that of the rooted
M is the number of OTU’s tree.

Slide taken from Dr. Itai Yanai

The number of rooted and unrooted trees:


Orthologs - genes related by speciation
Possible Number of
Number events. Meaning same genes in different
of OTU’s Rooted trees Unrooted
2 1
trees
1 species.
3 3 1
4 15 3
5
6
105
945
15
105
Paralogs - genes related by duplication
7 10395 945 events. Meaning duplicated genes in the
8 135135 10395
9 2027025 135135 same species.
10 34459425 2027025

OTU – Operational Taxonomical Unit


Selecting sequences for phylogenetic Selecting sequences for phylogenetic
analysis analysis
™Non-coding DNA regions have more
What type of sequence to use, Protein or
substitution than coding regions.
DNA?
™Proteins are much more conserved since
they "need" to conserve their function.
The rate of mutation is assumed to be the
So it is better to use sequences that
same in both coding and non-coding
mutate slowly (proteins) than DNA.
regions.
However, if the genes are very small, or
However, there is a difference in the they mutate slowly, we can use them for
substitution rate. building the trees.

Alignment of a coding region should be compared with


Known Problems of Multiple Alignments the alignment of their protein sequences, to be sure about
the placement of gaps.

Important sites could be misaligned by the


software used for the sequence alignment.
T Y R R S R ACA TAC AGG CGA
T Y R R
That will effect the significance of the site - T Y R R S R ACA
T
TAC
Y
AGG
R
CGA
R
and the tree. T Y R - S R ACA
T
TAC
Y
AGG
R
---
-
For example: ATG as start codon, or specific T Y R - S R ACA TAC --- CGA
T Y - R
amino acids in functional domains. T Y R R S R ACA TAC AGG CGA
T Y R R
Gaps - Are treated differently by different
alignment programs and should play no
part in building trees.
Known Problems of Multiple Alignments Selecting sequences for phylogenetic
analysis
♥ Low complexity regions - effect the multiple ‡ Sequences that are being compared belong
alignment because they create random bias together (orthologs).
for various regions of the alignment. ‡ If no ancestral sequence is available you may
♥ Low complexity regions should be removed use an "outgroup" as a reference to measure
from the alignment before building the tree. distances. In such a case, for an outgroup you
need to choose a close relative to the group
being compared.
If you delete these regions you need to ‡ For example: if the group is of mammalian
consider the affect of the deletions on the sequences then the outgroup should be a sequence
branch lengths of the whole tree. from birds and not plants.

How to choose a phylogenetic method?


Taken from Dr.Itai Yanai

Choose set of related seqs Given a multiple alignment, how do we construct the tree?
Obtain MultipleAlignment
(DNA or Proteins

Is there a strong similarity?


A - GCTTGTCCGTTACGAT
B – ACTTGTCTGTTACGAT
C – ACTTGTCCGAAACGAT ?
Strong similarity D - ACTTGACCGTTTCCTT
Distant (weak) similarity E – AGATGACCGTTTCGAT
Maximum Parsimony Distance methods F - ACTACACCCTTATGAG

Check validity of the Very weak similarity


results Maximum Likelihood
Building Phylogenetic Trees Statistical Methods
Main methods: 9 Bootstrapping Analysis –
¾ Distances matrix methods Is a method for testing how good a dataset fits a
Neighbour Joining, UPGMA evolutionary model.
This method can check the branch arrangement
¾ Character based methods:
(topology) of a phylogenetic tree.
Parsimony methods
In Bootstrapping, the program re-samples
Maximum Likelihood method columns in a multiple aligned group of
¾ Validation method: sequences, and creates many new alignments,
Bootstrapping (with replacement the original dataset).
Jack Knife These new sets represent the population.

Statistical Methods Taken from Dr. Itai Yanai

The process is done at least 100 times.


Given the following tree, estimate the confidence of
the two internal branches
Phylogenetic trees are generated from all the
sets.
Part of the results will show the # of times a Chimpanzee Human

particular branch point occurred out of all the Gibbon

trees that were built. Gorilla


Orang-utan

The higher the # - the more valid the


branching point.
Taken from Dr. Itai Yanai Statistical Methods
Estimating Confidence from the Resamplings
Bootstrap values between 90-100 are
1. Of the 100 trees:
41/100 28/100 31/100 considered statistically significant
Gorilla Human Chimpanzee Human
Chimpanzee Human

Gibbon
Gibbon Gibbon
Gorilla
Orang-utan Chimpanzee Gorilla
Orang-utan Orang-utan
2. Upon the original tree we superimpose bootstrap values:
Chimpanzee Human
41
In 41 of the 100 trees, Gibbon
chimp and gorilla are
split from the rest. In 100 of the 100 trees,
Gorilla 100
gibbon and orang-utan
Orang-utan are split from the rest.

Character Based Methods Character Based Methods

All Character Based Methods assume that Q: How do you find the minimum # of changes
each character substitution is independent needed to explain the data in a given tree?
of its neighbors.
A: The answer will be to construct a set of possible
ways to get from one set to the other, and choose
ƒ Maximum Parsimony (minimum evolution) the "best". (for example: Maximum Parsimony)
- in this method one tree will be given
CCGCCACGA
(built) with the fewest changes required to P P R
explain (tree) the differences observed in CGGCCACGA
the data. R P R
Character Based Methods - Maximum
Parsimony

∞ Not all sites are informative in parsimony.


Maximum Parsimony

Start by classifying the sites:

∞ Informative site, is a site that has at least 2


characters, each appearing at least in 2 of
123456789012345678901
Mouse CTTCGTTGGATCAGTTTGATA
Rat CCTCGTTGGATCATTTTGATA
the sequences of the dataset. Dog
Human
CTGCTTTGGATCAGTTTGAAC
CCGCCTTGGATCAGTTTGAAC
------------------------------------
Invariant * * ******** *****
Variant ** * * **
------------------------------------
Informative ** ** Taken from
Dr. Itai Yanai
Non-inform. * *

123456789012345678901 Maximum Parsimony


Mouse CTTCGTTGGATCAGTTTGATA
123456789012345678901
Rat CCTCGTTGGATCATTTTGATA Taken from
Mouse CTTCGTTGGATCAGTTTGATA
Dog CTGCTTTGGATCAGTTTGAAC Dr. Itai Yanai
Human CCGCCTTGGATCAGTTTGAAC Rat CCTCGTTGGATCATTTTGATA Taken from
** * Dog CTGCTTTGGATCAGTTTGAAC Dr. Itai Yanai
Human CCGCCTTGGATCAGTTTGAAC
Mouse Dog Dog Mouse Mouse Rat
G T T T G G G
Informative ** **
G G G G G
Site 5:
G C G C T C Mouse Dog Dog Mouse Mouse Rat
Rat Human Rat Human Dog Human
Mouse Dog Dog Mouse Mouse Rat
T C C T T C C T T T C C
Site 2: Rat Human Rat Human Dog Human
C C C C T C
Rat Human Rat Human Dog Human 3 0 1
Mouse Dog Dog Mouse Mouse Rat
T T G G G G G T T G G T
Site 3:
T G T G G G
Rat Human Rat Human Dog Human
Character Based Methods - Maximum Maximum Parsimony Methods are
Parsimony Available…

∞ The Maximum Parsimony method is good for ¾For DNA in Programs:


similar sequences, a sequences group with paup, molphy,phylo_win
small amount of variations
In the Phylip package:
DNAPars, DNAPenny, etc..
Maximum Parsimony methods do not give the
branch lengths only the branch order. ¾For Protein in Programs:
paup, molphy,phylo_win
For larger set it is recommended to use In the Phylip package:
the “branch and bound” method instead PROTPars
Of Maximum Parsimony.

Character Based Methods - Character Based Methods -


Maximum Likelihood Maximum Likelihood

Basic idea of Maximum Likelihood method is Maximum Likelihood method – using a tree
building a tree based on mathemaical model. model for nucleotide substitutions, it will try to
This method find a tree based on probability find the most likely tree (out of all the trees of
calculations that best accounts for the large the given dataset).
amount of variations of the data (sequences)
set. The Maximum Likelihood methods are very slow
and cpu consuming.
Maximum Likelihood method (like the Maximum
Parsimony method) performs its analysis on Maximum Likelihood methods can be found in
each position of the multiple alignment. phylip, paup or puzzle.
This is why this method is very heavy on CPU.
Maximum Likelihood method Character Based Methods

Are available in the Programs:


paup or puzzle The Maximum Likelihood methods are
In phylip package in programs: very slow and cpu consuming (computer
DNAML and DNAMLK expensive).

Maximum Likelihood methods can be


found in phylip, paup or puzzle.

Distances Matrix Methods Distances Matrix Methods

Distance methods assume a molecular clock, Distance - the number of substitutions per site per
meaning that all mutations are neutral and time period.
therefore they happen at a random clocklike Evolutionary distance are calculated based on one
rate. of DNA evolutionary models.
This assumption is not true for several reasons:
Different environmental conditions affect mutation Neighbors – pairs of sequences that have the
rates. smallest number of substitutions between them.
This assumption ignores selection issues which are On a phylogenetic tree, neighbors are joined by a
different with different time periods.
node (common ancestor).
Distances Matrix Methods Distance method steps

Distance methods vary in the way they construct


1 Multiple alignments - based on all against
the trees. all pairwise comparisons.
2 Building distance matrix of all the compared
Distance methods try to place the correct sequences (all pair of OTUs).
positions of all the neighbors, and find the 3 Disregard of the actual sequences.
correct branches lengths.
4 Constructing a guide tree by clustering the
distances. Iteratively build the relations
Distance based clustering methods:
(branches and internal nodes) between all
Neighbor-Joining (unrooted tree)
OTUs.
UPGMA (rooted tree)

Distance method steps Distances Matrix Methods

Construction of a distance tree using clustering with the Unweighted


Pair Group Method with Arithmatic Mean (UPGMA)
Distances matrix methods can be found in the
First, construct a distance matrix: following Programs:
A B C D E Clustalw, Phylo_win, Paup
A - GCTTGTCCGTTACGAT
B
C


ACTTGTCTGTTACGAT
ACTTGTCCGAAACGAT
B
C
2
4 4
In the GCG software package:
D
E
-

ACTTGACCGTTTCCTT
AGATGACCGTTTCGAT
D 6 6 6 Paupsearch, distances
E 6 6 6 4
F - ACTACACCCTTATGAG
F 8 8 8 8 8 In the Phylip package:
DNADist, PROTDist, Fitch, Kitch, Neighbor
From http://www.icp.ucl.ac.be/~opperd/private/upgma.html
Mutations as data source for Correction of Distances between
evolutionary analysis DNA sequences

Mutation - an error in DNA replication or In order to detect changes in DNA


DNA repair. sequences we compare them to each
other.
Only mutations that occur in germline We assume that each observed change in
cells play a rule in evolution. However, in similar sequences, represent a “single
some organisms there is no distinction mutation event”.
between germline or somatic mutation. The greater the number of changes, the
Only mutations that were fixed in the more possible types of mutations.
population are called substitutions.

Original Sequence
Mutations - Substitutions
A
C
T
G Q: What do we measure by sequence alignment?
A
A
A: Substitutions in the aligned sequences.
C
G

A
T
A
The rate of substitution in regions that
A
C C > A Single Substitution evolve under no constraints are assumed
T T
G G
to be equal to the mutation rate.
Multiple A>C>T A
Substitution A A
coincidental Substitution
C>G C>A
G Parallel Substitution G
T>A T>A
Mutations - Substitutions Mutations

Point mutation - mutation in a single ♣Deletion - removal of one or more nucleotides


nucleotide. from the DNA.
Segmental mutation - mutation in several ♣Insertion - addition of one or more
adjacent nucleotides. nucleotides to the DNA.
Substitution mutation - replacement of one
nucleotide with another. ♣Inversion - rotation by 180 of a double-
Recombination - exchange of a sequence stranded DNA segment comprising 2 or more
with another. base-pairs.

1 AGGCAAACCTACTGGTCTTAT Original Sequence


*
2 AGGCAAATCCTACTGGTCTTAT Transtion c-t
Substitution Mutations

3 AGGCAAACCTACTGCTCTTAT transversion g-c Transition - a change between purines


* (A,G) or between pyrimidines (T,C).
4 AGGCAAACCTACTGGTCTTAT recombination gtctt

ACCTA
5 AGGCAA CTGGTCTTAT deletion accta Transversion - a change between purines
(A,G) to pyrimidines (T,C).
6 AGGCAAACCTACTAAAGCGGTCTTAT insertion aagcg

Substitution mutations usually arise from


7 AGGTTTGCCTACTGGTCTTAT inversion from 5' gcaaac3'
to 5' gtttgc 3'
mispairing of bases during replication.
Let's assume that.. Where do the mutations occur?

A mutation of a sense codon to another Synonymous substitutions mostly occur at


sense codon, occur in equal frequency for the 3rd position of the codon.
all the codons.
Nonsynonymous substitutions mostly
occur at the 2nd position of the codon.
If so we can now compute the expected
proportion of different types of substitution
mutations from the genetic code. Any substitution in the 2nd position is
nonsynonymous

Correction of Distances between Jukes & Cantor one-parameter model


DNA sequences
This model assumes that substitutions
between the 4 bases occur with equal frequency.
There are several evolutionary models Meaning no bias in the direction of the change.
used to correct for the likelihood of
multiple mutations and reversions in DNA A α G
sequences. α Is the rate of substitutions
These evolutionary models use a In each of the 3 directions
For one base.
normalized distance measurement that is
the average degree of change per length α ( Is the one parameter).
of aligned sequences.
C T
Kimura two-parameter model

This model assumes that transitions


(A - G or T - C) occur more often than
transversions (purine -pyrimidine). These evolutionary models improve the
distance calculations between the
A sequences.
α G
These evolutionary models have less
α Is the rate of transitional
Substitutions.
effect in phylogenetic predictions of
closely related sequences.
β β Is the rate of transversional
β
substitutions. These evolutionary models have better
effect with distant related sequences.
C T
α

DNA Evolution Models How to read the tree?

Start at the base and follow


A basic process in DNA sequence rat The progression of the branch
evolution is the substitution of one points (nodes)
nucleotide with another. bovine

This process is slow and can not be human


observed directly.
The study of DNA changes is used to goldfish
estimate the rate of evolution, and the zebrafish
evolution history of organisms.
Xeno
How to draw Trees? (Building trees software)
A Tip...
* Unrooted trees should be plotted using
the DRAWGRAM program (phylip), or For DNA sequences use the Kimura's
similar. model in the building trees programs.

* Rooted trees should be plotted using the For PROTEINS the differences lie with
DRAWTREE program (phylip), or similar. the scoring (substitution) matrices used.
For more distant sequences you should
use BLOSUM with lower # (i.e., for
* On a PC use the TreeView program
distant proteins use blosum45 and for
similar proteins use blosum60).

Known problems of Phylogenetic Analysis Known problems of Phylogenetic Analysis

™ Order of the input data (sequences) - ™ The definition of "best tree" is ambiguous.
The order of the input sequences effects the tree It might mean the most likely tree, or a tree with
construction. You can "correct" this effect in some the fewest changes, or a tree best fit to a known
of the programs (like phylip), using the Jumble model, etc..
option. (J in phylip set to 10).

The trees that result from various methods differ


™ The number of possible trees is huge for large
datasets. Often it is not possible to construct all from each other. Never the less, in order to
trees, but can guarantee only "a good" tree not compare trees, one need to assume some
the "best tree". evolutionary model so that the trees may be
tested.
Known problems of Phylogenetic Analysis How many trees to build?
∆ The amount of data used for the tree construction is
not always "informative". For example, if we ! For each dataset it is recommended to build
compare proteins that are 100% similar in a site, we more than one tree. Build a tree using a
do not have information to infer to phylogenetic distance method and if possible also use a
relationships between these proteins. character-based method, like maximum
parsimony.
∆ Population effects are often to be considered,
especially if we have a lot of variety (large # of
alleles for one protein). ! The core of the tree should be similar in
both methods, otherwise you may suspect
that your tree is incorrect.

You might also like