You are on page 1of 29

Phylogeny

EVOLUTIONARY
PROCESSES I
Phylogenetic Trees
Alignments and Phylogeny: What’s the
connection?

If a group of aligned sequences shows significant similarity to each other, this is


usually be taken as evidence that they are the result of divergent evolution from
a common ancestral sequence. Why?

Will the sequence alignment will contain traces of the evolutionary history of
these sequences?

Is it possible to infer this history by complex analysis of a multiple sequence


alignment?

What is the mechanism of Darwinian Evolution? How do DNA sequences


evolve?
Unrooted versus rooted species trees

Unrooted and rooted phylogenetic trees.


(A) Unrooted tree. This tree is fully resolved in that each internal node has
three branches leading from it, one connecting to the ancestor and two to
descendants. However, the direction of evolution along the internal
branches—that is, which ancestral species has evolved from which—
remains undetermined.
(B)  Rooted tree for the same set of existing species. The brown bird
marked root can now be distinguished as the last common ancestor of
all the yellow and brown birds. The line upward from the root bird indicates
where the ancestors of the root bird would be.
What is phylogenetic analysis and why should we perform it?

Phylogenetic analysis has two major components:

1. Phylogeny inference or tree building —


the inference of the branching orders, and
ultimately the evolutionary relationships,
between taxa (entities such as genes,
populations, species, etc.)
2. Character and rate analysis —
using phylogenies as analytical frameworks
for rigorous understanding of the evolution of
various traits or conditions of interest

Which is larger: the basal rate of DNA mutation or the rate of PAM?
Vocabulary:

taxa (singular taxon) or operational taxonomic units (OTUs)

species trees

branches or edges

external nodes: existing species (genes, proteins)

internal nodes, ancestral states species (genes, proteins)


hypothesized to have

the internal branch points represent speciation events


Common Phylogenetic Tree Terminology

Terminal Nodes
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny

D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
Mutation Rates and Bacterial Growth

Mutation Rates and Bacterial Growth

Even if only a single S. aureus cell were to make its way into your wound, it would take only
10 generations for that single cell to grow into a colony of more than 1,000 (210 = 1,024), and
just 10 more generations for it to erupt into a colony of more than 1 million (220 = 1,048,576).
For a bacterium that divides about every half hour (which is how quickly S. aureus can grow
in optimal conditions), that is a lot of bacteria in less than 12 hours. S. aureus has about 2.8
million nucleotide base pairs in its genome. At a rate of, say, 10-10 mutations per nucleotide
base, that amounts to nearly 300 mutations in that population of bacteria within 10 hours!
To better understand the impact of this situation, think of it this way: With a genome size of
2.8 × 106 and a mutation rate of 1 mutation per 1010 base pairs, it would take a single
bacterium 30 hours to grow into a population in which every single base pair in the genome
will have mutated not once, but 30 times! Thus, any individual mutation that could
theoretically occur in the bacteria will have occurred somewhere in that population—in just
over a day.
Four different types of rooted phylogenetic tree

(A) Cladogram in which branch


lengths have no meaning.
(B) An additive tree, in which
branch lengths are a
measure of evolutionary
divergence.
(C)  An ultrametric tree, which in
addition to the properties of
the additive tree has the
same constant rate of
mutation assumed along all
branches.
(D)  An additive tree for the same
set of species B, which has
been rooted by the addition
of data for a distantly related
outgroup (orange bird).

Note that for additive trees, the branch length are proportional to the number of
mutations that have occurred. Seems simple, but its not—why?
Three types of trees showing the same evolutionary relationships,
or branching orders, between the taxa
Cladogram Phylogram (additive tree) Ultrametric tree

6
Taxon B Taxon B Taxon B
1
1 Taxon C
Taxon C 3 Taxon C

Taxon A 1 Taxon A
Taxon A

Taxon D 5 Taxon D Taxon D

no genetic change time


meaning

Note that for additive trees, the branch length are proportional to the number of
mutations that have occurred. Seems simple, but its not—why?

For ultrametric tree, which in addition to the properties of the additive tree, has
the same constant rate of mutation assumed along all branches. This last property
is often referred to as a molecular clock, because one can, in principle, measure
the actual times of evolutionary events from such trees.
Examples of a species tree and a gene tree

(A)  A species tree showing the evolutionary relationships between seven eukaryotes, with one
more distantly related to the others (Hydra) used as an outgroup to root the tree. Xenopus is
a frog, Catostomus a fish, Drosophila a fruit fly, and Artemia the brine shrimp.
(B)  The gene tree for the Na+–K+ ion pump membrane protein family members found in the
species shown in (A). In some species, e.g., Hydra, and Xenopus, only one member of the
family is known, whereas other species, such as humans and chickens, have three
members. The small squares at nodes indicate gene duplications.

What is the mechanism of a gene duplication?


Can you picture unequal homologous recombination?
Definitions of ortholog and paralog?
A tree can be represented as a set of splits.
(A) Unrooted additive tree using
fictitious data for eight mammalian
taxa. The horizontal lines carry the
information about evolutionary
change; the vertical lines are
purely for visual clarity. he scale
bar refers to branch length and in
this case represents a genetic
distance of 0.2 mutations per site.

(B) A table representing all the


possible internal branch splits of
the tree shown in (A). The columns
correspond to the taxa and the
rows to the split. The two groups of
each split are shown by labeling
the taxa in one group with an
asterisk, and leaving the others
blank.
Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B

Taxon C
No meaning to the
Taxon A spacing between the
taxa, or to the order in
which they appear from
Taxon D top to bottom.

Taxon E

This dimension either can have no scale (for cladograms ),


can be proportional to genetic distance or amount of change
(for phylograms or additive trees ), or can be proportional
to time (for ultrametric trees or true evolutionary trees).
((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
D and E. If the tree has a time scale, then D and E are the most closely related.
Which species are the closest living
relatives of modern humans?

Humans Gorillas
Chimpanzees Chimpanzees

Bonobos Bonobos

Gorillas Orangutans
Orangutans Humans

14 0 15-30 0
MYA MYA

Mitochondrial DNA, most nuclear DNA- The pre-molecular view was that the
encoded genes, and DNA/DNA great apes (chimpanzees, gorillas and
hybridization all show that bonobos and orangutans) formed a clade separate
chimpanzees are related more closely to from humans, and that humans
humans than either are to gorillas. diverged from the apes at least 15-30
MYA.
Did the Florida Dentist infect his patients with HIV?

Phylogenetic tree DENTIST


of HIV sequences Patient C
from the DENTIST, Patient A
his Patients, & Local Patient G
HIV-infected People: Yes:
Patient B The HIV sequences
Patient E from
Patient A these patients fall within
the clade of HIV
DENTIST sequences found in the
Local control 2 dentist.
Local control 3
Patient F No

Local control 9

Local control 35
Local control 3
Patient D No

From Ou et al. (1992) and Page & Holmes (1998)


A few examples of what can be learned
from character analysis using phylogenies
as analytical frameworks:

•  When did specific episodes of positive Darwinian


selection occur during evolutionary history?
•  Which genetic changes are unique to the human
lineage?
•  What was the most likely geographical location of
the common ancestor of the African apes and
humans?
•  Plus countless others…..
There are three possible unrooted trees
for four taxa (A, B, C, D)
Tree 1 Tree 2 Tree 3
A C A B A B

B D C D D C
Phylogenetic tree building (or inference) methods are aimed at
discovering which of the possible unrooted trees is "correct".
We would like this to be the true biological tree — that is, one
that accurately represents the evolutionary history of the taxa.
However, we must settle for discovering the computationally
correct or optimal tree for the phylogenetic method of choice.
The number of unrooted trees increases in a greater
than exponential manner with number of taxa
A B
# Taxa (N ) # Unrooted trees

C A C 3 1
4 3
5 15
B D 6 105
7 945
C 8 10,935
A D
9 135,135
10 2,027,025
B E . .
. .
A C . .
D
. .
30 !3.58 x 1036
B F E (2N - 5)!! = # unrooted trees for N taxa
Condensed trees showing well-supported features is derived by
applying the bootstrap procedure.

The bootstrap procedure assigns values


to individual branches that indicate
whether their associated splits are well
supported by the data.

(A) Each internal branch of the tree has


been given a number that indicates the
percentage occurrence

(B) A condensed tree is produced by


removing internal branches that are
supported by less than 60% of the
bootstrap trees. This procedure results in
multifurcating nodes.

Note that the branch lengths no longer


have meaning, so all line lengths are for
aesthetic purposes only.
Consensus trees show features that are consistent between trees.

Assuming that the four trees in (A)


are all equally strongly supported
they can be represented by the
strict consensus tree shown in (B),
in which only splits that occur in all
the trees are represented: that is,
(A,B,C) and (D,E,F).

(C) Majority rule consensus trees


(60% and 50%) for the four trees
in (A). The (A,B) split occurs in
only 50% of the trees, and thus is
not included separately in the 60%
consensus tree, whereas the (E,F)
split occurs in 75%. The (A,B) split
can be included in the 50% tree.
Quiz Oct 18
Which is the correct
representation?

A a.  ((A,B,C),(D,E))
B
b.  ((C,(A,B)),(D,E))
C
c.  (C,(A,B),(D,E)))
D
d.  ((A,(B,C)),(D,E))
E
Best DNA alignments require the alignment of
amino acid sequences first
Which is the better
alignment?

RevTrans 1.4 Server


http://www.cbs.dtu.dk/services/RevTrans/
Number of observed mutations is often significantly less than the actual
number of mutations because of overlapping/redundant mutations

Straight red line represents the p-


distance—the fraction of nonidentical
sites in an alignment— that would be
observed if each site only received one
mutation at most. The observed p-
distance in an alignment is plotted (black
line) against the average number of
mutations at each site as calculated by
the PAM model described in Section 5.1.
This can be compared with Figure 5.2,
which shows the fraction of identical
alignment sites for different PAM
distances. It can be seen that the
observed fraction of nonidentical sites in
an alignment is always an underestimate
of the actual number of mutations that
have occurred, except when the number
of mutations is very small.
A comparison of the average percentage GC content of codons in
different bacteria.

Note which position of the codon triplets


general corresponds to the total G+C
content of the genome. Why??
Transition and transversion mutations

(A)  The possible transitions (blue) and transversions (red) between the four bases in DNA.
Note that there are twice as many ways of generating a transversion than a transition.
(B)  The observed numbers of transitions, transversions, and total mutations in an aligned
set of cytochrome c oxidase subunit 2 (COII) mitochondrial gene sequences from the
mammalian subfamily Bovinae.
Influence of selective pressure on the observed frequency of
synonymous and non-synonymous mutations

Counting numbers of synonymous


Proportion of synonymous mutations in a protein-coding
gene will form a baseline against which the proportion and non-synonymous mutations
of non-synonymous mutations can be measured, and
this ratio can be used to determine whether that gene
has been subject to positive selection, negative
selection, or no selection.

Positive Selection:
If protein is more effective as a result of a non-
synonymous mutation, under selective pressure the
mutation is likely to be retained  Greater number of
non-synonymous mutations than expected.

Negative Selection:
If a mutation decreases the effective role of a protein,
under selective pressure it is likely to be lost. Negative
selection will result in fewer non-synonymous mutations
than expected, implying that change is being strongly
selected against; that is, the sequence is being
conserved.

Counted as 1.5 non-synonymous and 0.5 synonymous ‘ways’ to get to observed


mutation
Add these up for a whole alignment: obtain Sd and Nd (see book)
Nine possible single base (point) mutations per codon

These mutational possibilities are


used to calculate the number of
possible sites of non-synonymous (N)
and synonymous (S) mutations.

In this case:
1st position: 3 of 3 Ncounts as one N site

2nd position: 2 of 2 Scounts as one N


site

3rd position: 3 of 3 Scounts as one S


site

Why are these factors regarding synonymous and non-synonymous


mutations important?
1.  Informs the type of selective pressure (positive vs ‘purifying’)
2.  Provides signatures of historic duplication events
Genome applications of codon mutational divergence analysis

Estimated frequencies of synonymous substitutions

At least two periods of large scale duplication events in the


Arabidopsis genome
Setting-up of a phylogenetic analysis:

1.  Perform a multiples sequence alignment (MSA). The quality of the MSA is a
critical determinant in the ultimate quality of the phylogenetic tree. Therefore,
every effort should be undertaken to use as much information as possible to
improve the alignment.
a)  Possible to use structural information to improve an automatic alignment
(e.g. ClustalW) and manually adjust.
b)  Some researchers remove blocks of sequence where the alignments are
ambiguous.

2.  To obtain information on DNA sequence evolution, convert the protein MSA to
an MSA of the corresponding DNA sequences using the amino acid alignments
to produce the DNA sequence alignments. The resultant DNA sequence MSA
can provide information regarding the likely selective forces (positive versus
negative selection)

3.  Use the MSA to generate a phylogenetic hypothesis using one of the many
phylogenetic tree generating programs currently available.
The evolutionary history of a gene that has undergone two separate
duplication events

(A) A species tree is depicted by the pale blue cylinders, with the branch points (nodes)
in the cylinders representing speciation events. In the ancestral species a gene is
present as a single copy and has function a (blue). At some time a gene duplication
event occurs within the genome, producing two identical gene copies, one of which
subsequently evolves a different function, identified as b (red).

You might also like