You are on page 1of 69

Evolutionary

Change in
DNA
Sequences
Evolutionary process is a series of
gene substitutions in which new
alleles, each arising as a mutation
in a single individual,
individual
progressively increase their
frequency and ultimately become
fixed in the population.
population

2
We may look at the process from a
different point of view.

An allele that becomes fixed is


different in its sequence from the allele
that it replaces. That is, the substitution
of a new allele for an old one is the
substitution of a new sequence for a
previous sequence.

3
If we use a time scale in which one time unit is
larger than the time of fixation, then the DNA
sequence at any given locus will appear to
change with time.

actgggggtaaactatcggtatagatcataa
actgggggttaactatcggtatagatcataa
actgggggttaactatcggtatagatcataa
actgggggttaactatcggtatagatcataa
actgggggtgaactatcggtatagatcataa
actgggggtgaactatcggtacagatcataa 4
To study the dynamics of
nucleotide substitution,
we must make several
assumptions regarding the
probability of substitution
of a nucleotide by
another.

5
Jukes & Cantor’s
one-parameter model

6
Assumption:
Substitutions occur with equal probabilities
among the four nucleotide types.

7
If the nucleotide residing at a
certain site in a DNA
sequence is A at time 0,
what is the probability, PA(t),
that this site will be occupied
by A at time t? 8
Since we start with A, PA(0) = 1. At
time 1, the probability of still
having A at this site is

P  1 3
A(1)
where 3 is the probability of A changing to
T, C, or G, and 1 – 3 is the probability that
A has remained unchanged. 9
To derive the probability of having A
at time 2, we consider two possible
scenarios:

10
1. The nucleotide has remained
unchanged from time 0 to time 2.

11
2. The nucleotide has changed to
T, C, or G at time 1, but has
subsequently reverted to A at
time 2.

12
 
P  (1 3 )P  1 P
A(2) A(1)  A(1) 


13
The following equation applies
to any t and any t+1


PA(t  1)  (1 3a)PA(t)  a 1 PA(t) 
 

14
We can rewrite the equation in
terms of the amount of change in
PA(t) per unit time as:

 
P P P  3aP  a1  P  4aP
 a
A(t) A(t  1) A(t) A(t)  A(t)  A(t)

15
We approximate the discrete-time
process by a continuous-time
model, by regarding PA(t) as the
rate of change at time t.

dPA (t )
 4 PA(t )  
dt 16
The solution is:

1  1  4at
P   P  e
A(t) 4  A(0) 4 

17
1  1  4at
P   P  e
A(t) 4  A(0) 4 
If we start with A, the probability that
the site has A at time 0 is 1. Thus, PA(0)
= 1, and consequently,

1 3 4at
P   e
A(t) 4 4 18
1  1  4at
P   P  e
A(t) 3  A(0) 4 

If we start with non A, the probability that


the site has A at time 0 is 0. Thus, PA(0) =
0, and consequently,
1 1 4at
PA(t)   e
4 4 19
In the Jukes and Cantor model,
the probability of each of the four
nucleotides at equilibrium (t = )
is 1/4.

1 3
P 1 : P   e  4at
A(0) A(t) 4 4

1 1
P 0 : P   e  4at
A(0) A(t) 4 4

20
So far, we treated PA(t) as a probability.

However, PA(t) can also be interpreted as the frequency


of A in a DNA sequence at time t.

For example, if we start with a sequence made of


adenines only, then PA(0) = 1, and PA(t) is the expected
frequency of A in the sequence at time t.

The expected frequency of A in the sequence at


equilibrium will be 1/4, and so will the expected
frequencies of T, C, and G.
21
After reaching equilibrium no further
change in the nucleotide frequencies
is expected to occur.
However, the actual frequencies of the
nucleotides will remain unchanged
only in DNA sequences of infinite
length.
In practice, fluctuations in nucleotide
frequencies are likely to occur.

22
23
24
25
26
NUMBER OF NUCLEOTIDE
SUBSTITUTIONS BETWEEN
TWO DNA SEQUENCES

27
After two nucleotide sequences diverge from
each other, each of them will start accumulating
nucleotide substitutions.

If two sequences of length N differ from each


other at n sites, then the proportion of
differences, n/N, is referred to as the degree of
divergence or Hamming distance.

Degrees of divergence are usually expressed as


percentages (n/N  100%).
28
29
The observed
number of
differences is likely
to be smaller than
the actual number
of substitutions due
to multiple hits at
the same site.
30
13 mutations
=
3 differences

31
32
Number of substitutions
between two noncoding
(NOT protein coding)
sequences
33
The one-parameter model

In this model, it is sufficient to


consider only I(t), which is the
probability that the nucleotide at a
given site at time t is the same in both
sequences.
34
1 3 8t
I   e
(t) 4 4
where I(t) is the proportion of
identical nucleotides between two
sequences that diverged t time
units ago. P
A(t)
1 3 4at
  e
4 4 35
The probability that the two sequences are
different at a site at time t is p = 1 – I(t).

3  8 t 
p  1 e 
4
t is usually not known and, thus, we cannot
estimate . Instead, we compute K, which is the
number of substitutions per site since the time of
divergence between the two sequences.

36
37
3  8 t 
p  1 e 
4

L = number of sites compared between the


two sequences.
38
Jukes & Cantor’s one-parameter
model

39
40
Kimura’s
two-parameter
model
41
Assumptions:
• The rate of transitional substitution
at each nucleotide site is  per unit
time.
• The rate of each type of
transversional substitution is  per
unit time.
42
α ⁄ β ≈ 5 - 10

43
If the nucleotide residing at a
certain site in a DNA sequence is
A at time 0, what is the
probability, PA(t), that this site will
be occupied by A at time t?

44
After one time unit the probability of A
changing into G is , the probability of A
changing into C is and the probability of A
changing into T is . Thus, the probability of A
remaining unchanged after one time unit is:

PAA(1)  1   2
45
To derive the probability of having A
at time 2, we consider four possible
scenarios:

46
1. A remained unchanged at t = 1
and t = 2

47
2. A changed into G at t = 1 and
reverted by a transition to A at t
=2

48
3. A changed into C at t = 1 and
reverted by a transversion to A at
t=2

49
4. A changed into T at t = 1 and
reverted by a transversion to A at t
=2

50
PAA(2)  (1   2 )PAA(1)  PTA(1)   PCA(1)  PGA(1)

51
By extension we obtain the
following recurrence equation for
the general case:

PAA(t 1)  (1   2 )PAA(t)  PTA(t)   PCA(t)  PGA(t)

52
After rewriting this equation as
the amount of change in PAA(t) per
unit time, and after
approximating the discrete-time
model by the continuous-time
model, we obtain the following
differential equation
dP
AA(T )
 (  2 )P  P  P  P
dt AA(t) TA(t) CA(t) GA(t)
53
Similarly, we can obtain equations
for PTA(t), PCA(t), and PGA(t), and from
this set of four equations, we arrive
at the following solution

1 1 4 t 1 2(   )t
PAA(t)   e  e
4 4 2

1 3 4at
P   e
A(t) 4 4 54
In the Jukes-Cantor model:

PAA(t) = PGG(t) = PCC(t) = PTT(t)

Because of the symmetry of the


substitution scheme, this equality
also holds for Kimura's two-
parameter model.
55
3 probabilities
X(t) = The probability that a
nucleotide at a site at time t is
identical to that at time 0

1 1 4 t 1 2(   )t
X   e  e
(t) 4 4 2
At equilibrium, the equation reduces to X() = 1/4.
Thus, as in the case of Jukes and Cantor's model, the
equilibrium frequencies of the four nucleotides are 1/4. 56
3 probabilities
Y(t) = The probability that the initial
nucleotide and the nucleotide at time t differ
from each other by a transition.

Because of the symmetry of the substitution


scheme, Y(t) = PAG(t) = PGA(t) = PTC(t) = PCT(t).

1 1 4  t 1 2(   )t
Y   e  e
(t) 4 4 2
57
3 probabilities
Z(t) = The probability that the
nucleotide at time t and the initial
nucleotide differ by a specific type of
transversion is given by

1 1 4 t
Z   e
(t) 4 4
58
Each nucleotide is subject to two types
of transversion, but only one type of
transition. Therefore, the probability
that the initial nucleotide and the
nucleotide at time t differ by a
transversion is twice the probability that
differ by a transition

X(t) + Y(t) + 2Z(t) = 1

59
Number of substitutions
between two noncoding
(NOT protein coding)
sequences
60
The differences between two
sequences are classified into
transitions and transversions.

P = proportion of transitional differences

Q = proportion of transversional differences

61
62
63
64
1  
2 2 2
1   1 1   P Q Q 
V(K)  P  Q     
L 1  2P  Q  2  4P  2Q 2  4Q  1  2P  Q 2  4P  2Q 2  4Q  
 

65
66
Numerical example (2P-model)

67
There are substitution
schemes with more
than two parameters!

68
THANK YOU

69

You might also like