Professional Documents
Culture Documents
Evolution of Long-Range Fractal Correlations and 1lf Noise in DNA Base Sequences
Richard F. Voss
IBM Thomas J. Watson Research Center, Yorkto~n Heights, Ne~ York 10598
(Received 18 February 1992)
A new method of quantifying correlations in symbolic sequences is applied to DNA nucleotides. Spec-
tral density measurements of individual base positions demonstrate the ubiquity of low-frequency 1/P'
noise and long-range fractal correlations as well as prominent short-range periodicities. Ensemble aver-
ages over classifications in the GenBank databank (primate, invertebrate, plant, etc. ) show systematic
changes in spectral exponent P with evolutionary category.
A wide variety of physical systems show power-law tance traveled a time T is hX(T) (IX(t+ T)
in
correlations in space (fractals) [1-3] or time (1/f noises) —X(t)l )'i ~ T't . Like w(t), the future changes of
[4,5]. The characterization and understanding of these X(t) are independent of its past.
correlations has become an important and surprisingly Many naturally occurring ffuctuations, however, from
difficult problem. Although diffusion limited aggregation electronic voltages and time standards to meteorological,
[2] and self-organizing criticality [6] are based on simple biological, traffic, economic, and musical quantities
physical models, a detailed prediction of their power-law [1,4, 5, 15-18] have nontrivial correlations that extend
correlations has proven illusive. The measurements re- over all measured time scales. These 1/f noises exhibit
ported here demonstrate that similar power laws have S(f) ~1/fS with P= 1 over many decades. In most
evolved in DNA. cases, the physical reason for this long-range power-law
Numerous attempts have been made to characterize behavior remains a mystery.
and graphically portray the genetic information stored in The most effective mathematical model for such scal-
DNA nucleotide sequences [7-12]. One popular method ing noises has been fractional Brownian motion (fBm)
maps the four individual bases to steps in a variant of a [19,20] as the integral of fractional Gaussian noises
random walk. Unfortunately, when the number of spatial xH(t). A fBm process XH(t) fxH(t)dt is specified by
dimensions is less than 4, the mapping itself introduces 0&H &1 such that
correlations between the bases. Nevertheless, recent re-
sults [11] on a few sequences have shown long-range frac- ~H(T) -(IXH«+» —XH«) I'&'" ~ T",
tal or scaling correlations. Here, a new method of and a corresponding
characterizing arbitrary symbolic sequences is described
that does not introduce correlations between symbols. S, (f) ~1/f~
where P 2H —1
When applied to DNA sequences it confirms the ubiquity
for the increments xH(t) Thus, .as
of low-frequency I/ft' noise and the corresponding long-
range fractal correlations as well as prominent short-
H
As H
0 the fBm XH(t)
1, XH(t)
1/f noise and xH(t)
1/f noise and xH(t)
noise. f
range periodicities. Moreover, averages over & 5 x 10
1/f noise.
H 2 gives norma/ Brownian motion.
bases from & 25000 sequences in 10 classifications (pri-
mate, invertebrate, plant, etc. ) of the GenBank [13] data-
A definition for the product x(t)x(t+z) is central to
bank show systematic changes in correlations and spec-
C(z), S(f), and ~(T). For non-numeric sequences
tral exponent P with evolutionary category.
xt, . . . , xjv, composed of K allowed symbols care must be
taken to define procedures that do not make a priori as-
There are a number of accepted techniques [14] for
sumptions about relations between the symbols. Assign-
characterizing random processes x(t). Both the correla-
ing a number to each symbol, or a walk displacement
tion function C, (z) (x(t)x(t+z)) and the spectral
[8-12] along a direction in a D-dimensional space with
density S (f) ~ fx(t)e 'dtl /hf quantify correla-
I
D & K, introduces numeric correlations.
tions at the time scale z= ,
1/f Generally, S. (f) and Such problems may be avoided with equal symbol-
C~(z) are connected by the Wiener-Khintchine relations
[14], S (f) cx: fC (z)cos(2nz)dz and C (z) crfS (f)
multiplication defined as x„x =1 if x„=x,
and 0 oth-
erwise. A binary indicator function or projection opera-
xcos(2')df. Fast Fourier transform (FFT) algorithms tor Uk selects the elements of x„ that are equal to symbol
allow efficient direct computation of S„(f) from sample
k. Uk[x„] 1 if x„=k, and 0 otherwise. Consequently,
sequences and the estimation of C„(z ) from S (f).
A true random process or white noise w(t) has no x~x~ gt, Uk [x„]Uk [x~] and
correlations in time. C (z)~8(z) and S (f) cLconst.
C(z)- —
1V JC K
Its integral X(t) fw(t)dt produces a Brownian motion
or random walk with Sx(f) ~1/f . The average dis- g g Uk[x„]U, [x„+,]= g Ck(z).
&S-ik-i
1992 The American Physical Society 3805
VOLUME 68, NUMBER 25 PHYSICAL REVIEW LETTERS 22 JUNE 1992
Here Ck(z), the equal-symbol correlation function for correlations between difI'erent bases and illustrates the
symbol k, is the probability that x„+, =k if x„=k. Simi- difficulties with low-D walks. For some averaging scales
larly, and positions (e.g. , around 160000) the pyrimidines (C,
S(f) - g S,(f)
k 1
T) are strongly anticorrelated, around 100000 the corre-
lation is near 0 while near 115000 the correlation is posi-
tive. At all scales, however, the Uq[x„l are extremely ir-
where Sk(f) is the equal-symbol spectral density of sym- regular and there is little change as the averaging is in-
bol k. creased from 10 to 100 and 1000 sites.
Thus, as symbolic sequence may be decomposed into This independence of scale is confirmed by the mea-
the K binary sequences Uk[x„] which identify the posi- sured equal-symbol Sk(f) shown at the top of Fig. 2.
tions of symbol k. Fourier transforms of Uk[x, ] yield The sequence was divided into independent sections of
the frequency components, Uk(f), the individual Sk(f) size NFFT=2' for FFT transforms and the results from
cc Uk (f)
I I,
and Ck (z ), as well as S~(f) QSk (f) individual sections were averaged together. Although
and Cx(z ) QCk (z ). Cross-correlation spectra, SJ k (f) there are small differences, each of the Sk(f) has the
IxUJ(f)Uk(f) between different symbols
also be calculated.
and k may j same overall behavior At. large (small z) Sk(f) is
dominated by the white noise of an uncorrelated random
f
This method is illustrated with the 229354 bases in process. Within the white noise there is a sharp peak at
the complete genome [21] of the human Cytomegalovi- f '
—, (period z =3) that is probably related to the nu-
rus strain AD 169 (GenBank [13] accession number cleotide triplet (or codon) that specifies one of 20 amino
X17403). Figure 1 shows the four indicator functions acids [7]. At low f
(large z), however, Sk(f) shows
Uk[x, ] (k A, C, G, T) at different levels of averaging. power-law scaling similar to I/f noise. For comparison
This plot graphically reveals many positive and negative and as a verification of the method, a superposition of the
4
10
A
-3 1/3 .
T
-2
C
)C c 10-'—
—
1
G white noise limit
—
0 digits of z
I 'I shat~~
~ IIIIw4tSllggggHI oslo NNeeieooeoloae eaoe ~ ~ ~
1000
I
1500
I
2000 Q)
T 077
10
cQ
-3
T
CO
C 083
10-8
C
G G 0.92
I I I
3806
VOLUME 68, NUMBER 25 PHYSICAL REVIEW LETTERS 22 JuNE 1992
I
10 Sk(f) for the first 1.13 x 10 decimal digits of x [22] is
primate 0.77 1/3 also shown. This symbolic sequence shows no long-range
correlation.
10 rodent 0.81 In many cases I/f~ noises can be studied over larger
ranges by subtracting the high
— S(~)
f white-noise limit [23]
10 '— mammal 0.84 hS(f) S(f) under the assumption that the I/ft'
and white noises are independent. The subtraction is also
vertebrate 0.87 illustrated in Fig. 2, where the AS(f) ~1/ft' scaling
clearly extends from around 100 bases to 100000 bases.
E 1 .00 Differences in the exponents Pg are visible. For this sam-
3 invertebrate
3—
~ ~
0
10
ple, A is most correlated with T and C with G at low f
10-4— plant 0.86 A similar separation of the short-range and scaling com-
ponents is much more diIIicult with hX(T) analysis [11].
The limited accuracy of estimating P from individual
)
O
1P-5 virus 0.82 sequences can be greatly reduced by averaging over the
p-6 — organelle 0.71
large number of DNA sequences in the GenBank data-
1 ~ (; bank. The GenBank classification (mammal, inver-
CO
tebrate, plant, etc. ), moreover, allows examination of sta-
-7
bacteria 1.16 tistical changes with evolutionary category.
Figure 3 shows log-log plots of the equal-symbol
P-8
1
phage 1.02 hSz(f) ghSt, (f) averaged over specific categories in
the GenBank Release 68 database with NF~ 2048. Se-
1P-9 ~&
~ The
0-10
category il ~
+a /
~
stdt+ eat
RI
~ ~++~ ~
~
quences 2NFFY contributed
average included the 6610 sequences )
multiple samples.
NF~ and
1 significantly reduced the scatter about I/f~.
I I 1
The category average also clarifies the small period
10-' 10 10 10 ' fp0 (high f)variations. Figure 4 shows S (z1/f) versus
frequency, f (base ') period r 1/f for NFrT 512 to emphasize short-range
FIG. 3. Equal-symbol hSz(f) for categories of DNA se- structure. The large narrow codon peak at period 3 is the
quences from the GenBank Release 68 databank. Least- most prominent feature, but the increased averaging al-
squares estimates of P are tabulated and the resulting fits are lows other small structures to be visible. For example,
shown as solid lines. hSz(f) offset for clarity. the peak at period 9 for primates, vertebrates, and inver
category N„q I I I I I I I I I I I I
primate 4926
rodent 4512
mammal 1243
vertebrate 1246
invertebrate 2196
plant 299O g
bacteria 4008
virus 2748
organelle 824
phage 324
I I I I I I I I I I I I I I
2 3 4 5 6 7 8 9 1011 12131415
period, t=1/f
FIG. 4. Equal-symbol Sz(f) $$k(f) vs period z I/f for specific categories of DNA base sequences from the GenBank Release
68 database emphasizes short-range correlations. In this linear plot successive Sz(f) are offset for clarity and N~ is tabulated for
each category.
3807
VOLUME 68, NUMBER 25 PH YSICAL REVIEW LETTERS 22 JUTE 1992
3808