You are on page 1of 14

A low-complexity distance for DNA strings

Liviu P. DINU, Andrea SGARRO


University of Bucharest, Faculty of Mathematics and Computer Science,
Academiei 14, Bucharest, Romania
Universita degli Studi di Trieste,
Via Valerio, Trieste, Italy
E-mail: ldinu@funinf.cs.unibuc.ro, sgarro@units.it

Abstract
We exhibit a low-complexity but non-trivial distance for strings to be
used in biology.

1 Introduction
In general, when a new DNA sequence is given, the first step taken by a biol-
ogist would be to compare the new sequences with sequences that are already
well studied and annotated. Sequences that are similar would probably have
the same function, or, if two sequences from different organisms are similar,
there may be a common ancestor sequence (this kind of relation has important
implications in speciation study and phylogenetic analysis).
One of the most used method for sequence comparison is by sequence align-
ment. Sequence alignment is the procedure of comparing two (pairwise align-
ment) or more (multiple sequence alignment) sequences by searching for a series
of individual characters or characters patterns that are in the same order in the
sequences. The standard pairwise alignment method is based on on dynamic
programming [Smith and Wateman 1981]. The method compares every pair of
characters of the two sequences and generates an alignment and a score, which
is dependent on the scoring scheme used (i.e., a scoring matrix for the different
base-pair combinations, match and mismatch scores, or a scheme for insertion
or deletion (gap) penalties).
Although dynamic programming for sequence alignment is mathematically
optimal, it is far too slow for comparing a large number of bases. Typical DNA
database today contains billions of bases, and the number is still increasing
rapidly. To enable sequence search and comparison to be performed in a rea-
sonable time, fast heuristic local alignment algorithms have been developed (e.g.
BLAST, freely available at http://www.ncbi.nlm.nih.gov/BLAST).
The DNA sequence comparison problems are described in many scientific pa-
pers as a sort of string alignment and matching problems. However, comparison

1
and alignment of biological sequences is different from the general string match-
ing problem in computer science in at least several aspects. First, the number
of bases to be compared can be huge. For example, just for the human genome,
the number of nucleotide bases is around 3 × 109 . This places huge demand on
the computational efficiency and speed of the processing algorithms. Secondly,
one should have some knowledge about the biological nature of the problem to
be solved. A result that is mathematically sound may be highly implausible
and might not reflect what is known about the biological process. For example,
consider the problem of alignment of two protein-coding DNA sequences. It is
not very sensible to align the DNA sequences of protein-coding genes. Instead,
it is much more sensible to translate the sequences to their corresponding amino
acid sequences and then put the gaps into the DNA sequence alignment accord-
ing to where they are found in the amino acid alignment. To illustrate, consider
the alignment of two protein-coding DNA sequences ATGCTGTTAGGG and
ATGCTCGTAGGG. An alignment algorithm might give the solution below as
the preferred alignment:
ATGCTGTTAGGG
ATGCTCGTAGGG

However, the alternative alignment below, although less mathematically optimal


may be much more plausible biologically:

ATGCT-GTTAGGG
ATGCTCGT-AGGG

On the other hand, inaccurate alignments are considered to be one of the most
significant of the error sources in molecular phylogenies, (Lake and Moore, 1998;
Thorne, 2000). Furthermore, once an acceptable alignment is obtained, a high
fraction of the original sequence information is sometimes discarded, making the
estimation of similarity highly restricted and only relevant to one or just a few
selected domains [G. Stuart, K. Moffett, S. Baker. Integrated gene and species
phylogenies from unaligned whole genome protein sequences. Bioinfomatics, 18,
100-108, 2002].
The above motivations, the constant appearances of papers dedicated to
the problem of the DNA sequences analysis [...], corroborated with the ranking
of this problem on the first position in two lists of major open problems in
bioinformatics (Koonin, 1999; Wooley, 1999) are enough reasons to say that the
DNA sequence comparison is actually an exciting problem which is waiting for
new approaches.
We propose in this paper a low-complexity but non-trivial distance for strings
to be used in biology. Our method is easily implemented, does not use the align-
ment principle, has a good computational behavior (i.e. linear time complexity
and polynomial time for the median string problem; we recall that the median
string problem is NP-hard, i.e. intractable, in the case of the edit distance).
Also, we can say that our method gives the total non-alignment score, which
is related to one of the suggestions of Karp in (Karp, 2002): ”the distance

2
between genomes should be measured not only by counting mutations, but also
by determining the number of large-scale rearrangements needed to transform
one genome to another”.
The method has been already successfully used in computational linguistics,
in such problems as similarity of Romance languages [Dinu and Dinu 2005].
The paper is organized as it follows: in the next section we explain the rank
distance principle. The section 3 is the central part (the formalization of our
method is presented there) and in the final section an experimental result and
future works are stated.

2 Motivation
To measure the distance between two strings, we use the following strategy: we
scan (from left to right ) both strings and for each letter from the first string we
count the number of elements between its position in first string and the posi-
tion of its first occurrence in the second string. Finally, we sum all this scores
and obtain the rank distance. It is readily checked that the rank distance gives
a score zero to the letter which are on the same position in the both strings, like
Hamming distance. On the other hand, an important aspect is that the reduced
sensitivity of the rank distance to deletions and insertions is paramount here,
and allows us to make use of ad hoc extensions (rather than the edit-like exten-
sion), which do not much affect the (low) computational complexity of the rank
distance. So, rank distance sides with Hamming distance rather than Leven-
shtejn distance as far as computational complexity is concerned; the fact that in
the Hamming and in the rank case the median string problem is tractable, while
in the edit case it is not (actually the problem is NP-hard) is a very significant
indicator. Also a comparison with classical ordinal distances (as the rank dis-
tance is) is to the point. e.g. with bubble distances or Spearman’s footrule, the
more so since the latter, which has long been used in non-parametric statistics,
is tightly related to the rank distance.
Historically, Spearman’s footrule1 is the first ordinal distances to have been
used; however, it evaluates distances between permutations of the integers
1, 2, . . . , n, rather than between sequences on a fixed and known alphabet, as
is the case with rank distances 2 . While Spearman’s footrule is linear, and so
very easy to compute, the bubble distance is quadratic, and so less user-friendly
(by the way, the name ”bubble” is derived from an elementary sub-optimal
sorting algorithm called ”bubble sort”, which swaps adjacent integers whenever
they are found to be in the wrong order). The average value is at two-thirds
of the way to the maximum value in the Spearman’s case, while it is at half
the way (and so perfectly balanced) in the bubble case; this is because the
1 Both Spearman’s footrule and binary Hamming distance are a special case of a well-known

metric distance called taxi distance or Manhattan distance, which is known to be equivalent
to usual Euclidian distance. Computationally, Manhattan distance is obviously linear.
2 Rank distance can be extended to (potentially) infinite alphabets, but then its computa-

tional complexity increases. This generalization is not needed here

3
Spearman footrule, becomes rather ”undiscriminating” for highly different or-
derings. So: the footrule is more ”easy-going” but much easier to compute than
the bubble distance. Rank distance has the same drawbacks and the same ad-
vantages of Spearman’s foootrule: its average value is equally unbalanced, but
computationally it is quite attractive. As for ”classical” ordinal distances for
integers (Spearman’s footrule, bubble distance and others, with averages values,
maximal values etc.) the reader is referred to the basic work [Diaconis].

3 Rank distance in biology


Let us choose a finite alphabet, say {A, C, G, T }, and two strings on that al-
phabet, which for the moment will be constrained to be a permutation of each
other (when taken away ???). E.g. take the two strings of length 6, AACGTT
and CTGATA; number the occurrences of repeated letters in increasing order
to obtain A1 A2 C1 G1 T1 T2 and C1 T1 G1 A1 T2 A2 . Now, proceed as follows: in the
first sequence A1 is in position 1, while it is in position 4 in the second sequence,
and so the difference is 3; compute the difference in positions for all letters and
sum them. In this case the differences are 3, 4, 2, 1, 3, 1 and so the distance
is 14. Even if the computation of the rank distance as based on its definition
is quadratic, we shall exhibit below an algorithm which takes it back to linear
complexity; actually, even the ”naive” (and in principle quadratic) computation
will turn out to have a very low empirical complexity when applied to sequences
of practical interest, for reasons to be discussed below.
The further generalizations (add dashes at the beginning, or at the end, or
randomly) are ad hoc and basically are validated empirically.

3.1 Preliminaries
An alphabet is a finite non-empty set. The elements of an alphabet Σ are called
letters or symbols. A word or string is a finite sequence of zero or more letters
of Σ; the word with zero letters is called the empty word and is denoted by λ.
The set of all words over Σ is denoted by Σ∗ , whereas the set of all non-
empty words over Σ is denoted by Σ+ = Σ∗ − {λ}. The concatenation of two
words u, v, denoted uv, is obtained by juxtaposition, that is, writing v at the
end of u.
The set Σ∗ is the free monoid generated by Σ with respect to the operation
of concatenation. The length of a word w, denoted |w|, is the number of letters
appearing in w; each letter is counted as many times as it occurs. Given a word
w and a letter x, one denote by |w|x the number of letter x appearing in w. A
language over an alphabet Σ is any subset of Σ∗ .

3.2 Definitions
Firstly, we introduce rank distance on strings without repetitions (a.k.a. per-
mutations or rankings). Let u = a1 a2 . . . an and v = b1 b2 . . . bm be two strings

4
such that ∀i = 1 . . . n |u|i = 1, and ∀i = 1 . . . m |v|i = 1 (for simplicity reasons,
we write |u|i instead of |u|ai ). For an element ai ∈ u we define its order by
ord(ai |u) = i (its position in the string, counted from the left to the right).
Using these notations, the rank distance is defined as it follows:

Definition 1 The rank-distance between two strings without repetitions u and


v is given by:
X X
∆(u, v) = |ord(x|u) − ord(x|v)| + ord(x|u)
x∈u∩v x∈u\v
X
+ ord(x|v). (1)
x∈v\u

To extend rank distance to strings, we will index each symbol of the string
with the number of its previous occurrences in the string. The rank distance be-
tween two strings is the rank distance between the indexed words, corresponding
to the strings.
Example 1 Let w1 = abbab and w2 = abbbac be two strings. Theirs cor-
responding indexed strings will be: w1 = a1 b1 b2 a2 b3 and w2 = a1 b1 b2 b3 a2 c1 ,
respectively. So, ∆(w1 , w2 ) = ∆(w1 , w2 ) = 8

Remark 1 Note that the above transformation can be done in linear time (by
memorizing for each symbol, in an array, how many times it appears in the
string).

Remark 2 Since (1) gives a greater importance to the right part of strings,
and this is not necessary true in biology we add to this value the value given by
applying the rank distance to the reversal strings (mirror images).
So, we will use the formula:

∆(u, v) + ∆(mi(u), mi(v))


∆av (u, v) = (2)
2
where mi(u) is the reversal of the string u.
It keeps all the properties of rank distance.

Deletions and insertions are less worrying in the rank case rather than in
the Hamming case3 : if one incorrectly moves a symbol by, say, one position,
the Hamming distance loses any track of it, but rank distance does not, and
the mistake is quite light. So, generalizations in the spirit of the edit distance
are unavoidable in the Hamming case, even if they are computationally very
demanding, while in the rank case we may think of ad hoc way-outs, which are
computationally convenient, as we made in the definition[s] above.
3 When the length n is given, the maximum value for the rank distance is quadratic in n,
2 2
while it is only n in the Hamming case. More exactly n2 if n is even and n 2−1 if n is odd,
as easily checked

5
3.3 On the complexity of Rank distance
In the DNA sequence analysis, the strings may be huge (for example, for the
human genome, the number of nucleotide bases is around 3 × 109 ), so it is
stringent to have good algorithms (regarding time and space complexity) in
order to calculate the rank distance between two sequences.
We give here two time linear algorithms: the first one is a linear time algo-
rithm and it works with a linear supplementary space; basically, it takes back
rank distances to the computation of taxi-distances, which are obviously linear.
The second one is a little slower then the first linear time algorithm, which
is directly based on the definition of the rank distance. The advantage of the
second is that it has no need of supplementary space.
We now mention a convenient property of the rank distance:
We will get rid of equal aligned bases as shown below (unequal bases keep
their original ranks):

1 2 3 4 5 6 1 2 3 4 5 6
u = a a g c c t ⇒ ū = a − g − c −
v = c a a c g t v̄ = c − a − g −

If we denote by ū and v̄ the two transformed strings, we have:

∆(u, v) = ∆(ū, v̄)

Formally, this property is as follows:

Proposition 1 If a letter x is on the same position in both strings u and v,


then its contribution to final score is equal to ||u| − |v||. It is obviously that if
|u| = |v| then the contribution is zero.

Algorithm 1 (linear in time and space)


• We use four arrays A[][],C[][],G[][],T[][] with 2 rows and |u| columns.
• In each array we’ll memorize the following data: A[1][i] will contain the
rank of i-th ’a’ in the first string and A[2][i] will contain the rank of i-th
’a’ in the second string (it will be 0 if there are no more ’a’-s). Analogous
for C[][], G[][] and T[][].
• Finally we compute the sum:
X X
|A[1][i] − A[2][i]| + |C[1][i] − C[2][i]|+
i i
X X
+ |G[1][i] − G[2][i]| + |T [1][i] − T [2][i]|
i i

6
1 2 3 4 5 6
Example 2 = a − g − c −

= c − a − g −

1 5 2
A= ,C = ,G = So, ∆(u, v) = 2 + 4 + 2 = 8
3 1 4

Remark 3 The time complexity of the algorithm 1 is O(|u| + |v|).

Remark 4 In this algorithm, the elements of the matrixes A, C, G and T


are integers. The matrixes have 2 rows and max(|u|a , |v|a ), max(|u|c , |v|c ),
max(|u|g , |v|g ) and max(|u|t , |v|t ) columns, respectively. The total number of
columns of the 4 matrixes is at most equal to |u| + |v|. So, the supplementary
space is at most O(|u| + |v|).

Remark 5 The reader may have noticed that the so-called bit complexity of our
algorithm is n log n rather then n, due to the fact that we have to write down
longer and longer integer representations, e.g. decimal representations. The
algorithmic complexity referred to here is, however, the standard one since in
practice the extra complexity due to integer representation is irrelevant.

Algorithm 2: (without supplementary space )


• We’ll use eight positive variables ia, ic, ig, it and ja, jc, jg, jt which will
point to the last a, c, g or t read in the first (i) and second (j) string
(initially all are 0)
• So, if we read in first string an ’a’, we search in the second string the
next ’a’ starting from the position ja; if it is found, we make the difference
|ja − ia| and add it to the final sum. Analogous with c, g and t.

Remark 6 It can be easily checked that the time complexity is at most 4 ×


max(|u|, |v|).

Remark 7 Since this algorithm doesn’t use supplementary space, it can be run
also on a modest computer even for huge strings.

In Addenda, we detail both algorithms.

4 Experiment
To test our method in bioinformatics, we use a classical problem: the phyloge-
netic analysis of the mammals.
We use whole mitochondrial DNA sequence genome of the following 22 mam-
mals available in the EMBL database: human (Homo sapiens, V00662), com-
mon chimpanzee (Pan troglodytes, D38116), pigmy chimpanzee (Pan panis-
cus, D38113), gorilla (Gorilla gorilla, D38114), orangutan (Pongo pygmaeus,
D38115), sumatran orangutan (Pongo pygmaeus abelii, X97707), gibbon (Hy-
lobates lar, X99256), horse (Equus caballus, X79547), donkey (Equus asinus,

7
X97337), Indian rhinoceros (Rhinoceros unicornis, X97336), white rhinoceros
(Ceratotherium simum, Y07726), harbor seal (Phoca vitulina, X63726), gray
seal (Halichoerus grypus, X72004), cat (Felis catus, U20753), fin whale (Balenoptera
physalus, X61145), blue whale (Balenoptera musculus, X72204), cow (Bos tau-
rus, V00654), sheep (Ovis aries, AF010406), rat (Rattus norvegicus, X14848),
mouse (Mus musculus, V00711), North American opossum (Didelphis virgini-
ana, Z29573), and platypus (Ornithorhyncus anatinus, X83427).
Our approach is the following: for any two mtDNA sequences used in this
study, we compute the normalized average rank distance. To normalize the
average rank distance, we divide ∆av by the maximum possible value which can
be reached by ∆av . The maximum value between two strings u and v is equal
to |u|(|u|+1)+|v|(|v|+1)
2 . So, the normalized average rank distance between two
strings is:

∆(u, v) + ∆(mi(u), mi(v))


∆av (u, v) = (3)
|u|(|u| + 1) + |v|(|v| + 1)

In our experiment, the usual length of a mtDNA sequence is around 214


letters.
Three ad-hoc extensions of normalized average rank distance were tested in
our experiment: the first one was the one defined as above (i.e. it used the
strings in initial form, without any modification); the second one and the third
one modified the strings in order to obtain strings with the same length. The
second extension inserts a number of ||u| − |v|| dashes to the end of the shorter
string between u and v and then computes the normalized average rank distance
between this new strings, and the third extension inserts a ||u| − |v|| number
of dashes in random positions of the shorter string between u and v and then
computes the normalized average rank distance.
Using the normalized ∆av , we computed the distance matrixes for all mam-
mal mitochondrial DNAs reported in the upper experiment. Then we used
Neighbor Joining method to construct the phylogenies for the mammals. We
used the PHILIP package, available at
http://evolution.genetics.washington.edu/philip.html.
In Fig. 1 we have drawn the phylogenetic tree of the mammals obtained
with the third extension of our distance.
The obtained tree has a topological structure comparable to the structure
of other trees reported by other researches (Cao et al. 1998, Reyes et al. 2000,
Li et al. 2004). Though, two differences can be observed: the classifications of
rat and cat in our research are not similar to its corresponding classifications
in the other papers. If we look at the distance matrix, we can observe that the
grey seal and harbor seal are the closest mammals to the cat, but the cat is not
the closest mammal to the grey seal and the harbor seal. The same with the
mouse and the rat (the rat is the closest mammal to the mouse, but the mouse
is no the closest mammal to the rat).

8
Figure 1: The mammals phylogenies build from complete mammalian mtDNA
sequences using rank distance

9
5 Conclusion
In this paper we exhibit a low-complexity distance for the DNA sequence com-
parison problem. We show that our distance can be computed in linear time
(O(2n), where n is the length of the longer string). We propose two linear
algorithms to compute the distance: the first uses supplementary space, while
the second doesn’t.
The reported experiment (the phylogenies of mammals) produced similar
results to experiments reported in the literature, but the computational costs
in time and space were much lower.

6 Addenda
Algorithm 1 (linear in time and space)
Input: two strings, u and v

1. x:=|u|,ia:=0, ic:=0, ig:=0, it:=0, ja:=0, jc:=0, jg:=0, jt:=0, dist:=0;

2. for i:=1 to |u| do


3. begin (beginfor)

4. read(u[i],v[i]);

5. if (u[i] <> v[i]) then

6. case u[i]

7. A: ia++; A[1,ia]:=x;

8. C: ic++; C[1,ic]:=x;

9. G: ig++; G[1,ig]:=x;

10. T: it++; T[1,it]:=x;

11. end;(endcase)

12. case v[i]

13. A: ja++; A[2,ja]:=x;

14. C: jc++; C[2,ic]:=x;

15. G: jg++; G[2,ig]:=x;

16. T: jt++; T[2,it]:=x;

17. end;(endcase)

18. else;

10
19. x:=x-1;
20. end(endfor);
21. for i:=1 to max(ia,ja) do
22. dist:=dist+|A[1, i] − A[2, i]|

23. for i:=1 to max(ic,jc) do


24. dist:=dist+|C[1, i] − C[2, i]|

25. for i:=1 to max(ig,jg) do


26. dist:=dist+|G[1, i] − G[2, i]|
27. for i:=1 to max(it,jt) do
28. dist:=dist+|T [1, i] − T [2, i]|
29. return dist

Output: dist.

Algorithm 2 (linear in time and space)


Input: two strings, u and v

Step 1 Initializare:
1. dist:=0,ia:=1,ic:=1,ig:=1, it:=1;

Step 2 Avansare in sir:


2. for i:=1 to min(|u|, |v|) do

3. begin (beginfor)
4. x:=u[i]; am presupus ca |u| < |v|

5. case x:
6. A:
7. for j:=ia to |v| do
8. if u[i]=v[j] then
9. begin (beginthen)
10. dist:=dist+|i − j|;
11. u[i]:=’z’;
12. v[j]:=’z’;

11
13. ia:=j+1;
14. break;
15. end (endthen)
16. else;
17. end (endcase x=A)
18. C:
19. for j:=ic to |v| do
20. if u[i]=v[j] then
21. begin (beginthen)
22. dist:=dist+|i − j|;
23. u[i]:=’z’;
24. v[j]:=’z’;
25. ic:=j+1;
26. break;
27. end (endthen)
28. else;
29. end (endcase x=C)
30. G:
31. for j:=ig to |v| do
32. if u[i]=v[j] then
33. begin (beginthen)
34. dist:=dist+|i − j|;

35. u[i]:=’z’;
36. v[j]:=’z’;
37. ig:=j+1;
38. break;
39. end (endthen)
40. else;

12
41. end (endcase x=G)
42. T:
43. for j:=it to |v| do
44. if u[i]=v[j] then
45. begin (beginthen)
46. dist:=dist+|i − j|;

47. u[i]:=’z’;
48. v[j]:=’z’;
49. it:=j+1;
50. break;
51. end (endthen)
52. else;
53. end (endcase x=T)
54. end (endfor)

Step 3 Calculul distantei:


55. for i:=1 to |u| do

56. if u[i] <>′ z ′ then


57. dist := dist + |u| + 1 − i

58. else;
59. for i:=1 to |v| do

60. if v[i] <>′ z ′


61. then dist := dist + |v| + 1 − i

62. else;
63. return dist;

Output: dist.

13
References
[1] Aurelio Reyes, Carmela Gissi, Graziano Pesole, Francois M. Catzeflis, and
Cecilia Saccone Where Do Rodents Fit? Evidence from the Complete Mito-
chondrial Genome of Sciurus vulgaris . Mol. Biol. Evol. 17(6):979 983. 2000
[2] Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N.
Okada, S. Paabo, M. Hasegawa, Conflict among individual mitochondrial pro-
teins in resolving the phylogeny of Eutherian orders, J. Mol. Evol., 47(1998),
307-322.
[3] L. P. Dinu, A. Dinu. On the Syllabic Similarities of Romance Languages. In
A. Gelbukh (Ed.): CICLing 2005. LNCS 3406, 785-788, 2005.
[4] L.P.Dinu, F. Manea. An efficient approach for computing median rankings
(submitted)
[5] Koonin. The emerging paradigm and open problems in comparative ge-
nomics. Bioinformatics, 15, 265-266, 1999
[6] Holmquist, R., Miyamoto, M. and Goodman, M. Higher-Primate Phylogeny
- Why Can’t We Decide?, Mol. Biol. Evol., 5(3):201-216, 1988
[7] Stuart, G., Moffett, K. and Baker, S. Integrated gene and species phylogenies
from unaligned whole genome protein sequences Bioinformatics, ??? 2002
[8] Liew, A.W-C, H. Yan, M. Yang. Pattern recognition techniques for the
emerging field of bioinformatics: A review Pattern Recognition, 38 (2005),
2055-2073

[9] S.C. Chan, A.K.C. Wong and D.K.Y. Chiu, A survey of multiple sequence
comparison methods, Bulletin of Math. Biology, Vol. 54 n4 (1992) pp 563-598.

[10] Lake, J.A. and Moore, J.E., 1998. Philogenetic analysis and comparative
genomics. Trends guide to bioinformatics, Trends Journal Supplement 1998,
22-23
[11] Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M.B. Vitanyi. The Similarity
Metric. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX,
NO Y, MONTH 2004

[12] T.F. Smith, M.S. Waterman, Comparison of biosequences, Adv. Appl.


Math. 2 (1981) 482-489

[13] Wooley Trends in computational biology: a summary based on a RECOMB


plenary lecture. J. Comput. Biology, 6, 459-474, 1999

14

You might also like