# E q u a t i o n (22) m a y be p r o v e d by inspection of (18), while (21) m a y be demonstrated by expanding the logarithm in (17) into a Taylor series and retaining only the first two

terms. Equations (21) and (22) show that as the n u m b e r o f accesses becomes immaterial, one should use very small resident and overflow records to reduce the total storage volume. Substituting (21) a n d (22) into eq. (13), we get

Programming Techniques

G. M a n a c h e r , S.L. G r a h a m Editors

V* ~ , R . s / a + R . s . (a -- 1)/a = R . s ,

(23)

which is the absolute minimal storage volume needed, w i t h o u t a n y " o v e r h e a d " a d d e d by the storage method.

**A Fast Algorithm for Computing Longest Common Subsequences
**

James W. Hunt Stanford University Thomas G. Szymanski Princeton University

Previously published algorithms for finding the longest common subsequence of two sequences of length n have had a best-case running time of O(n2). A n algorithm for this problem is presented which has a running time of O((r + n) log n), where r is the total number of ordered pairs of positions at which the two sequences match. Thus in the worst case the algorithm has a running time of O(n 2 log n). However, for those applications where most positions of one sequence match relatively few positions in the other sequence, a running time of O(n log n) can be expected. Key Words and Phrases: Longest common subsequence, efficient algorithms CR Categories: 3.73, 3.63, 5.25

Appendix

PROPERTY 1. F o r all a > 1,

p*(a) < q*(a).

(A1)

PROOF. Instead o f (A1), we prove the equivalent exp (q*/s)/exp (p*/s) > 1. Let us introduce the following n o t a t i o n : (A2)

b = a + (a 2 -- 1) ~,

c = 2(a -- 1).

(A3)

T h e n it is easily seen that

q*/s = 2/(b -- 1), p * / s = In (b/c).

(A4) (A5)

Substituting (A4) and (A5) into (A2) a n d expanding the n u m e r a t o r as a T a y l o r series, retaining the first three terms, we get exp (q*/s) exp (p*/s) exp (2/(b -- 1)) (A6)

b/c > 1 -+- 2 / ( b - - 1) + 2 / ( b -- 1) 2 _ (b z + 1)c b/c (b -- 1)2b

F r o m definition (A3), we get the following identity: (b -- 1) 2 = b.c. Substituting (A7) into (A6), we finally have exp (q*/s) > (b 2 + 1).c _ b 2 - I - 1 > 1. exp (p*/s) b 2. c b2 Received February 1974; revised March 1975 (A7)

References

1. Benner, F.H. On designing generalized file records for management information systems. Proc. AFIPS 1967 FJCC, Vol. 31, AFIPS Press, Montvale, N.J., pp. 291-303. 2. Burroughs DISK FORTE Users Manual. Burroughs Corp., Detroit, Mich., 1973. 3. Collmeyer, A.J., and Shemer, J.E. Analysis of retrieval performance for selected file organization techniques. Proc. AFIPS 1970 FJCC, Vol. 37, pp. 201-210. 4. Olle, T.W. Generalized systems for storing structured variable length data and retrieving information. In Mechanized Information Storage, Retrieval and Dissemination, K. Samuelson, Ed., Rome, 1968. 5. Wilde, D.J., and Beighfler, C.S. Foundations of Optimization. Prentice-Hall, Englewood Cliffs, N.J., 1967.

Copyright © 1977, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted provided that ACM's copyright notice is given and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery. The work of the first author was partially supported by Bell Laboratories' Cooperative Research Fellowship Program. The work of the second author was partially supported by NSF Grants GJ-35570 and DCR74-21939. Author's addresses: J.W. Hunt, Department of Electrical Engineering, Stanford University, Stanford CA 94305; T.G. Szymanski, Dept. of Electrical Engineering and Computer Science, Princeton University, Brackett Hall, Engineering Quadrangle, Princeton, NJ 98540. Communications of the ACM May 1977 Volume 20 Number 5

3511

A [i-k. These algorithms all have a worst-case (as well as a best-case) running time of O(n~)2 A more relevant parameter for this problem is r. considered as sequences.k _~ Ti.v are defined. .k] have a c o m m o n subsequence of length k. then U is said to be a subsequence of Ir if there exists a monotonically increasing sequence of integers r l . T5.1 =
351
.k -. The number of elements in the set {(i. The longest common subsequence of these files. A [ l :i+ l ] and B[I:T¢+I.Introduction
Many algorithms [1. our algorithm will perform substantially better.k-1 < j <__ Ti.k-1 and Ti. thus A[I:i] and B[I:T~+Lk--1] have a common subsequence of length k -. If A [1 :i] and B[1 :T~.k ITi. by Lemma 2 and the premise of this case.kmust lie in a specific range of values. then certainly A[I:i-t-1] and B[l:Ti. Thus in the general case our algorithm will not take much longer than the algorithms of [1. that is. r~.k-1 < T~+L~ • The following rule suffices to compute T~+Lk from T~. then Ti. T~. For ease of presentation.k] do also. Note that each row of the T array is strictly increasing. Ti. any c o m m o n subsequence of the sequences A[I:i-t-1] and B[I:T~+~.~ must equal Ti+l.~ . Ti. Thus Ti+Lk <__ Ti.k may thus be considered as a pointer which tells us how much of the B sequence is needed to produce a c o m m o n subsequence of length k with the first i elements of A.k . Moreover.a = 6. . Case 1. whereas in many common applications.k] is the last member of this common subsequence or else T~.~ would not be minimal. rw~ such that U[i] = V[r~]for 1 < i < I U I. Deleting the last element from each of these sequences can remove at most one element from this common subsequence.k-1 <__ T~.k = the smallest j such that A[I:i] and B[I~] contain a c o m m o n subsequence of length k.2 = 3. Communications of the ACM May 1977 Volume 20 Number 5
The key data structure needed by our algorithm is an array of "threshhold values" T~. B[T~+L~] does not match A[iq-1]. LEMMA 2. In these situations our algorithm will exhibit an O(n log n) behavior. A [j]. In the worst case this is of course O(n 2 log n).1 common subsequence of
1An unpublished result of Michael Paterson shows how to construct an O(n~/log n) algorithm for the longest common subsequence problem for sequences over a finite alphabet. N o such j exists.~ if no such j exists
f
Preliminary Results
PROOF. represents that common "core" which does not have to be changed if we desire to edit one file into the other. that is. By definition. Suppose that we have computed T~. However. PROOF. the total number of matching pairs of positions within the sequences in question. 6] for finding the longest common subsequence of two sequences of length n have appeared in the literature.k. namely the length k .k for all values of k and wish to compute T~+1. (3) Finding the edit distance between two files in which the individual lines of the files are considered to be atomic. A longest common subsequence is a common subsequence of greatest possible length.k] must have B[T~+Lk] as its last element. We denote the length of A by t A I.1 ] contain a common subsequence of length k .k-1 < Ti+l.4 = 7. T5. . T~. Thus Ti. Throughout this paper A and B will be used to denote the sequences in question. By the minimality of T~+~. U i s a c o m m o n s u b s e q u e n c e o f A and B if U is a subsequence of both A and B.1] = B[i] T~+Lk = ~ and Ti. LEMMA 1. 6]. A[i] is the ith element of A and A[i'. .1. T~. Therefor the same common subsequence of length k is also contained in A[I:i] and B[l:Ti+l. There exists a m i n i m a l j for which A[iA-1] = B[j] and T~. Ti.~.1. Consider the common subsequence of length k contained in All:i] and B[l:Ti.1 < T~. 4.k ~ Ti+l. we can expect r to be close to n. Clearly B[T~. Ts.~ < . • • • .~ defined by Ti.1.k]. [] This linear ordering is of paramount importance in the efficient implementation of our algorithm. and an O((n~ log log n)/log n) algorithm for sequences over an infinite ordered alphabet.j] denotes the sequence A [i].k for all values of k.k]. LEMMA 3.k and by Lemma 2. B = badbabd we have T5. .k • Case 2. We first show T~+x. If U and V are finite sequences.5 = undefined. PROOF. All results of this paper apply to the case of the infinite ordered alphabet. We shall present an O ( ( r + n ) log n) algorithm for the longest common subsequence problem. Let A be a finite sequence of elements chosen from some alphabet. given sequences A = abcbdda.
1. 4. For example. If T i n . smallest j such that A[iq.k] have a common subsequence of length k. (2) Finding a maximum cardinality linearly ordered subset of some finite collection of vectors in 2-space [7].k-~ < j < Ti. . j) such that A[i] = B[j]} will be denoted by r. Each T~.1].1 [] and Ti. we shall assume both sequences have the same length which will be denoted by n. Therefore All :i] and B [ I : T ~ . "'" . Accordingly Ti. Typical of such applications are the following: (1) Finding the longest ascending subsequence of a permutation of the integers from 1 to n [3]. Certainly A[l:iq-l] and B[l:j] contain a c o m m o n subsequence of length k.k.k -. v .k_~ <__ T¢+~. k .k. < T i . for a large number of applications.

k = j.
for i := 1 step 1 until n do forj on M/1TCHLIST[i] do
beg~ find k such that THRESH[k--I] < j <_ THRESH[k]. The main source of inefficiency in this algorithm is the inner loop on j in which we repeatedly search for elements of the B sequence which match A[i]. 4. B[j] "tacked" onto the end.
would be set to j8 and so forth. for i := 1 step 1 until n do THRESH[i] := n q. we see that the loop on j must run downwards. k. Algorithm 1
element array . Finally.k_t < T~+l. .k.k] does not match A [ i + I ] . F o r each position i we need a list of corresponding j positions such that A[i] = B[j].k < j is incorrect and hence we must have T~+t. i f j < THRESH[k] then begin THRESH[k] := j.k ~ Ti+t. I f instead the loop on j ran upwards from 1 to n.k. These lists must be kept in decreasing order in j. Suppose that for some value of i.1 until 1 do if/110 = B[j] then
= = = = = = =
(5.
pointer array LINK[1 :n]. integer array THRESH[O:n]. Since these latter assignments are unwarranted. T~. Since the inner loop of Algorithm 1 considers values of j in decreasing order. Since L e m m a 2 guarantees that T~.
THRESH[O] := 0. 2) (6. . However.j2. pointer PTR. k := largest k such that THRESH[k] ~ n -t. 1) ( ) MATCHLIST[2] (7.k-t < j l < "'" < j. >j~andA[i] = B[jq] for I < q _< p. notice that the direction of the loop on j is crucial. implying that Ti. end.k = jt and that T H R E S H [ k ] should be updated to this value.. Subsequent refinements will enable us to not only improve the running time to O((r + n) log n) but also recover the actual longest c o m m o n subsequence. integer array THRESH[O:n]. A LINK[k--l]).k we can conclude that the last element of the length k c o m m o n subsequence of A [ I : [ + I ] and B[I:T~+I. B = badbabd the desired lists are MATCHLIST[I] MATCHLIST[2] MATCHLIST[3] MATCHLIST[4] MATCHLIST[5] MATCHLIST[6] MATCHLIST[7]
Algorithm 2 element array/111 :n]. advance PTR. By L e m m a 2 then.
end. B[1 :n].
list array MATCHLIST[1 :n].k_I] with the pair A [ i + 1]... each of the values j.k < T~+~. Thus T~+l.jp) such that jx >j2 > . jm-x. for i := 1 step 1 until n do
The correctness of the algorithm follows from consideration of the invariant relation " T H R E S H [ k ] = T~-L~ for all k " which holds at the start of each iteration on i. <_ T ~-1. [] We can now present an O(n 2 log n) algorithm for determining the length of the longest c o m m o n subsequence.~ for all k " which holds at the end of each iteration on i.
for i : = 1 step 1 until n do
THRESH[i] := n + 1. . T H R E S H [ k + 2 ]
352
set M/1TCHLIST[i] :-. . P T R := LINK[k]. comment Step 2: initialize the THRESH array.. LINK[O] := null. THRESH[k] := j. but THRESH[k+I] would be set to j 2 .(jr . say B[j~].k] also contain a c o m m o n subsequence of length k which implies that T~. print Largest k such that THRESH[k] ~ n + 1.1. LINK[k] := newnode (i.
Communications of the ACM
May 1977 Volume 20 Number 5
.. while PTR ~ null do begin print (i.1 ] = Ti-l.j) pair pointed to by PTR. for the sequences A = abcbdda.:~ = T H R E S H [ k ] . Thus A[I:i] and B[I:T~+I. All positions of the A sequence which contain the same element m a y be set up to use the same physical list of matching j's. Thus Algorithm 1 m a y be implemented to run in O(n ~ log n) time.~.k < j.A[1 :i] and B[T~. jl will cause T H R E S H [ k ] to take on successively smaller values until it is set equal to the desired value of jx. j. we see that T~. Since the T H R E S H array is monotonically increasing (Lemma 1) we can utilize a binary search to implement the "find" operation in time O(log n).k. This contradiction leads us to conclude that the original assumption of T~+~.
end.k = T~+l. A [i] matches several different B elements in a given "threshold" interval.
The Algorithm
A small a m o u n t of preprocessing will vastly improve the performance of Algorithm 1. B[j~] with T H R E S H [ k . THRESH[O] := 0. Assume temporarily that T~+1. " " . Linked list techniques obviate the need for this search. and the invariant relation " T H R E S H [ k ] = T~. . Ti+Lk < j _< Ti.. .1. comment Step 3: compute successive THRESH values.411 :n].. integer i. B[1 :n].
We can now display our final algorithm. then not only would T H R E S H [ k ] be set to jx.k < j. comment Step 4: recover longest common subsequence in reverse
order. 3) MATCHLIST[5] MATCHLIST[I].
for i := 1 step 1 until n do for j := n step -. by the above assumption and the premise of this case.k.
begin find k such that THRESH[k--I] < j <_ THRESH[k].
end. comment Step 1: build linked lists. F r o m L e m m a 3.

Alfred V. Douglas McIlroy who first suggested this problem to us.. pp. deleting. Stanford. n. we conclude that the time requirement for step 3 is O(n + r log n). of Illinois at Urbana-Champaign. D. The authors are indebted to M. N. Clearly this takes at most O(n) time. Aho and Jeffrey D.
Received May 1975. Since these innermost statements involve one binary search plus a few operations which require constant time. and Fischer. Step 3 wiU require the allocation of at most O(r) list nodes. Oct. J. A special case of the maximal common subsequence problem.
r = n. 341-343. 168-173.R. This step takes a total of O(n log n) time and O(n) space.E. Peter van Emde Boas has shown that each such operation can he performed in O(log log n) time [2]. and testing membership of elements in a set where all elements are restricted to the first n integers.. and Yao. we can use a distribution sort to create the MATCHLISTs in O(n) time. The problem of part (c) is. equivalent to finding the longest common subscquence of the given permutation and the sequence 1. 5. on Foundations Comptr. Comptr. Step 2 clearly takes O(n) time. j) such that A [i] = B[j] taken in order of decreasing j within increasing i. Sci. (c) The longest ascending subsequence of a permutation oJ the first n integers may be Jound in O(n log log n) time.. Fredman. for the permutation case note that each integer appears exactly once in each sequence and thus wc have
common subsequence of the sequences d and B in time O((r + n) log n) and space O(r + n). Klarner. The value returned by newnode is a pointer to the list node just created. Calif. Step 2 takes O(n) time. Dep. Wagner. van Emde Boas.G. 3. The string-to-string correctioP problem. 1975.J. step 3 takes O(n + r log log n) time and step 4 takes O(n) time.A. We record each (i. UIUCDCS-R-75-699. Chvatal. STAN-CS-72-292. Step 1 entails a sorting procedure to set up the MATCHLISTs. His data stucture requires O(n log log n) time for initialization. DiscreteMathematics 11. Algorithm 2 finds and prints a longest
THEOREM 2. M. UUman provided us with several enlightening conversations including the particular example given following Theorem 1 which shows that our algorithm can require as much as O(r) space.
[]
Acknowledgments. On computing the length of longest increasing subsequences. Dep.. [] We note that certain input sequences such as A = "aabaabaab. Selected combinatorial research problems. Feb. LINK[k] points to the head of a list of (i. Step 1 can be implemented by sorting each sequence while keeping track of each element's original position. (a) Algorithm 2 can be implemented to have a running time of O(r log log n + n log n) over an infinite alphabet.. 1975). V. Sci. See [4] for an algorithm which never uses more than O(n) space nor less than O(n~) time. U. A. We shall present a c o m m o n analysis. 1975.L. 16th Annual Symp. of course. Peter van Emde Boas made several helpful comments on an early draft of this paper. Harold Stone suggested a variant of the problem (described and solved in [5]) which led to the development of the present algorithm. and Knuth. Princeton. Stanford U.
3S3
. Princeton U... THEOREM 1. 75-84. we can use them to present the following theoretical result. Sci. and a pointer to some other list node. Finally. Comptr. 6 (June 1975). Jan. this sort can be done in O(n log n) time.. Urbana.A. • • • . Preserving order in a forest in less than logarithmic time. Yao. F o r the infinite alphabet case. respectively. On computing the rank function for a set of vectors. Comm." and B = "ababab. A linear space algorithm for computing maximal common subsequences. D. These arguments are. In the other two cases. j) pairs describing a common subsequence of length k. 7. Szymansld. Electrical Eng. 1 (Jan. an index of a position in the B sequence.The subroutine newnode invoked in step 3 is a subroutine which creates a list node whose fields contain the values of the arguments to newnode.." cause Algorithm 2 to use O(r) space even if list nodes are reclaimed whenever they become inaccessible. 6.S... ACM 21..J. (b) Algorithm 2 can be implemented to have a running time of O((n + r) log log n) over a fixed finite alphabet.Dep. 1 (Jan. Thus whenever THRESH[k] is defined. In step 4 we recover the actual longest common subsequence. June 1972. Hirschberg.. The two outer loops of step 3 should be considered as a single loop over all pairs (i.C. M. TR-170. In other words. In this step we also implement a simple backtracking device that will allow us to recover the longest common subsequence. revised January 1976
References 1. PROOF. Since at most one list node is created per search. 29-35. 1975). an index of a position in the A sequence.F. Although the necessary algorithms are quite complex. F. All three parts of the theorem use basically the same algorithm although the implementation of some of the steps varies slightly. 1975. j) pair which causes an element of the T H R E S H array to change value. the outer loops of step 3 induce exactly r executions of the innermost statements of step 3. P. 4. We may then merge the sorted sequences creating the MATCHLISTs as we go. D. In all three cases we require O(n log log n) time to initialize van Emde Boas's data structures.. 2. Ill.
PROOF. 2. T.
Communications of the A C M May 1977 Volume 20 Number 5
A Final Note
The key operations in the implementation of our algorithm are the operations of inserting.. ACM 18.