# Algorithms on Strings, Trees, and Sequences

COMPUTER SCIENCE AND COMPUTATIONAL BIOLOGY

Dan Gusfield
University vfCalifomiu, Davis

UCAMBRIDGE

V

UNIVERSITY PRESS

Contents

Preface

xiii String Problem

I

Exact String Matching: The Fundamental

1 Exact l\latching: Fundamental Preprocessing and First Algorithms 1.1 1.2 1.3 1.4 1.5 1.6
The naive method

Exercises

10 11 Methods 16 16 16 23 27 29 35 35 39 48
52

2 Exact Matching: Classical Comparison-Based 2.1 2.2 2.3 2.4 2.5 Introduction

Real-time string matching
Exercises

3 Exact Matching: 3.1 3.2

A Deeper took at Classical Methods

;! ;::;;'~:~':i~;':~~':~;:
3.5
3.6

3.7

,""·,,,,o,;,.,,,ioo Rcgulare~prcssio[]patternmatching Exercises

6! 65 67 70 70 70 73 77 84

4 Scminumerical String Matching 4.1 4.2 4.3 4.4 4.5

The

match-count

problem

and Fast Fourier

Transform

Karp--Rabinllngcrprintrnelhod,forexactmatch Exercises

vii

viii II Suffix Trees and Their Uses 5 Introduction 5.1 5.2 5.3 5.4 to Suffix Trees 87 89 90 90 91 93 94 8.10 8.11

CONTENTS

ix 192 193 196 196 197 199 200 201 202 205 207 209 212 215 215 215 217 223 224 225 230 235 245 254 254 259 270 279 287 293 302 308 312 312 321 325 329

For the purists: how to avoid bit-level operations Exercises of Suffix Trees

9 More Applications 9.1 9.2 9.3

Amotivating cxample A naive algorithm to build a suffix tree of Suffix Trees

6 Linear- Time Construction

'.4
9.5 9.6 9.7 9.8 A linear-time solution Exercises Matching,
to

6.1 6.2 6.3 6.4 6.5 Practical implementation issucs 6.6 Exercises 7 First Applications 7.1 7.2 7.3 7.4 7.5 7.6 7.7 of Suffix Trees

107

'4

the multiple common substring problem

115 116 116 119 122 122 123 124 125 125 127 129 132
135

III Inexact

Sequence

Alignment,

Dynamic

Programming

10 The Importance

of (Sub)sequence

Comparison

in Molecular Biology

11 Core String Edits, Ajjgnrnents, 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9

and Dynamic Programming

introduction distance between two strinas

7.' 7.'

7.10 7.11 7.12 7.13 7.14 7.15 7.16
7.17

APL13:Suffixarrays-morespaceredudion APLI4: Surtixtrees in genome-scale projects

7.18 7.19 7.20

Additional applications Exercises Lowest Common Ancestor Retrieval

135 138 143 148 149 156 157 164 167 168 168 181
181

Gaps Exercises

12 Refining Core String Edits and Alignments 12.1 12.2 12.3 12.4 12.5 12.6 12.7 ~1=2.~8 Exercises 13 Extending 13.1 13.2 13.3 13.4 the Core Problems

___

8 Constant-Time 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9

Introduction The assumed machine model COlflp1etebinarytrees:averysimp1ecase How to sctve rce quenesln zt FimstepsinmappingTtoB The mapping of T to B The linear-time preprocessing ofT Answering an lcaquery in constant time The binary tree is only conceptual

182 182 183 184 186 188 18' 191

Exercises - The Holy Grail

14 Multlple String Comparison 14.1 14.2 14.3

332 332 335 336

9 14.6 16.8 Exercise.ll string inference alignments (SP) objective functions function 341 342 17 Strings and Evolutionary 17.The Mother Lode 15.1 17.10 16. Shotgun Sequence Final sequencing bottom-up DNA assembly on top-down.2 15.1 18.8 15. Mapping.2 17.4 Success stories of database search 17.8 16.7 14.9 PROSITE BLOCKS and BLOSUM substitution matrices for database searching The BLOSUM Additional Exercises 15.12 considerations 381 385 385 386 387 391 19 l\"lodelsofGenome-Level 19.5 14.15 16.11 16.6 Multiple Introduction MUltiple Multiple Multip1e Comments Common Exercises sequence alignment alignment alignment multiple comparison with with for structu'.1 Introduction Genome rearrangements Mutations with inversions 492 493 19.5 17.4 17. and Cameos and Superstrings 393 395 395 395 396 398 16 Maps.4 16.7 Ultrametric Trees 447 449 to computing multiple the sum-of-pairs consensus tree 14.5 16.7 15. bottom-up sequencing sequencing: the 16. Cousins.14 picture using YACs sequencing 16.2 530 16.11 343 351 354 358 359 366 370 370 373 375 376 377 379 trees and ultrametric distances Additive-distance trees 15 Sequence Databases and Their Uses .3 18.7 16.3 16.2 18.10 on bounded-error approximations 14. 456 458 466 470 471 474 475 480 480 482 485 490 492 18 Three Short Topics 18.6 Molecular Exercises computation: computing with (not about) DNA strings 15.16 16.CONTENTS 14.2 498 499 501 505 Glossary Index 524 IV Currents.4 14.19 Exercises .6 17.18 398 401 406 407 4ll 412 415 415 416 420 420 424 comments Theshortestsuperstringproblem Sequencing by hybridization 425 437 442 16.10 15.3 15.12 Directed Top-down.11 15.17 16.4 The databaseindustry Algorithmic Real sequence FASTA BLAST PAM: the first major amino acid substitution matrices issues in database search search database 15. 16.13 16.1 15.5 15.8 objective to a (phylogenetic) alignment methods 14. Sequencing.1 16.9 16.3 17.

looking for new or illdefined patterns occurring frequently ill DNA.Preface History and motivation and and So without worrying much about the more difficult chemical and biological aspects of DNA and protein. Dali{ Naor.e. retrieving. comparing two or more strings for similarities. defining and exploring different notions of string relationships. looking for structural patterns in DNA and L Th< other \ong·{crm members ". our computer science group was empowered to consider a variety of biologically important problems defined primarily on sequences.e William Chang. and comparing DNA strings. Gene Lawler. and Frank Olken xiii . or (more in the computer science vernacular) on strings: reconstructing long strings of DNA from overlapping string fragments: determining physical and genetic maps from probe data under various experimental protocols. searching databases for related strings and substrings. storing.

Dalit Naor. Paul Stelling. Jane Guschier. Martin Tandy Warnow. Frank Olken. Elizabeth Sweedyk. Marci McClure. Martin Farach. Hershel Sater. for support of my work and the work of my students and postdoctoral researchers. Martin Tompa. Wilham Pearson. Dick Karp. and Lusheng Wang. Paul Horton. Tao Jiang. Sorin Istrail. David Axelrod. Archie Cobbs. Individually. Kevin Murphy. Gene Lawler. lohn Keoecioglu. John Nguyen. Gene Myers. Doug Brutlag. Snoddy. lim Knight. and The DIMACS Center for Discrete Mathematics and Computer Science special year on computational biology. Fred Roberts. R. Baruch Schieber. George Hartzell. Richard Cole. Russ Doolittle. Mike Paterson. Pavel Pevzner. Dina Kravers. PART! Exact String Matching: The Fundamental String Problem 1i 1 . Esko Ukkonen. Ron Shamir. Robert Irving.xviii ular Biology. Udi Manber. Ravi. lowe a great debt of appreciation to William Chang. I would also like 10 thank the following people for the help they have given me along the way: Stephen Altschul. Sylvia Spengler. and Mike Waterman. Gad Landau.

~h for att web sites that mention ··M. common database searching programs (for example. Note that two occurrences of P may overlap. if any. as illustrated by the occurrences of P at locations 7 and 9. often experience long.andrep-onedlllaltwenlytho". DNA database) for a thirty-character string. I am editing a ninety-page file using an "ancient" shareware word processor ami a PC clone (486). RNA.rkTwain"lOok. of pattern P in text T. In molecular biology there are several hundred specialized databases holding raw DNA. 7. Even grepping through a large directory can demonstrate that exact matching is not yet trivial.ond.. and in numerous specialized databases. Some of the more common applications are in word processors.. But the point is that the exact matching problem is not so effectively and universally solved that it needs no further attention. where it (and extensions) will be solved using suffix trees Basic string definitions We will introduce most definitions at the point where they are first used.fylhcqucry For another example ". The exact matching problem is solved for those applications (although other more sophisticated string tools might be useful in word processors). wilh Ie« lhan one minute u. Recently we used GCG (a very popular interface to search DNA and protein databanks) to search Genbank (the major U. in internet browsers and crawlers. which sift through massive amounts of text availableontheinternetformaterial containing specific keywords.? And Genbank today is only a fraction of the size it will be when the various genome programs go into full production mode. Chapter 3 looks more deeply at those methods and extensions of them.arch l<>okIe« lh"" len minutes. Although exact matching is the focus of Pan I.".e [W21 for other applications.i[e.000 processor Maspar computer). or Nex!s. the entire field of string algorithms remains vital and open. and analysis and proof techniques. That education takes three terms: specific algorithms.S. Users of Melvyl.. But the story changes radically I I just vi.. Lexis. and 9. an e-mail server is available for exact and inexact database matching running on a 4. Importance of the exact matching problem The practical importance of the exact matching problem should be obvious to anyone who uses a computer. the on-line catalog of the University of California library system. but style and proof technique get the major emphasis. and every exact match command that I've issued executes faster than I can blink.EXACT STRING MATCHING 3i Exact matching: what's the problem? Given a string P called the pattern and a longer string T called the text.e . All three are covered in this book. The search took over four hours (on a local machine using a local copy of the database) to find that the string was not tbere. in on-line dictionaries and thesauri. the exaet matching problem is to find all occurrences. in on-line encyclope dias and other educational CD-ROM applications. A . The All.. cranking out massive quantities of sequenced DNA. but several definitions are so fundamental that we introduce them now II I We taler repeated Ihe leM ". and taken for granted? Right now. Certainly there are faster. one might ask whether the problem is still of any research or educational interest.(i. over 21 billion wor.'" database contain. for example. The problem arises in widely varying applications. exploring methods for exact matching based on arithmetic-like operations rather than character comparisons. So is there anything left to do on this problem? The answer is that for typical word-processing applications there probably is little left to do. The exact matching problem will be discussed again in Part Il. and the education one gets from studying exact matching may be crucial for solving less understood problems. devoled 10 mo.. most of which wa.tininternet news readers that can search the articles for topics of interest.ooopioor. It will remain a problem of interest as the size of the databases grow and also because exact matching will continue to be a subtask needed for more complex searches that will be devised."od. frustrating delays even for fairly simple matching requests. if P = aba and T = bbabaxabahay then P occurs in T starting at locations 3. The ". some aspects of inexact matching and the use of wild cards are also discussed. or processed patterns (called motifs) derived from the raw string data.ctl by lhe''''lual Ie" ". in library catalog searching programs that have replaced physical card catalogs in most large libraries. Chapter 4 moves in a very different direction.iled the 1111" Vi'ta web poge maintained by the Digital fujuipmenl Corporation. which is a small string in typical uses of Genbank. Many of these will be illustrated in this book But perhaps the most important reason to study exact matching in detail is to understand the various ideas developed for it. in electronic journals that are already being "published" on-line: in telephone directory assistance.arch . using the fundamental tools developed in Chapter I. and there are faster machines one can use (for example. general algorithmic styles. Chapter 2 develops several classical methods for exact matching. Even assuming that the exact matching problem itself is sufficiently solved. in textual information retrieval programs such as Medline. Hasn't exact matching heen so well solved that it can be put in a black box. intheg iant digital libraries that are being planned for the near future. That's rather depressing for someone writing a book containing a large section on exact matching algorithms.h collected from over 10 mittion web "to. we will show at the end of Chapter 1 that the usc of fundamental tools alone gives a simple linear-time algorithm for exact matching. BLAST).'ement of text between the disk and the computer. Although the practical importance of the exact matching problem is not in doubt. too numerous to even list completely. and amino acid strings. Overview of Part I In Chapter I we naive solutions to the exact matching problem and develop the fundamental needed to obtain more efficient methods. in utilities such as grep on Unix. Although the classical solutions to the problem will not be presented until Chapter 2. especially those with cross' features (the Oxford English Dictionary project has created an electronic OED containing 50 million words). Some of these databases will be discussed in Chapter 15. For example. Vi.ing the Boyer-Moore algorithm on (lur own ra"'copy ofGenbank.

Problems on subsequences are considered in parts 11Iand IV.1. Shifting by more than one position saves comparisons since it moves P through T more rapidly. It then finds that the next seven comparisons are matches and that the succeeding comparison (the nimh overall) is a mismatch. we will also use "sequem:e" when Its standard mathematical rnearung IS Intended The first two parts of this book primarily concern problems on strings and substrings. after the ninth comparison. The naive algorithm first aligns P at the left end of T. Almost all discussions of exact matching begin with the naive method. until the left end of P is aligned with character 6 of T. and omia is a 1.. Therefore. a total of twenty comparisons are made by the naive algorithm A smarter algorithm might realize. Of course. At thai point it finds eight matches and concludes that P occurs in T starting at position 6. We w ill generally use "string" when pure computer science. using P '" abxyabx: and T = xabxyabxvabx: Note that an occurrence of P begins at location 6 of T. for clarity. y.1 gives a flavor of these ideas. but never shift it so an occurrence of P in T. This process repeats until the right end of P shifts past the right end of T. Figure 1. In addition 10 shifting by larger amounts. The naive method aligns the left end of P with the left end of T and then compares the characters of P and T left to right until either two unequal characters are found or until P is exhausted. some methods try to reduce comparisons by Skipping over parts of the pattern after the shift. and shifts P by one position. In either case. The naive method Definition For any string S. In this example. Terminology confusion acters in a subsequence might be interspersed with characters not Worse. P is then shifted one place to the right. cal is a prefix. in which case an occurrence of P is reponed. We will examine many of these ideas in detail. We will usually use the symbol S to refer to an arbitrary fixed string that has no additional assumed features or roles. Sri) denotes the ith character of S.1. Early ideas for speeding up the naive method The first ideas for speeding up the naive method all character when a mismatch occurs." u~ed in p~ace of "subsequence". and the comparisons are restarted from the left end of. in this book we will always rnamtam a distinction between "subsequence" and "substring" and never use "sequence" for "subsequence". However. in the biological literature one often sees the w~rd "sequence. and repeats this cycle two additional times. we will refer to the string as P or T respectively. It then shifts P by one place. 8) to refer to variable strings and use lower case roman characters to refer to single variable characters. issu~s are dis~us~ed and "" "sequence" or "string" interchangeably in the context of biological apph~at~ons. We will use lower case Greek char deters (ct. P. that the next three . and we follow this tradition. Exact Matching: Fundamental Preprocessing and First Algorithms For example. 1.EXACT STRING MATCHING 1 Definition suffix S[i .1. finds a mismatch. california i~ a string. when a string is known to play the role of a pattern or the role of a text. (3. j] is the empty string ifi > t. lifo is a substring. immediately finds a mismatch.

Hence it has enough information to conclude that there can be no matches between P and T until the left end of P is aligned with position 6 of T. we will use Z. Efficient implementations have been devised for a number of algorithms such as the Knuth-Morris-Pratt algorithm. Then we show how each specific matching algorithm uses the information computed by the fundamental preprocessing of P.. we highlight the similarity of the preprocessing tasks needed for several different matching algorithms. An even smarter algorithm knows the next occurrence in P of the first three characters of P (namely abx) begin at position 5. A comparisons of the naive algorithm will be mismatches. The preprocessing approach Many string matching and analysis algorithms are able to efficiently skip comparisons by first spending "modest' time learning about the internal structure of either the pattern P or the text T. are similar in spirit but often quite different in detail and conceptual difficulty. The copy of« that occurs as a prefix of Sisalsoshowninthefigure .r. Z6(5) Z7(5) =I (aa . This smarter algorithm therefore saves a total of six comparisons over the naive algorithm The above example illustrates the kinds of ideas that allow some comparisons to be skipped..) These preprocessing methods.:onsider the boxes drawn in Figure 1. In this book we take a different approach and do not initially explain the originally developed preprocessing methods. All of these algorithms have been implemented to run in linear time (O(n + m) time). = Z~(5) Z9(5) =2 (aab . In specific applications of fundamental preprocessing. and the Apostolico-Giancarlo version of it.5). r. and the next two illustrate smartershifls. This smarter algorithm skips over the next three shift/compares. Then since the first seven characters of P were found to match characters 2 through 8 of T. each box in the figure represents a maximal-length 1. 111he above example. This approach to linear-time pattern matching was developed in [202] 1. In addition to shifting by larger amounts. thus saving three comparisons.. we will see that certain aligned characters do not need to be compared.l: The first scenario illustrates pure naive matching.. after the left end of P is moved to align with position 6 of T. Fundamental preprocessing of the pattern Fundamental preprocessing win be described for a general string denoted by S.ab .2: Each solid box represents a substring of Sthatmatchesaprefixof San d thai starts between posnlona 2 and i. it has enough information to conclude that character a does not occur again in T until position 6 of T. is greater than zero. caret beneath a character indicates a match and a star indicates a mismatch made by the algorithm.Thenlidenotes the left end of ". (The opposite approach of preprocessing' text T is used in other algorithms. The length of the box starting at i is meant to represent Zi' Therefore. immediately moving the left end of P to align with position 6 of T.3. the t 21. (wh. During that time. this smarter algorithm has enough information to conclude that when the left end of P is aligned with position 6 of T. Preprocessing is followed by a search stage. a real-time extension of it... Each boxts caued a Z·box. I. and the even smarter method was assumed to know that the pattern abx was repeated again' starting at position 5. The result is a simpler mere uniform exposition of the preprocessing needed by several classical matching methods and a simple linear time algorithm for exact matching based only on this preprocessing (discussed in Section 1. Reasoning of this sort is the key to shifting by more than one character.. i Figure 1. the other suing may not even be known to the algorithm.aaz). This assumed knowledge is obtained in the preprocessing stage For the exact matching problem.. the algorithm compares character 4 of P against character 9 of T. Those methods will be explained later in the book. the Beyer-Moore algorithm.. but here we use S instead of P because fundamental preprocessing will also be applied to strings other than P. all of the algorithms mentioned in the previous section preprocess pattern P. Rather. the next three comparisons must be matches. in place of ZitS) To introduce the next concept. We use rj 10 denote the right-moslend of any Z-boxthat begins at ortothe left of p<lsition iand « to denote the substring inth eZ-boxendingatf.2. as originally developed.3. 25(S)=3 (aabc . Instead. This part of the overall algorithm is called the preprocessing stage. the algorithm knows that the first seven characters of P match characters 2 through Sof Fclfitalsc knows that the first character of P (name!ya) does not occur again in P until position 5 of P. although it should still be unclear how an algorithm can efficiently implement these ideas.2. by first defining a fundamental preprocessing of P that is independent of any particular matching algorithm. fUNDAMENTAL PREPROCESSING OF THE PATTERN T xabxyabxyabxz abxyabxz T' xabxyabxyabxz abxyabx~ T:xab"yabxyabxz P:ab><ya:oxz abxyabxz abxyab><z Figurel. such as those based on suffix trees. The details will be discussed in the next two chapters smarter method was assumed to know that character a did not occur again until position 5. = 0. How can a smarter algorithm do this? After the ninth comparison. S will often be the pattern P.). This smarter algorithm avoids making those three comparisons.). where the information found during the preprocessing stage is used to reduce the work done while searching for occurrences of P in T. Each box starts at some position j > 1 such that Z. When 5 is dear by context.EXACT MATCHING 1.

In fact.. before Z. FUNDAMENTAL PREPROCESSING IN LINEAR TIME substring of 5 that matches a prefix of 5 and that does not start at position one. starting from i = 2.4. then I.. say. suppose 5 = aabaabcaxaabaabcy. r . we have FiguraL3: String S[k. and so Z12 may be very helpful ill computing ZI2I' As one case. Zk' < 1. no Z-box has been found that starts . As an example. and Ij5 = 10 The linear time computation of Z values from 5 is the fundamental preprocessing task that we will use in all the classical linear-time matching algorithms that preprocess P But before detailing those uses.lnthis case. uFTIi3l k t ~ k )' k+Zk-1 Figure 1.I > I. similarly. Now assume inductively that the algorithm has correctly computed Z. Fundamental preprocessing in linear time Figure 1. If Z2 > 0. and rl211 130 and 112U 100. Z. I remain unchanged (see Figure IA) Ecd is correctly computed and variables rand P~OOF In Case I. Thus in this illustration. Ii is the position of the left end of the Z-box that ends at r.4. This case. and assume that the algorithm knows the current r = rk_1 and I = Ii-I' The algorithm next computes Zko r = r" and I = Ix The main idea is to use the already computed Z values to accelerate the computation of Zk. left to right. Hence the algorithm only uses a single variable. It follows that the substring of length 10 starting at position 121 must match the substring of length 10 starting at position 22 of 5. Z2 is the length of the matching string. More formally. the characters of 5[2 . That = = means that there is a substring of length 31 starting at position WOand matching u prefix of S (of length 31). can be chosen to be the left end of any of those Zcboxes.. for i up to k . if Zri is three. all the values Z2 through 2120 have already been computed. 212l call be deduced without any additional character comparisons. values for j = i . then r = r : is set to Z2 + I and I = 12is set to 2.EXACT MATCHING 1. As a concrete example.81 then Zk = Zk' and r...L No earlier r or I values are needed. will be formalized and proven correct below 2a. ISI] until a mismatch is found. Zk can be deduced from the previous Z values without doing any additional character comparisons.4: Case2a. the algorithm finds Z~ by explicitly comparing. but in any iteration i. to refer 10 the value. then ZIO = 7. along with the others. TheZalgurithm The preprocessing algorithm computes Z. then a little reasoning shows that ZI21 must also be three.rj is labeled fJ and also occurs starting at position k'of S We use the term Ii for the value of j specified in the above definition.Thelongeslslringslartingatk'thatmatchesaprsfixofSisshorterthanllll.5: Case2b The longesl siring slarling at k'that matches a prefix of Sisatl sasllPI. That is. is computed.bcx ending at r. Zk is set correctly since it is computed by explicit comparisons. To begin. we show how to do the fundamental preprocessing in linear time z. Also (since k > r in Case I). Otherwise rand / are set to zero. Zk=Zk' 1.r15 = 16. andtheupdatedrand discovered so far. 151] and 5[1 . r" and Ii for each successive position i. in some cases. If 2b. Therefore. suppose k = 121. the algorithm only needs the rj and I. In case there is more than one z. All the Z values computed will be kept by the algorithm. Each such box is called a Zbox. it I. and rhe current values otr and r.

SOT and I remain correctly unchanged. k+ ZI. and n ::: til. and it is correct to change r to k + 21. including problems that are not easily solved hy direct application of rhc fundamental preprocessing 1.. 1. give a linear-time preoede Character algorithm to determine whether a linear string about a is this to Exercise asubstr"lngofacircularstringj3. 151.AcircularstringollengthnisastringinwhichCharacter nls consldereoto 1 {see Figure 1. In Case 2b. Therefore. defabc is a Circular of abcdef.6. the algorithm correctly changes r and I. Given two strings 01 a and p. Why continue? Since function Z.. Moore.· must match character 1 to the definition of Z". 1.1 and that ends at or after position k. character k + 2." must match character I + z. PROOF The time is proportional to the number of iterations. we show that fundamental preprocessing alone provides a simple linear-time exact matching algorithm This is the simplest linear-time matching algorithm we know of.1. examining only a fraction of the characters ofT. Similar solution. Hence Zk is correctly obtained in this case. ::: n for every i > I. Therefore.4. real-time matching. There is another characteristic of this method worth introducing here: The method is considered an alphabet-independent linear-time method. then must be to 11. Repeating Algorithm Z for each position i z. for it would establish a substring longer than 2k• that starts at k' and matches a prefix of S.4. we only need to record the Z values =0 =0 the proper implementation) it also runs in time but typically runs in silblinearlime.(5) values are computed by the algorithm ill 0(151) lillie. 5 = PST has length n + til + I = Oem). The method can be implemented to use only O(n) space (in addition to the space needed for pattern and text) independent of the size of the alphabet. when Zk > 0 in Case I. 0 Corollary 1. Aho-Coraslck. Another way to think . then the next character to the right. thatis. . suffix tree etc. But character k + Z. Methods based on suffix trees typically preprocess the text rather than the pattern and then lead to algorithms in which the search time is proportional t? the size of the pattern rather than the size of the text. > for the n characters in P and also maintain the current I and r. Hence it is the preferred method in most cases. Exercises The first time lour exercises use the lact that fundamental be found prOceSSing in linear time. So. position k' (determined in step 2) will always fall inside P. matches character k' + z. the algorithm does find a new Z-box ending at or after k. Z.5.I :::: r.(5) values can be computed in O(n + m) = the occurrences of P inTin O(m) time. if P occurs in T starting at position j of T. Compute Z. Recall that P has Icngthn and T has length m. we never had to assume that the alphabet size was finite or that we knew the alphabet ahead of time . it needs no further information about the alphabet. The simplest linear-time exact matching algorithm Before discussing the more complex (classical) exact matching methods. can be computed for the pattern in linear time and can be used directly to solve the exact matching problem in O(m) time (with only O(n) additional why continue? In what are more complex methods (Knuth-Morris-Pratt.10 EXACT MATCHING 1. This is an extremely desirable feature. Let 5 PST be the string consisting of P followed by the symbol "S" followed by T. Any value of i > n + I such that Z. values. For example.1. the substring beginning at position k can match a prefix of S only for length ZI.2. III Case Za. Each comparison results in either a match or a mismatch. 1. Since Zi S n for all i. If not.) deserving of attention? 2 correctly yields all the Theorem 1.(5) n identifies an occurrence of Pin T starting at position i . EXERCISES 11 between positions 2 and k . That is. Furthermore. plus the number of character comparisons. Since all the Z. very elegant 2. We will see that this characteristic is also true of the Knuth-Marris-Prall and Beyer-Moore algorithms.ifaandfihavethesamelengthandaconsistsofasuffixofj3foilowed prefix wlth a fl. there is no need to record the Z values for characters in T. Conversely. to solve the following (or cyclic) problem of bya This is a classic problem can be done In linear and that all occurrences of Pin T can 1.' < lfil.6.(5) for i from 2 to n + m + L Because "S" does not appear in P or T. f3 must be a prefix of 5 (as argued in the body of the algorithm) and since any extension of this match is explicitly verified by comparing characters beyond r to characters beyond the prefix /3. but not of the Aho-Corasick algorithm or methods based on suffix trees. Hence the algorithm works correctly in Case I. Instead.a character comparison only determines whether the two characters match or mismatch. suffix trees can be used to solve much more complex problems than exact matching. where "S" is a character appearing in neither P nor T. Use the existence of a linear-time exact matching in linear time. . All the Z. Hence Zk = Zk' in this case. determine rotation algorithm if a is a circular rotation p.I < T.5. Further. so we next bound the number of matches and mismatches that can occur. since k + Z! .6). the full extent of the match is correctly computed. Those values are sufficient to compute (but not store) the Z value of each character in T and hence to identify and output any position i where Z.1. The Apostolico-Giancarlo method is valuable because it has all the advantages of the Beyer-Moore method and yet allows a relatively simple proof of linear worst-case running time. Moreover. character k' + Z. = n.(n + I) of T.

A substring then III contained than in string 5 is called copy is a tandem a tandem be extended a ta.andoutputsthepair(S.. string 13. a naive algorithm O(ri}-timebound. to rotations character comparisons case that Z k' = ItJl.. DNA are mostly exercises circular obtained additional comparisons executing and we don't know is typically whose circular. that parameter 5. Then is a substring fi tt and only ifft rotalion of \$.. coli replication. Nucleotide if the DNA of two individual rotation. III tools for handling will be discussed matching.k)foreachoccurrence.) is opposite a matching token in a parameter s. Tandem run in O(n+m) arrays.Sincemaximal of a given base can overlap.. of '" if 5 = the base) if '" consists a tandem array of more III one consecutive xyzabcabcabcabcpq. case alphabet of one of its ptasmlds. finds would establish every maximal tandem on Iyan tandem arrays and ending n. 5 has length n. we want to point outthat circular DNA is common in its genomic such and important. remain solve their matching mechanisms the between parameter 5. complementary Thus the problem system. The linear string jiderived from it is accalggc. any critical both problems that they address. and parameter compucompuuses of demonstrate important other names in S:1 Eh = XYdxCdCXZccxW. strings They undetermined. are added rotation to F coli containing 0fpthenthestrandopposite with 13creating whether a proper of DNA may be a step III Now we present in common. do. Figure 1.12 EXACT MATCHING and 1.6. DNA molecules cells contain DNA. (or \$. however. 6. but even when of the order rapidly 7. one for speedup of the algorithm? Z". Case 2b can be refined Z".g. Eh. circular strings but do not suggest indirect as well as linear molecular experiments immediate strings. Suffix-prefix Give an algorithm in two strings and f3. to Ci) hybridizes This transfer of determining XYabCaCXZddW and only if Over Land Eh are double-stranded in the replication rctaticn Other circular of c as a single strand. but parameterized of code means names (e.andthe used for such rapid matching role of enzyme tational tational circular problems.. giving array its starting a is an exampte'ot is a tandem f3. Consider becomes: the special this case of Exercise problem the problem as in Exercise Is a a circular experimental called rotation conditions (for energy) If(¥ is a circular of \$? This problem arises n a parameterized of one section tetraosrormeo into the other the parameter in linear in E. array that cannot of times 7.smids. = I. Baker[43] Z algorithm is split into two cases. would this result in an overall comparisons You must c onsideralloperations.the algorithm way to organize an unneeded no character only inthe mitochondrial double-stranded (higher addition DNA and in additional contain plasmid This is a reasonable so as 10 eliminate = 1131 nd hence a are needed the algorithm. A string example. In Case 2b of the Z algorithm. all the values character Z~ . linear DNA molecule.For example. names of the other via aon e-to-onelunction" Let Land n be two alphabets and parameters Fl. solved leaving of the plasmid. suffix of (1 that exactly The algorithm should 4. Viral DNA circular mutations. can consist of any combinations is a legal one strand of which is called string (¥ is the uppercase and n is the lower (which has the DNA sequence plasmid.M'. main body of Algorithm Z. ratherthanadditiona viruses have mutated introduced 'The the following application match matching between problem sections and applied it to a problem 01 software system. coli a single by the parameter definition. a part of the program that cannot be changed. Interestingly. Give an example to show that two maximal tandem arrays of a fj be lhe linear string ft obtained of circular from {Jstartingalcharacter1 string given bese s cen overtep.Zq_'.. application. a tandem of /3 in 5 can be described (see Section by two numbers location repeated Suppose problem substring is the following.and one for Z". p-ma/chif the formal (ReeA) and ATP molecules Each symbol if:E in :E is called a token and each symbol of tokens alphabet English string in n is called a paramefer. k). section We can matches and con- from each otheronlybya I mutations are actually 2 when stnnq c has length is solved maintenance: where stants) is to track down duplication between by replacing two sections in a large software tha tone want to find not only exact matches of code. identifiers Itisveryinterestingtonotethattheproblemsaddressedintheexercises "solved" Then lime in nature. provide RecA in E. a nucleus) and even some true eukaryotes as yeast circular strings may someday it is linear the and away in another character organisms Z. algorithm. For then said to 1. II Case 2b olthe not just character 8. a token represents . > I. in addition in some viral populations . of lengths n il I For example. (or \$. at character of soma circular Let in 5 and the number substring /3 is repeated.· > IfJl then Zk comparisons to their nuclear tools for handling is not always For example. Each parameter s. (or \$... Each token in \$. Precisely matching and is "solved" repiication an enzyme strand to underthecertain descri bed in [475]. array with a longer base).tv for developing Several bock.13 and that takes 3. maps not violate ot p-matching. Two strings 5. Consequently. maps problemsfasterthancanbeexplainedbykineticanalysis. and a double-stranded (1.ndem array of 13(called of /3. are needed. = names in \$. of the DNA in one individual problem is to determine will be a Circular rotation [450]. crme motivation ts. containing from and no symbols Land Fl. This does In Baker's XYabCaCXZddbW to parameter the definition c-matches d in Notice to c in a in 5. can then be the A digression on circular strings in DNA in using the existence biological 0 fa linear-time Bacterial exact and small DNA in 5. be of use in those organisms. while parameter d in 5. properties. coli. circular called pla.occur in viruses.6: A circular string fj. Zq.) jlis by this natural in \$. in Sections 1. ::: 1/31.1). II the Z algorithm immediately findS that without Z2 = q > 0. (or b.) is opposite in \$.. is a Now give an O(n)-time algorithm that takes 5 and 13as input. Given the base (5. and finds the longest time. is a Circular a. Note that 5 array of abcabc{Le. arrayoffJ. until it finds when expllclt a mismaich. some virus genomes linear order individual a plausible exhibit circular.11.M.2 and without The above two problems matching However. comparison. In that experiment. A maximal either A tandem left 0 r right. Flesh out and justify the details when 01 this claim does explicit comparisons but in fact Argue that Therefore.) experiments in [475] can be described as substring these natural These matching systems problems relating to and linear DNA in E. EXERCISES 13 matches a prefix of I~' m. also contains tandem array = abcabcabcabc array =- abc.

we emphasize the Beyer-Moore method over the Knuth-Morris-Pratt method. T(5) = I mismatches with P(3) and R(f) = 1. erc. all of these algorithms can be implemented (0 run in linear worst-case time. Bad character rule As in the naive algorithm. in contrast to previous expositions.2. Further. then the worst-case running time of this approach is O(nm) just as in the naive algorithm. so P can he shifted right by two positions.actmatching. However.1.2.1. the bad character shift rule. We next examine these two ideas. The point of this shift rule is to shift P by more than one character when possible. with two additional ideas (the bad character and the good suffix rules).2. and all achieve this performance by preprocessing pattern P. shifts of more than one position often occur. 10 implement the needed preprocessing for each specific matching algorithm Also. The Boyer-Moore Algorithm To check whether P occurs in T at this position.\1 17 2 Exact Matching: Classical Comparison-Based Methods algorithm. consider the alignment of P against T shown below' 12345678901234567 xpbctbxabpqxctbpq tpabxab 2. and the good suffix shift rule. it then compares T(8) with P(6).2. the Beyer-Moore algorithm successively aligns P with T and then checks whether P matches the opposing characters of T. partly for historical reasons. if P is shifted right by one place after each mismatch. So at this point it isn't clear why comparing characters from right to left is any better than checking from left to right. With suitable extensions.2. the Beyer-Moore algorithm contains three clever ideas not contained in the naive algorithm' the right-to-left scan. (Methods that preprocess T will be considered in Part II of the book. For example. developed in the previous chapter. Rlght-to-Ien scan by one position. However. the Beyer-Moore algorithm starts at the right end of P. Together. In the above example. Aftcrthe shift. but mostly because It generalizes to problems such as real-rime string matching and matching against a set of patterns more easily than Boyer-Moore does These two topics will be described in this chapter and the next 2. At that point P is shifted right relative to T (the amount for the shift will be discussed below) and the comparisons begin again at the right end of P Clearly. the comparison of P and T begins again at the right end of P . after the check is complete. first comparing T(9) with P(7). THE BOYER-MOORE ALGOR1TH.) The origi~al preprocessing methods for these various algorithms are related in spirit but are quite different III conceptual difficulty. Some of the original preprocessing methods are quite difficult.. and in typical situations large shifts are common. or after an occurrence of P is found. Finding a match. moving right to [eft until it finds a mismatch when comparing T(5) with P(3). I This chapter does not follow the original preprocessing methods but instead exploits fundamental preprocessing. 2. thcse ideas kad to a method examines fewer than m + /I characters (an 2. Knuth-Morris-Pra~t is nonetheless completely developed. P is shifted right relative to T just as in the naive algorithm.2. Introduction This chapter develops a number of classical comparison-based matching algorithms for the exact matching problem. since Beyer-Moore is the practical method of choice for eX.

and all of P is shifted past the r in T. the preprocessing for the strong rule was given incorrectly in [278] and corrected. The (strong) good suffix rule The bad character rule by itself is reputed to be highly effective in practicerparticularly for English text [229]. so z has a chance 0tmatchingx good suffix rule. many similar.i. If no such shift is possible.! Pascal code for strong preprocessing.2. If an occurrence of P is found.met newsgwup compo thenry .3. One could also do binary search on the list in circumstances that warrant it 123456789012345678 T: prstabstubabvqxrst qcabdabdab 1234567890 When the mismatch occurs at position 8 of P and position \0 of T. For that. resulting in the followingalignmcnt 2. However. That is the approach we take here. shift P past I in T For a specific example consider the alignment of P and T given below T is x. In contrast. THE BOYER_MOORE ALGORITHM [9 Extended bad character rule The bad character rule is a useful heuristic for mismatches near the right end of P. which is roughly the number of characters that matched. in [384]. the extended rule can be implemented to take only O(n) space and at most one extra step per character comparison. then shift P by n places to the right. That situation is typical of DNA. The simpler rule uses only O(l"EI)space("E is the alphabet) for array R.. the fundamental preprocessing of P discussed in Chapter I makes the simple. That amount of added space is not often a critical issue. If there is none then there is no o~currence o~ x before. this approach at most doubles the running time of the Boyer-Moore algorithm. scan . Hence P is shifted right by six places. and one table lookup for each mismatch. that is. but not exact.. without much explanation. The strong good prefix of the shifted pattern matches a suffix oft in T. substrings. As we will see. In such cases. which has an alphabet of size twenty. the found entry gives the desired position of x. In fact. The original Beyer-Moore algorithm only uses the simpler bad character rule. Implementing the extended bad character rule P before shift P after shift Figure 2. then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T...1: Good suffix shift rule. Characters yandzof Pare guaranteed to be distinct by the good sultix rule. then shift P to the right so that the closest x to the left of position i in P is below the mismatched x in T.. where character x 01 T mismatches with Character y at P. the following extended bud character rule is more robust: When a mismatch occurs at position i of P and the mismatched character in T is x. So in worst case. Otherwise. based on an outline by Richard Cole [lOn. Because the extended rule gives larger shifts. in most problem settings the added time will be vastly less than double.is EXACT MATCHING:CLASS!CAL COMPARISON-BASED METHODS 2. After a mismatch at position i of P the lime to scan the list is at most n . but proves less effective for small alphabets and it does not lead to a linear worst-case running time. but it has no effect If the mismatching character from T occurs in P to the right of the mismatch point This may be common when the alphabet is small and the text contain.2. The original preprocessing method [2781 for the strong good suffix rule is generally considered quite difficult and somewhat mysterious (although a weaker version of it is easy to understand).c's list from the top until we reach the first number less than or discover there is none. which has an alphabet of size four. we introduce anOther rule called the strong II l A Tocent plea appeared on the . and even protein. often contains different regions of high similarity. Code based Oil [384] is given without real explanation in the text by Baase [321. the only reason to prefer the simpler rule is to avoid the added implementation expense of the extended rule. but there are no published sources that try to fully explain the merhod. I '" ah and I' occurs in P starting at position 3. is shown in Exercise 24 at the end of this chapter. then shift P by n places.i. but it is an empirical question whether the longer shifts make up for the added time used by the extended rule.

if P 123456789012345678 T: prstabstubabvqxrst qcabdabdab Note that the extended bad character rule would have shifted P this example 'theorem by = cabdabdab. if one exists.1). (Exercise 9 of this chapter) l'(i) . Definition Let l'(i) denote the length of the largest suffix. P[i . since it would imply either that a closer copy of t occurs in P or that a longer prefix of P matches a suffix of t. if P' denotes th~ string obtained by reversing P. the weak rule is not sufficient. then L(R) = 6 and C(ll) = 3 Final preprocessing detail The preprocessing stage must also prepare for the case when L'(i) = 0 or when an occurrence of P is found. 2. version of the good suffix rule.2.3. lfnone exists. Preprocessing for the good suffix rule We now formalize the preprocessing needed for the Beyer-Moore algorithm.nJI.4. THE BOYER-MOORE ALGORITHM 21 = For example.20 EXACT MATCHING:CLASSICAL COMPARISON-BASED METHODS 2.2. and suppose that the good suffix: rule shifts P so its right end aligns with character k' > k Any occurrence of P ending '<It a position I strictly between k and k' would immediately violate the selection rule for k'.of of P. then NJ(P) = 2 and Nb(P) 5.2. L'(il]: Theorem 2. The above method correctly computes the L values 2. n] that is also a prefix I 2. N is the reverse of Z.4. [J The original published Beyer-Moore algorithm l75 J uses a simpler.. The use of the good suffix rule never shifts P past all occurrence in T Given Theorem 2. then lct/'(i) be zero Theorem Nj(Pl=j.2. weaker. if P = cabdabdab. In the previous example. Nj(P) is the length of the longest P[I.. as well as the problem of how to accumulate the linear time. Clearly.2. For example.1.2. that is. and in th is book the strong version is assumed unless stated otherwise L(2):= L'(2): := for i 3 to n do L(i):= max[L(i . then Nj(P) = Zn-hl(P'). as a simple exercise. For the purpose of proving that the search part of Beyer-Moore runs in linear worst-case time. Hence the Nj(P) values can be obtained in O(n) time by using Algorithm Z on P'. The following theorem is then immediate.. An explicit statement of the weaker rule can be obtained by deleting the italics phrase in the fi rsr paragraph of the statement of the strong good suffix rule. The following definition and theorem accomplish that. suffix of the substring II We leave the proof. such that values in Definition For string P. it follows immediately that allthe C(i) values can be accumulated in linear time from the N values using the following algorithm Z-based Boyer-Moore PROOF Suppose the right end of P is aligned with character k of T before the shift. I'(i) equals the largest j ::: IP[i . That version just requires that the shifted P agree with the t and does not specify that the next characters to the left of those cccurrences or r be different. only one place in Recall that Z.2.j] that is also a suffix of the full string P. the weaker shift rule shifts P by three places rather than Six. we will call the simpler rule the weak good suffix rule and the rule slated above the strong good suffix rule.(S) is the length of the longest substring of 5 that starts at i and matches a prefix of S. When we need to distinguish the two rules. which is n - i + I.

For example. and is often much inferior in practice to the Boyer--Moorc method (and others). ifi =Othen begin report an occurrence of P in T ending at position k k:=k+n-l'(2). When an occurrence of P is found. it is not trivial However. a mismatch occurs at position i-I of P and L'U) > O. in Section 3. the The best known linear-time algorithm for the exact matching problem is due to Knuth..3.1'(2) places.2. there is no shifting in the actual implementation. Both of these proofs were quite difficult and established worst-case time bounds no better than 5m comparisons. Although it is rarely the method of choice. The Apostolico-Giancarlo variant of Boyar-Moore is discussed in Section 3. a sublinear running time is almost always observed in practice. This was first proved by Knuth.3. Richard Cole gave a much simpler proof [108] establishing a bound of4m comparisons and also gave a difficult proof establishing a tight bound of 3m comparisons. while i > 0 and begin i:=i-I. which efficiently finds all occurrences in a text of any pattern from a set of patterns. '0 th.' 2. if we only use the bad character shift rule. We will later show.e. then the rule shift'i P by n . THE KNUTH-MORRIS-PRATI ALGORITHM 23 2. Notice that this can be deduced without even knowing what string T is or exactly how P is aligned with T. and given rules to determine by how much P should be "shifted".1 k:=n. these preprocessed values are used during the search stage of the algorithm to achieve larger shifts. Rather. for the case that P doesn't occur in T. but assuming randomly generated strings. We will present Cole's proof of 4m comparisons in Section 3. the index k is increased to the point where the right end of P would be "shifted". in typical string matching applications involving natural language text.2.1.wefullyde"'loptre method in Section 3.L'(i) places to the right.. We now formalize this idea . gives a "Boyer-Moore-like" algorithm that allows a fairly direct proof of a 2m worst-case bound on the number of comparisons. we use a variant of Galil's idea to achieve the linear time bound in all cases At the other extreme. and Pratt [2781. whilek::. The complete Beyer-Moore algorithm We have argued that neither the good suffix rule nor the bad character rule shift P so far as to miss any occurrence of P. and an alternate proof was given by Guibas and Odlyzko [196]. The first of these modifications was due to Galil [168]. Hence. in1h~field. Later. We won't discuss random string analysis in this book but refer the reader to [184]. Although Cole's proof for the linear worst case is vastly simpler than earlier proofs. The Knuth-Morris-Pratt shift idea For a given alignment of P with T. and for its hiSlorical rol. Note that the rules work correctly even when /'(i) = 0 One special case remains. Afterdiscussing Cole's proof. each act of shifting P takes constant time. a fairly simple extension of the Beyer-Moore algorithm. suppose the naive algorithm matches the first i characters of P against their counterparts in T and then mismatches on the next comparison. else shift P (increase k) by the maximum amount determined by the (extended) bad character rule and the good suffix rule. In the case that L'(i) == 0. so that the L'(i)-length prefix of the shifted P aligns with the C(i)·length suffix of tbc unshifted P.-e. if P = ubcxabcde and. The Knuth-Morris-Pratt algorithm P(i) = T(h) do h :=h -1. When the first comparison is a mismatch (i. and its linear time bound is (fairly) easily proved. 2.on. the good suffix rule shifts P by n -/'(i) places.22 EXACT MATCHING:CLASSICAL COMPARISON·BASED METHODS 2. The naive algorithm would shift P by just one place and begin comparing again from the left end of P. several simple modifications to the method correct this problem. But a larger shift may often be possible.6. For those Knu'h·Morri. However. The algorithm also forms the basis of the well-known Aho-Corasick algorithm.2. and is important in order to complete the full story of Boyer-Moore.. We will present several solution. Morris. then it is easily deduced (and we will prove below) that P can be shifted by four places without passing over any occurrences of P in T. If. in Section 3. that by using the strong good suffix rule alone. Note that although we have always talked about "shifting P>. the mismatch occurs in position 8 of P.lt . Morris.·Prattmethodhere . end.5. So the Beyer-Moore algorithm shifts by the largest amount given by either of the rules. due to Apostolico and Giancarlo [261. the expected running time is sublinear. it can be simply explained. Moreover.3.mdo begin h:=k. We can now present the complete algorithm The Boyer-Moore algorithm Beyer-Moore method has a worst-case running time of O(m) provided that the pattern does not appear in the text. yielding an O(m) time bound in all cases. P(n) mismatches) then P should be shifted one place to the right 2.. Only the location of the mismatch in P must be known.2 When the pattern does appear in the text then the original Beyer-Moore method runs in 0(nm) worst-case time. during the search stage.4. in the present alignment of P with T.et problem including the Aho-Corasick . and Prall [278]. The Knurh-Morrts-Pran algorithm is based on this kind of reasoning to make larger shifts than the naive algorithm makes. end. then the worst-case running time is O(nm). then the good suffix rule shifts P by n .2. The good suffix rule in the search stage of Beyer-Moore Having computed L'U) and l'(il for each position i in P.

after a shift. First. contradicting the definition of sp. However.'Pivalue.. k . For example.3. This is true even without knowing T or how P is positioned with T.1 of T.houldbealertoJthattraditionaIlYlheKnuth-Mnrris-Prnnalgmilhmha"beendescribed. As an example. In fact.i] that matches a proper prefix of P. Staled differently. xyabcxabcxadcdqJeg.3 = a places. if the first mismatch (comparing from left to right) occurs in position i + I of P and position k of T. and so sp. SPI = 0 for any string An optimized version of the Knuth-Morris-Pratt algorithm uses the following values '-1 _+-----'-: ---'-:' ':'f-----f---- P before shifl P aflershifl P" ~~---~--"missedoccurren~eof Figure 2. and P will be shifted 4 Thcroader. Note that by definition... it often shifts P by more than just a single character. t> is the prefix of P of length sp.5pS = 3.1]. and all characters in the (assumed) missed occurrence of P match their counterparts in T.on. and SPIO = 2. Character P(/) cannot be equal to P(i + I) since P(l) is assumed to match T(k) and Pit + I) does not match T(k). (P) to be the length of the longest proper .'iP.81 > 1131 sp. Thus af3 is a proper suffix of P[I. both copies of the string are followed by the same character c. In particular.. The advantage of the shift rule is twofold. characters of P are guaranteed to match their counterparts in T. shift P exactly i + I . if P = bbccaebbcaod.. the first 3 characters of the shifted P match their counterparts in T (and their counterparts in the unshifted P) Summarizing.2: Assumed missed occurrence ueeco correctness prool tor Knuth-Morris·Pratt by 4 places as shown below: Clearly...(.. where P = of P mismatches then P will be shifted by 7 . + I) = i . the left-most sp. Both of these matched regions contain the substrings a and f3.3 PROO~.. Hence the theorem = is proved.nlcrm.l aligns with Trk . and the left end of P is aligned with character 3 of T. But [c] > 0 due to the assumption that an occurrence of P starts strictly before the shifted P.8] and is followed by character b in position 2 and by character c in poshion 9. . to determine whether the newly shifted P matches its counterpart in T. then .FailurefunCI. The unshifted P matches T up through position i of P and position k . Thus.. The shift rule guarantees that the prefix PlI . In the case that an occurrence of P has been found (no mismatch). sp.] of the shifted P matches its opposing substring in T.(P) for all P. SP4 = 1.sp.ln other words. 8}.i) that matches a prefix of P.of !ailure!unCI. for example. then shift P to the right (relative to T) so that prl . and the next character is unequal to P(i + I). suppose P = abcxabcde as above. so that there is an occurrence of P Starting strictly to the left of the shifted P (see Figure 2. + I of P will align with character k of T. The next comparison i. SPi(P) is the length of the longest proper substring of PILi] that ends at i and that matches a prefix of P. so the unshifted P and the assumed occurrence of P match on the entire substring af3. Second.3.(P) :':0 sp. so la. we have For any alignment of P and T. Now let I = laf3! + I so that position I in the "missed occurrence" of P is opposite position k in T.2).' = 1 since the single character b occurs as both the first and last character of prl . and let a and t> be the substrings shown in the figure.IPS = a proper prefix of PII . if P = ahcaeabcahd.. the algorithm can start comparing P and T at position +I of P (and position k of T). When the string is clear by context we will use sp. define sp. Hence af) is a suffix of P[l. . The time analysis is equally simple . 0 Theorem 2.Suppose not.. whicharerciutedtothc.sp.. shift P by n . and so the Knuth-Morris-Pratt algorithm will correctly find all occurrences of Pin T.' -c 2.ol! •.24 EXACT MATCHING:CLASSICAL COMPARISON·BASED METHODS 2. THE KNUTH-MORRIS-PRATT ALGORITHM 25 Definlnnn For each pcsltion i in pattern P. places to the right so that character sp..i] tbat matcbes a prefix of P. then SP2 = sp_] = D. guarantees that the same mismatch will not above example.mjJixof PII . shown relative to the shifted povition of P.3. sp. Then P and T will match for 7 characters but mismatch on character 8 of P. + I] The use of the stronger shift rule based on sp.2 says that the Knuth-Morris-Pratt shift rule does not miss any occurrence of P in T.8] and as a suffix of PlI . sp.'iP. then made between characters T(k) and Plsp.willbee'pticillydelincdinSection2. The Knuth-Morrls-Pratt shift rule 123456789012345678 xyabcxabcxadcdqfeg abcxabcde abcxabcde As guaranteed. in place of the full notation.sp~ places.

so sp:(P)::: Itli > [cj. where a single phase consists of the comparisons done between successive shifts. as in the case of Beyer-Moore. REAL-TIME STRING MATCHING 27 Theorem 2. where s is the number of shifts done in the algorithm. Whilec+(II-p):. but we never accounted for time needed toimplement shifts. Zj = [e ] = sp.(P) and j maps to i. PROOF Divide the algorithm into compare/shift phases.i] would be a proper suffix of P[I .3. so the number of comparisons done is bounded by 2m 0 :"" sp~(P).(P)] 2. end. such that P[i + 1] does not match P[lal + 11. The reason is that shifting is only conceptual and P is never explicitly shifted. sp' and sp values can be computed in linear time using end.: i that mapstoi. sp.4.+ I Putting all the pieces together gives the full Knuth-Morris-Pran algorithm Knnth-Morrts-Pratt algorithm SP~_I PROOF If spJP) is greater than zero.i] that matches a prefix of P. letting j denote the start of c .I p:= rep). 0 begin Preprocess P to find F'(k) = c.(P) must be zero Now suppose sp. the Knuth-Morris-Pratt algorithm "shifts" P so that the next comparison is between the character in positionc of T and the character in position rpj-l. and p s: II Given Theorem 2. The proofs of the claims for SPi(P) are similar and are left as exercises. Therefore. . and let j* be a position in the range I < j' -: j that maps toi..3. the numberof choracter comparisons is at most 2m. then sp. Butsp. + I for k from 1 to 11 + I.sp. The sp Zj.3.P(lfJD. We claim that j is the smallest position in the range 2 to i that maps to i. ifp =11+ I then report an occurrence of P starting at position ifp:= I then c t. in any phase at most one comparison involves a char deter of T that was previously compared. the total number of character comparisons is bounded by m + s.e c -l. where sP. Preprocessing for Knuth-Morrfs-Pratt The key to the speed up of the Knuth-Morris-Pratt algorithm over the naive algorithm is the use of sp' (orsp) values. We use pointer p 10 point into P and one pointer c (for "current" character] to point into T + Iton+ 1.(P):= downto z do max[sp.i] that matches a prefix (call it fJ) of P. Then PlJ* .J andspo are defined to be zero r(i) 2.~p' and sp values from the Z values obtained during the fundamental preprocessing of P. + I = F'(i + I). Real-time string matching In the search stage of the Knuth-Morris-Pratt algorithm. pointers to P and T are incremented. define the failure function F'(O to be I). Rather.=ndownt02do begin i:=j+Zj(P)-I. We verify this below We will only use the (stronger) failure function > in this discussion but will refer to I of P. = T(c) I.4.= I.+-l(P) - 1. After any shift.:m do begin While pep) do begin p :=p+ c:=c+l. c- II of T. P is aligned against a substring of T and the two strings are compared left to right until either all of P is exhausted (in which values are obtained by adding the following' .contradicring the assumption thatsp: = a.I of P.26 EXACf MATCHING:CLASSICAL COMPARISON·BASED METHODS 2. :=n-l sp. by the definition of mapping. then there is a proper suffix ct of P[l. P(i + I) #. p:= I. if there is no j in the range 1 < j:. Two special cases remain. then P is shifted right by n . This is implemented by setting F'(n+ I)tosp. end.2. :=0.4. Since P is never shifted left. Hence..3. the comparisons in the phase go left to right and start either with the last character of T compared in the previous phase or with the character to its right. But s < m since after m shifts the right end of P is certainly to the right of the right end of T. 2. In the Knuth-Morris-Pratt method. A full implementation of Knuth-Morris-Pratt We have described the Knuth-Morris-Pratt algorithm in terms of shifting P.sp~ places.. It is easy to see how to compute all the .. forj. all the the Zi values as follows: Z-based Knuth-Morris-Pratt fori:= I tondo sp. Thus. thenp is set to F'(I) = I and cis incremented by one.3.= end.(P) .3. When an occurrence of P is found. so a general "shift" can be imp1cmentedin constant time by just setting p to F'(i + 1). Moreover. Suppose not.> 0 and let j be as defined above. When the mismatch occurs in position 1 of P.

First... This shift guarantees that the prefix P[l. Then P should be shifted right by i .4. We will not use this terminology here and instead represent the method in pseudo code I I 2. a method must do at most a constant amount of work between the time it first examines any position in T and the time it last examines that position. Give a 'hand-wavinq" explanation tor this 2. In general terms.5p. Preprocessing for real-time string matching The proof of this theorem is almost identical to the proof of Theorem and is left to the reader. . and we leave that as an exercise. For that reason. such as when searching lor an English word in a book. positions.3."J matches the opposing substring in T and that I(k) matches the next character in P. as the size 01 the pattern grows.2. It follows that the search is done in real time.5. So. 2. The utility of a real-time matcher is two fold. Note also that if the search stage is real time it certainly is also linear time. sp. and it now also never reexamines a position involved in a mismatch. the Knuth-Morris-Pratt method is not a reid-time method To be real lime. "Common sense" and the (-l(nm) worst-case time bound of the Beyer-Moore algorithm {usingonlythebadcharacterrule)bothwouldsuggestthatempiricalrunningtimesincrease with increasing pattern length (assumingafixedtext). it does not guarantee that T(k) = P{sp. the resulting real-time method is generally referred to as a deterministic finite-state string matcher and is often represented with a finite state machine diagram..2 for the Knuth-Morris-Pratt algorithm. Preprocessing of P need not be real time. the earlier for Knuth-Morris-Pratt Z-based real-time fori:= I ton do matching 2. as indicated above. Give a handwaving explanation for this. If the size of the alphabet is explicitly included in the time and space bounds. if. this is not true when the position i~ involved in a mismatch. if a position of T is involved in a match. usual) that the alphabet is finite.l places. Below we show how to find all the SP(ix) values in linear time. + 1).Brute torce (the naive algorithmj is reported [184j to run laster than mostot her methods. When searching lora single word ora small phrase in a large Enqlish text. In the Knuth-Morris-Pratt method. this gives an algorithm that does linear preprocessing of P and real-time search of T. Note that the definition of real time only concerns the search stage of the algorithm. Although the shift based on sp. + I). EXERCISES 29 case an occurrence of P in T has been found) or until a mismatch occurs at some positions i + 1 of P andk ofT.> 0.ln the Ianer case. and the next comparison is between characters T(k) and P(sp. It is easy to establish that the algorithm finds all occurrences of P in T. The required preprocessing of P is quite similar to the preprocessing done in Section 2. No explicit comparison of those substrings is needed. in certain applications. Hence. P[ 1.1.r and P(i + 1). + I).5. such as when the characters of the text are being sent to a small memory machine.the simple bad character rule seems to be as etfective as lhe extended bad character rule. guarantees that P(i + I) differs from Plsp. In "typical" applications of exact matching. the method becomes real time because it still never reexamines a position in T involved in a match (a feature inherited from the Knuth-Morris-Pratt algorithm). But when sea rchinginactualEnglish Knowing the sP:i.d values for each character x in the alphabet allows a shift rule that converts the Knuth-Morris-Pratt method into a real-time Suppose P is compared against a substring of T and T(k) = . Exercises 1. it is never examined again (this is easy to verify) but.SP:i. the comparison between T(k) and The next needed comparison is between characters II I . For historical reasons.SP(i. Converting Knuth-Morris-Pratt to a real-time method Note that the linear time (and space) bound for this method require that the alphabet L be finite. then the preprocessing time and space needed for the algorithm is O(]I:]n) We will use the Z values obtained during fundamental preprocessing of P to convert the Knuth-Morris-Pratt method into a real-time method. of the length of the shift rule.l of the shifted pattern matches its opposing substring in T.28 EXACT MATCH!NG:CLASSICAL COMPARISON-BASED METHODS 2. one might need to guarantee that each character can be fully processed before the next one is due to arrive If the time for each character is constant. how would you expect this observation to hold up with smaller alphabets (say in DNA with an alphabet size of four). Hence Tlk) might be compared several times (perhaps Q(lPll times) with differing characters in P. the search stage of this algorithm never examines a character in T more than once. and when the text has many long sections of similar but not exact substrings? 3. then P is shifted tight by i -Ip. guaranteeing that the prefix. This allows us to do ]L I comparisons in constant time. Together.4.

riteln('the writelnl'after value in .1 we showed for the how to use Z values and the values values. we can make similar Pof Show rule to make the Beyer-Moore 1 . zero = 0.. Prove Theorem In this chapter.n] itself) such that shift more effective. EXERCISES 33 gsmatch(input. of using Z values sP:.j. indexarray.4. cell i is the numbe . = p[j]) (3) then preprocessing tstring kmp_shift[k]: end (3) r-».. to compute needed both Ihe sP: and SA values Instead for its real-time these values of Pand extension. <" m) do {2l i:"j_old to j-1 do > j-1) then gs_shift[ij: j-1.. {stage j:=j+1.eqe rLr for if (j m. i:= 1 to m do write(Inatchshift[i] wr i t. begin go_on:=true. where n is the length lEI is the length 01 the (gs_shift[j] alphabet.. for k:=m-l (2) downto 1 do 26.outputj.integer) .x sp:.of positions in position i of the to shift'). Prove that the shift rule used by the real-lime of Pin string matcher doe s oot mrse envoccurrences T. 27. (' the length of the string is end. Lold:=1. = array[l . + .m.. Although method strong occurs than we don't know howto shift P[i simply convert the Boyer-Moore algorithm to be a real-time changes P[i+l to the i < m) then go_on: [3) j. while if if else end. show how to obtain from the SA and/or begin t sP: values in linear [O{nll:lll time. (gs_shift[iJ j :=j+kmp_shift[j]. character That is. n] (other how to modify T{h) we can look for the right·mostcopyin the preceding is T{h). is an implementation for the strong of Richard Cole's rule) good suffix if (p[k] begin j: ~j -1. occurring pattern').matchshift. 28. procedure gs_shift: gsshift(p:tstring. var miLnt. readstring(p.. when a mismatch . kInp_shift: indexarri. else indaxarray kmp_shift end. go_on:boolean.32 program (This EXACf MATCHING:CLASSICAL COMPARISON-BASED METHODS 2.~ j+kmp_shift[j+lj false. begin write1n [ll {main} I' input a string on {2} a single a mismatch gs_shift[j] kmp_shift[m] (stage I) 25. the way Knuth-Morris-Pratt between p(i)and was converted. matehahi ft: indexarray.ml. brnshiEt.i:integer.tstring. 2) (2) [k] : ~j-k+l.S. p.1Y. gsshift(p.j_old. readstring(var p:tstring.m) i.k:integer. end. 1001 of integer. Lold:"j. while begin procedure begin readlp} . writeln end. = string[200].. m:"Length (p) . for begin for {I} j:= 1 to rn do :=I.e Ln r end. (prj1 [3) > <> used in Knuth-Morris-Pratt p[k]l and j-k) go_on then do gs_shift[j] :~ j-k. [main} . 2.

the Boyer-Moore assuming 29. and of the at in the tree is the concatenation the subpath must labels on the edges path directed characters in the subpath. Consider a phase where the right end of P is aligned with position j of T and suppose that P and T match for I places (from right to left) but no farther. it is quite amazing that a close variant of the algorithm allows a simple linear time bound We present here a further improvement of the Apostolico-Giancarlo idea. It infers and avoids many of the explicit matches that Beyer-Moore makes We take the following high-level view of the Beyer-Moore algorithm. from the root. where n is the length of P. It is then immediate that the number of comparisons is at most 2m: Every comparison is either a match or a mismatch. where preprocessing for Boyer-Moore was discussed. the right end of P is aligned with a character of T. We assume here that vector N has been obtained during the preprocessing of P.3) of number Apostolico and Giancarlo [26] suggested a variant of the Boyer-Moore algorithm that allows a fairly simple proof of linear worst-case running time. The method therefore has all the rapid shifting advantages of the Beyer-Moore method as well as a simple linear worst-case time analysis. thesubpath for this on the edges of the tree plus the lenglh 3.butitviolatestherequirementIhatitbeasubpathofapathstartingatihe root. In acompare/shift phase. Note Ihalalthough that runs be part ofa 2. With this variant.The is to find all subpat hs of paths starting not start at the root {see proportional to the total of P. that N. In Section 2. Those paths start at the root. we maintain an m length vector M in which at most one entry is updated in every phase. Suppose weare preprocessing so that the needed information is collected in linear time. during the search for P in T (after the preprocessing). resulting in an algorithm that simulates exactly the shifts of the Beyer-Moore algorithm.3: The pattern p= eqre labels two subpalhs of paths starting at the root. set M(j) to a value k S I (the rules for selecting k are detailed below). P is shifted right by some amount as given in the Boyer-Moore shift rules. A Beyer-Moore variant with a "simple" linear time bound FIgure 2. there can only be m mismatches since each one results in a nonzero shift of P.1. First.4 we showed how to compute N. The label with pattern problem each edge is labeled ofa subpath problem itseltneed in time with one or more characters. M(j) records the fact that a suffix of P oflength k (at least) occurs in T and ends exactly .34 EXACf MATCHING:CLASSICAL COMPARISON-BASED METHODS 3 Exact Matching: A Deeper Look at Classical Methods 3. Then. Thus in the figure. However. not "arq". and P is compared right 10 left with selected characters of T until either all of P is matched or until a mismatch occurs.. Note that an edge label is displayed from the top 01 the edge down towards the bottom of the edge. Then. We will also show that (in addition to the time for comparisons) the time taken for all the other work in this method is linear in m Given the history of very difficult and partial analyses of the Beyer-Moore algorithm. buithe subpaihscontaining aqrado not.2. no character of T will ever be compared after it is first matched with any character of P.2.4. a fixed size alphabet we are given a tree where given a pattern P.iJ that matches a suffix of P. Figure the root that are labeled Give an algorithm P. and there can only be Iff matches since no character of T is compared again after it matches a character of P. for every i in O(n) time. Recall from Section 2. There is also another subpath in the tree labeled aqra (itstartsab-ovethecharacterz). finding exactly the same mismatches that Boyer-Moore would find and making exactly the same shifts.(P} is the length of the longest suffix of P[1.1. We divide the algorithm into compare/shift phases numbered I through q S m. Two modifications of the Boyer-Moore algorithm are required.1. there is an edge labeled "qra"'. Key ideas OUf version of the Apostollco-Giancarlo algorithm simulates the Beyer-Moore algorithm.

3.-length substring of P ends at position i and matches a suffix of P. PROOF We prove correctness by showing that the algorithm simulates the original BoyerMoore algorithm.1: Substring Ct has length Ni and substring fJ has length M(h) :> Ni. + 1 of P.) 0. As the algorithm proceeds. 3.2.. Correctness and linear-time analysis Theorem 3.h.1. = i > 0.1. and we can conclude that the next Ni comparisons (from PU) and T(h) moving leftward) in the Boyer-Moore algorithm would be matches.1). at position j. The two strings must match from Iheir right ends lor Nicharacters. One phase in detail j andi setton Phase algorithm 1. then P matches T from the right end of P down to character i . PU . then declare that an occurrence of P has been found in Tendingatpositionj. and N.-length suffixes of those two substrings must match. M and N as above. If M(II) ::: N. but mismatch at the next character.N.h. . and if N.36 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. That means that an N. Set MU) to j .Ni of P (this ends the phase). but the next pair of characters mismatch [i. The rest of the simulation is correct since the shift rules are the same as in the Beyer-Moore algorithm 3. = 0.N. then an occurrence of P in T has been found. then compare T(II) and P{I) as follows: The following definitions and lemma will be helpful in bounding the work done by the algorithm 2. while an M(h)-length substring of Tends at position h and matches a suffix of P. and No < I.1. the algorithm is correct when it declares a match down to a given position and a mismatch at the next position.e. a value for M(j) is set for every position j in T that is aligned with the right end of P: M(j) is undefined for all other positions in T The second modification exploits the vectors Nand M to speed up the Beyer-Moore algorithm by inferring certain matches and mismatches. Set M(j)to j . T(h . M(h) > Ni 5. if N.1. < i . So the N. Shift P by the Beyer-Moore rules based un a mismatch at position i . Further. for any given alignment of P with T. = i. and suppose it knows that M(h) > N. suppose the Beyer-Moore algorithm is about to compare characters T(h).N.N. A BOYER-MOORE VARIANT WITH A "SIMPLE" LINEAR TIME BOUND 37 4. To the idea.)] Hence P matches T for j . and shift according to the Beyer-Moore rules based on finding an occurrence of P endingatj(thisendsthephase). M(j) must be set to a value less than or equal to j . the Apostolico-Giancarlo variant of the T. then we can be sure that 3.II + Ni.M(j)mustbesettoavaluelessthanorequalton. That is.1. (see Figure 3.h + Ni characters and mismatches at position i . of P. If Figure 3. If M{II) is undefined or M(Il) = N.

)] does not cross any existing interval. TnCase I. andtherightendsofl and J'cannotbee qual (since M(h) has at most one value for any h). the substring of T consisting of the matched PROOF Every phase ends if a comparison finds a mismatch and every phase. other than the right end ofthe interval. and that this is the first time in the phase that II (presently k . then) > hand M(j) will certainly get set to ) . except the last. However. and in Cases 3 and 4.I) is in an interval in a position other than its right end. and j is to the right of any interval. b. The effect is that Cases 2 and 5 can apply at most once for any position in T. and M(h) will never be accessed again. Thus the algorithm can find at most m mismatches. the new inrervalje + 1. and that interval does not cross itself.2: a. = M(II) > 0 and N'-Ni = M(h . 3. and the induction is complete D if the comparison involving T(h) is a match then. We will show that by using the good suffix rule alone. This is also true for Cases 3 and 4 since the phase ends after either of those cases. Case 2 or 5 could apply repeatedly before a comparison or shift is done. Hence.M(h»). and the total number of character comparisons is bounded by 2m. the Beyer-Moore method has a worst-case running time of O(m). then shift P byn places to the right. either the phase ends or h is decremented by one place. That means that the right end of I cannot he to the left of the right end of I' (for then position k . Now suppose thatk . no character of T will ever be compared again after it is first in a match. Cole's linear worst-case bound for Beyer-Moore Suppose for a given alignment of P and T. We consider how h could first be set to a position inside an interval.Two covered intervals thai do cross PROOF The proof is by induction on the number of intervals created. it is possible that Case 2 or Case 5 can apply without an immediate shift or immediate character comparison. If no such shift is possible. An execution of Case 2 or 5 moves h from the right end of some intcrvall = [k . then shift P by the least amount so that a proper prefix of the shifted pattern matches a suffix of the occurrence of P in T. at the end of the phase.. although one interval can contain another..I is in some interval I' but is not at its right end.. The previously existing intervals have not changed. which is assumed to be untrue. we divide the Boyer-Moore algorithm into compare/shift phases numbered I through q ::: m. that move must follow an execution of Case 2 or 5.M(j) + 1. Since) is to the right of any interval.)]. so that a prefix of the shifted pattern matches a suffix of t in T possible. a substring t of T matches a suffix of P. if it exists. In compare/shift phase i. For example. so in all cases h + I is either not in a covered interval or is at the left end of an interval. So position h will be in the strict interior of the covered interval defined by ) Therefore. P is immediately shifted.theinterval[h+l . and if II is the right end of an interval then M(II) is defined and greater than 0. is followed by a nonzero shift of P. so the number of accesses made when these cases apply is also O(m). we focus on the number of accesses of M during execution of the five cases since the amount of additional work is proportional to the number of such accesses. To bound the amount of additional work. But these conditions imply that I and l' cross. Consequently. So ifh is ever moved into an interval in a position other than its right end. then in that phase only the right end of any covered interval is examined A new covered interval gets created in the phase only after the execution of Case I. A character comparison is done whenever Case I applies. M(j) is set at least as large as ) . and .h + I. Diagram showing covered intervals that do no! cross. Now assume that no intervals cross and consider the phase where the right end of P is aligned with positionj ofT. Certainly the claim is true until the first interval is created. "Whenever Case 3 or 4 applies. so there are no crossing intervals at the end of the phase. If an occurrence of P is found. To bound the matches. one place to the left of I. but a mismatch occurs at the next comparison to the left. h is the right end of an interval. Later we will extend the Beyer-Moore method to take care of the case that P does occur in T As in our analysis of the Apostolico-Giancarlo algorithm. Now the algorithm only examines the right end of an interval. Hence Cases I.Inanyofthesecases. then shift P by II places. observe that characters are explicitly compared only in Case I. That is. Since h = j at the start of the phase. a suffix of P is matched right to left with characters of T until either all of P is matched or until a mismatch occurs. provided that the pattern does not appear in the text. h will never be examined again. h] to position k -1.I would have been strictly inside 1'). and h + I is either not in an interval or is the left end of one.h + I or more at the end of that phase.2.2. so the algorithm never compares a character of T in a covered interval. In the latter case. 3. COLE'S LINEAR WORST·CASE BOuND FOR BOYER-MOORE 39 ~ i' __ 1 j-M(j)+! Figure 3. ifno intervals cross at the start of the phase.38 EXACT MATCHING: A DEEPER o b) __ ~ j'-M(j')+l o ~-L LOOK AT CLASSiCAL METHODS 3. h begins outside any interval. h is not in any interval. j]iscreatedafterthealgorithmexamines position h. But whenever Case 2 or 5 applies. That means that all characters in T that matched a character of P during that phase <ITe contained in the covered intervallJ . Rule 1 is never executed when h is at the right end of an interval {since then M(h) is defined and greater than zero). Hence the algorithm finds at most m matches.. Then find. So an execution of Case I cannot cause h to move beyond the right-most character of a covered interval. and 4 can apply at most O(m) times since there are at most O(m) shifts and compares. Case 5 would apply twice in a row (without a shift or character comparison) if N. D 3. and after any execution of Rule I. or4.

) :s 4m An initial lemma We start with the following definition and a lemma that is valuable in its own right. Consider how P aligns with T before and after that shift (see Figure 3. f3. the portion . be the number of comparisons in phase i of the second type. and our goal is to show that this sum is Certain[Y'Lr~lg.Phasei ends by shifting P right by Si positions.2. but it is not periodic. and let g.1 + I > 3. until less than 5i characters remain on the left (see Figure 3.1'. Definition r copies of We use the term "prefix semipenodic" to distinguish this definition from the definition given for "semiperiodic". pattern is then be periodic. mark off substrings of length s. For example. Let gi be the number of comparisons in phase i of the first type (comparisons involving a previously examined character of T). There will be at least three full substrings since IPI = It. Then.2.Thenp = ab. We will show that for any phase i. f3' denotes the string obtained by concatenating together Return to Cole's proof = ababababab = yo. denote the amount by which P is shifted right at the end of phase i.1. bcabcabc is semiperiodic with period abc.. over the entire algorithm the number of comparisons is 'Lr~l(gi + g. 3. s.4). Assume that P does not occur in T. Then since Lr~l s. and its proof is typical of the style of thinking used in dealing with overlapping matches For any string d. abababab is periodic with period abab and also with shorter period abo An alternate definition of a semiperiodic string is sometimes useful. For example. In each compare/shift phase. so the compare part of every phase ends with a mismatch.5). COLE'S LINEAR WORST·CASE BOUND FOR BOYER-MOORE 41 characters is denoted t" and the mismatch occurs just to the left of shifted right by an amount determined by the good suffix rule. Cole's proofwben the pattern does not occur in the text Definition Let s. Note also that a string cannot have itself as a period although a period may itself PROOF Starting from the right end of P. :s number of comparisons done by the algorithm is 'Lr~ I (gi + g. a is the part of the shifted P to the right of the original P. the string abaabaahaabaabaah is serniperiodic with period aab and is prefix semiperiodic with period uba. By definition ofsi and a. By the good suffix rule. The following useful lemma is easy to verify.l. we divide the comparisons into those that compare a character of T thai has previously been compared (in a previous phase) and those comparisons that compare a character of T for the first time in the execution of the algorithm./3. length of ail the shifts is at most m) it will follow that'LL g. but the following lemma (whose proof is simple and is left as an exercise) shows that these two definitions are really alternate reflections of the same For example.40 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS Ii' The 3. Note that a periodic string is by definition also serniperiodic. ::::g. String: abcahc is periodic with period abc.

after the phase h shift a character of P is opposite character T(k') (the character of T that will mismatch in phase 0.) or 2. if in phase h the right end of P is aligned with the right end of a /3. then phase h must have ended with a mismatch between T(k') and P(k). could have previously been examined only during a phase where P overlaps f" So to bound g" we closely examine in what ways P could have overlapped Ii during earlier phases.consequcntly. = P(I +q!tll) = P(k).4: Startingfromtheright. Butin phase i the mismatch occurs when comparing T(k') with i\I). Let k' be the position in T just to the left of Ii (so T(k') is involved in the mismatch ending phase i). All but one of the characters compared in phase i are contained in Ii. The reasonis the following: Strings P andr. where q ~ 1 (see Figure 3. as a concatenation of copies of string {3. So in phase h. and let k be the position in P opposite T(k') in phase h. string Figure 3. In phase h. Hence. the comparison of P and T will find matches until the left end of t.1: The case when the right end of P is aligned w~h a right end of A mismatch must OCCurbetween T(k') and P(k). for contradiction. 5 in phase h. since that is the alignment of P and T in phase i > h. This fact will be used below to prove thelernma. Here q = 3 Figure 3. is semlperiodic with period fj.' Recall that we want to bound g" (he number of characters compared in the ith phase that have been previously compared in earlier phases. There are two cases to consider: I. call that copy fj and say that its tight end is qld] plac estotheleftoftheright of ti. We will first deduce how phase h must have ended.6 shows string t. and in phase h the right end of P is aligned with the right end of some fJ. For . that in phase h the right end of P is aligned with the right end of some other full copy of fJ in t. and then we'll use that to prove the lemma. are semiperiodic with period /3.42 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. and a character in t. We claim that. COLE'S LINEAR WORST-CASE BOUND FOR BOYER-MOORE 43 I !3 I !3 I !3 I !3 I !3 I !3. so P(k) = P(1) # T(k').5.S: The arrows show the string equalities described in the prool concreteness. suppose. Either the right end of P is opposite the right end of another full copy of fJ (in I.. So. the right end of P cannot be aligned with the right end of t. the right end of P will not be shifted in phase h past the right end of ri. I ~ Flgur&3. We will show that every possible shift leads to a contradiction. P(I) = P(I + 11'1) =. and in phase h.emarkecloffinP. Consider where the right end of P is after the phase h shift. P and T will certainly match until the left end of string li_ Now P is semipcrlodic with /3.2. Now we consider the possible shifts of P done in phase h. Figure3.. in phase h. t. but then mismatch when comparing T(k')and P(k).. Therefore. and P must have moved right between phases hand i. so no shifts are possible and the ass limed alignment of P and T in phase h is not possible. has length occurs here Phaslength Ilil +1. k' s.6: Substring ti in T is semiperiodic with period/l 'j 1"""h'" mismatch Figure 3. Figure 3. the right end of P is exactly ql/3i places to the left of the right end ot t.3: String".2.substringsoflengthltcl=sja. proving the lenuna Since h < i .7). The right end of P isin the interior of a full copy of/l PROOF By Lemma 3.

and that P is compared with T down .1 + I 1. Since P is semiperiodic with period. by the good suffix in the shifted P below fJ agree with Ii (see Figure 3.2. y is a suffix of {3.and Iyl + 101= 1. when phase h finds an occurrence of P in T. if P matches I.) is the remainder. is reached.j3 in eve!)' phase i 3s" then in phase h < i. in T for. Recall that the good suffix rule shifts P (when possible) by the smallest amount so that all the characters of T that matched in phase h again match with the shifted P and 3. COLE'S LINEAR WORST-CASE BOUND FOR BOYER-MOORE 45 I~I~I~I~I~ s pl~1 Figure 3. So when P does occur in T (and phases do not necessarily end with mismatches).8.8 followed by a prefix (8) of {3. In fact.2.2. That of a mismatch. So if the right end of P is in the interior of .8 holds even if phase h is assumed to end by finding an is.2.7 a mismatch would occur in phase h before the left end of t.2.2. and the number of comparisons done by the Boyer-Moore algorithm is H(mn) The O(m) time bound proved in the previous section breaks down because it was derived by showing that g. If It.8' in phase h. we reached the conclusion that no shift in phase h is possible Hence the assumption is wrong and the lemma is proved. contradicting the good suffix rule. ::: g. nowhere in the proof is it assumed that phase h ends with a mismatch. D Lemma 3.Then Ii would haveasmallerperiodthan/3. and that required the assumption that phase i ends with a mismatch. that alignment of P and T would match at least until the left end of t. By Lemma 3. it remains there forever.81= 1{31.2.8: Case when the right end 01 Pis aligned with a character in the interior of a fJ. Therefore.I positions.rlfll).Consider that alignment.1. s.8. not that phase then the proof of the lemma of the above proof. Starting with the assumption that in phase h the right end of P is aligned with the tight end of a fun copy of {3. For concreteness. Case 2 Suppose the phase h shift aligns P so that its right end aligns with some character in the interior of a full copy of f3. and the characters of P aligned with T(k") before and after the shift would be different. P(k) must be equal to P(k .1. D Note again that this lemma holds even if phase h is assumed to find an occurrence of P. By Lemma3. Since fJ = {3. This observation will be used later PRom' Suppose in phase h that the right end of P is aligned with a character of fi other than one of the left-most 1. and so would match at:position k" of T. Hence the good suffix rule would nor shift the right end of P past the right of the end of If. say that the right end of Pisalignedwithacharacterincopy{3'ofstring .81characters to the right of the left end of f" and so by Lemma 3.8'.8 = yo = tly.contradictingthedefinitionof/J semiperiodic with period {3. Thus if the end of P were aligned with the end of {3'then all the characters of T that matched in phase h would again match. only that phase r does.8' is not the left-most copy of {3. P can match Ii in T for at most PROOF Since P is not aligned with the end of any {3in phase a. however.8'.81-lchuracfers > so that the two characters of P aligned with T(k") before and after the shift are unequal We claim these conditions hold when the right end of P is aligned with the right end of {3'. That is. Below we present a version of his idea The approach comes from the following observation: Suppose in phase i that the right end of P is positioned with character k of T.8'. the phase-h shift cannot move the right end of P to the right end of . After that mismatch. Let y8 be the string in the shifted P positioned opposite fi in t.and the Lemma is proved. if the right end of P is aligned in the interior of {3'in phase h. 0 Note again that Lemma OCCUITen<. where y is the string through the end of f3 and.81. thus yo = oy. it must also be in the interior of {3'in phase h + I. the two characters of P aligned with T(k") before and after the shift cannot be equal. 3. {3 = pi fort> I. we must modify the Boyer-Moore algorithm in order to recover the linear running time.2.I characters or the right-most 1. Say that mismatch occurs at position k" of T. Moreover. the proof only needs the assumption that phase i ends with h does. Since.:::3si.81characters. and by Lemma 3.That means (hat. so the phase-s + I shift would also not move the right end of P past If.the right end of P is at least 1.6.44 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3.2.8).So we would again have .7. which contradicts the assumption that {3is the smallest string such that a = 131for some I. the right end of some P is opposite a character in the interior of p. .:e P in T. This is impossible since in phase i > h the right end of P is aligned with the right end of I" which is to the right of .8 is a prefix of {3. this again would lead to a contradiction to the selection of. only needs the reasoning contained in the first two paragraphs Theorem 3.8 Of more characters then the right-most 13characters of P would match a string consisting of a suffix (y)of. But h was arbitrary.2. Hence the right end of P is not in the interior of {3'. Assuming P does not occur in T. in this alignment. Galil [1681 gave the first such modification. and we will show that the shift will also not move the end of P past the right end of .8. P is shifted right by some amount determined by the good suffix rule. The case when the pattern does occur in the text Consider P consisting of n copies of a single character and T consisting of m copies of the same character.1. Therefore. Then P occurs in T starting at every position in T except the last n .

r . THE ORiGiNAL PREPROCESSING FOR KNUTH-MORRIS-PRATT 49 3. The general case Our goal istofind~pl.2.-t-I.. a serious student of string algorithms should also understand the classical algorithm for Knuth-Morris-Pratt preprocessing.3. + I then the character after a must be . Figure 3. the required prefix ends at character sp"P!' So if character P(. w~ focus on how 10 compute SPk+l ..spd. 0 3.11: ii a' k HI must be a suffixof" Sp. To explain the method.48 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. values obtained during fundamental preprocessing of P. The use of Z. to the algorithm should find the longest proper prefix of P[Lsp. The original preprocessing for Knuth-Morrfs-Pratt 3. 1-1' . Now clearly. otherwise SPk~1 = 0 The complete preprocessing algorithm Putting all the pieces together gives the following algorithm for finding 11 and . either a valid prefix is found.ithe prefix that the algorithm will next try to compute).3. we can state this as The general reduction Fi!P-lre3.. restricting our search to ever smaller prefixes of p. By the definition of s pi. That is. 1:1:+1 place of P[I. to find fj 3. However.r .isfoundbyfindingii m.Ip. or the beginning of P is reached. + I = lal + I. since ex = P[ l .orequivalently. Hence SPk+l S SPk + I. But 1J is also a proper suffix of P[I . Eventually..sPk+.. 'P.k + 1]). The method does not use fundamental preprocessing In Section 1.1.10.9. Those two facts would contradict the definition of ''Pk (and the selection of a). The preprocessing algorithm computes SPi(P) for each position i from i = 2 to i = n ~SPI is zero).3.. the classical preprocessing method given in Knuth-MoITis-Pratt [278J is not based on fundamental preprocessing. a is the longest string that occurs both as a proper prefix of P and as a substring of P ending at position k.L1 (i. Conversely. That is. is a proper suffix of P[l .3. and let f3 = 11x denote the prefix of P of length SPk.9: Thes'ltuaflonafterflnO'ngsPk Ul Figure3. let a' refer to the copy of a that ends at position k Let x denote character k + 1 of P. SPHI = I if P(I) = P(k + I). SPHI = SPk + 1 if the character to the right of ex is . since ax would then be a prefix of P that also occurs as a proper suffix of P[I . In the latter case.3. k] (because f3x.3.k]).. values was conceptually simple and allowed a uniform treatment of various preprocessing problems. then 11 would be a prefix of P that is longer than a.And clearly. For those reasons. if ~"Pk+l "" sp. Is. where string a IS the prefix of P of length sp.lthat matches a of P[l. or else we recurse again. The easy case However.assu~ing that sp. The situation IS shown in Figure 3.known for each I S k.IP"PI + I) = x then we have found /3. Finding SPk+! is equivalent to findmg string jl.3 we showed how to compute all the Sf/i values from Z. The approach taken there is very well known and is used or extended in several additional methods (such as the Aho-Corasick method thai is discussed next). k + I]. For clarity. We should proceed as before.lpd and then check whether the character 10 the right of that prefix is ..e.r .

BUI since P(k + I) = P(SPk + P(:>p. I). else end.I times u is assigned at the for statement. and the theorem is proved. "". The value of 1! is assigned once is a suffix P[l.e. values can be easily computed from the SPi values in the algorithm below.k].. + P(k + 1) the algorithm correctly sets sp.2. We consider the places where the value of u is assigned and focus on how the value of v changes over the execution of the algorithm. Hence L' can be assigned in the while loop at most n .. prefix P[l.50 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3.spkl that also I) # P(sp~ + I). so the total amount that the value of v can increase (at thefor statement) over the entire algorithm is at most n . The/or loop executes exactly n .I.2tolldo begin v:= sp. 1 The arrows show the SPk+l How to find each time the for statement is reached. and hence the total number of times that the value of u can be assigned is at most 2(n . Theorem 3. In the classical method the (weaker) sp values are computed first and then the more desirable sp' values are derived . it is assigned a variable number of times inside the while loop.+ I) =X then SPk+-I:= u+ 1 else end. incrementing the value of k each time The while loop executes a variable number of times each time it is entered The work of the algorithm is proportional to the number of times the value of v is assigned. that value has already been correctly computed as. THE ORIGINAL PREPROCESSING FOR KNUTH-MORRIS-PRATI 51 h /"A~ x a b q a b ...3.I times. 0 x:= P(k+I).1) "'" O(n). end: If P(!. only does constant work per position._ + 1) By the induction hypothesis. While P(v+ l}#x V:=SPl'. I) =0.I times.: If P(v + I) sp.I plus the number of times it is assigned inside the while loop. So when P(sp. PROOF Note first that the algorithm consists of two nested loops.r a b r "b x a h q a b x a lIt Figure 3. values in O(n) lime u :=sp. But since the value of u starts at zero and is never negative. # Pti + I) then #-x and u"# OdD Theorem 3.. each time this loop is reached.12. Algorithm S P'(P) correal)' computes all the sp. a for loop and a while loop. and u '# OdD 3. V:=SPk.12: "Bouncing baircartccn ofo. its value either increases by one or it remains unchanged (at zero). follows: See the example in Figure 3.(PJ values in O(n) time.iginal successive assignments to the vafiablev Knuth-Morris-Pratt preprocessing.1. Each assignment of v inside the while loop must decrease the value of v. and each of the n . Hence the number of times v is assigned is n . For the purposes of the algorithm. How many times that can be is the key question. the total amount it can increase.3.I. + 1) # P(sp. sP"/'!. How to compute the optimized shift values The (stronger) sp.4. on Z values).3.isdefinedtobedifferentfromanycharacterinP Algorithm SP'(P) end. The value of v is initially zero.3. lfP(v+l)=xthen SPk+l '. the total amou nt that the value of v can detrease over the entire algorithm must also be bounded by n . total time for the II It is interesting to compare the classical method for computing sp and sp' and the method based on fundamental preprocessing (i. where n is the tength of P. Algorithm SP finds all the sp. The entire set ofsp values are found as Algorithm SP(P) ~Pl = 0 For k ce I ton -1 do begin x:= ru . character P(n + notexist.= u else SPk+l +1 := 0.

P2 already occurs in the tree. method for the exact set matching problem is based on suffix trees and is discussed in Section 7. If a node is encountered on this path that is numbered by i. Pi of P Tree IC) just consists of a single path of IPII edges out of root r. the (unique) path in lCi that matches the characters in Pi+1 in order. first find the longest path from root r that matches the characters of P2 in order. every node in the keyword tree corresponds to a prefix of one of the patterns in P. In the first case.. then number the node where the match ends with the number i + I. f\. In the second case. all edges out ofv have distinct labels.14 In either of the above two cases. That is.isparty. 3. these characters spell out PI' The number I is written at the node at the end of this path. some of the proofs are left to the reader. . It can be solved in O(n +m +k) time.J In this section. since the alphabet is finite and no two edges out of a node are labeled with the same character. start from each position I in T and follow the unique path from r in IC that matches a substring of T starting at character I . Aho-Ccrasick method. whereas the order is just the opposite in the method based on fundamental preprocessing. it is easy to construct the keyword tree for P in O(n} lime.. More than one such numbered node can be encountered if some patterns in P are prefixes of other patterns in P In general. we can use the keyword tree to search for an occurrences in T of patterns from P. To begin. find the longest prefix of P1 that matches the characters on some path from r. We will see that the latter property holds inductively for any tree IC. then create a new path out of u labeled with the remaining unmatched part of P. 13 shows the keyword tree for the set of patterns (potato. This path is unique because. the work done at any node is bounded by a Constant. Define IC. Hence for any i. 3.+I and number the endpoint of that path with the numberi + I..2 every leaf of K is mapped to by some pattern in P For example. In generaI.when when f\. _That is. and so the time to construct the entire keyword tree is O(n).4. The. a_ The insertion of pattern f\. and write number 2 at the end of that path An example of these two possibilities is shown in Figure 3. the characters on the edges out of u are distinct Ifpattern P''''I is exhausted (fully matched). iSPil_ b. During the insertion of Pi+ I. Figure 3.14: Pattern P1 Is the string P"r. If a node u is reached where no further match is possible but P'+I is not fully matched. wherek is the number of occurrences in T of the patterns from P.1. K: with five pattems from them.. we create anew path out or».1. pottery. the exact set matching problem can be solved faster than.4. Each edge on this path is labeled with a character of PI and when read from the root. labeled by the remaining (unmatched) characters of P2. and so we write the number 2 at the node where the path ends. at any branching node » inlCi. and the characters on the two edges out of the branching node will be distinct. poetry. at any branching node v of IC. as far as possible.tn create Ki+1 from lC"start at the rootoflCi and follow. That path either ends by exhausting P2 or it ends at some node u in the tree where no further match is possible.4. EXACT MATCHING WITH A SET OF PATIERNS S3 '/ Flgur&3. to be the (partial) keyword tree that encodes patterns PI.. consider how to search for occurrences of patterns in P that begin at character I of T: Follow the unique path in IC that matches a prefix ofT as far as possible.and every prefix of a pattern maps to a distinct node in the tree Assuming a fixed-size alphabet.13: KeywDrd tree b) /~ .S2 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. Naive use of keyword trees for set matching Because no two edges out of any node are labeled with the same character. then Pi occurs in T starting from position I. O(I! + zm).. science. To create 1C2 from ICI. it takes O(IPHII) time to insert pattern Pi+) into lCi. The first method to achieve this bound is due to Aho and Corasick [9J. 1C2 will have at most one branching node (a node with more than one child). school) Clearly. An equally but more robust. Exact matching with a set of patterns each of the z patterns Perhaps surprisingly.nsertion Figure 3. to find an patterns that occur in T.

We will reduce this to 0(0 + m + k) time below.4. a walk from the root of K determines if the string is in the dictionary.16: Keyword showingthe failurelinks tree For example. For a fixed I. the concatenation of characters on the path from the root to u spells out the string. with only one subtlety that occurs if a pattern in P is a proper substring of another pattern in P.15: Keyword to itlustratethe labetof a node tree 3. The speedup: generalizing Knuth-Morris-Pratt The above naive approach to the exact set matching problem is analogous to the naive search we discussed before introducing the Knuth-Morris-Pratt method. the problem is to determine if the text T (an individual presented string) completely matches some string in P. theater. tattoo. other) and its keyword tree shown in Figure 3.15 the node pointed to by the arrow is labeled with the stringpott. where after every mismatch the pattern is shifted by only one position. Then a sequence of individual strings will be presented. the traversal of a path of K. so by successively incrementing I from 1 to m and traversing K. The strings in the dictionary are encoded into a keyword tree }C. This generalization is fairly direct. called the dictionary problem.4. That is.c(uj is used to denote the label on u. The dictionary problem Without any further embellishments. The Aho-Corasick algorithm makes the same kind of improvements. So there must be a path from the root in K that spells out . EXACT MATCHING WITH A SET OF PATfER. where k is the number of occurrences. Let v be the node labeled with the string potato Since [at is prefix of tattoo. in Figure 3. and it is the longest proper suffix of potot that is a prefix of any patterninP. PROOF K encodes all the patterns in P and. and the comparisons are always begun at the left end of the pattern.16. this simple keyword tree algorithm efficiently solves a special case of set matching."IS 55 Numbered nodes along that path indicate patterns in P that start at position I. In this special case of exact set matching. for each I. In the dictionary problem. The key is to generalize the function SPi (defined on page 27 for a single pattern) to operate on a set of patterns. takes time proportional to the minimum of m and n. ' i ! Figure 3. by definition. it is very helpful to (temporarily) make the following assumption' Assumption No pattern in P is a proper substring of any other pattern in P 3.2.3. the lp(u)-!ength suffix of L(V) is a prefix of some pattern in P.lp(v)=3. The utility of a keyword tree is clear in this context. the exact set matching problem can be solved in O(nm) time. the task is to find if the presented string is contained in the dictionary.. define lp(u) to be the length of the longest proper suffix of string Zru) that is a prefix of some pattern in P. for each one. Successively incrementing I by one and starting each search from root r is analogous to the naive exact match method for a single pattern. a set of strings (funning a dictionary) is initially known and preprocessed. The Knurh-Morris-Prau algorithm improves on that naive algorithm by shifting the pattern by more than one position when possible and by never comparing characters to the left of the current character in T. and when an individual string is presented. when possible.4. Definition For any node u of K. incrementing I by more than one and skipping over initial parts of paths in K.. We now return to the general set matching problem of determining which strings in P are contained in text T Ftgure3.54 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. consider the set of patterns P = {po/atu. Failure functions for the keyword tree Definition Each node u in K is labeled with the string obtained by concatenating in order the characters on the path from the root ofK to node u.c(v) For example. So. .

.16 shows the keyword tree for P = {potato. and yet we may be sure that every occurrence ofa pattern in P that begins at character c ":"'lp(v) of T will be correctly detected. for node v . By the construction of T no two paths spell out the same string.~. we have to argue that there are no occurrences of patterns of P starting strictly between the old I and c -lp(v) in T. c must be incremented by 1 and comparisons again begin at the root Therefore.17: Keyword tree used 10 compute the failurefunclion -: ". c can be left unchanged.. At this point c = 8.2 in the analysis of Knuth-Morris-Pratt.3. As before.So when /p(v) :::: 0. Figure 3. and the next comparison is between character T(S) and character t on the edge below tal With this algorithm. The failure links speed up the search Suppose that we know the failure link u 1--* II" for each node u in K. avoiding the reexamination of characters of T to the left of c. labeled tat. other). That is.n. 3. and it is left as an exercise.. But does it improve the worst-case running time? By the same sort of argument used to analyze the search time (not the preprocessing time) of Knuth-Morris-Pratt (Theorem 2. the algorithm could traverse K from the root to node no and be suretomatchallthecharactersonthispathwiththecharactersinTstartingfromposition c -lp(v) . We know that string L(u) occurs in T starting at position 1 and ending at position c _ J. the text matches the string polar but mismatches at the next character.16. So I is incremented t05 = 8 . I can be increased to c -Ip(v). Instead. and thus / can be incremented to c -lp(v) without missing any occurrences. The other failure links point to the root and are not shown. Failure links are shown as pointers from every node u to node ilL' where lp(v) > O. EXACT MATCHING WITH A SET OF PATIERNS 57 string a.) How do the failure links help speed up the search? The Aho-Corasick algorithm uses the function u H. then I is increased to c and the comparisons begin at the root of K The only case remaining is when the mismatch occurs at the root.4.e-lj.) efaiture link. (Later we will show how to efficiently find those links. the comparisons should bcgin at node n. repeat While there is an edge (w. character T(e) does not occur on any edge out of u). comparing character c of T against the characters on the edges out of n..4. that argument is almost identical to the proof of Theorem 2. it is guaranteed that string L(n. to the node n.56 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. tatoo.5.e. n. and the failure link from the node v labeled potat points J Figure 3. When I = 3.. and [p(v) = 3. in a way that directly generalizes the use of the function i t-+ SPi in the Knuth-Morris-Pratt algorithm. theater. To understand the usc of the function u t-+ n. . is the root of ftC C(v) oflcnglhlp(v).4.. With the given assumption that no pattern in P is a proper substring of another one. suppose we have traversed the tree to node v but cannot continue (i. We cal! the ordered pair (u. w:=rootofK. when no further matches are possible. In this case.3. w') labeled character begin if urls numbered by pattem f then wi= T(e) end. We also use pointer c into T to indicate the "current character" of T to be compared with a character on K. .. I may increase by more than one. so this path is unique and the lemma is proved.) matches siring T[e-lp(u) . and 1:= c -lp(w). For example. 0 Definition Definition For a node v of K let II" be the unique node in K labeled with the suffix of When lp(v) = Omen 11..4. we use / to indicate the starting position in T of the patterns being searched for.theuseoffunctionu t-+ n"certainlyacceleratesthenaivesearchforpatterns ofP.3.. Whenlp(v) = O.3). The following algorithm uses the failure links to search for occurrences in T of patterns from P Algorithm AC search 1:= I. Linear preprocessing for the failure function e:= I. and there IS no need 10 actually make the comparisons on the path from the root to node n. consider the text T = XXpOlaUOOXX and the keyword tree shown in Figure 3. w:= 11".. it is easily established that the search time for Aho-Corasick is in linear time 3. untilc >m.. Bv the definition of the function u t-+ n. Of course (just as in Knuth-Morris-Pratt)..

is a more realistic problem. The time to search for occurrences in T of patterns from P is O(m + z}. but complicate the problem a bit. This is a two-dimensional generalization of the exact string matching problem Admittedly. A transcription factor is a protein that binds to specific locations in DNA and regulates. For each starting location j of P. in Sections 9. the Zinc Finger is a common transcription factor that has the following signature' Correctness Clearly. ab) and discussed 2. The problem of matching with wild cards should need little motivating. Another important transcription factor is the Leucine Zipper. . if the number of wild cards is bounded by a fixed constant (independent of the size of P) then the following method. The above method uses this idea in reverse. We treat each pattern in P as being distinct even if there are multiple copies of it in P. which truly arises in numerous practical applications. which also is digitized. there is an occurrence of Pin T starting at position p if and only if.62 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 63 text T. For example. In summary. where ITI = m and: is the number of occurrences. Two-dimensional exact matching A second classic application of exact set matching occurs in a generalization of string matching to two-dimensional exact matching. Thousands of times thereafter. We assume that the bottom edges of the two rectangles are parallel to each other. Hence z must be bounded by km. at position p. [For example. many transcription factors are now known and can be separated into families characterized by specific substrings containing wildcards For example. then the method does run in linear time. Then whenever an occurrence of a pattern from P is found in T. if P 11=1'/2=5'/3=7. Given a pattern P containing wild cards. a eel! can be incremented to at most k. allowing some errors. thi. all starting positions of P. one would look for occurrences of any of these 600. Two-dimensional matching that is inexact. Let C be a vector of length ITI initialized to all zeros. and the algorithm runs in O(km) time. each separated by six wild card amino acids If the number of pencilled wild cards is unbounded. based on exact set pattern matching.000 patterns in text strings of length 150. we will return either the pattern. we return to the problem of exact matching with a single pattern. However. for each i. compelling applications of two-dimensional exact matching are hard to find. Scan vector C for any cell with value k. Unlike the one-dimensional exact matching problem. leading to a nonlinear O(km) time bound.} be the (multi-)set of maximal substrings of P that do not contain any wild cards. we want to find all occurrences of P in a text T. This correctly determines whether P occurs starting at p because each string in P can cause at most one increment to cell p of C. Correctness and complexity of the method A related application comes from the "BAC~PAC" proposal [442J for sequencing the human genome (see page 418). 600. E P occurs at position j = p + Ii .1 of T. where CYS is the amino acid cysteine and HIS is the amino acid histidine. then cell 12 of C is incremented by one. I.3.000 strings (patterns) of length 500 would first be obtained and entered into the computer. problem is somewhat contrived. Let /1. which is two-thousand times as large as the typical text to be searched 3. km need not be O(m) and hence the number of times C is incremented may grow faster than O(m).. LetP = [PI. exactly one cell in C is incremented. Hence P occurs in T starting at p if and only if similar witnesses for position p are found for each of the k strings in P. the transcription of the DNA into RNA. as it is not difficult to think up realistic cases where the pattern contains wild cards. /2. There is an occurrence of P in T starting at position p if and only if C(p) = k. subpattem P. In this duction of the protein that the DNA codes for is regulated. runs in linear time: Exact matching with wild cards O.) 3. in . or both. then this provides one "witness" that P occurs at T starting at position p = j . to the problem of wild cards when they occur in 3. if the second copy of string ab is found in T starting at position 18. Exact matching with wild cards As an application of exact set matching. the pattern ab¢l¢lc¢l occurs twice in the text x aboccbabubcax. and pattern Pi starts at position Ii in P. it is not known if the problem can be solved in linear time. Complexity The time used by the Aho-Corasick algorithm to build the keyword tree for Pis O(n).5. find for each string Pi in 'P.000. Suppose we have a rectangular digitized picture T. C. either enhancing or suppressing. If pattern Pi E P is found to occur starting at position j of T. Note that in this version of the problem no wild cards appear in T and that each wild card matches only a single character rather than a substring of unspecified length. . But if k is assumed to be bounded (independent of IP I). increment the count in cell j -Ij + 1 of C by one. the number of witnesses that observe an occurrence of P beginning at p.3. which consists of four to seven Ieucincs. The study factors has exploded in the past decade. in T. P.Ii + 1. Using the Aho-Coraslck algorithm (or the suffix tree approach to later).5. text. called a wild card.2. Note that the total length of the patterns is 300 million characters. where each point is given a number indicating its color and brightness. Although the number of character comparisons used is just O(m). and we want to find all occurrences (possibly overlapping) of the smaller picture in the larger one. . We modify the exact matching problem by introducing a character e . One very important case where simple wild cards occur is in DNA transcription factors. that matches any single character. . P2. The algorithm counts. In that method.. We are also given a smaller rectangular picture P. furthermore. but its solution requires '* be ab¢¢lc¢ab¢l¢ then P = lab.) = Later. be the starting positions in P of each of these substrings (For example.

matches developed are computed to lind matches several special expressions to start in biological first with [279. alphabet can strings. assumption. expression are actually format for the string and time..6. which takes O(nm) time. applications. Section In this the Unix regular expression the problem by a given for significant substrings These been have in proteins that on PROSITE) of finding programs text string we examine specified grep. improving upon the naive method.422]. looking for an occurrence of the string n' in consecutive cells in a single column. but rather is correct for a string and that 3 replaces 1:) is not standard.. So for now. if this string is found in column 6. Just as in exact string matching.4. any and symbol example. can REGULAR EXPRESS10N PATTERN MATCHING 65 techniques of the exact settings presented type we will examine in Part 111 of the book. 10 are rows the same of 3.2. We do a similar look for occurrences at most one number for any other 3. easily be converted case As an example. q) rows of P in write the small chosen only anyone This actually the protein example.8 for more section. we want to find the smaller picture in the larger one in 0(11 + m) time. This ensures e repres. recursive contrary from the Formal PROSITE list: definitions expression assume f definition to the following for a regular formed that from an alphabet E does 1:. P same row dash x letter indicates a Single one amino known is found acid of amino substring Section complete numberi each will row of another by brackets indicates that for that position.. first lind all identical then we might rows give in These recursive rules are simple string means The Note to follow. exactly twenty in any and be chosen.e . then all exact occurrences of P in T can be found in O(n + m) time. can some explanation. and R= S= . cards discussed that the rows label of in the previous section. at most be written In the . let n be the number of points in P.6. alphabet not contain *. 1fT and P are rectangular pictures with m and n cells. For a sophisticated treatment of two-dimensional matching see [22] and [169]. and amino from For acids which a group must are separated acids A be but Then.. array M with dimensions as T. later we will relax this proteins. and let n' be the number of rows in P. in appear A formal defi- of a regular expression some is given PROSITE expression proteins: specifies of the a single use Since of P among alphabet) string To do this. two-dimensional in more The complex method matching and as an illustration of how to more set matching two-dimensional and [661. where O(IIm) is the time for the obvious approach. +. starting at row 12 and ending at row n' + 12. problem Note with the similarity wild between this solution and the solution to the exact will matching 3. ). found and follows approach given Since sets of substrings as regular The particularly have been then. In phase where 2.42]. just inclusion that the but as part of a regular above in PROSITE to do so.q) of row P ofT. but may of length need zero). T' O(m). over Then +c + t)ykk(p + vdt(l + z + is a regular expression E.000 amino granin acid Whenever an occurrence in position (p. only will Forexample. problem. methods.2. many additional methods have been presented since that improve on those papers in various ways. A distinction be discussed in the exercises Now and give them phase cell of suppose them P are not is easily thing of row all distinct.64 more complex EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. utility patterns.ents the empty then zero (i. have the identifies [ED]-[EN]-L-[SAN]-x-x-[DEJ-x-E-L Every by a enclosed string specified Each capital by this regular specifies that of the regular appear expression expression has amino of those acids proteins. respectively. T O(n + m) i of time. to the present but is closer definition to the way example can that regular given approach it takes O(n +m) in many conform In summary. Baircch (see match in [41. all the label one. more complex. For Simplicity.6. However. the P as a separate Aho--Corasick is rectangular. add search an end each and for all occurrences of row these row of in marker rows together of (some to Assume for now that each of the rows of P are distinct. For example. exact 3. of which a to each of length algorithm all can rows use T concatenate treating for width.• the string that the If Risaparenthesizedregular any number of times (outside specified does not not rows 6 and phase that a look 6 expression. Phase two can be implemented in 0(11' +m) = O(n+m) time by applying any linear-time exact matching algorithm to each column of M This gives an O(n + m) time solution to the two-dimensional exact set matching I. and of in it during that are identical. Theorem 3.ifrows3. 1.2. (including of R' times).6. database. of P is assumed in any cell and because P is rectangular. done during and Then.. don't tree for the row patterns). several developed patterns ofa databases by Amos referred to as a we view be used problems. because the problem as slated is somewhat unrealistic. one a few of these by the ENLSSEDEEL of another to be distinct of M. 3. of character method rows not text the is divided in the into two the phases. can be specified such constructed to hold is the major 15. in human describes ten positions.J Many importan~ In Regular way expression pattern matching sometimes in biosequences. is found at position same expression 192. PROSITE database strings. and one of the strings to regular regular sequences expression.. (. Let In be the total number of points in T. The each form pattern. a set of substrings. expression of parentheses R be repeated expressions M have written I.12). of a simple The following particular regular family expression.416. 10. rows ofT.3. row In the of first and phase. and the and we first takes to search all occurrences so no T' of any is row P. of granin It is helpful nition an example later. Because number is specified regular proteins. then P occurs in T when its upper left corner is at position (6. The symbol a common (this the construction of the keyword Then. 2 . 10. etc. rows Hence of phase al! occurrences starting the of complete (p.5. (a let 1: q)* be the alphabet of lower t)(pq) English characters. A regular expression is a ~attem. We now give a formal. we will not discuss the newer. as an introduction the basic realistic in [44] to specify a set of related (patterns) expressions. scan each column of M. second phase. .1. n' in It is easy to verify the columns that this M.

where k is the total Theorem 3. complete that if no further the correctness proof of the Aho-Ccrasick then O(e) of time (see Exercise I can be set to in T To search nonprefix the regular substring T that matches zero) expression in (including of any character of T that matches R. The graph has a SImi node s and a termination node I. the search algorithm rule for finding from NU - I} and character if and only Therefore.il· the sets for this each 11. R. ~\:/~ EXACT MATCHING: A DEEPER LOOK AT CLASSICAL Figure3. n prefixes in time linear in the length 7.19: Directed graph for the regular expression (d+ 0+ g)«(n METHODS 3. Find examples M(j) could always where be set that this Now. the "randomn ess" of the text or pattern.e. Each s 10 I path in G(R) specifies a string by concatenating the characters of 2: that label the edges of the path. To specify S. to determine whether each prefix is semi periodic and with what inthe the lime should establish be linear. then utility of the optimization Using the assumption another That c -Ip(v) pattern is. method. In the Apostolico-Giancarlo length of the (right-te-Ieft) the algorithm algorithm learns would the exact method. the size of the alphabet.. preux P[t . 13. substring in T that matches whether of determining (unspecified) plus all nodes a node R.it may be better match.2. Argue of Boyer-Moore. since the (/+z+l') R by a directed graph G{ R) (usually called a nondeterministic. but modified Again. to comparethec Evaluate haracters first and N if the two characters this idea both theoretically aykkpqppvdrpq repeated four is a string and specified the empty by R.. r* represents any number of repetitions 1:. If G(R) - contains T(i)] e edges. G(R) from R are simple and are left then as an exercise. It easily follows if there is path in G(R) from s that prefix T[ I . Make this precise. we now have the following: and the comparisons at node n. which may be large. The is that P is substring matches resumed free (l.6. 6. was In the Apostolico--Giancarlo then examine M and and empirically.. we first consider prejU of T matches R. of G(R) 8. string e was the choice the subexpression 4. The Aho-Corasickalgorithm . prove that algorithm iteration O(me). the method so that it runs in the same time. compute the time reason is finished) can be viewed of the Knuth-Morris-Pratt and quantify the preprocessing T part PST. then il is possible 10determine whether T contains a substring matching R in O(nm) time. By being the Let Cole's 9. 11 is a periodic is one) such that of P. Similarto Pin of what was done in Section method to the string fact. where from 29). and each edge is labeled with a single symbol from 1: U €. E P is a substring algorithm of i [finding for 1) and character can be implemented Pi E P). il matches R if and only discussion. and otherwise + k) time. Show Hint: That prefixes of P. Show how the text and the pattern. to run in T. that no pattern are possible at a node missing Pi v. of the mismalch in all cases.5. With this detail. simply search for a prefix 1:* R. N(i) a the length on i that a node v is in N(i) 1. The rules for It is very useful to represent a regular expression constructing if a regular edges. or is the linear the example bad character time bound Consider of T "" abababababababababab rule the classical method version P""xaaaaaaaaawithoutalsousingthe 10. only the weak Can the argument rule is used? be fixed. In the Apostolico--Giancarlo to modify +fl sizen.2 n showing the equivalence of the two definitions the of serraoenocnc strings. s Cole's worst-case that are reachable if s by traversing from edges edges labeled labeled e .7.~N~~. match MU) is set to be a number T starting at position jof less than or equal to the T.7. Solve Searching To search simpler N(O) from for a problem for matches the regular some expression problem as above period. u is in set u can be reached by zero set or more some node in N(i gives I) by traversing a constructive by induction an edge NU). but in place of M uses an array of + o)w)*(c +1 +€)(c 3. This In general. we also want to know the period. labeled TU) followed set NU) rule is used. tothe change full length would seem to be a good thin gto do. the conetan tintheO(m)boundfrom uneer-nme where analysis of the Boyer-Moore bound breaks algorithm down if only the weak Boyer_Moore simply untrue shift when and be the set of nodes node consisting of node f. Exercises 1.andthis result in a correct less than the length of the match. Given N(i) T(I). The set of strings specified by all such paths is exactly the set of strings specified by the regular expression R. G(R} constructed 2n For each of the string. Evaluate Giancarlo sumptions empirically method should the under include speed different of the Boyer-Moore about method against the ApcstclicoThese as" assumptions etc M is of size m. for i > 0. can have the same probtem that the Knuth-Morris-Pratt algorithm phase of the Aho-Corasick algorithm in O(m runs in Oem) time if no pattern substring of another. 1l can be written how to determine z-aiqonthm the same ee c. for i from is of T.. In P part of the Knuth-Morris-Pratt as a slightly to the applied algorithm N(O of contains node the above all prefixes T that match t.19. (an be It is easy 10 show using at most that expression of Pand sets the value to be strictly location of the match.66 .for some string this for all Of course. show that applying Knuth-Morris-Pratt 01 preprocessing PST gives a linear-time optimized of to find all occurrence (after the preprocessing ends at u and generates if the string T[l·. Show more careful bookkeeping. EXERCISES 67 3. finite state automaton). is. R has n symbols. Prove that the search in P is a proper number of occurrences. An example is shown in Figure 3. the level of periodicity of the text or pattern. Then explain simulation why this was not done in Ihe algorithm Hint: 11'5the lime bound 5. the subexpression specified (p+q) by of R times.1. for each k > 1 (if there The details are left as an exercise and (an be found in [lOJ and [81 P[t . m is the length N(i of the text string T. Prove Lemma 3. we want to know whether i we want to know the largest (t. array 2. to find 0 tom. and the regular expression R contains n symbols. If T is of length m. method. without any occurrences ofpalternsfromP 12.

30.3 in all the small m be the O(n number how to solve this problem have the same width. Suppose currences number that relate to the case of patterns thai are not substring picture. EXERCISES 69 upper left corner in any cell whose it only uses sp values rather than 51" values. Does + this work? tenoo. j _ i counter 3. Lemmas 3. What time bound method is used once for would result? of 24. A better failure function tailurefunction. that Knuth-Morris·Pratt used. tor example. How are reflected time.68 has when Figure in EXACT MATCHING: A DEEPER LOOK AT CLASSlCAL METHODS (i'. be avoided? Biologica I strings are always algorithm done in the two-dimensional when identical rows were first found a nnear-ume be avoided. Suppose else in the patterns.Extendthediscussionandthetheoremtocoverthetaskofexplicitly State the time bound as the sum of a term that is on that number. have the property that if Rcontainsnsymbols • then G(R)contains more carefully to relate the running time of the algorithm Since the directed time.2 anc 3.4. and il not what are the problems? array C (which time. Another approach to handling algorithms. That to what is the closure +. that the set NU) can be naively that the improvement found from N(i . can 27.1 separately considers the path in K for 28. This suggests in O(ne) reduction edges. nowhere issues.5. However. This results in an overcount of the time actually used by the algorithm to the number 29. when algorithm wild cards the analysis in the proof of Theorem 3. in P is a substring of patterns of another to the both 32.asdefined it may be more efficient 10 modify Develop the definition that idea and ofa regularexpr pattern eeson the wild card handled symbol. P. range Explain The formal how. such regular of the 20. Prove free. small we have of points Discuss pictures m) lime? pictures rectangular and we want to find all ocpicture. showing would avoid this situation. Lst n bethe suppose total all the of points in the large thai k. to explicitly include can be efficiently lor the wild card problem. DiSCUSS the problem are permitted 18. that no pattern tothe when substrings PROSITE finite three. to develop shift rules and preprocessing approach 33. symbol "*" can always However. wild cards are permitted in the text but not in the pattern in the text for this task is Ole). Explain hence the importance. where does not include the length perhaps specifications. explain how wild cards matching in the text. to the case when the bottom re. and such duplicates in linear time.3). Then the required directed As a simplification. Then declare that P occurs in 3. can or the utility. expression range over the length of the more how such the regular in the that the wild card can match any length substring. Theorem finding 3. or in both? 22. What can you say about in the text. The time analysis each pattern Pertorm of nodes inK 17. rather than Aho-Corasick Show how to extend the iwo-dlmensronat the rectangular pattern is not parallel What of the two bottoms is known. Explain I N(i)1 = how this no f O(n) for any i.D. Perhaps a counter occurs we can omit phase at each in row jofthe two of the two-dimensional picture. and or Boyer-Moore can handle discuss 23. 16. 15. independent of the number plus a term that depends the problems if you see them). in a + t). such an improved 14. that is. expression graph Show that one can still search for a substring m is the length of T and e is the number wild cards in the pattern Does this is to modify the Knuth-Morris-Pratt methods that seem promising? Try it. q > 1 small (distinct) pictures pictures rectangular in a larger and efficiently.4.4. of matches if T contains a substring that matches wild cards in the pattern. + m) time suffices. ofE edges E edges always Explain in the graph G(R). of the wild card algorithmisd can be found duplicates matching Does and removed from problem ue to duplicate is similar copies of to the nonlinear in P. of time is achieved. Give an example grow lasterthsn Why not? Can you fix it and make it run in O(n of occurrences in T of patterns in set P.1) and T(i) is trivial if G( R) contains when a. but finite that CD can repeat can be expressed increase expression? Show or four times.7. and try to use it 10 obtain 31. in P. range specifications do those concise directed specifications PROSITE graph each sIring in P that appears correctness 21. in both the text and pattern time behavior by first removing string Consider label. The at Show how to construct construction mostO(n)edges should G( R) from a regular expression Pin P. starting When at position matching Keep picture for cell cell of the large large picture we find that row i of the small (i'.buttheorientation lar? as follows: ounter is not rectangu method to the bottom of the largepictu happens if the pattern 26.16 where the edge below the character ooteso is direct ed to the character Give t hedetailsforcompuling becomes n' (the number of rows of T with Pl.1 states the time bound for determining and outputting all such matches. expression. the same by an extension patterns often of the regular specify expression algorithm can repeat either ina as a two. and O(n+ m) time bound for the two-dimensional Give a complete matching method proof of the correctness described inthe text (Section 3. In the wild card Could case we instead problem simply we first assumed the algorithm reduce the case For example. Suppose in the two-dimensional matching problem being matching each pattern 25.6. are permitted). concise range the number CD(2-4) ola definition much inthe of times indicates regular that a substring expression one. Wildcardscanclearlybeencodedintoaregularexpression.lf Rdoes not contain finite. and then we extended when they are not? and complexity case when that assum ption does not hold are allowed Consider we just add a new symbol Does it work? to the end of of numbers. how this simplifies the searching this approach tt work. exact matching with these kinds of wi Id cards in the pattern. Try to make the growth of any of the qsmall as large as possible.edges of T that matches of edges acter. For example. it is tempting approach "fix up' the method and given a single method 19. A. is of length m > nj Show how to modily by a list of length the wild card method while keeping by replacing running n.incrementthec . (and solutions aregularexpressionR. show that graph for the input size n. graph G( R) contains the time stated Explain O(n) edges when R contains n symbols. Since strings (and solution if you see one) of using the Aho-Corasick and b. ain This is shown. rather than just a single char- specifications in O(me) for the re gularexpression(. the number Be sure you account O(n+ m).

The main primitive operation in each of those methods is the comparison of two characters. Baeza-Yates and G. Recall that pattern P is of size n and the text T is of size m because prefixes of P of lengths one and three end at position seven of T. string matching methods based on bit operations or on arithmetic. j) should be 1 if and only if the first i-I characters of P match the i-I characters of T ending at character j . For the algorithm to compute M it first constructs an a-length binary vector U(x) for each character x of the alphabet I for the positions in P where character . For example.1 and character P(i) matches character T(j). the = I if and only if an occurrence of P ends at position j ofT. the use of the Fast Fourier Transform in string matching: and the random lingerprint method of Karp and Rabin Array M is constructed column by column as follows: Column one of M is initialized to all zero entries if T(1) # P(I). the vector for column j is obtained by the bitwise AND of vector Bit-Shiftt j . j . To see in genera! why the Shift-And method produces the correct array entries. The Stuft-And method R. which has a U vector of When the eighth column of M is shifted down and anA~D resultis is performed with Ural. the entries for column j > 1 are obtained from column j . There are. if we let M(j) denote the jth column of M.1 with entry i of the vector U(T(j».1) is I. and the second condition is true when the ith bit of the U vector for character T(j) is 1. the algorithm ANDs together entry (i .1. The eighth character of T is character a. then M(j) = Bit·Shift(j . The first condition is true when the array entry for cell (i . Otherwise. Figure 4. however.I. as well as most of the methods that have yet to be discussed in this book.4.1. rather than character comparisons.1: Column j- 1 before and after operation Bir-Shif!(j _ 1) 4. Gonnet [35] devised a simple. These methods therefore have a very different flavor than the comparison-based approaches. 70 . More formally. but it seems more natural to call it Shift-And. when T(1) = P(l) its first entry is I and the remaining entries are O.2.1) with the U vector for character T(j). Arithmetic versus comparison-based methods All of the exact matching methods in the first three chapters.r appears.1) AND U(T(j»). if P 10100010. are examples of comparison-based methods.1 shows a column j- I before and after the bit-shift which is the correct ninth column ofM.2. How to construct array M 4. hence computing the of M solves the exact matching problem.I. For example.1. even though one can sometimes see character comparisons hidden at the inner level of these "seminumericai" methods. For example. THE SHIFf·AND METHOD 71 4 Seminumerical String Matching Figure 4. if P = abuac and T = sabsabaaxa then the eighth column of M is 4. After that. observe that for any i > 1 the array entry for cell (i. We will discuss three examples of this approach: the Shift-And method and its extension to a program called agrep to handle inexact matching.I) of column j .ln particular. Hence the algorithm computesthecorrcctentriesforarrayM. They call this method the Shift-Or method. j . By first shifting column j . bit-oriented method that solves the exact matching problem very efficiently for relatively small patterns (the length of a typical English word for example).I and the U vector for character T(j).2.

j) = I then there is an occurrence of P in T ending at position j that contains at most k mismatches. ugrepisextremelyfast. it also occurs with four mismatches starting at position two. For a small number of errors and for small patterns. and otherwise it is set to MC(i . For example. that amplifies the Shift-And method by finding inexact occurrences of a pattern in a text By inexact we mean that the pattern either occurs exactly in the text or occurs with a "small" number of mismatches or inserted or deleted characters. Match-counts are useful in several problems 10 be discussed later. The zero column of MC starts with all zeros. so that a column of any M! fits into a few words.i + I . The case of permitted insertions and deletions will be left as an exercise. In ogrep. We let MkU) denote the jth column of Mk. but it certainly is practical and would be the method of choice in many circumstances zero column of each array is again initialized to all zeros.1.£. How to compute Mk Let k be the fixed maximum permitted number of mismatches specified by the user. There are several ways to organize the computation and its description. agrep is very efficient and can be used in the core of more elaborate text searching methods. This extension uses 8(nm) additions and comparisons.3. for every j we will compute column j in arrays . I. In particular.•. Mk are computed. j) to allow up tok mismatches. However.the larger k is. and the next pair of characters in P and T are equal. j ]. j . The efficiency of the method depends on the size of k .. but each MC(i. replacing the AND operation with the increment by one operation.2. of P with T.2. Thus when the pattern is relatively small. . and in worst case the number of bit operations is clearly EJ(mn). The match-count problem and Fast Fourier Transform to compute for pairi. with at most 1 mismatches. packaged into a program called agrep. every column of M and every U vector can be encoded into a single computer word. let us define a new matrix MC That is.3.j the number of characters TU . Any entry with value n in the last row indicates an occurrence of P in T. for reasonable sized patterns. MO is the array M used in the Shift-And method.11 in increasing order of I.72 SEMINUMERICAL STRING MATCHING 4. Manber [482] devised a method. j . • The first i-I mismatches characters of P match a substring of T ending at j .1) + 1. just incrementing by one. the last row indicates the number of characters that match when the right end of P is aligned with character j of T. with at most I- I 4. In this section we will explain ogrep and how it handles mismatches. Hence. the slower the method. The method will compute MI for all values of 1 between 0 and k. if and only if one of the following three conditions hold: • The first matches i characters of P match a substring of T ending at j. it is a natural introduction to the next topic. Then the jth column of computed by: is Intuitively. a value of k as small as 3 or 4 is sufficient. the method is very efficient if n is less than the size of a single computer word. For clarity. For each position j ::: /I in T. and k is also small. match-counts.1. These are very fast operations in most computers and can be specified in languages such as C Even if n is several times the size of a single computer word. the pattern at cgaa occurs in the text aatatccacaa with two mismatches starting at position four.2.Ml. This computation is again a form of inexact matching. as was true of agrep.3. and over the entire algorithm the numberofbit operations is O(knm). If Mk(n. Wu and U. which is the focus of Part III. although each addition operation is particularly simple. the . only a few word operations are needed.1. the Shift-And method is very efficient in both time and space regardless of the size of the text From a purely theoretical standpoint it is not a linear time method. In addition. Column j only depends on column j . Therefore. such as single English words. Inexact matching is the focus of Part HL bur the ideas behind agrep are so closely related to the Shift-And method that it is appropriate to examine agrep at this point. and the method is extremely fast. As in the Shift-And method. The problem of finding the last row of MC is called the match-count problem. 4.2. M2. j) is the natural extension of the definition of M(i. but the most important information is contained in the last row of Me. with at most 1 . For many applications. . j) entry now is set to MCU . this just says that the first i cbaracters of P will match a substring of T ending at position i.4.I.the user chooses a value of k and then the arrays M. M((i.1) if P(i) #. 4. so all previous columns can be forgotten.shifting by one position and ANDing bit vectors. Furthermore. If we want to compute the entire MC array then (-)(nm) time is necessary. ·.I. the solution is so connected to the Shift-And method that we consider it here.' PROBLEM AND FAST FOURIER TRANSFORM 73 M' 4.I misI • The first i-I characters of P match a substring of T ending at j . the practical efficiency comes from the fact that the vectors are bit vectors (again of length n) and the operations are very simple . but for simplicity we will compute column j of each array MI before any columns past j will be computed in any array. and both the Bit-Shift and the AND operations can be done as single-word operations. .1. with at most mismatches. Shift-And is effective for small patterns Although the Shift-And method is very simple. only two columns of Mare needed at any given time.T(j). but values less than n count the exact number that match for each of different alignment. THE MATCH-COU!'. In that case. Further. agrep: The Shift-And method with errors s. It is simple to establish that these recurrences are correct.'·1 A simple algorithm to compute matrix MC generalizes the Shift-And method.

3. i). The extra zeros are there to handle the cases when the left end of a is to the left end of fJ and. so yea. We still assume. when a = P and fJ = T the vector Yea. The solution will work for any alphabet. i) for each i. and all other characters become Os. To compute Va(a. Using Fast Fourier Similar definitions apply for the other three characters. position Pa to start at position i of a" and count the number of columns where both bits are equal to I. f3) = VAl>". For most problems of interest log m « 11. SEM1NUMER1CAL STRING MATCHING 4. fJ) contains the information needed for the last row of Me. where the indices in the expression are taken modulo 11 + m. .respectively. but we treat the FFI" itself as a black box and leave it to any interested reader to learn the details of the FFT. all we have done is to recede the match-count problem.Ifi e c rhen we have 1011000100010 10010110 Clearly. Negative numbers specify positions to the left of the left end of fJ.(a. Surprisingly.c. Further. For this more general problem the two strings involved will be denoted a and fJ rather than P and T.just directly count the number of resulting matches and mismatches).. i) can be directly computed in O(n) time (for any i. fJ. suggesting a way to solve these problems quickly with hardware. Convert the two strings into binary stnngs o. This is where correlation and the FIT come in. fJ} Transform for match-counts The match-count problem is essentially that of finding the last row of the matrix Me However. since the roles of the two strings will be completely symmetric. For example. This approach was developed by Fischer and Paterson [157J and independently in the biological literature by Pelsenstein. one for each character in the alphabet. and [99J. j~'Hm-l fJ. if i = 3 then we get 1011000100010 10010110 andtheV.fJ. and VoCO'. no number requires more than O(1ogm) bits). that lad = n ::: m = Itil The problem then becomes how to compute Va(a. Enough zeros were padded so that when the right end of a is right of the right end of fJ. THE MATCH·COUNT PROBLEM AND FAST FOUR(ER TRANSFORM 75 A fast worst-case method for the match-count problem? The high-level approach We break up the match-count problem into four problems. so this technique yields a large speedup. fJ) can be computed in O(IIIll) total time. 21123456789 accctgtcc aactgccg then the left end of a is aligned with position -2 of fJ Index i ranges from -n + I to m. V(O'. The O(mlogm) method is based on the Fast Fourier Transform (FFT). (3.3.gofDNA Vll(o:. and this receding doesn't suggest a way to compute Vo(a. 13) + V.3. f-l) in O(mlogm) total time by using the Fast Fourier Transform. Notice that when i > m .1. when a is aligned with fJ as follows. For concreteness we use the four-letteralphabeta. but rather we will solve a more general problem whose solution contains the desired information. and Koclun [152J. when the right end of a is to the right end of {3. but the operations are still more complex than just incrementing by one. We now show how to compute V(a. But it contains more information because we allow the left end of a to be to the left of the left end of 13. fJ) more efficiently than before the binary coding and padding. Hence no "illegitimate wraparound" of 0: and 13 is possible.74 4. [59).(a. fJ.11. and we also allow the right end of a to be to the right of the right end of fJ. the corresponding bits in the padded r"ia are all opposite zeros..3)=2. Sawyer. there is specialized hardware for FIT that is very fast. For any fixed i . i) = L j~1l r"iQ(j) x i3"U + j). Can any of the linear-time exact matching methods discussed in earlier chapters be adapted to solve the match-count problem in linear time? That is an open question. The extension of the Shift-And method discussed above solves the match-count problem but uses 8(nm) arithmeticoperationsinallca~es. the right end of a is right of the right end of 13. With these definitions. [581. the match-count problem can be solved with only O(m log m) arithmetic operations if we allow multiplication and division of complex numbers. and p". conversely. i) is correctly computed So far. where every occurrence of character a becomes aI. We will reduce the match-count problem to a problem that can be efficiently solved by the FIT. f3) + V. Other work that builds on this approach is found in [3]. fJ. let a be acaacggaggtor and fJ be accacgaag. but it is easiest to explain it on a small alphabet. The numbers remain small enough to permit the unit-time model of computation (that is. and positive numbers specify the other positions.For example. For example. Then the binary strings ii~ and Po are 10110001000(0 and 100100110. V(a.l.2. we will not work onMe directly. however. 4.(o:. f3) + V~(o:.

X Clearly.4. and the Shift-And method assumes that we can efficiently increment an integer by one. /3. jJ. and its use in the solution of the cyclic correlation problem. Say jJ.2. i} is computed by replacing each wild card with I.wherethe indices in the expression are taken modulo z. I} = 7 (i. For example. /3. KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 77 Cyclic correlation Definition Let X and Y be two a-length vectors with rea! number components indexed "" where for each character characterx.r . the problem of computing vector Va(a.f3). let ~2" 'PU} H(Pl= Similarly. z = n +m. The first result will he a simple linear-time method that has a very small probability of making an Recall the wild card discussion begun in Section 3. simply replace each wild card symbol in fi with the character x. is beyond the scope of this book. So if for a fixed i there arek places where two wild cards line up. what character in c is opposite the j position inr)). for two vectors each of length z. then the computed L. the converse is also true. require 8(Z2) of computing Va(a. so Theorem 4. This is surprisingly efficient and a definite improvement over the I.r .}=6 Clearly. Then {3.). cyclic correlation is solved by reversing one of the input vectors and then computing the convolution of the two resulting vectors.lndetaii.1.3 a different. But cyclic correlation is a classic problem known to be solvable in 0(:: log z) time using the Fast Fourier Transform.. and compute V<I>(a. i) for every position i and character . but the key is thai it Solves the cyclic correlation problem in 0(: log z) arithmetic operations. How do we incorporate these wild card symbols into the FFT approach computing match-counts? If the wild cards only occur in one of the two strings. i} for each i .Insummary. Similarly. This works because for any fixed starting point i and any position j in/3.andr =2.) The FFT method. where the wild card symbol ¢ matches any other single character. that we can also efficiently multiply an integer by two.5.(a. (J. i) would be too large. i) for each r = a. {3) is exactly the cyclic correlation of padded strings tia andPa./3. If two wild cards are opposite each other when ~ srarts at position i. So it is not much of an extension to assume. then Vx(a.4.I of tr (Le. But if wild cards occur in both a and fj.j .n=4.4.(a. more comparison-based approach to handling wild cards that appear in both stnngs.4. V. consider P to be an n-bit binary number. then this direct approach will not work. There is an occurrence of P starting at position r of T if and only if R(P) = H(T. after discussing suffix trees and common ancestors. since those two symbols will be counted as a match when computing V. . andg.' Handling wild cards in match-counts 4. operations. if ~ = ag¢¢ct¢a and f3 = agctctgt. i) for at most one character .1.thenH{T. Arithmetic replaces comparisons starung nt Definition For the binary pattern P.-)(nm) bound given by the generalized Shift-And approach. (The FFr is more often associated with the convolution problem for two vectors. Theorem 4.the jth position will contribute a I to V. jJ. The cyclic correiatian of X and Y is an z-Iength real vector W(i) x Y(i+ j). i) value How can we avoid this overcount? The answer is to find what k is and then correct the overcount. consider T. based only on the definition of cyclic correlation would Now an algorithm Later. in addition to being able to increment an integer. (3. If we treat a (row) bit vector as an integer number then a left shift by one bit results in the doubling of the number (assuming no bits fall off the left end). With that added primitive operation we can turn the exact match problem (again without mismatches) into an arithmetic problem. fi. if there is an occurrence of P starting at positionr ofT then H(P} = H(T. However. we will present in Section 9. then the solution is very direct When computing VA~.) However. if an w= Y = tJ~." to be an n-bit binary number.c..3.e. . The idea is to treat ¢ as a real character. then V(a. so again no progress is apparent. 4.(a. if P = 0101 then n = 4 and H(P} = 23 X 0 + 2' x 1 + 21 x 0+20 x! =5.76 SEMINUMERICAL STRING MATCHING 4. The match-count problem can be solved in O(m logm} time even unbounded number of wild cards are allowed in either P or T.1. the FFT requires operations over complex numbers and so each arithmetic step is more involved (and perhaps more costly) than in the more direct Shift-And method. In fact. all positions are counted as a match except the last). but cyclic correlation and convolution are very similar. Hence it solves the match-count problem using only O(m logm) arithmetic operations.ifT = IOII0101O.r.c. Vx(a. For example. depending on what character is in position i -i. let That is. and the problem = fia. Karp-Rabin fingerprint methods for exact match method assumes that we can efficiently shift a vector of bits. i) will be 3k larger than the correct V(~.

KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 79 7 = The proof.then Moreover. but the if part does not. By reducing H(P) and modulo a number p.? The problem is that the required powers of two used in definition of H(P) and H(T. yet large enough that the probability of a false match between P and T is kept small.) = H(T. It is a randomized method where the only if part of Theorem 4.e. 4. me x 2mod7+0=2 2x2mod7+!=5 5 x 2mod7+1 =4 4 x 2mod 7+ 1 = 2 2x2mod7+1=5 5mod7=5.j.bu{{hetimetodo{heupdatesisnNthei. each successive power of two taken mod can be computed in constant time and each successive value Prime moduli limit false matches Clearly. I than by following the definition directly (and w< will need th. but the number 2" requires n bits. rr(lI) is The following theorem is a variant of the famous prime . HAT. but that the intermediate numbers are always kept small. if P occurs in Tstaningatpositionr then Hp(P) = H1. Hp(Tr) can be computed from Hp(Tr_1) with only a small constant number of operations. this can be computed as follows: I H(P) = 47 and Hp(P) = 47 mod 5. We state the needed properties of prime Definition equalto a.. the (fpart will hold with high probability. one can never-reduce too much). Since Hp(T. every fingerprint remains in the range 0 to P -1. For example. that preserves the spirit of the above numerical approach.) =2 x H(T.). the computation of H(P) and H(Tr) wil! not be efficient.j = The general idea is that.2. the use of such large numbers violates the unit-time random access machine (RAM) model. The arithmetic will then be done on numbers requiring only a small number of bits. H(T. The following definitions and lemmas make this precise. However.> 1. But if H(P) and ff(T. instead of working with numbers as large as H(P) and H(T. called the randomized fingerprint method. the largest allowed numbers must be represented in O[log(n + m)] bits. . Rabin [2661 published a method (devised almost ten years earlier).1 converts the exact match problem into a numerical problem.. 2" mod p=2 X (2"-1 mod pI p mod p. is an immediate consequence of the fact that every integer can be written in a unique way as the sum of positive powers of two. then numbers as large as r" are needed.4. Fortunately. comparing the two numbers H(P) and H(T. Instead. Fingerprints of P and T it follows that Hp(T. if P = lOllil andp = 7. Karp and M.4.) Therefore.1 continues to hold.\7. when the alphabet is not binary but say has t characters. But unless the pattern is fairly small. even greater efficiency is possible: For r .. modular arithmetic allows one to reduce at any time (i. Theorem 4. out 1l0W the converse does uot hold for every That is. This is explained in detail in the next section.4. The key comes from choosing p to be number in the proper range and of prime numbers.. For a positive integer the number of primes that are less than or number II.). (From the standpoint of complexity theory. we will work with those numbers reduced modulo a relatively small integer p. Thus the required numbers are exponentially too large. but that is extremely efficient as well.{lateton). Theorem 4.) Even worse.4. ~ ::s n(u)::S 1. In 1987 R."uehere theorem.(T. In that model. using numbers that satisfy the RAM model.). so that the following generalization of Horner's rule hold.where In(u) is rhebase e logarithm ofu [383J._d - -1) + T(r +n - I).78 SEMINUMERICAL STRING MATCHING 4. the utility of using fingerprints should be apparent. The point of Horner's rule is not only that the number of multiplications and additions required is linear. since that computation can be organized the way that Hp( P) was.j must be computed before they can be reduced modulo p.) rather than directly comparing characters. One can more efficiently compute H{T.) Hp(T.2. Intermediate numbers are also kept small when computing Hp(Tr) for any r . so the size of does nor violate the RAM model.4. a fingerprint The goal will be to choose a modulus p small enough that the arithmetic is kept efficient.) mod 2"T(r p H(T. then we have the same problem of intermediate numbers that are too large.+I I from H(T. [(2 x H(Tr_Jl mod p) - (2" mod p) x T(r - 1) + Ttr +n .) grow large too rapidly.lI] mod p Further.26. we cannot necessarily conclude that P occurs in T Already. which we leave to the reader. and so will be efficient But the reallv attractive feature of this method is a proof that the probability of error can be made sm~ll if p is chosen randomly in a certain range.

2).. which is greater than the product of the first rr(lI) primes (by assumption that k > JT(ul). if u ::: 29.. the fingerprint algorithm runs in O(m) time. be reasonable not to bother explicitly checking declared matches. 7. . whet now is the probability of a false match between P and T~ One bound is fairly immediate and intuitive currences Theorem 4. However. however. why not use several? Why not pick k primes PI... Fer each posnion r in T. 5..4. and the lemma is proved.. then than n(u) (distinct) prime divisors .4 .(T. 11. the pro- Notice that Theorem 4. If the numbers are equal. where nand m are the Corullary 4.23. It may.r Random fingerprint 1. bUI Hp. the probability of a false march is at most ~ number theorem (Theorem 4. excluding any time used to explicitly check a declared match. 17. fer II = 29 the prime numbers less than or equal 1029 are 2.4. Given the fad that each Hp(T. the product of the primes less than or equal to II is greater than 2" (by Lemma 4.(P) = Hp.2).80 SEMINUMERICAL STRING MATCHING 4. we have to answer the question of what I should be How to choose I Theorem 4.) for each of the k primes. But QIQ2 .r does have k > rr(u) distinct prime divisors 2" ::: x::: QIQ2. Pk randomly and compute k fingerprints') For any position r . When I= nm2. then either declare a probable match or check explicitly that P occurs in T starting at thatposition r.(Efficiem has fewer PROOF Suppose .4.4. p.l and test to see if it equals HAP). depending on the probability of an error.4.564. there can be an occurrence of P starling at r on(v if HrJP) = H". below) and compute [fp(Pl.4.1 leads to the following random fingerprint algorithm for finding all ocof P in T . Their product is 2. qk is at least as large as the product of the smallest k primes. . KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 81 Lemma 4. {hen the product of all the primes (hat are less than or equal to u isgreater than 2" [383J For example. 13.2.3 holds for any choice of pattern P and text T such that has a false match Extensions If one prime is good. algorithm I Choose a positive integer (to be discussed in more detail I. Corollary 4. 0 The central theorem Now we are ready for the central theorem of the Karp-Rabin approach 3.) can be computed in constant time from Hp(Tr_I)..r may be repeated).4. 10 fully analyze the probability of error. and 29. For now. Let P and T be any strings such that nm 2: 29.870.410 whereas i~<Jis 536.2. So the assumption that k >11"(11) leads to the contradiction that 2" > 2".qk (the first inequality is from the second from the fact that some primes in the factorization of .4. If II 2: 29 and x is any number less than or equal to 2". We now define a false match between P and T to mean that there is an r such thai P does nOI occur in T starting atr.912.156.1. 19.3.) for everyone of the k selected primes. We will return to the issue of checking later. compute Hp(T..(T.

4. as befure. Moreover.. Hence. see [182].4 is again very weak. as a first check.4. KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 83 Theorem 4. If d is not the smallest period. if k = 4 and n. but they may also find false matches .4. Otherwise. But that implies that there is an occurrence of P starting at position II + d' < d2. II and h. we can apply Corollary 4. Otherwise. say d'.3 holds for each of the primes. lif I -Ii::::: nI2). but the reader need not have read that section. suppose the algorithm explicitly checks that P occurs in T when Hp(P) "" H. it is no longer random.) (page 431). which is stated in Section 16. and since the Karp-Rabin method never misses any occurrence of P. Thus.2 to each prime. If any of these successive checks finds a mismatch. it declares a false match and stops if this check fails for some i. f2 •. For additional probabilistic analysis of the Karp-Rabin method. the probability of a false match between P and T is at most (1. P dues in fact occur starting at each declared location 1 . With this observation we can reduce the probability of a false match as follows: allows numerous false matches (a demon seed). We present here an O(m)-time method. so we first discuss how to check for false matches in a single run In a single run. PROOF We saw in the proof of Theorem 4. But once the prime has been shown to allow a false match. However. a small choice of k will assure that the probability of an error due to a false match is less than the probability of error due to a hardware malfunction. So d must be the smallest period of P.tion r .d. this checking would seem to require 8(nm) worst-case time.3 says nothing about how bad a particular prime can be. a little arithmetic (which we leave to the reader) Corollary 4. and the theorem is proved. When k primes are chosen randomly between I and I and k fingerprints are used.2.. Theorem 4.I. for position Ii. If one checks for P at each declared location.26im-'2k-Il(l + 0. although the expected time can be made smaller. 0 As an example.3 that if p is a prime that allows Hp(P) = Hp(T. as in the Galil method. Assuming. Even lower limits on error The analysis in the proof of Theorem 4. establishing then This probability falls so rapidly that one is effectively protected against a long series of errors on any particular problem instance. then the probability ofa false match between P and T is at most by 10-12.(T. then d must be a multiple of the smallest period.. and it follows that if there are no false matches in the run. andk = 4 reduces the We mention one further refinement discussed in [266]. the method either verities that the Karp-Rabin algorithm has found no false matches or it declares that there is at least one false match (but it may not be able to find all the false matches) in O(m) time The method is related to Galil'sexrension of the Boyer-Moore algorithm (Section 3. the probability of a false match is reduced dramatically.e.4. it suftices to successively check the last d characters in each declared occurrence of P against the last d characters of P. since two runs are separated by at least nj2 positions and each run is at least n positions long. . even though it is not there. in L such that every two successive numbers in the interval differ by at must nj2 (i. Muthukrishnan [336]... For the time analysis. Theorem 4. If P does not occur in both of those locations then the method has found a false match and stops. For typical values of nand m.~)]k. the probability of a false match between P and T is at most [~~.17.locations where P is declared to be in T. that contradict.4. then p is in a set of at most Jr(nm) inregers. the method checks the d characters of T starting at position I. That is.). This makes intuitive sense. But by picking a new prime after each error is detected. note first that no character of T is examined more than twice during a check of a single run. while the computational effort of using four primes only increases by four times.2). from to-' to to-". no character of T can be examined in . Consider a list L of (starting) locations in T where the Karp-Rabin algorithm declares P to be found. The method works on each run separately. This is an even less likely event. to check each location in C.. It may well be a prime that in the run.6lnml Applying this to the running example ofn probability of a false match to at most 2 x = 4000. So the probability that all the primes are in the set is bounded by [~t. that determines if any of the declared locations are false matches. m. Then one may be better off by picking a new prime to use for the continuation of the computation. a false match can occuronlyife ach of the k primes is in that set. the bound from Theorem 4. noted first by S. then /'''1 =d for each i in the run. the choice of 12 as the second occurrence of P in the interval between II and t.3 randomizes over the choice of primes and bounds the probability that a randomly picked prime will allow a false match anywhere in T. (This follows easily from the GCD Theorem. and it finds that P does not occur there.) at some position r where P does not occur. A run is a maximal interval of consecutive starting locations ii.4. the method verifies that Ii+l .82 SEMINUMERICAL STRING MATCHING 4.When k fingerprints are used. each of the primes i must simultaneously allow a false match at the same r. then the method has found a false match in the run and stops. for the algorithm to actually make an error at some specific po.Ii = each i. Checking for error in linear time All the variants of the Karp-Rabin method presented above have the property that they find all true occurrences of P in T.4. the method learns that P is semipenodic with period I~ -II We usc d to refer to /1-1 and we show that d is the smallest period of P. When k primes are chosen randomly and used in the fingerprint algorithm.3. that I shows = nm".4. Returning to the case where only a single prime is used. the method explicitly checks for the occurrence of P at the first two positions in the run. because it just multiplies the probability that each of the k primes allows a false match somewhere in T. Otherwise. That is.4. and since the primes are chosen randomly (independently). when and h. and I arc as in the previous example.4.1. + n .

5. But the main attraction is that the method is based on very different ideas than the linear-time methods that guarantee no error. for Patonlyl. Show how to efficiently approach. establishing significant properties of the method's performance. ln the Karpcheck to the two-dimensional pattern matching 10.84 SEMINUMERICAL STRING MATCHING specifications depends 8. Extend Kleene of wild cards method in the strings. be deleted from P.3. the method when can find in O(m} time all those point. Also. we have seen several methods that run in linear worst-case time and never make errors. We mentioned by the O{nm} in PROSITE. the method is accompanied by concrete proofs. algorithms. Complete time. it can be thought of as a method that never makes an error and whose expected running time is linear. Little has been proven about their performance. but. the method and why runs containing no false matches. but runs in Oem) expected time (a conversion from a Monte Carlo algorithm to a Las Vegas algorithm). The random problem discussed in Section 3. and the text) in the Shift-And of neither method is affected 4. unlike the Karp-Rabin method. the method is simple and can be extended to other problems. the Karp-Rabin randomized to check method from a Monte-Carloalgorithm. Prove the correctness 5. they generally lack any theoretical analysis.4. how.5. Vary Ihe sizes of Pand 2.1 . EXERCISES 85 method or agrep? The answer partly more than two consecutive runs. (Open be efficiently handled 4.4. Why fingerprints? The Karp-Rabin fingerprint method runs in linear worst-case lime. a small number of insertions and deletions of characters. Ranges matching method patterns often specify a range for handled 32 of Chapter pattern thenumberoftimesthatasubpattern expression of this type can be easily of Section 3.5_ Exercises 1. Perhaps can be replaced one can the FFT method with character comparisons in the case of computing the proof of Corollary fingerprint approach 4. Explain 4. the Karp-Rabin algorithm can be converted from a method with a small probability of error that runs in Oem) worst-case time. as well as are allowed. Complete the detailS and analysis algorithm to convert style randomized 12. Second. over all runs. Thus the method is included because a central goal of this book is to present a diverse collection of idem. can be extended Doit. is. to efficiently for agrep. used in a range of techniques. Alternatively. Methods similar in spirit to fingerprints (or filters) predate the Karp-Rabin method. To achieve this. It follows that the total time for the method. by the number 6. to one that makes no error.3. simply (rejrun and (re)check the Karp-Rabin algorithm until no false matches are detected. 4. isO(m) With the ability to check for false matches in Oem) time. There Rabin to a Las Vegas-style possible in the method are improvemerlts method. from a practical standpoint. Adapt Shift-And each pattern and agrepto handle a set of patterns. Can such range I . characters can be inserted P and characters 3. 9. wild cards (both in the pattern Show that the efficiency Do the same for agrep.6. We leave the details as an exercise with the Shift-And on the number problem) Devise of such a purely specifications that appear method in the ex pression to compute in detail match-counts to see if complex matchin comparison-based examine O(mlogm} arithmetic counts. the Shift-And closure. as those handle regular expressions that do not use the to collections . 3 that PAOSITE repeats. Evaluate empirically T method to solve That the problem of finding an "occurrence" into of a pattern P can the Shift-And method against methods discussed earlier. Can you do better than just harldlirlg in the set independently? of the agrep handle method. Extend the agrep inSide a text T. at some needs to explicitly and not 12. Do the same patterns in Exercise regular such Explain the utility of these extensions of biosequence 7. but with a nonzero (though extremely small) chance of error. First. In contrast. So what is the point of studying the Karp-Rabin method? There are three responses to this question. and proofs 11. such as two-dimensional pattern matching with odd pattern shapes . when mismatches. Explain for false matches For example.3.a problem that is more difficult for other methods.

PART II Suffix Trees and Their Uses .

as will be illustrated by additional applications in Chapter 9. the allowed preprocessing takes time proportional to the length of the text. but thereafter. the Aho--Corasick method does not solve the substring problem in the desired time bounds.1 and 7.3.those methods would preprocess each requested string on input. In is important. for example. Some of those applications provide a bridge to inexact matching. Thus it is natural to expect that the Ahc--Corasick algorithm could be applied. so that the substring problem is to determine whether the input string is a substring of any of the fixed strings.5 Introduction to Suffix Trees A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing discussed in Section 1. suffix trees provide a bridge between exact matching problems.where the focus is on inexact matching 89 . thereafter. those algorithms would be impractical on any but trivial-sized texts Often the text is a fixed set of strings. several applications and extensions will be discussed in Chapter 7. Suffix trees work nicely to efficiently solve this problem as well. Then a remarkable result. and then take (~)(m) (worst-case) time to search for the string in the text. One is first given a text T of length m . Because m may be huge compared to n . That method greatly amplifies the utility of suffix trees. the focus of Part I. independent of the length of T. and inexact matching problems that are the focus of Part III The classic application for suffix trees is the substring problem. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the Knuth-Morris-Pratt and the Boyer-Moore algorithms achieve). this case of multiple rext strings resembles the dictionary problem discussed in the context of the Aho-Corasick algorithm.10). preprocessing rime. The suffix tree for the text is built in preprocessing stage. but their real virtue comes from their use in linear-time solutions to many string problems more complex than exact matching. These bounds are achieved with the use ofa suffix tree. That is. Superficially. the search for S must be done in time proportional to the length of S. more applications of suffix trees will be discussed in PartIII. because it will only determine if the new string is a/till string in the dictionary. The O(m) and O(n) search result for the substring problem is extremely useful. a collection of STSs or ESTs (see Sections 3. whenever a string of length searchesforitinO(n)timeusingthatsuffixtree. After O(m).one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T or determine that S is not contained in T. methods . the constant-time least common a!!cestor method. Moreover (as we will detail in Chapter 9). or linear. However. will be presented in Chapter 8. not whether it is a substring ofastring in the dictionary After presenting the algorithms.5.

As ~:~~~:::~li~~a~~~~: ~:~~::e~:it:ea~~ ~:. sometimes a text. A short history The first linear-time algorithm for constructing suffix trees was given by Weiner [4731 in 1973. No two edges our ot a node can have edge-labels beginning with Ihe same characte r.). And. The path from the root to the leaf numbered I spells oUI the full string S = xabxac.1 string xa labels the internal node w (so node w has path-tabel w). when implemented well. in Figure 5. Chapter 6 fully develops the linear-time algorithms of Ukkonen and Weiner and then briefly mentions the high-level organization of McCreight's algorithm and its relationship t~ ~kk~nen's ~lgorithm.3.4. then the suffix tree for 5 can be obtained from the keyword tree for P by merging any path of nonbranching nodes into a single edge. That reputation is well deserved but unfortunate. while the path to the leaf numbered 5 spells out the suffix ac. More recently. particularly for Weiner's algorithm. That is. A different. I). A motivating example Before diving into the details of the methods to construct suffix trees.. although nontrivial. and string xabx labels a path that ends inside edge (w. sometimes both. This is probably because the two original papers of the 1970s have a reputation for being extremely difficult to understand. then suffixxc is a prefix of suffix xabxa. . Eachintcmal norle. Our aoproecn is to introduce each algorithm at a high level.1: Sufflxtreelorstring xabxac. we can add a character to the end of S that is not in the alphabet that string 5 is taken from. To avoid this problem. The node tatlets uand won ltletwo interiornodes wiltbe used tater. every string S is assumed to be extended with the termination symbol \$. that is. and we hope that these expositions result in a wider use of suffix trees in practice. and sometimes neither.rit~~: algorithms for Definition For any node u in a suffix tree.ss~~. 5. Then. if set P is defined to be the m suffixes of 5. which starts in position 5 of S. we assume (as was true in Figure 5. rather than the O(m) bound we will establish. In this book we use \$ for the "termination" character. The simple algorithm given in Section 3. string a labels node II. the characters in u's label string-depth of v is the number of Definition A suffix tree T for an m-character string S is a rooted directed tree with exactly er leavesnumhered 1 tom. ~Ivlng simple. if the last character of xahxac is removed. To achieve this in practice. we will refer to the generic string 5 of length m. We do not use P or T (denoting pattern and text) because suffix trees are used in a wide range of applications where the input string sometimes plays the role of a pattern. For example. Basic definitions When describing how to build a suffix tree for an arbitrary string. are no more complicated than are many widely taught methods. it spells out Sh .ml For example. however.1. The key feature of the suffix tree is that for any leaf i . since the path for the first suffix would not end at a leaf.V:~~:t~ ~~.90 INTRODUCTION TO SUFfiX TREES 5. so the path spelling out xa would not end at a leaf. the concatenation of the edge-Iabels on the path ~romtheroottoleafl exactly spells out the suffix of 5 that starts at position i . As stated above. other than the root. A ManVATING EXk\1PLE 91 5. although he called his tree a position tree. more space efficient algorithm to build suffix trees in linear time was given by McCreight [318] a few years later. the algorithms are practical and allow efficient solutions to many complex string problems. the definition ofa suffix tree for 5 does not guarantee that a suffix tree for any string 5 actually exists.4 for building keyword trees could be used to construct a suffix tree for S in O(m2) time. inside the leaf edge touching leaf 1 5. suffix trees have not made it into the mainstreamof computer science education. has at least two children and each edge is labeled with a noncmpty substring of S.1. even if the symbol is not explicitly shown A suffix tree is related to the keyword tree (without backpcintersj considered in Seclion 3.~ 2 Flgure5. we will write it explicitly as in S5. We believe that the exposruons and analvses given here. Those Implementations are. Much of the time. no suffix of the resulting string can be a prefix of any other suffix. unless explicitly stated otherwise.nl implementations.1) that the last character of 5 appears nowhere else in S. Ukkonen [438] developed a conceptually different linear-time algorithm for building suffix trees that has all the advantages of McCreight's algorithm (and when properly viewed can be seen as a variant of McCreight's algorithm) but allows a much simpler explanation Although more than twenty years have passed since Weiner's original result (which Knuth is claimed to have called "the algorithm of 1973" [24]). We know of no other single data structure (other than those essentially equivalent to suffix trees) that allows efficient solutions to such a wide range of complex string problems. The problem is that if one suffix of S matches a prefix For example. and they have generally received less attention and use than might have been expected.~:s~:.then incrementally Improved to achieve linear runmng times. Given string S. the original papers.2. m~fficre. because the algorithms. the suffix tree for the string xabxac is shown in Figure 5. creating string xabxa. are much simpler and clearer than i. this reminder will not be necessary and. let's look at how a suffix tree for a string is used to solve the exact match problem: Given a pattern P of . When it is important to emphasize the fact that this termination character has been added. of another suffix of 5 then no suffix tree obeying the above definition is possible.

.4. collecting position numbers written at the leaves.1 we will again consider the relative advantages of methods that preprocess the text versus methods that preprocess the pattemts). find all occurrences of P in T in O(n + ml time. Then (whether a new node w was created or whether one already existed at the point where the match ended). even though the total string-depth of those O(k) edges may be arbitrarily larger than k. w) is labeled with the part of the (u. Suffix trees provide another approach' Build a suffix tree T for text T in O(m) time. then it breaks edge (u. but the distribution of work is different.4. and note the leaf numbers encountered. This path is found by successively comparing and matching characters in suffix SU + I.8. and the new edge (w. This is the same overall time bound achieved by several algorithms considered in Part I. the number of leaves encountered . We have already seen several solutions to this problem. we will also show how to use a suffix tree to solve the exact matching problem using O(n) preprocessing and Oem) search time. But that happens if and only if string P labels an initial part of the path from the root to leaf j. traverse the subtree at the end of the matching path using any linear-time traversal (depth-first say). until no further matches are possible.. as follows' Starting at the root of Ni find the longest path from the root whose label matches a prefix of S[I + I. A naive algorithm to build a suffix tree To further solidify the definition of a suffix tree and develop the reader's intuition.mJ. Their starting positions number the leaves in the 5. called IIJ.m]\$ to characters along a unique path from the root. denote the intermediate tree that encodes all the suffixes from I to I In detail.. the work at each node takes COnstant time and so the time to match P to a path is proportional to the length of p For example. the algorithm can find all the starting positions of P in T by traversing the Subtree below the end of the matching path. the suffix tree approach spends Oem) preprocessing time and then O(n + k) search time. every leaf in the subtree below the point of the last match is numbered with a starting location of P in T.just after the last character on the edge that matched a character in Sri + I . The edge is labeled with the string S\$. The tree now contains a unique path from the root to leaf i + 1. so the time for the traversal is O(k). To collect the k starting positions of P.92 INTRODUCTION TO SUFFIX TREES SA.2. m]S into the growing tree. Assuming. Tree N'+l is constructed from N. Later..mJ\$ (the entire string) into the tre e. And.ml and just before the first character on the edge that mismatched. In contrast. Since every internal node has at least two children. and every starting location of P in T numbers such a leaf The key to understanding the former case (when ali of P matches a path in T) is to note that P occurs in T starting at position j if and only if P occurs as a prefix of Tfj. achieving the same bounds as in the algorithms presented in Part I Figure 5..mlS.m]S. When that point is reached. At some point. the above naive method takes O(m2) time to build a suffix tree for the string Soflengthm. A NAIVE ALGORITHM TO BUILD A SUFFIX TREE 93 is proportional to the number of edges traversed. the algorithm creates a new edge (ui. u] label that matched with Sri + Lm]. Pattern P matches a path down to the point shown by an arrow. Figure 5. In Section 7. All occurrences of P in T can therefore be found in O(n +m) time. as usual. the algorithm is either at a node. then the search time can be reduced from O(n + k) to O(n) time. If only a single occurrence of P is required. and the preprocessing is extended a bit. and so it follows inductively that no two edges out of a node have labels with the same first character. If it is in the middle of an edge. If P fully matches some path in the tree.. Then.fhenit successively enters suffix 5[i . we present a straightforward algorithm to build a suffix tree for string S. v) say. The details are straightforward and are left to the reader. In the latter case.2: Three occurrences of aw in awyawxawxz_ subtreeofthenodewithpath-labelaw. This naive method first enters a single edge for suffix 5[1 . The new edge (u. in the search stage. and as required. u) is labeled with the remaining part of thc (u. The idea ts to write at each node one number (say the smallest) of a leaf in its subtree. It is the initial path that will be followed by the matching algorithm. v) label. the number written on the node at or below the end of the match gives one starting position of P in T. v) into two edges by inserting a new node. in Section 7.m ]\$. a bounded-size alphabet. The matching path is unique because no two edges out of a common node can have edge-labels beginning with the same character. The matching path is unique because no two edges (Jut of a node can have labels that begin with the same character. match the characters of P along the unique path in T until either P is exhausted or no more matches are possible. and this path has the label Sri + l. In the fornier case.2shows a fragment ofthesuffix tree for string T = awyawxawx~ Pattern P = aw appears three times in T starting at locations 1.. for i increasing from 2 to m. i + I) running from w to a new leaf labeled i + I. Those earlier algorithms spend O(n) time for preprocessing P and then Oem) time for the search.. and 7. P does not appear anywhere in T. because we have assumed a finite alphabet. the leaves below that point are numbered 1. tree NI consists of a single edge between the TOotof the tree and a leaf labeled I. Then. where k is the number of occurrences of Pin T. w say. This can be achieved in Oem) time in the preprocessing stage by a depth-first traversal of T. and it labels the new edge with the unmatched part of suffix Sri + 1. or it is in the middle of an edge.4. length n and a text T of length m. {u. Note that all edges out of the new node w have labels that begin with different first characters. We let N.and 7. no further matches are possible because no suffix of 5\$ is a prefix of any other suffix of S\$.

6.1. UKKONEN'S

LINEAR-TIME

SUFFIX

TREE

ALGORITHM

95

6 Linear-Time Construction of Suffix Trees

We will present two methods for constructing suffix trees in detail, Ukkone~'s.me.thod and Weiner's method. Weiner was the first to show that suffix trees can be built In linear time, and his method is presented both for its historical importance and for some different technical ideas that it contains. However, Ukkonen's method is equally fast and uses far less space (i.e., memory) in practice ,than Weiner's method. Hence Ukkonen is the met,ho,d of choice for most problems requiring the construction of a suffix tree. We also believe that Ukkonen's method is easier to understand. Therefore, it will be presented first. A reader who wishes to study only one method is advised to concentrate on it. However, our development of Weiner's method does not depend on understanding Ukkonen'~ algorithm, and the two algorithms can be read independently (with one small shared section noted In the description of Weiner's method).

Figure 6.1: Suffix tree lor string xabxa\$

Figure 6.2: Implic~suffixtreeforstringxabx8

6.1. Ukkonen's linear-time suffix tree algorithm
Esko Ukkonen [438J devised a linear-time algorithm for constructing a suffix tree that may be the conceptually easiest linear-time construction algorithm. This algorithm has a spacesaving improvement over Weiner's algorithm (which was achieved first in the developme~t of McCreight's algorithm), and it has a certain "on-line" propert~ that may be ~sef~1 U1 some situations. We will-describe that on-line property but emphasize that the mam Virtue of tnconen's algorithm is the simplicity of its description, proof,. and time. analy.sis. The simplicity comes because the algorithm can be developed as a slm~le but inefficient method, followed by "common-sense" implementation tricks that establish a better worstcase running time. We believe that this less direct exposition is more understandable, as each step is simple to grasp. string S\$ if and only if at least one of the suffixes of S is a prefix of another suffix. The terminal symbol \$ was added to the end of S precisely to avoid this situation. However, if S ends with a character that appears nowhere else in S, then the implicit suffix tree of S will have a leaf for each suffix and will hence be a true suffix tree. As an example, consider the suffix tree for string ,ra/Ha\$ shown in Figure 6.1. Suffix xa is a prefix of suffix x aoxa, and similarly the string a is a prefix of abxa. Therefore, in the suffix tree for x.abxa the edges leading to leaves 4 and 5 are labeled only with \$ Removing these edges creates two nodes with only one child each, and these arc then removed as well. The resulting implicit suffix tree for xab.ta is shown in Figure 6.2. As another example, Figure 5.1 on page 91 shows a tree built for the string xooxuc. Since character c appears only at the end of the string, the tree in that figure is both a suffix tree and an implicit suffix tree for the string. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S -each suffix is spelled out by the characters on some path from the root of the implicit suffix tree. However, if the path does not end at a leaf, there will be no marker to indicate the path's end. Thus implicit suffix trees, on their own, are somewhat less informative than true suffix trees. We will use them just as a tool in Ukkonen's algorithm to finally obtain the true suffix tree for S

6.1.1. Implicit

suffix
implicit

trees
suffix trees, the last of which is

Ukkonen's algorithm constructs a sequence of converted to a true suffix tree of the string S

Definition An implicit sufJu tree for string S is a tree obtained from the suffix tree for S\$ by removing every copy of the terminal symbol \$ from .the edge labels of the tree, then removing any edge that has no label,andthen removing any node that does not have at least tWQ children An implicit suffix tree for a prefix S[I ..il of S is similarly defined by taking the suffix tree for S[l..i]\$ and deleting \$ symbols, edges, and nodes as above Definition tom. The implicit suffix tree for any string S will have fewer leaves than the suffix tree for '4 We denote the implicit suffix tree of the string
S[I .. i]

6.1.2_ Ukkonen's algorithm at a high level
Ukkonen'g algorithm constructs an implicit suffix tree L, for each prefix S[1..i] of S, starting from II and incrementing i by one until Im is built. The true suffix tree for S is constructed from Im, and the time for the entire algorithm is O(m). We will explain

by Ij, for

i

from 1

96

LINEAR-TIME

CONSTRUCTION

OF SUFFIX

TREES

6.1.

UKKONEN'S

LINEAR-TIME

SUFFIX

TREE

ALGORITHM

97

Ukkonen's algorithm by first presenting an O(ml)-time method to build all trees Ii and then optimizing its implementation to obtain the claimed time bound High-level description of Ukkonen's algorithm

Ukkonen's algorithm is divided into m phases. In phase i + 1, tree 7,+1 is constructed from Ii' Each phase i + I is further divided into i + 1 extensions, one for each of the i + I suffixes of S[I .. i + IJ. In extension j of phase j + I, the algorithm first finds the end of the path from the root labeled with substring Slj ..r]. It then extends the substring by adding the character S{i + 1) to its end, unless S(i + 1) already appears there. So in phase i + 1, string S[l . .i + 11 is first put in the tree, followed by strings 5[2,,1 + l], S[3 .. 1 + 1]"" (in extensions 1,2,3" .. , respectively), Extension i + 1 of phase i + 1 extends the empty suffix of S[I ..r], that is, it puts the single character string Sri + I) into the tree (unless it is already there). Tree Ij is just the single edge labeled by character S(I). Procedurally, the algorithm is as follows' High-level Ukkonen algorithm Construct tree Zj . Forifromltom-Ido begin [phase r + I} Forjfrom 1 tol+ I beginjextensionj} Find the end of the path from the root labeled S[j.,i] in the current tree. If needed, extend that path by adding character thusassuringthatstringS[j . .i +l]isinthetree end; end;

FigureS.3;

Implicit suffix tree for string i'Xi'bx before the sixth cnereceer. b. is added

S(i

+ 1),

sum: x extension

rules

To tum this high-level description into an algorithm, we must specify exactly how to perform asu.fJixextension. Let SU .. i] = {3 be a suffix of S[l..i]. In extension i. when the algorithm finds the end of {3 in the current tree, it extends f3 to be sure the suffix {3SU + I) is in the tree. It docs this according to one of the following three rules' Rule 1 In the current tree, path {3 ends at a leaf. That is, the path from the root labeled {3 extends to the end of some leaf edge. To update the tree. character S(i + \) is added to the end of the label on that leaf edge. Rule 2 No path from the end of string /3 starts with character S(i + I), but at least one labeled path continues from the end of f3. In this case, a new leaf edge starting from the end of /3 must he created and labeled with character SU + I). A new node will also have to be created there if f3 ends inside an edge. The leaf at the end of the new leaf edge is given the number j Rule 3 Some path from the end of string (3 starts with character S(i + 1). In this case the string (3S(j + 1) is already in the current tree, so (remembering that in an implicit suffix. tree theendofa suffix need not be explicitly marked) we do nothing

As an example, consider the implicit suffix tree for S = axabx shown in Figurc 6.3 The first four suffixes end at leaves, but the single character suffix x ends inside an edge When a sixth character b is added to the string, the first four suffixes get extended by Lfhe fifth suffix gets extended by rule 2. and the sixth by rule 3. The

6.1.3. Implementation and speedup
Using the suffix extension rules given above, once the end of a suffix ji of S[Lil has been found in the current tree. only constant time is needed to exec utc the extension rules (to ensure that suffix {3SU + I) is in the tree). The key issue in implementing Ukkonen's algorithm then is how to locate the ends of all the 1 + I suffixes of S[I .. i]. Naively we could find the end of any suffix j:J in O(ltlll time by walking from the root of the current tree. By that approach, extension j of phase i + I would take OU + 1 - j) time. IHj could be created from I, in O(j2) time. and Im could be created in O(m) time. This algorithm may seem rather foolish since we already know a straightforward algorithm to build a suffix tree in O(m2) time (and another is discussed in the exercises), bur it is easier to describe Ukkonen's O(m.) algorithm as a speedup of the O(m) method above. We will reduce the above O(mJ) lime bound to O(m.) time with a few observations and implementation tricks. Each trick by itself looks like a sensible heuristic to accelerate the algorithm, but acting individually these tricks do not necessarily reduce the worst-case

98

LINEAR-TIME

CONSTRUCTION

OF SUFFIX

TREES

99 time. The most Following Corollary 6.1.1, all internal nodes in the changing tree will have suffix links from them, except for the most recently added internal node, which will receive its suffix link by the end of the next extension. We now show how suffix links are used to speed up the implementation

bound. However, taken together, they do achieve a linear worst-case important element of the acceleration is the use of wffix links. Suffix links: first implementation speedup

Definition Let xu denote an arbitrary string, where x denotes a single character and a denotes a (possibly empty) substring. For an internal node u with path -Iabetxc. if there isanothernodes(v)withpath-labela,thenapointerfromvlos(v)iscalledasuffulillk. We will sometimes refer to a suffix link from » tos(v) as the pair (v, s{v». For-example, in Figure 6.1 (on page 95) let v be the node with path-label x a and let s(v) be the node whose path-label is the single character a. Then there exists a suffix link from node v to nodes(v). In this case, a is just a single character long. As a special case, if a is empty, then the suffix link from an internal node with path-label XCI: goes to the root node. The root node itself is not considered internal and has no suffix link from it. Although definition of suffix links does not imply that every internal node of an implicit suffix tree has a suffix link from it, it will, in fact, have one. We actually establish something stronger in the following lemmas and corollaries.

Following a trail of suffix links to build

Ii+!

PROOF A new internal node v is created in extension j (of phase i + 1) only when extension rule 2 applies. That means that in extension j, the path labeled xa continued with some character other than S(i + 1), say c. Thus, in extension j + I, there is a path labeled a in the tree and it certainly has a continuation with character c (although possibly with other characters as well). There are then two cases to consider: Eitherthe path labeled Ci continues only with character c or it continues with some additional character. When a is continued only by c, extension rule 2 will create a node s(1') at the end of path a. When a is continued with two different characters, then there must already be a node s(v) at the end uf path 0:. The Lemma is proved in either case. 0 Corollary 6,1.1. In Ukkonen's algorithm, any newly created internal node will have a suffixlink from it by the end of the next extension. PROOf The proof is inductive and is true for tree II since II contains no internal nodes Suppose the claim is true through the end of phase i, and consider a single phase i + I. By Lemma6.1.1, when a new node v is created in extension i. the correct node s(v) ending the suffix link from v will be found or created in extension j + I. No new internal node gets created in the last extension ofa phase (the extension handling the single character suffix S(i + I)), so all suffix links fromintemal nodes created in phase i + I are known by the end of the phase and tree Ii+1 has all its suffix links. 0 Corollary 6.1.1 is similar to Theorem 6.2.5, which will be discussed during the treatment of Weiner's algorithm, and states an important fact about implicit suffix trees and ultimately about suffix trees. For emphasis, we restate the corollary in slightly different language. Corollary 6.1.2. In any implicit suffix tree I" if internal node v has path-label there is a node s(v) of I, with path-label e.
XCi,

then

extension j ~2 of phasei+lis: Single extension algorithm Begin lemma leads immediately to the answer. the first extension of phase i + I need not do any up or down walking. This trick will also be central in other algorithms to build and use Trick number 1: skip/count trick In Step 2 of extension j + 1 the algorithm walks down from node 05(1') along a path labeled y. The only extra ancestor that v can have (without a corresponding ancestor for s(v» is an internal ancestor whose path-label is a prefix of the path 10 u.. called the skip/count trick. this walk along V takes time proportional to Ivl. It will then follow that the time for all the down walks in a phase is at most O(m). the use of suffix links does not yet improve the time bound.1. And. Definition from Define the node-depth of a node u to be the number of nodes on the path u the rootto Eod Assuming the algorithm keeps a pointer to the current full string S[l. ensure that the string Sfj . here we introduce a trick that will reduce the worst-case time for the node-depth of 8(V) is at least one (for the root) plus the number of internal ancestors of u who have path-labels more than one character long. O(mz). suffix link from anyinternal ancestor of v goes to an ancestorofs(v). But a simple trick. as done in the naive algorithm. Recall that there surely must be such a V path from s(1'). ijSU + l j is in the tree. when implemented using suffix links. Directly implemented. 3.ij. UKKONEN'S LlNEAR·TIME SUFFIX TREE ALGORITHM 101 algorithm to suffix trees. Single extension algorithm: SEA Putting these pieces together. But does their use improve the worst-case running time? The answer is that as described. Using the extension rules. iffJ is nonernp ty then the node labeled is an internal node. the numoer of cnaracters on that path. Moreover. will reduce the traversal time to something proportional to the number of nodes on the path. the first extension of phase i+ I always applies suffix extension rule I What has been achieved so far? The use of suffix links is dearly a practical improvement over walking from the root in each extension. because the node-depths of any two ancestors of L' must . Furthermore.. However.100 LINEAR-TIME CONSTRUCTION OF SUFFlX TREES 6.

and the theorem is proved. All other extensions (before and after those explicit extensions) are done implicitly.i + II ends. In summary. Each line represents a phase of the explicit extension executed by Iha algorithm.• in decreasing order We will first implement the method in a straightforward inefficient way. and J is bounded by m. Therefore. Therefore. the time for an explicit extension is a constant plus some time proportional to the number of node skips it does during the down-walk in that extension. it only does constant work to increment variable e. WEINER'S LiNEAR-TIME SUFFIX TREE ALGORITHM 107 3456 ~ 111213141516 !. All work has been accounted for. J never decreases. consider an index J corresponding to the explicit extension the algorithm is currently executing. As established earlier.. It is now easy to prove the main result. Ukkonen '. In this cartoon extensions. suffix link traversals.2_ Weiner's linear-time suffix tree algorithm r Definition Suff denotes the suffix Sli . we consider (similar to the proof of Theorem 6.toprepareforthenextpha. Theorem 6. two consecutive phases share at most one index (j') where an explicit extension is executed (sec Figure 6. That means the first explicit extension in any phase only takes constant time. and each number represents an mere are four phases and seventeen explicit One index whereihe same espncu ecereic« of Uk konen'a algorithm. Instead. Figure 6. or node skipping. Creating the true suffix tree + I.9).1.!2. along with all its suffix links in O(m) time 6.6. In any two con secutlve phases.1. which was the last explicit extension in the previous phase.1. in each explicit extension the current node-depth is first reduced by at most two (up-walking one edge and traversing one suffix link). Summarizing this. even extensions in different phases. The punch line With Tricks 2 and 3.e. Since the maximum node-depth is m. in O(m) total time.Buff is the entire string S. Eod Step 3 correctly sets ji+l because the initial sequence of extensions where extension rule 1 or2 applies must end at the point where rule 3 first applies The key feature of algorithm SPA is that phase i + 2 will begin computing explicit extensions with extension j*. the algorithm therefore executes only 2m explicit extensions. The key is that the first explicit extensionin any phase (after phase 1) begins with extension j*. there is atmost is exE!cuted in both phases those j. so the repeated extension of P in phase i + 2 can execute the suffix extension rule for without any up-walking. 3. phase i + I ends knowing where string SU*.. S(m) tmnucusumx links and implementation tricks 1. and trees through Iff. phase i + I is implemented as follows: Single phase algorithm: Begin 1. the current node-depth does not change between the end of one extension and the beginning of the next.I algorithm builds a true suffix tree [or S.1) that the maximum number of node skips done during all the down-walking (and not just in a single phase) is bounded by Oem). is the single character PROOF The time for all the implicit extensions in any phase is constant and so is O(m) over the entire algorithm As the algorithm executes explicit extensions. Over the entire execution of the algorithm.2. D SPA 6. Moreover..1) how the current node-depth changes during successive extensions. of i)..1.9: Carloon of a possible execution algorithm.106 LlNEAR·TIME CONSTRUCTION OF SUFFIX TREES 6. but it does remain the same between two successive phases. But (as detailed in the proof of Theorem 6. Increment index e to l through j.3. explicit extensions in phase i + I (using algorithm SEA) are then only required from extension ji + 1 until the first extension where rule 3 applies (or until extension i + I is done).L. (By Trick 3 this correctly implements all implicit extensions 3. and Suff.e . This will . To bound the total number of node skips done during all the down-walks. and there are only 2m explicit extensions. and thereafter the down-walk in that extension increases the current node-depth by one at each node skip. and then does explicit work for (some) extensions starting with extension i. where j* was the lasrexplicit extension computed in phase i + 1.1. (i. Setj!+1 toj*-I. i Since there are only m phases.) 2...-[ down to T. l}. Ukkonens For example. extensions. + I.2. Weiner's algorithm constructs trees from Tm".mJ ofS starting in position i. it follows (as in the proof of Theorem 6.

3. then 5 is not in the database. or sequence a DNA string will cause unwanted DNA to become inserted into the string of interest or mixed together with a collection of strings.S. A third desired feature is that the preprocessing of the database should be relatively fast Suffix trees yield a very attractive solution to this database problem. As expected..• likely between That interval is therefore used as a "nearly unique" identifier DNA is extracted from the remains of who have been killed is examined. Contamination of protein in the laboratory can also be a serious problem. The DNA database contains a collection of previously sequenced DNA strings. in O(n) time. requires only O(m) space. That problem reduced to finding the longest substring in one fixed string that is also in some string in a database of strings. where k is the number of occurrences of the substring. 7. If the full string 5 cannot be matched against a path in T. A generalized suffix tree T for the strings in the database is built in O(m) time and. As usual. During cloning. is assumed to be large. probe. a set of strings. 278]. or a database. if 51 = sealiver. In that case. Each leaf of the tree represents either a suffix from one of the two strings or a suffix that occurs in both the strings. The sequenced interval has two key properties: It can be reliably isolated by the polymerase chain reaction and the DNA string in it is highly variable (i.Constructionofthesuffixtreecanbcd one in linear time (proportional to the total length of 5[ and 52). this is accomplished by matching the string against a path in the tree starting from the root. or even the Aho-Corasick algorithm. maintain. more importantly. An efficient and conceptually simple way to find a longest common substring is to build a generalized suffix tree for 51 and 52. The substring problem is one of the classic applications of suffix trees. then the longest com[non substring of S. The time/space tradeoff remains. For example. whereas no such choice is available for a keyword tree. and an efficient method to check thai is of value. denoted by m. The path-label of any internal node marked both I and 2 is a substring common to both 51 and 57.4. This is the reverse of the bounds shown above for suffix trees. if 5 is a substring of strings in the database then the algorithm can find all strings in the database containing 5 as a substring. or declared not to be there. clone. the matched path does specify the longest prefix of S that is contained as a substring in the database. When a new DNA string is sequenced. Boyer-Moore. This is the reverse of the exact set matching problem where the issue is to find which of the fixed patterns are in a substring of the input string.2 and 125 of Part Ill).6. Later. and the node markings and calculations of string-depth can be done by standard linear-time tree traversal methods. The total length of all the strings in the database. More realistically. In summary.5_ APL5: Recognizing DNA contamination Often the various laboratory processes used to isolate. one looks to see if the extracted and sequenced string is a substring of one of the strings in the database. because of errors. In the context of databases for genomic DNA data [63.4. a sequence of strings will be presented and for each presented string S. the algorithm must find all the strings in the database containing S as a substring. that the new string contains one of the database strings. which will be discussed in Sections 1 1.9. The full string S is in the database if and only if the matching path reaches a leaf of T at the point where the last character of Although the longest common substring problem looks trivial now. So the algorithm has only to find the node with the greatest string-depth (number of characters on the path to it) that is marked both I and2. copy. We will return to this problem in Section 7. purify. but that is the case of exact setmatching. 7. it could be contained in an already sequenced string. Any single string S of length n is found in the database.and the longest such string is the longest common substring. That longest common substring would then narrow the possibilities for the identity of the person. APL4: Longest common substring of two strings A classic problem in string analysis is to find the longest substring common to two given strings SI and 5> This is the longes/commollsubItrillg problem (different from the longest common subsequence problem. the problem of finding substrings is a real one that cannot be solved by exact set matching. Moreover. However. What constitutes a good data structure and lookup algorithm for the substring problem? The two constraints are that the database should be stored in a small amount of space and that each lookup should be fast. 320].3. Mark each internal node u with a I (2) if there is a leaf in the subtree of v representing a suffix from 5[ (S2).) One somewhat morbid application of this substring problem is a simplified version of a procedure that is in actual usc to aid in identifying the remains of U. one might want to compute the length of the longest substring found both in the newly extracted DNA and in one of the strings in the database. A solution to that problem is an immediate extension of the longest common substring problem and is left to the reader 7. This takes O(n + k) time.e . APL3: The substring problem for a database of patterns The substring problem was introduced in Chapter 5 (page 89). military personnel Mitochondrial DNA from live military personnel is collected and a small interval of each person's DNA is sequenced. APL4: LONGEST COMMON SUBSTRING OF TWO STRINGS 125 and space bounds as for the Aho-Corasick method . contamination is often caused . The longest common substring problem will be considered in Section 7. and neither is it contained in any string there. the opposite case is also possible. is first known and fixed.124 FlRST APPLICATIONS OF SUFFIX TREES S 7. In the most interesting version of this problem. bUI a suffix tree can be used for either of the chosen time/space combinations. The results obtained using a suffix tree are dramatic and not achieved using the Knurh-Morris-Pran. we have condition of the remains may not allow complete extraction or sequencing of the desired DNA interval.O(n) for preprocessing and O(m) for search. this is achieved by traversing the subtree below the end of the matched path for 5. it is very interesting to note that in 1970 Don Knuth conjectured that a linear-time algorithm for this problem would be impossible [24. (Of course.4. giving a more space efficient solution Now recall the problem of identifying human remains mentioned in Section 7. given our knowledge of suffix trees.

Generalized suffix trees can be built in time proportional to the total length of the strings in the tree.. find all substrings of 52 that occur in 51 and that are longer than some . (The dinosaur story doesn't quite fit here because there isn't yet a substantialtranscrtpt cfhuman DNA. since that algorithm is designed to search for occurrences ofjult patterns from P in 51. the problem of finding (exactly matching) common substrings in a set of distinct strings arises as a subproblem of many heuristics developed in the biological literature to align a set of strings. and then mark every internal node that has a leaf in its subtree representing a suffix from . there is a more space efficient solution to the contamination problem. then one can have greater confidence (but not certainty) that the DNA has not been contaminated by the known contaminants. because mutations that occur in those regions will more likely be lethal. and other DNA sources being worked with in the laboratory.) A good illustration comes from the study of the nemotode C. The "extracted' DNA sequences were shown. These includec!oning vectors. called multiple alignment. the problem of finding common substrings arises because mutations that occur in DNA after two species diverge will more rapidly change those parts of the DNA or protein that are less functionally important.8. APL6: Common substrings of more than two strings One of the most important questions asked about a set of strings is: What substrings are common to a large number of the distinct strings? This is in contrast to the important problem of finding substrings that occur repeatedly in a single string In biologlcalsrrings (DNA. rather than for substrings of patterns As in the longest common substring problem.15) in which short strings of sequenced DNA arc assembled into long strings by looking for overlapping substrings. But this one was published.14 and 16.3.10. Often. etegans genome. so we move on to other things. Without going into these and other specific ways that contamination occurs.')1and a leaf representing a suffix from a pattern in P. Blair Hedges. finding DNA or protein substrings that occur commonly in a wide range of species helps poi nt to regions or subpatterns that maybe critical for the function or structure of the biological string Less directly. S. through DNA database searching. elegans. we refer to the general phenomenon as DNA contamination Contamination is an extremely serious problem. for example).. Russell Doolittle [129] writes " . The problem now is to determine if the DNA string in hand has any sufficiently long substrings (say length I or more) from the known set of possible contaminants. the complete genomic sequence of the host organism (yeast. one of the critics of the dinosaur claims. stated: "In looking for dinosaur DNA we all sometimes find material that at first looks like dinosaur genes but later turns out to be human contamination. The parts of the DNA or protein that are critical for the correct functioning of the molecule will be more highly conserved.126 FIRST APPLICATIONS OF SUFFIX TREES 7. or the contamination is from the DNA of the host itself (for example bacteria or yeast). The approach in this case is to build a generalized suffix. tree for the set P of possible contaminants together with 51." [801 These embarrassments might have been avoided if the sequences were examined early for signs of likely contaminants. The biological applications motivate the following exact matching problem: Given a This motivates the following computational problem DNA contamination problem Given a string SI (the newly isolated and sequenced string of DNA) and a known string S2 (the combined sources of possible contamination). the announcement a few years ago that DNA had been successfully extracted from dinosaur bone is now viewed as premature at best. and there have been embarrassing occurrences of large-scale DNA sequencing efforts where the use of highly contaminated clone libraries resulted in a huge amount of wasted sequencing. PCR primers. then the path-label of v is a suspicious string that may be contaminating the desired DNA string. Dr. These substrings are candidates for unwanted pieces of S2 that have contaminated the desired DNA string. Contamination can also come from very small amounts of undesired foreign DNA that gets physically mixed into the desired DNA and then amplified by PCR (the polymerase chain reaction) used to make copies of the desired DNA. Hence suffix trees can be used to solve the contamination problem in linear time. APL6: COMMON SUBSTRINGS OF MORE THAN TWO STRINGS 127 by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA in a host organism. the contamination problem and its potential solution is stated as follows given length I. Thatproblem. to be more similar to mammal DNA (particularly human) [21 than to bird and crcckodllian DNA. report all marked nodes that have string-depth of I or greater. In discussing the need to use YACs (Yeast Artificial Chromosomes) to sequence the C.. Most directly. one of the key model organisms of molecular biology. If II is such a marked node.6. the DNA sequences from many of the possible contaminants are known. As a rule. it is important to know whether the DNA of interest has been contaminated Besides the general issue of the accuracy of the sequence finally obtained. used in purification. More generally. We will say much more about this when we discuss database searching in Chapter 15 and multiple string comparison in Chapter 14. contamination can greatly complicate the task of shotgun sequence assembly (discussed in Sections 16. In contrast. based on the material in Section 7. Similarly. then. and all the other marking and searching tasks described above can be performed in linear time by standard tree traversal methods. Build a generalized suffix tree for 51 and 52· Then mark each internal node that has in its subtree a leaf representing a suffix of 51 and also a leaf representing a suffix of 52. before large-scale analysis was performed or results published.6. it is not clear if the Aho-Corasick algorithm can solve the problem in linear time. Finally. RNA. On a less happy note. This problem can easily be solved in linear time by extending the approach discussed above for the longest common substring of two strings. All marked nodes of siring-depth I or more identify suspicious substrings.. more than a few studies have been curtailed when a preliminary search of the sequence revealed it to be a common contaminant . We leave this to the reader 7. If there are no marked nodes with string-depth above the threshold I. one has an entire set of known DNA strings that might contaminate a desired DNA string. suggesting that much of the DNA in hand was from human contamination and not from dinosaurs. will be discussed in some detail in Section 14. the experimentalist should search early and often". Clearly. or protein) the problem of finding substrings common to a large number of distinct strings arises in many different contexts. Therefore.

[71].2... Once the C(v) numbers are known. •.6. grand. That is. Thus the method presented here to compress suffix trees can either be considered as an application of suffix trees to building DAWGs or simply as a technique to compact suffix trees Consider the suffix tree for a string S = xyxaxaxa shown in Figure 7. We will return to this problem in Section 9. That time bound is also nontrivial but is achieved by a generalization of the longest common substring method for two strings. strings whose lengths sum to n one substring sand and and 7. . for each value of k from 2 to K. V(k) holds the string-depth (and location if desired) of the deepest (string-depth) node u encountered with C(v) == k. we could merge pinto q hy redirecting the labeled edge from p's parent to now go into o. change V(k) to the depth of u. except for the leaf numbers. For I children. Problems of this type will be discussed in Part III Formal problem statement and first method Suppose we have Definition We want to compute a table of K . (When encountering a node v with C(v) = k. Therefore over the entire tree. where "similar" allows a small number of differences. rather than learning all the locations of the pattern occurrencets). the resulting directed graph can Surprisingly.. build a generalized suffix tree T for the K sn-ing. 1 :. and any significant reduction in space is of value. The edgelabeled subtree below node p is isomorphic to the subtree below nodc q. Computing the C(v) numbers In linear time. the desired I(k) values can be easily accumulated with a linear-time traversal of the tree That traversal builds a vector V where. each 1eafinT has only one string identifier. The resulting vector holds the desired I'. " I . and vice versa.7. O(n). The vector for » is obtained by DRing the vectors of the children of u. V(k) :::: I(k). A more difficult problem is to find "similar" substrings in many given strings. Figure 7. V(k) reports the length of the longest string that occurs exactly k times.s. To find I(k) simply scan V from largest to smallest index. The resulting graph is not a tree but a directed acyclic graph. writing into each position the maximum V(k) value seen..1 entries. compare the string-depth' of » to the current 'value of V(k) and if u's depth is greater than V(k). But that number may be larger than C(v) since two leaves in the subtree may have the same identifier. find substrings "common" to a large number of those strings. after merging two nodes in the suffix tree. consider the set of strings {sandoilar.128 FIRST APPLICATIONS OF SUFFIX TREES 7. for every path from p there is a path from q with the same path-labels. and the string-depth of every node is known. pantry}. time [236]. the time needed to build the entire table is O(Kn). this takes I K time. the algorithm uses O(Kn) time to explicitly compute which identifiers are found below any node. we show here how to solve the problem in O(Kn) time.7. These compression techniques can also be used to build a directed acyclic word graph (DAWG).. That is.] Essentially. APL7: Building a smaller directed graph for exact matching As discussed before. so that identical suffixes appearing in more than one string end at distinct leaves in the generalized suffix tree Hence. .•.. Then C(~') is JUStthe number of l-bits in that vector. handler. and [115]. sandlot. Therefore.7.l.'. it is easy to compute for each internal node v the number of leaves in u's subtree. to indicate which string the suffix is from Each of the K strings is given a distinct termination symbol. The word "common" here means "occurring with equality". where an O(n) time solution will be presented.Each leaf of the tree represents a suffix from one of the K strings and is marked with one of K unique string identifiers. since there are O(n) edges. Linear-time algorithms for building DAWGs are developed in PO]. the problem can be solved in linear. If we only want to determine whether a pattern occurs in a larger text. I to K. First. For example. Therefore.1: Suffix Iree lor string xyxaxaxawithout suffix linkS shown i . For each internal node u. Clearly.1. Then the I(k) values (without pointers to the strings) are J(k) K 7. . instead of counting the number of leaves below u.~. : . in many applications space is the critical constraint. which is the smallest finite-state machine that can recognize suffixes of a given string.. APL7: BDrLDING A SMALLER DIRECTED GRAPH FOR EXACT MATCHING 129 set of strings.••••••• . That repetition of identifiers is what makes it hard to compute C(v) in O(n) time. where entry k gives /(k) and also points to one of the common substrings of that length. a K-Iength bit vector is created that has a I in bit i if there is a leaf with identifier i in the subtree of u.deleting the subtree of p as shown in Figure 7. It really is amazing that so much information about the contents and substructure of the strings can be extracted To prepare for the O(n) result. In this section we consider how to compress a suffix tree into a directed acyclic graph (DAG) that can be used to solve the exact matching problem (and others) in linear time but that uses less space than the tree. if V(k) is empty < V(k + I) then set V(k) to V(k + 1).

Moreover. we match characters of string T against T. If b is an internal node v of T then the algorithm can follow its suffix link to a node s{v). called the marching statistics problem. I I the way they accelerate the construction ofT in Ukkonen's algorithm To learn ms(l). If v is the root. Section 6. This use of matching statistics will probably be more important than their use to duplicate the preprocessing/search bounds of KnuthMorris-Pratt and Aho-Corasick.ml. there is an occurrence of P starting at position T if and only if ms(i) IFI . who solved a somewhat more general problem. If h is not an internal node. the Knuth-Morris-Pratt (or Boyer-Moore) method preprocesses the pattern in O(n) time and space. But the resulting finite-state machine would not necessarily have the minimum number of states. Still. So the question arises of whether we can solve those problems by building a not the text. Instead of traversing that path by examining every character on it./II]. A DAWG represents a finite-state machine and.. if /IIs(5) = 4.vzqab<:dw then ms( I) i of = 3 and = Clearly. Of course. and hence it would not necessarily be the DAWG for S.ing matching statistics will be given in Section 7. When the end of that fj path is reached. APL8: A reverse role for suffix trees. Matching statistics are also used in a variety of other applications described in the book. proceed as follows to learn ms(i + 1).m].1. and so is as compact as the DAWO even though it may not be a finite-state machine. we will develop a result due to Chang and Lawler [94]. One advertisement we give here is to say that matching statistics are central to a fast approximate matching method designed for rapid database searching.. and hence the path from the root to s(v) matches a prefix of T[i + l.. In those cases it is clearly superior to preprocess the patternts). [71]. each edge label is allowed to have only one character. This is the reverse of the normal use of suffix using suffix trees to achieve exactly the same time and space bounds (preprocessing versus search time and space) as in the Knuth-Morris-Pratt or Aho-Corasick methods. Having learned ms(i).8. That means that the algorithm has located a point b in T such that the path to that point exactly matches a prefix of T[i . The Aho-Corasick method achieves similar bounds for the set matching problem. as such. the search for msti + I) can Start at node I(1') rather than at the root Let fj denote the string between node u and point h. there must be a path labeled fJ out of s(v). Matching statistics: duplicating bounds and reducing space For example. The path-label of v. and major space reduction We have previously shown how suffix trees can be used to solve the exact matching problem with 0(1/1) preprocessing time and space (building a suffix tree of size Oem) for the text T) and O(n + k) search time (where n is the length of the pattern andk is the number of occurrences). by folluwing the unique matching path of T[I .. To explain this. APL8: A REVERSE ROLE FOR SUFFIX TREES.. m]..3) to traver:se it in time proportional to the number of nodes on the path.8. Moreover. the main theoretical feature of the DAWG for a string S is that it is the finite-state machine with the fewest number of slates (nodes) thai recognizes suffixes of S. MAJOR SPACE REDUCTION 133 DAGs versus DAWGs DAG D created by the algorithm is not a DAWG as defined in [70J. Matching statistics lead to space reduction Matching statistics can be used to reduce the size of the suffix tree needed in solutions to problems more complex than exact matching. The length of that matching path is /IIs(I). The first example of space reduction U'. and [lIS]./II1.9. the algorithm uses the skip/count trick (detailed in Ukkonen's algorithm. DAG D for string S has as few (or fewer) nodes and edges than does the associated DAWO for S. D can be converted to a finite-state machine by expanding any edge of D whose label has k characters intuk edges labeled by one character each.3. Hence ml is a string in P matching a substring starting at position i + I of T.. and then searches in Oem) time. then the search for msti + I) begins at the root.132 FIRST APPLICATIONS OF SUFFIX TREES 7. machine that recognizes substrings of a string. Then xafJ is the longest substring in P that matches a substring starting at position i of T. Now suppose in general that the algorithm has just followed a matching path to learn ms(i) for i < Iml. the algorithm continues to match single characters from T against characters in the tree until either a leaf is reached or until . T = abcxabcdex and P= wyabL"'. This will be detailed in Section 12. Since s( u) has path-label a. then the algorithm can back up to the node v just above b. say xe. How to compute matching statistics 7. Therefore. But if v is not the root then the algorithm follows the suffix link from v tos(1'). We have also seen how suffix trees are used to solve the exact set matching and space bounds (n is now the total size of all the patterns In contrast. However. Therefore. but no further matches are possible (possibly because a leaf has been reached).3.both run space (they have to represent the strings). construction of the DAWG for of theoretical interest.so o must be a prefix of T[i + l. Thus the problem offinding the matching statistics is a generalization of the exact matching problem. 7.8. But s(v) has path-label a. the situation sometimes arises that the pattern(s) will be given first and held fixed while the text varies. and space bounds for suffix trees often make their use unattractive compared to the other methods. Thus matching statistics provide one bridge between exact matching methods and problems of approximate string matching. isaprefixofT[i .1. preprocess the pattern .

1. if the search stops at a node u.17.134 RRST APPLICATIONS OF SUFFIX TREES 7. APL9. Now consider the time required by the algorithm.B character comparisons needed to compute ms(i + 1). Using a (precomputed) generalized suffix tree for the STSs (which play the role of P). Then. then msti + I) = O. node u with the leaf number of one of the leaves in its subtree. and then a certain number of additional 7. Correctness and time analysis for matching statistics The proof of correctness of the method is immediate since it merely simulates the naive method for finding each ms(i). and the theorem is proved. This takes time linear in the size of T. In order to accumulate the p(i) values. begin with the character in T that ended the computation for ms(i) or with the next character in T. for each ordered pair Sf. exact matching may not be an appropriate way to find STSs in new sequences.5. The key there is that the after-.. It only remains to consider the total time used in all the character comparisons done in the "after-d' traversals. But since the number of sequencing errors is generally small.9. 7. the pointer p(i) will point to the appropriate STS in the suffix tree. that matches a prefix or Sj is exactly ms(i) places. the desired pO) is the suffix number written at u. We leave it to the reader to flesh out the details. ms(i + 1) is the string-depth of the ending position.9. We next modify the matching statistics algorithm so that it provides that information. APLIO: All-pairs suffix-prefix matching Here we present a more complex use of suffix trees that is interesting in its own right and that will be central in the linear-time superstring approximation algorithm to be discussed in Section 16. first do a depth-first traversal ofT marking each II caliedaslIfjix-prejixmu/chof5" Given a collection of strings S = 51. the all-pairs suffixprefix problem is the problem of finding.8. for each i. s. v». 0 7. It follows that at most 2m comparisons in total are performed during all the after-f comparisons. when using T to find each ms(i). the location of at least one such matching substring. otherwise (when the search stops on an edge (u. Hence the after-f comparisons performed when computing ms(i) and ms(i + I) share at most one character in common. .10. SPACE·EFFlCIENT LONGEST COMMON SUBSTRING ALGORITIIM 135 no further matches are possible. Generally. There it was mentioned that. Using oniy a saffix tree for P can be found in PROOF The search for any ms(i + I) begins by backing up at most one edge from position b to a node u and traversing one suffix link to node ~'(l'). but it does not indicate the location of any such match in P. The analysis is very similar to that done for Ukkonen's algorithm Theorem7.2. A small but important extension The number m.8. . Note tbat when given a new sequence. and r(i + 1) is notin P. Definition Given two strings Si and S.2) we must also know. because of errors. for i ::: I. That takes care of all the work done in finding all the matching statistics. Those regions of agreement should allow the correct identification of the STSs it contains. . APL9: Space-efficient longest common substring algorithm the current depth is increased at each step of any fJ traversal.l'(i) indicates the length of the longest substring starting at position i of T that matches a substring somewhere in P. Sj in S. p(i) is the suffix number written at node u Back toSTSs Recall the discussion of STSs in Section 3. SA of tctallengrh m. 52. Thereisonespeciaicasethatcanariseincomputingms(i+l). In either case.lfms(i} = I orms(i) = 0 (so that the aJgorithmis at the root). From s(~') a Ii path is traversed in time proportional to the number of nodes on it. 7. compute matching statistics for the new DNA sequence (which is T) and the set ofSTSs. Note that the character comparisons done after reaching the end of the fJ path begin either with the same character in T that ended the search for ms(i) or with the next character in T. any suffix of S. we can expect long regions of agreement between a new DNA sequence and any STS it (ideally) contains. the time for the computation is just proportional to the length of the new sequence. the longest suffix-prefix match of SnS}.3.1. depending on whether that search ended with a mismatch or at a leaf.S. For some applications (such as those in Section 9.1.

When the depth-first traversal backs up past a node t'. The key observation is the following: If t' is a node on this path and i is in L(p). that matches a prefix of 5j• So for each index i.5.136 FIRST APPLICATIONS OF SUFFIX TREES 7. and because substring containment is possible among strings of unequal length. we pop the top of any stack whose index is in L(v) is time optimal 1 :1 '. for each i E L(1'). One of the main objectives of the research was to estimate the number of ACRs that occur in the genes of C. assuming (as usual) that the alphabet is fixed. List L(v) contains the index i if and only if v is incident with a terminal edge whose leaf is labeled by a suffix of string Sj.2. so a simple extrapolation would be wrong if the prevalence of ACRs is systematically different in ESTs from frequently or infrequently expressed genes. then there is no over-lap between a suffix of string 5. Following the above observation. the node with path-label a has a list consisting of indices I and2. push t' onto the ith stack. arises in datu compression: this connection will be discussed in the exercises for Chapter 16 A different. the algorithm also builds a list L(v) for each internal node u. label of v is the longest suffix-prefix match of(Si. and focus on the path from the root of 7(5) to the leaf j representing the entire string Sj. Moreover. it was indeed found that ACRs occur more commonly in EST. APLlO: ALL-PAIRS SUFFIX. or how frequently that gene is expressed. The goal is to use the ACR data observed in the ESTs to estimate the number of ACRs in the entire set of genes. it maintains k stacks. Now we return tu the extrapolation problem. Their approach was to extrapolate from the number of ACR5 they observed in the set of ESTs. eiegans. then the path-label of u is a suffix of S. To explain this. As it does. However.400 ESTs (see Section 3. Solving the all-pairs suffix-prefix problem in linear time For asingle pairof strings. To describe the role of suffix-prefix matching in this extrapolation. Using suffix trees it is possible toreduce the computation time to O(m+k2). genes are not uniformly sampled. the heart of that computation consists of solving the all-pairs suffix-prefix problem on the set of ESTs. from frequently expressed genes (more precisely. The main motivation for the all-pairs suffix-prefix problem comes from its use in implementing fast approximation algorithms for the shortest superstring problem (to be discussed in Section \6.andthenodewithpath-labelxa has a list consisting of index 1. we can maintain a linked list of . (Because there may be some sequencing errors. If EST a originates in gene fJ. As 7(5) is constructed. the lists can be constructed in linear time during (or after) the construction of7(5) Now consider a fixed string 5j..1) from the organism C. and many different ESTs can be collected from the same gene fJ. The path. so how can ESTs from frequently sampled genes be distinguished from the others? The approach taken in [190J is to compute the "overlap" between each pair of ESTs Since all the ESTs are of comparable length. For example. consider the generalized suffix tree shown in Figure 6. All the other lists in this example are empty. the authors [190J conclude' The main data structure used to solve the all-pairs suffix-prefix probl em is the genetulized suffix tree 7(5) for the k strings in set 5. That is. How can that prevalence be determined? When an EST is obtained. and it is not easy to tell if two ESTs Originate from the same gene. and hence for the all-pairs suffix-prefix problem. L(t') holds index i if and only if the path label to u is a complete suffix of string 5t. a set of around 1. and a prefix of string 5j. It is not difficult to see that the top of stack i contains the node v that defines the sutlix-prefix match of(S. ESTs are collected more frequently from some genes than others. in the common method used to collect ESTs.·1 constant time per answer D Extensions We note two extensions. 5i)· It is easy to see that by one traversal from the root to lear j we can find the deepest nodes for all I S i Sk (i"# j). the preprocessing discussed in Section 2. We can thus consider ESTs as a biased sampling of the underlying gene sequences. one doesn't know the gene it comes from. Sj). whereas an EST that has substantial overlap with one or more of the other ESTs is considered to be from a frequently expressed gene. The node with path-label ba has an L list consisting of the single index 2.) After categorizing the ESTs in this way.17). However.10. we need to remember some facts about ESTs For the purposes here. we can think of an EST as a sequenced DNA substring of length around 300 nuc1eotides. When a leaf j (representing the entire string 5i) is reached.4 will find the longest suffix-prefix match in time linear in the length of the two strings. ESTs will more frequently be collected from genes that are more highly expressed (transcribed) than from genes that are less frequently expressed. Clearly.10. one for each string. then the actual location of substring a in fJ is essentially random. from ESTs that overlap other ESTs). one should also solve the all-pairs longest common substring problcm. By using double links. one does not learn the identity of the originating gene.lfthe ith stack is empty. and a prefix of Sj. In that research. Another motivation for the shortest superstring problem. direct application of the all-pairs suffix-prefix problem is suggested by computations reported in [190]. applying the preprocessing to each of the k: pairs of strings separately gives a total bound of O(km) time. Commonly.1. The superstring problem is itself motivated by sequencing and mapping problems in DNA that will be discussed in Chapter 16.PREFIX MATCHING 137 Motivation for the problem 7. scan the k stacks and record for each index i the current top of the ith stack. originating in a gene of much greater length.elegansgenes However.firsttraversaLwhenanodevi s reachedin a forward edge traversal. A simple extrapolation would be justified if the ESTs were essentially random samples selected uniformly from the entire setofC.11 (page 117). the algorithm efficiently collects the needed suffixprefix matches by traversing 7(5) in a depth-first manner. Let k' S k2 be the number of ordered pairs of strings that have a nonzero length suffix-prefix match. An EST that has no substantial overlap with another EST was considered in the study to be from an infrequently expressed (and sampled) gene. elegans (which is a worm) were analyzed for the presence of highly conserved substrings called ancientconserved regions (ACRs). Duringthedcpth. the deepest node u on the path to leaf j such that i E L(v) identifies the longest match between a suffix of 5.

tandem arrays of repeated RNA that flank retroviruses (viruses whose primary genetic material is RNA) and facilitate the incorporation of viral DNA (produced from the RNA sequence by reverse transcription) into the host's DNA. and protein). see [253] and [255] Small-scale local repeats whose function or origin is partially understood include: complemented palindromes in both DNA and RNA.138 FIRST APPLICATIONS OF SUFFIX TREES 7. 7.()rvariouskindsofrepeatsarelOo~ommonevenlnli'1. 7. 9. To make this concrete. but to motivate the algorithmic techniques we develop For emphasis. In many texts tfor example. the sentence wa. Ignoring spaces. we briefly introduce some of those repetitive structures. 7. For-example. (In fact.12. Each strand isacomplem8ntedpalindrom8 accordingtothedelinitionsusedinthist:. In that way. report. 9. the entire contents of each scanned stack is read out. repetitive DNA is striking for the variety of repeated structures it contains. RNA.5. and [315]) on genetics or molecular biology one can find extensive discussions of repetitive strings and their hypothesized functional and evolutionary role. as well as several exercises. and more will be discussed later in the book.7. [3 17].000 families of repetitive sequences were identified [5]. elegans.) The motivation for the general topic of repetitive structures in strings comes from several sources.1.6). clustered genes that code for important proteins (such as histone) that regulate chromosome structure and must be made in large number. most of the human Y chromosome consists of repeated substrings.5 it a cat i saw is another example. 9. some aspects of one type of repetitive structure. but our principal interest is in important repetitive structures seen in biological strings (DNA. and overall Families of reiterated sequences account for about one third of the human genome. Repetitive structures in biological strings One of the most striking features of DNA (and to a lesser degree. INTRODUCTION TO REPETITIVE STRUCTURES 139 the nonempty stacks. and for the biological functions that some of the repeats may play (see [394] for one aspect of gene duplication). small-scale repeated strings whose function or origin is at least partially understood.11. The intent is not to write a dissertation on repetitive DNA or protein. This is particularly true of eukaryotes (higher-order organisms whose DNA is enclosed in a cell nucleus). for the various proposed mechanisms explaining the origin and maintenance of repeats.sinceastackthatgoesfromemptytononemptymustbelinkedatoneoftheendsofthe list.1. tandem repeats. prokaryotes (organisms such as bacteria whose DNA is not enclosed in a nucleus) have in total little repetitive DNA. In contrast. similar genes that probably arose through duplication and subsequent mutation (including pseudogenes that have mutated .2.1.6 million bases of DNA from C.6.11. [3 I71 There is a vast' literature on repetitive structures in DNA.?ok. 9. whose function is less clear. Introduction to repetitive structures in molecular strings Several sections of this book (Sections 7. 9. then the complexity for this solution is O(m+k*). We modify the above solution so that when the tops of the stacks are scanned. have already been discussed in the exercises of Chapter I. families of genes that code for similar proteins (hemoglobins and myoglobins for example). sentence or verse reading the same backwards as forwards [441].S Figure 7.£ VDJ~DD:J:JVDJ~ . hence we must also keep (in the stack) the name ofthe string associated with that stack. we divide the structures into three types: local. Note that the position of the stacks in the linked list will vary. simple repeats. 3' . Definition A palilldrome is a string that reads the same backwards as forwards.2.3: A palindrome in the vernacular of molecular biology. For example. which act to regulate DNA transcription (the two parts of the complemented palindrome fold and pair to form a "hairpin loop"). short repeated substrings (both palindromic and nonpalindromic] in DNA that may help the chromosome fold into a more compact structure. hoth local and interspersed.6.11. the Random House dictionary definition of "palindrome" is: a word. the string xyaayx is a palindrome under this definition.12. over 7. and even in protein.2. protein) is the extent to which repeated substrings occur in the genome. repeated substrings at the ends of viral DNA (in a linear state) that allow the concatenation of many copies of the viral DNA (a molecule of this type is called a concatamerr. 9. single copy inverted repeats that flank transposable (movable) DNA in various organisms and that facilitate that movement or the inversion of the DNA orientation. [128] In an analysis of 3. s' TCGACCGGTCGA In the following discussion of repetitive structures in DNA and protein. nested complemented palindromes in tRNA (transfer RNA) that allow the molecule to fold up into a cloverleaf structure by complementary base pairing. [469]. copies of genes that code for important RNAs (rRNAs and tRNAs) that must be produced in large number. only the stacks on this list need be examined. bUI all suffix-prefix matches no matter how long they are. At the other extreme. If the output size is k*. all nonzero length suffix-prefix matches can be found in Otm + k') time. suppose we want to collect for every pair not just the longest suffix-prefix match. The doubJe·str andedstringisthasame after reflection around both the horizontaland vertical midpoints. are devoted to discussing efficient algorithms for finding various types of repetitive structures in strings. and 7. For an introduction to repetitive elements in human DNA. although they still possess certain highly structured small-scale repeats In addition to its sheer quantity. Then when a leaf of the tree is reached during the traversal. and more complex interspersed repeated strings whose function is even more in doubt.

227. These repeats are called satellite DNA (further subdivided into micro and mini-satellite DNA). the string TTAGGG appears at the ends of every human chromosome in arrays containing one to two thousand copies [332]. and whose function and origin is less clear. This folding then apparently facilitates either recognition or cutting by the enzyme. the interior of an Alu string itself consists of repeated substrings of length around 40. also called "direct repeats") of repeated DNA. sixteen imprinted gene alleles have been found to date [48]. other mini-satellite repeats also play a role in human genetic diseases [286]. Uses of repetitive structures in molecular biology \ Ade!.11. Highly dispersed tandem arrays of length-two strings are common. although there is serious opposition to that theory.11. common exons of eukaryotic DNA that may be basic building blocks of many genes.S()()b. Tn addition to tri-nucleotide repeats. INTRODUCTION TO REPETITIVE STRUCTURES 141 to the point that they no longer function). The DNA sequences of these sixteen imprinted genes all share the common feature that 7. A restriction enzyme is an enzyme that recognizes a specific substring in the DNA of both prokaryotes and eukaryales and cuts (or cleaves) the DNA every place where that pattern occurs (exactly where it cuts inside the pattern varies with the pattern). and the rest from the father. The Alu repeats occur about 300. is generally divided into SINEs (short interspersed nuclear sequences) and LINEs (long interspersed nuclear sequences). Some tandem arrays may originate and continue to grow by a postulated mechanism of unequal crossing over in meiosis. The complemented palindromic structure has been postulated to allow the two halves of the complemented palindrome (separated or not) to fold and form complementary pairs. Restriction enzyme cutting sites are interesting examples of repeats because they tend to be complemented palindromic substrings. Kennedy's disease. Moreover. the likelihood that more copies will be added in a single meiosis increases as the number of existing copies increases. where N stands for any nucleotide The enzyme cuts between the last two Ns. For an introduction to Alu repeats see [254J One of the most fascinating discoveries in molecular genetics is a phenomenon called genomic (or gametic) imprinting.000 times in the human genome and account for as much as 5% of the DNA of human and other mammalian genomes Alu repeats are substrings of length around 300 nucleotides and occur as nearly (but not exactly) identical copies widely dispersed throughout the genome. The classic example of a SINE is the Alu family. the number of triples in the repeat increases with successive generations. For example. For example.n nfJOOutl.that chromosomes (other than the Y chromosome) have no memory of the parent they originated from. if inherited from the "incorrect" parent. Those right and left flanking sequences are usually complemented palindromic copies of each other. 391]. whereby a particular allele of a gene is expressed only when it is inherited from one specific parent [48. Because of the palindromic structure of restriction enzyme cutting sites. Moreover. restriction enzyme BgfI recognizes GCCNNNNNGCC.140 FIRST APPLICATIONS OF SUFFIX TREES 7. Tnmice and humans. Repetitive DNA that is interspersed throughout mammalian genomes. the surprising discovery that eukaryotic DNA contains introns (DNA substrings that interrupt the DNA of protein coding regions). ataxia) are now understood to be caused by incre asing numbers of tandem DNA repeats of a string three bases long. myotonic dystrophy. There are hundreds of known restriction enzymes and their use has been absolutely critical in almost all aspects of modern molecular biology and recombinant DNA technology. For example. people have scanned DNA databases looking for common repeats of this form in order to find additional candidates for unknown restriction enzyme cutting sites Simple repeats that are less well understood often arise as tandem arrays (consecutive repeated strings. for which Nobel prizes were awarded in 1993. This is in contradiction to the classic Mendelian rule of equivalence ."e. which appears to explain why the disease increases in severity with each generation. Five of these require inheritance from the mother. These triplet repeats somehow interfere with the proper production of particular proteins. So the Alu repeats wonderfully illustrate various kinds of phenomena that occur in repetitive DNA. structured.1 leng. Huntington's disease. A number of genetic diseases (Fragile X syndrome. With unequal crossing over in meiosis. or expressed differently. !Iie genes studied 148] have a tot. and common functional or structural subunits in protein (motifs and domains) Restriction enzyme cutting sites illustrate another type of small-scale.2. the restriction enzyme EcrJRI recognizes the complemented palindrome GAATTC and cuts between the G and the adjoining A (the substring TTC when reversed and complemented is GAA)_ Other restriction enzymes recognize separated (or interruptedi complemented palindromes. and the Alu sequence is often flanked on either side by tandem repeats of length 7-10. was closely coupled with the discovery and use of restriction enzymes in the late 1970s. Other long tandem arrays consisting of short strings are very common and arc widely distributed in the genomes of mammals. and their existence has been heavily exploited in genetic mapping and forensics.ilno!containedin!hisqlloteistha!!h<direc!itandemJrep<at'.h . For example. The allele will be unexpressed. Sometimes the required parent is the mother and sometimes the father. and that the same allele inherited from either parent will have the same effect on the child. repeating substring of great importance to molecular biology.

it is represented only once in R·(S). But what is a distinguishing feature between human and hamster DNA? One approach exploits the Alu sequences. hashing substring locations into the precomputed table. but their existence can also facilitate certain cloning. In this Iight. In this cell. may be of future use in computational biology. and yet it gives a good reflection of the maximal pairs Tn some applications.16). if a string consists of n copies of the same character. Alu sequences specific to human DNA are so common in the human genome that most fragments of human D~A longer than 20. substring a is .000 bases will contain an Alu sequence [317]. Accordingly. both strings abc and abcy are maximal repeats. and 9. Techniques that handle more liberal errors will be considered later in the book. Other ways of defining and studying repetitive structures are addressed in Exercises 56. The same idea is used to isolate human oncogenes (modified growth-promoting genes that facilitate certain cancers) from human tumors.~ Admittedly.5. Each resulting hybrid-hamster cell incorporates different parts of the human DNA. A recurring objection is that the first repetitive string problems we consider concern exact repeats (although with complementation and inversion allowed). Other poor definitions may not capture the structures of interest. makes certain kinds of large-scale DNA sequencing more difficult (see Sections 16. APLll: Finding all maximal repetitive structures in linear time Before developing algorithms for finding repetitive structures. Cells that receive the fragment of human DNA containing the oncogene become transformed and replicate faster than cells that do not. For example. such as Alus. Those techniques pass the "plausibility" test in that they. Algorithmic problems on repeated structures We consider specific problems concerning repeated structures in strings in several sections of the book.6. one then has to identify the parts of the hamster's hybrid genome that are human. an algorithm searching for all pairs of identical substrings (an initially reasonable definition of a repetitive structure) would output toJ(n4) pairs. the definition of a maxima! repeat does not properly model the desired notion of a repetitive structure. and in Sections 9.12. 57. A poor definition may lead to an avalanche of output. if one seeks repeating DNA of length ten. Poor definitions are particularly confusing when dealing with the set of all repeats of a particular type. or they may make reasoning about those structures difficult. we must carefully define those structures. in S = aabXCiyaab.12. with S as above. For example. In this section. it makes sense to first build a table of all the 411l possible strings and then scan the target DNA with a length-ten template. Some techniques for handling inexact palindromes (complemented or not) and inexact repeats will be considered in Sections 9. the fragments of human DNA in the hybrid can be identified by probing the hybrid for fragments of Alu. Hence IR'(S)I is less than or equal to IR(S)I and is generally much smaller. For example.142 ARST APPLICATIONS OF SUFFIX TREES 7. For example. the key problem is to define repetitive structures in a way that does nOI generate overwhelming output and yet captures all the meaningful phenomena in a clear way. This isolates the human DNA fragment containing the from the other human hut then the human DNA has to be 7. 9. whereas most cases of repetitive DNA involve nearly identical copies. Therefore. an undesirable result.5 and 9. or the ideas that underlie them. and 58 in this chapter.weaowconsiderproblemsconcemingexactlyrepeatedsubstringsina single string For example. not every repetitive string problem that we will discuss is perfectly motivated by a biological problem or phenomenon known today.6.1. in exercises in other chapters. Note that no matter how many times a string participates in a maximal pair in S. APLI I: FINDING ALL MAXIMAL REPETITIVE STRUCTURES 143 The existence of highly repetitive DNA. one general approach to low-resolution physical mapping (finding on a true physical scale where features of interest are located in the genome) or to finding genes causing diseases involves inserting pieces of human DNA that may contain a feature of interest into the hamster genome. Despite these objections. The output is more modest. Another objection is that simple techniques suffice for small-length repeats. Fragments of human DNA from the tumor are first transferred to mouse cells. the fit of the computational problems we will discuss to biological phenomena is good enough to motivate sophisticated techniques for handling exact or nearly exact repetitions. mapping. This technique is called somatic cell hybridization. we address the issue through various notions of maximality. and these hybrid cells can be tested to identify a specific cell containing the human feature of interest. and searching efforts.6.11 and 16.

A linear-time algorithm to find all maximal repeats The maximal repeats can be compactly represented o its children are left diverse. So suppose that the two occurrences are xap and yap. based essentially on suffix trees. APLl I: FINDING ALL MAXIMAL REPETITIVE STRUCTURES 145 L' a maximal repeat but so is aab. Problems related 10 palindromes and tandem repeats are considered in several sections throughout the book. but we will not use Ihal .12.molimc.1..12. sou must be part of a maximal pair and hence a must be a maximal repeat. maximal repeats.2. and supennaximal repeats Suppose first that v is left diverse. is also described in that paper.erminology . Hence u must be left diverse 7.12. and supermaximal repeats are only three possible ways to define exact repetitive structures of interest. Conversely. so are all of its ancestors in the tree Theorem 7. Note that being left diverse is a property that propagates upward. referTed lo as a ompact trie. since the larger substring aob that sometimes contains a may be more informative. Certain kinds of repeats are elegantly represented in graphic alformin a device called a landscape [104J. By definition. Lemma 7. we have 5 This kind of lree is . if a is a maximal repeat then it participates in a maximal pair and there must be occurrences of cc that have distinct left characters.12.12. then a is a maximal repeat and the theorem is proved.l summary. Inexact repeats will be considered in Sections 9. maximal repeats. 0 PROOF i Theorem 7.r then it participates in a maximal pair with string yap. Either way. That means there are substrings xa in S. If the second substring is followed by any character but p.. where x and y represent different characters. In the next sections we detail how to efficiently find all maximal pairs. a cannot be preceded by both . An efficient program to construct the landscape. It may not always be desirable to report Of as a repetitive structure. There can he at most n maximal repeats in any string of length n Since T has Il leaves. which is a superstring of string IX. T can have at most n internal nodes.although not every occurrence of C( is contained in that superstring.144 FIRST APPLlCATIONS OF SUFFIX TREES 7. a leaf cannot be leftdiverse ' j .1 then implies the theorem.1 would be a trivial fact if at most one substring starting at any position could be Definilion A node u ofT is called left diverse ifat least two leaves in e's subtree have differentleftcharacters. The subtree of T from the root down to the frontier nodes Theorem 7.. and each internal node other than the root must have at least two children.1. Let the first substring be by character p. If this occurrence of O!q is preceded by character . and if it is preceded by y then it participates in a maximal pair with seep. y Finding left diverse nodes in linear time :. But since v is a (branching) node there must also be a substring oc in S for some character q that is different from p. The string and onlv if v ls Ieft diverse. If a node diverse. Other models of exact repeats are given in the exercises.12. PROOF O! is left labeling the path to a node v ofT is a maximal repeat if Maximal pairs.1.r and y.5 and 9_6.

7. Since w is a leaf. That degree of each nearsupermaximal repeat can also be computed in linear time. We establish here efficient criteria to find all the superrnaximal repeats in a string S.12. Since there can be more than O(n) of them. then Xlt occurs only once in S. where a is the string labeling the path to u. the running time of the algorithm will be stated in terms of the size of the output. AI/the maximal repeats in S cun be found in 0(/1) time. In detail.r . lf w is a leaf then w specifies a single particular occurrence of substring Ii = ay. Now traverse the tree from bottom up. Finding supermaximaJ repeats in linear time Recall that a supermaximal repeat is a maximal repeat that is not a substring of any other maximal repeal.2. the occurrence of a at i is contained in a maximal repeat. substring a is a maximal repeat but not a supermeximal ora near-supermaximal repeat. Now let y be the first character on the edge from u to w. Such an occurrence of a is said to witness the near-supermaximalny of 1)/.12. If we only want to count the number of PROOf If there is another occurrence of U' with a preceding character .146 FIRST APPLICATIONS OF SUFFIX TREES 7. so O(k) time is used in those operations. we can state 7.2. but it is near-superrnaximal. Definition A substring a of S is a near-supermaximal repeat if IX is a maximal repeat in S that occurs at least once in a location where it is not contained in another maximal repeat.r Letting n denote the length of S. a supermaximal repeat It is a maximal repeat in which every occurrence of a is a witness to its near-supermaximalny. Note that it is not true that the set of near-supermaximal repeats is the set of maximal repeats that are not supermaxunal repeats The suffix tree T for S will be used to locate the near-supermaximal and the supermaximal repeats.12. The creation of the suffix tree. For each leaf specifying a suffix i. then the method works in O(n + k) time. the list at u indexed by x is just the list of leaf numbers below u that specify suffixes in S that are immediately preceded by character . and so witnesses the near-supermaximaluy of «. First. At the start of the visit to node v. To do this. we can define the degree of near-supermaximality of It as the fraction of occurrences of It that witness its near-supermaximality.r .re cord its left character SU . is not contained in a maximal repeat. c).Inthat case. and let w (possibly a leaf) be one of e's children. We now consider that case pairs containing the string that labels the path to v. whereas m aabxavaob. Moreover. If there are k maximal pairs. APLll: ANDING ALL MAXIMAL REPETITIVE STRUCTURES 147 Theorem 7.r contains all the starting positions of substrings in S that match the string on the path to v and that have the left character . Therefore. P2. 0 In summary. To create the list for character x at node v.3. If there is no other occurrence of a with a preceding x . it is easy to create (but not keep) these lists in O(I!) total time. and all the list linking take O(n) time. co' occurs only once in S. then Xlt occurs twiceandsoiseitheramaximalrepeatoriscontainedinone. which is preceded . The leaves in the subtree of T rooted at w identify the lccaTherefore. in the string uabxCtyaabxltb. we solve the more general problem of finding near-supermaximai repeats. its bottom up traversal. Thus no occurrence of 01 in L(w) can witness the near-supermaximality of It unless UJ is a leaf. The proof of this is essentially the same as the proof of Theorem 7. in the tree.12. the algorithm forms the Cartesian product of the list for x at u' with the union of every list for a character other than x at a child of u other than s'. The second occurrence of a witnesses that fact With this terminology. visiting each node in the tree. For each character x and each child v' of u. Finding all the maximal pairs in linear time We now turn to the question of finding all the maximal pairs. build a suffix tree for S.I). all supermaximal and near-supermaximal repeats can be identified in linear time. That is.12. before u's lists have been created. and a tree representation for them can be constructed from suffix tree T in O(n) lime as well by x and succeeded by y. substring It is again not supermaxirnal. The list at v indexed by . the occurrence of a starting at i. Any pair in this list gives the starting positions of a maximal pair for string It. Each list is indexed by a left character . Let v be a node corresponding to a maximal repeat 01.3.r . The algorithm is an extension of the method given earlier to find all maximal repeats. For example. the algorithm can output all maximal pairs (PI. Each operation used in a Cartesian product produces a maximal pair not produced anywhere else.

) Next. but interpret it to be lexically greater than any character in the alphabet used for S. the suffix starting at position Pos (I) of T is the lexically smallest suffix.more space reduction In Section 6. and in general suffix Pos(i) of T is lexically smaller than suffixPos(i + 1). the suffix tree for a string of length m either requires (-)(mlI:l) space or the minimum of O(mlogm) and O(m log II:I) time.3) where a fixed string T will be searched many times. we have the following theorem: linearization of the circular string. Similarly.14. many of the ideas were anticipated in the biological literature by Martinez [310J After defining suffix arrays we show how to convert a suffix tree to a suffix array in linear time.1. 7. Such a depth will always be reached (with the proof left to the reader).2. As usual. we saw that when alphabet size is included in the time and space bounds. Simply stop the bottom-up traversal at any node whose string-depth falls below that minimum. Otherwise. If I = 0 or I = n + I. so this problem arises in the "inner loop" of more complex chemical retrieval and comparison problems A natural choice for canonicallinear string is the one that is lexically least.7.14.13. but now we interpret it to be lexically less than any other character in the alphabet.5. For these reasons. The correctness of this solution is easy to establish and is left as unexercise This method runs in linear time and is therefore time optimal. If I -c I ::: n. APL13: Suffix arrays . or if we assume that up to 11:1character comparisons COS! only constant time. the key issues are the time needed for the search and the space used by the fixed data structure representing T. Each such molecule is represented by a circular string of chemical characters: to allow faster lookup and comparisons of molecules. As an example of a suffix array. then the algorithm can be modified to run in O(n + kml time.10. 8. and build the suffix tree T for LL. 3. In summary. the goal is to find a spacefor T be held fixed and that facilitates representation is not so critical.1. Solution via suffix trees Arbirrurily cur the circular string S. then rhe suffix array Pcsis Ll . . Manoer and Myers [308J proposed a new data structure. then cut the circular string between character n and character 1. Figure 7. APL12: Circular string linearization Recall the definition of a circular string 5 given in Exercise 2 of Chapter 1 (page 11). Each leaf in the subtree of this point gives a cutting point yielding the same linear string.5. the traversal follows the edge whose first character is lexically smallest over all first characters on edges out of the node. where km is the number of maximal pairs of length at least the required minimum.13. searching for a pattern P of length II using a suffix tree can be done in O(n) time only ifH(ml1:l) space is used for the tree. the search takes the minimum of O(lllogm) and O(nlog 11:1) comparisons. then the algorithm can be modified to run in O(n) time. As usual. The space used during the preprocessing of T is of less concern. traverse tree T with the rule that. alphabet The circular string linearization problem for a circular string S of n characters is the Choose a place to cut S so that the resulting linear string is the lexically possible linear strings created by cutting S This problem arises in chemical data bases for circular molecules. while P Therefore. APL13: SUFFIX ARRAYS -MORE SPACE REDUCTION 149 maximal pairs. then cutting 5 between characters I . In the context of the substring problem (see Section 7. This is in contrast to its interpretation ill the previous section. A single circular molecule may itself be a part of a more complex molecule. a suffix tree may require too much space to be practical in some applications Hence a more space efficient approach is desired that still retains most of the advantages of searching with a suffix tree.creatingthe string LL. It is important to be clear on the setting of the problem.4 lists the eleven suffixes in lexicographic order.9. String T will be held fixed for a long time. that string can be found in O{n) lime more formal notion of a suffix array basic algorithms for building it were developed in [308J. The characters of S are initially numbered sequentially from I to n starting at an arbitrary polnt in S. that space efficient and yet can be used to solve the exact matching problem or the 7. if T is mississippi. (Intuitively. called a suffix array.I and I creates a lexically smallest That is. at every node. Then. If only maximal pairs of a certain minimum length are requested (this would be the typical case in many applications). although it should still be "reasonable". A different linear-time method wish a smaller constant was given by Shiloach [404].148 FIRST APPLICATIONS OF SUFFIX TREES 7. In the exercises we consider a more space efficient way to build the representation itself 7. Any leaf I in the subtree at that point can be used to cut the string. With suffix trees. we will affix a terminal symbol \$ to the end of S.4. one wants to store each circular string by a canonical linear string. the purpose of doubling L is to allow efficient consideration of strings that begin with a suffix of L and end with a prefix of L.1. affix the terminal symbol \$ at the end of LL. giving a linear string L.double L. This traversal continues until the traversed path has string-depth n.6.

Every time the search must choose an edge out of a node v to traverse. then the method can be improved with the two tricks described in the next two subsections.2. the space required by suffix arrays is modestfor a string of length m. Therefore. the array can be stored in exactly m computer words. and 55 develop an alternative. so we have the following' Theorem 7. More generally. the suffix tree is discarded 7. assuming a word size of at least logm bits. then the places where P occurs in T must be in the second half of Pos. the true behavior of the algorithm depends on how many long prefixes of P occur in T.6.14. suffix Pos(fm/2l)) In that case. Suffix tree to suffix array in linear time We assume that sufficient space is available to build a suffix tree for T (this is done once during a preprocessing phase). 7. Using binary search. 1. APLl3: SUFFIX ARRAYS . more space efficient (but slower) method. Similarly. 6.3. hence Since no two edges out of v have labels beginning with the same character. This ordering implies that the path from the root ofT following the lexically smallest edge out of each encountered node leads to a leaf ofT representing the lexically smallest suffix of T. . the suffix tree for T = tartaris shown in Figure 7.14. this bound is independent of the alphabet size. defining the values of array Pos As an implementation detail.1 For example. the substring problem is solved by using suffix arrays as efficiently as by using suffix trees. Once the suffix array is built. 3.MORE SPACE REDUcnON 151 Flgure7. Suffix array Pos is therefore just the ordered list of suffix numbers encountered at the leaves of Tduring the lexical depth-first search. In more detail. How to search for a pattern using a suffix array The suffix array for string T allows a very simple algorithm to lind all occurrences of any pattern P in T. For example.4) occurrences of P in T simply do binary search over the suffix array. In "random" strings (even on large alphabets) this method should run in O(n + log m) expected time. Instead. and the traversal also takes only linear time. Similarly.150 ARST APPLICATIONS OF SUFFIX TREES 7.5. we convert the suffix tree to the more space efficient suffix array.e. Since for most problems of interest \og2 m is O(n). The suffix tree for T is constructed in linear time. one can find the largest index i' with that property Then pattern P occurs in T starting at every location given by Pos(i) through Pos(i') The lexical comparison of P to any suffix takes time proportional to the length of the common prefix of those two strings.4. When augmented with an additional 2m the suffix array can be used to find all the occurrences in of a pattern P in O(n + log2 m) single-character comparison and bookkeeping operations. In cases where many long prefixes of P do occur in T.14. 2. The lexical depth-first traversal visits the nodes in the order5. 54. one can therefore find the smallest index i in POol(if any) such that P exactly matches the first 11 characters of suffix PasU).5.5: The lex. there is a strict lexical ordering of the edges out of v. page 116) then the overhead to do a lexical depth-first search is the same as for any depth-first search. it simply picks the next edge on u's linked list. If very few long prefixes of P occur in T then it will rarely happen that a specific lexical comparison actually takes 8(n} time and generally the 0(11 log m) bound is quite pessimistic.. That prefix has length at most 11. but that the suffix tree cannot be kept intact to be used in the (many) subsequent searches for patterns in T. Figure 7.4: The eleven suffixes of mississippilisled suffixes define the suffix array Pos in lexicographic order Thesta rtingpositionsofthose Notice that the suffix array holds only integers and hence contains no information about the alphabet used in string T. Exercises 53.1. The suffix array Pos for o string T of length m can be constructed in O(m) time. if P is lexically greater than suffix Pos(lm/2l). suppose that P is lexically less than the suffix in the middle position of Pos (i. The key is that if P occurs in T then all the locations (If those occurrences will be grouped consecutively in Pos. a depth-first traversal ofT that traverses the edges out of each node v in their lexical orderw ill encounter the leaves ofT in the lexical order of the suffixes they represent. Moreover.14. the first place in Pos that contains a position where P occurs in T must be in the first half of POS. P at locations 2 and 5.1. which are indeed adjacent in Pos (see Figure 7. Of course. for building a suffix array A suffix array for T can be obtained from the suffix tree T for T by performing a "lexical" depth-first traversal ofT.cal depth-first traversal of the suHix tree vi sits the leaves in order 5. if the branches out of each node of the tree are organized in a sorted linked list (as discussed in Section 6.4.2.

as before. R) that arises during the execution of the binary search. Then in each iteration of the binary search. then compare P to suffix Pos(MJ starting from position mlr + I "" I + I = r + I. by simple case analysis it is easy to verify that neither I nor r ever decrease during the binary search. At the start of the search. whcn T "" mississippi.14. r) of P. r) The value mlr can be used to accelerate the lexical comparison of P and suffix Pos (M) Since array Pus gives the lexical ordering of the suffixes of T. the lexical comparison of P and suffix Pos(M) can begin from position mlr + 1 of the two strings.J__U\ -----"l 7. Also.r .every iteration of the binary search terminates the search. examines no characters of P. M. the first mlr characters of suffix PosU) must be the same as the first mlr characters of suffix Po. if I := r. suffix Pos(3) is isslppi. O(n + logm).3. Then there are k . or ends after the first mismatch occurs in that iteration. In the two cases (/ = r or Lcp (L. rather than starting fromthetirstposition Maintaining mlr during the binary search adds little additional overhead to the algorithm but avoids many redundant comparisons. but the next character in P has not been. L equals 1 and R equals m. let us assume without loss of generality that I > r. and mlr. A simple accelerant As the binary search proceeds. Lett and r denote those two prefix lengths. The desired time bound. Hence at the start of any character max(l. and so Lep(3. M) "" [ > r) where the algorithm examines a character during the iteration. Thus before.14. if i is any index between Land R. the worst-case time for this revised method is still f g t. let L and R denote the left and right boundaries of the "current search interval".I (either I or r is changed to that value). R) for each triple (L. all characters I to the maximum of I and r will have already been examined.MORE Icp(L. 4) is four (see Figure 7. The usc of mlr alone does not Since mlr is the minimum of I and r . Later we will show how to compute the particular Lcp values needed by the binary search during the preprocessing of T How to use Lcp + log 1/1) values Simplest case In any iteration of the binary search.14. Thus no more than log! m redundant comparisons are done overall.4.I'L-- UUUU f 'I ------".M) SPACE REDUCTION 153 7. There are at most n nonredundant comparisons of characters . At the start. Then PROOF First. That means at most one redundant comparison per iteration is done. Suppose there are k characters of P examined in that iteration. respectively. M) and Lcp(M. The search algorithm keeps track of the longest prefixes of POs (Ll and Pas (R) that match a prefix of P. Therefore. whenL "" 1 andR "" m. For now. and at the end of the iteration increases by k .~(L) and hence of P.1 matches during the iteration. General case When I there are three subcases: =I r. APL13: SUfFIX ARRAYS . A super-accelerant Call an examination of a character in P redundant if that character has been examined The goal of the acceleration is to reduce the number of redundant character examinations to at most one periteration of the binary search-hence O(logm) in all. suffix Pas (4) is ississtppi.152 FIRST APPLICATIONS OF SUFFIX TREES 7.4). whenever I =I r. explicitly compare P to suffix Pos(i) and suffix Pos(m) to find t . the comparisons start with character max(l. and letmlr = min(l. follows immediately. a query is made at location M = r(R + LJ/ll of Pos. the corresponding change of lor r the search algorithm does at most Ocn For example. we assume that these values can be obtained in constant time when needed and show how they help the search. the algorithm uses Lcp (L. r) of P may have already been examined. However. To speed up the search.

It is therefore plausible that those values can he accumulated during the O(m)-time preprocessing of T.5 7.3 3. i + 2). then by the Lemma. depth-first traversal of T used to construct the suffix array. lexical. The leaf values can be accumulated during the linear-time. LepU. with large "alphabets" (using some computer representation of the Chinese pictograms.1 involves parsing it. A large part of the motivation for suffix arrays comes from problems that arise in using suffix trees when the underlying alphabet is large.14. The remaining values are computed from the leaf values in linear time by a bottom-up traversal of B. also arise in computational problems in molecular biology. Now by transitivity. So it is natural to ask where large alphabets occur. String and substring matching problems where the alphabet contains numbers. That string-depth is 3.3 fori froml to S. First.S.6 6.) must be at least as large as the smallest Lcp(k. the O(n + logm)-time string and substring matching algorithm using a suffix array must precompute the 2m . How to obtain the Lcp values The Lcp values needed to accelerate searches are precomputed in the preprocessing phase during the creation of the suffix array. giving a total bound of n +Iogm comparisons.8 Flgu~7. the node labels specify the endpoints (L. Combined with the observation in the first paragraph. the values Lep(i. 0. Therefore.4 4. a map may display the sites of many different patterns of interest (whether or not they are restriction enzyme sites). So there are only O(m) Lep values that need be precomputed.2. More generally. Lep (i.6. the restriction enzyme map for that single enzyme is represented as a string consisting of a sequence of integers specifying the distances between successive enzyme sites. k + I) 2::.14.6) is the string-depth of the parent of leaves 4 and 1. and it is left to the reader.1 Lcp values associated with the nodes of binary tree B. for every k between i and j. A restriction enzyme map for a single enzyme specifies the locations in a DNA string where copies of a certain substring (a restriction enzyme recognition site) occurs. R) of all the possible search intervals thai could arise in the binary search of an ordered list of length m.up(i. such as Chinese. One example is the map matching problem.8 1. APL1J: SUFFIX ARRAYS-MORE SPACEREDUCTION ISS '.2 2. One simple example is a string that comes from a picture where each character in the string gives the color or gray leve1 ofa pixel. j) By the properties of lexical ordering. i + I)fori from I tom -1 are easily accumulated in O(m) time. Considered as a string. 1.8 Ifwe assume that the Siring-depths of the nodes are known (these can be accumulated in linear time). Where do large alphabet problems arise'? '1. lcp(k. Lep(5. All the other work in the algorithm can clearly be done in time proportional to these comparisons.j)is 4. the lemma is proved. but how exactly? In the next lemma we show that the Lep values at the leaves of B are easy to accumulate during the lexical depth-first traversal ofT.14.0. i + l ) are 2. The hardest part of Lemma 7 .I. Once done. k + I) for k from i to j . and where P and T are large. resulting in the following and all of P. IcpU. We first consider how many possible Lcp values 7. suffix Pos(k) must also have that common prefix. Extending this observation. 0 Given Lemma 7. In summary. This clearly takes just O(m) time. Each such site may be separated from the next one by many thousands of bases. Since B is a binary tree with m leaves.14. setting the Lcp value at any node u to the minimum of the lep values of its two children. i+2)mustbeatleastaslargea~theminimumofLep(i.8 I. B has 2m .) However. so the string (map) .I nodes in total. 0 7.since the parent of4 and I is labeled with the string tar. most large alphabets of interest to us arise because the string contains numbers. j) for every k between i and j . each integer is a character of a (huge) underlying alphabet. For example. the remaining Lcp values for B can be found by working up from the leaves.5 (page 151). consider again the suffix tree shown in Figure 7. Hence. there are natural languages. The rest of the Lcp values are easy to accumulate because of the following lemma: 4.154 FIRST APPLICATIONS OF SUFFIX TREES 7.7: Binary tree B represenling Inallstofleng!hm=8 all the possible search intervals in any execution 0fbinarysearch PROOF Suffix PosU) and Suffix Pos(j) of T have a common prefix of length lepU. each of which is treated as a character.J.14. The values of up(i. the proof is immediate from properties of suffix trees. Essentially. i + I) and LepU + I.

Third.612 fragments (strings) totaling 2. the variety of known patterns of interest is itself large (see [435]). APLl5: A BOYER-MOORE APPROACH TO EXACT SET MATCHING 157 consists of characters from a finite alphabet (representing the know n patterns of interest) altemating with integers giving the distances between such sites.6.032. This provides one motivation for using suffix arrays for matching or substring searching in place uf suffix trees.). The vector sequences are kept in a generalized suffix tree. In fragment assembly. the underlying alphabet is huge. where a small number of errors in a match arc allowed. or when sites are missing or spuriously added. generalized suffix trees and suffix arrays are now being used as the central data structures in three genome-scale projects. are used in the search for biologically significant patterns '1Il the obtained Arahidopsis sequences. To compare the speed and accuracy of the suffix tree methods to pure dynamic programming methods for overlap detection (discussed in Section 11. which requires solving a variant of the suffix-prefix matching problem (allowing some errors) for all pairs of strings in a large set (see Section 16. The fragment sequences are kept in an expanding generalized suffix tree for this purpose. each sequenced fragment is checked to catch any contamination by known vector sequences. one major bottleneck is overlap detection. Efficiency is critical 7.1). 64. 65J. each of length about 400 bases.it examines all characters of the text. An approach that permits 7. APLl4: Suffix trees in genome-scale projects Suffix trees.16. Moreover. Since each map is represented as an alternating string of characters and integers.15. Since the Beyer-Moore algorithm for a single string is far more efficient in practice than Knuth-Morris-Prau.1 for a discussion ufESTsand Chapter 16 for a discussion of mapping). Since the project will sequence about 36. the efficiency of the searches for duplicates and for contamination is important.740 bases.000 times speedup over the (slightly) more accurate dynamic programming approach. one would like to have a Beyer-Moore type algorithm for the exact set matching problem. the needed overlaps were computed in about fifteen minutes.1.156 FIRST APPLICATIONS OF SUFFIX TREES 7. suffix trees to speed up regular expression pattern matching (with errors) is discussed in Section 12. That problem.10.15. and generalized suffix trees are used to accelerate regular expression pattern matching. So in that project. First. The test established that the suffix tree approach gives a 1. a method fur the exact set matching problem that typically examines only a sublinear portion of T. Second.) Borrelia burgdorferi Borrelia is the bacterium causing Lyme disease Its genome is about one million bases and is currently being sequenced at the Brookhaven National Laboratory using a directed sequencing approach to fill in gaps after an initial shotgun sequencing phase (see Section 16." at the Michigan State University and the University of Minnesota is initially creating an EST map of the Arabidopsis genome {see Section 3. each new sequenced fragment is checked against fragments already sequenced to find duplicate sequences or regions ufhigh similarity.4 Yeast Suffix trees are also the central data structure in genome-scale analysis of Saccharomyces cerevisiae (brewer's yeast). Chen and Skiena llOOJ developed methods based on suffix trees and suffix arrays tu solve the fragment assembly problem for this project. If both the new and the previously studied DNA are fully sequenced and put in a database. the following matching problem on maps arises and translates to an matching problem on strings with large alphabets Given an established (restriction enzyme) map for a large DNA string and a map from a smaller string. structure prediction.000 fragments. and since distances are often known with high precision. (See Section 16. then the issue of previous work or locations would be solved by exact string matching. APL15: A Boyer-Moore approach to exact set matching The Boyer-Moore algorithm for exact matching (single pattern) will often make long shifts of the pattern. the problems become more difficult in the presence of errors.maps are easier and cheaper to obtain than sequences. done at the Max-Plank Institute [320]. Chen and Skiena closely examined cosrmd-sized data. Knuth-Morris-Pratt examines all characters in the text in order to find all occurrences of the pattern. called map alignment. It often happens that a DNA substring is obtained and studied without knowing where that DNA is located in the genome or whether that substring has been previously researched. as discussed in Section 7. finding 99% of the significant overlaps found by using dynamic programing. In that project generalized suffix trees are used in several ways [63. examining only a small percentage of all the characters in the text In contrast. The Borrelia work [100] consisted of 4. But most DNA substrings that are studied are not fully sequenced .14).5.16.15. is discussedin Secnon 16. Suffix trees are "particularly suitable for finding substring patterns in sequence databases" [320].4 and 16. Consequently. suffix tree. highly optimized suffix trees called hashed position trees are used to solve problems of "clustering sequence data into evolutionary related prot ein families. Patterns of interest are often represented as regular expressions. In the case of exact set matching. Of course.15 for a discussion of fragment assembly. and fragment assembly" [320J.5. the Aho-Coraxick algorithm is analogous to KnuthMorris-Pratt. No known simple algorithm achieves this goal and also has a linear worst-case running . Arabidopsis thaliana AnArahidopsis thaliana genome project. the numbers are not rounded off. when the integers in the strings may not be exact. that is. The alpbabet is huge because the range of integers is huge. determine if the smaller string is a substring of the larger one. Using suffix trees and suffix arrays.

.12.[60 FIRST APPLICATIONS OF SUFFIX TREES 1. exists) such that when all the patterns in P are aligned with position i.Then the good suffix rule is: Increase i to the minimum of hi" either or both are nonexistent.16. determine the number 2" for each internal node u. Notice that because P contains more than one pattern. an occurrence of Pm. butleft of. if the right end of every pattern moves to i2.16. traverse the path labeled 0:' from the root of T'. That path exists because ~ is a suffix of some pattern in P (that is what the search in K' . Even if that doesn't happen. A bad character rule analogous to the simpler. weak good suffix rille is applied.cTI figure.. and let a = T[j . As an example of the preprocessing. We can now describe how T' is used during the search to determine value i2> if it exists After matching a' along a path in J(/. matched from posiuon i down to position jof T. _[_r=u=J-"u'--L_--i P. making the bad character rule almost useless. tn this pro Tfthatoccurrenceof~rstartsatpositionzofpattern r = Z .buildageneralizedsuffixtreeT'forthe set of patterns pr. How to determine i2 and s. As the number of pan ems grows. if 7. a prefix of at least one pattern is aligned opposite a suffix of c in T. unextended bad character rule for a single pattern would be even less useful. i J matches a substring of some pattern P in P. cyj: ___L[Yl-"uc_[_____)-< P.11: The shift when the amount ot the shift. These two preprocessing tasks are easily accomplished in linear time by standard methods and are left to the reader. If that happens.. then an occurrence of P in T could be missed.10: Substring" to the left of position j in A. xuqq. Pattern P must contain the substring ~ beginning exactly i2 .3. Recall that a denotes the substring of T that was matched in the search just ended To determine 12. the typical size ofij ~i is likely to decrease. A direct generalization (to set matching) of the two-string good suffix rule would shift the right ends of all the patterns in P to the smallest value i2 > i (if it exists) such that T[j . _m Figllre 7. As noted earlier. Good suffix rule To adapt the (weak) good suffix rule to the set matching problem we reason as follows: After a matched path in J(/ is found (either finding an occurrence of a pattern. Recall that in a generalized suffix tree each leaf is associated with both a pattern P' E P' and a number z specifying the starting position of a suffix of P' + IPminl. particularly if the alphabet is small.I positions from the end of P. If that copy of a ends r places With this suffix tree T'.16. and the second number refers to a suffix starting position in that string. APLI5: A BOYER-MOORE APPROACH TO EXACT SET MATCHING 161 The preprocessing needed to implement the bad character rule is simple and is left to the reader. when needed during the search.pattern ~ determines tne Figure 7. 1f P'issmallerthan P. Duringthepreprocessingphase. in some applications in molecular biology the total length of the patterns in P is larger than the size of T. So let I] be the smallest index greater than 1 (if i. or not) let j be the left-most character in T that was matched along the path." would be placed more than one position to the right of i . of T.4. or i P. no further match is possible 7. qxaxj and the generalized suffix tree for P' shown in Figure 7. -.Ignore i2 and/or iJ in this rule. then shifting the right end of P'tol2 may shift the prefix f3 of P' past the substring f3 in T. in the set matching case. The first number un each leaf refers to a string in P'. __ ~ P5 . that shift may he too large. The generalization of the bad character rule to set matching is easy but. The number z" is the first (or only) number written at every internal node (the second number will be introduced later). a rule analogous to the good suffix rule is crucial in making a Beyer-Moore approach effective P. the point where the previous matches end. In that case. unlike the case of a single pattern.nin T could be missed.---' -P3~ -.j + 1 positions from its right end. Suppose that a prefix f3 of some pattern P' E P matches a suffix ore. but unlike the two-pattern case. .1) close to. We will first discuss i2. This is because some pattern is likely to have character TU . The question now is how to efficiently determine h and I. When there are patterns smaller than P. consider the set P = {wxa. use of the bad character rule alone may not be very effective.i] be the substring of T that was matched by the traversal (but found in reverse order). This shift is analogous to the good suffix shift for two patterns. we need to find which pattern P in P contains a copy of a ending closest to its right end. that overlap might not be the largest overlap between a prefix of a pattern in P and a suffix of ce. Therefore. but not occurring as a suffix of P. it may happen that the left end of the smallest pattern Pm. there is another problem. The reason is again due to patterns that are smaller than P.

of a is a prefix of P. z ) be any edge branching off the a' path and labeled with the single symbol \$. Over all such occurrences of a' in the strings of t». P is then the string in P that should be used to set i2. . For each node v)ofv. If neither iz nor f exist. Now consider the path labeled a' in T'.e. set a variable d" to z. There thus must be a leaf edge (u. An implementation eliminating redundancy The implementation above builds two trees in time and space proportional to the total size of the patterns in 'P. 01ends in P exactly z. equal to the smallest value of d" for any ancestor 6. (i. in Figure 7. <lofI'thatislabelcdonlybythcterminal character Siand set (including 5. . z) branching off that path. That means that P contains that is not a suffix of P. z l is just the terminal character \$.12 . Since some suffix. the test to see if a pattern of P ends at position i can be executed using tree L rather than K'. For example. unlike K'. Hence. we have in T' at or below the end of path a' has a defined Using suffix tree T'. The details will be given below in Section 7. and over all such occurrences of a. This is again easy assuming the proper preprocessing ofP. Build a generalized suffix tree I' for the strings in pc. as argued above. d" is the second number written at each node u (d" is not defined for the root node). P contains the ending closest to its right end. then h can be obtained from it: The leaf defining e.) for Beyer-Moore exact set matching 1.16. after matching a string a in T. after the algorithm matches a string a in T by following the path a' in L. marked in step 2.lfJ. the search used to find i2 can be eliminated. and the two traversals are as fast as one But clearly there is superfluous work in this implementation . Remove the subtree rooted at any unmarked node (including leaves) of I. If:" is defined there.16. each node v in L now has associated with it the values needed to compute i2 and iJ in constant time In detail. Next the algorithm checks the first node v at or above the end of the matched path. then i.I. let (u.1. (Each leaf in the tree is numbered both by a specific pattern P' in P' and by a specific starting position z ofasuffix in P'. However..z)in T' that is labeled only by the terminal character \$. . In addition. is undefined. (The number z is used both as a leaf name and as the starting position of the suffix associated with that leaf. of a. the two rules can be combined to increment i by the largest amount specified by either of the two rules Figure 7. with proper preprocessing. identify every edge (u.[62 FlRST APPUCATIONS OF SUFFIX TREES 7. To get the idea of the method let P E 'P be any pattern such that a suffix of a is a prefix of P. the algorithm checks the first node L' at or beneath the end of the path in L.12. If no node on that path has a d value defined. some initial portion of the a' path in T' describes a suffix of P'. The search phase will not miss any occurrence of a pattern if either the good suffix rule or the bad character rule is used by itself. Moreover. is defined there then i3 exists and equals i +d~ -I. From an asymptotic standpoint the two trees are as small as one. Preprocessing Begin 1. Here's how.. 50a' is a prefix of some pattern in P". Then the pattern P associated with z must have a prefix matching a suffix.I characters from the end of P. In the search phase.5 Now we turn to the computation of i. If a. pr contains the copy of c" starting closest to its left end.J is associated with a string P' E pr that contains a copy of at starting to the right of position one. These observations lead to the following preprocessing and search methods InthepreprocessingphasewhenT' is built. the determination of h takes O(lal) time. i should be increased by z. Using C in the search phase Let L denote the tree at the end of the preprocessing. then i2 exists and equals i + a. Thai means that a prefix of a' is a suffix of P'.3 7. APL15: A BOYER-MOORE APPROACH TO EXACT SET MATCHING 163 The proof is immediate and is left to the reader. In summary. the leaf where z = z. qxax} detennined). where dill is the smallest d value at a node on the a' path in T'. then i should be increased by the length of the smallest pattern in P 3.) (Nodes were Eod The above preprocessing tasks are easily accomplished in linear time by standard tree traversal methods. every time a string a' is matched in K' only O(lal) additional time is used to search for i2 and i3. Thus the time to implement the shifts using the tWOtrees is proportional to the lime used to find the matches. L3 shows tree L corresponding to the tree T' shown in Figure 7.2. the value ofiJ (if needed) can be found as follows' Theorem 7.16. Hence.2 Figure 7. xaqq. i. Conversely..16. Tree L is essentially the familiar keyword tree K' but is more compacted: Any path of nodes with only one descendent has been replaced with a single edge. Let u be the first node at or below the end of that path in T'. for any i. 4. The value ofi3 should be set to i +d . v in I' set d.. only doubling the time of the search used to find a. However. . where leaf z is associated with pattern P' and the label of edge (u. Find every leaf edge du =2 {II. Clearly. is defined. Again we use the generalized suffix tree T' for 'P'. Moreover.5.a single tree and a single traversal per search phase should suffice.12: Generalized suffix tree 'It for the set P = {WX8.I positions. can be found during the traversal of the a' path in T' used to search foril.) For each such node u.

ba/"bah. or it is a pair (s. .v= 0.xaqq..> then begin ° end else begin output SCi) i :=i+1 Untili. APL16: Ziv-Lempel data compression Large text or graphics files are often compressed in order to save storage space or to speed up transmission when the file is shipped. providing another illustration of their utility The Ziv-Lempel compression method is widely used (it is the basis for the Unix utility comoresss. then Priur] is bax > Definition For any position i in S. is totally contained in S[ I . abahah"b"'wha'h«baba. yielding characters 1 through i-I of the original string S.).I I· The Ziv. and uncompress the file.!-I I has ~een represented (perhaps in compressed form) and I.14. The basic insight is that if the text S[l . In either case. .. and outputting the character S(i) when needed. That is. we present a variant of the basic method and an efficient implementation of it using suffix trees For example.) in place of the explicit substring S[i .hah'. outputting the pair (Si' I.> end n To see how [.> 0. and some file transfer programs automatically compress. ship. For l.164 FIRST APPLICATlONS OF SUFFIX TREES 7. Full details are given in the algorithm below Compression algorithm 1 2. The next term in the compressed string is either character SO + 1). then the next I." characters of S (substring Priori) need not be explicitly described.-Lempel method uses some of the Ii and . 7. a popular compression method due to Ztv-Lempel [487.l. If we extend this example to containk repeated copiesofab. iflj. see [423]) and will not be handled in depth here.I. assume inductively that the first j terms (single characters or . without user intervention. the algorithm has the information needed to decompress the jth term.) in the representation points to a substring that has already been fully decompressed.b. although there are actually several variants of the rnethcd t hat go by the same name (see 14871and [488]). In this section.1 3. pointing to an earlier occurrence of the substring. then the compressed repre sentanon contains cpprcximately Slog2 k symbols . 1\ used during the search.i + Ii . that substring can be described by the pair (Sj. if S = abaxcabaxab::. i pairs) of the compressed string have been processed.1 Figure 7. define I. siasthestartingpositionoftheleft-mostcopyofPriuri tn the above cxample.13: Tree£col1"cspondingtotree'T'forthesetP=iwx3. APL16: ZIV-LEMPEL DATA COMPRESSION 165 2. a compression method could process S left to right. However. represented as That representation uses 24 symbols in place of the original 32 symbols.. starting ut s. The shifts of the P are shown in Figure7.qxaxl begin j :=1 Repeat compute f and s. The field of text compression is itself the subject cr severar books (for example. process the compressed string left to right. values to construct a compressed representation of string S.17.I.1 Note that when I. and since the first . i . is greater than zero. I.a dramatic reduction in space To decompress a compressed string...17.II when possible. as the length of Prior. Rather. Most operating systems have compression utilities. so thar any pair (Si. the copy of Prior. 488] has an efficient implementation using suffix trees 1382]. Following this insight. let T be qxaxtoqpst. Ij) pointing to a substring of S strictly before i. define Land rj = 2.4 1.

3241. In this way. It can obtain that pair in exactly the same way that is done in the above implementation if the c.i]. lO)a(2.17. Assume that the compaction has been done for 5[ I . this character is not needed and seems extraneous. l. then the entire algorithm would run in Oem) time. In either case. That works fine in many contexts. then both computations should significantly compress the string at hand Another biological use for Ziv-Lempel-Iikc algorithms involves estimating the "entropy" of short strings in order to discriminate between exons and lntrons in eukaryoric DNA [146]. One for outputting an explicit character after each pair is that + /. These compressions reflect and estimate thc "relatedness" of 51 and 5:.) time. The tree can be built in Oem) time.I. SI can be compressed using only a suffix tree for S2. l the compaction algorithm needs to know (s.-Lempel algorithm outputs the extra character. Farach et al. the path from the root to p describes the longest prefix of S[i . since that would take more than linear time overall. values when a new node is addedtothetree_Insummary. APL17: Minimum length encoding of DNA 7_17.). Certainly for compaction purposes. one character at a time Essentially. The compressed substrings are therefore nonoverlapping.) defines the shortest substring starting at position i not appear anywhere earlier in the string.i-ljhasbcenconstructcd.1. and Ii each time the algorithm requests those values for a position i_The algorithm compresses S left to right and does not request o. When the algorithm needs to compute (Si' /. Historically.16)...mJ that also occurs in S[1.m]. which establishes those c..18. __forstringS[l . Using a suffix tree for S._ If the two strings are highly related. Ii).) is OOd· Thus the entire compression algorithm runs in O(m) time.arecxteosively discussed inPartIII . rather than as (1. 7. 7 Other.2. ls just the suffix number associated with leaf v. A one-pass version In the implementation above we assumed 5 to be known ahead of time and that a suffix tree for S could be built before compression begins.3.wehave Compression has also been used to study thc "relatedness"] of two strings SI and S2 of DNA [14. the algorithm is implemented so that the compaction of S is interwoven with the construction ofT.L . the algorithm cannot traverse each of the implicit suffix trees.166 FIRST APPLICATIONS OF SUFFIX TREES 7. values have been written at each node v in I. Ii) can be found in OU. c. by induction that the 7. and all the node numbers can be obtained in Oem) time by any standard tree traversal method (or bottom-up propagation)..Atthalpoint..I] and thatimplicitsuffix tree Z. The traversal ends at point p (not necessarily a node) when i equals the string-depth of point p plus the number c. The real Ziv-Lempel The compression scheme given in Compression algorithm I although not the actual ZivLcmpel method. and if each requested pair (Si.. That compression of S2 takes advantage of substrings in 52 that appear in SI. 5 = abababababababababababababababab would be compressed to ab(l. [146) report that the average compression of introns does not differ significantly from the average compression of exons. Ii) for any position i already in the compressed part of S.17. the O(ld time bound is easily achieved for any request Before beginning the compression.). the time to find (s. and hence compression by itself does not distinguish exons from introns. For example. However. it traverses the unique path in T that matches a prefix of S[i . APLI7: MINIMUM LENGTH ENCQDlNG OF DNA 167 term in the compressed string is the first character of 5. values in a linear time traversal of T. alone Similarly. Exploiting the fact that the alphabet is fixed. and it gives the left-most starting position in S of any copy of the substring that labels the path from r to u. However. Essentially. equals the string-depth of p. we conclude decompression algorithm can obtain the original string S. and I. The real Ziv-Lempel method is a one-pass algorithm whose output differs from the output of Compression algorithm 1 in that. unlike the above implementation.wheneveritoutputsapair(_si. Si equals c. c.itthenexplicitlyoutputsSU+I. 2)a(2. the idea is to build a suffix tree for SI and then compress string 5:.li) defines the longest substring starting at i that does appear earlier. the algorithm first builds a suffix tree T for Sand then numbers each node 1! with the number c. tv). Implementation using suffix trees The key implementation question is how to compute I. 12)b. using only the suffix tree for 51. whereas (si. 7.._I. does resemble it and capture its spirit. Instead. whenever a new internal node r is created in Ukkonen's algorithm by splitting an edge (u. it may have been easier to reason about shortest substrings that do not appear earlier in the string than to reason about longest substrings that do appear earlier. 4)b(1. but the method can also be modified to operate on-line as S is being input.thecharacterfollowing the substring. they also report the following extension of that approach to be effective in distinguishing exons from introns. This number equals the smallest suffix (position) number of any leaf in u's subtree.. but does not take advantage of repeated substrings in 5:.) for some position i.: where v is the first node at or below p.18. is set to CU" and whenever a new leaf v is created. The one-pass version of Compression algorithm I can trivially implement Ziv-Lempel in linear time It is not completely clear why the Ziv. only constant time is needed to update the c. The easiest way to see how to do this is with Ukkonen's linear-time suffix tree algorithm Ukkonen's algorithm builds implicit suffix trees on-line as characters are added to the right end of the growing string. more common ways (0 Sludy the relatedness rsimilarityf o o 'trings OflWO string. So.

and contains of the techniques or limitations. in the DNA Let 8 be a string of length n. algorithm.!. a gene is composed and introns. that are independent different this view. is defined on pairs of suffixes of S. exon exons being shuffling.lnthenew is the relationship T'. The goal the time-efficient solve the more complex space-e.quivalence c. We say iE. unsettled issue. the kinds of computation 5. all proteins. concerning should try 10 identify common the exons or repeated by finding substrings for exon E. since facts: Fact n in En IS a Singleton.1 it may be he Ipful to first find all the v represents the same set of suffixes retinementtree.. or cover with it as best you can.whileachieving it would be cheannq pro p. 43.) genome other database. the suffiX tree approach to to implement time. substrings in 8" each 01 length k or 52. Hence exonscorrespond gene construction of the rest of the protein). bu tclearly. sintothe desired EXERCISES 175 7.torepresentthe of the classes contains children What the elements of successive The root of refinements class mtrcns not known.. Consider string properties Therefore. because that 'does E. Note that the substrings contained case? 49.each node at level 1 repres ents a class of Et could be applied. Venfy the followmg sequenced of reusing are called a few thousand exons is called shuffling. whose Similar form. we introduce ve refinement the ideas. were in addition to organisms. Prove Theorem In the Ziv-Lempel traversal with lening 7. nsuffixesofS? Now modify T as follows. of as its parent node I v'. Successive Successive of string connect string refinement refinement [114. coli. each of length of Cin &.calledtherefinementtree.contract v'loasinglenode. that are from two strings each of length k or greater. class is a subset of an Ek class and so the Ek+l partition problem The problem: gions where common Given anonymous.16.!.fficient a suffix have no appllcaticn problems listed to prokaryotic problems is to maintainthe of the suffix array. and delerminedinconstanttimeafteranytest.! is a permutation algorithm of substrings of 5. common substrings of length k or more 7. remany of the E"partition WeuseaiabeledtreeT. Grapple organisms.yeaststringswereincorrectlyidentifiedinthelargercompositedatabases. how to convert T' to the suffix tree for string S? Show I a suffix tree for S into tree T' in O(ri') time . protein. t E~i + 1 is a refinement mosaic proteins. (units orders are often built in a modular refinement to suffix trees. methods is a general 199.lasses two on the reuse to date exons and reordering up of just exon facts of stock exon and the s. iE"" j if and only if iE. appeared higher than quality in the reported sequencing isan yeast strings.! and a parameter good suffix rule. each nonleafnode Prove Lemma has at least two children. 50.3. the excns (ideally. represent S. are substrings) strings of DNA from protein-coding Clearly.jifandonlyifsuffixiandsuffixjol8agreeforatleasltheirfirslkcharacterS.20. discussed in this chapter they would result does require In general. data should set matching With that implementation. repeats shuffling? Although maximal 44. T represents Eo and and its regions in two or more DNA strings.ltisnaturaltowonderif evidence to support by modular tnat This [468]. To further more recent of yeast may yie ld sequences databases. folds concatenation composed a single in a variety is unclear. organisms where. as k increases from 0 to n. end at point p if the string-depth the match of p plus c.4) constructed from the set of substrlnqs.16. 265]. set matching tree to implement independent a strong a weak good suffix rule for a Boyerthe increment of index i was even of the alphabet size. 53. it may not give an elegantworsl'case result. vand It node What That question an empirical.I. the same domains and there is some in many and combination s. . andapplysuccessi of distinct although modular based domains in diiferenl protein protein have distinct problems.j and i . Prove the correctness of the method presented in Section earizationproblem. to particular greater.2.and parts of cloning problem. Fact 2 Every E •.) for some position i. is instructive DNA.ertiesofthesufflxtre~. suggest createdviaexon general search 1 For any i '" j. prcblern rather In addition to methods the x-cover seemofu sa in studying all the E'~l analylical will surely is the relationship of T to the keyword 3. In eukaryotlo specifies it as much This is a harder problem. Or. Extend in this one tenth 01 the errors old sequencing curation 42. a k-cover we used a suffix algorithm. all the nsuffixesof in that class. 8. why should the &. find the largest < k (if any) such Try to give some explanation compared to compression for why the Ziv-Lempel algorithm 1 algorithms 01 finding for these problems. That is. extend wherl computing i? algorithm outputs the extra character (s.Notethat construction are made These may be reflected E. and identical but sequenced. isan equivalence S has relation characters. to cover &.ordeterminethat If there is no k-cover. nonoverlapping as possible. of the yeast database problem. as in most prokaryotic not mean that techniques organisms.14. algorithmic technique that hasbeenusedforanumber exercises. of 5"each k' of length korgreater.and to individual domains.that that there is a 51. Proteins that whose of alternating function functions exons. In Section Moore of the preprocessing to exact needed to implement the bad character rule in the asyeast. Consider Give a linear-time to find a k-cover no such cover exists. where again the increment he alphabet set of substrings concatenation C may overlap korgreater.although No elegant and common have to be tried out on real data to test their utility be expected..13 lor the circular string /in- cleaning up the data and organizing the sequence 45.Jtts estlmated AlSO.. C is a as the in al tools discussed k. However. clean-up existed above. such that s" can be expressed i can be found in constant Can you removethedependenceont of the substrings in 5" but not in some order. k'-cover. Each child of the root represents classesmat refine it tree (Section a class 01 E. in the yeast strings lrom to first use the suttlx array for a stn ng to construct the kinds 01 problems vectors tree for that string Give the details Boyer-Moore 48. and Ih. than theoretical answer. equals i? What would be the problem then find a set of substrings of past character cover the most characters Give linear-time now the problem s". The relation E.174 would you go about contigs? (This application little repetitive structures Similar FIRST APPLICATIONS OF SUFFIX TREES 7. in detail wtlether considered a suffix in this array can be used to eificiently chapter. thaI have approach 7. &. k-eover sequences be integrated will require in the existing How the new and the any large-scale here. incorrectly complicate listed the 47. or distinct are seen problems successive tn the next several are often seen of genes. for handling contains repetitive 46. Given two input strings of 5. and so il partitions every class the elements into e.

the tandem repeats (In Section so comes before the class for the single m.1.fl.jand the classes i+kE.1 (page 40) om-meres. Instead. trom th. any numbers 01 this follows Giveacomp~eteproofofthe the E'k partinon Irom the 55.{2.8).6.S) remain E.2).7) were in the same reftnedin E. to refinement.i Lemma implies that{i.j+ of k lists all Cl = {JI contained S Is called maximal From Fact 2.1. simply and th e crasses are arranged of all are no additional Maximal interested before.partition. Processing the first and all the tandem used a successive arrays in a given string S.20. [2.g that the can be obtained it easy to identify order.are:{12).goflthm In correctness Ekparlition refinement approach of E".1 we will discuss and we allow the repeats the tandem in a string.7) and [3.. useful to use Lemma Lemma 7.) the tandem orby array 4). 112):12. The class in identifyingthemaximaltandemarrayscontainedinastring. Primitivetandemarravs Recall tandem that a string repeat.inlexicalorder are: 112).1).5).j) is a pm-triple. creates [3. These by the substrings in Ihe same represent.S. efficient We focus here on all the maximal to find all methods and implicitly to contain encode and to identify the most informative m lexical order of those characters. What of the integers 1 to n. Alt. tandem of {J before were initial'v atter defined of [308] starts In Sofone by computing character (3. suited is such [308] The advantage for parallel an algorithm. The pm-trlples triples k..8. is obtained in practice depends on the size of the alphabet with O(nlogn) by refining implement should character and the manner comparisons that it We use the pair ({J.{4. IS represented. and E. [114] (with different terminology) refinement method 10 lind all the pmlor each and hence Each class it creates of E2 holds £2 classes {3. First i. i .6. Be sure to detail how the classes Assume smgleton. Now consider bythepair(abababab. i and i + k are both contained in a single class E.(3. 57. lemma There the two classes classes. lor any k.(3.2.1? time.(6. crocnemore In the creation of E. When or array can also be called i. for each E2k class.54. 9. we create than examining I > kine. .7) in O(nlognJtime. Prove Lemma is easy.s. results {lOI. in Exercise 4 in Chapter in Section 1 (page 7. such that each consecutive pair of numbers differs by k.1. we choose the later pair.14)'{6).f8).20.p.s. ends. using a reverse C of Ek to find how number C be refined.{1).. Eachclas~ al. For example. C as a ~ee h~w it forces other locate by C identify Ek classes a complete to split. creates they there can be only O(nlog classes substrrngs {2. However. t~e general idea of reverse of addHional £2" classes that 12.Explainthisindetail .1 Assume that the indices in each class where series repeat of a k·lenglh class direction is harder and it may be substring The method in [114] finds the Ek partition following 7. for S =mississippl\$.[1J. are kept in lexical of Ek are sorted is a in increasing substring.. £.{I}'[10J. k. the maximal tandem arrays in mississippi Note described arrays by the pm-triples pm-triples can begin are a suffix S using approach. approach as suggested in is best? Since the first two pairs can be deduced will from the Fact 3. is central: is a tandem i and i + k are One direction 3.hough tation analysis.1ss. in order. Which description array {J'. Explain why.6. Then. 7.The lemma succinctly encode since and (9.'5 In S and limit the size of the output a subset of the maximal tandem arrays. position. {8) and {9). Class {9.k.e lexically 11..1 makes provin.fl1).and the For example.14. of the (singleton) permutation creates the O(n)-lime classes describes array a permutation S.{9).Thisimpliestheverynontrivialiactthatinanystringollengthn nJ pm-triples..fll). /) to describe Cl = abababababababab.2). The other requires a number Give complete order.2). and mark A. Each class of E. . Rather refiner to we use last.4. three of mississippi.theyconstructonly some of the classes is reduced use of space The original the suffix the in practice suffix array partitions or [116]. E.7) and {10).13). E.{9.. character the partition the importance arrays and repeats was discussed subset specific in the alphabet. 13) and We are (see Section 3.6. This "choice" now be precisely defined. Conclude 7.(10).SI and 01 length of identical or two.5. Note that the algorithm within log. the tandem in a string Cl. ~everalstringalgorithmsusesuccessiverefinementwithoutexplicitlyfindingorrepresentmgalltheclassesmtherefinementtree.{9)'{4. holds [308] co. For any x> t.8.6) one of E.2) the same first number. efficient implementation from the Ek classes implemendetails in O(n) and time. class because 14.10) to obtain the of Ek are kepi 01 the two dFfferent maximal tandem the two maximal arrays ss and ssissi both begin at position E2 partition smallest 01 the of S = mississippiS.{3. The end-of-string other character. The correctness 01 the reverse as follows: Ek class For each number in A from Fact 3 to creating array.Asdiscussed on a structured 01 the string sofinterestinorderto members. which comes to be lexically for the s's. largest the three process the classes class of {2.e.61.How \$ is arrays that succinctly considered small erthanany 9. etc. the E2k partition time.2.lnstead.11. the order of numbers i.nstructs the starting array for locations 01 a k-Iength Ihe reverse in the lexical substring refinement order of S. 10 the lexically Class. or to stay together.2).10).5. has five classes: before the class it is often best to locus tandem 11) lists the position . Class the starting ordered lexically Note £.we can obtain the E2k partition in O(n) the E. can stop as soon as every class When the algorithm fJ is a k·length if and only if some single class of E" contains and this must happen n iterations. computation where as a byproduct of a successive E.i + + 2k. The For example. Cl is called array a tandem army if Cl is periodic I = 2. computed only for values of k that are a power of we need an extension of Fact 1: Fact3 For any i ¢ two.[2.8. The E2 classes of [4.[7).111.ltcanbedescribed orby(ab.7). It certainly can be obtained (abab.5. marked It IS not clear how to efficiently a class a direct use of Fact 3. We develop that method here.f8).4. should now be clear.SI.2).6) are each no numbers in that refinemenl details. iE2kj if and onlyifiE.20.20.TheclassesoIE.ssi. and 9. class. array is obare 56.7) locations 111.7). The second E.4. if there lt a can be written as {JI for some A tandem copies arrays of tandem I :0: 2. a maximal n is a p~wer of two.11) of E. some errors. E2 class 111). Prove that this method over or is the suffix method for string in Section thai the reverse refinement only compute the tree implicitly.. strings (2. refinement a suffix array in O(nlogn) detailed is the spaceadvantag e of this method an algorithm c~nstructlon tained that is better method computation In that algorithm.i + jk. fJ starting at position of i 01 S if and only if the numbers between 7.the The algorithm :he locations classes of £21< refine E. that two or more tandem can have at the same With t~e added detail that the classes associated With the classes More specdlcally.