## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

COMPUTER SCIENCE AND COMPUTATIONAL BIOLOGY

Dan Gusfield

University vfCalifomiu, Davis

UCAMBRIDGE

V

UNIVERSITY PRESS

Contents

Preface

xiii String Problem

I

Exact String Matching: The Fundamental

**1 Exact l\latching: Fundamental Preprocessing and First Algorithms 1.1 1.2 1.3 1.4 1.5 1.6
**

The naive method

Exercises

10 11 Methods 16 16 16 23 27 29 35 35 39 48

52

2 Exact Matching: Classical Comparison-Based 2.1 2.2 2.3 2.4 2.5 Introduction

**Real-time string matching
**

Exercises

3 Exact Matching: 3.1 3.2

A Deeper took at Classical Methods

;! ;::;;'~:~':i~;':~~':~;:

3.5

3.6

3.7

,""·,,,,o,;,.,,,ioo Rcgulare~prcssio[]patternmatching Exercises

6! 65 67 70 70 70 73 77 84

4 Scminumerical String Matching 4.1 4.2 4.3 4.4 4.5

The

match-count

problem

and Fast Fourier

Transform

Karp--Rabinllngcrprintrnelhod,forexactmatch Exercises

vii

viii II Suffix Trees and Their Uses 5 Introduction 5.1 5.2 5.3 5.4 to Suffix Trees 87 89 90 90 91 93 94 8.10 8.11

CONTENTS

ix 192 193 196 196 197 199 200 201 202 205 207 209 212 215 215 215 217 223 224 225 230 235 245 254 254 259 270 279 287 293 302 308 312 312 321 325 329

For the purists: how to avoid bit-level operations Exercises of Suffix Trees

9 More Applications 9.1 9.2 9.3

Amotivating cxample A naive algorithm to build a suffix tree of Suffix Trees

6 Linear- Time Construction

'.4

9.5 9.6 9.7 9.8 A linear-time solution Exercises Matching,

to

6.1 6.2 6.3 6.4 6.5 Practical implementation issucs 6.6 Exercises 7 First Applications 7.1 7.2 7.3 7.4 7.5 7.6 7.7 of Suffix Trees

107

'4

the multiple common substring problem

**115 116 116 119 122 122 123 124 125 125 127 129 132
**

135

III Inexact

Sequence

Alignment,

Dynamic

Programming

10 The Importance

of (Sub)sequence

Comparison

in Molecular Biology

11 Core String Edits, Ajjgnrnents, 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9

and Dynamic Programming

introduction distance between two strinas

7.' 7.'

7.10 7.11 7.12 7.13 7.14 7.15 7.16

7.17

APL13:Suffixarrays-morespaceredudion APLI4: Surtixtrees in genome-scale projects

7.18 7.19 7.20

Additional applications Exercises Lowest Common Ancestor Retrieval

**135 138 143 148 149 156 157 164 167 168 168 181
**

181

Gaps Exercises

12 Refining Core String Edits and Alignments 12.1 12.2 12.3 12.4 12.5 12.6 12.7 ~1=2.~8 Exercises 13 Extending 13.1 13.2 13.3 13.4 the Core Problems

___

8 Constant-Time 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9

Introduction The assumed machine model COlflp1etebinarytrees:averysimp1ecase How to sctve rce quenesln zt FimstepsinmappingTtoB The mapping of T to B The linear-time preprocessing ofT Answering an lcaquery in constant time The binary tree is only conceptual

182 182 183 184 186 188 18' 191

Exercises - The Holy Grail

14 Multlple String Comparison 14.1 14.2 14.3

332 332 335 336

9 14.6 16.8 Exercise.ll string inference alignments (SP) objective functions function 341 342 17 Strings and Evolutionary 17.The Mother Lode 15.1 17.10 16. Shotgun Sequence Final sequencing bottom-up DNA assembly on top-down.2 15.1 18.8 15. Mapping.2 17.4 Success stories of database search 17.8 16.7 14.9 PROSITE BLOCKS and BLOSUM substitution matrices for database searching The BLOSUM Additional Exercises 15.12 considerations 381 385 385 386 387 391 19 l\"lodelsofGenome-Level 19.5 14.15 16.11 16.6 Multiple Introduction MUltiple Multiple Multip1e Comments Common Exercises sequence alignment alignment alignment multiple comparison with with for structu'.1 Introduction Genome rearrangements Mutations with inversions 492 493 19.5 17.4 17. and Cameos and Superstrings 393 395 395 395 396 398 16 Maps.4 16.7 Ultrametric Trees 447 449 to computing multiple the sum-of-pairs consensus tree 14.5 16.7 15. bottom-up sequencing sequencing: the 16. Cousins.14 picture using YACs sequencing 16.2 530 16.11 343 351 354 358 359 366 370 370 373 375 376 377 379 trees and ultrametric distances Additive-distance trees 15 Sequence Databases and Their Uses .3 18.7 16.3 16.2 18.10 on bounded-error approximations 14. 456 458 466 470 471 474 475 480 480 482 485 490 492 18 Three Short Topics 18.6 Molecular Exercises computation: computing with (not about) DNA strings 15.16 16.CONTENTS 14.2 498 499 501 505 Glossary Index 524 IV Currents.4 14.19 Exercises .6 17.18 398 401 406 407 4ll 412 415 415 416 420 420 424 comments Theshortestsuperstringproblem Sequencing by hybridization 425 437 442 16.10 15.3 15.12 Directed Top-down.11 15.17 16.4 The databaseindustry Algorithmic Real sequence FASTA BLAST PAM: the first major amino acid substitution matrices issues in database search search database 15. 16.13 16.1 15.5 15.8 objective to a (phylogenetic) alignment methods 14. Sequencing.1 16.9 16.3 17.

looking for new or illdefined patterns occurring frequently ill DNA.Preface History and motivation and and So without worrying much about the more difficult chemical and biological aspects of DNA and protein. Dali{ Naor.e. retrieving. comparing two or more strings for similarities. defining and exploring different notions of string relationships. looking for structural patterns in DNA and L Th< other \ong·{crm members ". our computer science group was empowered to consider a variety of biologically important problems defined primarily on sequences.e William Chang. and comparing DNA strings. Gene Lawler. and Frank Olken xiii . or (more in the computer science vernacular) on strings: reconstructing long strings of DNA from overlapping string fragments: determining physical and genetic maps from probe data under various experimental protocols. searching databases for related strings and substrings. storing.

particularly of ideas rather than of concrete methods. and protein). The use of sequence data is already central in several subareas of molecular. computational biology is hot. but didn't have. the computer scientist who wants to enter the general field of computational molecular biology. My general background was in combinatorial optimization. and many computer scientists are now entering the (now more hectic.xiv PREFACE protein. I would still choose to include a broad range of algorithmic techniques from pure computer science In this book. however. but as detailed in this book. and existing algorithmic methods used by biologists. should receive a training in string algorithms that is much broader than a tour through techniques of known present application. Molecular biology and computer science are changing much too rapidly for that kind of narrow approach. the problem of searching for an exact or "nearly exact" copy a given text. was Moreover. but faint. Matching problems are central. although I had a prior interest in algorithms for building evolutionary trees and had studied some genetics and molecular biology in order to pursue that interest 'What we needed then. and who learns string algorithms with that end in mind. B ut that is not enough Computer scientists need to learnfundamentul ideas and techniques that will endure long after today's central motivating applications are forgotten. finding conserved. computational molecular biology was a largely undiscovered area for computer sci- ence. No doubt.biology and the full impact of having extensive sequence data is yet to be seen. I cover a wide spectrum of string techniques . most of the available sources from either community focused on matching. Significant contributions to computational biology might be made by extending or adapting algorithms from computer science. and to anticipaled rather than to existing problems is a matter of judgment and speculation. Our "relative advantage" partly lies in the mastery and use of those skills. effective algorithms. What should they learn? The problem is that the emerging field of computational molecular biology is not well defined and its definition is made more difficult by rapid changes in molecular biology itself. theoretical computer scientists try to develop effective algorithms somewhat differently than other algorithmisrs. They need to study methods that prepare them to frame and tackle future problems and applications. treatment Why mix computer science and computational biology in one book? My interest in computational biology began in 1980. more competitive) field [280]. seminal papers on computational issues in biology (such as the one by Buneman [83 j) did not appear in mainstream computer science venues but in obscure places such as conferences on cum· putational archeology [226]. That side interest allowed me an occasional escape from the hectic. but had little exposure to specific string algorithms in biology. Potential application. Still. Therefore. lower bound arguments. We rely more beavily on correctness proofs. worst-case analysis. algorithms that operate on molecular sequence data (strings) are at the heart of computational molecular biology. Early on. they constitute only a part of the many important computational problems defined on strings. So even if 1 were 10 write a book for computer scientists who only want to do computational biology. but which might plausibly be of use either in current problems or in problems that we could anticipate arising when vast quantities of sequenced DNA or protein become available. and bounded approximation results (among other techniques) to guide the development of practical. although it was an active area for statisticians and mathematicians (notably Michael Waterman and David Sankoff who have largely framed the field). some of the material contained in this book will never find direct application in biology. when I started reading papers on building evolutionary trees. while other material will find uses in surprising ways. and more We organized uur efforts into two high-level tasks. But seventeen years later. along with a rigorous treatment of specific string algorithms of greatest current and potential import in computational biology. Certain string algorithms that were generally deemed to be irrelevant to biology just a few years ago have become adopted . we needed to learn the relevant biology. hyper competitive "hot" topics that theoretical computer science focuses on. Moreover.well beyond those of established utility. computer scientists need to learn the string techniques that have been most successfully applied. At that point. RNA. Our problem None of us was an expert on string algorithms. laboratory protocols. This is illustrated by several recent sublinear-time approximate matching methods for database searching that rely on an interplay between exact matching methods from computer science and dynamic programming methods already utilized in molecular biology. randomized algorithm analysis. Thux. At that point I had a textbook knowledge of Knuth-Morris-Pratt and a deep confusion about Beyer-Moore (under what circumstances it was a linear time algorithm and how to do strong preprocessing in linear I understood the use of dynamic programming to compute edit distance. patterns in many DNA and protein sequences. Hence. those techniques that seem to have the greatestpOfemial application in future molecular biology. First. This book is an attempt to provide such a dual. determining secondary (two-dimensional) structure of RNA. we recognized that summer a need for a rigorous and fundamental treatment of the general topic of algorithms that operate on strings. even when the original algorithm has no clear utility in biology. Getting sequence data is relatively cheap and fast (and getting more so) compared to more traditional laboratory investigations. algorithms that operate on strings will continue to be the area of closest intersection and interaction between computer science and molecular biology. Second we sought to canvass the computer science literature for ideas and algorithms that weren't already used by biologists. Certainly then. and integrated. The big-picture question in computational molecular biology is how to "do" as much "real biology" as possible by exploiting molecular sequence data (DNA. I have selected from the many possible illustrations.

it emphasizes the ideas and derivations of the methods it presents. although there are many biologists (and of course mathematicians) with sufficient algorithmic background to read the book. thc book covers all the major styles of thinking about string algorithms. rigorous text on deterministic algorithms that operate on strings. that string algorithms have been employed in molecular biology. I also expect that my (hopefully simpler and clearer) expositions on linear-time suffix tree constructions and on the constant-time least common ancestor algo- rithrn will make those important methods more available and widely understood. The Lawrence Berkeley Laboratory. It only lightly touches on probabilistic analysis and does not discuss parallel algorithms or the elegant. Many of those algorithms utilize trees as data-structures or arise in biologic alproblemsre1atedto evolutionary trees. hence the inclusion of "trees" in the title The model reader is a research-level professional in computer science or a graduate or advanced undergraduate student in computer science. It covers the full spectrum of string algorithms from classical computer science to modern molecular biology and. many complex methods that have previously been written exclusively for the specialist in string algorithms. I apologize in advance. With these exceptions. and space). I connect theoretical results from computer science on sublinear-time algorithms with widely used methods for biological database search. both unifying and simplifying the preprocessing tasks needed for those algorithms.xvi PREFACE PREFACE xvii by practicing biologists in both large-scale projects and in narrower technical problems Techniques previously dismissed because they originally addressed (exact) string problems where perfect data were assumed have been incorporated as components of more robust techniques that handle imperfect data. Throughout the book. trees. In the book Ltry to explain in complete detail. The book is intended to serve as both a reference and a main text for courses in pure computer science and for computer science-oriented courses on computational biology Explicit discussions of biological applications appear throughout the book. but I am much too new an arrival and not nearly brave enough to attempt a complete taxonomy of the field. therefore. Similarly.eduTgusficldistrpgms. It is the book I wished I had available when I began learning about string algorithms.ve been coded in C and are available at nt{p:llwwwcsif..h{ml . the preprocessing methods I present for Knuth-MorrisPratt. usually providing complete proofs of behavior (correctness. To better expose ideas and encourage discovery. and at a reasonable pace. to the many people whose work may not be properly recognized. In the discussion of multiple sequence alignment I bring together the three major objective functions that have been proposed for multiple alignment and show a continuity between approximation algonfhms for those three multiple alignment problems. The book contains some new approaches I developed to explain certain classic and complex material. Beyer-Moore and several other linear-time pattern matching algorithms differ from the classical methods.·i.e. with three exceptions. 1 consider several different ways to define repeated substrings and use each specific definition to explore computational problems and algorithms on repeated substrings. The reader who absorbs the material in this hook will gain a deep and broad understanding of the field and sufficient sophistication to undertake original research Reflecting my background. "rne book also does not cover srochasnc-orientedmethods that have come out of the machine learning community. this book is a general-purpose rigorous treatment of the entire field of deterministic algorithms that operate on strings and sequences. The National Science Foundation. rather than simply providing an inventory of available algorithms.ueda. I discuss a number of biological issues in detail in order to give the reader a deeper appreciation for the reasons that many biological problems have been cast as problems on strings and for the variety of (often very imaginative) technical way. However. worst-case time. inefficient version and then successively apply additional insight and implementation detail to obtain the desired result. I discuss many computational problems concerning repeated substrings (a very widespread phenomenon in DNA).. as I find it rarely serves to explain interesting ideas. The Program in Math and Molec. This book covers all the classic topics and most of the important advanced techniques in the field of string algorithms. many of the algorithm' in {he hook h.' and 1 provide over 400 exercises to both reinforce the material of the book and to develop additional topics. What the book is not pointed the reader to what I hope are helpful references. What the book is Following the above discussion. although some of the algorithms in this book are extensively used as subrools in those methods. I often present a complex algorithm by introducing a naive. More important. I avoid detailed code. Acknowledgments I would like to thank The Department of Energy Human Genome Program. the book rigorously discusses each of its topics. results on algorithms for infinite alphabets and on algorithms using only constant auxiliary space. In summary This book is a general. and sequences. the chapter on evolutionary tree construction exposes the commonality of several distinct problems and solutions in a way that is not well known. In particular. when appropriate. but very theoretical. but are more concentrated in the last sections of Part II and in most of Parts III and IV. connects those two fields.

Dalit Naor. Paul Stelling. Jane Guschier. Martin Tandy Warnow. Frank Olken. Elizabeth Sweedyk. Marci McClure. Martin Farach. Hershel Sater. for support of my work and the work of my students and postdoctoral researchers. Martin Tompa. Wilham Pearson. Dick Karp. and Lusheng Wang. Paul Horton. Tao Jiang. Sorin Istrail. David Axelrod. Archie Cobbs. Individually. Kevin Murphy. Gene Lawler. lohn Keoecioglu. John Nguyen. Gene Myers. Doug Brutlag. Snoddy. lim Knight. and The DIMACS Center for Discrete Mathematics and Computer Science special year on computational biology. Fred Roberts. R. Baruch Schieber. George Hartzell. Richard Cole. Russ Doolittle. Mike Paterson. Pavel Pevzner. Dina Kravers. PART! Exact String Matching: The Fundamental String Problem 1i 1 . Esko Ukkonen. Ron Shamir. Robert Irving.xviii ular Biology. Udi Manber. Ravi. lowe a great debt of appreciation to William Chang. I would also like 10 thank the following people for the help they have given me along the way: Stephen Altschul. Sylvia Spengler. and Mike Waterman. Gad Landau.

~h for att web sites that mention ··M. common database searching programs (for example. Note that two occurrences of P may overlap. if any. as illustrated by the occurrences of P at locations 7 and 9. often experience long.andrep-onedlllaltwenlytho". DNA database) for a thirty-character string. I am editing a ninety-page file using an "ancient" shareware word processor ami a PC clone (486). RNA.rkTwain"lOok. of pattern P in text T. In molecular biology there are several hundred specialized databases holding raw DNA. 7. Even grepping through a large directory can demonstrate that exact matching is not yet trivial.ond.. and in numerous specialized databases. Some of the more common applications are in word processors.. But the point is that the exact matching problem is not so effectively and universally solved that it needs no further attention. where it (and extensions) will be solved using suffix trees Basic string definitions We will introduce most definitions at the point where they are first used.fylhcqucry For another example ". The exact matching problem is solved for those applications (although other more sophisticated string tools might be useful in word processors). wilh Ie« lhan one minute u. Recently we used GCG (a very popular interface to search DNA and protein databanks) to search Genbank (the major U. in internet browsers and crawlers. which sift through massive amounts of text availableontheinternetformaterial containing specific keywords.? And Genbank today is only a fraction of the size it will be when the various genome programs go into full production mode. Chapter 3 looks more deeply at those methods and extensions of them.arch l<>okIe« lh"" len minutes. Although exact matching is the focus of Pan I.".e [W21 for other applications.i[e.000 processor Maspar computer). or Nex!s. the entire field of string algorithms remains vital and open. and analysis and proof techniques. That education takes three terms: specific algorithms.S. Users of Melvyl.. But the story changes radically I I just vi.. Lexis. and 9. an e-mail server is available for exact and inexact database matching running on a 4. Importance of the exact matching problem The practical importance of the exact matching problem should be obvious to anyone who uses a computer. the on-line catalog of the University of California library system. but style and proof technique get the major emphasis. and every exact match command that I've issued executes faster than I can blink.EXACT STRING MATCHING 3i Exact matching: what's the problem? Given a string P called the pattern and a longer string T called the text.e . All three are covered in this book. The search took over four hours (on a local machine using a local copy of the database) to find that the string was not tbere. in on-line dictionaries and thesauri. the exaet matching problem is to find all occurrences. in on-line encyclope dias and other educational CD-ROM applications. A . The All.. cranking out massive quantities of sequenced DNA. but several definitions are so fundamental that we introduce them now II I We taler repeated Ihe leM ". and taken for granted? Right now. Certainly there are faster. one might ask whether the problem is still of any research or educational interest.(i. over 21 billion wor.'" database contain. for example. The problem arises in widely varying applications. exploring methods for exact matching based on arithmetic-like operations rather than character comparisons. So is there anything left to do on this problem? The answer is that for typical word-processing applications there probably is little left to do. The exact matching problem will be discussed again in Part Il. and the education one gets from studying exact matching may be crucial for solving less understood problems. devoled 10 mo.. most of which wa.tininternet news readers that can search the articles for topics of interest.ooopioor. It will remain a problem of interest as the size of the databases grow and also because exact matching will continue to be a subtask needed for more complex searches that will be devised."od. frustrating delays even for fairly simple matching requests. if P = aba and T = bbabaxabahay then P occurs in T starting at locations 3. The ". some aspects of inexact matching and the use of wild cards are also discussed. or processed patterns (called motifs) derived from the raw string data.ctl by lhe''''lual Ie" ". in library catalog searching programs that have replaced physical card catalogs in most large libraries. Chapter 4 moves in a very different direction.iled the 1111" Vi'ta web poge maintained by the Digital fujuipmenl Corporation. which is a small string in typical uses of Genbank. Many of these will be illustrated in this book But perhaps the most important reason to study exact matching in detail is to understand the various ideas developed for it. in electronic journals that are already being "published" on-line: in telephone directory assistance.arch . using the fundamental tools developed in Chapter I. and there are faster machines one can use (for example. general algorithmic styles. Chapter 2 develops several classical methods for exact matching. Even assuming that the exact matching problem itself is sufficiently solved. in textual information retrieval programs such as Medline. Hasn't exact matching heen so well solved that it can be put in a black box. intheg iant digital libraries that are being planned for the near future. That's rather depressing for someone writing a book containing a large section on exact matching algorithms.h collected from over 10 mittion web "to. we will show at the end of Chapter 1 that the usc of fundamental tools alone gives a simple linear-time algorithm for exact matching. BLAST).'ement of text between the disk and the computer. Although the practical importance of the exact matching problem is not in doubt. too numerous to even list completely. and amino acid strings. Overview of Part I In Chapter I we naive solutions to the exact matching problem and develop the fundamental needed to obtain more efficient methods. in utilities such as grep on Unix. Although the classical solutions to the problem will not be presented until Chapter 2. especially those with cross' features (the Oxford English Dictionary project has created an electronic OED containing 50 million words). Some of these databases will be discussed in Chapter 15. For example. Vi.ing the Boyer-Moore algorithm on (lur own ra"'copy ofGenbank.

Problems on subsequences are considered in parts 11Iand IV.1. Shifting by more than one position saves comparisons since it moves P through T more rapidly. It then finds that the next seven comparisons are matches and that the succeeding comparison (the nimh overall) is a mismatch. we will also use "sequem:e" when Its standard mathematical rnearung IS Intended The first two parts of this book primarily concern problems on strings and substrings. after the ninth comparison. The naive algorithm first aligns P at the left end of T. Almost all discussions of exact matching begin with the naive method. until the left end of P is aligned with character 6 of T. and omia is a 1.. Therefore. a total of twenty comparisons are made by the naive algorithm A smarter algorithm might realize. Of course. At thai point it finds eight matches and concludes that P occurs in T starting at position 6. We w ill generally use "string" when pure computer science. using P '" abxyabx: and T = xabxyabxvabx: Note that an occurrence of P begins at location 6 of T. for clarity. y.1 gives a flavor of these ideas. but never shift it so an occurrence of P in T. This process repeats until the right end of P shifts past the right end of T. Figure 1. In addition 10 shifting by larger amounts. The naive method aligns the left end of P with the left end of T and then compares the characters of P and T left to right until either two unequal characters are found or until P is exhausted. some methods try to reduce comparisons by Skipping over parts of the pattern after the shift. and shifts P by one position. In either case. The naive method Definition For any string S. In this example. Terminology confusion acters in a subsequence might be interspersed with characters not Worse. P is then shifted one place to the right. cal is a prefix. in which case an occurrence of P is reponed. We will examine many of these ideas in detail. We will usually use the symbol S to refer to an arbitrary fixed string that has no additional assumed features or roles. Sri) denotes the ith character of S.1. Early ideas for speeding up the naive method The first ideas for speeding up the naive method all character when a mismatch occurs." u~ed in p~ace of "subsequence". and the comparisons are restarted from the left end of. in this book we will always rnamtam a distinction between "subsequence" and "substring" and never use "sequence" for "subsequence". However. in the biological literature one often sees the w~rd "sequence. and repeats this cycle two additional times. we will refer to the string as P or T respectively. It then shifts P by one place. 8) to refer to variable strings and use lower case roman characters to refer to single variable characters. issu~s are dis~us~ed and "" "sequence" or "string" interchangeably in the context of biological apph~at~ons. We will use lower case Greek char deters (ct. P. that the next three . and we follow this tradition. Exact Matching: Fundamental Preprocessing and First Algorithms For example. 1.EXACT STRING MATCHING 1 Definition suffix S[i .1. finds a mismatch. california i~ a string. when a string is known to play the role of a pattern or the role of a text. (3. j] is the empty string ifi > t. lifo is a substring. immediately finds a mismatch.

Hence it has enough information to conclude that there can be no matches between P and T until the left end of P is aligned with position 6 of T. we will use Z. Efficient implementations have been devised for a number of algorithms such as the Knuth-Morris-Pratt algorithm. Then we show how each specific matching algorithm uses the information computed by the fundamental preprocessing of P.. we highlight the similarity of the preprocessing tasks needed for several different matching algorithms. An even smarter algorithm knows the next occurrence in P of the first three characters of P (namely abx) begin at position 5. A comparisons of the naive algorithm will be mismatches. The preprocessing approach Many string matching and analysis algorithms are able to efficiently skip comparisons by first spending "modest' time learning about the internal structure of either the pattern P or the text T. are similar in spirit but often quite different in detail and conceptual difficulty. The copy of« that occurs as a prefix of Sisalsoshowninthefigure .r. Z6(5) Z7(5) =I (aa . This smarter algorithm therefore saves a total of six comparisons over the naive algorithm The above example illustrates the kinds of ideas that allow some comparisons to be skipped..) These preprocessing methods.:onsider the boxes drawn in Figure 1. In this book we take a different approach and do not initially explain the originally developed preprocessing methods. All of these algorithms have been implemented to run in linear time (O(n + m) time). = Z~(5) Z9(5) =2 (aab . In specific applications of fundamental preprocessing. and the Apostolico-Giancarlo version of it.5). r. and the next two illustrate smartershifls. This smarter algorithm skips over the next three shift/compares. Then since the first seven characters of P were found to match characters 2 through 8 of T. each box in the figure represents a maximal-length 1. 111he above example. This approach to linear-time pattern matching was developed in [202] 1. In addition to shifting by larger amounts. thus saving three comparisons.. we will see that certain aligned characters do not need to be compared.l: The first scenario illustrates pure naive matching.. after the left end of P is moved to align with position 6 of T. Fundamental preprocessing of the pattern Fundamental preprocessing win be described for a general string denoted by S.ab .2: Each solid box represents a substring of Sthatmatchesaprefixof San d thai starts between posnlona 2 and i. it has enough information to conclude that character a does not occur again in T until position 6 of T. is greater than zero. caret beneath a character indicates a match and a star indicates a mismatch made by the algorithm.Thenlidenotes the left end of ". (The opposite approach of preprocessing' text T is used in other algorithms. The length of the box starting at i is meant to represent Zi' Therefore. immediately moving the left end of P to align with position 6 of T.3. the t 21. (wh. During that time. this smarter algorithm has enough information to conclude that when the left end of P is aligned with position 6 of T. Preprocessing is followed by a search stage. a real-time extension of it... Each boxts caued a Z·box. I. and the even smarter method was assumed to know that the pattern abx was repeated again' starting at position 5. The result is a simpler mere uniform exposition of the preprocessing needed by several classical matching methods and a simple linear time algorithm for exact matching based only on this preprocessing (discussed in Section 1. Reasoning of this sort is the key to shifting by more than one character.. i Figure 1. the other suing may not even be known to the algorithm.aaz). This assumed knowledge is obtained in the preprocessing stage For the exact matching problem.. the algorithm compares character 4 of P against character 9 of T. Those methods will be explained later in the book. the Beyer-Moore algorithm.. but here we use S instead of P because fundamental preprocessing will also be applied to strings other than P. all of the algorithms mentioned in the previous section preprocess pattern P. Rather. the next three comparisons must be matches. in place of ZitS) To introduce the next concept. We use rj 10 denote the right-moslend of any Z-boxthat begins at ortothe left of p<lsition iand « to denote the substring inth eZ-boxendingatf.2. as originally developed.3. 25(S)=3 (aabc . Instead. This part of the overall algorithm is called the preprocessing stage. the algorithm knows that the first seven characters of P match characters 2 through Sof Fclfitalsc knows that the first character of P (name!ya) does not occur again in P until position 5 of P. although it should still be unclear how an algorithm can efficiently implement these ideas.2. by first defining a fundamental preprocessing of P that is independent of any particular matching algorithm. fUNDAMENTAL PREPROCESSING OF THE PATTERN T xabxyabxyabxz abxyabxz T' xabxyabxyabxz abxyabx~ T:xab"yabxyabxz P:ab><ya:oxz abxyabxz abxyab><z Figurel. such as those based on suffix trees. The details will be discussed in the next two chapters smarter method was assumed to know that character a did not occur again until position 5. = 0. How can a smarter algorithm do this? After the ninth comparison. S will often be the pattern P.). This smarter algorithm avoids making those three comparisons.). where the information found during the preprocessing stage is used to reduce the work done while searching for occurrences of P in T. Each box starts at some position j > 1 such that Z. When 5 is dear by context.EXACT MATCHING 1.

In fact.. before Z. FUNDAMENTAL PREPROCESSING IN LINEAR TIME substring of 5 that matches a prefix of 5 and that does not start at position one. starting from i = 2.4. then I.. say. suppose 5 = aabaabcaxaabaabcy. r . we have FiguraL3: String S[k. and so Z12 may be very helpful ill computing ZI2I' As one case. Zk' < 1. no Z-box has been found that starts . As an example. and Ij5 = 10 The linear time computation of Z values from 5 is the fundamental preprocessing task that we will use in all the classical linear-time matching algorithms that preprocess P But before detailing those uses.lnthis case. uFTIi3l k t ~ k )' k+Zk-1 Figure 1.I > I. similarly. Now assume inductively that the algorithm has correctly computed Z. Fundamental preprocessing in linear time Figure 1. If Z2 > 0. and rl211 130 and 112U 100. Z. I remain unchanged (see Figure IA) Ecd is correctly computed and variables rand P~OOF In Case I. Thus in this illustration. Ii is the position of the left end of the Z-box that ends at r.4. This case. and assume that the algorithm knows the current r = rk_1 and I = Ii-I' The algorithm next computes Zko r = r" and I = Ix The main idea is to use the already computed Z values to accelerate the computation of Zk. left to right. Hence the algorithm only uses a single variable. It follows that the substring of length 10 starting at position 121 must match the substring of length 10 starting at position 22 of 5. Z2 is the length of the matching string. More formally. the characters of 5[2 . That = = means that there is a substring of length 31 starting at position WOand matching u prefix of S (of length 31). can be chosen to be the left end of any of those Zcboxes.. for i up to k . if Zri is three. all the values Z2 through 2120 have already been computed. 212l call be deduced without any additional character comparisons. values for j = i . then r = r : is set to Z2 + I and I = 12is set to 2.EXACT MATCHING 1. As a concrete example.81 then Zk = Zk' and r...L No earlier r or I values are needed. will be formalized and proven correct below 2a. ISI] until a mismatch is found. Zk can be deduced from the previous Z values without doing any additional character comparisons.4: Case2a. the algorithm finds Z~ by explicitly comparing. but in any iteration i. to refer 10 the value. then ZIO = 7. along with the others. TheZalgurithm The preprocessing algorithm computes Z. then a little reasoning shows that ZI21 must also be three.rj is labeled fJ and also occurs starting at position k'of S We use the term Ii for the value of j specified in the above definition.Thelongeslslringslartingatk'thatmatchesaprsfixofSisshorterthanllll.5: Case2b The longesl siring slarling at k'that matches a prefix of Sisatl sasllPI. That is. is computed.bcx ending at r. Zk is set correctly since it is computed by explicit comparisons. To begin. we show how to do the fundamental preprocessing in linear time z. Also (since k > r in Case I). Otherwise rand / are set to zero. Zk=Zk' 1.r15 = 16. andtheupdatedrand discovered so far. 151] and 5[1 . r" and Ii for each successive position i. in some cases. If 2b. Therefore. suppose k = 121. the algorithm only needs the rj and I. In case there is more than one z. All the Z values computed will be kept by the algorithm. Each such box is called a Zbox. it I. and rhe current values otr and r.

SOT and I remain correctly unchanged. k+ ZI. and n ::: til. and it is correct to change r to k + 21. including problems that are not easily solved hy direct application of rhc fundamental preprocessing 1.. 1. give a linear-time preoede Character algorithm to determine whether a linear string about a is this to Exercise asubstr"lngofacircularstringj3. 151.AcircularstringollengthnisastringinwhichCharacter nls consldereoto 1 {see Figure 1. In Case 2b. Therefore. defabc is a Circular of abcdef.6. the algorithm correctly changes r and I. Given two strings 01 a and p. Why continue? Since function Z.. Moore.· must match character 1 to the definition of Z". 1.1 and that ends at or after position k. character k + 2." must match character I + z. PROOF The time is proportional to the number of iterations. we show that fundamental preprocessing alone provides a simple linear-time exact matching algorithm This is the simplest linear-time matching algorithm we know of.1. examining only a fraction of the characters ofT. Similar solution. Hence Zk is correctly obtained in this case. ::: n for every i > I. Therefore.4. real-time matching. There is another characteristic of this method worth introducing here: The method is considered an alphabet-independent linear-time method. then must be to 11. Repeating Algorithm Z for each position i z. for it would establish a substring longer than 2k• that starts at k' and matches a prefix of S.4. we only need to record the Z values =0 =0 the proper implementation) it also runs in time but typically runs in silblinearlime.(5) values are computed by the algorithm ill 0(151) lillie. 5 = PST has length n + til + I = Oem). The method can be implemented to use only O(n) space (in addition to the space needed for pattern and text) independent of the size of the alphabet. when Zk > 0 in Case I. 0 Corollary 1. Aho-Coraslck. Another way to think . then the next character to the right. thatis. . suffix tree etc. But character k + Z. Methods based on suffix trees typically preprocess the text rather than the pattern and then lead to algorithms in which the search time is proportional t? the size of the pattern rather than the size of the text. > for the n characters in P and also maintain the current I and r. Hence it is the preferred method in most cases. Exercises The first time lour exercises use the lact that fundamental be found prOceSSing in linear time. So. position k' (determined in step 2) will always fall inside P. matches character k' + z. the algorithm does find a new Z-box ending at or after k. Z.5.I :::: r.(5) values can be computed in O(n + m) = the occurrences of P inTin O(m) time. if P occurs in T starting at position j of T. Compute Z. Recall that P has Icngthn and T has length m. we never had to assume that the alphabet size was finite or that we knew the alphabet ahead of time . it needs no further information about the alphabet. The simplest linear-time exact matching algorithm Before discussing the more complex (classical) exact matching methods. can be computed for the pattern in linear time and can be used directly to solve the exact matching problem in O(m) time (with only O(n) additional why continue? In what are more complex methods (Knuth-Morris-Pratt.10 EXACT MATCHING 1. This is an extremely desirable feature. Let 5 PST be the string consisting of P followed by the symbol "S" followed by T. Any value of i > n + I such that Z. values. For example.1. the substring beginning at position k can match a prefix of S only for length ZI.2. III Case Za. Each comparison results in either a match or a mismatch. 1. Since Zi S n for all i. If not.) deserving of attention? 2 correctly yields all the Theorem 1.(5) n identifies an occurrence of Pin T starting at position i . EXERCISES 11 between positions 2 and k . That is. Furthermore. plus the number of character comparisons. Since all the Z. very elegant 2. We will see that this characteristic is also true of the Knuth-Marris-Prall and Beyer-Moore algorithms.ifaandfihavethesamelengthandaconsistsofasuffixofj3foilowed prefix wlth a fl. there is no need to record the Z values for characters in T. Conversely. to solve the following (or cyclic) problem of bya This is a classic problem can be done In linear and that all occurrences of Pin T can 1.' < lfil.6.(5) for i from 2 to n + m + L Because "S" does not appear in P or T. f3 must be a prefix of 5 (as argued in the body of the algorithm) and since any extension of this match is explicitly verified by comparing characters beyond r to characters beyond the prefix /3. but not of the Aho-Corasick algorithm or methods based on suffix trees. Hence the algorithm works correctly in Case I. Instead.a character comparison only determines whether the two characters match or mismatch. suffix trees can be used to solve much more complex problems than exact matching. where "S" is a character appearing in neither P nor T. Use the existence of a linear-time exact matching in linear time. . All the Z. Hence Zk = Zk' in this case. determine rotation algorithm if a is a circular rotation p.I < T.5. Further. so we next bound the number of matches and mismatches that can occur. since k + Z! .6). the full extent of the match is correctly computed. Those values are sufficient to compute (but not store) the Z value of each character in T and hence to identify and output any position i where Z.1. The Apostolico-Giancarlo method is valuable because it has all the advantages of the Beyer-Moore method and yet allows a relatively simple proof of linear worst-case running time. Moreover. character k' + Z. = n.(n + I) of T.

A substring then III contained than in string 5 is called copy is a tandem a tandem be extended a ta.andoutputsthepair(S.. string 13. a naive algorithm O(ri}-timebound. to rotations character comparisons case that Z k' = ItJl.. DNA are mostly exercises circular obtained additional comparisons executing and we don't know is typically whose circular. that parameter 5. Then is a substring fi tt and only ifft rotalion of $.. coli replication. Nucleotide if the DNA of two individual rotation. III tools for handling will be discussed matching.k)foreachoccurrence.) is opposite a matching token in a parameter s. Tandem run in O(n+m) arrays.Sincemaximal of a given base can overlap.. of '" if 5 = the base) if '" consists a tandem array of more III one consecutive xyzabcabcabcabcpq. case alphabet of one of its ptasmlds. finds would establish every maximal tandem on Iyan tandem arrays and ending n. 5 has length n. we want to point outthat circular DNA is common in its genomic such and important. remain solve their matching mechanisms the between parameter 5. complementary Thus the problem system. The linear string jiderived from it is accalggc. any critical both problems that they address. and parameter compucompuuses of demonstrate important other names in S:1 Eh = XYdxCdCXZccxW. strings They undetermined. are added rotation to F coli containing 0fpthenthestrandopposite with 13creating whether a proper of DNA may be a step III Now we present in common. do. Figure 1.12 EXACT MATCHING and 1.6. DNA molecules cells contain DNA. (or $. however. 6. but even when of the order rapidly 7. one for speedup of the algorithm? Z". Case 2b can be refined Z".g. Eh. circular strings but do not suggest indirect as well as linear molecular experiments immediate strings. Suffix-prefix Give an algorithm in two strings and f3. to Ci) hybridizes This transfer of determining XYabCaCXZddW and only if Over Land Eh are double-stranded in the replication rctaticn Other circular of c as a single strand. but parameterized of code means names (e.andthe used for such rapid matching role of enzyme tational tational circular problems.. giving array its starting a is an exampte'ot is a tandem f3. Consider becomes: the special this case of Exercise problem the problem as in Exercise Is a a circular experimental called rotation conditions (for energy) If(¥ is a circular of $? This problem arises n a parameterized of one section tetraosrormeo into the other the parameter in linear in E. array that cannot of times 7.smids. = I. Baker[43] Z algorithm is split into two cases. would this result in an overall comparisons You must c onsideralloperations.the algorithm way to organize an unneeded no character only inthe mitochondrial double-stranded (higher addition DNA and in additional contain plasmid This is a reasonable so as 10 eliminate = 1131 nd hence a are needed the algorithm. A string example. In Case 2b of the Z algorithm. all the values character Z~ . linear DNA molecule.For example. names of the other via aon e-to-onelunction" Let Land n be two alphabets and parameters Fl. solved leaving of the plasmid. suffix of (1 that exactly The algorithm should 4. Viral DNA circular mutations. can consist of any combinations is a legal one strand of which is called string (¥ is the uppercase and n is the lower (which has the DNA sequence plasmid.M'. main body of Algorithm Z. ratherthanadditiona viruses have mutated introduced 'The the following application match matching between problem sections and applied it to a problem 01 software system. coli a single by the parameter definition. a part of the program that cannot be changed. Interestingly. Give an example to show that two maximal tandem arrays of a fj be lhe linear string ft obtained of circular from {Jstartingalcharacter1 string given bese s cen overtep.Zq_'.. application. a tandem of /3 in 5 can be described (see Section by two numbers location repeated Suppose problem substring is the following.and one for Z". p-ma/chif the formal (ReeA) and ATP molecules Each symbol if:E in :E is called a token and each symbol of tokens alphabet English string in n is called a paramefer. k). section We can matches and con- from each otheronlybya I mutations are actually 2 when stnnq c has length is solved maintenance: where stants) is to track down duplication between by replacing two sections in a large software tha tone want to find not only exact matches of code. identifiers Itisveryinterestingtonotethattheproblemsaddressedintheexercises "solved" Then lime in nature. provide RecA in E. a nucleus) and even some true eukaryotes as yeast circular strings may someday it is linear the and away in another character organisms Z. algorithm. For then said to 1. II Case 2b olthe not just character 8. a token represents . > I. in addition in some viral populations . of lengths n il I For example. (or $. at character of soma circular Let in 5 and the number substring /3 is repeated.· > IfJl then Zk comparisons to their nuclear tools for handling is not always For example. Each parameter s. (or $... Each token in $. Precisely matching and is "solved" repiication an enzyme strand to underthecertain descri bed in [475]. array with a longer base).tv for developing Several bock.13 and that takes 3. maps not violate ot p-matching. Two strings 5. Consequently. maps problemsfasterthancanbeexplainedbykineticanalysis. and a double-stranded (1.ndem array of 13(called of /3. are needed. = names in $. of the DNA in one individual problem is to determine will be a Circular rotation [450]. crme motivation ts. containing from and no symbols Land Fl. This does In Baker's XYabCaCXZddbW to parameter the definition c-matches d in Notice to c in a in 5. can then be the A digression on circular strings in DNA in using the existence biological 0 fa linear-time Bacterial exact and small DNA in 5. be of use in those organisms. while parameter d in 5. properties. coli. circular called pla.occur in viruses.6: A circular string fj. Zq.) jlis by this natural in $. in Sections 1. ::: 1/31.1). II the Z algorithm immediately findS that without Z2 = q > 0. (or b.) is opposite in $.. is a Now give an O(n)-time algorithm that takes 5 and 13as input. Given the base (5. and finds the longest time. is a Circular a. Note that 5 array of abcabc{Le. arrayoffJ. until it finds when expllclt a mismaich. some virus genomes linear order individual a plausible exhibit circular.11.M.2 and without The above two problems matching However. comparison. In that experiment. A maximal either A tandem left 0 r right. Flesh out and justify the details when 01 this claim does explicit comparisons but in fact Argue that Therefore.) experiments in [475] can be described as substring these natural These matching systems problems relating to and linear DNA in E. EXERCISES 13 matches a prefix of I~' m. also contains tandem array = abcabcabcabc array =- abc.

second. expensive. Z 10 compute more involved algorithm all the 2. mand of Characters all occurrences the variable making of the variable afe renamed 5. three A gene. Consider dimensional isto acters although then find all substrings in T of length 151 that are for the amino the amino acio sequence acid character This is done as follows. be done three formed times to get the proper frame) to form the string acids in 5. The third Glyand frame has two complete the three gga and cgg. specifying s. problem?) if there three but three viable one is given a DNA string of That is. For example. the length in T starting Z to lind all substrings more involved versions P. find all substrings in 0(1 PI + I TI) of T that c-matcn time. cedens. The problem at position of S{Li].Sincethereare43 two or more codons. That consecutive is. you don't into a different a tunctional. organism after a large amount of the genome of that organism where for a given protein acid in or the The following fancy tools. The input consists of characters. lor each formed lrom In [43] and [239]. b. then the DNA string long. zt for a siring a prefix Of course. Allow the encoding code and the abitityto of 5.. let i in S that P. locationotthegenethatcodesfortheprotein. For example.. But by better easy. the number from of opposing that match k are aligned. protein.Forapairofsubstrings. also locates sequence The match 12. HInt: helpful. filled-in crossword and a rnultiset problem The problem all tile charthe structure acid Phenylalanine in the single acid vauoe acids. Now suppose One can use that of interest. amino the amino find a connected in the multiset. each a string over 1: 11. complex methods 5 sequencing.14 whereas a parameter names represents in 5.64 triples In fact. two-dimensional in the text that matches as V). a. problem. Each such -eameennr are actuatly genes for each the a string in the amino Fo rexampie. A DNA molecule (nucleotides). is to restrict of two- acid. gac. steps (when USing the codons Clearl y. were part of a larger program.e . three problems require of can be solved without the Z algorlthm or other Fantasy protein has been is encoded protein from They only thought. when the two for each can be solved (From Paul 9. The method of the longest substring should also be able to ati that can be i. and Gly. acters encodes single (amino can bethoughtotasastringoveranalphabetolfourcharacters can be thought which fora of as a string over an alphabet embedded protein. coding for amino acids using the lewest number The goal is to produce translations. whereonesubstringisfromonestringandoneisfromtheother. in any reading-frame of the genetic of that string).. As mentioned intervals a different encoding compact in Exercise frame. and . coding Trp and Thr. In prokervotes.lneach pairs of substrings. of T that p-match of the p-match characters 5 to this problem position that runs in O(m) time.P to find atl substrings nthat by the characters is a substring of S. of r rormeo Irom the in 0(151) time (the implementation Then show 5 = {a. So. is physically particular every three string. Asp." n characters. then chemically typically Starting encode specify of cedens a the multi set 5 01 amino acids in the protein.chemicaliy acid sequence finding that the whole long DNA translate ot a protein DNA sequence is very slow. Problem: cedens where Suppose frame". This The above is most segment easily problem explained the amino .thethreetranstations steps in the in at most a program's be changed \I these variable. could EXACT MATCHING r. there are organisms not just a single protein. but not too difficult). The second for amino acids of the degeneracy frame has two complete tgg and BCg. acid string. il each protein isoniy n+2nucleotides (some viruses amino for example) proteins. Any such substrings it is unlikely that there T. and EXERCISES 15 of indexing code). that does the three n character then they couio both be replaced The most basic occurrences string slightly modified starting algorithm p-match problem by a call to a single is: Given function c-matcres values a text T P. operat ions). but each specifies of the three string which reading atggacgga. amino has three acids (over the amino Determine specifying string of lengthnandanotherprotein bet. Such a coding (abbreviated a codon. not interrupted the protein. complete substring because a DNA encoding of $. character and the full association the codon amino acid alphabet the two-dimensional text (say a of the preceding puzzle) substructure aCids is called the genetic gtrcodesforthe triples for the same amino amino For example. substrings k. acid alphaconsider (ChallengIng string m> 5. characters of length each and an additional parameterk. Find a method and nindexlnq steps wilh this terminology.) 10. one would like to find all those 01 metonqest how to modify details are how to use the Show it is not necessary.Presently. from the amino of the protein. is called code. in a contiguous So assume of DNA for each amino determining we have the protein but do not know its sequence and somewhat is relatively unreliable.. If you are acquainted although Ihe notion 01 a finite·s tate naosoucer may be is and n. in the genetic Met.. Then than for function Zj. let 5 be a mu/lisetof are formed caba S be the length Let Tbe let a text string of length in Toflength T. Th e task is to produce. triple wilt b e more than one candidate. of the DNA string and the fewest number acids in a table holding with 3n examinations table.! c-match. in a DNA molecule. we define the march-countas substrings of the 8(ff) with O{kff} organizing Horton. (i. is COded for by six different of DNA encoding reading those encoding.one molecule may become in prokaryotes. That is a very rect "reading potentially if the correct nucleotide acid string. cooons. in the protein the gene for the protein variant in the long DNA string. do not form How can this be done? A simpler but only twenty is a posslbil itythat to be rectangular 13. Give an algorithm acid alphabet) is a string lor the following a DNA problem. the two strings. know amino n nucleotldes. DNA characters DNA nucieotides Itt codes lao of twenty of length pairs of substrings operations (character the computations. 10. g} char- in the DNA sequence acid alphabet while a protein acids).orthird translates each of the three reading reading frames of the string. to begin at any point The problem use any reading The first reading code specify codons. (abbreviated there lor the amino as F). (this may have to determine acid the time can be reduced to 0(11") of the organism is known.) decomposition {There proteins (characters) begins with thelirst. First. reading Arg in the DNA for S. containing each read in form codons acid and that some triples acid Leucine this is the case. C. alg. st ring of length that contains is difficult frame a the associated lrame the amino codons. in 0(1 PI + I Til time problem are solved by more Give a solution state. c) and of abahgCEJbah. but you don't know the corof the string known into contains n acids. which can be renamed as long as then in nations Thus if $. for the genome the amino the multiset sequence each codon of amino acids that make up the protein to determine acid sequence into the amino the amino However.and so there are 8(ff) characters Clearly. and gga. 01 character examl- . You are given two strings string there are n-k+1 n sequenced. aero sequ ence DNA codon useful in sequencing the DNA is a particular by introns. of a protein s.s. n. consistently.yet frames. in the DNA and 3n indexing translations the two programs identical. The problem comparisons is to compute plus arithmetic the match-count the problem operations. encoding The input is a protein for not only specifies different. are candidates at a particular amino one amino to amino the codon possible point in the DNA string.. names to look up amino can be done genetic code examinations Ihe genetic to the corresponding two programs subroutine and a pattern variable S:!.

we emphasize the Beyer-Moore method over the Knuth-Morris-Pratt method. T(5) = I mismatches with P(3) and R(f) = 1. erc. all of these algorithms can be implemented (0 run in linear worst-case time. Bad character rule As in the naive algorithm. in contrast to previous expositions.2. Further. then the worst-case running time of this approach is O(nm) just as in the naive algorithm. so P can he shifted right by two positions.actmatching. However.1.2.1. the bad character shift rule. We next examine these two ideas. The point of this shift rule is to shift P by more than one character when possible. with two additional ideas (the bad character and the good suffix rules).2. and all achieve this performance by preprocessing pattern P. shifts of more than one position often occur. 10 implement the needed preprocessing for each specific matching algorithm Also. The Boyer-Moore Algorithm To check whether P occurs in T at this position.\1 17 2 Exact Matching: Classical Comparison-Based Methods algorithm. consider the alignment of P against T shown below' 12345678901234567 xpbctbxabpqxctbpq tpabxab 2. and the good suffix shift rule. it then compares T(8) with P(6).2. the Beyer-Moore algorithm successively aligns P with T and then checks whether P matches the opposing characters of T. partly for historical reasons. if P is shifted right by one place after each mismatch. So at this point it isn't clear why comparing characters from right to left is any better than checking from left to right. With suitable extensions.2. the Beyer-Moore algorithm contains three clever ideas not contained in the naive algorithm' the right-to-left scan. (Methods that preprocess T will be considered in Part II of the book. For example. developed in the previous chapter. Rlght-to-Ien scan by one position. However. the Beyer-Moore algorithm starts at the right end of P. Together. In the above example. Aftcrthe shift. but mostly because It generalizes to problems such as real-rime string matching and matching against a set of patterns more easily than Boyer-Moore does These two topics will be described in this chapter and the next 2. At that point P is shifted right relative to T (the amount for the shift will be discussed below) and the comparisons begin again at the right end of P Clearly. the comparison of P and T begins again at the right end of P . after the check is complete. first comparing T(9) with P(7). THE BOYER-MOORE ALGOR1TH.) The origi~al preprocessing methods for these various algorithms are related in spirit but are quite different III conceptual difficulty. Some of the original preprocessing methods are quite difficult.. and in typical situations large shifts are common. or after an occurrence of P is found. Finding a match. moving right to [eft until it finds a mismatch when comparing T(5) with P(3). I This chapter does not follow the original preprocessing methods but instead exploits fundamental preprocessing. 2. thcse ideas kad to a method examines fewer than m + /I characters (an 2. Knuth-Morris-Pra~t is nonetheless completely developed. P is shifted right relative to T just as in the naive algorithm.2. Introduction This chapter develops a number of classical comparison-based matching algorithms for the exact matching problem. since Beyer-Moore is the practical method of choice for eX.

and all of P is shifted past the r in T. the preprocessing for the strong rule was given incorrectly in [278] and corrected. The (strong) good suffix rule The bad character rule by itself is reputed to be highly effective in practicerparticularly for English text [229]. so z has a chance 0tmatchingx good suffix rule. many similar.i. If no such shift is possible.! Pascal code for strong preprocessing.2. If an occurrence of P is found.met newsgwup compo thenry .3. One could also do binary search on the list in circumstances that warrant it 123456789012345678 T: prstabstubabvqxrst qcabdabdab 1234567890 When the mismatch occurs at position 8 of P and position \0 of T. For that. resulting in the followingalignmcnt 2. However. That is the approach we take here. shift P past I in T For a specific example consider the alignment of P and T given below T is x. In contrast. THE BOYER_MOORE ALGORITHM [9 Extended bad character rule The bad character rule is a useful heuristic for mismatches near the right end of P. which is roughly the number of characters that matched. in [384]. the extended rule can be implemented to take only O(n) space and at most one extra step per character comparison. then shift P by n places to the right. That situation is typical of DNA. The simpler rule uses only O(l"EI)space("E is the alphabet) for array R.. the fundamental preprocessing of P discussed in Chapter I makes the simple. That amount of added space is not often a critical issue. If there is none then there is no o~currence o~ x before. this approach at most doubles the running time of the Boyer-Moore algorithm. scan . Hence P is shifted right by six places. and one table lookup for each mismatch. that is. but not exact.. without much explanation. The strong good prefix of the shifted pattern matches a suffix oft in T. substrings. As we will see. In such cases. which has an alphabet of size twenty. the found entry gives the desired position of x. In fact. The original Beyer-Moore algorithm only uses the simpler bad character rule. Implementing the extended bad character rule P before shift P after shift Figure 2. then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T...1: Good suffix shift rule. Characters yandzof Pare guaranteed to be distinct by the good sultix rule. then shift P to the right so that the closest x to the left of position i in P is below the mismatched x in T.. where character x 01 T mismatches with Character y at P. the following extended bud character rule is more robust: When a mismatch occurs at position i of P and the mismatched character in T is x. So in worst case. Otherwise. based on an outline by Richard Cole [lOn. Because the extended rule gives larger shifts. in most problem settings the added time will be vastly less than double.is EXACT MATCHING:CLASS!CAL COMPARISON-BASED METHODS 2. After a mismatch at position i of P the lime to scan the list is at most n . but proves less effective for small alphabets and it does not lead to a linear worst-case running time. but it has no effect If the mismatching character from T occurs in P to the right of the mismatch point This may be common when the alphabet is small and the text contain.2. The original preprocessing method [2781 for the strong good suffix rule is generally considered quite difficult and somewhat mysterious (although a weaker version of it is easy to understand).c's list from the top until we reach the first number less than or discover there is none. which has an alphabet of size four. we introduce anOther rule called the strong II l A Tocent plea appeared on the . and even protein. often contains different regions of high similarity. Code based Oil [384] is given without real explanation in the text by Baase [321. the only reason to prefer the simpler rule is to avoid the added implementation expense of the extended rule. but there are no published sources that try to fully explain the merhod. I '" ah and I' occurs in P starting at position 3. is shown in Exercise 24 at the end of this chapter. then shift P by n places.i. but it is an empirical question whether the longer shifts make up for the added time used by the extended rule.

if P 123456789012345678 T: prstabstubabvqxrst qcabdabdab Note that the extended bad character rule would have shifted P this example 'theorem by = cabdabdab. if one exists.1). (Exercise 9 of this chapter) l'(i) . Definition Let l'(i) denote the length of the largest suffix. P[i . since it would imply either that a closer copy of t occurs in P or that a longer prefix of P matches a suffix of t. if P' denotes th~ string obtained by reversing P. the weak rule is not sufficient. then L(R) = 6 and C(ll) = 3 Final preprocessing detail The preprocessing stage must also prepare for the case when L'(i) = 0 or when an occurrence of P is found. 2. version of the good suffix rule.2.3. lfnone exists. Preprocessing for the good suffix rule We now formalize the preprocessing needed for the Beyer-Moore algorithm.nJI.4. THE BOYER-MOORE ALGORITHM 21 = For example.20 EXACT MATCHING:CLASSICAL COMPARISON-BASED METHODS 2.2. and suppose that the good suffix: rule shifts P so its right end aligns with character k' > k Any occurrence of P ending '<It a position I strictly between k and k' would immediately violate the selection rule for k'.of of P. then NJ(P) = 2 and Nb(P) 5.2. L'(il]: Theorem 2. The above method correctly computes the L values 2. n] that is also a prefix I 2. N is the reverse of Z.4. [J The original published Beyer-Moore algorithm l75 J uses a simpler.. The use of the good suffix rule never shifts P past all occurrence in T Given Theorem 2. then lct/'(i) be zero Theorem Nj(Pl=j.2. weaker. if P = cabdabdab. In the previous example. Nj(P) is the length of the longest P[I.. as well as the problem of how to accumulate the linear time. Clearly.2. For example.1.2. that is. and in th is book the strong version is assumed unless stated otherwise L(2):= L'(2): := for i 3 to n do L(i):= max[L(i . then Nj(P) = Zn-hl(P'). as a simple exercise. For the purpose of proving that the search part of Beyer-Moore runs in linear worst-case time. Hence the Nj(P) values can be obtained in O(n) time by using Algorithm Z on P'. The following theorem is then immediate.. An explicit statement of the weaker rule can be obtained by deleting the italics phrase in the fi rsr paragraph of the statement of the strong good suffix rule. The following definition and theorem accomplish that. suffix of the substring II We leave the proof. such that values in Definition For string P. it follows immediately that allthe C(i) values can be accumulated in linear time from the N values using the following algorithm Z-based Boyer-Moore PROOF Suppose the right end of P is aligned with character k of T before the shift. I'(i) equals the largest j ::: IP[i . That version just requires that the shifted P agree with the t and does not specify that the next characters to the left of those cccurrences or r be different. only one place in Recall that Z.2.j] that is also a suffix of the full string P. the weaker shift rule shifts P by three places rather than Six. we will call the simpler rule the weak good suffix rule and the rule slated above the strong good suffix rule.(S) is the length of the longest substring of 5 that starts at i and matches a prefix of S. When we need to distinguish the two rules. which is n - i + I.

For example. and is often much inferior in practice to the Boyer--Moorc method (and others). ifi =Othen begin report an occurrence of P in T ending at position k k:=k+n-l'(2). When an occurrence of P is found. it is not trivial However. a mismatch occurs at position i-I of P and L'U) > O. in Section 3. the The best known linear-time algorithm for the exact matching problem is due to Knuth..3.1'(2) places.2. there is no shifting in the actual implementation. Both of these proofs were quite difficult and established worst-case time bounds no better than 5m comparisons. Although it is rarely the method of choice. The Apostolico-Giancarlo variant of Boyar-Moore is discussed in Section 3. a sublinear running time is almost always observed in practice. This was first proved by Knuth.3. Richard Cole gave a much simpler proof [108] establishing a bound of4m comparisons and also gave a difficult proof establishing a tight bound of 3m comparisons. while i > 0 and begin i:=i-I. which efficiently finds all occurrences in a text of any pattern from a set of patterns. '0 th.' 2. if we only use the bad character shift rule. We will later show.e. then the rule shift'i P by n . THE KNUTH-MORRIS-PRATI ALGORITHM 23 2. Notice that this can be deduced without even knowing what string T is or exactly how P is aligned with T. and given rules to determine by how much P should be "shifted".1 k:=n. these preprocessed values are used during the search stage of the algorithm to achieve larger shifts. Rather. for the case that P doesn't occur in T. but assuming randomly generated strings. We will present Cole's proof of 4m comparisons in Section 3. the index k is increased to the point where the right end of P would be "shifted". in typical string matching applications involving natural language text.2.1.wefullyde"'loptre method in Section 3.L'(i) places to the right.. We now formalize this idea . gives a "Boyer-Moore-like" algorithm that allows a fairly direct proof of a 2m worst-case bound on the number of comparisons. we use a variant of Galil's idea to achieve the linear time bound in all cases At the other extreme. and Pratt [2781. whilek::. The complete Beyer-Moore algorithm We have argued that neither the good suffix rule nor the bad character rule shift P so far as to miss any occurrence of P. and an alternate proof was given by Guibas and Odlyzko [196]. The first of these modifications was due to Galil [168]. Hence. in1h~field. Later. We won't discuss random string analysis in this book but refer the reader to [184]. Although Cole's proof for the linear worst case is vastly simpler than earlier proofs. The Knuth-Morris-Pratt shift idea For a given alignment of P with T. and for its hiSlorical rol. Note that the rules work correctly even when /'(i) = 0 One special case remains. Afterdiscussing Cole's proof. each act of shifting P takes constant time. a fairly simple extension of the Beyer-Moore algorithm. suppose the naive algorithm matches the first i characters of P against their counterparts in T and then mismatches on the next comparison. else shift P (increase k) by the maximum amount determined by the (extended) bad character rule and the good suffix rule. In the case that L'(i) == 0. so that the L'(i)-length prefix of the shifted P aligns with the C(i)·length suffix of tbc unshifted P.-e. if P = ubcxabcde and. The Knuth-Morris-Pratt algorithm P(i) = T(h) do h :=h -1. When the first comparison is a mismatch (i. and its linear time bound is (fairly) easily proved. 2.on. the good suffix rule shifts P by n -/'(i) places.22 EXACT MATCHING:CLASSICAL COMPARISON·BASED METHODS 2. The naive algorithm would shift P by just one place and begin comparing again from the left end of P. several simple modifications to the method correct this problem. But a larger shift may often be possible.6. For those Knu'h·Morri. However. The algorithm also forms the basis of the well-known Aho-Corasick algorithm.2. and is important in order to complete the full story of Boyer-Moore.. We will present several solution. Morris. then it is easily deduced (and we will prove below) that P can be shifted by four places without passing over any occurrences of P in T. If. in Section 3. that by using the strong good suffix rule alone. Note that although we have always talked about "shifting P>. the mismatch occurs in position 8 of P.lt . Morris.·Prattmethodhere . end.5. So the Beyer-Moore algorithm shifts by the largest amount given by either of the rules. due to Apostolico and Giancarlo [261. the expected running time is sublinear. it can be simply explained. Moreover.3.mdo begin h:=k. We can now present the complete algorithm The Boyer-Moore algorithm Beyer-Moore method has a worst-case running time of O(m) provided that the pattern does not appear in the text. yielding an O(m) time bound in all cases. P(n) mismatches) then P should be shifted one place to the right 2.. Only the location of the mismatch in P must be known.2 When the pattern does appear in the text then the original Beyer-Moore method runs in 0(nm) worst-case time. during the search stage.4. in the present alignment of P with T.et problem including the Aho-Corasick . and Prall [278]. The Knurh-Morrts-Pran algorithm is based on this kind of reasoning to make larger shifts than the naive algorithm makes. end. then the worst-case running time is O(nm). then the good suffix rule shifts P by n .2. The good suffix rule in the search stage of Beyer-Moore Having computed L'U) and l'(il for each position i in P.

after a shift. First. contradicting the definition of sp. However.'Pivalue.. k . For example.3. This is true even without knowing T or how P is positioned with T.1 of T.houldbealertoJthattraditionaIlYlheKnuth-Mnrris-Prnnalgmilhmha"beendescribed. As an example. In fact.i] that matches a proper prefix of P. Staled differently. xyabcxabcxadcdqJeg.3 = a places. if the first mismatch (comparing from left to right) occurs in position i + I of P and position k of T. and so sp. SPI = 0 for any string An optimized version of the Knuth-Morris-Pratt algorithm uses the following values '-1 _+-----'-: ---'-:' ':'f-----f---- P before shifl P aflershifl P" ~~---~--"missedoccurren~eof Figure 2. and P will be shifted 4 Thcroader. Note that by definition... it often shifts P by more than just a single character. t> is the prefix of P of length sp.5pS = 3.1]. and all characters in the (assumed) missed occurrence of P match their counterparts in T.on. and SPIO = 2. Character P(/) cannot be equal to P(i + I) since P(l) is assumed to match T(k) and Pit + I) does not match T(k). (P) to be the length of the longest proper .'iP.81 > 1131 sp. Thus af3 is a proper suffix of P[I. both copies of the string are followed by the same character c. In particular.. The advantage of the shift rule is twofold. characters of P are guaranteed to match their counterparts in T. shift P exactly i + I . if P = bbccaebbcaod.. the first 3 characters of the shifted P match their counterparts in T (and their counterparts in the unshifted P) Summarizing.2: Assumed missed occurrence ueeco correctness prool tor Knuth-Morris·Pratt by 4 places as shown below: Clearly...(.. where P = of P mismatches then P will be shifted by 7 . + I) = i . the left-most sp. Both of these matched regions contain the substrings a and f3.3 PROO~.. Hence the theorem = is proved.nlcrm.l aligns with Trk . and the left end of P is aligned with character 3 of T. But [c] > 0 due to the assumption that an occurrence of P starts strictly before the shifted P.8] and is followed by character b in position 2 and by character c in poshion 9. . to determine whether the newly shifted P matches its counterpart in T. then .FailurefunCI. The unshifted P matches T up through position i of P and position k . Thus.. The shift rule guarantees that the prefix PlI . In the case that an occurrence of P has been found (no mismatch). sp.] of the shifted P matches its opposing substring in T.(P) for all P. SP4 = 1.sp.ln other words. 8}.i) that matches a prefix of P.of !ailure!unCI. for example. then shift P to the right (relative to T) so that prl . and the next character is unequal to P(i + I). suppose P = abcxabcde as above. so that there is an occurrence of P Starting strictly to the left of the shifted P (see Figure 2. + I of P will align with character k of T. The next comparison i. SPi(P) is the length of the longest proper substring of PILi] that ends at i and that matches a prefix of P. so the unshifted P and the assumed occurrence of P match on the entire substring af3. Second.3.(P) :':0 sp. so la. we have For any alignment of P and T. Now let I = laf3! + I so that position I in the "missed occurrence" of P is opposite position k in T.2).' = 1 since the single character b occurs as both the first and last character of prl . and let a and t> be the substrings shown in the figure.IPS = a proper prefix of PII . if P = ahcaeabcahd.. the algorithm can start comparing P and T at position +I of P (and position k of T). When the string is clear by context we will use sp. define sp. Hence af) is a suffix of P[l. . The time analysis is equally simple . 0 Theorem 2.Suppose not.. whicharerciutedtothc.sp.. shift P by n . and so the Knuth-Morris-Pratt algorithm will correctly find all occurrences of Pin T.' -c 2.ol! •.24 EXACT MATCHING:CLASSICAL COMPARISON·BASED METHODS 2. THE KNUTH-MORRIS-PRATT ALGORITHM 25 Definlnnn For each pcsltion i in pattern P. places to the right so that character sp..i] tbat matcbes a prefix of P. then SP2 = sp_] = D. guarantees that the same mismatch will not above example.mjJixof PII . shown relative to the shifted povition of P.3. sp. Then P and T will match for 7 characters but mismatch on character 8 of P. + I] The use of the stronger shift rule based on sp.2 says that the Knuth-Morris-Pratt shift rule does not miss any occurrence of P in T.8] and as a suffix of PlI . sp.'iP. then made between characters T(k) and Plsp.willbee'pticillydelincdinSection2. The Knuth-Morrls-Pratt shift rule 123456789012345678 xyabcxabcxadcdqfeg abcxabcde abcxabcde As guaranteed. in place of the full notation.sp~ places.

so sp:(P)::: Itli > [cj. where a single phase consists of the comparisons done between successive shifts. as in the case of Beyer-Moore. REAL-TIME STRING MATCHING 27 Theorem 2. where s is the number of shifts done in the algorithm. Whilec+(II-p):. but we never accounted for time needed toimplement shifts. Zj = [e ] = sp.(P) and j maps to i. PROOF Divide the algorithm into compare/shift phases.i] would be a proper suffix of P[I .3. so the number of comparisons done is bounded by 2m 0 :"" sp~(P).(P)] 2. end. such that P[i + 1] does not match P[lal + 11. The reason is that shifting is only conceptual and P is never explicitly shifted. sp' and sp values can be computed in linear time using end.: i that mapstoi. sp.4.+ I Putting all the pieces together gives the full Knuth-Morris-Pran algorithm Knnth-Morrts-Pratt algorithm SP~_I PROOF If spJP) is greater than zero.i] that matches a prefix of P. letting j denote the start of c .I p:= rep). 0 begin Preprocess P to find F'(k) = c.(P) must be zero Now suppose sp. the Knuth-Morris-Pratt algorithm "shifts" P so that the next comparison is between the character in positionc of T and the character in position rpj-l. and p s: II Given Theorem 2. The proofs of the claims for SPi(P) are similar and are left as exercises. Therefore. . and let j* be a position in the range I < j' -: j that maps toi..3. the numberof choracter comparisons is at most 2m. then sp. Butsp. + I for k from 1 to 11 + I.sp. The sp Zj.3.P(lfJD. We claim that j is the smallest position in the range 2 to i that maps to i. ifp =11+ I then report an occurrence of P starting at position ifp:= I then c t. in any phase at most one comparison involves a char deter of T that was previously compared. the total number of character comparisons is bounded by m + s.e c -l. where sP. Preprocessing for Knuth-Morrfs-Pratt The key to the speed up of the Knuth-Morris-Pratt algorithm over the naive algorithm is the use of sp' (orsp) values. We use pointer p 10 point into P and one pointer c (for "current" character] to point into T + Iton+ 1.(P):= downto z do max[sp.i] that matches a prefix (call it fJ) of P. Then PlJ* .J andspo are defined to be zero r(i) 2.~p' and sp values from the Z values obtained during the fundamental preprocessing of P. + I = F'(i + I). Real-time string matching In the search stage of the Knuth-Morris-Pratt algorithm. pointers to P and T are incremented. define the failure function F'(O to be I). Rather.=ndownt02do begin i:=j+Zj(P)-I. We verify this below We will only use the (stronger) failure function > in this discussion but will refer to I of P. = T(c) I.4.= I.+-l(P) - 1. After any shift.:m do begin While pep) do begin p :=p+ c:=c+l. c- II of T. P is aligned against a substring of T and the two strings are compared left to right until either all of P is exhausted (in which values are obtained by adding the following' .contradicring the assumption thatsp: = a.I of P.26 EXACf MATCHING:CLASSICAL COMPARISON·BASED METHODS 2. :=n-l sp. by the definition of mapping. then there is a proper suffix ct of P[l. P(i + I) #. p:= I. if there is no j in the range 1 < j:. Two special cases remain. then P is shifted right by n . This is implemented by setting F'(n+ I)tosp. end.2. :=0.4. Since P is never shifted left. Hence..3. the comparisons in the phase go left to right and start either with the last character of T compared in the previous phase or with the character to its right. But s < m since after m shifts the right end of P is certainly to the right of the right end of T. 2. In the Knuth-Morris-Pratt method. A full implementation of Knuth-Morris-Pratt We have described the Knuth-Morris-Pratt algorithm in terms of shifting P.sp~ places.. It is easy to see how to compute all the .. forj. all the the Zi values as follows: Z-based Knuth-Morris-Pratt fori:= I tondo sp. Thus. thenp is set to F'(I) = I and cis incremented by one.3.= end.(P) .3. When an occurrence of P is found. so a general "shift" can be imp1cmentedin constant time by just setting p to F'(i + 1). Moreover. Suppose not.> 0 and let j be as defined above. When the mismatch occurs in position 1 of P.

First... This shift guarantees that the prefix P[l. Then P should be shifted right by i .4. We will not use this terminology here and instead represent the method in pseudo code I I 2. a method must do at most a constant amount of work between the time it first examines any position in T and the time it last examines that position. Give a 'hand-wavinq" explanation tor this 2. In general terms.5p. Preprocessing for real-time string matching The proof of this theorem is almost identical to the proof of Theorem and is left to the reader. . and we leave that as an exercise. For that reason. such as when searching lor an English word in a book. positions.3."J matches the opposing substring in T and that I(k) matches the next character in P. as the size 01 the pattern grows.2. It follows that the search is done in real time.5. So. 2. The utility of a real-time matcher is two fold. Note also that if the search stage is real time it certainly is also linear time. sp. and it now also never reexamines a position involved in a mismatch. the Knuth-Morris-Pratt method is not a reid-time method To be real lime. "Common sense" and the (-l(nm) worst-case time bound of the Beyer-Moore algorithm {usingonlythebadcharacterrule)bothwouldsuggestthatempiricalrunningtimesincrease with increasing pattern length (assumingafixedtext). it does not guarantee that T(k) = P{sp. the resulting real-time method is generally referred to as a deterministic finite-state string matcher and is often represented with a finite state machine diagram..2 for the Knuth-Morris-Pratt algorithm. Preprocessing of P need not be real time. the earlier for Knuth-Morris-Pratt Z-based real-time fori:= I ton do matching 2. as indicated above. Give a handwaving explanation for this. If the size of the alphabet is explicitly included in the time and space bounds. if. this is not true when the position i~ involved in a mismatch. if a position of T is involved in a match. usual) that the alphabet is finite.l places. Below we show how to find all the SP(ix) values in linear time. + 1).Brute torce (the naive algorithmj is reported [184j to run laster than mostot her methods. When searching lora single word ora small phrase in a large Enqlish text. In the Knuth-Morris-Pratt method. this gives an algorithm that does linear preprocessing of P and real-time search of T. Note that the definition of real time only concerns the search stage of the algorithm. Although the shift based on sp. + I). EXERCISES 29 case an occurrence of P in T has been found) or until a mismatch occurs at some positions i + 1 of P andk ofT.> 0.ln the Ianer case. and the next comparison is between characters T(k) and P(sp. It is easy to establish that the algorithm finds all occurrences of P in T. The required preprocessing of P is quite similar to the preprocessing done in Section 2. No explicit comparison of those substrings is needed. in certain applications. Hence. P[ 1.1.r and P(i + 1). + I).5. such as when the characters of the text are being sent to a small memory machine.the simple bad character rule seems to be as etfective as lhe extended bad character rule. guarantees that P(i + I) differs from Plsp. In "typical" applications of exact matching. the method becomes real time because it still never reexamines a position in T involved in a match (a feature inherited from the Knuth-Morris-Pratt algorithm). But when sea rchinginactualEnglish Knowing the sP:i.d values for each character x in the alphabet allows a shift rule that converts the Knuth-Morris-Pratt method into a real-time Suppose P is compared against a substring of T and T(k) = . Exercises 1. it is never examined again (this is easy to verify) but.SP:i. the comparison between T(k) and The next needed comparison is between characters II I . For historical reasons.SP(i. Converting Knuth-Morris-Pratt to a real-time method Note that the linear time (and space) bound for this method require that the alphabet L be finite. then the preprocessing time and space needed for the algorithm is O(]I:]n) We will use the Z values obtained during fundamental preprocessing of P to convert the Knuth-Morris-Pratt method into a real-time method. of the length of the shift rule.l of the shifted pattern matches its opposing substring in T.28 EXACT MATCH!NG:CLASSICAL COMPARISON-BASED METHODS 2. one might need to guarantee that each character can be fully processed before the next one is due to arrive If the time for each character is constant. how would you expect this observation to hold up with smaller alphabets (say in DNA with an alphabet size of four). Hence Tlk) might be compared several times (perhaps Q(lPll times) with differing characters in P. the search stage of this algorithm never examines a character in T more than once. and when the text has many long sections of similar but not exact substrings? 3. then P is shifted tight by i -Ip. guaranteeing that the prefix. This allows us to do ]L I comparisons in constant time. Together.4.

It is natural by example Below is working ing. So rather than starting do. In Section 1. how the Z algorithm can then be modified need z in P. not even a right-to-Ieftcomparison 6. using more is the average assumptions shift using the extended to implement about the Sizes of P and the following 16. The point of the exercise an algorithmic idea using . Given characters. versus 22. Consider point would now the case that the pattern times length creating following to compute a natural in that detailed method. Prove that if ZI< is greater sp. if we use S = PT. than zero even when up the Boyer-Moore good suffix rule shifts the mismalching shifted cbstow would method. maps to k. = 117. E valuate assumln grandomstrings when the bad character Combine computes the answers to the previous tor a string two exercises to create a linear-time algorithm that rule is used as before.althoughthe difficult to explain of Pin T. this approach Pascal has the same behavior code (in Turbo Pascal) rule. Given the answer sp" where an example Stated there k+ 1 = sp. on sp.eitherempirically 7. Conclude that than zerc. we do not use a separator how to find all occurrences place of Zvalues. is known only atthe no character k.=i-liforanyindex that this conjecture sp:(P) P. equals the largest position i such that kwhere «e: Zk iis of examining characters P from righl to left. Give a proot of Theorem time.5. but wtth sp values t. the search moves right to left in the makes it more with would at the a more that for Boyer-Moore can be based directly rather than on Z values.wherei::: II' Show ofJ. it looks the bad 20. Explain in what way the method (needed is a "simulation" for Boyer-Moore) of the Z algorithm should be sPn for any i. Show that this is not true. algorithm At that point we must begin can be generalized compares tirst). Prove the claims in Theorem only the 2. exercises method for this. bad character rule.) algorithms. 24. alone. the utility of this all the Z values S given only the sp values for S and not the string on real data or by analysis where fewer comparisons it with the good suffix snseu rule is 21.the values is incorrect. Suppose. can be greater x. T. we showed for string the this. values copy of t under then the pr e2. exposition Work and more However. preprocessing for it and for Knuth-Morris-Pratt. but if they are unequal at the start of iteration using str'lng S. That 11. the pattern Argue method preprocess- out the details ot this approach. Is it true that given determined? sp values for a given string from the sp'values alone? are completely algorithm to convey behind the idea of the algorithm. is known comparison from Ihe right end of P. Are the sp values determined ep' of the algorithm. First.2. as the or'rg'lnal Boyer-Moore implementing Richard Cole's on average to conjecture th atsp.. zwill not first compare rule trom y.3. Perform with different it? say that SPI maps tok If k = I-sp. Prove that the Knuth-Morris-Pratt occurrences 12. Show how to do that when usee in why it is more effective the preprocessing allow an easier Jeftaiter uniform {This P and T. give an example i such mat sn to show that Z. Now show T in linear time using S = PT. 10. It is different method explain a program Then 13. as it did to search explanation bound. Construct an example instead in P that are least bad character likely to be in T. S itself). the bad character shift rule from th eBoyer-Mooremethodto method itself.000 it took about as long to search In Section the we establish Z values accomplish final three 2. as the Boyer-Moore a rfqht-to-lett we apptyfbe Recall that but r•. but there is no guaranteelhal then then no position mapstok kofthe Zalgorithm however. At what search Z values from S (I. exposition olthe each search needed lefl 10 right in the pattern. That so that comparisons. (somehow) Z" know the to compute x. How much empirically. sp values and the converse.e. to incorporate how effective algorithm preprocessing. for the strong preprocessing 14. If we use the weak the matched processing Explain this. exercise. the Boyer-Moore on an English for the word 'Inter" Give a hand_waving increases creasing? 4. algorithm that (when Zk is computed).4 concerning good suffix the program.2. time. =i it is natural toconje cturethat rules.. knowing five only all the to The We text of about 300. In Boyer-Moore pattern. . Recall the definition i. 'k can be deduced from the so varues for every position in P in the order character of how unlikely they are to be in T (most first at those characters rule or extended approach. the algorithm unlikely characters is. Evaluate good-suffix 5. Letxand and let z denote using at least three bad character characters the character another is way. Thus. Evaluate without Would empirically algorithm runs faster in practice characters. when given longer five times Using sp values to compute Zvalues that one can compute not knowing a linear-time exerCise exercise exposes all the sp values In the next algorithm suggests a hole algorithm. By the suffix rule. you expect bad character the search to stop de- The first you expect times to start increasing in combination needed also me point? rule compared to the original for the two rule? Does the T. and Boyer-Moore-like pattern the methods moves ieftto right relative 10 the text. shift rules (either based on spor sp') do not miss any suffix good suffix rule in Beyer-Moore require that shifts the closest to be different. not be Pfarther than the extended idea lor speeding i is the largest such that sp. The idea ctme instead of Pagainst bad character rule in the Boyer-Moore in 19. that is.4. and analyze than the approac hbasedonfundamental in [278]. on page 8.EXACT MATCHING. SupposethereisapositionisuchthatSPimapstok. place the pattern and show how it anows the Knuth-Morris-Pratt method orto the naive matching that rule is and explain right end of the text. but doesn't the next character is not as simple as it might at first appea r. empirically the effectiveness of the strong good suffix shift for Boyer-Moore the weak shift rule 9. -c t.5 we showed the Z values that all the occurrences on the str'lng S = of Pin of Pin Explain symbol T could be found how the method between in linear time would change in are made used alone. largest such position. choices the utility of the extended the evaluation exercises develop a correct linear-time In [202].andletibethe Prove that Z. Suppose in Tand comparisons that a phase Prespectively. and the Size 01 the alphabet. for 'fnteracuvely". why is begun bad character from the right at the shifted P. 8. is known at the start of iteration end of iteration Hence k. the number and Ihat the v oenote in the it will be 18.3.men r. patterns.CLASSICAL COMPARISON-BASED METHODS 2. Then show how to accumulate all the I'(i) varuesm linear by computing P$ T.andthatr. this modified r. Show why this is It may seem that /'(i) notlrue. EXERCISES 31 texts. Upon mismatching. This will shift Pagain. Exami ne the code to extract the prove correctness is that it is difficult and isclosertotheoriginal its running 15. Evaluate of combining rule. 23. ends with a mismatch rule. It is possible Then evaluate the Boyer-Moore and to combine uniform the preprocessing a small change for Boyer-Moore to Boyer-MOOre shifting for Knuth·Morris-Pratt. and conduct a mismatch. Show method extended x and z? If they are equal r. distinct extra shift pay for the extra computation different of occurrences of Pin to Ihe previous position Zk always equals T.

riteln('the writelnl'after value in .1 we showed for the how to use Z values and the values values. we can make similar Pof Show rule to make the Beyer-Moore 1 . zero = 0.. Prove Theorem In this chapter.n] itself) such that shift more effective. EXERCISES 33 gsmatch(input. of using Z values sP:.j. indexarray.4. cell i is the numbe . = p[j]) (3) then preprocessing tstring kmp_shift[k]: end (3) r-».. to compute needed both Ihe sP: and SA values Instead for its real-time these values of Pand extension. <" m) do {2l i:"j_old to j-1 do > j-1) then gs_shift[ij: j-1.. {stage j:=j+1.eqe rLr for if (j m. i:= 1 to m do write(Inatchshift[i] wr i t. begin go_on:=true. where n is the length lEI is the length 01 the (gs_shift[j] alphabet.. for k:=m-l (2) downto 1 do 26.outputj.integer) .x sp:.of positions in position i of the to shift'). Prove that the shift rule used by the real-lime of Pin string matcher doe s oot mrse envoccurrences T. 27. (' the length of the string is end. Lold:=1. = array[l . + .m.. Although method strong occurs than we don't know howto shift P[i simply convert the Boyer-Moore algorithm to be a real-time changes P[i+l to the i < m) then go_on: [3) j. while if if else end. show how to obtain from the SA and/or begin t sP: values in linear [O{nll:lll time. (gs_shift[iJ j :=j+kmp_shift[j]. character That is. n] (other how to modify T{h) we can look for the right·mostcopyin the preceding is T{h). is an implementation for the strong of Richard Cole's rule) good suffix if (p[k] begin j: ~j -1. occurring pattern').matchshift. 28. procedure gs_shift: gsshift(p:tstring. var miLnt. readstring(p.. when a mismatch . kInp_shift: indexarri. else indaxarray kmp_shift end. go_on:boolean.32 program (This EXACf MATCHING:CLASSICAL COMPARISON-BASED METHODS 2.~ j+kmp_shift[j+lj false. begin write1n [ll {main} I' input a string on {2} a single a mismatch gs_shift[j] kmp_shift[m] (stage I) 25. the way Knuth-Morris-Pratt between p(i)and was converted. matehahi ft: indexarray.ml. brnshiEt.i:integer.tstring. 2) (2) [k] : ~j-k+l.S. p.1Y. gsshift(p.j_old. readstring(var p:tstring.m) i.k:integer. end. 1001 of integer. Lold:"j. while begin procedure begin readlp} . writeln end. = string[200].. m:"Length (p) . for begin for {I} j:= 1 to rn do :=I.e Ln r end. (prj1 [3) > <> used in Knuth-Morris-Pratt p[k]l and j-k) go_on then do gs_shift[j] :~ j-k. [main} . 2.

the Boyer-Moore assuming 29. and of the at in the tree is the concatenation the subpath must labels on the edges path directed characters in the subpath. Consider a phase where the right end of P is aligned with position j of T and suppose that P and T match for I places (from right to left) but no farther. it is quite amazing that a close variant of the algorithm allows a simple linear time bound We present here a further improvement of the Apostolico-Giancarlo idea. It infers and avoids many of the explicit matches that Beyer-Moore makes We take the following high-level view of the Beyer-Moore algorithm. from the root. where n is the length of P. It is then immediate that the number of comparisons is at most 2m: Every comparison is either a match or a mismatch. where preprocessing for Boyer-Moore was discussed. the right end of P is aligned with a character of T. We assume here that vector N has been obtained during the preprocessing of P.3) of number Apostolico and Giancarlo [26] suggested a variant of the Boyer-Moore algorithm that allows a fairly simple proof of linear worst-case running time. The method therefore has all the rapid shifting advantages of the Beyer-Moore method as well as a simple linear worst-case time analysis. thesubpath for this on the edges of the tree plus the lenglh 3.butitviolatestherequirementIhatitbeasubpathofapathstartingatihe root. In acompare/shift phase. Note Ihalalthough that runs be part ofa 2. With this variant.The is to find all subpat hs of paths starting not start at the root {see proportional to the total of P. that N. In Section 2. Those paths start at the root. we maintain an m length vector M in which at most one entry is updated in every phase. Suppose weare preprocessing so that the needed information is collected in linear time. during the search for P in T (after the preprocessing). resulting in an algorithm that simulates exactly the shifts of the Beyer-Moore algorithm.3: The pattern p= eqre labels two subpalhs of paths starting at the root. set M(j) to a value k S I (the rules for selecting k are detailed below). P is shifted right by some amount as given in the Boyer-Moore shift rules. A Beyer-Moore variant with a "simple" linear time bound FIgure 2. there can only be m mismatches since each one results in a nonzero shift of P.1. First.4 we showed how to compute N. The label with pattern problem each edge is labeled ofa subpath problem itseltneed in time with one or more characters. M(j) records the fact that a suffix of P oflength k (at least) occurs in T and ends exactly .34 EXACf MATCHING:CLASSICAL COMPARISON-BASED METHODS 3 Exact Matching: A Deeper Look at Classical Methods 3. Then. Thus in the figure. However. not "arq". and P is compared right 10 left with selected characters of T until either all of P is matched or until a mismatch occurs.. Note that an edge label is displayed from the top 01 the edge down towards the bottom of the edge. Then. We will also show that (in addition to the time for comparisons) the time taken for all the other work in this method is linear in m Given the history of very difficult and partial analyses of the Beyer-Moore algorithm. buithe subpaihscontaining aqrado not.2. no character of T will ever be compared after it is first matched with any character of P.2.4. a fixed size alphabet we are given a tree where given a pattern P.iJ that matches a suffix of P. Figure the root that are labeled Give an algorithm P. and there can only be Iff matches since no character of T is compared again after it matches a character of P. for every i in O(n) time. Recall from Section 2. There is also another subpath in the tree labeled aqra (itstartsab-ovethecharacterz). finding exactly the same mismatches that Boyer-Moore would find and making exactly the same shifts.(P} is the length of the longest suffix of P[1.1. We divide the algorithm into compare/shift phases numbered I through q S m. Two modifications of the Boyer-Moore algorithm are required.1. there is an edge labeled "qra"'. Key ideas OUf version of the Apostollco-Giancarlo algorithm simulates the Beyer-Moore algorithm.

3.-length substring of P ends at position i and matches a suffix of P. PROOF We prove correctness by showing that the algorithm simulates the original BoyerMoore algorithm.1: Substring Ct has length Ni and substring fJ has length M(h) :> Ni. + 1 of P.) 0. As the algorithm proceeds. 3.2.. Correctness and linear-time analysis Theorem 3.h.1. = i > 0.1. and we can conclude that the next Ni comparisons (from PU) and T(h) moving leftward) in the Boyer-Moore algorithm would be matches.1). at position j. The two strings must match from Iheir right ends lor Nicharacters. One phase in detail j andi setton Phase algorithm 1. then P matches T from the right end of P down to character i . PU . then declare that an occurrence of P has been found in Tendingatpositionj. and N.-length suffixes of those two substrings must match. M and N as above. If M(II) ::: N. but mismatch at the next character.N.h. . and if N.36 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. That means that an N. Set MU) to j .Ni of P (this ends the phase). but the next pair of characters mismatch [i. The rest of the simulation is correct since the shift rules are the same as in the Beyer-Moore algorithm 3. = 0.N. then an occurrence of P in T has been found. then compare T(II) and P{I) as follows: The following definitions and lemma will be helpful in bounding the work done by the algorithm 2. while an M(h)-length substring of Tends at position h and matches a suffix of P. and No < I.1. the algorithm is correct when it declares a match down to a given position and a mismatch at the next position.e. a value for M(j) is set for every position j in T that is aligned with the right end of P: M(j) is undefined for all other positions in T The second modification exploits the vectors Nand M to speed up the Beyer-Moore algorithm by inferring certain matches and mismatches. Set M(j)to j . T(h . M(h) > Ni 5. if N.1. < i . So the N. Shift P by the Beyer-Moore rules based un a mismatch at position i . Further. for any given alignment of P with T. = i. and suppose it knows that M(h) > N. suppose the Beyer-Moore algorithm is about to compare characters T(h).N.N. A BOYER-MOORE VARIANT WITH A "SIMPLE" LINEAR TIME BOUND 37 4. To the idea.)] Hence P matches T for j . and shift according to the Beyer-Moore rules based on finding an occurrence of P endingatj(thisendsthephase). M(j) must be set to a value less than or equal to j . the Apostolico-Giancarlo variant of the T. then we can be sure that 3.II + Ni.M(j)mustbesettoavaluelessthanorequalton. That is.1. (see Figure 3.h + Ni characters and mismatches at position i . of P. If Figure 3. If M{II) is undefined or M(Il) = N.

)] does not cross any existing interval. TnCase I. andtherightendsofl and J'cannotbee qual (since M(h) has at most one value for any h). the substring of T consisting of the matched PROOF Every phase ends if a comparison finds a mismatch and every phase. other than the right end ofthe interval. and that this is the first time in the phase that II (presently k . then) > hand M(j) will certainly get set to ) . except the last. However. and in Cases 3 and 4.I) is in an interval in a position other than its right end. and j is to the right of any interval. b. The effect is that Cases 2 and 5 can apply at most once for any position in T. and M(h) will never be accessed again. Thus the algorithm can find at most m mismatches. the new inrervalje + 1. and that interval does not cross itself.2: a. = M(II) > 0 and N'-Ni = M(h . 3. and the induction is complete D if the comparison involving T(h) is a match then. We will show that by using the good suffix rule alone. This is also true for Cases 3 and 4 since the phase ends after either of those cases. Case 2 or 5 could apply repeatedly before a comparison or shift is done. Hence.M(h»). and the total number of character comparisons is bounded by 2m. the Beyer-Moore method has a worst-case running time of O(m). then shift P byn places to the right. either the phase ends or h is decremented by one place. That means that the right end of I cannot he to the left of the right end of I' (for then position k . Now suppose thatk . no character of T will ever be compared again after it is first in a match. Cole's linear worst-case bound for Beyer-Moore Suppose for a given alignment of P and T. We consider how h could first be set to a position inside an interval.Two covered intervals thai do cross PROOF The proof is by induction on the number of intervals created. it is possible that Case 2 or Case 5 can apply without an immediate shift or immediate character comparison. If no such shift is possible. An execution of Case 2 or 5 moves h from the right end of some intcrvall = [k . then shift P by the least amount so that a proper prefix of the shifted pattern matches a suffix of the occurrence of P in T. at the end of the phase.. although one interval can contain another..I is in some interval I' but is not at its right end.. The previously existing intervals have not changed. which is assumed to be untrue. we divide the Boyer-Moore algorithm into compare/shift phases numbered I through q ::: m. that move must follow an execution of Case 2 or 5.M(j) + 1. Since) is to the right of any interval.)]. so that a prefix of the shifted pattern matches a suffix of t in T possible. a substring t of T matches a suffix of P. if it exists. In compare/shift phase i. For example. so in all cases h + I is either not in a covered interval or is at the left end of an interval. So position h will be in the strict interior of the covered interval defined by ) Therefore. P is immediately shifted.theinterval[h+l . and if II is the right end of an interval then M(II) is defined and greater than 0. is followed by a nonzero shift of P. so the number of accesses made when these cases apply is also O(m). we focus on the number of accesses of M during execution of the five cases since the amount of additional work is proportional to the number of such accesses. To bound the amount of additional work. But these conditions imply that I and l' cross. Consequently. So ifh is ever moved into an interval in a position other than its right end. then in that phase only the right end of any covered interval is examined A new covered interval gets created in the phase only after the execution of Case I. A character comparison is done whenever Case I applies. M(j) is set at least as large as ) . and .h + I. Diagram showing covered intervals that do no! cross. Now assume that no intervals cross and consider the phase where the right end of P is aligned with positionj ofT. Certainly the claim is true until the first interval is created. "Whenever Case 3 or 4 applies. so there are no crossing intervals at the end of the phase. If an occurrence of P is found. To bound the matches. one place to the left of I. but a mismatch occurs at the next comparison to the left. h is the right end of an interval. Later we will extend the Beyer-Moore method to take care of the case that P does occur in T As in our analysis of the Apostolico-Giancarlo algorithm. Now the algorithm only examines the right end of an interval. Hence Cases I.Inanyofthesecases. then shift P by II places. observe that characters are explicitly compared only in Case I. That is. Since h = j at the start of the phase. a suffix of P is matched right to left with characters of T until either all of P is matched or until a mismatch occurs. provided that the pattern does not appear in the text. h will never be examined again. h] to position k -1.I would have been strictly inside 1'). and h + I is either not in an interval or is the left end of one.h + I or more at the end of that phase.2.2. so the algorithm never compares a character of T in a covered interval. In the latter case. 3. COLE'S LINEAR WORST·CASE BOuND FOR BOYER-MOORE 39 ~ i' __ 1 j-M(j)+! Figure 3. ifno intervals cross at the start of the phase.38 EXACT MATCHING: A DEEPER o b) __ ~ j'-M(j')+l o ~-L LOOK AT CLASSiCAL METHODS 3. h begins outside any interval. h is not in any interval. j]iscreatedafterthealgorithmexamines position h. But whenever Case 2 or 5 applies. That means that all characters in T that matched a character of P during that phase <ITe contained in the covered intervallJ . Rule 1 is never executed when h is at the right end of an interval {since then M(h) is defined and greater than zero). Hence the algorithm finds at most m matches.. Then find. So an execution of Case I cannot cause h to move beyond the right-most character of a covered interval. and 4 can apply at most O(m) times since there are at most O(m) shifts and compares. Case 5 would apply twice in a row (without a shift or character comparison) if N. D 3. and after any execution of Rule I. or4.

) :s 4m An initial lemma We start with the following definition and a lemma that is valuable in its own right. Consider how P aligns with T before and after that shift (see Figure 3. f3. the portion . be the number of comparisons in phase i of the second type. and our goal is to show that this sum is Certain[Y'Lr~lg.Phasei ends by shifting P right by Si positions.2. but it is not periodic. and let g.1 + I > 3. until less than 5i characters remain on the left (see Figure 3.1'. Definition r copies of We use the term "prefix semipenodic" to distinguish this definition from the definition given for "semiperiodic". pattern is then be periodic. mark off substrings of length s. For example. Let gi be the number of comparisons in phase i of the first type (comparisons involving a previously examined character of T). There will be at least three full substrings since IPI = It. Then.2.Thenp = ab. We will show that for any phase i. f3' denotes the string obtained by concatenating together Return to Cole's proof = ababababab = yo. denote the amount by which P is shifted right at the end of phase i.1. bcabcabc is semiperiodic with period abc.. over the entire algorithm the number of comparisons is 'Lr~l(gi + g. 3. s.4). Assume that P does not occur in T. Then since Lr~l s. and its proof is typical of the style of thinking used in dealing with overlapping matches For any string d. abababab is periodic with period abab and also with shorter period abo An alternate definition of a semiperiodic string is sometimes useful. For example. In each compare/shift phase. so the compare part of every phase ends with a mismatch.5). COLE'S LINEAR WORST·CASE BOUND FOR BOYER-MOORE 41 characters is denoted t" and the mismatch occurs just to the left of shifted right by an amount determined by the good suffix rule. Cole's proofwben the pattern does not occur in the text Definition Let s. Note also that a string cannot have itself as a period although a period may itself PROOF Starting from the right end of P. :s number of comparisons done by the algorithm is 'Lr~ I (gi + g. a is the part of the shifted P to the right of the original P. the string abaabaahaabaabaah is serniperiodic with period aab and is prefix semiperiodic with period uba. By definition ofsi and a. By the good suffix rule. The following useful lemma is easy to verify.l. we divide the comparisons into those that compare a character of T thai has previously been compared (in a previous phase) and those comparisons that compare a character of T for the first time in the execution of the algorithm./3. length of ail the shifts is at most m) it will follow that'LL g. but the following lemma (whose proof is simple and is left as an exercise) shows that these two definitions are really alternate reflections of the same For example.40 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS Ii' The 3. Note that a periodic string is by definition also serniperiodic. ::::g. String: abcahc is periodic with period abc.

after the phase h shift a character of P is opposite character T(k') (the character of T that will mismatch in phase 0.) or 2. if in phase h the right end of P is aligned with the right end of a /3. then phase h must have ended with a mismatch between T(k') and P(k). could have previously been examined only during a phase where P overlaps f" So to bound g" we closely examine in what ways P could have overlapped Ii during earlier phases.consequcntly. = P(I +q!tll) = P(k).4: Startingfromtheright. Butin phase i the mismatch occurs when comparing T(k') with i\I). Let k' be the position in T just to the left of Ii (so T(k') is involved in the mismatch ending phase i). All but one of the characters compared in phase i are contained in Ii. The reasonis the following: Strings P andr. where q ~ 1 (see Figure 3. as a concatenation of copies of string {3. So in phase h. and let k be the position in P opposite T(k') in phase h. string Figure 3. In phase h. Hence. the comparison of P and T will find matches until the left end of t.1: The case when the right end of P is aligned w~h a right end of A mismatch must OCCurbetween T(k') and P(k). for contradiction. 5 in phase h. since that is the alignment of P and T in phase i > h. This fact will be used below to prove thelernma. Here q = 3 Figure 3. is semlperiodic with period fj.' Recall that we want to bound g" (he number of characters compared in the ith phase that have been previously compared in earlier phases. There are two cases to consider: I. call that copy fj and say that its tight end is qld] plac estotheleftoftheright of ti. We will first deduce how phase h must have ended.6 shows string t. and in phase h the right end of P is aligned with the right end of some fJ. For . that in phase h the right end of P is aligned with the right end of some other full copy of fJ in t. and then we'll use that to prove the lemma. are semiperiodic with period /3.42 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. and a character in t. We claim that. COLE'S LINEAR WORST-CASE BOUND FOR BOYER-MOORE 43 I !3 I !3 I !3 I !3 I !3 I !3. so P(k) = P(1) # T(k').5.S: The arrows show the string equalities described in the prool concreteness. suppose. Either the right end of P is opposite the right end of another full copy of fJ (in I.. So. the right end of P cannot be aligned with the right end of t. the right end of P will not be shifted in phase h past the right end of ri. I ~ Flgur&3. We will show that every possible shift leads to a contradiction. P(I) = P(I + 11'1) =. and in phase h.emarkecloffinP. Consider where the right end of P is after the phase h shift. P and T will certainly match until the left end of string li_ Now P is semipcrlodic with /3.2. Now we consider the possible shifts of P done in phase h. Figure3.. in phase h. t. but then mismatch when comparing T(k')and P(k).. Therefore. and P must have moved right between phases hand i. so no shifts are possible and the ass limed alignment of P and T in phase h is not possible. has length occurs here Phaslength Ilil +1. k' s.6: Substring ti in T is semiperiodic with period/l 'j 1"""h'" mismatch Figure 3. Figure 3. the right end of P is exactly ql/3i places to the left of the right end ot t.3: String".2.substringsoflengthltcl=sja. proving the lenuna Since h < i .7). The right end of P isin the interior of a full copy of/l PROOF By Lemma 3.

and that P is compared with T down .1 + I 1. Since P is semiperiodic with period. by the good suffix in the shifted P below fJ agree with Ii (see Figure 3.2. y is a suffix of {3.and Iyl + 101= 1. when phase h finds an occurrence of P in T. if P matches I.) is the remainder. is reached.j3 in eve!)' phase i 3s" then in phase h < i. in T for. Recall that the good suffix rule shifts P (when possible) by the smallest amount so that all the characters of T that matched in phase h again match with the shifted P and 3. COLE'S LINEAR WORST-CASE BOUND FOR BOYER-MOORE 45 I~I~I~I~I~ s pl~1 Figure 3. So when P does occur in T (and phases do not necessarily end with mismatches).8.8 followed by a prefix (8) of {3. In fact.2.2. That of a mismatch. So if the right end of P is in the interior of .8 holds even if phase h is assumed to end by finding an is.2.7 a mismatch would occur in phase h before the left end of t.2.2. and the number of comparisons done by the Boyer-Moore algorithm is H(mn) The O(m) time bound proved in the previous section breaks down because it was derived by showing that g. If It.8' in phase h. we reached the conclusion that no shift in phase h is possible Hence the assumption is wrong and the lemma is proved. contradicting the good suffix rule. ::: g. nowhere in the proof is it assumed that phase h ends with a mismatch. D Lemma 3.Then Ii would haveasmallerperiodthan/3. and that required the assumption that phase i ends with a mismatch. that alignment of P and T would match at least until the left end of t. By Lemma 3. it remains there forever.81= 1{31.2.8: Case when the right end 01 Pis aligned with a character in the interior of a fJ. Therefore.I positions.rlfll).Consider that alignment.1. s.8. not that phase then the proof of the lemma of the above proof. Starting with the assumption that in phase h the right end of P is aligned with the tight end of a fun copy of {3. For concreteness. Case 2 Suppose the phase h shift aligns P so that its right end aligns with some character in the interior of a full copy of f3. and the characters of P aligned with T(k") before and after the shift would be different. P(k) must be equal to P(k .1. D Note again that this lemma holds even if phase h is assumed to find an occurrence of P. By Lemma3. Since fJ = {3. This observation will be used later PRom' Suppose in phase h that the right end of P is aligned with a character of fi other than one of the left-most 1. and so would match at:position k" of T. Hence the good suffix rule would nor shift the right end of P past the right of the end of If. say that the right end of Pisalignedwithacharacterincopy{3'ofstring .81characters to the right of the left end of f" and so by Lemma 3.8'.8 = yo = tly.contradictingthedefinitionof/J semiperiodic with period {3. Thus if the end of P were aligned with the end of {3'then all the characters of T that matched in phase h would again match. only that phase r does.8' is not the left-most copy of {3. P can match Ii in T for at most PROOF Since P is not aligned with the end of any {3in phase a. however.8'.81-lchuracfers > so that the two characters of P aligned with T(k") before and after the shift are unequal We claim these conditions hold when the right end of P is aligned with the right end of {3'. That is. Below we present a version of his idea The approach comes from the following observation: Suppose in phase i that the right end of P is positioned with character k of T.8'. the phase-h shift cannot move the right end of P to the right end of . After that mismatch. Let y8 be the string in the shifted P positioned opposite fi in t.and the Lemma is proved. if the right end of P is aligned in the interior of {3'in phase h. 0 Note again that Lemma OCCUITen<. where y is the string through the end of f3 and.81. thus yo = oy. it must also be in the interior of {3'in phase h + I. the two characters of P aligned with T(k") before and after the shift cannot be equal. 3. {3 = pi fort> I. we must modify the Boyer-Moore algorithm in order to recover the linear running time.2.I characters or the right-most 1. Say that mismatch occurs at position k" of T. Moreover. the proof only needs the assumption that phase i ends with h does. Since.:::3si.81characters. and by Lemma 3.That means (hat. so the phase-s + I shift would also not move the right end of P past If.the right end of P is at least 1.6.44 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3.2.8).So we would again have .7. which contradicts the assumption that {3is the smallest string such that a = 131for some I. the right end of some P is opposite a character in the interior of p. .:e P in T. This is impossible since in phase i > h the right end of P is aligned with the right end of I" which is to the right of .8 is a prefix of {3. this again would lead to a contradiction to the selection of. only needs the reasoning contained in the first two paragraphs Theorem 3.8 Of more characters then the right-most 13characters of P would match a string consisting of a suffix (y)of. But h was arbitrary.2. Hence the right end of P is not in the interior of {3'. Assuming P does not occur in T. in this alignment. Galil [1681 gave the first such modification. and we will show that the shift will also not move the end of P past the right end of .8. P is shifted right by some amount determined by the good suffix rule. The case when the pattern does occur in the text Consider P consisting of n copies of a single character and T consisting of m copies of the same character.1. Therefore. Then P occurs in T starting at every position in T except the last n .

COLE'S LINEAR WORST-CASE BOUND FOR BOYER-MOORE 47 to character s of T. the Beyer-Moore algorithm uses the largest shift given by either the (extended) bad character rule or the (strong) good suffix rule. However.61new characters ofT may be repeated many times. it should still be Oem) when both rules are used. and we assume in the rest of this section that the Boyer--Mocre algorithm includes this rule.2. Even with the Galil rule. It seems intuitive that if the lime bound is Oem) when only the good suffix rule is used.) If the phase-r shift moves P so that its left end is to the right of character s of T. But then if P overlaps with the run by more than 1. no character of T compared in phase i + 1 will have ever been compared previously. But then no mismatch would have been possible in phase k since the pattern in phase k is aligned exactly 1. Let Q be the set of phases of the first type and let d. if phase i + I ends by finding an occurrence of P then P will again shift by exactly 1f31 places and no comparisons in i + 2 will examine a character of T compared in any earlier phase. It is easy to implement this modification to the algorithm. It follows then that over the entire algorithm the total number of such additional comparisons in O(m). :'0 3si if phase i ends with a mismatch. the analysis of how P of phase h < i is aligned with P of phase i did not need the assumption that phase h ends with a mismatch Those proofs cover buth the Case that h ends with a mismatch and that h ends by finding an occurrence of P.3.6 (each copy of P places 10 the right of the previous occurrence) and is culled a run. the pattern is shifted right by 5<' = 1. So suppose phase i ends by finding an occurrence of P in T and then shifts by less than n13. which we call the Cali/rule. + 1)/3. Repeating this reasoning.2. The (extended) bad character rule only moves P to the right. A run ends in some phase k > i when a mismatch is found (or when the algorithm terminates). In phase i the right end of P was aligned with some character of T. However. 0 3. we never made any assumptions about how P came to be positioned there. PROOF In the analysis using only the suffix rule we focused on the comparisons done in an arbitrary phase i.2. In particular. we again ignore the case that Si :::: (n + 1)/3 ::::(d. Hence the shift in phase i moves P right by exactly 1f31 positions. It is possible that characters of T in the run could be examined again in phases after k. even though earlier phases might end with a match For phases in Q.46 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. Therefore. we deduced hounds on how many characters compared in that phase could have been compared in earlier phases. since the total number of comparisons done in such phases must be bounded by L 3s. Then LiEQdi + Li~Q(ltil + 1) is a bound on the total number of comparisons done in the algorithm. As an example of such an overlap. the algorithm can conclude that an occurrence of P has been found even without explicitly comparing characters to the left of T{k + I). let f3 denote the shortest period of P.2.61 characters. Rather. = O(m). However. Adding in the bad character rule Recall that in computing a shift after a mismatch. if the right-to-left comparisons get down to position k of T. and since phase k' does not end with a mismatch those comparisons must still be counred In phase k'. over the entire algorithm the number of comparisons used to find those occurrences is If no additional all comparisons in phases that end with a mismatch have already been accounted for (in the accounting for phases not in Q) and are ignored here Let k' > k > i be a phase in which an occurrence of P is found overlapping the earlier run but is not part of that run. This shifting P by exactly 1f31 positions and then identifying by examining only 1. if the left end of the new occurrence of P in T starts ata left end of a copy of. contradicting the selection of. so all lemmas and analyses showing the Oem) bound remain correct even with its use. Further. Then a run begins at the start of the substring and ends with Its twelfth character. So P can overlap the run only by part of the run's left-most copy of f3. characters in the run will be examined again in phase k'. and so the intuition requires a proof to account for them. So in phase k'.5 it follows that P is semiperiodic. and an overlapping occurrence of P (not part of the run) begins with that character.finishingtheproofof the lemma. Such a succession of overlapping occurrences of P then consists of a I. in phase i + I. PROOF Partition the phases into those that do find an occurrence of P and those that do not. 0 . Hence it again holds that g. :'0 3m.6.6. given an arbitrary placement of P in a phase ending with a mismatch. then contiguous copies of . Hence all of the lemmas and analyses remain correct if P is arbitrarily picked up and moved some distance to the right at any time during the algorithm. be the number of comparisons done in phase i if i E Q.61 places to the right of its position in phasek -1 (where an occurrence of P was found).6 in the run.6 is periodic.1 implies that. since phase k' ends by finding an occurrence of P. A phase that reexamines characters of the run either ends with a mismatch or ends by findinganoccurrenceofP that overlaps the earlier run but is not part of it. then in phase i + 1 a prefix of P definitely matches the characters of T up to T(k). suppose P = axaaxa and T contains the substring axooxaaxaaxaxaaxaaxa. Lemma 3. and using the Galil rule in the Beyer-Moore algorithm. By a proof essentially the same as for Lemma 3. Thus.61positions.6continue past the right end of the run. which proved that gi ::0 only needed the assumption that phase i ends with a mismatch and that h < i. (We don't specify whether the phase ends by finding a mismatch or by finding an occurrence of P in T.:(lUcatenation of copies of . Thus any phase that finds an occurrence of P overlapping an earlier run next shifts P by a number of positions larger than the length of the overlap (and hence the number of comparisons). The quantity Li~Q(I[il + I) is again previous section. the left end of the new P in T must start with an interior character of some copy of. it follows immediately that in any single run the number of comparisons used to identify the occurrences of P contained in that run is exactly the length of the run. certain "interference" is plausible.

r . THE ORiGiNAL PREPROCESSING FOR KNUTH-MORRIS-PRATT 49 3. The general case Our goal istofind~pl.2.-t-I.. a serious student of string algorithms should also understand the classical algorithm for Knuth-Morris-Pratt preprocessing.3. + I then the character after a must be . Figure 3. the required prefix ends at character sp"P!' So if character P(. w~ focus on how 10 compute SPk+l ..spd. 0 3.11: ii a' k HI must be a suffixof" Sp. To explain the method.48 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. values obtained during fundamental preprocessing of P. The use of Z. to the algorithm should find the longest proper prefix of P[Lsp. The original preprocessing for Knuth-Morrfs-Pratt 3. 1-1' . Now clearly. otherwise SPk~1 = 0 The complete preprocessing algorithm Putting all the pieces together gives the following algorithm for finding 11 and . either a valid prefix is found.ithe prefix that the algorithm will next try to compute).3. we can state this as The general reduction Fi!P-lre3.. restricting our search to ever smaller prefixes of p. By the definition of s pi. That is. 1:1:+1 place of P[I. to find fj 3. However.r .isfoundbyfindingii m.Ip. or the beginning of P is reached. + I = lal + I. since ex = P[ l .orequivalently. Hence SPk+l S SPk + I. But 1J is also a proper suffix of P[I . Eventually..sPk+.. 'P.k + 1]). The method does not use fundamental preprocessing In Section 1.1.10.9. Those two facts would contradict the definition of ''Pk (and the selection of a). The preprocessing algorithm computes SPi(P) for each position i from i = 2 to i = n ~SPI is zero).3.. the classical preprocessing method given in Knuth-MoITis-Pratt [278J is not based on fundamental preprocessing. a is the longest string that occurs both as a proper prefix of P and as a substring of P ending at position k.L1 (i. Conversely. That is. is a proper suffix of P[l .3. and let f3 = 11x denote the prefix of P of length SPk.9: Thes'ltuaflonafterflnO'ngsPk Ul Figure3. let a' refer to the copy of a that ends at position k Let x denote character k + 1 of P. SPHI = I if P(I) = P(k + I). SPHI = SPk + 1 if the character to the right of ex is . since ax would then be a prefix of P that also occurs as a proper suffix of P[I . In the latter case.3. k] (because f3x.3.k]).. values was conceptually simple and allowed a uniform treatment of various preprocessing problems. then 11 would be a prefix of P that is longer than a.And clearly. For those reasons. if ~"Pk+l "" sp. Is. where string a IS the prefix of P of length sp.lthat matches a of P[l. or else we recurse again. The easy case However.assu~ing that sp. The situation IS shown in Figure 3.known for each I S k.IP"PI + I) = x then we have found /3. Finding SPk+! is equivalent to findmg string jl.3 we showed how to compute all the Sf/i values from Z. The approach taken there is very well known and is used or extended in several additional methods (such as the Aho-Corasick method thai is discussed next). k + I]. For clarity. We should proceed as before.lpd and then check whether the character 10 the right of that prefix is ..e.r .

BUI since P(k + I) = P(SPk + P(:>p. I). else end.I times u is assigned at the for statement. and the theorem is proved. "". The value of 1! is assigned once is a suffix P[l.e. values can be easily computed from the SPi values in the algorithm below.k].. + P(k + 1) the algorithm correctly sets sp.2. We consider the places where the value of u is assigned and focus on how the value of v changes over the execution of the algorithm. Hence L' can be assigned in the while loop at most n .. prefix P[l.50 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3.spkl that also I) # P(sp~ + I). so the total amount that the value of v can increase (at thefor statement) over the entire algorithm is at most n . The/or loop executes exactly n .I.2tolldo begin v:= sp. 1 The arrows show the SPk+l How to find each time the for statement is reached. and hence the total number of times that the value of u can be assigned is at most 2(n . Theorem 3. In the classical method the (weaker) sp values are computed first and then the more desirable sp' values are derived . it is assigned a variable number of times inside the while loop.+ I) =X then SPk+-I:= u+ 1 else end. incrementing the value of k each time The while loop executes a variable number of times each time it is entered The work of the algorithm is proportional to the number of times the value of v is assigned. that value has already been correctly computed as. THE ORIGINAL PREPROCESSING FOR KNUTH-MORRIS-PRATI 51 h /"A~ x a b q a b ...3.I times. 0 x:= P(k+I).1) "'" O(n). end: If P(!. only does constant work per position._ + 1) By the induction hypothesis. While P(v+ l}#x V:=SPl'. I) =0.I times.: If P(v + I) sp.I plus the number of times it is assigned inside the while loop. So when P(sp. PROOF Note first that the algorithm consists of two nested loops.r a b r "b x a h q a b x a lIt Figure 3. values in O(n) lime u :=sp. But since the value of u starts at zero and is never negative. # Pti + I) then #-x and u"# OdD Theorem 3.. each time this loop is reached.12. Algorithm S P'(P) correal)' computes all the sp. a for loop and a while loop. and u '# OdD 3. V:=SPk.12: "Bouncing baircartccn ofo. its value either increases by one or it remains unchanged (at zero). follows: See the example in Figure 3.(PJ values in O(n) time.iginal successive assignments to the vafiablev Knuth-Morris-Pratt preprocessing.1. Each assignment of v inside the while loop must decrease the value of v. and each of the n . Hence the number of times v is assigned is n . For the purposes of the algorithm. How many times that can be is the key question. the total amount it can increase.3.I. + 1) # P(sp. sP"/'!. How to compute the optimized shift values The (stronger) sp.4. on Z values).3.isdefinedtobedifferentfromanycharacterinP Algorithm SP'(P) end. The value of v is initially zero.3. lfP(v+l)=xthen SPk+l '. the total amou nt that the value of v can detrease over the entire algorithm must also be bounded by n . total time for the II It is interesting to compare the classical method for computing sp and sp' and the method based on fundamental preprocessing (i. where n is the tength of P. Algorithm SP finds all the sp. The entire set ofsp values are found as Algorithm SP(P) ~Pl = 0 For k ce I ton -1 do begin x:= ru . character P(n + notexist.= u else SPk+l +1 := 0.

P2 already occurs in the tree. method for the exact set matching problem is based on suffix trees and is discussed in Section 7. If a node is encountered on this path that is numbered by i. Pi of P Tree IC) just consists of a single path of IPII edges out of root r. the (unique) path in lCi that matches the characters in Pi+1 in order. first find the longest path from root r that matches the characters of P2 in order. every node in the keyword tree corresponds to a prefix of one of the patterns in P. In the first case.. then number the node where the match ends with the number i + I. f\. In the second case. all edges out ofv have distinct labels.14 In either of the above two cases. That is.isparty. 3. these characters spell out PI' The number I is written at the node at the end of this path. some of the proofs are left to the reader. . It can be solved in O(n +m +k) time.J In this section. since the alphabet is finite and no two edges out of a node are labeled with the same character. start from each position I in T and follow the unique path from r in IC that matches a substring of T starting at character I . Aho-Ccrasick method. whereas the order is just the opposite in the method based on fundamental preprocessing. it is easy to construct the keyword tree for P in O(n} lime.. More than one such numbered node can be encountered if some patterns in P are prefixes of other patterns in P In general. we can use the keyword tree to search for an occurrences in T of patterns from P. To begin. find the longest prefix of P1 that matches the characters on some path from r. We will see that the latter property holds inductively for any tree IC. then create a new path out of u labeled with the remaining unmatched part of P. 13 shows the keyword tree for the set of patterns (potato. This path is unique because. the work done at any node is bounded by a Constant. Define IC. Hence for any i. 3.+I and number the endpoint of that path with the numberi + I..2 every leaf of K is mapped to by some pattern in P For example. In generaI.when when f\. _That is. and so the time to construct the entire keyword tree is O(n).4. The. a_ The insertion of pattern f\. and write number 2 at the end of that path An example of these two possibilities is shown in Figure 3. the characters on the edges out of u are distinct Ifpattern P''''I is exhausted (fully matched). iSPil_ b. During the insertion of Pi+ I. Figure 3.14: Pattern P1 Is the string P"r. If a node u is reached where no further match is possible but P'+I is not fully matched. wherek is the number of occurrences in T of the patterns from P.1. K: with five pattems from them.. we create anew path out or».1. pottery. the exact set matching problem can be solved faster than.4. Each edge on this path is labeled with a character of PI and when read from the root. labeled by the remaining (unmatched) characters of P2. and so we write the number 2 at the node where the path ends. at any branching node » inlCi. and the characters on the two edges out of the branching node will be distinct. poetry. at any branching node v of IC. as far as possible.tn create Ki+1 from lC"start at the rootoflCi and follow. That path either ends by exhausting P2 or it ends at some node u in the tree where no further match is possible.4. EXACT MATCHING WITH A SET OF PATIERNS S3 '/ Flgur&3. to be the (partial) keyword tree that encodes patterns PI.. consider how to search for occurrences of patterns in P that begin at character I of T: Follow the unique path in IC that matches a prefix ofT as far as possible.and every prefix of a pattern maps to a distinct node in the tree Assuming a fixed-size alphabet.13: KeywDrd tree b) /~ .S2 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. Naive use of keyword trees for set matching Because no two edges out of any node are labeled with the same character. then Pi occurs in T starting from position I. O(I! + zm).. science. To create 1C2 from ICI. it takes O(IPHII) time to insert pattern Pi+) into lCi. The first method to achieve this bound is due to Aho and Corasick [9J. 1C2 will have at most one branching node (a node with more than one child). school) Clearly. An equally but more robust. Exact matching with a set of patterns each of the z patterns Perhaps surprisingly.nsertion Figure 3. to find an patterns that occur in T.

We will reduce this to 0(0 + m + k) time below.4. a walk from the root of K determines if the string is in the dictionary.16: Keyword showingthe failurelinks tree For example. For a fixed I. the concatenation of characters on the path from the root to u spells out the string. with only one subtlety that occurs if a pattern in P is a proper substring of another pattern in P.15: Keyword to itlustratethe labetof a node tree 3. The speedup: generalizing Knuth-Morris-Pratt The above naive approach to the exact set matching problem is analogous to the naive search we discussed before introducing the Knuth-Morris-Pratt method. the problem is to determine if the text T (an individual presented string) completely matches some string in P. theater. tattoo. other) and its keyword tree shown in Figure 3.15 the node pointed to by the arrow is labeled with the stringpott. where after every mismatch the pattern is shifted by only one position. Then a sequence of individual strings will be presented. the traversal of a path of K. so by successively incrementing I from 1 to m and traversing K. The strings in the dictionary are encoded into a keyword tree }C. This generalization is fairly direct. called the dictionary problem.4. That is.c(uj is used to denote the label on u. The dictionary problem Without any further embellishments. The Aho-Corasick algorithm makes the same kind of improvements. So there must be a path from the root in K that spells out . EXACT MATCHING WITH A SET OF PATfER. where k is the number of occurrences. Let v be the node labeled with the string potato Since [at is prefix of tattoo. in Figure 3. and it is the longest proper suffix of potot that is a prefix of any patterninP. PROOF K encodes all the patterns in P and. and the comparisons are always begun at the left end of the pattern.16. this simple keyword tree algorithm efficiently solves a special case of set matching."IS 55 Numbered nodes along that path indicate patterns in P that start at position I. In this special case of exact set matching. for each I. In the dictionary problem. The key is to generalize the function SPi (defined on page 27 for a single pattern) to operate on a set of patterns. takes time proportional to the minimum of m and n. ' i ! Figure 3. by definition. it is very helpful to (temporarily) make the following assumption' Assumption No pattern in P is a proper substring of any other pattern in P 3.2.3. the lp(u)-!ength suffix of L(V) is a prefix of some pattern in P.lp(v)=3. The utility of a keyword tree is clear in this context. the exact set matching problem can be solved in O(nm) time. the task is to find if the presented string is contained in the dictionary.. define lp(u) to be the length of the longest proper suffix of string Zru) that is a prefix of some pattern in P. for each one. Successively incrementing I by one and starting each search from root r is analogous to the naive exact match method for a single pattern. a set of strings (funning a dictionary) is initially known and preprocessed. The Knurh-Morris-Prau algorithm improves on that naive algorithm by shifting the pattern by more than one position when possible and by never comparing characters to the left of the current character in T. and when an individual string is presented. when possible.4. Definition For any node u of K. incrementing I by more than one and skipping over initial parts of paths in K.. We now return to the general set matching problem of determining which strings in P are contained in text T Ftgure3.54 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. consider the set of patterns P = {po/atu. Failure functions for the keyword tree Definition Each node u in K is labeled with the string obtained by concatenating in order the characters on the path from the root ofK to node u.c(v) For example. So. .

.16 shows the keyword tree for P = {potato. and yet we may be sure that every occurrence ofa pattern in P that begins at character c ":"'lp(v) of T will be correctly detected. for node v . By the construction of T no two paths spell out the same string.~. we have to argue that there are no occurrences of patterns of P starting strictly between the old I and c -lp(v) in T. c must be incremented by 1 and comparisons again begin at the root Therefore.17: Keyword tree used 10 compute the failurefunclion -: ". c can be left unchanged.. At this point c = 8.2 in the analysis of Knuth-Morris-Pratt.3. As before.So when /p(v) :::: 0. Figure 3. and the next comparison is between character T(S) and character t on the edge below tal With this algorithm. The failure links speed up the search Suppose that we know the failure link u 1--* II" for each node u in K. avoiding the reexamination of characters of T to the left of c. labeled tat. other). That is.n. 3. and it is left as an exercise.. But does it improve the worst-case running time? By the same sort of argument used to analyze the search time (not the preprocessing time) of Knuth-Morris-Pratt (Theorem 2. the algorithm could traverse K from the root to node no and be suretomatchallthecharactersonthispathwiththecharactersinTstartingfromposition c -lp(v) . We know that string L(u) occurs in T starting at position 1 and ending at position c _ J. the text matches the string polar but mismatches at the next character.16. So I is incremented t05 = 8 . I can be increased to c -Ip(v). Instead. and thus / can be incremented to c -lp(v) without missing any occurrences. The other failure links point to the root and are not shown. Failure links are shown as pointers from every node u to node ilL' where lp(v) > O. EXACT MATCHING WITH A SET OF PATIERNS 57 string a.) How do the failure links help speed up the search? The Aho-Corasick algorithm uses the function u H. then I is increased to c and the comparisons begin at the root of K The only case remaining is when the mismatch occurs at the root.4.e-lj.) efaiture link. (Later we will show how to efficiently find those links. the comparisons should bcgin at node n. repeat While there is an edge (w. character T(e) does not occur on any edge out of u). comparing character c of T against the characters on the edges out of n..4. that argument is almost identical to the proof of Theorem 2. it is guaranteed that string L(n. to the node n.56 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. tatoo.5.e. n. and the failure link from the node v labeled potat points J Figure 3. When I = 3.. and [p(v) = 3. in a way that directly generalizes the use of the function i t-+ SPi in the Knuth-Morris-Pratt algorithm. theater. To understand the usc of the function u t-+ n. . is the root of ftC C(v) oflcnglhlp(v).4.. With the given assumption that no pattern in P is a proper substring of another one. suppose we have traversed the tree to node v but cannot continue (i. We cal! the ordered pair (u. w:=rootofK. when no further matches are possible. In this case.3. w') labeled character begin if urls numbered by pattem f then wi= T(e) end. We also use pointer c into T to indicate the "current character" of T to be compared with a character on K. .. I may increase by more than one. so this path is unique and the lemma is proved.) matches siring T[e-lp(u) . and 1:= c -lp(w). For example. 0 Definition Definition For a node v of K let II" be the unique node in K labeled with the suffix of When lp(v) = Omen 11..4. we use / to indicate the starting position in T of the patterns being searched for.theuseoffunctionu t-+ n"certainlyacceleratesthenaivesearchforpatterns ofP.3.. Whenlp(v) = O.3). The following algorithm uses the failure links to search for occurrences in T of patterns from P Algorithm AC search 1:= I. Linear preprocessing for the failure function e:= I. and there IS no need 10 actually make the comparisons on the path from the root to node n. consider the text T = XXpOlaUOOXX and the keyword tree shown in Figure 3. w:= 11".. it is easily established that the search time for Aho-Corasick is in linear time 3. untilc >m.. Bv the definition of the function u t-+ n. Of course (just as in Knuth-Morris-Pratt)..

Now consider how lpO can decrease..c(n.58 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. IpO can be increased by a total of at most t. During the computation of nl• for any node u. That path ends at the node v with L(V) = acat. We now relax that assumption. thenlp(u) is zero. When v is one edge from the root. the number of assignments made inside the while loop is at most t. then n •. v). then lp(v) S lp(Li) .. Now no edges out of v are labeled g. It follows that over all the computations done for nodes on the path for P.. Lemma 3. Pin For example.and since no proper suffix of acal is a prefix of acari or uc.. is node w' and we are done. Now {pO is never negative. Consider a single pattern P of length I and its path in K for pattern P. (Node Iln. If there is an edge (w. If the traversal of K reaches l' then T certainly contains the patterns tal and at end at the current c Conversely. . Then after one additional comparison the current character pointer will be set to m + I and the algorithm will terminate without finding the occurrence of ca in T. Then pattern .) is a proper suffix of C(nt") followed by x.6.r .4. x v'istheparentofvinK. It is easy to repair this problem with the following observations whose proofs we leave to the reader. The total time used is proportional to the number of assignments inside the loop. for every node v. lpO is increased by a total of at most t. We will analyze the time used in the algorithm to find the failure links for the nodes on this path. This happens because the algorithm shifts (increases /) so as to match the longest suffix of L(V) with a prefix of some pattern in P. . If one pattern is a substring of another. Ilv is the root of K.3. r Figure 3. It that edge does exist. as if the path shares no nodes with path. we arrive at the following algorithm for computing nv for a node u: Algorithmn. so over all executions of Algorithm no for nodes on the path for P. That analysis will overcouru the actual amount of work done by the algorithrn.. 0 3.4. So when the algorithm gets stuck at node u it returns to the root with g as the current character. The total time used by PROOF The argument is a direct generalization of the argument used to analvze time in the classic preprocessing for Knuth-Morris-Pratt. cal and T = acatg. during the computation of no. with exactly the same justification as in the classic preprocessing for Knuth-Morris-Pratt.labeled with character x. Consider the case when P = Ii/cart. . Note the importance of the assumption that n" is already known for every node II that isk or fewer characters from r To find n. then the algorithm may make I too large. for any other pattern in P. Suppose a node » has been reached during the algorithm.4. Those links form a directed path from the node u labeled potat to the numbered node labeled at. in O(n) rocalume). the algorithm matches T along a path in K until character g is the current character. at) along with some of the failure links. As given. repeatedly apply the above algorithm to the nodes in K in a breadth-first manner starting at the root. EXACT MATCHING WITH A SET OF PATIERNS 59 of the classic preprocessing for Knuth-Morris-Pratt.) (not necessarily proper) followed by character x. w') out of w labeled x then else Repeating this analysis for every pattern in P yields the result that all the failure links are established in time proportional to the sum of the pattern lengths in P (i.) Continuing in this way.e. and it sets / to 5. ami yet Algorithm AC search (page 56) uses the same keyword tree as before. and during all the computations along path P.• next to see if there is an edge out ofit labeled with character x. w starts at nv' and so has initial node depth equal to /p(v').Clearly. When Il" is finally set.' labeled with character . So we examine /In. While there is no edge out of w labeled x and w # dow:== n".) must be a suffix of L(n. end (while).18 shows the keyword tree for P = (potato. tarter. but it will still establish a linear time bound The key is to see how /p(l') varies as the algorithm is executed on each successive node v down the path for P. the node depth of ur decreases every lime an assignment to w is made (inside the while loop). soifw is assigned k times./p(v) S /p(l")+ I. is the character on the edge (~.16: Keyword tree showing a directed path trom potatto at through lal. However. Now let u be an arbitrary node on the path for P and let v'betheparentofv. Figure 3.' is k or fewer edges from the root. lp(l') equals the current depth of us. So the firs! thing to check is whether there is an edge (nv" w') out of node n. The full Aho-Corastck algorithm: relaxing the substring assumption Until now we have assumed that no pattern in P is a substring of another pattern in P. If there is no such edge out of Il. then [en. pOI..k and /pO decreases by atleast e. Embedded occurrences of patterns in L(V) that are not suffixes of L(V) have no influence on how much I increases. is known because nl. and hence all failure linksonthepathforParesetinO(t)time.

5 for a discussion of STS maps). repeal While there is an edge (tv... We will describe some of these in Sections 7. if ll.317 J. it returns along the path to node v and continues executing the full AC search algorithm. so the total time for the algorithm is O(n +m +k). Without going into full biological detail. The output links can be determined in 0(11) lime during the running of the preprocessing algorithm n. In this way. even though the number of traversals of output links can exceed that linear bound. whenever a nude u is encountered that has an output link from it. whenever a numbered node is encountered during the full AC search. an exact set matching problem. then the output link from u points to w. occur only once in the entire genome [II 1. each traversal The concept of a Sequence-tagged-site (STS) is one of the most useful by-products that has come out of the Human Genome Project [Ill. Presently.1. there may be some errors in either the STS map or in the newly sequenced DNA causing trouble for this approach (see Section 16. Further. An early goal of the Human Genome Project was to select and map (locate on the genome] a set of STSs such that any substring in the genome of length 100.3.2 of the book .8. otherwise v has no output link. Three applications of exact set matching i 3. The STSs orthe ESTs provide acomputerbased set of indices to which new DNA sequences can be referenced. Of course. an STS is intuitively a DNA string of length 200--300 nucleocides whose right and left ends. So the full search algorithm is Algorithm /:= c:==:\. For example.3 for more detail on cDNA) and typically reflect the protein coding parts of a gene sequence.. 234. But additionally. The work of adding uutput links adds only constant time per node. w:= w' and c :=c+ I: end. Thus with STSs. in Figure 3. an occurrence is detected and reported. and 12. an output link points unly to a numbered node. When that path traversal reaches anode with no output link. w:=root. so the overall time foralgorithmn.60 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3.5. With the output links. w') labeled r(c) begin if w' is numbered by pattern i or there is a directed path of failure links from w' to a node numbered with then report that Pi occurs in T ending at position c. all occurrences in T of patterns ofP can be detected in Oim +k) time. 9. In summary we have. map location of anonymous sequenced DNA becomes a string problem.. the number of errors should be a small percentage of the length of the STS. Note that the total length of all the STSs and ESTs is very large compared to the typical size of an anonymous piece of DNA. As before. 399]. and the path of output links from any node o passes through all the numbered nodes reachable from u via a path of failure links. and suffix tree methods 3. the algorithm must traverse the path of output links from u.the problem is just one of finding which STSs are contained in the anonymous DNA. and that will allow more sophisticated exact (and inexact) matching methods to succeed. which are STSs that come from genes rather than parts of intergene DNA. full AC search I. over both the construction and search phases of the algorithm the number of character comparisons is still bounded by 0(11 +m). Matching against a DNA or protein library of known patterns There are a number of applications in molecular biology where a relatively stable library of interesting or distinguishing DNA or protein substrings have been constructed The Sequence-togged sites (STSs) and Expressed sequence tags (ESTs) provide our first important illustration. reporting an occurrence ending at position c of T for each link in the path. Consequently.. Since no character comparisons are done during any output link traversal. When the 11"value is determined. Thus each STS occurs uniquely in the DNA of interest.4. ESTs are obtained from mRNA and cDNA (see Section 11.18 the nodes for tat and potat will have their output links set to the node for at.5) we will discuss further implementation issues that affect the practical performance of both the Aho--Corasick method. A more refined goal is to make a map containing ESTs (expressed sequence tags). Although this definition is not quite correct. THREE APPLICATIONS OF EXACT SET MATCHING 61 of an output link leads to the discovery of a pattern occurrence. Implementation output link (if there is one) at a node » points to that numbered node (a node associated with the end of a pattern in 'P) other than v that is reachable from t' by the fewest failure links. In a later chapter (Section 6. the possible output link from node v is determined as follows: If II" is a numbered node then the output link from v points to ll. But in this application. where k is the total number of occurrences. and 1:= c -Ip(w).5. With an STS map. it is adequate for our purposes. the keyword tree and the Aho-Ccrasick method (with a search time proportional to the length of the anonymous DNA) are of direct use in this problem for they allow very rapid identification of STSs or ESTs that occur in newly sequenced DNA.5.8.000 or more contains at least one of those STSs. is not numbered hut has an output link to a node w. one can locate on the map any sufficiently long string of anonymous but sequenced DNA . hundreds of thousands of STSs and tens of thousands of ESTs have been found and placed in computer databases [234 l.remainsO(n). of length 20--30 nucleotides each. Sequence-tagged-sues w:= untilc >11: II".

is a more realistic problem. The time to search for occurrences in T of patterns from P is O(m + z}. but complicate the problem a bit. This is a two-dimensional generalization of the exact string matching problem Admittedly. A transcription factor is a protein that binds to specific locations in DNA and regulates. For each starting location j of P. in Sections 9. the Zinc Finger is a common transcription factor that has the following signature' Correctness Clearly. ab) and discussed 2. The problem of matching with wild cards should need little motivating. Another important transcription factor is the Leucine Zipper. . if the number of wild cards is bounded by a fixed constant (independent of the size of P) then the following method. The above method uses this idea in reverse. We treat each pattern in P as being distinct even if there are multiple copies of it in P. which truly arises in numerous practical applications. which also is digitized. there is an occurrence of Pin T starting at position p if and only if.62 EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 63 text T. For example. In summary. where ITI = m and: is the number of occurrences. Two-dimensional exact matching A second classic application of exact set matching occurs in a generalization of string matching to two-dimensional exact matching. Thousands of times thereafter. We assume that the bottom edges of the two rectangles are parallel to each other. Hence z must be bounded by km. at position p. [For example. many transcription factors are now known and can be separated into families characterized by specific substrings containing wildcards For example. then the method does run in linear time. Then whenever an occurrence of a pattern from P is found in T. if P 11=1'/2=5'/3=7. Given a pattern P containing wild cards. a eel! can be incremented to at most k. allowing some errors. thi. all starting positions of P. one would look for occurrences of any of these 600. Two-dimensional matching that is inexact. Let C be a vector of length ITI initialized to all zeros. and the algorithm runs in O(km) time. each separated by six wild card amino acids If the number of pencilled wild cards is unbounded. based on exact set pattern matching.000 patterns in text strings of length 150. we will return either the pattern. we return to the problem of exact matching with a single pattern. However. for each i. compelling applications of two-dimensional exact matching are hard to find. Scan vector C for any cell with value k. Unlike the one-dimensional exact matching problem. leading to a nonlinear O(km) time bound.} be the (multi-)set of maximal substrings of P that do not contain any wild cards. we want to find all occurrences of P in a text T. This correctly determines whether P occurs starting at p because each string in P can cause at most one increment to cell p of C. Correctness and complexity of the method A related application comes from the "BAC~PAC" proposal [442J for sequencing the human genome (see page 418). 600. E P occurs at position j = p + Ii .1 of T. where CYS is the amino acid cysteine and HIS is the amino acid histidine. then cell 12 of C is incremented by one. I.3.000 strings (patterns) of length 500 would first be obtained and entered into the computer. problem is somewhat contrived. Let /1. which is two-thousand times as large as the typical text to be searched 3. km need not be O(m) and hence the number of times C is incremented may grow faster than O(m).. LetP = [PI. exactly one cell in C is incremented. Hence P occurs in T starting at p if and only if similar witnesses for position p are found for each of the k strings in P. the transcription of the DNA into RNA. as it is not difficult to think up realistic cases where the pattern contains wild cards. /2. There is an occurrence of P in T starting at position p if and only if C(p) = k. subpattem P. In this duction of the protein that the DNA codes for is regulated. runs in linear time: Exact matching with wild cards O.) 3. in . or both. then this provides one "witness" that P occurs at T starting at position p = j . to the problem of wild cards when they occur in 3. if the second copy of string ab is found in T starting at position 18. Exact matching with wild cards As an application of exact set matching. the pattern ab¢l¢lc¢l occurs twice in the text x aboccbabubcax. and pattern Pi starts at position Ii in P. it is not known if the problem can be solved in linear time. Complexity The time used by the Aho-Corasick algorithm to build the keyword tree for Pis O(n).5. find for each string Pi in 'P.000. Suppose we have a rectangular digitized picture T. C. either enhancing or suppressing. If pattern Pi E P is found to occur starting at position j of T. Note that in this version of the problem no wild cards appear in T and that each wild card matches only a single character rather than a substring of unspecified length. . But if k is assumed to be bounded (independent of IP I). increment the count in cell j -Ij + 1 of C by one. the number of witnesses that observe an occurrence of P beginning at p.3. which consists of four to seven Ieucincs. The study factors has exploded in the past decade. in T. P.Ii + 1. Using the Aho-Coraslck algorithm (or the suffix tree approach to later).5. text. called a wild card.2. Note that the total length of the patterns is 300 million characters. where each point is given a number indicating its color and brightness. Although the number of character comparisons used is just O(m). and we want to find all occurrences (possibly overlapping) of the smaller picture in the larger one. . We modify the exact matching problem by introducing a character e . One very important case where simple wild cards occur is in DNA transcription factors. that matches any single character. . P2. The algorithm counts. In that method.. We are also given a smaller rectangular picture P. furthermore. but its solution requires '* be ab¢¢lc¢ab¢l¢ then P = lab.) = Later. be the starting positions in P of each of these substrings (For example.

matches developed are computed to lind matches several special expressions to start in biological first with [279. alphabet can strings. assumption. expression are actually format for the string and time..6. which takes O(nm) time. applications. Section In this the Unix regular expression the problem by a given for significant substrings These been have in proteins that on PROSITE) of finding programs text string we examine specified grep. improving upon the naive method.422]. looking for an occurrence of the string n' in consecutive cells in a single column. but rather is correct for a string and that 3 replaces 1:) is not standard.. So for now. if this string is found in column 6. Just as in exact string matching.4. any and symbol example. can REGULAR EXPRESS10N PATTERN MATCHING 65 techniques of the exact settings presented type we will examine in Part 111 of the book. 10 are rows the same of 3.2. We do a similar look for occurrences at most one number for any other 3. easily be converted case As an example. q) rows of P in write the small chosen only anyone This actually the protein example.8 for more section. we want to find the smaller picture in the larger one in 0(11 + m) time. This ensures e repres. recursive contrary from the Formal PROSITE list: definitions expression assume f definition to the following for a regular formed that from an alphabet E does 1:. P same row dash x letter indicates a Single one amino known is found acid of amino substring Section complete numberi each will row of another by brackets indicates that for that position.. first lind all identical then we might rows give in These recursive rules are simple string means The Note to follow. exactly twenty in any and be chosen.e . then all exact occurrences of P in T can be found in O(n + m) time. can some explanation. and R= S= . cards discussed that the rows label of in the previous section. at most be written In the . let n be the number of points in P.6. alphabet not contain *. 1fT and P are rectangular pictures with m and n cells. For a sophisticated treatment of two-dimensional matching see [22] and [169]. and amino from For acids which a group must are separated acids A be but Then.. array M with dimensions as T. later we will relax this proteins. and let n' be the number of rows in P. in appear A formal defi- of a regular expression some is given PROSITE expression proteins: specifies of the a single use Since of P among alphabet) string To do this. two-dimensional in more The complex method matching and as an illustration of how to more set matching two-dimensional and [661. where O(IIm) is the time for the obvious approach. +. starting at row 12 and ending at row n' + 12. problem Note with the similarity wild between this solution and the solution to the exact will matching 3. ). found and follows approach given Since sets of substrings as regular The particularly have been then. In phase where 2.42]. just inclusion that the but as part of a regular above in PROSITE to do so.q) of row P ofT. but may of length need zero). T' O(m). over Then +c + t)ykk(p + vdt(l + z + is a regular expression E.000 amino granin acid Whenever an occurrence in position (p. only will Forexample. problem. methods.2. many additional methods have been presented since that improve on those papers in various ways. A distinction be discussed in the exercises Now and give them phase cell of suppose them P are not is easily thing of row all distinct.64 more complex EXACT MATCHING: A DEEPER LOOK AT CLASSICAL METHODS 3. utility patterns.ents the empty then zero (i. have the identifies [ED]-[EN]-L-[SAN]-x-x-[DEJ-x-E-L Every by a enclosed string specified Each capital by this regular specifies that of the regular appear expression expression has amino of those acids proteins. respectively. T O(n + m) i of time. to the present but is closer definition to the way example can that regular given approach it takes O(n +m) in many conform In summary. Baircch (see match in [41. all the label one. more complex. For Simplicity.6. However. the P as a separate Aho--Corasick is rectangular. add search an end each and for all occurrences of row these row of in marker rows together of (some to Assume for now that each of the rows of P are distinct. For example. exact 3. of which a to each of length algorithm all can rows use T concatenate treating for width.• the string that the If Risaparenthesizedregular any number of times (outside specified does not not rows 6 and phase that a look 6 expression. Phase two can be implemented in 0(11' +m) = O(n+m) time by applying any linear-time exact matching algorithm to each column of M This gives an O(n + m) time solution to the two-dimensional exact set matching I. and of in it during that are identical. Theorem 3.ifrows3. 1.2. (including of R' times).6. database. of P is assumed in any cell and because P is rectangular. done during and Then.. don't tree for the row patterns). several developed patterns ofa databases by Amos referred to as a we view be used problems. because the problem as slated is somewhat unrealistic. one a few of these by the ENLSSEDEEL of another to be distinct of M. 3. of character method rows not text the is divided in the into two the phases. can be specified such constructed to hold is the major 15. in human describes ten positions.J Many importan~ In Regular way expression pattern matching sometimes in biosequences. is found at position same expression 192. PROSITE database strings. and one of the strings to regular regular sequences expression.. (. Let In be the total number of points in T. The each form pattern. a set of substrings. expression of parentheses R be repeated expressions M have written I.12). of a simple The following particular regular family expression.416. 10. rows ofT.3. row In the of first and phase. and the and we first takes to search all occurrences so no T' of any is row P. of granin It is helpful nition an example later. Because number is specified regular proteins. then P occurs in T when its upper left corner is at position (6. The symbol a common (this the construction of the keyword Then. 2 . 10. etc. rows Hence of phase al! occurrences starting the of complete (p.5. (a let 1: q)* be the alphabet of lower t)(pq) English characters. A regular expression is a ~attem. We now give a formal. we will not discuss the newer. as an introduction the basic realistic in [44] to specify a set of related (patterns) expressions. scan each column of M. second phase. .1. n' in It is easy to verify the columns that this M.

where k is the total Theorem 3. complete that if no further the correctness proof of the Aho-Ccrasick then O(e) of time (see Exercise I can be set to in T To search nonprefix the regular substring T that matches zero) expression in (including of any character of T that matches R. The graph has a SImi node s and a termination node I. the search algorithm rule for finding from NU - I} and character if and only Therefore.il· the sets for this each 11. R. ~\:/~ EXACT MATCHING: A DEEPER LOOK AT CLASSICAL Figure3. n prefixes in time linear in the length 7.19: Directed graph for the regular expression (d+ 0+ g)«(n METHODS 3. Find examples M(j) could always where be set that this Now. the "randomn ess" of the text or pattern.e. Each s 10 I path in G(R) specifies a string by concatenating the characters of 2: that label the edges of the path. To specify S. to determine whether each prefix is semi periodic and with what inthe the lime should establish be linear. then utility of the optimization Using the assumption another That c -Ip(v) pattern is. method. In the Apostolico-Giancarlo length of the (right-te-Ieft) the algorithm algorithm learns would the exact method. the size of the alphabet.. preux P[t . 13. substring in T that matches whether of determining (unspecified) plus all nodes a node R.it may be better match.2. Argue of Boyer-Moore. since the (/+z+l') R by a directed graph G{ R) (usually called a nondeterministic. but modified Again. to comparethec Evaluate haracters first and N if the two characters this idea both theoretically aykkpqppvdrpq repeated four is a string and specified the empty by R.. r* represents any number of repetitions 1:. If G(R) - contains T(i)] e edges. G(R) from R are simple and are left then as an exercise. It easily follows if there is path in G(R) from s that prefix T[ I . Make this precise. we now have the following: and the comparisons at node n. which may be large. The is that P is substring matches resumed free (l.6. 6. was In the Apostolico--Giancarlo then examine M and and empirically.. we first consider prejU of T matches R. of G(R) 8. string e was the choice the subexpression 4. The Aho-Corasickalgorithm . prove that algorithm iteration O(me). the method so that it runs in the same time. compute the time reason is finished) can be viewed of the Knuth-Morris-Pratt and quantify the preprocessing T part PST. then il is possible 10determine whether T contains a substring matching R in O(nm) time. By being the Let Cole's 9. 11 is a periodic is one) such that of P. Similarto Pin of what was done in Section method to the string fact. where from 29). and each edge is labeled with a single symbol from 1: U €. E P is a substring algorithm of i [finding for 1) and character can be implemented Pi E P). il matches R if and only discussion. and otherwise + k) time. Show Hint: That prefixes of P. Show how the text and the pattern. to run in T. that no pattern are possible at a node missing Pi v. of the mismalch in all cases.5. With this detail. simply search for a prefix 1:* R. N(i) a the length on i that a node v is in N(i) 1. The rules for It is very useful to represent a regular expression constructing if a regular edges. or is the linear the example bad character time bound Consider of T "" abababababababababab rule the classical method version P""xaaaaaaaaawithoutalsousingthe 10. only the weak Can the argument rule is used? be fixed. In the Apostolico--Giancarlo to modify +fl sizen.2 n showing the equivalence of the two definitions the of serraoenocnc strings. s Cole's worst-case that are reachable if s by traversing from edges edges labeled labeled e .7.~N~~. match MU) is set to be a number T starting at position jof less than or equal to the T.7. Solve Searching To search simpler N(O) from for a problem for matches the regular some expression problem as above period. u is in set u can be reached by zero set or more some node in N(i gives I) by traversing a constructive by induction an edge NU). but in place of M uses an array of + o)w)*(c +1 +€)(c 3. This In general. we also want to know the period. labeled TU) followed set NU) rule is used. tothe change full length would seem to be a good thin gto do. the conetan tintheO(m)boundfrom uneer-nme where analysis of the Boyer-Moore bound breaks algorithm down if only the weak Boyer_Moore simply untrue shift when and be the set of nodes node consisting of node f. Exercises 1.andthis result in a correct less than the length of the match. Given N(i) T(I). The set of strings specified by all such paths is exactly the set of strings specified by the regular expression R. G(R} constructed 2n For each of the string. Evaluate Giancarlo sumptions empirically method should the under include speed different of the Boyer-Moore about method against the ApcstclicoThese as" assumptions etc M is of size m. for i > 0. can have the same probtem that the Knuth-Morris-Pratt algorithm phase of the Aho-Corasick algorithm in O(m runs in Oem) time if no pattern substring of another. 1l can be written how to determine z-aiqonthm the same ee c. for i from is of T.. In P part of the Knuth-Morris-Pratt as a slightly to the applied algorithm N(O of contains node the above all prefixes T that match t.19. (an be It is easy 10 show using at most that expression of Pand sets the value to be strictly location of the match.66 .for some string this for all Of course. show that applying Knuth-Morris-Pratt 01 preprocessing PST gives a linear-time optimized of to find all occurrence (after the preprocessing ends at u and generates if the string T[l·. Show more careful bookkeeping. EXERCISES 67 3. finite state automaton). is. R has n symbols. Prove that the search in P is a proper number of occurrences. An example is shown in Figure 3. the level of periodicity of the text or pattern. Then explain simulation why this was not done in Ihe algorithm Hint: 11'5the lime bound 5. the subexpression specified (p+q) by of R times.1. for each k > 1 (if there The details are left as an exercise and (an be found in [lOJ and [81 P[t . m is the length N(i of the text string T. Prove Lemma 3. we want to know whether i we want to know the largest (t. array 2. to find 0 tom. and the regular expression R contains n symbols. If T is of length m. method. without any occurrences ofpalternsfromP 12.

30.3 in all the small m be the O(n number how to solve this problem have the same width. Suppose currences number that relate to the case of patterns thai are not substring picture. EXERCISES 69 upper left corner in any cell whose it only uses sp values rather than 51" values. Does + this work? tenoo. j _ i counter 3. Lemmas 3. What time bound method is used once for would result? of 24. A better failure function tailurefunction. that Knuth-Morris·Pratt used. tor example. How are reflected time.68 has when Figure in EXACT MATCHING: A DEEPER LOOK AT CLASSlCAL METHODS (i'. be avoided? Biologica I strings are always algorithm done in the two-dimensional when identical rows were first found a nnear-ume be avoided. Suppose else in the patterns.Extendthediscussionandthetheoremtocoverthetaskofexplicitly State the time bound as the sum of a term that is on that number. have the property that if Rcontainsnsymbols • then G(R)contains more carefully to relate the running time of the algorithm Since the directed time.2 anc 3.4. and il not what are the problems? array C (which time. Another approach to handling algorithms. That to what is the closure +. that the set NU) can be naively that the improvement found from N(i . can 27.1 separately considers the path in K for 28. This suggests in O(ne) reduction edges. nowhere issues.5. However. This results in an overcount of the time actually used by the algorithm to the number 29. when algorithm wild cards the analysis in the proof of Theorem 3. in P is a substring of patterns of another to the both 32.asdefined it may be more efficient 10 modify Develop the definition that idea and ofa regularexpr pattern eeson the wild card handled symbol. P. range Explain The formal how. such regular of the 20. Prove free. small we have of points Discuss pictures m) lime? pictures rectangular and we want to find all ocpicture. showing would avoid this situation. Lst n bethe suppose total all the of points in the large thai k. to explicitly include can be efficiently lor the wild card problem. DiSCUSS the problem are permitted 18. that no pattern tothe when substrings PROSITE finite three. to develop shift rules and preprocessing approach 33. symbol "*" can always However. wild cards are permitted in the text but not in the pattern in the text for this task is Ole). Explain hence the importance. where does not include the length perhaps specifications. explain how wild cards matching in the text. to the case when the bottom re. and such duplicates in linear time.3). Then the required directed As a simplification. Then declare that P occurs in 3. can or the utility. expression range over the length of the more how such the regular in the that the wild card can match any length substring. Theorem finding 3. or in both? 22. What can you say about in the text. The time analysis each pattern Pertorm of nodes inK 17. rather than Aho-Corasick Show how to extend the iwo-dlmensronat the rectangular pattern is not parallel What of the two bottoms is known. Explain I N(i)1 = how this no f O(n) for any i.D. Perhaps a counter occurs we can omit phase at each in row jofthe two of the two-dimensional picture. and or Boyer-Moore can handle discuss 23. 16. 15. independent of the number plus a term that depends the problems if you see them). in a + t). such an improved 14. that is. expression graph Show that one can still search for a substring m is the length of T and e is the number wild cards in the pattern Does this is to modify the Knuth-Morris-Pratt methods that seem promising? Try it. q > 1 small (distinct) pictures pictures rectangular in a larger and efficiently.4.4. of matches if T contains a substring that matches wild cards in the pattern. + m) time suffices. ofE edges E edges always Explain in the graph G(R). of the wild card algorithmisd can be found duplicates matching Does and removed from problem ue to duplicate is similar copies of to the nonlinear in P. of time is achieved. Give an example grow lasterthsn Why not? Can you fix it and make it run in O(n of occurrences in T of patterns in set P.1) and T(i) is trivial if G( R) contains when a. but finite that CD can repeat can be expressed increase expression? Show or four times.7. and try to use it 10 obtain 31. in P. range specifications do those concise directed specifications PROSITE graph each sIring in P that appears correctness 21. in both the text and pattern time behavior by first removing string Consider label. The at Show how to construct construction mostO(n)edges should G( R) from a regular expression Pin P. starting When at position matching Keep picture for cell cell of the large large picture we find that row i of the small (i'.buttheorientation lar? as follows: ounter is not rectangu method to the bottom of the largepictu happens if the pattern 26.16 where the edge below the character ooteso is direct ed to the character Give t hedetailsforcompuling becomes n' (the number of rows of T with Pl.1 states the time bound for determining and outputting all such matches. expression. the same by an extension patterns often of the regular specify expression algorithm can repeat either ina as a two. and O(n+ m) time bound for the two-dimensional Give a complete matching method proof of the correctness described inthe text (Section 3. In the wild card Could case we instead problem simply we first assumed the algorithm reduce the case For example. Suppose in the two-dimensional matching problem being matching each pattern 25.6. are permitted). concise range the number CD(2-4) ola definition much inthe of times indicates regular that a substring expression one. Wildcardscanclearlybeencodedintoaregularexpression.lf Rdoes not contain finite. and then we extended when they are not? and complexity case when that assum ption does not hold are allowed Consider we just add a new symbol Does it work? to the end of of numbers. how this simplifies the searching this approach tt work. exact matching with these kinds of wi Id cards in the pattern. Try to make the growth of any of the qsmall as large as possible.edges of T that matches of edges acter. For example. it is tempting approach "fix up' the method and given a single method 19. A. is of length m > nj Show how to modily by a list of length the wild card method while keeping by replacing running n.incrementthec . (and solutions aregularexpressionR. show that graph for the input size n. graph G( R) contains the time stated Explain O(n) edges when R contains n symbols. Since strings (and solution if you see one) of using the Aho-Corasick and b. ain This is shown. rather than just a single char- specifications in O(me) for the re gularexpression(. the number Be sure you account O(n+ m).

The main primitive operation in each of those methods is the comparison of two characters. Baeza-Yates and G. Recall that pattern P is of size n and the text T is of size m because prefixes of P of lengths one and three end at position seven of T. string matching methods based on bit operations or on arithmetic. j) should be 1 if and only if the first i-I characters of P match the i-I characters of T ending at character j . For the algorithm to compute M it first constructs an a-length binary vector U(x) for each character x of the alphabet I for the positions in P where character . For example.1 and character P(i) matches character T(j). the = I if and only if an occurrence of P ends at position j ofT. the use of the Fast Fourier Transform in string matching: and the random lingerprint method of Karp and Rabin Array M is constructed column by column as follows: Column one of M is initialized to all zero entries if T(1) # P(I). the vector for column j is obtained by the bitwise AND of vector Bit-Shiftt j . j . To see in genera! why the Shift-And method produces the correct array entries. The Stuft-And method R. which has a U vector of When the eighth column of M is shifted down and anA~D resultis is performed with Ural. the entries for column j > 1 are obtained from column j . There are. if we let M(j) denote the jth column of M.1 with entry i of the vector U(T(j».1) is I. and the second condition is true when the ith bit of the U vector for character T(j) is 1. the algorithm ANDs together entry (i .1. The eighth character of T is character a. then M(j) = Bit·Shift(j . The first condition is true when the array entry for cell (i . Otherwise. Figure 4. however.I. as well as most of the methods that have yet to be discussed in this book.4.1. rather than character comparisons.1: Column j- 1 before and after operation Bir-Shif!(j _ 1) 4. Gonnet [35] devised a simple. These methods therefore have a very different flavor than the comparison-based approaches. 70 . More formally. but it seems more natural to call it Shift-And. when T(1) = P(l) its first entry is I and the remaining entries are O.2.1) with the U vector for character T(j). Arithmetic versus comparison-based methods All of the exact matching methods in the first three chapters.r appears.1) AND U(T(j»). if P 10100010. are examples of comparison-based methods.1 shows a column j- I before and after the bit-shift which is the correct ninth column ofM.2. How to construct array M 4. hence computing the of M solves the exact matching problem.I. For example.1. even though one can sometimes see character comparisons hidden at the inner level of these "seminumericai" methods. For example. THE SHIFf·AND METHOD 71 4 Seminumerical String Matching Figure 4. if P = abuac and T = sabsabaaxa then the eighth column of M is 4. After that. observe that for any i > 1 the array entry for cell (i. We will discuss three examples of this approach: the Shift-And method and its extension to a program called agrep to handle inexact matching.I) of column j .ln particular. Hence the algorithm computesthecorrcctentriesforarrayM. They call this method the Shift-Or method. j . By first shifting column j . bit-oriented method that solves the exact matching problem very efficiently for relatively small patterns (the length of a typical English word for example).I and the U vector for character T(j).2.

j) = I then there is an occurrence of P in T ending at position j that contains at most k mismatches. ugrepisextremelyfast. it also occurs with four mismatches starting at position two. For a small number of errors and for small patterns. and otherwise it is set to MC(i . For example. that amplifies the Shift-And method by finding inexact occurrences of a pattern in a text By inexact we mean that the pattern either occurs exactly in the text or occurs with a "small" number of mismatches or inserted or deleted characters. Match-counts are useful in several problems 10 be discussed later. The zero column of MC starts with all zeros. so that a column of any M! fits into a few words.i + I . The case of permitted insertions and deletions will be left as an exercise. In ogrep. We let MkU) denote the jth column of Mk. but it certainly is practical and would be the method of choice in many circumstances zero column of each array is again initialized to all zeros.1.£. How to compute Mk Let k be the fixed maximum permitted number of mismatches specified by the user. There are several ways to organize the computation and its description. agrep is very efficient and can be used in the core of more elaborate text searching methods. This extension uses 8(nm) additions and comparisons.3. for every j we will compute column j in arrays . I. In particular.•. Mk are computed. j) to allow up tok mismatches. However.the larger k is. and the next pair of characters in P and T are equal. j ]. j . The efficiency of the method depends on the size of k .. but each MC(i. replacing the AND operation with the increment by one operation.2. of P with T.2. Thus when the pattern is relatively small. . and in worst case the number of bit operations is clearly EJ(mn). The match-count problem and Fast Fourier Transform to compute for pairi. with at most 1 mismatches. packaged into a program called agrep. every column of M and every U vector can be encoded into a single computer word. let us define a new matrix MC That is.3.j the number of characters TU . Any entry with value n in the last row indicates an occurrence of P in T. for reasonable sized patterns. MO is the array M used in the Shift-And method.11 in increasing order of I.72 SEMINUMERICAL STRING MATCHING 4. Manber [482] devised a method. j . • The first i-I mismatches characters of P match a substring of T ending at j .1) + 1. just incrementing by one. the last row indicates the number of characters that match when the right end of P is aligned with character j of T. with at most I- I 4. In this section we will explain ogrep and how it handles mismatches. Hence. the slower the method. The method will compute MI for all values of 1 between 0 and k. if and only if one of the following three conditions hold: • The first matches i characters of P match a substring of T ending at j. it is a natural introduction to the next topic. Then the jth column of computed by: is Intuitively. a value of k as small as 3 or 4 is sufficient. the method is very efficient if n is less than the size of a single computer word. For clarity. For each position j ::: /I in T. and k is also small. match-counts.1. These are very fast operations in most computers and can be specified in languages such as C Even if n is several times the size of a single computer word. the pattern at cgaa occurs in the text aatatccacaa with two mismatches starting at position four.2.Ml. This computation is again a form of inexact matching. as was true of agrep.3. and over the entire algorithm the numberofbit operations is O(knm). If Mk(n. Wu and U. which is the focus of Part III. although each addition operation is particularly simple. the . only a few word operations are needed.1. the Shift-And method is very efficient in both time and space regardless of the size of the text From a purely theoretical standpoint it is not a linear time method. In addition. Column j only depends on column j . Therefore. such as single English words. Inexact matching is the focus of Part HL bur the ideas behind agrep are so closely related to the Shift-And method that it is appropriate to examine agrep at this point. and the method is extremely fast. As in the Shift-And method. The problem of finding the last row of MC is called the match-count problem. 4.2. M2. j) is the natural extension of the definition of M(i. but the most important information is contained in the last row of Me. with at most 1 . For many applications. . j) entry now is set to MCU . this just says that the first i cbaracters of P will match a substring of T ending at position i.4.I.the user chooses a value of k and then the arrays M. M((i.1) if P(i) #. 4. so all previous columns can be forgotten.shifting by one position and ANDing bit vectors. Furthermore. If we want to compute the entire MC array then (-)(nm) time is necessary. ·.I. the solution is so connected to the Shift-And method that we consider it here.' PROBLEM AND FAST FOURIER TRANSFORM 73 M' 4.I misI • The first i-I characters of P match a substring of T ending at j . the practical efficiency comes from the fact that the vectors are bit vectors (again of length n) and the operations are very simple . but for simplicity we will compute column j of each array MI before any columns past j will be computed in any array. and both the Bit-Shift and the AND operations can be done as single-word operations. .1. with at most mismatches. Shift-And is effective for small patterns Although the Shift-And method is very simple. only two columns of Mare needed at any given time.T(j). but values less than n count the exact number that match for each of different alignment. THE MATCH-COU!'. In that case. Further. agrep: The Shift-And method with errors s. It is simple to establish that these recurrences are correct.'·1 A simple algorithm to compute matrix MC generalizes the Shift-And method.

3. i). The extra zeros are there to handle the cases when the left end of a is to the left end of fJ and. so yea. We still assume. when a = P and fJ = T the vector Yea. The solution will work for any alphabet. i) for each i. and all other characters become Os. To compute Va(a. Using Fast Fourier Similar definitions apply for the other three characters. position Pa to start at position i of a" and count the number of columns where both bits are equal to I. f3) = VAl>". For most problems of interest log m « 11. SEM1NUMER1CAL STRING MATCHING 4. fJ) contains the information needed for the last row of Me. where the indices in the expression are taken modulo 11 + m. .respectively. but we treat the FFI" itself as a black box and leave it to any interested reader to learn the details of the FFT. all we have done is to recede the match-count problem.Ifi e c rhen we have 1011000100010 10010110 Clearly. Negative numbers specify positions to the left of the left end of fJ.(a. Surprisingly.c. Further. For this more general problem the two strings involved will be denoted a and fJ rather than P and T.just directly count the number of resulting matches and mismatches).. i) can be directly computed in O(n) time (for any i. fJ. suggesting a way to solve these problems quickly with hardware. Convert the two strings into binary stnngs o. This is where correlation and the FIT come in. fJ} Transform for match-counts The match-count problem is essentially that of finding the last row of the matrix Me However. since the roles of the two strings will be completely symmetric. For example. This approach was developed by Fischer and Paterson [157J and independently in the biological literature by Pelsenstein. one for each character in the alphabet. and [99J. j~'Hm-l fJ. if i = 3 then we get 1011000100010 10010110 andtheV.fJ. and VoCO'. no number requires more than O(1ogm) bits). that lad = n ::: m = Itil The problem then becomes how to compute Va(a. Enough zeros were padded so that when the right end of a is right of the right end of fJ. THE MATCH·COUNT PROBLEM AND FAST FOUR(ER TRANSFORM 75 A fast worst-case method for the match-count problem? The high-level approach We break up the match-count problem into four problems. so this technique yields a large speedup. fJ) can be computed in O(IIIll) total time. 21123456789 accctgtcc aactgccg then the left end of a is aligned with position -2 of fJ Index i ranges from -n + I to m. V(O'. The O(mlogm) method is based on the Fast Fourier Transform (FFT). (3.3.gofDNA Vll(o:. and this receding doesn't suggest a way to compute Vo(a. 13) + V.3. f-l) in O(mlogm) total time by using the Fast Fourier Transform. Notice that when i > m .1. when a is aligned with fJ as follows. For concreteness we use the four-letteralphabeta. but rather we will solve a more general problem whose solution contains the desired information. and Koclun [152J. when the right end of a is to the right end of {3. but the operations are still more complex than just incrementing by one. We now show how to compute V(a. But it contains more information because we allow the left end of a to be to the left of the left end of 13. fJ) more efficiently than before the binary coding and padding. Hence no "illegitimate wraparound" of 0: and 13 is possible.74 4. [59).(a. fJ.11. and we also allow the right end of a to be to the right of the right end of fJ. the corresponding bits in the padded r"ia are all opposite zeros..3)=2. Sawyer. there is specialized hardware for FIT that is very fast. For any fixed i . i) = L j~1l r"iQ(j) x i3"U + j). Can any of the linear-time exact matching methods discussed in earlier chapters be adapted to solve the match-count problem in linear time? That is an open question. The extension of the Shift-And method discussed above solves the match-count problem but uses 8(nm) arithmeticoperationsinallca~es. the right end of a is right of the right end of 13. With these definitions. [581. the match-count problem can be solved with only O(m log m) arithmetic operations if we allow multiplication and division of complex numbers. and p". conversely. i) is correctly computed So far. where every occurrence of character a becomes aI. We will reduce the match-count problem to a problem that can be efficiently solved by the FIT. f3) + V. Other work that builds on this approach is found in [3]. fJ. let a be acaacggaggtor and fJ be accacgaag. but it is easiest to explain it on a small alphabet. The numbers remain small enough to permit the unit-time model of computation (that is. and positive numbers specify the other positions.For example. For example. Then the binary strings ii~ and Po are 10110001000(0 and 100100110. V(a.l.2. we will not work onMe directly. however. 4.(o:. f3) + V~(o:.

X Clearly.4. and the Shift-And method assumes that we can efficiently increment an integer by one. /3. jJ. and its use in the solution of the cyclic correlation problem. Say jJ.2. i} is computed by replacing each wild card with I.wherethe indices in the expression are taken modulo z. I} = 7 (i. For example. /3. KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 77 Cyclic correlation Definition Let X and Y be two a-length vectors with rea! number components indexed "" where for each character characterx.r . the problem of computing vector Va(a.f3). let ~2" 'PU} H(Pl= Similarly. z = n +m. The first result will he a simple linear-time method that has a very small probability of making an Recall the wild card discussion begun in Section 3. simply replace each wild card symbol in fi with the character x. is beyond the scope of this book. So if for a fixed i there arek places where two wild cards line up. what character in c is opposite the j position inr)). for two vectors each of length z. then the computed L. the converse is also true. require 8(Z2) of computing Va(a. so Theorem 4. This is surprisingly efficient and a definite improvement over the I.r .}=6 Clearly. Then {3.). cyclic correlation is solved by reversing one of the input vectors and then computing the convolution of the two resulting vectors.lndetaii.1.3 a different. But cyclic correlation is a classic problem known to be solvable in 0(:: log z) time using the Fast Fourier Transform.. and compute V<I>(a. i) for every position i and character . but the key is thai it Solves the cyclic correlation problem in 0(: log z) arithmetic operations. How do we incorporate these wild card symbols into the FFT approach computing match-counts? If the wild cards only occur in one of the two strings. i} for each i .Insummary. Similarly. This works because for any fixed starting point i and any position j in/3.andr =2.) The FFT method. where the wild card symbol ¢ matches any other single character. that we can also efficiently multiply an integer by two.5.(a. (J. i) would be too large. i) for each r = a. {3) is exactly the cyclic correlation of padded strings tia andPa./3. If two wild cards are opposite each other when ~ srarts at position i. So it is not much of an extension to assume. then Vx(a.4.I of tr (Le. But if wild cards occur in both a and fj.j .n=4.4.(a. more comparison-based approach to handling wild cards that appear in both stnngs.4. V. consider P to be an n-bit binary number. then this direct approach will not work. There is an occurrence of P starting at position r of T if and only if R(P) = H(T. after discussing suffix trees and common ancestors. since those two symbols will be counted as a match when computing V. . andg.' Handling wild cards in match-counts 4. operations. if ~ = ag¢¢ct¢a and f3 = agctctgt. i) for at most one character .1.thenH{T. Arithmetic replaces comparisons starung nt Definition For the binary pattern P.-)(nm) bound given by the generalized Shift-And approach. (The FFr is more often associated with the convolution problem for two vectors. Theorem 4.the jth position will contribute a I to V. jJ. The cyclic correiatian of X and Y is an z-Iength real vector W(i) x Y(i+ j). i) value How can we avoid this overcount? The answer is to find what k is and then correct the overcount. consider T. based only on the definition of cyclic correlation would Now an algorithm Later. in addition to being able to increment an integer. (3. If we treat a (row) bit vector as an integer number then a left shift by one bit results in the doubling of the number (assuming no bits fall off the left end). With that added primitive operation we can turn the exact match problem (again without mismatches) into an arithmetic problem. fi. if there is an occurrence of P starting at positionr ofT then H(P} = H(T. However. we will present in Section 9. then the solution is very direct When computing VA~.) However. if an w= Y = tJ~." to be an n-bit binary number.c..3.e. . The idea is to treat ¢ as a real character. then V(a. so again no progress is apparent. 4.(a. if P = 0101 then n = 4 and H(P} = 23 X 0 + 2' x 1 + 21 x 0+20 x! =5.76 SEMINUMERICAL STRING MATCHING 4. The match-count problem can be solved in O(m logm} time even unbounded number of wild cards are allowed in either P or T.1. the FFT requires operations over complex numbers and so each arithmetic step is more involved (and perhaps more costly) than in the more direct Shift-And method. In fact. all positions are counted as a match except the last). but cyclic correlation and convolution are very similar. Hence it solves the match-count problem using only O(m logm) arithmetic operations.ifT = IOII0101O.r.c. Vx(a. For example. depending on what character is in position i -i. let That is. and the problem = fia. Karp-Rabin fingerprint methods for exact match method assumes that we can efficiently shift a vector of bits. i) will be 3k larger than the correct V(~.

KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 79 7 = The proof.then Moreover. but the if part does not. By reducing H(P) and modulo a number p.? The problem is that the required powers of two used in definition of H(P) and H(T. yet large enough that the probability of a false match between P and T is kept small.) = H(T. It is a randomized method where the only if part of Theorem 4.e. 4. me x 2mod7+0=2 2x2mod7+!=5 5 x 2mod7+1 =4 4 x 2mod 7+ 1 = 2 2x2mod7+1=5 5mod7=5.j.bu{{hetimetodo{heupdatesisnNthei. each successive power of two taken mod can be computed in constant time and each successive value Prime moduli limit false matches Clearly. I than by following the definition directly (and w< will need th. but the number 2" requires n bits. rr(lI) is The following theorem is a variant of the famous prime . HAT. but that the intermediate numbers are always kept small. if P occurs in Tstaningatpositionr then Hp(P) = H1. Hp(Tr) can be computed from Hp(Tr_1) with only a small constant number of operations. this can be computed as follows: I H(P) = 47 and Hp(P) = 47 mod 5. We state the needed properties of prime Definition equalto a.. the (fpart will hold with high probability. one can never-reduce too much). Since Hp(T. every fingerprint remains in the range 0 to P -1. For example. that preserves the spirit of the above numerical approach.) =2 x H(T.). the computation of H(P) and H(Tr) wil! not be efficient.j = The general idea is that.2. the use of such large numbers violates the unit-time random access machine (RAM) model. The arithmetic will then be done on numbers requiring only a small number of bits. H(T. The following definitions and lemmas make this precise. However.> 1. But if H(P) and ff(T. instead of working with numbers as large as H(P) and H(T. called the randomized fingerprint method. the largest allowed numbers must be represented in O[log(n + m)] bits. . Rabin [2661 published a method (devised almost ten years earlier).1 converts the exact match problem into a numerical problem.. 2" mod p=2 X (2"-1 mod pI p mod p. is an immediate consequence of the fact that every integer can be written in a unique way as the sum of positive powers of two. then numbers as large as r" are needed.4. Fortunately. comparing the two numbers H(P) and H(T. Instead. Fingerprints of P and T it follows that Hp(T. if P = lOllil andp = 7. Karp and M.4.) Therefore.1 continues to hold.\7. when the alphabet is not binary but say has t characters. But unless the pattern is fairly small. even greater efficiency is possible: For r .. modular arithmetic allows one to reduce at any time (i. Theorem 4. out 1l0W the converse does uot hold for every That is. This is explained in detail in the next section.4. The key comes from choosing p to be number in the proper range and of prime numbers.. For a positive integer the number of primes that are less than or number II.). (From the standpoint of complexity theory. we will work with those numbers reduced modulo a relatively small integer p. Thus the required numbers are exponentially too large. but that is extremely efficient as well.{lateton). Theorem 4.) Even worse.4. ~ ::s n(u)::S 1. In 1987 R."uehere theorem.(T. In that model. using numbers that satisfy the RAM model.). so that the following generalization of Horner's rule hold.where In(u) is rhebase e logarithm ofu [383J._d - -1) + T(r +n - I).78 SEMINUMERICAL STRING MATCHING 4. the utility of using fingerprints should be apparent. The point of Horner's rule is not only that the number of multiplications and additions required is linear. since that computation can be organized the way that Hp( P) was.j must be computed before they can be reduced modulo p.) rather than directly comparing characters. One can more efficiently compute H{T.) Hp(T.2. Intermediate numbers are also kept small when computing Hp(Tr) for any r . so the size of does nor violate the RAM model.4. a fingerprint The goal will be to choose a modulus p small enough that the arithmetic is kept efficient.) mod 2"T(r p H(T. then we have the same problem of intermediate numbers that are too large.+I I from H(T. [(2 x H(Tr_Jl mod p) - (2" mod p) x T(r - 1) + Ttr +n .) grow large too rapidly.lI] mod p Further.26. we cannot necessarily conclude that P occurs in T Already. which we leave to the reader. and so will be efficient But the reallv attractive feature of this method is a proof that the probability of error can be made sm~ll if p is chosen randomly in a certain range.

2).. which is greater than the product of the first rr(lI) primes (by assumption that k > JT(ul). if u ::: 29.. the fingerprint algorithm runs in O(m) time. be reasonable not to bother explicitly checking declared matches. 7. . whet now is the probability of a false match between P and T~ One bound is fairly immediate and intuitive currences Theorem 4. However. however. why not use several? Why not pick k primes PI... Fer each posnion r in T. 5..4. and the lemma is proved.. then than n(u) (distinct) prime divisors .4 .(T. 11. the pro- Notice that Theorem 4. If the numbers are equal. where nand m are the Corullary 4.23. It may.r Random fingerprint 1. bUI Hp. the probability of a false march is at most ~ number theorem (Theorem 4. excluding any time used to explicitly check a declared match. 17. fer II = 29 the prime numbers less than or equal 1029 are 2.4. Given the fad that each Hp(T. the product of the primes less than or equal to II is greater than 2" (by Lemma 4.(P) = Hp.2).80 SEMINUMERICAL STRING MATCHING 4. we have to answer the question of what I should be How to choose I Theorem 4.) for each of the k primes. But QIQ2 .r does have k > rr(u) distinct prime divisors 2" ::: x::: QIQ2. Pk randomly and compute k fingerprints') For any position r . When I= nm2. then either declare a probable match or check explicitly that P occurs in T starting at thatposition r.(Efficiem has fewer PROOF Suppose .4.4. p.l and test to see if it equals HAP). depending on the probability of an error.4.564. there can be an occurrence of P starling at r on(v if HrJP) = H". below) and compute [fp(Pl.4.1 leads to the following random fingerprint algorithm for finding all ocof P in T . Their product is 2. qk is at least as large as the product of the smallest k primes. . KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 81 Lemma 4. {hen the product of all the primes (hat are less than or equal to u isgreater than 2" [383J For example. 13.2.3 holds for any choice of pattern P and text T such that has a false match Extensions If one prime is good. algorithm I Choose a positive integer (to be discussed in more detail I. Corollary 4. 0 The central theorem Now we are ready for the central theorem of the Karp-Rabin approach 3.) can be computed in constant time from Hp(Tr_I)..r may be repeated).4. 10 fully analyze the probability of error. and 29. For now. Let P and T be any strings such that nm 2: 29.870.410 whereas i~<Jis 536.2. So the assumption that k >11"(11) leads to the contradiction that 2" > 2".qk (the first inequality is from the second from the fact that some primes in the factorization of .4. If II 2: 29 and x is any number less than or equal to 2". We now define a false match between P and T to mean that there is an r such thai P does nOI occur in T starting atr.912.156.1. 19.3.) for everyone of the k selected primes. We will return to the issue of checking later. compute Hp(T..(T.

4. as befure. Moreover.. Hence. see [182].4 is again very weak. as a first check.4. KARP-RABIN FINGERPRINT METHODS FOR EXACT MATCH 83 Theorem 4. If d is not the smallest period. if k = 4 and n. but they may also find false matches .4. Otherwise. But that implies that there is an occurrence of P starting at position II + d' < d2. II and h. we can apply Corollary 4. Otherwise. say d'.3 holds for each of the primes. lif I -Ii::::: nI2). but the reader need not have read that section. suppose the algorithm explicitly checks that P occurs in T when Hp(P) "" H. it is no longer random.) (page 431). which is stated in Section 16. and since the Karp-Rabin method never misses any occurrence of P. Thus.2 to each prime. If any of these successive checks finds a mismatch. it declares a false match and stops if this check fails for some i. f2 •. For additional probabilistic analysis of the Karp-Rabin method. the probability of a false match between P and T is at most (1. P dues in fact occur starting at each declared location 1 . With this observation we can reduce the probability of a false match as follows: allows numerous false matches (a demon seed). We present here an O(m)-time method. so we first discuss how to check for false matches in a single run In a single run. PROOF We saw in the proof of Theorem 4. But once the prime has been shown to allow a false match. However. a small choice of k will assure that the probability of an error due to a false match is less than the probability of error due to a hardware malfunction. So d must be the smallest period of P.tion r .d. this checking would seem to require 8(nm) worst-case time.3 says nothing about how bad a particular prime can be. a little arithmetic (which we leave to the reader) Corollary 4. and the theorem is proved. When k primes are chosen randomly between I and I and k fingerprints are used.2.. Theorem 4.I. for position Ii. If one checks for P at each declared location.26im-'2k-Il(l + 0. although the expected time can be made smaller. 0 As an example.3 that if p is a prime that allows Hp(P) = Hp(T. as in the Galil method. Assuming. Even lower limits on error The analysis in the proof of Theorem 4. establishing then This probability falls so rapidly that one is effectively protected against a long series of errors on any particular problem instance. then the probability ofa false match between P and T is at most by 10-12.(T. then d must be a multiple of the smallest period.. and it follows that if there are no false matches in the run. andk = 4 reduces the We mention one further refinement discussed in [266]. the method either verities that the Karp-Rabin algorithm has found no false matches or it declares that there is at least one false match (but it may not be able to find all the false matches) in O(m) time The method is related to Galil'sexrension of the Boyer-Moore algorithm (Section 3. the probability of a false match is reduced dramatically.e.4. it suftices to successively check the last d characters in each declared occurrence of P against the last d characters of P. since two runs are separated by at least nj2 positions and each run is at least n positions long. . even though it is not there. in L such that every two successive numbers in the interval differ by at must nj2 (i. Muthukrishnan [336]... For the time analysis. Theorem 4. If P does not occur in both of those locations then the method has found a false match and stops. For typical values of nand m.~)]k. the probability of a false match between P and T is at most [~~.17.locations where P is declared to be in T. that contradict.4. then p is in a set of at most Jr(nm) inregers. the method checks the d characters of T starting at position I. That is.). This makes intuitive sense. But by picking a new prime after each error is detected. note first that no character of T is examined more than twice during a check of a single run. while the computational effort of using four primes only increases by four times.2). from to-' to to-". no character of T can be examined in . Consider a list L of (starting) locations in T where the Karp-Rabin algorithm declares P to be found. The method works on each run separately. This is an even less likely event. to check each location in C.. It may well be a prime that in the run.6lnml Applying this to the running example ofn probability of a false match to at most 2 x = 4000. So the probability that all the primes are in the set is bounded by [~t. that determines if any of the declared locations are false matches. m. Then one may be better off by picking a new prime to use for the continuation of the computation. a false match can occuronlyife ach of the k primes is in that set. the bound from Theorem 4. noted first by S. then /'''1 =d for each i in the run. the choice of 12 as the second occurrence of P in the interval between II and t.3 randomizes over the choice of primes and bounds the probability that a randomly picked prime will allow a false match anywhere in T. (This follows easily from the GCD Theorem. and it finds that P does not occur there.) at some position r where P does not occur. A run is a maximal interval of consecutive starting locations ii.4. the method verifies that Ii+l .82 SEMINUMERICAL STRING MATCHING 4.When k fingerprints are used. each of the primes i must simultaneously allow a false match at the same r. then the method has found a false match in the run and stops. for the algorithm to actually make an error at some specific po.Ii = each i. Checking for error in linear time All the variants of the Karp-Rabin method presented above have the property that they find all true occurrences of P in T.4. the method learns that P is semipenodic with period I~ -II We usc d to refer to /1-1 and we show that d is the smallest period of P. When k primes are chosen randomly and used in the fingerprint algorithm.3. that I shows = nm".4. Returning to the case where only a single prime is used. the method explicitly checks for the occurrence of P at the first two positions in the run. because it just multiplies the probability that each of the k primes allows a false match somewhere in T. Otherwise. That is.4. and since the primes are chosen randomly (independently). when and h. and I arc as in the previous example.4.1. + n .

5. But the main attraction is that the method is based on very different ideas than the linear-time methods that guarantee no error. for Patonlyl. Show how to efficiently approach. establishing significant properties of the method's performance. ln the Karpcheck to the two-dimensional pattern matching 10.84 SEMINUMERICAL STRING MATCHING specifications depends 8. Extend Kleene of wild cards method in the strings. be deleted from P.3. the method when can find in O(m} time all those point. Also. we have seen several methods that run in linear worst-case time and never make errors. We mentioned by the O{nm} in PROSITE. the method is accompanied by concrete proofs. algorithms. Complete time. it can be thought of as a method that never makes an error and whose expected running time is linear. Little has been proven about their performance. but. the method and why runs containing no false matches. but runs in Oem) expected time (a conversion from a Monte Carlo algorithm to a Las Vegas algorithm). The random problem discussed in Section 3. and the text) in the Shift-And of neither method is affected 4. unlike the Karp-Rabin method. the method is simple and can be extended to other problems. the Karp-Rabin randomized to check method from a Monte-Carloalgorithm. Prove the correctness 5. they generally lack any theoretical analysis.4. how.5. Vary Ihe sizes of Pand 2.1 . EXERCISES 85 method or agrep? The answer partly more than two consecutive runs. (Open be efficiently handled 4.4. Why fingerprints? The Karp-Rabin fingerprint method runs in linear worst-case lime. a small number of insertions and deletions of characters. Ranges matching method patterns often specify a range for handled 32 of Chapter pattern thenumberoftimesthatasubpattern expression of this type can be easily of Section 3.5_ Exercises 1. Perhaps can be replaced one can the FFT method with character comparisons in the case of computing the proof of Corollary fingerprint approach 4. Explain 4. the Karp-Rabin algorithm can be converted from a method with a small probability of error that runs in Oem) worst-case time. as well as are allowed. Complete the detailS and analysis algorithm to convert style randomized 12. Second. over all runs. Thus the method is included because a central goal of this book is to present a diverse collection of idem. can be extended Doit. is. to efficiently for agrep. used in a range of techniques. Alternatively. Methods similar in spirit to fingerprints (or filters) predate the Karp-Rabin method. To achieve this. It follows that the total time for the method. by the number 6. to one that makes no error.3. simply (rejrun and (re)check the Karp-Rabin algorithm until no false matches are detected. 4. isO(m) With the ability to check for false matches in Oem) time. There Rabin to a Las Vegas-style possible in the method are improvemerlts method. from a practical standpoint. Adapt Shift-And each pattern and agrepto handle a set of patterns. Can such range I . characters can be inserted P and characters 3. 9. wild cards (both in the pattern Show that the efficiency Do the same for agrep.6. We leave the details as an exercise with the Shift-And on the number problem) Devise of such a purely specifications that appear method in the ex pression to compute in detail match-counts to see if complex matchin comparison-based examine O(mlogm} arithmetic counts. the Shift-And closure. as those handle regular expressions that do not use the to collections . 3 that PAOSITE repeats. Evaluate empirically T method to solve That the problem of finding an "occurrence" into of a pattern P can the Shift-And method against methods discussed earlier. Can you do better than just harldlirlg in the set independently? of the agrep handle method. Extend the agrep inSide a text T. at some needs to explicitly and not 12. Do the same patterns in Exercise regular such Explain the utility of these extensions of biosequence 7. but with a nonzero (though extremely small) chance of error. First. In contrast. So what is the point of studying the Karp-Rabin method? There are three responses to this question. and proofs 11. such as two-dimensional pattern matching with odd pattern shapes . when mismatches. Explain for false matches For example.3.a problem that is more difficult for other methods.

PART II Suffix Trees and Their Uses .

as will be illustrated by additional applications in Chapter 9. the allowed preprocessing takes time proportional to the length of the text. but thereafter. the Aho--Corasick method does not solve the substring problem in the desired time bounds.1 and 7.3.those methods would preprocess each requested string on input. In is important. for example. Some of those applications provide a bridge to inexact matching. Thus it is natural to expect that the Ahc--Corasick algorithm could be applied. so that the substring problem is to determine whether the input string is a substring of any of the fixed strings.5 Introduction to Suffix Trees A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing discussed in Section 1. suffix trees provide a bridge between exact matching problems.where the focus is on inexact matching 89 . thereafter. those algorithms would be impractical on any but trivial-sized texts Often the text is a fixed set of strings. several applications and extensions will be discussed in Chapter 7. Suffix trees work nicely to efficiently solve this problem as well. Then a remarkable result. and then take (~)(m) (worst-case) time to search for the string in the text. One is first given a text T of length m . Because m may be huge compared to n . That method greatly amplifies the utility of suffix trees. the focus of Part I. independent of the length of T. and inexact matching problems that are the focus of Part III The classic application for suffix trees is the substring problem. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the Knuth-Morris-Pratt and the Boyer-Moore algorithms achieve). this case of multiple rext strings resembles the dictionary problem discussed in the context of the Aho-Corasick algorithm.10). preprocessing rime. The suffix tree for the text is built in preprocessing stage. but their real virtue comes from their use in linear-time solutions to many string problems more complex than exact matching. These bounds are achieved with the use ofa suffix tree. That is. Superficially. the search for S must be done in time proportional to the length of S. more applications of suffix trees will be discussed in PartIII. because it will only determine if the new string is a/till string in the dictionary. The O(m) and O(n) search result for the substring problem is extremely useful. a collection of STSs or ESTs (see Sections 3. whenever a string of length searchesforitinO(n)timeusingthatsuffixtree. After O(m).one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T or determine that S is not contained in T. methods . the constant-time least common a!!cestor method. Moreover (as we will detail in Chapter 9). or linear. However. will be presented in Chapter 8. not whether it is a substring ofastring in the dictionary After presenting the algorithms.5.

As ~:~~~:::~li~~a~~~~: ~:~~::e~:it:ea~~ ~:. sometimes a text. A short history The first linear-time algorithm for constructing suffix trees was given by Weiner [4731 in 1973. No two edges our ot a node can have edge-labels beginning with Ihe same characte r.). And. The path from the root to the leaf numbered I spells oUI the full string S = xabxac.1 string xa labels the internal node w (so node w has path-tabel w). when implemented well. in Figure 5. Chapter 6 fully develops the linear-time algorithms of Ukkonen and Weiner and then briefly mentions the high-level organization of McCreight's algorithm and its relationship t~ ~kk~nen's ~lgorithm.3.4. then the suffix tree for 5 can be obtained from the keyword tree for P by merging any path of nonbranching nodes into a single edge. That reputation is well deserved but unfortunate. while the path to the leaf numbered 5 spells out the suffix ac. More recently. particularly for Weiner's algorithm. That is. A different. I). A motivating example Before diving into the details of the methods to construct suffix trees.. although nontrivial. and string xabx labels a path that ends inside edge (w. sometimes both. This is probably because the two original papers of the 1970s have a reputation for being extremely difficult to understand. then suffixxc is a prefix of suffix xabxa. . Eachintcmal norle. Our aoproecn is to introduce each algorithm at a high level.1: Sufflxtreelorstring xabxac. we can add a character to the end of S that is not in the alphabet that string 5 is taken from. To avoid this problem. The node tatlets uand won ltletwo interiornodes wiltbe used tater. every string S is assumed to be extended with the termination symbol $. that is. and we hope that these expositions result in a wider use of suffix trees in practice. and sometimes neither.rit~~: algorithms for Definition For any node u in a suffix tree.ss~~. 5. Then. if set P is defined to be the m suffixes of 5. which starts in position 5 of S. we assume (as was true in Figure 5. rather than the O(m) bound we will establish. In this book we use $ for the "termination" character. The simple algorithm given in Section 3. string a labels node II. the characters in u's label string-depth of v is the number of Definition A suffix tree T for an m-character string S is a rooted directed tree with exactly er leavesnumhered 1 tom. ~Ivlng simple. if the last character of xahxac is removed. To achieve this in practice. we will refer to the generic string 5 of length m. We do not use P or T (denoting pattern and text) because suffix trees are used in a wide range of applications where the input string sometimes plays the role of a pattern. For example. Basic definitions When describing how to build a suffix tree for an arbitrary string. are no more complicated than are many widely taught methods. it spells out Sh .ml For example. however.1. The key feature of the suffix tree is that for any leaf i . since the path for the first suffix would not end at a leaf.V:~~:t~ ~~.90 INTRODUCTION TO SUFfiX TREES 5. so the path spelling out xa would not end at a leaf. the concatenation of the edge-Iabels on the path ~romtheroottoleafl exactly spells out the suffix of 5 that starts at position i . As stated above. other than the root. A ManVATING EXk\1PLE 91 5. although he called his tree a position tree. more space efficient algorithm to build suffix trees in linear time was given by McCreight [318] a few years later. the algorithms are practical and allow efficient solutions to many complex string problems. the definition ofa suffix tree for 5 does not guarantee that a suffix tree for any string 5 actually exists.4 for building keyword trees could be used to construct a suffix tree for S in O(m2) time. inside the leaf edge touching leaf 1 5. suffix trees have not made it into the mainstreamof computer science education. has at least two children and each edge is labeled with a noncmpty substring of S.1. even if the symbol is not explicitly shown A suffix tree is related to the keyword tree (without backpcintersj considered in Seclion 3.~ 2 Flgure5. we will write it explicitly as in S5. We believe that the exposruons and analvses given here. Those Implementations are. Much of the time. no suffix of the resulting string can be a prefix of any other suffix. unless explicitly stated otherwise.nl implementations.1) that the last character of 5 appears nowhere else in S. Ukkonen [438] developed a conceptually different linear-time algorithm for building suffix trees that has all the advantages of McCreight's algorithm (and when properly viewed can be seen as a variant of McCreight's algorithm) but allows a much simpler explanation Although more than twenty years have passed since Weiner's original result (which Knuth is claimed to have called "the algorithm of 1973" [24]). We know of no other single data structure (other than those essentially equivalent to suffix trees) that allows efficient solutions to such a wide range of complex string problems. The problem is that if one suffix of S matches a prefix For example. and they have generally received less attention and use than might have been expected.~:s~:.then incrementally Improved to achieve linear runmng times. Given string S. the original papers.2. m~fficre. because the algorithms. the suffix tree for the string xabxac is shown in Figure 5. creating string xabxa. are much simpler and clearer than i. this reminder will not be necessary and. let's look at how a suffix tree for a string is used to solve the exact match problem: Given a pattern P of . When it is important to emphasize the fact that this termination character has been added. of another suffix of 5 then no suffix tree obeying the above definition is possible.

.4. collecting position numbers written at the leaves.1 we will again consider the relative advantages of methods that preprocess the text versus methods that preprocess the pattemts). find all occurrences of P in T in O(n + ml time. Then (whether a new node w was created or whether one already existed at the point where the match ended). even though the total string-depth of those O(k) edges may be arbitrarily larger than k. w) is labeled with the part of the (u. Suffix trees provide another approach' Build a suffix tree T for text T in O(m) time. then it breaks edge (u. but the distribution of work is different.4. and note the leaf numbers encountered. This path is found by successively comparing and matching characters in suffix SU + I.8. and the new edge (w. This is the same overall time bound achieved by several algorithms considered in Part I. the number of leaves encountered . We have already seen several solutions to this problem. we will also show how to use a suffix tree to solve the exact matching problem using O(n) preprocessing and Oem) search time. But that happens if and only if string P labels an initial part of the path from the root to leaf j. traverse the subtree at the end of the matching path using any linear-time traversal (depth-first say). until no further matches are possible.. as follows' Starting at the root of Ni find the longest path from the root whose label matches a prefix of S[I + I. A naive algorithm to build a suffix tree To further solidify the definition of a suffix tree and develop the reader's intuition.mJ. Their starting positions number the leaves in the 5. called IIJ.m]$ to characters along a unique path from the root. denote the intermediate tree that encodes all the suffixes from I to I In detail.. the work at each node takes COnstant time and so the time to match P to a path is proportional to the length of p For example. the algorithm can find all the starting positions of P in T by traversing the Subtree below the end of the matching path. the suffix tree approach spends Oem) preprocessing time and then O(n + k) search time. every leaf in the subtree below the point of the last match is numbered with a starting location of P in T.just after the last character on the edge that matched a character in Sri + I . The edge is labeled with the string S$. The tree now contains a unique path from the root to leaf i + 1. so the time for the traversal is O(k). To collect the k starting positions of P.92 INTRODUCTION TO SUFFIX TREES SA.2. m]S into the growing tree. Assuming. Tree N'+l is constructed from N. Later..mJ$ (the entire string) into the tre e. And.ml and just before the first character on the edge that mismatched. In contrast. Since every internal node has at least two children. and every starting location of P in T numbers such a leaf The key to understanding the former case (when ali of P matches a path in T) is to note that P occurs in T starting at position j if and only if P occurs as a prefix of Tfj. achieving the same bounds as in the algorithms presented in Part I Figure 5..mlS.m]S. When that point is reached. At some point. the above naive method takes O(m2) time to build a suffix tree for the string Soflengthm. A NAIVE ALGORITHM TO BUILD A SUFFIX TREE 93 is proportional to the number of edges traversed. the algorithm creates a new edge (ui. u] label that matched with Sri + Lm]. Pattern P matches a path down to the point shown by an arrow. Figure 5. In Section 7. All occurrences of P in T can therefore be found in O(n +m) time. as usual. the algorithm is either at a node. then the search time can be reduced from O(n + k) to O(n) time. If only a single occurrence of P is required. and the preprocessing is extended a bit. and so it follows inductively that no two edges out of a node have labels with the same first character. If it is in the middle of an edge. If P fully matches some path in the tree.. Then.fhenit successively enters suffix 5[i . we present a straightforward algorithm to build a suffix tree for string S. v) say. The details are straightforward and are left to the reader. In the latter case.2: Three occurrences of aw in awyawxawxz_ subtreeofthenodewithpath-labelaw. This naive method first enters a single edge for suffix 5[1 . The new edge (u. in the search stage. and as required. u) is labeled with the remaining part of thc (u. The idea ts to write at each node one number (say the smallest) of a leaf in its subtree. It is the initial path that will be followed by the matching algorithm. v) label. the number written on the node at or below the end of the match gives one starting position of P in T. v) into two edges by inserting a new node. in Section 7.m ]$. a bounded-size alphabet. The matching path is unique because no two edges out of a common node can have edge-labels beginning with the same character. The matching path is unique because no two edges (Jut of a node can have labels that begin with the same character. match the characters of P along the unique path in T until either P is exhausted or no more matches are possible. and this path has the label Sri + l. In the fornier case.2shows a fragment ofthesuffix tree for string T = awyawxawx~ Pattern P = aw appears three times in T starting at locations 1.. for i increasing from 2 to m. i + I) running from w to a new leaf labeled i + I. Those earlier algorithms spend O(n) time for preprocessing P and then Oem) time for the search.. and 7. P does not appear anywhere in T. because we have assumed a finite alphabet. the leaves below that point are numbered 1. tree NI consists of a single edge between the TOotof the tree and a leaf labeled I. Then. where k is the number of occurrences of Pin T. w say. This can be achieved in Oem) time in the preprocessing stage by a depth-first traversal of T. and it labels the new edge with the unmatched part of suffix Sri + 1. or it is in the middle of an edge.4. length n and a text T of length m. {u. Note that all edges out of the new node w have labels that begin with different first characters. We let N.and 7. no further matches are possible because no suffix of 5$ is a prefix of any other suffix of S$.

6.1. UKKONEN'S

LINEAR-TIME

SUFFIX

TREE

ALGORITHM

95

6 Linear-Time Construction of Suffix Trees

We will present two methods for constructing suffix trees in detail, Ukkone~'s.me.thod and Weiner's method. Weiner was the first to show that suffix trees can be built In linear time, and his method is presented both for its historical importance and for some different technical ideas that it contains. However, Ukkonen's method is equally fast and uses far less space (i.e., memory) in practice ,than Weiner's method. Hence Ukkonen is the met,ho,d of choice for most problems requiring the construction of a suffix tree. We also believe that Ukkonen's method is easier to understand. Therefore, it will be presented first. A reader who wishes to study only one method is advised to concentrate on it. However, our development of Weiner's method does not depend on understanding Ukkonen'~ algorithm, and the two algorithms can be read independently (with one small shared section noted In the description of Weiner's method).

Figure 6.1: Suffix tree lor string xabxa$

Figure 6.2: Implic~suffixtreeforstringxabx8

**6.1. Ukkonen's linear-time suffix tree algorithm
**

Esko Ukkonen [438J devised a linear-time algorithm for constructing a suffix tree that may be the conceptually easiest linear-time construction algorithm. This algorithm has a spacesaving improvement over Weiner's algorithm (which was achieved first in the developme~t of McCreight's algorithm), and it has a certain "on-line" propert~ that may be ~sef~1 U1 some situations. We will-describe that on-line property but emphasize that the mam Virtue of tnconen's algorithm is the simplicity of its description, proof,. and time. analy.sis. The simplicity comes because the algorithm can be developed as a slm~le but inefficient method, followed by "common-sense" implementation tricks that establish a better worstcase running time. We believe that this less direct exposition is more understandable, as each step is simple to grasp. string S$ if and only if at least one of the suffixes of S is a prefix of another suffix. The terminal symbol $ was added to the end of S precisely to avoid this situation. However, if S ends with a character that appears nowhere else in S, then the implicit suffix tree of S will have a leaf for each suffix and will hence be a true suffix tree. As an example, consider the suffix tree for string ,ra/Ha$ shown in Figure 6.1. Suffix xa is a prefix of suffix x aoxa, and similarly the string a is a prefix of abxa. Therefore, in the suffix tree for x.abxa the edges leading to leaves 4 and 5 are labeled only with $ Removing these edges creates two nodes with only one child each, and these arc then removed as well. The resulting implicit suffix tree for xab.ta is shown in Figure 6.2. As another example, Figure 5.1 on page 91 shows a tree built for the string xooxuc. Since character c appears only at the end of the string, the tree in that figure is both a suffix tree and an implicit suffix tree for the string. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S -each suffix is spelled out by the characters on some path from the root of the implicit suffix tree. However, if the path does not end at a leaf, there will be no marker to indicate the path's end. Thus implicit suffix trees, on their own, are somewhat less informative than true suffix trees. We will use them just as a tool in Ukkonen's algorithm to finally obtain the true suffix tree for S

6.1.1. Implicit

suffix

implicit

trees

suffix trees, the last of which is

Ukkonen's algorithm constructs a sequence of converted to a true suffix tree of the string S

Definition An implicit sufJu tree for string S is a tree obtained from the suffix tree for S$ by removing every copy of the terminal symbol $ from .the edge labels of the tree, then removing any edge that has no label,andthen removing any node that does not have at least tWQ children An implicit suffix tree for a prefix S[I ..il of S is similarly defined by taking the suffix tree for S[l..i]$ and deleting $ symbols, edges, and nodes as above Definition tom. The implicit suffix tree for any string S will have fewer leaves than the suffix tree for '4 We denote the implicit suffix tree of the string

S[I .. i]

**6.1.2_ Ukkonen's algorithm at a high level
**

Ukkonen'g algorithm constructs an implicit suffix tree L, for each prefix S[1..i] of S, starting from II and incrementing i by one until Im is built. The true suffix tree for S is constructed from Im, and the time for the entire algorithm is O(m). We will explain

by Ij, for

i

from 1

96

LINEAR-TIME

CONSTRUCTION

OF SUFFIX

TREES

6.1.

UKKONEN'S

LINEAR-TIME

SUFFIX

TREE

ALGORITHM

97

Ukkonen's algorithm by first presenting an O(ml)-time method to build all trees Ii and then optimizing its implementation to obtain the claimed time bound High-level description of Ukkonen's algorithm

Ukkonen's algorithm is divided into m phases. In phase i + 1, tree 7,+1 is constructed from Ii' Each phase i + I is further divided into i + 1 extensions, one for each of the i + I suffixes of S[I .. i + IJ. In extension j of phase j + I, the algorithm first finds the end of the path from the root labeled with substring Slj ..r]. It then extends the substring by adding the character S{i + 1) to its end, unless S(i + 1) already appears there. So in phase i + 1, string S[l . .i + 11 is first put in the tree, followed by strings 5[2,,1 + l], S[3 .. 1 + 1]"" (in extensions 1,2,3" .. , respectively), Extension i + 1 of phase i + 1 extends the empty suffix of S[I ..r], that is, it puts the single character string Sri + I) into the tree (unless it is already there). Tree Ij is just the single edge labeled by character S(I). Procedurally, the algorithm is as follows' High-level Ukkonen algorithm Construct tree Zj . Forifromltom-Ido begin [phase r + I} Forjfrom 1 tol+ I beginjextensionj} Find the end of the path from the root labeled S[j.,i] in the current tree. If needed, extend that path by adding character thusassuringthatstringS[j . .i +l]isinthetree end; end;

FigureS.3;

Implicit suffix tree for string i'Xi'bx before the sixth cnereceer. b. is added

S(i

+ 1),

Figure 6.4; Exlendedimplicilsuffixlreeaftertheaddilionolcharacterb,

sum: x extension

rules

To tum this high-level description into an algorithm, we must specify exactly how to perform asu.fJixextension. Let SU .. i] = {3 be a suffix of S[l..i]. In extension i. when the algorithm finds the end of {3 in the current tree, it extends f3 to be sure the suffix {3SU + I) is in the tree. It docs this according to one of the following three rules' Rule 1 In the current tree, path {3 ends at a leaf. That is, the path from the root labeled {3 extends to the end of some leaf edge. To update the tree. character S(i + \) is added to the end of the label on that leaf edge. Rule 2 No path from the end of string /3 starts with character S(i + I), but at least one labeled path continues from the end of f3. In this case, a new leaf edge starting from the end of /3 must he created and labeled with character SU + I). A new node will also have to be created there if f3 ends inside an edge. The leaf at the end of the new leaf edge is given the number j Rule 3 Some path from the end of string (3 starts with character S(i + 1). In this case the string (3S(j + 1) is already in the current tree, so (remembering that in an implicit suffix. tree theendofa suffix need not be explicitly marked) we do nothing

As an example, consider the implicit suffix tree for S = axabx shown in Figurc 6.3 The first four suffixes end at leaves, but the single character suffix x ends inside an edge When a sixth character b is added to the string, the first four suffixes get extended by Lfhe fifth suffix gets extended by rule 2. and the sixth by rule 3. The

**6.1.3. Implementation and speedup
**

Using the suffix extension rules given above, once the end of a suffix ji of S[Lil has been found in the current tree. only constant time is needed to exec utc the extension rules (to ensure that suffix {3SU + I) is in the tree). The key issue in implementing Ukkonen's algorithm then is how to locate the ends of all the 1 + I suffixes of S[I .. i]. Naively we could find the end of any suffix j:J in O(ltlll time by walking from the root of the current tree. By that approach, extension j of phase i + I would take OU + 1 - j) time. IHj could be created from I, in O(j2) time. and Im could be created in O(m) time. This algorithm may seem rather foolish since we already know a straightforward algorithm to build a suffix tree in O(m2) time (and another is discussed in the exercises), bur it is easier to describe Ukkonen's O(m.) algorithm as a speedup of the O(m) method above. We will reduce the above O(mJ) lime bound to O(m.) time with a few observations and implementation tricks. Each trick by itself looks like a sensible heuristic to accelerate the algorithm, but acting individually these tricks do not necessarily reduce the worst-case

98

LINEAR-TIME

CONSTRUCTION

OF SUFFIX

TREES

99 time. The most Following Corollary 6.1.1, all internal nodes in the changing tree will have suffix links from them, except for the most recently added internal node, which will receive its suffix link by the end of the next extension. We now show how suffix links are used to speed up the implementation

bound. However, taken together, they do achieve a linear worst-case important element of the acceleration is the use of wffix links. Suffix links: first implementation speedup

Definition Let xu denote an arbitrary string, where x denotes a single character and a denotes a (possibly empty) substring. For an internal node u with path -Iabetxc. if there isanothernodes(v)withpath-labela,thenapointerfromvlos(v)iscalledasuffulillk. We will sometimes refer to a suffix link from » tos(v) as the pair (v, s{v». For-example, in Figure 6.1 (on page 95) let v be the node with path-label x a and let s(v) be the node whose path-label is the single character a. Then there exists a suffix link from node v to nodes(v). In this case, a is just a single character long. As a special case, if a is empty, then the suffix link from an internal node with path-label XCI: goes to the root node. The root node itself is not considered internal and has no suffix link from it. Although definition of suffix links does not imply that every internal node of an implicit suffix tree has a suffix link from it, it will, in fact, have one. We actually establish something stronger in the following lemmas and corollaries.

Following a trail of suffix links to build

Ii+!

PROOF A new internal node v is created in extension j (of phase i + 1) only when extension rule 2 applies. That means that in extension j, the path labeled xa continued with some character other than S(i + 1), say c. Thus, in extension j + I, there is a path labeled a in the tree and it certainly has a continuation with character c (although possibly with other characters as well). There are then two cases to consider: Eitherthe path labeled Ci continues only with character c or it continues with some additional character. When a is continued only by c, extension rule 2 will create a node s(1') at the end of path a. When a is continued with two different characters, then there must already be a node s(v) at the end uf path 0:. The Lemma is proved in either case. 0 Corollary 6,1.1. In Ukkonen's algorithm, any newly created internal node will have a suffixlink from it by the end of the next extension. PROOf The proof is inductive and is true for tree II since II contains no internal nodes Suppose the claim is true through the end of phase i, and consider a single phase i + I. By Lemma6.1.1, when a new node v is created in extension i. the correct node s(v) ending the suffix link from v will be found or created in extension j + I. No new internal node gets created in the last extension ofa phase (the extension handling the single character suffix S(i + I)), so all suffix links fromintemal nodes created in phase i + I are known by the end of the phase and tree Ii+1 has all its suffix links. 0 Corollary 6.1.1 is similar to Theorem 6.2.5, which will be discussed during the treatment of Weiner's algorithm, and states an important fact about implicit suffix trees and ultimately about suffix trees. For emphasis, we restate the corollary in slightly different language. Corollary 6.1.2. In any implicit suffix tree I" if internal node v has path-label there is a node s(v) of I, with path-label e.

XCi,

then

Recall that in phase i + 1 the algorithm locates suffix Srj ..i] of S[I . .i] in extension j, for j increasing from I to i + 1. Naively, this is accomplished by matching the string S[j .. i] along a path from the root in the current tree. Suffix links can shortcut this walk and each extension. The first two extensions (for j = I and j = 2) in any phase i + 1 are the easiest to describe The end of the full string S[I . .i] must end at a leaf of Ii since S[I . .i] is the longest string represented in that tree. That makes it easy to find the end of that suffix (as the trees are constructed, we can keep a pointer to the leaf corresponding to the current full string S[I • .i]), and its suffix extension is handled by Rule 1 of the extension rules. So the first extension of any phase is special and only takes constant time since the algorithm has a pointer to the end of the current full string. Let string S[1..i] be .ec , where X is a single character and a is a (possibly empty) substring, and let (u, I) be the tree-edge that enters leaf I. The algorithm next must find the end of string S[2 . .i] = a in the current tree derived from Ii' The key is that node v is either the root or it is an interior node of Ii' If it is the root, then to find the end of a the algorithm just walks down the tree following the path labeled c as in the naive algorithm Butifv is an internal node, then by Corollary 6.1.2 (since v was in I,) u has a suffix link out of it to node s(~'). Further, since s(v) has a path-label that is a prefix of string a, the end of string 0: must end in the subtree of s(1'). Consequently, in searching for the end of Ci in the current tree, the algorithm need not walk down the entire path from the root, but can instead begin the walk from node s(v). That is the main point ofinc\udiog suffix links in the algorithm To describe the second extension in more detaiL let y denote the edge-label on edge (u. 1). To find the end of c, walk up from leaf 1 to node u; follow the suffix link from v to s{v); and walk from s(v) down the path (which may be more than a single edge) labeled y. The end of that path is the end of a (see Figure 6.5). At the end of path a, the tree is updated following the suffix extension rules. This completely describes the first two extensions ot phase r j-t To extend any string S[j . .ij to Slj .. i + I] for j > 2, repeat the same general idea Starting at the eod of string S[j - 1..i1 in the current tree, walk up at most one node to either the root or to a node u that has a suffix link from it; let y be the edge-label of that edge; assuming v is not the root. traverse the suffix link from l' tos(u); then walk down the tree from stu), following a path labeled y to the end of S[j .. i]; finally, extend the suffix to S[j .. i + 1] according to the extension rules. There is one minor difference between extensions for j > 2 and the first two extensions. In general, the end of S[i - l..iJ may be at a node that itself has a suffix link from it, in which case the algorithm traverses that suffix link. Note that even when extension rule 2 applies in extension j - I (so that the end of S[j - I..i] is at a newly created internal node w), if the parent of w is not the root, then the parent of w already has a suffix link out of it, as guaranteed by Lemma 6.1.1. Thus in extension j the algorithm never walks up more than one edge.

extension j ~2 of phasei+lis: Single extension algorithm Begin lemma leads immediately to the answer. the first extension of phase i + I need not do any up or down walking. This trick will also be central in other algorithms to build and use Trick number 1: skip/count trick In Step 2 of extension j + 1 the algorithm walks down from node 05(1') along a path labeled y. The only extra ancestor that v can have (without a corresponding ancestor for s(v» is an internal ancestor whose path-label is a prefix of the path 10 u.. called the skip/count trick. this walk along V takes time proportional to Ivl. It will then follow that the time for all the down walks in a phase is at most O(m). the use of suffix links does not yet improve the time bound.1. And. Definition from Define the node-depth of a node u to be the number of nodes on the path u the rootto Eod Assuming the algorithm keeps a pointer to the current full string S[l. ensure that the string Sfj . here we introduce a trick that will reduce the worst-case time for the node-depth of 8(V) is at least one (for the root) plus the number of internal ancestors of u who have path-labels more than one character long. O(mz). suffix link from anyinternal ancestor of v goes to an ancestorofs(v). But a simple trick. as done in the naive algorithm. Recall that there surely must be such a V path from s(1'). ijSU + l j is in the tree. when implemented using suffix links. Directly implemented. 3.ij. UKKONEN'S LlNEAR·TIME SUFFIX TREE ALGORITHM 101 algorithm to suffix trees. Single extension algorithm: SEA Putting these pieces together. But does their use improve the worst-case running time? The answer is that as described. Using the extension rules. iffJ is nonernp ty then the node labeled is an internal node. the numoer of cnaracters on that path. Moreover. will reduce the traversal time to something proportional to the number of nodes on the path. the first extension of phase i+ I always applies suffix extension rule I What has been achieved so far? The use of suffix links is dearly a practical improvement over walking from the root in each extension. because the node-depths of any two ancestors of L' must . Furthermore.. However.100 LINEAR-TIME CONSTRUCTION OF SUFFlX TREES 6.

A simple implementation detail We next establish an O(m) time bound for building a suffix tree.each suffix link traversal decreases the node-depth by at most another une (by Lemma 6.1. Consider the string S = abcdefghijklmnopqrsruvwxvz. That will be done shortly. As described so far. Using the skip/count trick. applies the suffix extension rules. the time per down-edge traversal is constant. the edge-labels of a suffix tree might contain more than 8(m) characters in total. Every suffix begins with a distinct character. Why all the work just to come back to the same time bound? The answer is that although we have made no progress on the lime bound. Ukkonen's algorithm can be implemented O(m<) time with suffix links to run in Theorem 6. that many characters makes an O(m) time bound impossible. one immediate barrier to that goal: The suffix tree may require 8(m2) space.1. There is.1. 0 Note that the O(m2) time bound for-the algorithm was obtained by multiplying the O(m) time bound on a single phase by m (since there are m phases). and the theorem is proved. Using the skip/count trick. This crude multiplication was necessary because the time analysis was directed to only a What is needed are some changes to the implementation allowing a time that crosses phase boundaries. hence there are 26 edges out of the root and .1. any phase ojUkkonen'salgoriliIm lime takes O(m) PROOF There are i + 1 ::: In extensions in phase i. walks down some number of nodes. we will need one simple implementation detail and two more little tricks.102 LINEAR-TIME CONSTRUCTION Of SlJFFIX TREES 6. j 6. Since the time fur the algorithm is at least as large as the size of its output. In a single extension the algorithm walks up at most one edge to find a node with a suffix link. so we only need \0 analyze the time for the down-walks.4. in particular. The up-walk in any extension decreases the current node-depth by at most one (since it moves upatmostone node). the time will fall to O(m). traverses one suffix link. and since no node can have depth greater than m.uffixj-i end of suffix j There are m phases.3. We have already established that all the operations other than the down-walking take constant time per extension.1. the total number of edge traversals during down-walks is bounded by 3m. we have made great conceptual progress so that with only a few more easy details. the total possible increment to current node-depth is bounded by 3m over the entire phase. and each edge traversed in a down-walk moves tu a node of greater node-depth.2). however. so the total time in a phase for all the down-walking is O(m). so the following is immediate: u can have node-depth at most one more than Corollary 6. since we started with a naive O(m') method. We do this by examining how the current node-depth can change over the phase. Thus over the entire phase the current node-depth is decremented at most 2m times. and maybe adds a suffix link. UKKONEN'S LINEAR-TIME SUFFiX TREE ALGORITHM 103 g h . At this point the reader may be a bit weary because we seem to have made no progress. It fullows that over the entire phase.1.

and so the path labeled S[j + 1. Thus. and rule 3 again applies in extensions j + 1. UKKONEN'S LINEAR-TIME SUFFIX TREE ALGORITHM lOS 6. butst ill one can construct strings of arbitrary length m so that the resulting edge-labels have more than (-)(m) characters in total. denote the last extension in this sequence. if suffix extension rule 3 applies in extension j. Moreover. requiring 26 x 27f2 characters in all. an O(m l-rime algorithm for building suffix trees requires some alternate scheme to represent the edge-labels. It is easy to see inductively thatqhadtobeiandhencethenewlabel(p. since the algorithm knows that rule I will apply in extensions I through i. q specifying the substring S[p . write (p..I. Let j.. These facts and Observation I lead to the fol\owing implementation trick.q]. the initial sequence of extensions where rule 1 or 2 applies cannot shrink in successive phases. will lead immediately to the desired linear lime bound Observation 1: Rule 3 is a show stopper In any phase..i] in the current tree must continue with character SCi + I). The extension rules are also easilyimplemented with this labeling scheme. The reason is that when rule 3 applies.iJ is 10. once there is a leaf labeled j.i+ I) represents the correct new substring forthatleaf edge By using an index pair to specify an edge-label.12 Figure 6. 1\vo more little tricks and we're done We present two more implementation tricks that come from two observations about the way the extension rules interact in successive extensions and phases. ::: j'+i' That is.. 3 could alsohaveb6solabeled8. In more detail. o. only constant time will be required to do those extensions implicitly To describe the trick. reflecting the addition of character SCi + I) to the end of each suffix. This suggests an implementation trick that in phase i + 1 avoids all explicit extensions I through j. In phase i + I. it will also apply in all further extensions (j + 1 to i + I) until the end of the phase. This i. The right tree shows the edge·labels compressed. the algorithm uses the index pair written on an edge to retrieve the needed characters from S and then performs the comparisons on those characters. always a leaf That is.1. ij does also. with the edge-labels written explicitly. i + I. Trick 3 In phase i + I.L5. no work needs to be done since the suffix of interest is already in the tree._i + 1]. Ifthis happens in extension j. index q is equal 10 i and in phase i + 1 index q gets Incremented to i + I. instead of writing indices (P. if at some point in Ukkonen's algorithm a leaf is created and labeled j (for the suffix starting at position j of S). and since the number of edges is at most 2m . always a leaf Now leaf I is created in phase I. Nota that that edge with label 2. then that leaf will remain a leaf in all successive trees created during the algorithm. Observation 2: Once a leaf. j + 2. change the index pair on that leaf edge from(p. So once a leaf. q + 1). where e is a symbol denoting "the current end". wrilton.defabcuvw. only two numbers are written on any edge. Edge-label compression substring) For example. so in i there is an initial of consecutive extensions (starting with extension where extension rule I applies. when a leaf edge is first created and would normally be labeled with substring S[p. plus Lemma 6. a new suffix link needs 10 be added to the tree only after an extension in which extension rule 2 applies.2.d.. Since any application of rule 2 creates a new leaf. . the path labeled S[j . Recall also thai for any leaf edge of Ii. e). then there is no need to explicitly find the end of any string S[LiJ fork > j I that are "done" after the tirst execuncn ofrulc 3 are said extension where the end of Slj . true because the algorithm has no mechanism for extending a leaf edge beyond its current leaf.'ump{LOn that a number with up {o log rn bit' con be re.. When extension rule 2 applies in a phase i + 1.1. in Ukkonen's algorithm when matching along an edge. q)to(p. When extension rule 3 applies. i + 1).104 L1NEAR"TIME S = abcdefabcuvw CONSTRUCTION OF SUFFIX TREES 6. Instead.. at least.d RAM modd "-. Trick 2 End any phase i + I the first time that extension rule 3 applies. extension rule I will always apply to extensicn j in any successive phase."oda. This makes it more plausible that the tree can actually be built in Oem) time2 Although the fully implemented algorithm will not explicitly write a substring on an edge..9 each is labeled with a complete suffix. and when extension rule 1 applies (on a leaf edge).compared . These tricks. some characters will repeat. it need do no additional explicit work to implement 1 We make lne . i + 1) on the edge.. recall that the label on any edge in an implicit suffix tree (or a suffix tree) can be represented by two indices P. For strings longer than the alphabet size.. Symbol e is a global index that is set to i + I once in each phase. the suffix tree uses only Oem) symbols and requires only oem) space. we will still find it conveni ent to talk about "the substring or label on an edge or path" as if the explicit substring wa s written there. it follows from Observation 2 that j.8: The left tree is a fragment of the suffix tree for string S = abt. label the newly created edge with the index pair (i + I.

and the theorem is proved. All other extensions (before and after those explicit extensions) are done implicitly.i + II ends. In summary. Each line represents a phase of the explicit extension executed by Iha algorithm.• in decreasing order We will first implement the method in a straightforward inefficient way. and J is bounded by m. Therefore. Therefore. the time for an explicit extension is a constant plus some time proportional to the number of node skips it does during the down-walk in that extension. it only does constant work to increment variable e. WEINER'S LiNEAR-TIME SUFFIX TREE ALGORITHM 107 3456 ~ 111213141516 !. All work has been accounted for. J never decreases. consider an index J corresponding to the explicit extension the algorithm is currently executing. As established earlier.. It is now easy to prove the main result. Ukkonen '. In this cartoon extensions. suffix link traversals.2_ Weiner's linear-time suffix tree algorithm r Definition Suff denotes the suffix Sli . we consider (similar to the proof of Theorem 6.toprepareforthenextpha. Theorem 6. two consecutive phases share at most one index (j') where an explicit extension is executed (sec Figure 6. That means the first explicit extension in any phase only takes constant time. and each number represents an mere are four phases and seventeen explicit One index whereihe same espncu ecereic« of Uk konen'a algorithm. Instead. Figure 6. or node skipping. Creating the true suffix tree + I.9).1.!2. along with all its suffix links in O(m) time 6.6. In any two con secutlve phases.1. which was the last explicit extension in the previous phase.1. in each explicit extension the current node-depth is first reduced by at most two (up-walking one edge and traversing one suffix link). Summarizing this. even extensions in different phases. The punch line With Tricks 2 and 3.e. Since the maximum node-depth is m. in O(m) total time.Buff is the entire string S. Eod Step 3 correctly sets ji+l because the initial sequence of extensions where extension rule 1 or2 applies must end at the point where rule 3 first applies The key feature of algorithm SPA is that phase i + 2 will begin computing explicit extensions with extension j*. the algorithm therefore executes only 2m explicit extensions. The key is that the first explicit extensionin any phase (after phase 1) begins with extension j*. there is atmost is exE!cuted in both phases those j. so the repeated extension of P in phase i + 2 can execute the suffix extension rule for without any up-walking. 3. phase i + I ends knowing where string SU*.. S(m) tmnucusumx links and implementation tricks 1. and trees through Iff. phase i + I is implemented as follows: Single phase algorithm: Begin 1. the current node-depth does not change between the end of one extension and the beginning of the next.I algorithm builds a true suffix tree [or S.1) that the maximum number of node skips done during all the down-walking (and not just in a single phase) is bounded by Oem). is the single character PROOF The time for all the implicit extensions in any phase is constant and so is O(m) over the entire algorithm As the algorithm executes explicit extensions. Over the entire execution of the algorithm.2. D SPA 6. Moreover..1) how the current node-depth changes during successive extensions. of i)..1.9: Carloon of a possible execution algorithm.106 LlNEAR·TIME CONSTRUCTION OF SUFFIX TREES 6. but it does remain the same between two successive phases. But (as detailed in the proof of Theorem 6. Increment index e to l through j.3. explicit extensions in phase i + I (using algorithm SEA) are then only required from extension ji + 1 until the first extension where rule 3 applies (or until extension i + I is done).L. (By Trick 3 this correctly implements all implicit extensions 3. and Suff.e . This will . To bound the total number of node skips done during all the down-walks. and there are only 2m explicit extensions. and thereafter the down-walk in that extension increases the current node-depth by one at each node skip. and then does explicit work for (some) extensions starting with extension i. where j* was the lasrexplicit extension computed in phase i + 1.1. (i. Setj!+1 toj*-I. i Since there are only m phases.) 2...-[ down to T. l}. Ukkonens For example. extensions. + I.2. Weiner's algorithm constructs trees from Tm".mJ ofS starting in position i. it follows (as in the proof of Theorem 6.

In general. T.+1 and will verify that it holds for tree 7.. for each i from m. Ultimately.2. and label the new edge with the remaining (unmatched) part of Suff. then create a new node there. Since Tm+l only has one edge out of the root. Finally. Let R denote that path. refer to the node there (old or new) as w. The edge labeled with the single character $ is omitted. is a prefix. the path from the root to leaf i has label Suff.+] down a path whose label matches a prefix. has been created. to create 7.+1 no two edges out of v have edge-labels beginning with the same character. Head(i) is also a prefix of SUffkfor some k > i. In either case. sinC€lsuchanedgeispartofeverysuffixtree serve to introduce and illustrate important definitions and facts. let L. more detail. from 7.. Thereafter. from 7.$. Find the end of the path labeled Head(i) in tree "T.1. path Risfoundbystartingattherootandexplicitlymatchin g successive characters of Suff.S. Figure 6. i) does not occur as the first character on any other edge out of w. The indicator vector is a bltvector soits enrries are just G or I. Head(m) is always the empty string because Sri + 1. Thus the matching continues on at most one edge out of u. Thus the claimed inductive property is maintained Clearly.-'-1 and character S(i). and each is indexed by the characters of the alphabet. Figure 6. for the English alphabet augmented with S. the algorithm constructs each tree T. no two edges out of u have edge-labels beginning with the same character.10: A step in the naive Weiner algorithm. of Suff. from tree 7. Since no further match had the first character on the label for edge (11'. each link and indicator vector will be of length 27.+1. For example. The link vector is essentially the reverse of the suffix link in Ukkcnen's algorithm. In fact. because no suffix.. That is. The idea of the method is essentially the same as the idea for constructing keyword trees (Section 3. eac h tree 7.108 LINEAR·TIME CONSTRUCfION OF SUFFIX TREES 6. of another. For example.(. The full string tat is added to the suffix tree lor at.+1 can be described as follows: Naive Weiner algorithm 1. down 10 1.S with successive characters along a unique path in 7.+l 2.]0 shows the transformation of 12 to 'Ii for the string rat. since at any node v of 7..m] is the empty string when i + 1 is greater than m and character S(m):f. will have the property that for any node (I in 7. that path exactly spells out string Suff. Let l.S. Similarly. whereas each entry in the link vector is either null or is a pointer to a tree node.+1' The matching path is unique. this is trivially true for Tm+l. If no node exists at that point. and the two links are used in similar ways to accelerate traversals insid e the tree. but for a different set of strings and without putting in backpointers.sotree 7. start at the root of7. WEINER'S LiNEAR·TIME SUFFIX TREE ALGORITHM 109 Since a copy of string Head(i) begins at some position between i + 1 and tn. Toward a more efficient implementation 6.r) spe cify the entry of the indicator vector at node v indexed by character x.2. Then we will speed up the straightforward construction to obtain Weiner's linear-time algorithm 6.$. of length equal to the size of the alphabet. We assume inductively that this property is true for tree 7.2. As the algorithm proceeds.2. A straightforward construction The first tree T"H I consists simply of a single edge out of the root labeled with the termination character S.S. It follows mat Head(i) is the longest prefix (possibly empty) of Suff that is a label on some path from the rOOIin tree 71+1' The above straightforward algorithm to build 7. ofaleafiism-i+l. no further match will be possible. FindingHead(i) efficiently Note that Heud(i) could be the empty string.(x) specify the entry of the link vector at node v indexed by character x The vectors I and L have IWO crucial properties that will be maintained inductively throughout the algorithm' .4). add an cdge out ofw to a new leaf node labeled i.

.+1 in the good case. The ifand labeled not end at a node.+1. Hence the above properties hold for T. but the converse is not true. Otherwise L"(_t)is null.3.+l point> to (internal) node i< in 1. Therefore.+1 path-label if and of node only if there path is apath u. in tree '1.. for some k Since u was found with iv(S(i)) =' I there is a path in 7. But then Suffi+1 begins f3a and Suffj. it may not yet be clear at this point why this method is more efficient than the naive algorithm for finding Head(i) That will come later.r . I.2. The algorithm must take care of two degenerate cases. The basic idea of Weiner's algorithm Weiner's algorithm uses the indicator and link vectors to find Head(i) and to construct 7. we can express Head(i) as some (possibly empty) string fj. must either be u or below l' on the path to leaf i + 1. in the tree in Figure 5.(x) o is the = I in 1. Hence there is a node II in '1.2. more efficiently. and set all the link entries for the root to null. It is also immediate that if I"(x) = 1 then IL'(x) = I for every ancestornodevofu. o 6. an initial part of the path to leaf k) labeled S(i)/3 in '1. begins S(i)f3b. for any node u and any character x. Let us first examine how the algorithm handles the degenerate cases when v and v'do not both exist.1 (page 91) consider the two internal nodes u and w with path-labels a and xc respectively./e first discuss how 10construct I. I.. PROOF Helld(i) is the longest prefix of Suff that is also a prefix of Sufi. Both i + I andk+ I are greater than or equal to I + landless than or equal to m. That means that 0: is a prefix of {3 and so node u.. Tree Tm has only one nonleafnode.(x) to zero for every other character . 0: = 13.(b) = 1.+1 path-Iabelr«. begins S(i)/3a and Suft. if u ¥= v then u would be a node below t' on the path to leaf i + =' 1. Clearly. No assumption about u' was made. For any cbaracrer if i"i has For example. labeled xa.(S(m)) to one.(5(/)) =' 1 and v has path-label 0:.2.+1 with path-label 13. but Lw(b) is null.. >i . Also.+1 that begins with Head(i) is at least one character long. Also.IlO • Forany{singJe) from the root of xo' need • UNEAR-TIME CONSTRUCTION OF SUffIX TREES 6. but these are not much different than the general "good" case where no degeneracy occurs. the above method correctly creates '1.(x) is nonnull only if lu(x) = 1. one continuing with character a and the other with character h. from 1iTI' although we must still discuss how to update the vectors. the algorithm and that il has when u and exist. say Suff. The algorithm will maintain the vectors as the tree changes.r . However. with path label {3. and LuCx) = u'. and then we handle the degenerate cases The algorithm in the good case Note that in Theorem 6. Further. in constant time after v' is found.1 and its proof WI:: only assume that node L' exists.. both begin with string Head(i) =' S(i)/3 and differ after that.+lo there must be a path from the root labeled {3 (possibly the empty string) that extends in two ways. For concreteness. This will be useful in one of the degenerate cases examined later. so v = U. where e has path-label a. Tv.q begins {3h. and we will prove inductively that the above properties hold for each tree Suff and Suff. from 7. namely the root r.ln this tree we set I. WEINER'S LINEAR-TIME SUFFIX TREE ALGORITHM III character x and any node 4+1 . Then I" (x) = 1 for the specific character . only L"(x) in T. and 1"(5(1)) = I since there is a path (namely. so Head(i) must begin with S(i)o:.c. L. so both suffixes are represented in tree 1i+I' Therefore. node u must be on the path to leaf i+ i since f is a prefix ofSuffi+1 Now 1. where u. set I. This contradicts the choice of node u.

. = or I.IHead(ill characters of Suffi). WEINER'S LINEAR-TIME SUFFIX TREE ALGORITHM 113 The two degenerate cases 4.2. z). if node w its link entries for 7i should be null.(SU)) = 0. then Head(i) end. I.2. otherwise. and 1. some path from the root would begin with SO).(Still =0. We will see how to find Head(i) efficiently in these two cases. but u' does nOI exist. Correctness It should be clear from the proof of Theorems 6.+1. a prefix of y. 1ft.(SU» = 0.(S(i») should be one new node was created. it must update the 1 and L vectors How to update the vectors After finding (or creating) node w.1 node u} has path-label case. In this case the walk ends at the root and no node u was found. Go to Step 4. there is some edge e = (r.(x) must also be I in7.2. ° rather than I. > O.1 Head(i) ends exactly t. The fact that there are no nodes between u and z in 'Ii+l means that every suffix from Suff. To see this. where u' has path-label HeadU) does The two degenerate cases arc that node v (and hence node Vi) does not exist or that u exists but u' does not. characters down (the natural analogue of the good case) 6. Tree 7i now been created. z) whose edge-label begins with character S(i). etl/?e(v". for if it did appear.lhcn let w denote that node.(x) = I then is set to ° l.2 and the discussion of the degenerate cases that the algorithm correctly creates tree T. It follows that character S(i) does oot appear in any position greater than i .2. be the number of characters from the root to ».. The full algorithm for creating Incorporating all the cases gives the following algorithm: wetner's Tree extenston -t. create a node It' at the end of Head(i).-(S(il) = I for some v (possibly the root). w created in the 2.z) PROOF It is immediate that if (x) 1 and yet l. we must update the I and L vectors so that are correct for tree 7.. + I characters from the root.4. create a new edge (w. suppose 10 the contrary that there is a node u in T.2. with path-label xHead(i). and of course node w has path-label Hcadti). Create a new leaf numbered i . atthe root. This is true whether I. Ifanode already exists at the end ofHead(i). followed with the termination character $. If the root is reached and 3.112 LINEAR-TrME CONSTRUCTION OF SUFFIX TREES 6.. although before it can create 'Ii-j. Furthermore. Recall that r denotes the root node Case 1 {. i) labeled with the remaining substring of Suff I . down to Suff +-1 that begins with string Head(i) must actually begin with the longer . 3a. from Ti+1 indicator vectors is proportional to the time for the walk The only remaining update is to set the J vector for a newly created node the interior of an edge e = [u".2. If the algorithm found a node v such that Theorem 6. Head(i) is the empty string and ends at the root Case 2 i. then some suffix in that range would begin with S(i). Lett. From Theorem 6.1 and 6. Since v exists. But can it happen at the moment that w is created'! We will 3b. In this case the walk ends at the root with LASU)) null. Let node: have path-label y.. L.(SU» would have been L So when [. from 7.

and the connection between the algorithms is not obvious That connection was suggested by and made explicit by Giegerich and Kurtz [178J.114 LINEAR-TIME CONSTRUCTION OF SUFFIX TREES 6.c. and is simpler to describe. regardless of how they are constructed: where . The only question remaining is how the current node-depth changes when a link pointeristraversedfromanodev'tov" is a This fact was also established as Coronary 6. McCreight's algorithm at the high level PR(NW Let II be a nonroor node in 7. the current node-depth increases by one.+1 there can be a path labeled xHead(i) only if there i~ also a path labeled xy. the complete string 5) (This is opposite to the order used in Weiner's algorithm.1. The time for the algorithm is proportional to the total number of nodes visited during upward walks. The naive construction method is immediate and runs in O(m!) time. but the be large in version can be seen as a somewhat disguised ofMcCrcight's algorithm. 0' is a prefix." is created the current node-depth is two.3. and suppose II has path-label S(iJQ' for some nonempty string a. D 6. and this holds for any character x. That means that Suffi+1 must begin with the string Head(i). Last comments about Weiner's algorithm Our discussion of Weiner's algorithm establishes an important fact about suffix trees.] It builds a tree encoding all the suffixes of S starting at 1 through i + I. when the algorithm is at node u' (or at the root) and then creates a new node w below 1''' (or below the root). Recall that the node-depth of a node u is the number of nodes on the path in the tree from the root to v For the time analysis we imagine that as the algorithm runs we keep track of what node has most recently been encountered and what its node-depth is. Hence the time to construct 7. However. Since Ukkonen's algorithm has all advantages of McCreight's.although it could be less. It follows that the current node-depth can also only decrease at most 2m times since the current node-depth starts at zero and is never negative. labeled xHead(i) (the requirement for [". McCreight [318] gave a different method that also runs in linear time but is more space efficient in practice. if there is a path in 7.+1 starts atleafi + 1 and walks toward the root. when the algorithm begins. and il is superficially different from Ukkonen's algorithrn. is proportional to the number of nodes encountered on the walk from leaf i + 1. kept at each node. the algorithm to construct Z' from 7.(x) = I at the moment w is created. Call the node-depth of the most recently encountered node the current node-depth. and when it finds node v or reaches the root. and assuming the usual parent pointers.j. Hence in 7.+l on the path from the root to u". MCCREIGHT'S SUFFIX TREE ALGORITHM 115 string y. of Suff. i). the high-level organization of Ukkonen and Mcflreight's algorithms are quite different. from 7. Now S(i)a is the prefix of Suff and of Suff.. Only constant time is used to follow a L link pointer. The current node-depth decreases at each move up the walk.2 during the discussion of Ukkonen's algorithm 6. this is considered linear in the length of S. for some k > i. For a fixed alphabet.6.+1 and update the vectors is proportional to the time needed during the walk from leaf i + I (ending either at u' or the root).r 6. The inefficiency in Weiner's algorithm is the space it needs for the indicator and link vectors. But since w has path-label Head(i). Time analysis of Weiner's algorithm The time to construct 7.2. This walk moves from one node 10 its parent. All nodes on the root-to-u" path are of this type.. Also. For example. that time can be reduced to O(m). and so must be below z in 7. the indicator entry for x has been set to I at every node on the path from the leaf i + L The walk finishes before node w is created. and there must be a node (possibly the root) with path-label Q' on the path to u' in 7. and so it cannot be that Iz<x) = 0 at the time when w is created. D The current node-depth can increase by one each time a new node is created and each time a link pointer is traversed. z is on the root to J + I path.3. and the theorem is proved.+1. except the node (if it exists) with path-label SU). Since u' is on the path from the root to leafi + I. and this string extends differently in the two cases. However. so the total number of nodes visited during all the upward walks is at most 2m. from the tree encoding all the suffixes of S starting at positions throughi.2.(xl to be 1) but no path xv' then the hypothesized string xHead(i) must begin at character i of S. We leave this to the interested reader to work out . one at a time. Using suffix links and the skip/count trick. Therefore. Clearly. I and L. That is.. except for the single node (if it exists) with path-label SU). So if path xHead(i) exists in 7. leaf i + I must be below w in T. then l.+1' Hence the path to v' has a node corresponding to every node on the path to u".. Hence the depth of u" is at most one more than the depth of v'. starting from suffix one (i. the current node-depth is one and just after 7. only constant time is used to move between nodes.5. when the algorithm walks up a path from a leaf the current node-depth decreases by one at each step. Hence the total number of increases in the current node-depth is at most 2m. we will only introduce McCreight's algorithm at the high level. and only constant time is used after that to add wand edge (w. McCreight's suffix tree algorithm Several years after Weiner published his linear-lime algorithm to construct a suffix tree for a string S. so the theorem is proved. 0 Wecanoowfinishthetimeanalysis Weiner's algorithm constructs the suffix tree McCreight's algorithm builds the suffix tree T for m-Iength string S by inserting the suffixes in order.

'" babxba.1 Figure 6.6 1. more attention to implementation is needed. the above method can be simulated without first concatenating the strings. When a new edge from 1) is added to the tree. When 52 is fully processed the tree will encode all the suffixes of 5'1 and all the suffixes of 52 but will have no synthetic suffixes. Those methods are easily extended to represent the suffixes of a set {SI. program./ of strings Those suffixes are represented in a tree called a generalized suffix tree. "'. Then starting at the root of this tree. the first i phases of Uk konen's algorithm for 52 have been executed on top of the tree for SI_ So. and build a suffix tree for the concatenated string. = xabxa and S. it can use an impractical amount of space as 11:1 and m get large An alternative to the array is to use a linked list at node v of characters that appear at the beginning of edge-labels out of t'.thesecondnumberindicatestheslartingpasition of the suffix in thai siring in theory by suffix trees. For large trees. quadratic (or even cubic) time algorithms into algorithms that run in O(m) worst-case time. In this case. implementing suffix trees 10 reduce practical space use can be a serious concern. PRACTICAL IMPLEMENTATION ISSUES 117 6. The key point is thatlt allows a faster termination of a search for a character that is not in the list.. .r]. For those problems. Practical implementation issues The implementation details already discussed in this chapter turn naive.4. Since the list is searched sequentially it costs no more to keep it in sorted order. S. by design. That is. A conceptually easy way to build a generalized suffix tree is to append a different end of string marker to each string in the set. all the unwanted synthetic suffixes are removed Under closer examination. and it implicitly encodes every suffix of the string S~[l .5 1. Keeping the list in sorted order will be particularly useful in some of applications of suffix trees to be discussed later.5 2. and/or where the alphabet size is in the hundreds. because each end of string marker is distinct and is not in any of the original strings. the result is the generalized suffix tree shown in Figure 6.11. However. The second subtlety is that suffixes from two strings may be identical.3 2. Suppose that the first i characters of 52 match. Hence the number of symbols per edge increases lrom two to three.5. This 'iomewhat reduces the average time to search for a given character and thus speeds up (in practice) the construction of the tree. S2. or even millions. assumed to be distinct. Consequently. suffix links allow an algorithm to move quickly from one part of the tree to a distant part of the tree. walk up at mOSI one node from the end of 52[ I . with that current tree. a leaf must indicate all of the strings and start ing As an example. but it is horrible for paging if the tree isn't entirely in memory." The comments made here for suffix trees apply as well to keyword trees used in the Aho---Corasick method The main issues in all three are how 6.2 2. if we add the string babxba to the tree for xabxa (shown in Figure 6.1). then concatenate all the strings together.S. These "synthetic" suffixes are not generally of interest. which will be used in many applications. Generalized suffix tree for a set of strings We have so far seen methods tu build a suffix tree for a single string in linear time.. paging can also be a serious problem because the trees do not have nice locality properties. etc. Essentially. where the typical string size is in the hundreds of thousands. but otherwise causes no problem. One is that the compressed label~ on differ~nt edges may refer to different strings. Repeating this approach for each of the strings in the set creates the generalized suffix tree in time proportional to the sum of the lengths of all the strings in the set There are two minor subtleties with the second approach. The resulting suffix tree will have one leaf for each suffix of the concatenated string and is built in time proportional to the sum of all the string lengths. First build a suffix tree for 51 (assuming an added terminal character).11: Generalized suffix tree far strings S.116 LINEAR-TIME CONSTRUCTION OF SUFFIX TREES 6. The tree at this point encodes all the suffixes of 51. assuming a fixed alphabet 1:. The first numbe' at a leaf indicaleslhestdng. There are problems nicely solved This array random accesses and updates and. 2. particularly as the size of the alphabet grows.. one identifying a string Si and the other a starting position in Si' One defect with this way of constructing a generalized suffix tree is that the tree represents substrings (of the concatenated string) that span more than one of the original strings. a new character (the first character on the new edge label) is added to the list. match 52 (again assuming the same terminal character has been added) against a path in the tree until a mismatch occurs. We describe the simulation using Ukkonen " algorithm and two strings 51 and 52. This is great for worst-case time bounds. although it will still be true that no suffix other. Traversals from node v are implemented by sequentially searching the list for the appropriate character.2 2. the label on any path from the root to an internal node must be a substring of one of the original strings Hence by reducing the second index of the label on leaf edges. But to make suffix trees truly practical. a "linear" time and space bound is not sufficient assurance of practicality. without changing any other parts of the tree. Indeed. resume Ukkonen's algorithm on 52 in phase i + I. The leaf numbers can easily be converted to two numbers. i]. The end of string markers must be symbols chat are not used in any of the strings.

5. if 1)) is a leaf and u is its parent. In contrast. In a position tree. but some are more equal than others The key implementation problems discussed above are all related to multiple edges (or links) at nodes. Similarly. see [23]. However. More completely. searching for a pattern P usmg a suffix tree can be done with 0(1 PI) comparisons only if we use 8(ml"E I) space. In asymptotic worst-case time and space. Nevertheless. When m and I: are large enough to make implementation difficult.lIe ordering. and McCreight ai. removing the need for explicit representation of the edge or the w. but for large trees and alphabets hashing is at least for some of the nodes. and other suggested variants of suffix trees. A suffix tree whose leaves are deleted in this way is called a position tree.the larger the alphabet. using perfect hashing techniques worst-case time bound can even be preserved.6. is 10implement the list at node v as some sort of balanced tree [10].1. Generally. where m is the size of all the patterns in a keyword tree or the size of the string in a suffix tree./" m"d. Nodes in the suffix tree toward the leaves tend to have few children and lists there are attractive.T. the exact matching method using Z values has worst-case space and comparison requirements that are alphabet independent .118 LINEAR-TIME CONSTRUCTION OF SUFFIX TREES Keeping a linked list at node v works well if the number of children of » is small. one can replace the first five levels of the tree with a five-dimensional array (indexed by substrings of length five).hthe use of suffix tinks and the skip/count trick.er. some people prefer to explicitly reflect the alphabet size in the time and space bounds of keyword and suffix tree algorithms. Every one of these substrings exists in some known protein sequence already in the databases Therefore. and usually the approach that builds the larger tree is conceptually simpler. when space is a serious practical concern (and in many problems.e been developed. we will diSCUSS such space-reducing alternatives III some detail throughout the book.. not whether one character precedes the other III sOl. The method present. For a more in-depth look at suffix tree implementation issues. in some people's view. there is a one-to-one correspondence between leaves of the tree and substrings that areuniquelyoccurringinS.lnthose applications.te·size workstali"n Inng that con T I 6. Depending on the other implementation choices. Whatever implementation is selected. a compromise between space and speed.) Later in the book we will discuss several problems involving two (or more) strings P and T. These are influenced by the size of the alphabet E .. be completety contained in the main memory . most large suffix trees are used in applications where S is fixed (a dictionaryordatabase)forsometimeandthesuffixtreewillbeusedrepeatedly. the method only checks :vhether the characters are equal or unequal. show that tinear-time suffix tree atgorithms would be impossible if edqe-tabels were written explicitlyon the edges 2. Two-dimensional exact matching is one such example. Alphabet-independent algorithms have also been developed tor a number of problems other than exact matc~ing. hashing or balanced trees may be the best choice. to the linear time and space bounds of keyword and su~x tree algorithms: "Alllinears are equal but some are more equal than others". the larger the problem. then information about 1)) up to v.Hersuffix ~ee . An alternative way to reduce the 0(&) time to O{ni) (wilhout suffix tinks or skip/count) te to keep a poinlertothe end of each suffix 01 51:1 ll . where k is the number of children of o. neither approach is superior to the other. where the total length 01 the edge-labels on their suffix trees grows faster than ("'J(m) misth e tength of the string). That time was then reduced to O(m2) wit. but where one uses a generalized suffix tree for two or more strings and the other uses just a suffix tree for the smaller string. The same idea has been applied [320] to depth seven for DNA data. Hence no prior knowledge about the alphabet need be assumed. For that reason.should be replaced with the ~nimum of O(m logm) and O(mlogl"EI). but in worst-case adds time lEI to every node operation. i: 6_6_Exercises 1. Alphabet independence: alliinears are equal. Again..ed III Section 3. and so arrays are a sensible choice at those nodes. it is clear that a suffix tree for a string will take considerably more space than the representation of the string itself..5. For example. Additions and searches then take O(log k) time and O(k) space. We win also have examples where equally time-efficient solutions exist. Nodes near the root of the tree tend to have the most children (the root has a child for every distinct character appearing in 5). alphabet-independent methods are more complex than their coarser counterparts. when two characters are compared.tring' more than one million chanl"t. we have built suffix trees for DNA and amino aoid . there are 2()'i possible amino acid substrings of length five. Those people usually refer to the construction time for keyword or suffix trees as O(m log II: I). otherwise we must allow the minimum of 0(1 P Ilog m) and 0(1 PI log 11: I) comparisons during a search for P. if the tree is dense for several levels below the root. Wein. where an entry of the array points to the place in the remaining tree that extends the five-tuple. The O(m) worst-case time bounds are preserved since II:I is assumed to be fixed.gorithms. Tnthis book we will not consider alphabet-independence much further. For nodes in the middle of a suffix tree. this can lead to a large sayings in space since roughly half the nodes in a suffix tree are leaves. then those levels can be condensed and eliminated from the explicit tree. At the extreme. space is more of a constraint than is time). where two O(IPI + ITI) time solutions exist. In addition. one has the time and motivation to experiment with different implementation choices. the best design is probably a mixture of the above choices. Due to the space and programming overhead of these methods. alphabet-independent solutions for that problem hav. one using a suffix tree for P and one using a suffix tree for T. II . although we will discuss other approaches to reducing space that can be employed If large alphabets cause excessive space use. the Aho-Corasick. all either require 8(mlLll space. And. including l Allhough. Ukkonen.So despite the added conceptual burden. Construclan infinitetarruly of strings over a tixedalphabet. The alphabet independence of these algorithms makes their linear time and space bOLmd:superior. we first introduced Ukkonen's algorithm at ahigh tevel and noted that it could be imptemented in 0(&) time. Moreover. the size of the suffix tree for a string may dictate using the sol~tion that builds the sma. this alternative makes sense only when k is fairly large The final choice is some sort of hashing scheme.the worst-case number of con:parisons (either characters Ofnumbers) used to compute Z values is uninfluenced by the size of the alphabet.3 for two-dimensional matching is based on keyword trees and hence IS not alphahetindependent..but if the numberofc hildrenofv is large thenlittle space is saved over the array while noticeably degrading perf ormance. In the text. These properties are also true of the Knuth-Morris-Pratt and the Boyer-Moore algorithms. or the O(m) time bound. A third choice. when implementing a suffix tree for the protein database. That { is. EXERCISES 119 many in molecular biology. the challenge is to find a scheme balancing space with speed. 6. Fortunately..

inour Resolve the backlinks this apparent 9. Hint: However. Consider What would be wrong in rev ersingthestringso on the left end is "simulated" problem solution where on the right end? If you issues 4. This is the common [320]. to the node in the current cannot find an efficient are and why this seems 16. to using index.Ao alternative of 5) at the that second may be in in linear time without using suffix links? The idea implicit suffix tree that 15. 6. Find a problem trees and hence implicit trees than with the single suffix trees Weiner in terms build 5 than does the sing Ie final suffixlreefor of implicit can not save the implicit with the construction suffix tree. algorithm. would Find an on-line Since (like Ukkonen's algorithm algorithm) running update in O(m) store total time that creates all the trees is e{m2). Ukkonen's the time taken to explicitly these each tree without algorithm builds all the implicit suffix trees in O(m) time. the symbol i+1 "e'' is used as the seco nd index on the label varrabre eis setior s t.120 Then Ukkonen's LINEAR-TIME CONSTRUCTION OF SUFFIX TREES 14.comparedtousingthesymbol"e". Can either or Ukkonen's of the string? the rtqht and to the left ends that a change be implemented index i.Explainwhythis s represented index on any leaf edge to In that way. such an saving it. this in detail siring S'isrlOt on the left between there the suffix tree fora is a signiiicant for the reverse the two trees. (contracting) algorithm by a change on the righ tend efficiently handle both changes to relationship between Find it. (Open question) trees.a pointer i. Make explicit of Aha-Corasick. append 5$. Can Ukkonen's is 10 maintain. Then discussion $ and letPbethesetofpatternsconsistingofthem+l a keyword tree for set P using that method the Aha-Corasick as a linear gives the suffix tree for S.1) Sol can be length of string 01 the Aho-Corasick algorithm of Section 3. in or con- high-level entire algorithm algorithm could visit allthese would run in O(m') string Sand ends and create time.2. when ftnks and skip/count) give a linear-time between McCreight's implementation algorithm for McCreight's and Ukkonen's Flesh out the relationship they both are implemented algorithm. is closest algorithm for each the prevlous the changes problem are in the inter! or of the string. is O{m"). "e" is to set the second built lor a set 01 k strings. Is there some higher-level that might establish of both the algorithms evaluate different 11. or entire of maintaining strings from the set. time method. expt ain what the technical to me end oisuftix like a difficult suffix tree sulfix 5. The relationship obvious. Explain O(i) time. and the arguments and arguments at once? implementation needed to make explicit analysis in the analysis. 10. end. in linear time.and one must dynamically Discuss how to do this how to do it if the string Weiner's algorithm maintain efficiently is growing a suffix ilthe tree for a string string is growing that is growing (contracting) 121 Ii+l from I. . state it. Thus it can be called a linear-time on-line algorithm to construct tru6 suffix algorithm 7. By using suffix 13. Discuss icsetting. Suppose tracting. choices for representing the branches attention issues out to in and the vectors in Weiner's algorithm. Note that the algorithm will have to be solved the problem 6. The time for this construction was considered contradiction the relationship between link pointers in Weiner's algorithm suffixes Removing Yet. implicit tree are deleted all in O(m) time. of every leaf edge. The time analyses how the current Examine differences bounds these of Ukkonen's algorithm and of Weiner's closely alqonthm are almost both rely on watching perfectly symmetric. m. and suffix links inUkkonen'salgorithm. and in phase is created.4: Given string algorithm. This sequence more information about that can be solved more efficiently with the sequence in parallel suffix trees may expose S. and consider both time and space the suffix tree and in using it afterwards. Pay particular the effect 01 alphabet size and string length. there has a much easier in the suffix thisapproach. Ukkcnen's suffix trees. and the time the similarities node-depth two algorithms changes. may be deleted tree lor biological than when arbitrary sequence substring Additional data strings may be case the the global to the set. Empirically 01 the nodes building 12. In Trick 30fUkkonen'salgorithm. through 1m in order and on-line. Consider added problem problem a generalized to updating the suffix tree. implementation tricks similar to those used in Ukkonen's algorithm (particularly. no work and discuss m {the tot al length vantages lor maintaining a generalized point that the leaf edge Explain is requir ed to update anydisad the generalized solution suffix tree in thisdynam in detail why this is correct. The naive described algorithm for construcling the suffix tree of S (Section 6. so thatthe 3. and prove it. Suffix links help. of implicit suffix suffix of the algorithm builds all the implicit suffix trees I.

but where SIKh an algorithm is Immediate with the use of suffix trees. In fact. rather than by suffix trees.7. and after only spending linear lime preprocessing T. In this chapter and in the exercises at its end. But there are many applications we can now discuss that illustrate the power and utility of suffix trees. without using suffix trees. where kp is the number of occurrences of P. Those algorithms spend O(n) preprocessing time so that the 122 I. where k is the number of occurrences.3) the use of suffix trees in the exact string matching problem when the pattern and the text are both known to the algorithm at the same time. C~ .8 how to use a suffix tree to solve the exact set matching problem in exactly the same time 7. the availability of suffix trees may make certain of the problems appear trivial.4 discussed the exact set matching problem. Although the asymptotic time and space bounds forthe two methods are the same when both the set P and the string T arc specified together. suffix trees can be used to achieve the same time and space bounds as Knuth-Morris-Prattand Boyes-Moore when the pattern is known first or when the pattern and text are known together.1. Can suffix trees be used in this scenario to achieve the same time bounds? Although it is not obvious. The Aho----Corasick method finds all occurrences in T of any pattern from P in O(n + m + k) time.is the classic situation handled by Knuth-Morris-Pratt or Beyer-Moore. The constant terms for the space bounds and for the search times depend on the specific way the trees are represented (see Section 6. We have already discussed (in Section 5 . APLl: Exact string matching There are three important variants of this problem depending on which SIring P or T is known first and held fixed.2. but they achieve vastly superior performance in the important case that the text is known first and held fixed. 7.2. built in O(n) time. lems. as the Knuth-Morris-Pratt or Beyer-Moore algorithms But the exact matching problem often occurs in the situation when the text T is known first and kept fixed for some time. Some of the most impressive applications need an additional tool. Recall that set P is of total length n and that text T is of length m. the constant-time lowest common ancestor algorithm. (As discussed in Section 3. Determining the relative advantages of Aho-Comsick versus suffix trees when the text is fixed and the set of patterns vary is left to the reader There is one way that suffix trees arc better. is amazing and was the prime motivation for developing suffix trees. That any pattern (unknown at the preproce:sing. O(n + m). the Aho-Corasick method uses less space than a suffix tree. even though linear-time algorithms for those problems :vere unknown before t~e advent of suffix trees. There we developed a linear-time solution due to Aho and Corasick.2.5. the suffix tree T is of size O(m). the search for all occurrences of P in T must be done as quickly as possible. several of these applications will be explored. In contrast. totally independent of the size of T. Thus for the exact matching problem (single pattern).or without some historical perspective.8.) When the total size of the patterns is smaller than the text. and is used to search in O(n) time. To solve it. The Ahc--Ccrasick method uses a keyword tree of size O(n).5). linear-time solutions to complex string prob. where Knuth had conjectured that ~ linea~-time algorithm would not be possible [24. a long sequence of patterns IS input. than keyword trees for the exact set matching problem (in addition to other problems). Using a suffix tree for T. In the case that the set of patterns is larger than the text. Other applications arise in the context of specific problems that will be discussed in detail later. Comparing suffix trees and keyword trees for exact set matching Here we compare the relative advantages of keyword trees versus suffix trees for the exact set matching problem. 2781. the problem of finding all occurrences from a set of strings P in a text T. and so are deferred until that algorithm has been discussed (in Chapter 8). Most of these applications allow surprisingly efficient. In that case the use of a suffix tree achieves the same worst-case bound. Thus the exact set matching problem is actually a simpler case because the set P is input at the same time the text is known. Another claxxic example is the longest prefix repeat problem discussed in the exercises. Let n denote the length of P and k denote the number of OCCurrencesof P in T. The total time needed in this approach is O(n + m + k) 7. where the set is input all at once. where a linear-time solution using suffix trees is easy.when the pattern is first fixed and can be preprocessed before the text is known . In contrast. one method may be preferable to the other depending on the relative sizes ofP and T and on which string can be preprocessed. we saw in the previous section that when T is first known and fixed and the pattern P varies. we build suffix tree T for T in O(m) time and then use this tree to successively search for all occurrences of each pattern in P. the answer is "yes". while the patterns vary. After the text has been preprocessed. there is a time/space trade-off and neither method is uniformly superior to the other in time and space.1. all occurrences of any specific P (oflengthn) in T can be found in O(n + kp) time. algorithms that preprocess the pattern would take O(n + m) time during the search for any single pattern P. This same time bound is easily achieved using a suffix tree T for T.4 ISone clear example. takes O(m) time to build. and for each pattern P in the sequence. We will show in Section 7. or more robust. Perhaps the best way to appreciate the power of suffix trees is for the reader to spend Some time trying to solve the problems discussed below. We will see many applications of suffix trees throughout the book. APL2: Suffix trees and the exact set matching problem Section 3. The reverse situation . but they are certainly large enough to affect practical performance. This reverse usc of suffix trees will be discussed along with a more general problem in Section 7. all Occurrences can be found in O(n + k) time. and then carries out the search in O(m) time. but where the best prior method ran in O(n logn) time.1 there are applications in molecular biology where the pattern library is much larger than the typical texts presented after the library is fixed. Hence. The longest common substring problem discussed In Section 7. APL2: SUFFIX TREES AND THE EXACT SET MATCHING PROBLEM 123 7 First Applications of Suffix Trees search can be done in Oem) time whenever a text T is specified. stage_) can be found in time proportional to its length alone. but the suffix tree uses less search time. Without this effort . the suffix tree approach uses less space but takes more time to search.

3. then 5 is not in the database. or sequence a DNA string will cause unwanted DNA to become inserted into the string of interest or mixed together with a collection of strings.S. A third desired feature is that the preprocessing of the database should be relatively fast Suffix trees yield a very attractive solution to this database problem. As expected..• likely between That interval is therefore used as a "nearly unique" identifier DNA is extracted from the remains of who have been killed is examined. Contamination of protein in the laboratory can also be a serious problem. The DNA database contains a collection of previously sequenced DNA strings. in O(n) time. requires only O(m) space. That problem reduced to finding the longest substring in one fixed string that is also in some string in a database of strings. where k is the number of occurrences of the substring. 7. If the full string 5 cannot be matched against a path in T. A generalized suffix tree T for the strings in the database is built in O(m) time and. As usual. During cloning. is assumed to be large. probe. a set of strings. 278]. or a database. if 51 = sealiver. In that case. Each leaf of the tree represents either a suffix from one of the two strings or a suffix that occurs in both the strings. The sequenced interval has two key properties: It can be reliably isolated by the polymerase chain reaction and the DNA string in it is highly variable (i.Constructionofthesuffixtreecanbcd one in linear time (proportional to the total length of 5[ and 52). this is accomplished by matching the string against a path in the tree starting from the root. or even the Aho-Corasick algorithm. maintain. more importantly. An efficient and conceptually simple way to find a longest common substring is to build a generalized suffix tree for 51 and 52. The substring problem is one of the classic applications of suffix trees. then the longest com[non substring of S. The time/space tradeoff remains. For example. whereas no such choice is available for a keyword tree. and an efficient method to check thai is of value. denoted by m. The path-label of any internal node marked both I and 2 is a substring common to both 51 and 57.4. This is the reverse of the bounds shown above for suffix trees. if 5 is a substring of strings in the database then the algorithm can find all strings in the database containing 5 as a substring. or declared not to be there. clone. the matched path does specify the longest prefix of S that is contained as a substring in the database. When a new DNA string is sequenced. Boyer-Moore. This is the reverse of the exact set matching problem where the issue is to find which of the fixed patterns are in a substring of the input string.2 and 125 of Part Ill).6. Later. and the node markings and calculations of string-depth can be done by standard linear-time tree traversal methods. The total length of all the strings in the database. More realistically. In summary.5_ APL5: Recognizing DNA contamination Often the various laboratory processes used to isolate. one looks to see if the extracted and sequenced string is a substring of one of the strings in the database. because of errors. In the context of databases for genomic DNA data [63.4. a sequence of strings will be presented and for each presented string S. the algorithm must find all the strings in the database containing S as a substring. that the new string contains one of the database strings. which will be discussed in Sections 1 1.9. The full string S is in the database if and only if the matching path reaches a leaf of T at the point where the last character of Although the longest common substring problem looks trivial now. So the algorithm has only to find the node with the greatest string-depth (number of characters on the path to it) that is marked both I and2. copy. We will return to this problem in Section 7. purify. but that is the case of exact setmatching. 7. it could be contained in an already sequenced string. Any single string S of length n is found in the database.and the longest such string is the longest common substring. That longest common substring would then narrow the possibilities for the identity of the person. APL4: Longest common substring of two strings A classic problem in string analysis is to find the longest substring common to two given strings SI and 5> This is the longes/commollsubItrillg problem (different from the longest common subsequence problem. the problem of finding substrings is a real one that cannot be solved by exact set matching. Moreover. However. What constitutes a good data structure and lookup algorithm for the substring problem? The two constraints are that the database should be stored in a small amount of space and that each lookup should be fast. 320].3. Mark each internal node u with a I (2) if there is a leaf in the subtree of v representing a suffix from 5[ (S2).) One somewhat morbid application of this substring problem is a simplified version of a procedure that is in actual usc to aid in identifying the remains of U. one might want to compute the length of the longest substring found both in the newly extracted DNA and in one of the strings in the database. A solution to that problem is an immediate extension of the longest common substring problem and is left to the reader 7. This takes O(n + k) time.e . APL3: The substring problem for a database of patterns The substring problem was introduced in Chapter 5 (page 89). military personnel Mitochondrial DNA from live military personnel is collected and a small interval of each person's DNA is sequenced. APL4: LONGEST COMMON SUBSTRING OF TWO STRINGS 125 and space bounds as for the Aho-Corasick method . contamination is often caused . The longest common substring problem will be considered in Section 7. and neither is it contained in any string there. the opposite case is also possible. is first known and fixed.124 FlRST APPLICATIONS OF SUFFIX TREES S 7. In the most interesting version of this problem. bUI a suffix tree can be used for either of the chosen time/space combinations. The results obtained using a suffix tree are dramatic and not achieved using the Knurh-Morris-Pran. we have condition of the remains may not allow complete extraction or sequencing of the desired DNA interval.O(n) for preprocessing and O(m) for search. this is achieved by traversing the subtree below the end of the matched path for 5. it is very interesting to note that in 1970 Don Knuth conjectured that a linear-time algorithm for this problem would be impossible [24. (Of course.4. giving a more space efficient solution Now recall the problem of identifying human remains mentioned in Section 7. given our knowledge of suffix trees.

Generalized suffix trees can be built in time proportional to the total length of the strings in the tree.. find all substrings of 52 that occur in 51 and that are longer than some . (The dinosaur story doesn't quite fit here because there isn't yet a substantialtranscrtpt cfhuman DNA. since that algorithm is designed to search for occurrences ofjult patterns from P in 51. the problem of finding (exactly matching) common substrings in a set of distinct strings arises as a subproblem of many heuristics developed in the biological literature to align a set of strings. and then mark every internal node that has a leaf in its subtree representing a suffix from . there is a more space efficient solution to the contamination problem. then one can have greater confidence (but not certainty) that the DNA has not been contaminated by the known contaminants. because mutations that occur in those regions will more likely be lethal. and other DNA sources being worked with in the laboratory.) A good illustration comes from the study of the nemotode C. The "extracted' DNA sequences were shown. These includec!oning vectors. called multiple alignment. the problem of finding common substrings arises because mutations that occur in DNA after two species diverge will more rapidly change those parts of the DNA or protein that are less functionally important.8. APL6: Common substrings of more than two strings One of the most important questions asked about a set of strings is: What substrings are common to a large number of the distinct strings? This is in contrast to the important problem of finding substrings that occur repeatedly in a single string In biologlcalsrrings (DNA. rather than for substrings of patterns As in the longest common substring problem.15) in which short strings of sequenced DNA arc assembled into long strings by looking for overlapping substrings. But this one was published.14 and 16.3.10. Often. etegans genome. so we move on to other things. Without going into these and other specific ways that contamination occurs.')1and a leaf representing a suffix from a pattern in P. Blair Hedges. finding DNA or protein substrings that occur commonly in a wide range of species helps poi nt to regions or subpatterns that maybe critical for the function or structure of the biological string Less directly. S. through DNA database searching. elegans. we refer to the general phenomenon as DNA contamination Contamination is an extremely serious problem. for example).. Russell Doolittle [129] writes " . The problem now is to determine if the DNA string in hand has any sufficiently long substrings (say length I or more) from the known set of possible contaminants. the complete genomic sequence of the host organism (yeast. one of the critics of the dinosaur claims. stated: "In looking for dinosaur DNA we all sometimes find material that at first looks like dinosaur genes but later turns out to be human contamination. The parts of the DNA or protein that are critical for the correct functioning of the molecule will be more highly conserved.126 FIRST APPLICATIONS OF SUFFIX TREES 7. or the contamination is from the DNA of the host itself (for example bacteria or yeast). The approach in this case is to build a generalized suffix. tree for the set P of possible contaminants together with 51." [801 These embarrassments might have been avoided if the sequences were examined early for signs of likely contaminants. The biological applications motivate the following exact matching problem: Given a This motivates the following computational problem DNA contamination problem Given a string SI (the newly isolated and sequenced string of DNA) and a known string S2 (the combined sources of possible contamination). the announcement a few years ago that DNA had been successfully extracted from dinosaur bone is now viewed as premature at best. and there have been embarrassing occurrences of large-scale DNA sequencing efforts where the use of highly contaminated clone libraries resulted in a huge amount of wasted sequencing. PCR primers. then the path-label of v is a suspicious string that may be contaminating the desired DNA string. Dr. These substrings are candidates for unwanted pieces of S2 that have contaminated the desired DNA string. Contamination can also come from very small amounts of undesired foreign DNA that gets physically mixed into the desired DNA and then amplified by PCR (the polymerase chain reaction) used to make copies of the desired DNA. Hence suffix trees can be used to solve the contamination problem in linear time. APL6: COMMON SUBSTRINGS OF MORE THAN TWO STRINGS 127 by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA in a host organism. the contamination problem and its potential solution is stated as follows given length I. Thatproblem. to be more similar to mammal DNA (particularly human) [21 than to bird and crcckodllian DNA. report all marked nodes that have string-depth of I or greater. In discussing the need to use YACs (Yeast Artificial Chromosomes) to sequence the C.. Most directly. one of the key model organisms of molecular biology. If II is such a marked node.6. the DNA sequences from many of the possible contaminants are known. As a rule. it is important to know whether the DNA of interest has been contaminated Besides the general issue of the accuracy of the sequence finally obtained. used in purification. More generally. We will say much more about this when we discuss database searching in Chapter 15 and multiple string comparison in Chapter 14. contamination can greatly complicate the task of shotgun sequence assembly (discussed in Sections 16. In contrast. based on the material in Section 7. Similarly. then. and all the other marking and searching tasks described above can be performed in linear time by standard tree traversal methods. Build a generalized suffix tree for 51 and 52· Then mark each internal node that has in its subtree a leaf representing a suffix of 51 and also a leaf representing a suffix of 52. before large-scale analysis was performed or results published.6. it is not clear if the Aho-Corasick algorithm can solve the problem in linear time. Finally. RNA. On a less happy note. This problem can easily be solved in linear time by extending the approach discussed above for the longest common substring of two strings. All marked nodes of siring-depth I or more identify suspicious substrings.. more than a few studies have been curtailed when a preliminary search of the sequence revealed it to be a common contaminant . We leave this to the reader 7. If there are no marked nodes with string-depth above the threshold I. one has an entire set of known DNA strings that might contaminate a desired DNA string. suggesting that much of the DNA in hand was from human contamination and not from dinosaurs. will be discussed in some detail in Section 14. the experimentalist should search early and often". Clearly. or protein) the problem of finding substrings common to a large number of distinct strings arises in many different contexts. Therefore.

[71].2... Once the C(v) numbers are known. •.6. grand. That is. Thus the method presented here to compress suffix trees can either be considered as an application of suffix trees to building DAWGs or simply as a technique to compact suffix trees Consider the suffix tree for a string S = xyxaxaxa shown in Figure 7. We will return to this problem in Section 9. That time bound is also nontrivial but is achieved by a generalization of the longest common substring method for two strings. strings whose lengths sum to n one substring sand and and 7. . for each value of k from 2 to K. V(k) holds the string-depth (and location if desired) of the deepest (string-depth) node u encountered with C(v) == k. we could merge pinto q hy redirecting the labeled edge from p's parent to now go into o. change V(k) to the depth of u. except for the leaf numbers. For I children. Problems of this type will be discussed in Part III Formal problem statement and first method Suppose we have Definition We want to compute a table of K . (When encountering a node v with C(v) = k. Therefore over the entire tree. where "similar" allows a small number of differences. rather than learning all the locations of the pattern occurrencets). the resulting directed graph can Surprisingly.. build a generalized suffix tree T for the K sn-ing. 1 :. and any significant reduction in space is of value. The edgelabeled subtree below node p is isomorphic to the subtree below nodc q. Computing the C(v) numbers In linear time. the desired I(k) values can be easily accumulated with a linear-time traversal of the tree That traversal builds a vector V where. each 1eafinT has only one string identifier. The resulting vector holds the desired I'. " I . and vice versa.7. O(n). The vector for » is obtained by DRing the vectors of the children of u. V(k) :::: I(k). A more difficult problem is to find "similar" substrings in many given strings. Figure 7. V(k) reports the length of the longest string that occurs exactly k times.s. To find I(k) simply scan V from largest to smallest index. The resulting graph is not a tree but a directed acyclic graph. writing into each position the maximum V(k) value seen..1 entries. compare the string-depth' of » to the current 'value of V(k) and if u's depth is greater than V(k). But that number may be larger than C(v) since two leaves in the subtree may have the same identifier. find substrings "common" to a large number of those strings. after merging two nodes in the suffix tree. consider the set of strings {sandoilar.128 FIRST APPLICATIONS OF SUFFIX TREES 7. for every path from p there is a path from q with the same path-labels. and the string-depth of every node is known. pantry}. time [236]. the time needed to build the entire table is O(Kn). this takes I K time. the algorithm uses O(Kn) time to explicitly compute which identifiers are found below any node. we show here how to solve the problem in O(Kn) time.7. These compression techniques can also be used to build a directed acyclic word graph (DAWG).. That is.] Essentially. APL7: Building a smaller directed graph for exact matching As discussed before. so that identical suffixes appearing in more than one string end at distinct leaves in the generalized suffix tree Hence. .•.. Then C(~') is JUStthe number of l-bits in that vector. handler. and [115]. sandlot. Therefore.7.l.'. it is easy to compute for each internal node v the number of leaves in u's subtree. to indicate which string the suffix is from Each of the K strings is given a distinct termination symbol. The word "common" here means "occurring with equality". where an O(n) time solution will be presented.Each leaf of the tree represents a suffix from one of the K strings and is marked with one of K unique string identifiers. since there are O(n) edges. Linear-time algorithms for building DAWGs are developed in PO]. the problem can be solved in linear. If we only want to determine whether a pattern occurs in a larger text. I to K. First. For example. Therefore.1: Suffix Iree lor string xyxaxaxawithout suffix linkS shown i . For each internal node u. Clearly.1. Then the I(k) values (without pointers to the strings) are J(k) K 7. . instead of counting the number of leaves below u.~. : . in many applications space is the critical constraint. which is the smallest finite-state machine that can recognize suffixes of a given string.. APL7: BDrLDING A SMALLER DIRECTED GRAPH FOR EXACT MATCHING 129 set of strings.••••••• . That repetition of identifiers is what makes it hard to compute C(v) in O(n) time. where entry k gives /(k) and also points to one of the common substrings of that length. a K-Iength bit vector is created that has a I in bit i if there is a leaf with identifier i in the subtree of u.deleting the subtree of p as shown in Figure 7. It really is amazing that so much information about the contents and substructure of the strings can be extracted To prepare for the O(n) result. In this section we consider how to compress a suffix tree into a directed acyclic graph (DAG) that can be used to solve the exact matching problem (and others) in linear time but that uses less space than the tree. if V(k) is empty < V(k + I) then set V(k) to V(k + 1).

Consider any occurrence of a in T and let y be the suffix of T just to the right of that occurrence of a. Since the graph is a DAG after the first merge..7. to add simple (linear-space) information to the graph so that the locations of all the occurrences can also be recovered when the graph is traversed. It follows that if p and q have the same number of leaves in their subtrees. that the edges into p are directed to q but have their original respective edge-labels. and all these subtrees are isomorphic to each other. for every (labeled) path from p to a leaf in its subtree..2: A directed acyclic graph used to recognize substrings otxyxaxaxa be used to solve the exact matching problem in the same way a suffix tree is used. suppose that the subtrees of p and q are isomorphic.ro while q has path-label 0:. however.. DAG D can be used to determine whether a pattern occurs in a text. PROOF First suppose p has a direct suffix link to q and those two nodes have the same number of leaves in their subtrees. orq into p. the leaf numbers reachable from the end of the path may no longer give the exact starting positions of the OCCUITence~. if fJ is a suffix of a it must be a proper suffix. lal. There are general algorithms for subtree isomorphism but suffix trees have additional structure making isomorphism detection much simpler. But that implies that the subtrees rooted at p and at Ij are nOI isomorphic. as an exercise. that fJ must be a suffix of c Suppose fJ is not a suffix of a. The algorithm matches characters of the pattern against a unique path from the root of the graph.node p can be merged into q if the edge-labeled subtree of p is isomorphic to the edge-labeled subtree of q in the suffix tree. The entire procedure to compact a suffix tree can now be described. in the algorithm. if fJ is a proper suffix of a. which is a contradiction 0 a. q) p to q and the in Q and both p andq are in the current DAG. the pattern occurs somewhere in the text if and only if all the characters of the pattern are matched along the path. Now the numbers of leaves in the subtrees of p and q are assumed to be equal. then all the subtrees below nodes on the path have the same number of leaves. And. by contradiction. so every path out of q is identical to some path out of p. the criteria used to determine which nodes to merge remain tied to the original suffix tree . The "correctness" of the resulting DAG is stated formally in the following theorem. the algorithm must know how to merge nodes in a DAG as well as in a tree. Therefore. However. Clearly then they have the same number of leaves. APLi: IlUtLUlNtj A :SMALLER LlIRECTED GRAPH FOR EXACTMATCHLNG 131 By the same reasoning. and that any part of the graph that is now unreachable from the rool is removed Although the merges generally occur in a DAG. but the graph seems to lose the location(s) where the pattern begins. since the suffix of T starting at i begins with xa only if the suffix of T starting at i + 1 begins with a. We address this issue in Exercise 10. pairs are merged in arbitrary order. Now since fJ is not a suffix of a. Thi . never merging two nodes that have ancestors in the suffix tree that can be merged . if there is a path of suffix links from p to q going through a node lJ. We leave the correctness of this. node p has path-Iabel. As a practical matter it makes sense to merge top-down. and hence the two subtrees are isomorphic. For every leaf numbered i in the subtree of p there is a leaf numbered i + I in the subtree of q. Suffix tree compaction begin Identify the set Q of pairs (p. For the converse side. then by the properties of suffix links. The general merge operation for both trees and DAGs is stated in the following way A merge of node p into node Ij means that all edges out of p are removed.130 HRST APPLICATIONS OF SUFFIX TREES 7. That means thatay is a suffix ofT and there is a path labeled y running from node p to a leaf in the suffix tree. Figure 7. Leta be the path-label of p andfJ be the path-label of q and assume thatldl s. p can be merged into q. and the theorem would be proved. there is an identical path (with the same labeled edges) from q to a leaf in its subtree. It may be surprising that.2. So the key algorithmic issue is how to find isomorphic subtrees in the suffix tree.taslargeasthe number in the subtree of p and no larger than the number in the subtree of q. there is a directed path of suffix links from p to q. It is possible. Moreover. Since there is a suffix link from p to q. no suffix ofT that starts just after an occurrence of fJ can have length b-l. Since f3 "# a. q) such that there is a suffix link from number of leaves in their respective subtrees is equal While there is a pair Merge node pintoq (p.7. and therefore there is no path of length I y I from q to a leaf.issue will be addressed in Exercise 10. only if the two subtrees are isomorphic.thenthenumberofleavesinthesubtreeofumustbeatlea. So we will prove. We will show that there is a directed path of suffix links between p and q. a necessary part of the proof of Theorem 7.

Moreover. we match characters of string T against T. If b is an internal node v of T then the algorithm can follow its suffix link to a node s{v). called the marching statistics problem. I I the way they accelerate the construction ofT in Ukkonen's algorithm To learn ms(l). If v is the root. Section 6. This use of matching statistics will probably be more important than their use to duplicate the preprocessing/search bounds of KnuthMorris-Pratt and Aho-Corasick.ml. there is an occurrence of P starting at position T if and only if ms(i) IFI . who solved a somewhat more general problem. If h is not an internal node. the Knuth-Morris-Pratt (or Boyer-Moore) method preprocesses the pattern in O(n) time and space. But the resulting finite-state machine would not necessarily have the minimum number of states. Still. So the question arises of whether we can solve those problems by building a not the text. Instead of traversing that path by examining every character on it./II]. A DAWG represents a finite-state machine and.. if /IIs(5) = 4.vzqab<:dw then ms( I) i of = 3 and = Clearly. Of course. and hence it would not necessarily be the DAWG for S.ing matching statistics will be given in Section 7. When the end of that fj path is reached. APL8: A reverse role for suffix trees. Matching statistics are also used in a variety of other applications described in the book. proceed as follows to learn ms(i + 1).m].1. and so is as compact as the DAWO even though it may not be a finite-state machine. we will develop a result due to Chang and Lawler [94]. One advertisement we give here is to say that matching statistics are central to a fast approximate matching method designed for rapid database searching.. and hence the path from the root to s(v) matches a prefix of T[i + l.. In those cases it is clearly superior to preprocess the patternts). [71]. each edge label is allowed to have only one character. This is the reverse of the normal use of suffix using suffix trees to achieve exactly the same time and space bounds (preprocessing versus search time and space) as in the Knuth-Morris-Pratt or Aho-Corasick methods. Having learned ms(i).8. That means that the algorithm has located a point b in T such that the path to that point exactly matches a prefix of T[i . The Aho-Corasick method achieves similar bounds for the set matching problem. as such. the search for msti + I) can Start at node I(1') rather than at the root Let fj denote the string between node u and point h. there must be a path labeled fJ out of s(v). Matching statistics: duplicating bounds and reducing space For example. The path-label of v. and major space reduction We have previously shown how suffix trees can be used to solve the exact matching problem with 0(1/1) preprocessing time and space (building a suffix tree of size Oem) for the text T) and O(n + k) search time (where n is the length of the pattern andk is the number of occurrences). by folluwing the unique matching path of T[I .. To explain this. APL8: A REVERSE ROLE FOR SUFFIX TREES.. m]..3) to traver:se it in time proportional to the number of nodes on the path.8. Moreover. the main theoretical feature of the DAWG for a string S is that it is the finite-state machine with the fewest number of slates (nodes) thai recognizes suffixes of S. MAJOR SPACE REDUCTION 133 DAGs versus DAWGs DAG D created by the algorithm is not a DAWG as defined in [70J. Matching statistics lead to space reduction Matching statistics can be used to reduce the size of the suffix tree needed in solutions to problems more complex than exact matching. The length of that matching path is /IIs(I). The first example of space reduction U'. and [lIS]./II1.9. the algorithm uses the skip/count trick (detailed in Ukkonen's algorithm. DAG D for string S has as few (or fewer) nodes and edges than does the associated DAWO for S. D can be converted to a finite-state machine by expanding any edge of D whose label has k characters intuk edges labeled by one character each.3. Hence ml is a string in P matching a substring starting at position i + I of T.. and then searches in Oem) time. then the search for msti + I) begins at the root.132 FIRST APPLICATIONS OF SUFFIX TREES 7. machine that recognizes substrings of a string. Then xafJ is the longest substring in P that matches a substring starting at position i of T. Now suppose in general that the algorithm has just followed a matching path to learn ms(i) for i < Iml. the algorithm continues to match single characters from T against characters in the tree until either a leaf is reached or until . T = abcxabcdex and P= wyabL"'. This will be detailed in Section 12. Since s( u) has path-label a. then the algorithm can back up to the node v just above b. say xe. How to compute matching statistics 7. Therefore. But if v is not the root then the algorithm follows the suffix link from v tos(1'). We have also seen how suffix trees are used to solve the exact set matching and space bounds (n is now the total size of all the patterns In contrast. However. Therefore. but no further matches are possible (possibly because a leaf has been reached).3.both run space (they have to represent the strings). construction of the DAWG for of theoretical interest.so o must be a prefix of T[i + l. Thus the problem offinding the matching statistics is a generalization of the exact matching problem. 7.8. But s(v) has path-label a. the situation sometimes arises that the pattern(s) will be given first and held fixed while the text varies. and space bounds for suffix trees often make their use unattractive compared to the other methods. Thus matching statistics provide one bridge between exact matching methods and problems of approximate string matching. isaprefixofT[i .1. preprocess the pattern .

1. if the search stops at a node u.17.134 RRST APPLICATIONS OF SUFFIX TREES 7. APL9. Now consider the time required by the algorithm.B character comparisons needed to compute ms(i + 1). Using a (precomputed) generalized suffix tree for the STSs (which play the role of P). Then. then msti + I) = O. node u with the leaf number of one of the leaves in its subtree. and then a certain number of additional 7. Correctness and time analysis for matching statistics The proof of correctness of the method is immediate since it merely simulates the naive method for finding each ms(i). and the theorem is proved. This takes time linear in the size of T. In order to accumulate the p(i) values. begin with the character in T that ended the computation for ms(i) or with the next character in T. for each ordered pair Sf. exact matching may not be an appropriate way to find STSs in new sequences.5. The key there is that the after-.. It only remains to consider the total time used in all the character comparisons done in the "after-d' traversals. But since the number of sequencing errors is generally small.9. 7. the pointer p(i) will point to the appropriate STS in the suffix tree. that matches a prefix or Sj is exactly ms(i) places. the desired pO) is the suffix number written at u. We leave it to the reader to flesh out the details. ms(i + 1) is the string-depth of the ending position.9. We next modify the matching statistics algorithm so that it provides that information. APLIO: All-pairs suffix-prefix matching Here we present a more complex use of suffix trees that is interesting in its own right and that will be central in the linear-time superstring approximation algorithm to be discussed in Section 16. first do a depth-first traversal ofT marking each II caliedaslIfjix-prejixmu/chof5" Given a collection of strings S = 51. the all-pairs suffixprefix problem is the problem of finding.8. for each i. s. v». 0 7. It follows that at most 2m comparisons in total are performed during all the after-f comparisons. when using T to find each ms(i). the location of at least one such matching substring. otherwise (when the search stops on an edge (u. Hence the after-f comparisons performed when computing ms(i) and ms(i + I) share at most one character in common. .10. SPACE·EFFlCIENT LONGEST COMMON SUBSTRING ALGORITIIM 135 no further matches are possible. Generally. There it was mentioned that. Using oniy a saffix tree for P can be found in PROOF The search for any ms(i + I) begins by backing up at most one edge from position b to a node u and traversing one suffix link to node ~'(l'). but it does not indicate the location of any such match in P. The analysis is very similar to that done for Ukkonen's algorithm Theorem7.2. A small but important extension The number m.8. . Note tbat when given a new sequence. and r(i + 1) is notin P. Definition Given two strings Si and S.2) we must also know. because of errors. for i ::: I. That takes care of all the work done in finding all the matching statistics. Those regions of agreement should allow the correct identification of the STSs it contains. . APL9: Space-efficient longest common substring algorithm the current depth is increased at each step of any fJ traversal.l'(i) indicates the length of the longest substring starting at position i of T that matches a substring somewhere in P. Sj in S. p(i) is the suffix number written at node u Back toSTSs Recall the discussion of STSs in Section 3. SA of tctallengrh m. 52. Thereisonespeciaicasethatcanariseincomputingms(i+l). In either case.lfms(i} = I orms(i) = 0 (so that the aJgorithmis at the root). From s(~') a Ii path is traversed in time proportional to the number of nodes on it. 7. compute matching statistics for the new DNA sequence (which is T) and the set ofSTSs. Note that the character comparisons done after reaching the end of the fJ path begin either with the same character in T that ended the search for ms(i) or with the next character in T. any suffix of S. we can expect long regions of agreement between a new DNA sequence and any STS it (ideally) contains. the time for the computation is just proportional to the length of the new sequence. the longest suffix-prefix match of SnS}.3.1. depending on whether that search ended with a mismatch or at a leaf.S. For some applications (such as those in Section 9.1.

When the depth-first traversal backs up past a node t'. The key observation is the following: If t' is a node on this path and i is in L(p). that matches a prefix of 5j• So for each index i.5.136 FIRST APPLICATIONS OF SUFFIX TREES 7. and because substring containment is possible among strings of unequal length. we pop the top of any stack whose index is in L(v) is time optimal 1 :1 '. for each i E L(1'). One of the main objectives of the research was to estimate the number of ACRs that occur in the genes of C. assuming (as usual) that the alphabet is fixed. List L(v) contains the index i if and only if v is incident with a terminal edge whose leaf is labeled by a suffix of string Sj.2. so a simple extrapolation would be wrong if the prevalence of ACRs is systematically different in ESTs from frequently or infrequently expressed genes. then there is no over-lap between a suffix of string 5. Following the above observation. the node with path-label a has a list consisting of indices I and2. push t' onto the ith stack. arises in datu compression: this connection will be discussed in the exercises for Chapter 16 A different. the algorithm also builds a list L(v) for each internal node u. label of v is the longest suffix-prefix match of(Si. and focus on the path from the root of 7(5) to the leaf j representing the entire string Sj. Moreover. it was indeed found that ACRs occur more commonly in EST. APLlO: ALL-PAIRS SUFFIX. or how frequently that gene is expressed. The goal is to use the ACR data observed in the ESTs to estimate the number of ACRs in the entire set of genes. it maintains k stacks. Now we return tu the extrapolation problem. Their approach was to extrapolate from the number of ACR5 they observed in the set of ESTs. eiegans. then the path-label of u is a suffix of S. To explain this. As it does. However.400 ESTs (see Section 3. Solving the all-pairs suffix-prefix problem in linear time For asingle pairof strings. To describe the role of suffix-prefix matching in this extrapolation. Using suffix trees it is possible toreduce the computation time to O(m+k2). genes are not uniformly sampled. the heart of that computation consists of solving the all-pairs suffix-prefix problem on the set of ESTs. from frequently expressed genes (more precisely. The main motivation for the all-pairs suffix-prefix problem comes from its use in implementing fast approximation algorithms for the shortest superstring problem (to be discussed in Section \6.andthenodewithpath-labelxa has a list consisting of index 1. we can maintain a linked list of . (Because there may be some sequencing errors. If EST a originates in gene fJ. As 7(5) is constructed. the lists can be constructed in linear time during (or after) the construction of7(5) Now consider a fixed string 5j..1) from the organism C. and many different ESTs can be collected from the same gene fJ. The path. so how can ESTs from frequently sampled genes be distinguished from the others? The approach taken in [190J is to compute the "overlap" between each pair of ESTs Since all the ESTs are of comparable length. For example. consider the generalized suffix tree shown in Figure 6. All the other lists in this example are empty. the authors [190J conclude' The main data structure used to solve the all-pairs suffix-prefix probl em is the genetulized suffix tree 7(5) for the k strings in set 5. That is. How can that prevalence be determined? When an EST is obtained. and it is not easy to tell if two ESTs Originate from the same gene. and hence for the all-pairs suffix-prefix problem. L(t') holds index i if and only if the path label to u is a complete suffix of string 5t. a set of around 1. and a prefix of string 5j. It is not difficult to see that the top of stack i contains the node v that defines the sutlix-prefix match of(S. ESTs are collected more frequently from some genes than others. in the common method used to collect ESTs.·1 constant time per answer D Extensions We note two extensions. 5i)· It is easy to see that by one traversal from the root to lear j we can find the deepest nodes for all I S i Sk (i"# j). the preprocessing discussed in Section 2. We can thus consider ESTs as a biased sampling of the underlying gene sequences. one doesn't know the gene it comes from. Sj). whereas an EST that has substantial overlap with one or more of the other ESTs is considered to be from a frequently expressed gene. The node with path-label ba has an L list consisting of the single index 2.) After categorizing the ESTs in this way.17). However.10. we need to remember some facts about ESTs For the purposes here. we can think of an EST as a sequenced DNA substring of length around 300 nuc1eotides. When a leaf j (representing the entire string 5i) is reached.4 will find the longest suffix-prefix match in time linear in the length of the two strings. ESTs will more frequently be collected from genes that are more highly expressed (transcribed) than from genes that are less frequently expressed. Clearly.10. one for each string. then the actual location of substring a in fJ is essentially random. from ESTs that overlap other ESTs). one should also solve the all-pairs longest common substring problcm. By using double links. one does not learn the identity of the originating gene.lfthe ith stack is empty. and a prefix of Sj. In that research. Another motivation for the shortest superstring problem. direct application of the all-pairs suffix-prefix problem is suggested by computations reported in [190]. applying the preprocessing to each of the k: pairs of strings separately gives a total bound of O(km) time. Commonly.1. The superstring problem is itself motivated by sequencing and mapping problems in DNA that will be discussed in Chapter 16.PREFIX MATCHING 137 Motivation for the problem 7. scan the k stacks and record for each index i the current top of the ith stack. originating in a gene of much greater length.elegansgenes However.firsttraversaLwhenanodevi s reachedin a forward edge traversal. A simple extrapolation would be justified if the ESTs were essentially random samples selected uniformly from the entire setofC.11 (page 117). the algorithm efficiently collects the needed suffixprefix matches by traversing 7(5) in a depth-first manner. Let k' S k2 be the number of ordered pairs of strings that have a nonzero length suffix-prefix match. An EST that has no substantial overlap with another EST was considered in the study to be from an infrequently expressed (and sampled) gene. elegans (which is a worm) were analyzed for the presence of highly conserved substrings called ancientconserved regions (ACRs). Duringthedcpth. the deepest node u on the path to leaf j such that i E L(v) identifies the longest match between a suffix of 5.

tandem arrays of repeated RNA that flank retroviruses (viruses whose primary genetic material is RNA) and facilitate the incorporation of viral DNA (produced from the RNA sequence by reverse transcription) into the host's DNA. and protein). see [253] and [255] Small-scale local repeats whose function or origin is partially understood include: complemented palindromes in both DNA and RNA.138 FIRST APPLICATIONS OF SUFFIX TREES 7. 7.()rvariouskindsofrepeatsarelOo~ommonevenlnli'1. 7. 9. To make this concrete. but to motivate the algorithmic techniques we develop For emphasis. In many texts tfor example. the sentence wa. Ignoring spaces. we briefly introduce some of those repetitive structures. 7. For-example. (In fact.12. Each strand isacomplem8ntedpalindrom8 accordingtothedelinitionsusedinthist:. In that way. report. 9. the entire contents of each scanned stack is read out. repetitive DNA is striking for the variety of repeated structures it contains. RNA.5. and [315]) on genetics or molecular biology one can find extensive discussions of repetitive strings and their hypothesized functional and evolutionary role. as well as several exercises. and more will be discussed later in the book.7. [3 17].000 families of repetitive sequences were identified [5]. elegans.) The motivation for the general topic of repetitive structures in strings comes from several sources.1.6). clustered genes that code for important proteins (such as histone) that regulate chromosome structure and must be made in large number. most of the human Y chromosome consists of repeated substrings.5 it a cat i saw is another example. 9. some aspects of one type of repetitive structure. but our principal interest is in important repetitive structures seen in biological strings (DNA. and overall Families of reiterated sequences account for about one third of the human genome. Repetitive structures in biological strings One of the most striking features of DNA (and to a lesser degree. INTRODUCTION TO REPETITIVE STRUCTURES 139 the nonempty stacks. and for the biological functions that some of the repeats may play (see [394] for one aspect of gene duplication). small-scale repeated strings whose function or origin is at least partially understood.11. The intent is not to write a dissertation on repetitive DNA or protein. This is particularly true of eukaryotes (higher-order organisms whose DNA is enclosed in a cell nucleus). for the various proposed mechanisms explaining the origin and maintenance of repeats.sinceastackthatgoesfromemptytononemptymustbelinkedatoneoftheendsofthe list.1. tandem repeats. prokaryotes (organisms such as bacteria whose DNA is not enclosed in a nucleus) have in total little repetitive DNA. In contrast. similar genes that probably arose through duplication and subsequent mutation (including pseudogenes that have mutated .2.1.6 million bases of DNA from C.6.11. [3 I71 There is a vast' literature on repetitive structures in DNA.?ok. 9. whose function is less clear. Introduction to repetitive structures in molecular strings Several sections of this book (Sections 7. 9. then the complexity for this solution is O(m+k*). We modify the above solution so that when the tops of the stacks are scanned. have already been discussed in the exercises of Chapter I. families of genes that code for similar proteins (hemoglobins and myoglobins for example). sentence or verse reading the same backwards as forwards [441].S Figure 7.£ VDJ~DD:J:JVDJ~ . hence we must also keep (in the stack) the name ofthe string associated with that stack. we divide the structures into three types: local. Note that the position of the stacks in the linked list will vary. simple repeats. 3' . Definition A palilldrome is a string that reads the same backwards as forwards.2.3: A palindrome in the vernacular of molecular biology. For example. which act to regulate DNA transcription (the two parts of the complemented palindrome fold and pair to form a "hairpin loop"). short repeated substrings (both palindromic and nonpalindromic] in DNA that may help the chromosome fold into a more compact structure. hoth local and interspersed.6.11. the Random House dictionary definition of "palindrome" is: a word. the string xyaayx is a palindrome under this definition.12. over 7. and even in protein.2. protein) is the extent to which repeated substrings occur in the genome. repeated substrings at the ends of viral DNA (in a linear state) that allow the concatenation of many copies of the viral DNA (a molecule of this type is called a concatamerr. 9. single copy inverted repeats that flank transposable (movable) DNA in various organisms and that facilitate that movement or the inversion of the DNA orientation. [128] In an analysis of 3. s' TCGACCGGTCGA In the following discussion of repetitive structures in DNA and protein. nested complemented palindromes in tRNA (transfer RNA) that allow the molecule to fold up into a cloverleaf structure by complementary base pairing. [469]. copies of genes that code for important RNAs (rRNAs and tRNAs) that must be produced in large number. only the stacks on this list need be examined. bUI all suffix-prefix matches no matter how long they are. At the other extreme. If the output size is k*. all nonzero length suffix-prefix matches can be found in Otm + k') time. suppose we want to collect for every pair not just the longest suffix-prefix match. The doubJe·str andedstringisthasame after reflection around both the horizontaland vertical midpoints. are devoted to discussing efficient algorithms for finding various types of repetitive structures in strings. and 7. For an introduction to repetitive elements in human DNA. although they still possess certain highly structured small-scale repeats In addition to its sheer quantity. Then when a leaf of the tree is reached during the traversal. and more complex interspersed repeated strings whose function is even more in doubt.

227. These repeats are called satellite DNA (further subdivided into micro and mini-satellite DNA). the string TTAGGG appears at the ends of every human chromosome in arrays containing one to two thousand copies [332]. and whose function and origin is less clear. This folding then apparently facilitates either recognition or cutting by the enzyme. the interior of an Alu string itself consists of repeated substrings of length around 40. also called "direct repeats") of repeated DNA. sixteen imprinted gene alleles have been found to date [48]. other mini-satellite repeats also play a role in human genetic diseases [286]. Uses of repetitive structures in molecular biology \ Ade!.11. Highly dispersed tandem arrays of length-two strings are common. although there is serious opposition to that theory.11. common exons of eukaryotic DNA that may be basic building blocks of many genes.S()()b. Tn addition to tri-nucleotide repeats. INTRODUCTION TO REPETITIVE STRUCTURES 141 to the point that they no longer function). The DNA sequences of these sixteen imprinted genes all share the common feature that 7. A restriction enzyme is an enzyme that recognizes a specific substring in the DNA of both prokaryotes and eukaryales and cuts (or cleaves) the DNA every place where that pattern occurs (exactly where it cuts inside the pattern varies with the pattern). and the rest from the father. The Alu repeats occur about 300. is generally divided into SINEs (short interspersed nuclear sequences) and LINEs (long interspersed nuclear sequences). Some tandem arrays may originate and continue to grow by a postulated mechanism of unequal crossing over in meiosis. The complemented palindromic structure has been postulated to allow the two halves of the complemented palindrome (separated or not) to fold and form complementary pairs. Restriction enzyme cutting sites are interesting examples of repeats because they tend to be complemented palindromic substrings. Kennedy's disease. Moreover. the likelihood that more copies will be added in a single meiosis increases as the number of existing copies increases. where N stands for any nucleotide The enzyme cuts between the last two Ns. For an introduction to Alu repeats see [254J One of the most fascinating discoveries in molecular genetics is a phenomenon called genomic (or gametic) imprinting.000 times in the human genome and account for as much as 5% of the DNA of human and other mammalian genomes Alu repeats are substrings of length around 300 nucleotides and occur as nearly (but not exactly) identical copies widely dispersed throughout the genome. The classic example of a SINE is the Alu family. the number of triples in the repeat increases with successive generations. For example. For example.n nfJOOutl.that chromosomes (other than the Y chromosome) have no memory of the parent they originated from. if inherited from the "incorrect" parent. Those right and left flanking sequences are usually complemented palindromic copies of each other. 391]. whereby a particular allele of a gene is expressed only when it is inherited from one specific parent [48. Because of the palindromic structure of restriction enzyme cutting sites. Moreover. restriction enzyme BgfI recognizes GCCNNNNNGCC.140 FIRST APPLICATIONS OF SUFFIX TREES 7. Tnmice and humans. Repetitive DNA that is interspersed throughout mammalian genomes. the surprising discovery that eukaryotic DNA contains introns (DNA substrings that interrupt the DNA of protein coding regions). ataxia) are now understood to be caused by incre asing numbers of tandem DNA repeats of a string three bases long. myotonic dystrophy. There are hundreds of known restriction enzymes and their use has been absolutely critical in almost all aspects of modern molecular biology and recombinant DNA technology. For example. people have scanned DNA databases looking for common repeats of this form in order to find additional candidates for unknown restriction enzyme cutting sites Simple repeats that are less well understood often arise as tandem arrays (consecutive repeated strings. for which Nobel prizes were awarded in 1993. This is in contradiction to the classic Mendelian rule of equivalence ."e. which appears to explain why the disease increases in severity with each generation. Five of these require inheritance from the mother. These triplet repeats somehow interfere with the proper production of particular proteins. So the Alu repeats wonderfully illustrate various kinds of phenomena that occur in repetitive DNA. structured.1 leng. Huntington's disease. A number of genetic diseases (Fragile X syndrome. With unequal crossing over in meiosis. or expressed differently. !Iie genes studied 148] have a tot. and common functional or structural subunits in protein (motifs and domains) Restriction enzyme cutting sites illustrate another type of small-scale.2. the restriction enzyme EcrJRI recognizes the complemented palindrome GAATTC and cuts between the G and the adjoining A (the substring TTC when reversed and complemented is GAA)_ Other restriction enzymes recognize separated (or interruptedi complemented palindromes. and the Alu sequence is often flanked on either side by tandem repeats of length 7-10. was closely coupled with the discovery and use of restriction enzymes in the late 1970s. Other long tandem arrays consisting of short strings are very common and arc widely distributed in the genomes of mammals. and their existence has been heavily exploited in genetic mapping and forensics.ilno!containedin!hisqlloteistha!!h<direc!itandemJrep<at'.h . For example. The allele will be unexpressed. Sometimes the required parent is the mother and sometimes the father. and that the same allele inherited from either parent will have the same effect on the child. repeating substring of great importance to molecular biology.

it is represented only once in R·(S). But what is a distinguishing feature between human and hamster DNA? One approach exploits the Alu sequences. hashing substring locations into the precomputed table. but their existence can also facilitate certain cloning. In this Iight. In this cell. may be of future use in computational biology. and yet it gives a good reflection of the maximal pairs Tn some applications.16). if a string consists of n copies of the same character. Alu sequences specific to human DNA are so common in the human genome that most fragments of human D~A longer than 20. substring a is .000 bases will contain an Alu sequence [317]. Accordingly. both strings abc and abcy are maximal repeats. and 9. Techniques that handle more liberal errors will be considered later in the book. Other ways of defining and studying repetitive structures are addressed in Exercises 56. The same idea is used to isolate human oncogenes (modified growth-promoting genes that facilitate certain cancers) from human tumors.~ Admittedly.5. Each resulting hybrid-hamster cell incorporates different parts of the human DNA. A recurring objection is that the first repetitive string problems we consider concern exact repeats (although with complementation and inversion allowed). Other poor definitions may not capture the structures of interest. makes certain kinds of large-scale DNA sequencing more difficult (see Sections 16. APLll: Finding all maximal repetitive structures in linear time Before developing algorithms for finding repetitive structures. Cells that receive the fragment of human DNA containing the oncogene become transformed and replicate faster than cells that do not. For example. such as Alus. Those techniques pass the "plausibility" test in that they. Algorithmic problems on repeated structures We consider specific problems concerning repeated structures in strings in several sections of the book.6. one then has to identify the parts of the hamster's hybrid genome that are human. an algorithm searching for all pairs of identical substrings (an initially reasonable definition of a repetitive structure) would output toJ(n4) pairs. the definition of a maxima! repeat does not properly model the desired notion of a repetitive structure. and in Sections 9.12. 57. A poor definition may lead to an avalanche of output. if one seeks repeating DNA of length ten. Poor definitions are particularly confusing when dealing with the set of all repeats of a particular type. or they may make reasoning about those structures difficult. we must carefully define those structures. in S = aabXCiyaab.12. with S as above. For example. In this section. it makes sense to first build a table of all the 411l possible strings and then scan the target DNA with a length-ten template. Some techniques for handling inexact palindromes (complemented or not) and inexact repeats will be considered in Sections 9. the fragments of human DNA in the hybrid can be identified by probing the hybrid for fragments of Alu. Hence IR'(S)I is less than or equal to IR(S)I and is generally much smaller. For example.142 ARST APPLICATIONS OF SUFFIX TREES 7. For example. the key problem is to define repetitive structures in a way that does nOI generate overwhelming output and yet captures all the meaningful phenomena in a clear way. This isolates the human DNA fragment containing the from the other human hut then the human DNA has to be 7. 9. whereas most cases of repetitive DNA involve nearly identical copies. Therefore. an undesirable result.5 and 9. or the ideas that underlie them. and 58 in this chapter.weaowconsiderproblemsconcemingexactlyrepeatedsubstringsina single string For example. not every repetitive string problem that we will discuss is perfectly motivated by a biological problem or phenomenon known today.6.1. in exercises in other chapters. Note that no matter how many times a string participates in a maximal pair in S. APLI I: FINDING ALL MAXIMAL REPETITIVE STRUCTURES 143 The existence of highly repetitive DNA. one general approach to low-resolution physical mapping (finding on a true physical scale where features of interest are located in the genome) or to finding genes causing diseases involves inserting pieces of human DNA that may contain a feature of interest into the hamster genome. Despite these objections. The output is more modest. Another objection is that simple techniques suffice for small-length repeats. Fragments of human DNA from the tumor are first transferred to mouse cells. the fit of the computational problems we will discuss to biological phenomena is good enough to motivate sophisticated techniques for handling exact or nearly exact repetitions. mapping. This technique is called somatic cell hybridization. we address the issue through various notions of maximality. and these hybrid cells can be tested to identify a specific cell containing the human feature of interest. and searching efforts.6.11 and 16.

A linear-time algorithm to find all maximal repeats The maximal repeats can be compactly represented o its children are left diverse. So suppose that the two occurrences are xap and yap. based essentially on suffix trees. APLl I: FINDING ALL MAXIMAL REPETITIVE STRUCTURES 145 L' a maximal repeat but so is aab. Problems related 10 palindromes and tandem repeats are considered in several sections throughout the book. but we will not use Ihal .12.molimc.1..12. sou must be part of a maximal pair and hence a must be a maximal repeat. maximal repeats.2. and supennaximal repeats Suppose first that v is left diverse. is also described in that paper.erminology . Hence u must be left diverse 7.12. and supermaximal repeats are only three possible ways to define exact repetitive structures of interest. Conversely. so are all of its ancestors in the tree Theorem 7. Note that being left diverse is a property that propagates upward. referTed lo as a ompact trie. since the larger substring aob that sometimes contains a may be more informative. Certain kinds of repeats are elegantly represented in graphic alformin a device called a landscape [104J. By definition. Lemma 7. we have 5 This kind of lree is . if a is a maximal repeat then it participates in a maximal pair and there must be occurrences of cc that have distinct left characters.12.12. then a is a maximal repeat and the theorem is proved.l summary. Inexact repeats will be considered in Sections 9. maximal repeats. 0 PROOF i Theorem 7.r then it participates in a maximal pair with string yap. Either way. That means there are substrings xa in S. If the second substring is followed by any character but p.. where x and y represent different characters. In the next sections we detail how to efficiently find all maximal pairs. a cannot be preceded by both . An efficient program to construct the landscape. It may not always be desirable to report Of as a repetitive structure. There can he at most n maximal repeats in any string of length n Since T has Il leaves. which is a superstring of string IX. T can have at most n internal nodes.although not every occurrence of C( is contained in that superstring.144 FIRST APPLlCATIONS OF SUFFIX TREES 7. a leaf cannot be leftdiverse ' j .1 then implies the theorem.1 would be a trivial fact if at most one substring starting at any position could be Definilion A node u ofT is called left diverse ifat least two leaves in e's subtree have differentleftcharacters. The subtree of T from the root down to the frontier nodes Theorem 7.. and each internal node other than the root must have at least two children.1. Let the first substring be by character p. If this occurrence of O!q is preceded by character . and if it is preceded by y then it participates in a maximal pair with seep. y Finding left diverse nodes in linear time :. But since v is a (branching) node there must also be a substring oc in S for some character q that is different from p. The string and onlv if v ls Ieft diverse. If a node diverse. Other models of exact repeats are given in the exercises.12. PROOF O! is left labeling the path to a node v ofT is a maximal repeat if Maximal pairs.1.r and y.5 and 9_6.

7. Since w is a leaf. That degree of each nearsupermaximal repeat can also be computed in linear time. We establish here efficient criteria to find all the superrnaximal repeats in a string S.12. Since there can be more than O(n) of them. then Xlt occurs only once in S. where a is the string labeling the path to u. the running time of the algorithm will be stated in terms of the size of the output. AI/the maximal repeats in S cun be found in 0(/1) time. In detail.r . lf w is a leaf then w specifies a single particular occurrence of substring Ii = ay. Now traverse the tree from bottom up. Finding supermaximaJ repeats in linear time Recall that a supermaximal repeat is a maximal repeat that is not a substring of any other maximal repeal.2. the occurrence of a at i is contained in a maximal repeat. substring a is a maximal repeat but not a supermeximal ora near-supermaximal repeat. Now let y be the first character on the edge from u to w. Such an occurrence of a is said to witness the near-supermaximalny of 1)/.12. If we only want to count the number of PROOf If there is another occurrence of U' with a preceding character .146 FIRST APPLICATIONS OF SUFFIX TREES 7. so O(k) time is used in those operations. we can state 7.2. but it is near-superrnaximal. Definition A substring a of S is a near-supermaximal repeat if IX is a maximal repeat in S that occurs at least once in a location where it is not contained in another maximal repeat.r Letting n denote the length of S. a supermaximal repeat It is a maximal repeat in which every occurrence of a is a witness to its near-supermaximalny. Note that it is not true that the set of near-supermaximal repeats is the set of maximal repeats that are not supermaxunal repeats The suffix tree T for S will be used to locate the near-supermaximal and the supermaximal repeats.12. The creation of the suffix tree. For each leaf specifying a suffix i. then the method works in O(n + k) time. the list at u indexed by x is just the list of leaf numbers below u that specify suffixes in S that are immediately preceded by character . and so witnesses the near-supermaximaluy of «. First. At the start of the visit to node v. To do this. we can define the degree of near-supermaximality of It as the fraction of occurrences of It that witness its near-supermaximality.r .re cord its left character SU . is not contained in a maximal repeat. c).Inthat case. and let w (possibly a leaf) be one of e's children. We now consider that case pairs containing the string that labels the path to v. whereas m aabxavaob. Moreover. If there are k maximal pairs. APLll: ANDING ALL MAXIMAL REPETITIVE STRUCTURES 147 Theorem 7.r contains all the starting positions of substrings in S that match the string on the path to v and that have the left character . Therefore. P2. 0 In summary. To create the list for character x at node v.3. If there is no other occurrence of a with a preceding x . it is easy to create (but not keep) these lists in O(I!) total time. and all the list linking take O(n) time. co' occurs only once in S. then Xlt occurs twiceandsoiseitheramaximalrepeatoriscontainedinone. which is preceded . The leaves in the subtree of T rooted at w identify the lccaTherefore. in the string uabxCtyaabxltb. we solve the more general problem of finding near-supermaximai repeats. its bottom up traversal. Thus no occurrence of 01 in L(w) can witness the near-supermaximality of It unless UJ is a leaf. The proof of this is essentially the same as the proof of Theorem 7. in the tree.12. the algorithm forms the Cartesian product of the list for x at u' with the union of every list for a character other than x at a child of u other than s'. The second occurrence of a witnesses that fact With this terminology. visiting each node in the tree. For each character x and each child v' of u. Finding all the maximal pairs in linear time We now turn to the question of finding all the maximal pairs. build a suffix tree for S.I). all supermaximal and near-supermaximal repeats can be identified in linear time. That is.12. before u's lists have been created. and a tree representation for them can be constructed from suffix tree T in O(n) lime as well by x and succeeded by y. substring It is again not supermaxirnal. The list at v indexed by . the occurrence of a starting at i. Any pair in this list gives the starting positions of a maximal pair for string It. Each list is indexed by a left character . Let v be a node corresponding to a maximal repeat 01.3.r . The algorithm is an extension of the method given earlier to find all maximal repeats. For example. the algorithm can output all maximal pairs (PI. Each operation used in a Cartesian product produces a maximal pair not produced anywhere else.

) Next. but interpret it to be lexically greater than any character in the alphabet used for S. the suffix starting at position Pos (I) of T is the lexically smallest suffix.more space reduction In Section 6. and in general suffix Pos(i) of T is lexically smaller than suffixPos(i + 1). the suffix tree for a string of length m either requires (-)(mlI:l) space or the minimum of O(mlogm) and O(m log II:I) time.3) where a fixed string T will be searched many times. we have the following theorem: linearization of the circular string. Similarly.14. many of the ideas were anticipated in the biological literature by Martinez [310J After defining suffix arrays we show how to convert a suffix tree to a suffix array in linear time.1. 7. Such a depth will always be reached (with the proof left to the reader).2. As usual. we saw that when alphabet size is included in the time and space bounds. Simply stop the bottom-up traversal at any node whose string-depth falls below that minimum. Otherwise. If I = 0 or I = n + I. so this problem arises in the "inner loop" of more complex chemical retrieval and comparison problems A natural choice for canonicallinear string is the one that is lexically least.7.14.13. but now we interpret it to be lexically less than any other character in the alphabet.5. For these reasons. The correctness of this solution is easy to establish and is left as unexercise This method runs in linear time and is therefore time optimal. If I -c I ::: n. APL13: Suffix arrays . or if we assume that up to 11:1character comparisons COS! only constant time. the key issues are the time needed for the search and the space used by the fixed data structure representing T. Each such molecule is represented by a circular string of chemical characters: to allow faster lookup and comparisons of molecules. As an example of a suffix array. then the algorithm can be modified to run in O(n + kml time.10. 8. and build the suffix tree T for LL. 3. In summary. the goal is to find a spacefor T be held fixed and that facilitates representation is not so critical.1. Solution via suffix trees Arbirrurily cur the circular string S. then rhe suffix array Pcsis Ll . . Manoer and Myers [308J proposed a new data structure. then cut the circular string between character n and character 1. Figure 7. APL12: Circular string linearization Recall the definition of a circular string 5 given in Exercise 2 of Chapter 1 (page 11). Each leaf in the subtree of this point gives a cutting point yielding the same linear string.5. the traversal follows the edge whose first character is lexically smallest over all first characters on edges out of the node. where km is the number of maximal pairs of length at least the required minimum.13. searching for a pattern P of length II using a suffix tree can be done in O(n) time only ifH(ml1:l) space is used for the tree. the search takes the minimum of O(lllogm) and O(nlog 11:1) comparisons. then the algorithm can be modified to run in O(n) time. As usual. The space used during the preprocessing of T is of less concern. traverse tree T with the rule that. alphabet The circular string linearization problem for a circular string S of n characters is the Choose a place to cut S so that the resulting linear string is the lexically possible linear strings created by cutting S This problem arises in chemical data bases for circular molecules. while P Therefore. APL13: SUFFIX ARRAYS -MORE SPACE REDUCTION 149 maximal pairs. then cutting 5 between characters I . In the context of the substring problem (see Section 7. This is in contrast to its interpretation ill the previous section. A single circular molecule may itself be a part of a more complex molecule. a suffix tree may require too much space to be practical in some applications Hence a more space efficient approach is desired that still retains most of the advantages of searching with a suffix tree.creatingthe string LL. It is important to be clear on the setting of the problem.4 lists the eleven suffixes in lexicographic order.9. String T will be held fixed for a long time. that string can be found in O{n) lime more formal notion of a suffix array basic algorithms for building it were developed in [308J. The characters of S are initially numbered sequentially from I to n starting at an arbitrary polnt in S. that space efficient and yet can be used to solve the exact matching problem or the 7. if T is mississippi. (Intuitively. called a suffix array.I and I creates a lexically smallest That is. at every node. Then. If only maximal pairs of a certain minimum length are requested (this would be the typical case in many applications). although it should still be "reasonable". A different linear-time method wish a smaller constant was given by Shiloach [404].148 FIRST APPLICATIONS OF SUFFIX TREES 7. In the exercises we consider a more space efficient way to build the representation itself 7. Any leaf I in the subtree at that point can be used to cut the string. With suffix trees. we will affix a terminal symbol $ to the end of S.4. one wants to store each circular string by a canonical linear string. the purpose of doubling L is to allow efficient consideration of strings that begin with a suffix of L and end with a prefix of L.1. affix the terminal symbol $ at the end of LL. giving a linear string L.double L. This traversal continues until the traversed path has string-depth n.6.

Every time the search must choose an edge out of a node v to traverse. then the method can be improved with the two tricks described in the next two subsections.2. the space required by suffix arrays is modestfor a string of length m. Therefore. the array can be stored in exactly m computer words. and 55 develop an alternative. so we have the following' Theorem 7. More generally. the suffix tree is discarded 7. assuming a word size of at least logm bits. then the places where P occurs in T must be in the second half of Pos. the true behavior of the algorithm depends on how many long prefixes of P occur in T.6.14. suffix Pos(fm/2l)) In that case. Suffix tree to suffix array in linear time We assume that sufficient space is available to build a suffix tree for T (this is done once during a preprocessing phase). 7. Using binary search. 1. APLl3: SUFFIX ARRAYS . more space efficient (but slower) method. Similarly. 6.3. hence Since no two edges out of v have labels beginning with the same character. This ordering implies that the path from the root ofT following the lexically smallest edge out of each encountered node leads to a leaf ofT representing the lexically smallest suffix of T. . the suffix tree for T = tartaris shown in Figure 7.14. this bound is independent of the alphabet size. defining the values of array Pos As an implementation detail.1 For example. the substring problem is solved by using suffix arrays as efficiently as by using suffix trees. Once the suffix array is built. 3.MORE SPACE REDUcnON 151 Flgure7. Suffix array Pos is therefore just the ordered list of suffix numbers encountered at the leaves of Tduring the lexical depth-first search. In more detail. How to search for a pattern using a suffix array The suffix array for string T allows a very simple algorithm to lind all occurrences of any pattern P in T. For example.4) occurrences of P in T simply do binary search over the suffix array. In "random" strings (even on large alphabets) this method should run in O(n + log m) expected time. Instead. and the traversal also takes only linear time. Similarly.150 ARST APPLICATIONS OF SUFFIX TREES 7.5. we convert the suffix tree to the more space efficient suffix array.e. Since for most problems of interest \og2 m is O(n). The suffix tree for T is constructed in linear time. one can find the largest index i' with that property Then pattern P occurs in T starting at every location given by Pos(i) through Pos(i') The lexical comparison of P to any suffix takes time proportional to the length of the common prefix of those two strings.4. When augmented with an additional 2m the suffix array can be used to find all the occurrences in of a pattern P in O(n + log2 m) single-character comparison and bookkeeping operations. In cases where many long prefixes of P do occur in T.14. 2. The lexical depth-first traversal visits the nodes in the order5. 54. one can therefore find the smallest index i in POol(if any) such that P exactly matches the first 11 characters of suffix PasU).5.5: The lex. there is a strict lexical ordering of the edges out of v. page 116) then the overhead to do a lexical depth-first search is the same as for any depth-first search. it simply picks the next edge on u's linked list. If very few long prefixes of P occur in T then it will rarely happen that a specific lexical comparison actually takes 8(n} time and generally the 0(11 log m) bound is quite pessimistic.. That prefix has length at most 11. but that the suffix tree cannot be kept intact to be used in the (many) subsequent searches for patterns in T. Figure 7.4: The eleven suffixes of mississippilisled suffixes define the suffix array Pos in lexicographic order Thesta rtingpositionsofthose Notice that the suffix array holds only integers and hence contains no information about the alphabet used in string T. Exercises 53.1. The suffix array Pos for o string T of length m can be constructed in O(m) time. if P is lexically greater than suffix Pos(lm/2l). suppose that P is lexically less than the suffix in the middle position of Pos (i. The key is that if P occurs in T then all the locations (If those occurrences will be grouped consecutively in Pos. a depth-first traversal ofT that traverses the edges out of each node v in their lexical orderw ill encounter the leaves ofT in the lexical order of the suffixes they represent. Moreover.14. the first place in Pos that contains a position where P occurs in T must be in the first half of POS. P at locations 2 and 5.1. which are indeed adjacent in Pos (see Figure 7. Of course. for building a suffix array A suffix array for T can be obtained from the suffix tree T for T by performing a "lexical" depth-first traversal ofT.cal depth-first traversal of the suHix tree vi sits the leaves in order 5. if the branches out of each node of the tree are organized in a sorted linked list (as discussed in Section 6.4.2.

as before. R) that arises during the execution of the binary search. Then in each iteration of the binary search. then compare P to suffix Pos(MJ starting from position mlr + I "" I + I = r + I. by simple case analysis it is easy to verify that neither I nor r ever decrease during the binary search. At the start of the search. whcn T "" mississippi.14. r) of P. r) The value mlr can be used to accelerate the lexical comparison of P and suffix Pos (M) Since array Pus gives the lexical ordering of the suffixes of T. the lexical comparison of P and suffix Pos(M) can begin from position mlr + 1 of the two strings.J__U\ -----"l 7. Also.r .every iteration of the binary search terminates the search. examines no characters of P. M. the first mlr characters of suffix PosU) must be the same as the first mlr characters of suffix Po. if I := r. suffix Pos(3) is isslppi. O(n + logm).3. Then there are k . or ends after the first mismatch occurs in that iteration. In the two cases (/ = r or Lcp (L. rather than starting fromthetirstposition Maintaining mlr during the binary search adds little additional overhead to the algorithm but avoids many redundant comparisons. but the next character in P has not been. L equals 1 and R equals m. let us assume without loss of generality that I > r. and mlr. A simple accelerant As the binary search proceeds. Lett and r denote those two prefix lengths. The desired time bound. Hence at the start of any character max(l. and so Lep(3. M) "" [ > r) where the algorithm examines a character during the iteration. Thus before.14. if i is any index between Land R. the worst-case time for this revised method is still f g t. let L and R denote the left and right boundaries of the "current search interval".I (either I or r is changed to that value). R) for each triple (L. all characters I to the maximum of I and r will have already been examined.MORE Icp(L. 4) is four (see Figure 7. The usc of mlr alone does not Since mlr is the minimum of I and r . Later we will show how to compute the particular Lcp values needed by the binary search during the preprocessing of T How to use Lcp + log 1/1) values Simplest case In any iteration of the binary search.14. Thus no more than log! m redundant comparisons are done overall.4.I'L-- UUUU f 'I ------".M) SPACE REDUCTION 153 7. There are at most n nonredundant comparisons of characters . At the start. Then PROOF First. That means at most one redundant comparison per iteration is done. Suppose there are k characters of P examined in that iteration. respectively. M) and Lcp(M. The search algorithm keeps track of the longest prefixes of POs (Ll and Pas (R) that match a prefix of P. Therefore. whenL "" 1 andR "" m. For now. and at the end of the iteration increases by k .~(L) and hence of P.1 matches during the iteration. General case When I there are three subcases: =I r. APL13: SUfFIX ARRAYS . A super-accelerant Call an examination of a character in P redundant if that character has been examined The goal of the acceleration is to reduce the number of redundant character examinations to at most one periteration of the binary search-hence O(logm) in all. suffix Pas (4) is ississtppi.152 FIRST APPLICATIONS OF SUFFIX TREES 7.4). whenever I =I r. explicitly compare P to suffix Pos(i) and suffix Pos(m) to find t . the comparisons start with character max(l. and letmlr = min(l. follows immediately. a query is made at location M = r(R + LJ/ll of Pos. the corresponding change of lor r the search algorithm does at most Ocn For example. we assume that these values can be obtained in constant time when needed and show how they help the search. the algorithm uses Lcp (L. r) of P may have already been examined. However. To speed up the search.

It is therefore plausible that those values can he accumulated during the O(m)-time preprocessing of T.5 7.3 3. i + 2). then by the Lemma. depth-first traversal of T used to construct the suffix array. lexical. The leaf values can be accumulated during the linear-time. LepU. with large "alphabets" (using some computer representation of the Chinese pictograms.1 involves parsing it. A large part of the motivation for suffix arrays comes from problems that arise in using suffix trees when the underlying alphabet is large.14. The remaining values are computed from the leaf values in linear time by a bottom-up traversal of B. also arise in computational problems in molecular biology. Now by transitivity. So it is natural to ask where large alphabets occur. String and substring matching problems where the alphabet contains numbers. That string-depth is 3.3 fori froml to S. First.S.6 6.) must be at least as large as the smallest Lcp(k. the O(n + logm)-time string and substring matching algorithm using a suffix array must precompute the 2m . How to obtain the Lcp values The Lcp values needed to accelerate searches are precomputed in the preprocessing phase during the creation of the suffix array. giving a total bound of n +Iogm comparisons.8 Flgu~7. the node labels specify the endpoints (L. Combined with the observation in the first paragraph. the values Lep(i. 0. Therefore.4 4. a map may display the sites of many different patterns of interest (whether or not they are restriction enzyme sites). So there are only O(m) Lep values that need be precomputed.2. More generally. Lep (i.6. the restriction enzyme map for that single enzyme is represented as a string consisting of a sequence of integers specifying the distances between successive enzyme sites. k + I) 2::.14.6) is the string-depth of the parent of leaves 4 and 1. and it is left to the reader.1 Lcp values associated with the nodes of binary tree B. for every k between i and j. A restriction enzyme map for a single enzyme specifies the locations in a DNA string where copies of a certain substring (a restriction enzyme recognition site) occurs. R) of all the possible search intervals thai could arise in the binary search of an ordered list of length m.up(i. such as Chinese. One example is the map matching problem.8 1. APL1J: SUFFIX ARRAYS-MORE SPACEREDUCTION ISS '.2 2. One simple example is a string that comes from a picture where each character in the string gives the color or gray leve1 ofa pixel. j) By the properties of lexical ordering. i + I)fori from I tom -1 are easily accumulated in O(m) time. Considered as a string. 1.8 Ifwe assume that the Siring-depths of the nodes are known (these can be accumulated in linear time). Where do large alphabet problems arise'? '1. lcp(k. Lep(5. All the other work in the algorithm can clearly be done in time proportional to these comparisons.j)is 4. the lemma is proved. but how exactly? In the next lemma we show that the Lep values at the leaves of B are easy to accumulate during the lexical depth-first traversal ofT.14.0. i + l ) are 2. The hardest part of Lemma 7 .I. Once done. k + I) for k from i to j . and where P and T are large. resulting in the following and all of P. IcpU. We first consider how many possible Lcp values 7. suffix Pos(k) must also have that common prefix. Extending this observation. 0 Given Lemma 7. In summary. This clearly takes just O(m) time. Each such site may be separated from the next one by many thousands of bases. Since B is a binary tree with m leaves.14. setting the Lcp value at any node u to the minimum of the lep values of its two children. i+2)mustbeatleastaslargea~theminimumofLep(i.8 I. B has 2m .) However. so the string (map) .I nodes in total. 0 7.since the parent of4 and I is labeled with the string tar. most large alphabets of interest to us arise because the string contains numbers. j) for every k between i and j . each integer is a character of a (huge) underlying alphabet. For example. the remaining Lcp values for B can be found by working up from the leaves.5 (page 151). consider again the suffix tree shown in Figure 7. Hence. there are natural languages. The rest of the Lcp values are easy to accumulate because of the following lemma: 4.154 FIRST APPLICATIONS OF SUFFIX TREES 7.7: Binary tree B represenling Inallstofleng!hm=8 all the possible search intervals in any execution 0fbinarysearch PROOF Suffix PosU) and Suffix Pos(j) of T have a common prefix of length lepU. each of which is treated as a character.J.14. The values of up(i. the proof is immediate from properties of suffix trees. Essentially. i + I) and LepU + I.

Third.612 fragments (strings) totaling 2. the variety of known patterns of interest is itself large (see [435]). APLl5: A BOYER-MOORE APPROACH TO EXACT SET MATCHING 157 consists of characters from a finite alphabet (representing the know n patterns of interest) altemating with integers giving the distances between such sites.6.032. This provides one motivation for using suffix arrays for matching or substring searching in place uf suffix trees.). The vector sequences are kept in a generalized suffix tree. In fragment assembly. the underlying alphabet is huge. where a small number of errors in a match arc allowed. or when sites are missing or spuriously added. generalized suffix trees and suffix arrays are now being used as the central data structures in three genome-scale projects. are used in the search for biologically significant patterns '1Il the obtained Arahidopsis sequences. To compare the speed and accuracy of the suffix tree methods to pure dynamic programming methods for overlap detection (discussed in Section 11. which requires solving a variant of the suffix-prefix matching problem (allowing some errors) for all pairs of strings in a large set (see Section 16. The fragment sequences are kept in an expanding generalized suffix tree for this purpose. each sequenced fragment is checked to catch any contamination by known vector sequences. one major bottleneck is overlap detection. Efficiency is critical 7.1). 64. 65J. each of length about 400 bases.it examines all characters of the text. An approach that permits 7. APLl4: Suffix trees in genome-scale projects Suffix trees.16. Moreover. Since each map is represented as an alternating string of characters and integers.15. Since the Beyer-Moore algorithm for a single string is far more efficient in practice than Knuth-Morris-Prau.1 for a discussion ufESTsand Chapter 16 for a discussion of mapping). Since the project will sequence about 36. the efficiency of the searches for duplicates and for contamination is important.740 bases.000 times speedup over the (slightly) more accurate dynamic programming approach. one would like to have a Beyer-Moore type algorithm for the exact set matching problem. the needed overlaps were computed in about fifteen minutes.1.156 FIRST APPLICATIONS OF SUFFIX TREES 7. suffix trees to speed up regular expression pattern matching (with errors) is discussed in Section 12. That problem.10.15. and generalized suffix trees are used to accelerate regular expression pattern matching. So in that project. First. The test established that the suffix tree approach gives a 1. a method fur the exact set matching problem that typically examines only a sublinear portion of T. Second.) Borrelia burgdorferi Borrelia is the bacterium causing Lyme disease Its genome is about one million bases and is currently being sequenced at the Brookhaven National Laboratory using a directed sequencing approach to fill in gaps after an initial shotgun sequencing phase (see Section 16." at the Michigan State University and the University of Minnesota is initially creating an EST map of the Arabidopsis genome {see Section 3. each new sequenced fragment is checked against fragments already sequenced to find duplicate sequences or regions ufhigh similarity.4 Yeast Suffix trees are also the central data structure in genome-scale analysis of Saccharomyces cerevisiae (brewer's yeast). Chen and Skiena llOOJ developed methods based on suffix trees and suffix arrays tu solve the fragment assembly problem for this project. If both the new and the previously studied DNA are fully sequenced and put in a database. the following matching problem on maps arises and translates to an matching problem on strings with large alphabets Given an established (restriction enzyme) map for a large DNA string and a map from a smaller string. structure prediction.000 fragments. and since distances are often known with high precision. (See Section 16. then the issue of previous work or locations would be solved by exact string matching. APL15: A Boyer-Moore approach to exact set matching The Boyer-Moore algorithm for exact matching (single pattern) will often make long shifts of the pattern. the problems become more difficult in the presence of errors.maps are easier and cheaper to obtain than sequences. done at the Max-Plank Institute [320]. Chen and Skiena closely examined cosrmd-sized data. Knuth-Morris-Pratt examines all characters in the text in order to find all occurrences of the pattern. called map alignment. It often happens that a DNA substring is obtained and studied without knowing where that DNA is located in the genome or whether that substring has been previously researched. as discussed in Section 7. finding 99% of the significant overlaps found by using dynamic programing. In that project generalized suffix trees are used in several ways [63. examining only a small percentage of all the characters in the text In contrast. The Borrelia work [100] consisted of 4. But most DNA substrings that are studied are not fully sequenced .14).5.16.15. is discussedin Secnon 16. Suffix trees are "particularly suitable for finding substring patterns in sequence databases" [320].4 and 16. Consequently. suffix tree. highly optimized suffix trees called hashed position trees are used to solve problems of "clustering sequence data into evolutionary related prot ein families. Patterns of interest are often represented as regular expressions. In the case of exact set matching. Of course.15 for a discussion of fragment assembly. and fragment assembly" [320J.5. the Aho-Coraxick algorithm is analogous to KnuthMorris-Pratt. No known simple algorithm achieves this goal and also has a linear worst-case running . Arabidopsis thaliana AnArahidopsis thaliana genome project. the numbers are not rounded off. when the integers in the strings may not be exact. that is. The alpbabet is huge because the range of integers is huge. determine if the smaller string is a substring of the larger one. Using suffix trees and suffix arrays.

. In the worst case. for such a shift could miss occurrences of P in'T 7. then increase ito j .8: No further match is p<:lssible at posilion j j-l . using rules that are analogous to the bad character and good suffix rules of Beyer-Moore. patterns and m is the size of the text. The problem arises because of patterns in P that are smaller than P.g: SMt when the bad characle. Suppose the test matches some path in K' against the characters from i down to j < i in T but cannot extend the path to match character T(j -1).nl. Of course...a simple keyword tree (without back pointers) together with a suffix tree. •. If a leaf of K' is reached before the left end of T is reached.1. We will not describe the Commentz-Walter algorithm but instead use suffix trees to achieve the same result more simply For simplicity of exposition. The search Recall that in the Beyer-Moore algorithm. matching characters on the path with characters in T. These high-level features of Beyer-Moore will also hold in the algorithm we present for exact set matching In the case of multiple patterns. i should not be set larger than j .. The search again chooses increasing values of index i and determines for each chosen i whether there is a pattern in set P ending at position i of text T.) The above generalization of the bad character rule from the two-string case is not quite correct. we will first describe a solution that uses two trees .8 and 7. then when the right end of every pattern in P is aligned with position i 1 of T. but the resulting computation will be inefficient. P. when the end of the pattern is placed against a position i in T. starting with T(i) and moving right to left as in Beyer-Moore.16. simply follow a path from the root of K'. However.---'-' figurel.9. The preprocessing time needed to build K/ is only O(n).---' figurel.1 + IPminl._. character j . then The first test begins with position i equal to the length of the smallest pattern in P. P. the length of T When the test for a specific position i is finished. if i! exists. (See Figures 7. index i is increased at each iteration. set j "" i + 1 before applying the shift rule.-:.16. because no backpointers are needed.I + IPrn. no occurrences of any pattern will be missed. p.1 + IPm. following as far as possible a matching path from the root.~.e. the search is carried out on a simple keyword tree K' (without backpointerx) that encodes the patterns in P'. i.. it will take (-<)(nm) time. the bad character rule fora set of patterns is: If i! minimum 7. the preprocessing is particularly simple. the algorithm increases i and returns to the root of K' to begin another test. the algorithm preprocesses the set of patterns and then uses the result of the preprocessing to accelerate the search.1 of T will be opposite a matching character in string P from P.:~~'ni~th~c~u~t:~:~P:~~i:~d~tb~~i~:n~~hing a leaf and cannot be extended. _--'-P3~ _ __.2."i" (with its left end opposite a position before j + 1 in T) will be missed. for any specific position i in T. After understanding the ideas. the comparison of individual characters proceeds right to left. The difficult work is done by the suffix tree. The entire algorithm ends when i is set larger than m. The more efficient algorithm will increase i by more than one whenever possible. But by how much should i be increased? Increasing i by one is analogous to the naive algorithm for exact matching.: P. the total length of all the patterns in P. otherwise increase of il andj . ~:~:~. (There is a special case to consider if the test fails on the first comparison.: ••• •••••. and let P' be the set of strings obtained by reversing every pattern P from an inputset P As usual. j-i i to the P. Details are given below. A direct generalization of the bad character rule increases i to the smallest index i 1 > i (if it exists) such that some pattern P from P has character TU .j + 2 positions from its right end. then the left end of the smallest pattern Pmi"in P would be aligned with a position greater than j in T _If that happens. a synthesis of the Beyer-Moore and Aho--Corasick algorithms due to Commentz-Walter [1091 solves the exact set matching problem in the spirit of the BoyerMoore algorithm.. no shift can be greater than the length of the shortest pattern P in P.t) exactly i! . Bad character rule The bad character rule from Beyer-Moore can easily be adapted to the set matching problem. With a shift of only one position.--'----."I. :..--~. rule is applied . It may happen thar f is so large that if the right ends of all the patterns are aligned with it. then the pattern number written at the leaf specifies a pattern that must occur in T ending at position i.. To make this test. etc. However. Recall that each leaf of K' specifies one of the patterns in P'. Hence. at the root of K'. Increasing i is analogous to shifting the single pattern in the original Beyer-Moore algorithm. using only the bad character information (not the suffix rules to come next).16. Its shift rules allow many characters of T to go unexamined. APLl5: A BOYER-MOORE APPROACH TO EXACf SET MATCHING 159 time. P3~ P. where n is the total size of the does not exist. :. Moreover. whether one of the patterns in P ends at position i. The following exposition interleaves the descriptions of the search method and the preprocessing that supports the search.--~. we implement the method using only the suffix tree Definition Let pc denote the reverse of a pattern P.••.158 FIRST APPLICATIONS OF SUFFIX TREES r 7. In that case.ln summary.•. it is possible that some occurrence of P. The algorithm to build K/ successively inserts each pattern into the tree. -~----+. ~: ~ P. The test at position i Tree K' can be used to test.) With this rule..

.12.[60 FIRST APPLICATIONS OF SUFFIX TREES 1. exists) such that when all the patterns in P are aligned with position i.Then the good suffix rule is: Increase i to the minimum of hi" either or both are nonexistent.16. determine the number 2" for each internal node u. Notice that because P contains more than one pattern. an occurrence of Pm. butleft of. if the right end of every pattern moves to i2.16. traverse the path labeled 0:' from the root of T'. That path exists because ~ is a suffix of some pattern in P (that is what the search in K' . Even if that doesn't happen. A bad character rule analogous to the simpler. weak good suffix rille is applied.cTI figure.. and let a = T[j . As an example of the preprocessing. We can now describe how T' is used during the search to determine value i2> if it exists After matching a' along a path in J(/. matched from posiuon i down to position jof T. _[_r=u=J-"u'--L_--i P. making the bad character rule almost useless. tn this pro Tfthatoccurrenceof~rstartsatpositionzofpattern r = Z .buildageneralizedsuffixtreeT'forthe set of patterns pr. How to determine i2 and s. As the number of pan ems grows. if 7. a prefix of at least one pattern is aligned opposite a suffix of c in T. unextended bad character rule for a single pattern would be even less useful. i J matches a substring of some pattern P in P. cyj: ___L[Yl-"uc_[_____)-< P.11: The shift when the amount ot the shift. These two preprocessing tasks are easily accomplished in linear time by standard methods and are left to the reader. If that happens.. then an occurrence of P in T could be missed.10: Substring" to the left of position j in A. xuqq. Pattern P must contain the substring ~ beginning exactly i2 .3. Recall that a denotes the substring of T that was matched in the search just ended To determine 12. the typical size ofij ~i is likely to decrease. A direct generalization (to set matching) of the two-string good suffix rule would shift the right ends of all the patterns in P to the smallest value i2 > i (if it exists) such that T[j . _m Figllre 7. As noted earlier. Good suffix rule To adapt the (weak) good suffix rule to the set matching problem we reason as follows: After a matched path in J(/ is found (either finding an occurrence of a pattern. Recall that in a generalized suffix tree each leaf is associated with both a pattern P' E P' and a number z specifying the starting position of a suffix of P' + IPminl. particularly if the alphabet is small.I positions from the end of P. If that copy of a ends r places With this suffix tree T'.16. and the second number refers to a suffix starting position in that string. APLI5: A BOYER-MOORE APPROACH TO EXACT SET MATCHING 161 The preprocessing needed to implement the bad character rule is simple and is left to the reader. when needed during the search.pattern ~ determines tne Figure 7. 1f P'issmallerthan P. Duringthepreprocessingphase. in some applications in molecular biology the total length of the patterns in P is larger than the size of T. So let I] be the smallest index greater than 1 (if i. or not) let j be the left-most character in T that was matched along the path." would be placed more than one position to the right of i . of T.4. or i P. no further match is possible 7. qxaxj and the generalized suffix tree for P' shown in Figure 7. -.Ignore i2 and/or iJ in this rule. then shifting the right end of P'tol2 may shift the prefix f3 of P' past the substring f3 in T. in the set matching case. The first number un each leaf refers to a string in P'. __ ~ P5 . that shift may he too large. The generalization of the bad character rule to set matching is easy but. The number z" is the first (or only) number written at every internal node (the second number will be introduced later). a rule analogous to the good suffix rule is crucial in making a Beyer-Moore approach effective P. the point where the previous matches end. In that case. unlike the case of a single pattern.nin T could be missed.---' -P3~ -.j + 1 positions from its right end. Suppose that a prefix f3 of some pattern P' E P matches a suffix ore. but unlike the two-pattern case. .1) close to. We will first discuss i2. This is because some pattern is likely to have character TU . The question now is how to efficiently determine h and I. When there are patterns smaller than P. consider the set P = {wxa. use of the bad character rule alone may not be very effective.i] be the substring of T that was matched by the traversal (but found in reverse order). This shift is analogous to the good suffix shift for two patterns. we need to find which pattern P in P contains a copy of a ending closest to its right end. that overlap might not be the largest overlap between a prefix of a pattern in P and a suffix of ce. Therefore. but not occurring as a suffix of P. it may happen that the left end of the smallest pattern Pm. there is another problem. The reason is again due to patterns that are smaller than P.

of a is a prefix of P. z ) be any edge branching off the a' path and labeled with the single symbol $. Over all such occurrences of a' in the strings of t». P is then the string in P that should be used to set i2. . For each node v)ofv. If neither iz nor f exist. Now consider the path labeled a' in T'.e. set a variable d" to z. There thus must be a leaf edge (u. An implementation eliminating redundancy The implementation above builds two trees in time and space proportional to the total size of the patterns in 'P. 01ends in P exactly z. equal to the smallest value of d" for any ancestor 6. (i. in Figure 7. <lofI'thatislabelcdonlybythcterminal character Siand set (including 5. . z) branching off that path. That means that P contains that is not a suffix of P. z l is just the terminal character $.12 . Since some suffix. the test to see if a pattern of P ends at position i can be executed using tree L rather than K'. For example. unlike K'. Hence. we have in T' at or below the end of path a' has a defined Using suffix tree T'. The details will be given below in Section 7. and over all such occurrences of a. This is again easy assuming the proper preprocessing ofP. Build a generalized suffix tree I' for the strings in pc. as argued above. d" is the second number written at each node u (d" is not defined for the root node). P contains the ending closest to its right end. then h can be obtained from it: The leaf defining e.) for Beyer-Moore exact set matching 1.16. after matching a string a in T. after the algorithm matches a string a in T by following the path a' in L. marked in step 2.lfJ. the search used to find i2 can be eliminated. and the two traversals are as fast as one But clearly there is superfluous work in this implementation . Remove the subtree rooted at any unmarked node (including leaves) of I. If:" is defined there.16. each node v in L now has associated with it the values needed to compute i2 and iJ in constant time In detail. Next the algorithm checks the first node v at or above the end of the matched path. then i.I. let (u.1. (Each leaf in the tree is numbered both by a specific pattern P' in P' and by a specific starting position z ofasuffix in P'. However..z)in T' that is labeled only by the terminal character $. . In addition. is undefined. (The number z is used both as a leaf name and as the starting position of the suffix associated with that leaf. of a. the two rules can be combined to increment i by the largest amount specified by either of the two rules Figure 7. with proper preprocessing. identify every edge (u.[62 FlRST APPUCATIONS OF SUFFIX TREES 7. To get the idea of the method let P E 'P be any pattern such that a suffix of a is a prefix of P. the algorithm checks the first node L' at or beneath the end of the path in L.12. If no node on that path has a d value defined. some initial portion of the a' path in T' describes a suffix of P'. The search phase will not miss any occurrence of a pattern if either the good suffix rule or the bad character rule is used by itself. Moreover. is defined there then i3 exists and equals i +d~ -I. From an asymptotic standpoint the two trees are as small as one. Preprocessing Begin 1. Here's how.. 50a' is a prefix of some pattern in P". Then the pattern P associated with z must have a prefix matching a suffix.I characters from the end of P. In the search phase.5 Now we turn to the computation of i. If a. pr contains the copy of c" starting closest to its left end.J is associated with a string P' E pr that contains a copy of at starting to the right of position one. These observations lead to the following preprocessing and search methods InthepreprocessingphasewhenT' is built. the determination of h takes O(lal) time. i should be increased by z. Using C in the search phase Let L denote the tree at the end of the preprocessing. then i2 exists and equals i + a. Thai means that a prefix of a' is a suffix of P'.3 7. APL15: A BOYER-MOORE APPROACH TO EXACT SET MATCHING 163 The proof is immediate and is left to the reader. In summary. the leaf where z = z. qxax} detennined). where dill is the smallest d value at a node on the a' path in T'. then i should be increased by the length of the smallest pattern in P 3.) (Nodes were Eod The above preprocessing tasks are easily accomplished in linear time by standard tree traversal methods. every time a string a' is matched in K' only O(lal) additional time is used to search for i2 and i3. Thus the time to implement the shifts using the tWOtrees is proportional to the lime used to find the matches. L3 shows tree L corresponding to the tree T' shown in Figure 7.2. the value ofiJ (if needed) can be found as follows' Theorem 7.16. Hence.2 Figure 7. xaqq. i. Conversely..16. Tree L is essentially the familiar keyword tree K' but is more compacted: Any path of nodes with only one descendent has been replaced with a single edge. Let u be the first node at or below the end of that path in T'. for any i. 4. The value ofi3 should be set to i +d . v in I' set d.. only doubling the time of the search used to find a. However. . where leaf z is associated with pattern P' and the label of edge (u. Find every leaf edge du =2 {II. Clearly. is defined. Again we use the generalized suffix tree T' for 'P'. Moreover.5.a single tree and a single traversal per search phase should suffice.12: Generalized suffix tree 'It for the set P = {WX8.I positions. can be found during the traversal of the a' path in T' used to search foril.) For each such node u.

ba/"bah. or it is a pair (s. .v= 0.xaqq..> then begin ° end else begin output SCi) i :=i+1 Untili. APL16: Ziv-Lempel data compression Large text or graphics files are often compressed in order to save storage space or to speed up transmission when the file is shipped. providing another illustration of their utility The Ziv-Lempel compression method is widely used (it is the basis for the Unix utility comoresss. then Priur] is bax > Definition For any position i in S. is totally contained in S[ I . abahah"b"'wha'h«baba. yielding characters 1 through i-I of the original string S.).I I· The Ziv. and uncompress the file.!-I I has ~een represented (perhaps in compressed form) and I.14. The basic insight is that if the text S[l . In either case. .. and outputting the character S(i) when needed. That is. we present a variant of the basic method and an efficient implementation of it using suffix trees For example.) in place of the explicit substring S[i .hah'. outputting the pair (Si' I.> end n To see how [.> 0. and some file transfer programs automatically compress. ship. For l.164 FIRST APPLICATlONS OF SUFFIX TREES 7. Full details are given in the algorithm below Compression algorithm 1 2. The next term in the compressed string is either character SO + 1). then the next I." characters of S (substring Priori) need not be explicitly described.-Lempel method uses some of the Ii and . 7. a popular compression method due to Ztv-Lempel [487.l. If we extend this example to containk repeated copiesofab. iflj. see [423]) and will not be handled in depth here.I. assume inductively that the first j terms (single characters or . without user intervention. the algorithm has the information needed to decompress the jth term.) in the representation points to a substring that has already been fully decompressed.b. although there are actually several variants of the rnethcd t hat go by the same name (see 14871and [488]). In this section.1 3. pointing to an earlier occurrence of the substring. then the compressed repre sentanon contains cpprcximately Slog2 k symbols . 1\ used during the search.i + Ii . that substring can be described by the pair (Sj. if S = abaxcabaxab::. i pairs) of the compressed string have been processed.1 Figure 7. define I. siasthestartingpositionoftheleft-mostcopyofPriuri tn the above cxample.13: Tree£col1"cspondingtotree'T'forthesetP=iwx3. APL16: ZIV-LEMPEL DATA COMPRESSION 165 2. a compression method could process S left to right. However. represented as That representation uses 24 symbols in place of the original 32 symbols.. starting ut s. The shifts of the P are shown in Figure7.qxaxl begin j :=1 Repeat compute f and s. The field of text compression is itself the subject cr severar books (for example. process the compressed string left to right. values to construct a compressed representation of string S.17.I.1 Note that when I. and since the first . i . is greater than zero. I.a dramatic reduction in space To decompress a compressed string...17.II when possible. as the length of Prior. Rather. Most operating systems have compression utilities. so thar any pair (Si. the copy of Prior. 488] has an efficient implementation using suffix trees 1382]. Following this insight. let T be qxaxtoqpst. Ij) pointing to a substring of S strictly before i. define Land rj = 2.4 1.

3241. In this way. It can obtain that pair in exactly the same way that is done in the above implementation if the c.i]. lO)a(2.17. Assume that the compaction has been done for 5[ I . this character is not needed and seems extraneous. l. then the entire algorithm would run in Oem) time. In either case. That works fine in many contexts. then both computations should significantly compress the string at hand Another biological use for Ziv-Lempel-Iikc algorithms involves estimating the "entropy" of short strings in order to discriminate between exons and lntrons in eukaryoric DNA [146]. One for outputting an explicit character after each pair is that + /. These compressions reflect and estimate thc "relatedness" of 51 and 5:.) time. The tree can be built in Oem) time.I. SI can be compressed using only a suffix tree for S2. l the compaction algorithm needs to know (s.-Lempel algorithm outputs the extra character. Farach et al. the path from the root to p describes the longest prefix of S[i . since that would take more than linear time overall. values when a new node is addedtothetree_Insummary. APL17: Minimum length encoding of DNA 7_17.). Certainly for compaction purposes. one character at a time Essentially. The compressed substrings are therefore nonoverlapping.) defines the shortest substring starting at position i not appear anywhere earlier in the string.i-ljhasbcenconstructcd.1. and Ii each time the algorithm requests those values for a position i_The algorithm compresses S left to right and does not request o. When the algorithm needs to compute (Si' /. Historically.16)...mJ that also occurs in S[1.m]. which establishes those c..18. __forstringS[l . Using a suffix tree for S._ If the two strings are highly related. Ii).) is OOd· Thus the entire compression algorithm runs in O(m) time.arecxteosively discussed inPartIII . rather than as (1. 7. 7 Other.2. ls just the suffix number associated with leaf v. A one-pass version In the implementation above we assumed 5 to be known ahead of time and that a suffix tree for S could be built before compression begins.3.wehave Compression has also been used to study thc "relatedness"] of two strings SI and S2 of DNA [14. the algorithm is implemented so that the compaction of S is interwoven with the construction ofT.L . the algorithm cannot traverse each of the implicit suffix trees.166 FIRST APPLICATIONS OF SUFFIX TREES 7. values have been written at each node v in I. Ii) can be found in OU. c. by induction that the 7. and all the node numbers can be obtained in Oem) time by any standard tree traversal method (or bottom-up propagation)..Atthalpoint..I] and thatimplicitsuffix tree Z. The traversal ends at point p (not necessarily a node) when i equals the string-depth of point p plus the number c. The real Ziv-Lempel The compression scheme given in Compression algorithm I although not the actual ZivLcmpel method. and if each requested pair (Si.. That compression of S2 takes advantage of substrings in 52 that appear in SI. 5 = abababababababababababababababab would be compressed to ab(l. [146) report that the average compression of introns does not differ significantly from the average compression of exons. Ii) for any position i already in the compressed part of S.17. the O(ld time bound is easily achieved for any request Before beginning the compression.). the time to find (s. and hence compression by itself does not distinguish exons from introns. For example. However. it traverses the unique path in T that matches a prefix of S[i . APLI7: MINIMUM LENGTH ENCQDlNG OF DNA 167 term in the compressed string is the first character of 5. values in a linear time traversal of T. alone Similarly. Exploiting the fact that the alphabet is fixed. and it gives the left-most starting position in S of any copy of the substring that labels the path from r to u. However. Essentially. equals the string-depth of p. we conclude decompression algorithm can obtain the original string S. and I. The real Ziv-Lempel method is a one-pass algorithm whose output differs from the output of Compression algorithm 1 in that. unlike the above implementation.wheneveritoutputsapair(_si. Si equals c. c.itthenexplicitlyoutputsSU+I. 2)a(2. the idea is to build a suffix tree for SI and then compress string 5:.li) defines the longest substring starting at i that does appear earlier. the algorithm first builds a suffix tree T for Sand then numbers each node 1! with the number c. tv). Implementation using suffix trees The key implementation question is how to compute I. 12)b. using only the suffix tree for 51. whereas (si. 7.._I. does resemble it and capture its spirit. Instead. whenever a new internal node r is created in Ukkonen's algorithm by splitting an edge (u. it may have been easier to reason about shortest substrings that do not appear earlier in the string than to reason about longest substrings that do appear earlier. 4)b(1. but the method can also be modified to operate on-line as S is being input.thecharacterfollowing the substring. they also report the following extension of that approach to be effective in distinguishing exons from introns. This number equals the smallest suffix (position) number of any leaf in u's subtree.. but does not take advantage of repeated substrings in 5:.) for some position i.: where v is the first node at or below p.18. is set to CU" and whenever a new leaf v is created. The one-pass version of Compression algorithm I can trivially implement Ziv-Lempel in linear time It is not completely clear why the Ziv. only constant time is needed to update the c. The easiest way to see how to do this is with Ukkonen's linear-time suffix tree algorithm Ukkonen's algorithm builds implicit suffix trees on-line as characters are added to the right end of the growing string. more common ways (0 Sludy the relatedness rsimilarityf o o 'trings OflWO string. So.

Then lor those npatterns. Fora tree used in the Aho-Corasick Swe can think of the edge. into a single Ifwetakethisapproach. Discuss overtime. {Section Sof in algorithms lor the shortest Generalize this idea and work 16. but is more invo!vedsinccthesubstringstartingatimustalsoappeartotheleftofpositinni The main empirical result of [146] is that the exon-average ZL value is lower than the intrcn-average ZL value by a statistically significant amount. Now suppose If the search u is the parent 01 p before number the subtree 7.7. that are not contained a keyword wilh the set 01 substrings s. resembles the way marching statistics are computed. L(i). Given a set string in problem 2. In this way.. and to find the longest string is of length 0(k2n) common substring the longest SUffices.buttofindalltheoccurrencesofthepattern. Next. Now consider are 01 length only if the set of positions in Swhere of cr end equals the set of positions all substrings Dtogether 5_ Suppose of 5. show how to Construct tree for set 0 in O(n) time. Exercises 1. 01 this sort can the DAG in linear in Sections 2. By modifying to the resulting the compaction a little extra (linear space) not only whether DAG.e. P{i. 01 this alternative tree for 5. In some applications ina larger string in O(n) time directly 01 the keyword single string from a suffix tree for S (AC) 01 S be answered it is desirable to know the number linear-time of times an input string queries string S.whatist structing he relationship the AC tree path through belween reverse 01 the string 5.3. give some information the structure It should be intuitive at this point that the exon-average ZL value and the intron-average ZL value can be computed in O(n) lime.2. values (dis- 11..) pairs 01 strings. where the strings in the building the 01 a suffix tree lor string computation labeling isomorphic. any maximal time so that these queries 12.otherwiseitoccursstartingatpositioni out the details lor any number 01 no demerges. Give an algorithm 01 each common suppose common substring substrings in a set 01 k strings Assume 01 the (. A suffix keyword a keyword following sUffix trees built in this way? tree for a string tree when implicitly S can be viewed as a keyword tree. expose the structure 01 a string about when the text is larger than the set of patterns 7. L'U) and sp.20_ EXERCISES 169 method versus the use 01 suffix trees is varied use.2. Assuming 5" let 0 be the set of in Swhereoccurrencesoffj the two strings n. build a keyword of 5. by using suffix trees to compute all the ZL(i) values. give an O(n)-time Poccursstartingatpositioni-l.168 ERST APPLICATIONS OF SUFFIX TREES 6. and 12. and the AC tree is a compact n of the compaction algorithm suffixes asasetofpatterns. it is possible to use the DAG to determine p and q. The subtrees 01 p and q are isomorphic the strings lor the tree are only implicitly Given two strings in specified. time clearly but sum to m. in Chapter 9. sepjust be the problem substrings to a single and the lirst seems much harder than the second. remove pq edge. 7. In what way does the suffix to the Aho-Corasick Boyer-Moore keyword That much methods? tree more deeply compared or but this a string. Prove the correctness 13_ Let Sr bethe can be answered The input to the AC algorithm is a set of patterns P.7. the relative Consider advantages preprocessing. Fora cussed o and put a displacement lor a pattern in u to we search the P in the text and determine below the termination that P is in the text. traversed S 01 k strings.4. the strings and adding Why can't the first problem 7. the sp values inlormation about about done lor the Knuth-Morris-Pratt of the string. Let i be a leal below the path search). P occurs to solve this problem. The technique.iJ of the Aho-Corasicx where search for the exact set matching problem. can it be easily extended 16. Show how to lind all the longest J<:2) (I don't know how 10 achieve Now try for O(mi- last bound). Make tree or the preprocessing is. the SUffix tree into an Aho-Corasick II Prove the correctness 15. Is there a relationship the number between 01 nodes in the DAG lor S and the DAG for S'? Prove it Find the relationship the DAG for 5 and the DAG for S' (this relationship 14_ In Theorem efficient substring ifand 7. The problem aratelyfrom problem reduced of finding substrings 01 finding to solve common to a set of distinct common strings was discussed string.1 still hold lora generalized suffix tree (for more than a single If not. let ZL(i) denote the length of the longest substringbeginningatithatappearssomewhereinthestringSfl . that the string lengths each n. the text is fixed and the s at of patterns time.Weillustratetheideawhen there is only a single merge q and that 01 nodes 01-1 that p has larger string depth the merge. Show how to preprocess in 0(1 P:) time using a DAG lor suffix tree s.19. the full biological significance of string compressability remains an open question. show how to compute the N(i). and space and vice versa. Additional appUcations Many additional applications of suffix trees appear in the exercises below.17) length n.. Salongwith its suffix links (or keyword one has built a generalized Show how to efficiently link pointers). representation the suffix 01 those tree in terms patterns.2) in 0(: PI) time using a suffix tree. nodes with only a single one can build a suffix tree lor Sbyfirstcon and then compressing. Now this 8.20.4 and 2. the suffix tree gives precise. The DAG Dforastring to hold? to a finite-state machine 01 edges byexpanding each edge S can be converted with more than one character on its label into a series la beled by one character . and in efficiently a node q in the suffix end. Answer more the same question to take suffix trees and Z values. After the obvious preprocessing. Since 01 any pair can be lound are different in time O(km). Assume the merge. Hence. a suffix set 01 strings: tree is useful S.4. That result is contrary 1(1 the expectation stated above that bio!ogically significant substring s (exons in this case) should be more compressable than more random substrings (which introns are believed to be). then cto S Assuming that the totat length of all the strings This result will be needed n. Does Theorem condition lor subtree tso morphism str ing)? 7. that are not in suffix tree fora convert string tree for S. we want to find every string S that is a substring is of some other algorithm superstring labeled pqedge.1 we gave an easily Sare is as follows: is a bit more direct than for suffix trees) computed condition to determine condition labeling when two subtrees that is less useful a node p and for thefailurelinksusedinthekeywordtreeandthesuffixlinksusedinUkkonen'salgorithm? Whyaren't 4. Leter An alternative be the substring occurrences fl be the tree are the suffixes specified of S. During on the new point otthe occursinthetext. in O(n) time.. 9. between 3_ We can define algorithm. and in exercises of Chapter 14 to the second by concatenating algorithm in thesettol orm one large string? Information a pattern than of 10. in Sections 12. C onslder both the cases Definition For any position i in string S. child.3. 12.

st atisticsbe alg orithm? Q is the set of all pairs (p. lor all occurrencesola the length of the view of this is that the AhaAlso. Again edge that illabeled leafito how to build the smallest with more than a single than before. We proceed as before. to skip past any part of the text that has already Fill in the details problem 26. Two states Prove that this improvement in this way) What is the relationship method? (l. During vis Clearly. S is a suffix 01 pinto directed ms(i) for each positioniinthetext. to rather until it gets bacc on the desired solves the exact matching of 8. remove all The to q. operations Can it be used to solve the of these shortcut the previous links to the backpclntera method even further: If not. (Thiscanbes olved using the lowest is possible.p) 01 the edge to the length a merge linear time using the tree and link pointers 23. By using a generalized exactly suffix tree. page 135.fori we imagine it. sometimes discussed one method in Section will Recall the discussion statistics "mOdest" can number be used to identify of errors use less space possible bounds 20.170 each.thenthe v'. u path that leads to teat I. in linear time of STSs in Section any in either of length 7. where (p.q) be modilied. There entered is a simple prove that the resulting by expanding DAG recognizes each edge of this DAG labeled by more than a single character improvement (after some to the previous let v' bethe of iterations number method. The total time should prefix of length algorithm be O(kn+ p). starting from them. that there there is a suffix link from each alltha Ie aves will gel merged.When performs first of these edges is labeled and the s eccnd cne. This finite-state the smallest FIRST APPLICATIONS OF sur=tx TREES compute in matching could statistics 7. Show that the linite-state unue-state ministic strings ta. it is the STS strings neach. is a path of suffix rithm. every edge of the DAG 01 its edge-label. Number ms(j) is defined as machine will recognize machine. Hence. to recognize substrings be ms(j) lor each position such finite-state Give an example finite-slate isomorphic character. in computed be done more carefully each leafi+1.5 using only a suffix tree for 5" contaminants to compute rather For any pair 01 strings.. of this idea and establish orch aracter set 01 patterns the solution Corasick it needs Because text.1 ms{i) suffix tree for 5. of total length prefixes of the longest than a generalized 21. whether solves the exact matching in the same total time bound but in O(n) space. obtaining the same time and space of strings.8.Atthatpoint.createadirectshortcutlinkfromvto at v.3.do skip/count not proceed operations until you are on the suffix 181 > I. then the 8 edge is split into two edges islabeledwiththefirstcharacterolS characters with lslabeled the flrstlvlc-fx] with the remaining by the introduction of a new node u. Thentheedgelrom +1 cnaracters v is directed of j-. longest keyword common tree (with backlinks). We might modify only follow the suffix cornpansons link (to the end of a) and do not do any skipjcount unless you are on the path toieaf I. and than expand each must 22. of y. c. longest pis common the number method prefix lor each pair O(n) preprocessing there is no definite time and O(m)searchtime. discussed 01 Section Show 27.the just follows but the edge not necessarily iteration. that is. all the Show how to find all the matching T and Pin 0(1 TI) time. we used the suffix link s created by reversing the link pointers Can the matching by Ukkonen'salgoalgorithm. marge Moreover. the smallest lor the small P the matching we also want to n and want to compute lor each position i in the long text string T.e. (v. start with a suffix tree for 5. Then. substrings of S. Prove that this modification What advantages correctly pro blem in linear time method compared to com- than to q. relationship between nand trees m. common 28. but that would require statistics for both T. < T. 25. The key to this proof if.) p refix common length to the of all the in Section discussed later. edge (u. algorithm pa th not on this path. In our discussion but suppose of matching statistics. machine However. The finite-state machine number process machine and for 5 puting the matching 24. p to q in Suppose into p. pair in time linear we are given pairwise longest k strings we can compute length. that the exact matching See [228] have the Aho----Corasick statistics in linear time. using only a the merge operation links connecting suffix tree for P. Explain why IYi O! lSI. together we used a suffix tree with all the possible string in their total common This is a simple use of a suffix tree.8 we discussed problem: matching statistic the reverse as given by Weiner's use of a suffix of pattern problem of leaves in their respective tree to solve the exact pattern comof is in Q. and n.8. beneath such subtrees.findthe greater for the set 01 patterns obtained the reverse and the reverse role for suttlx method. In Section malching putedthe statistics: actersm each 7. show how to solve the than zero. Show in more detail how matching STSs that a string contains. is that a deter- link can be taken to avoid the longer works indir ect route back to v' problem is correctly solved machine has the fewest are equivalent il no state in it is equivalent exactly the same set of to any other state. Using general or preprocessing time than the other. EXERCISES 171 j in P. Given having a set 01 k strings a common ancestor to sOlve the problem with a suffix tree. what is the difficulty? 7.u) Using this modilied is created or disadvantages statistics? practical are there to this modified merge. but a simpler the tength DNA contamination 7. Rather. Now suppose over all 01 the @ . do all the needed been examined. iteration ends Each node in the DAG is now a state in the finite-state 17. 7. from you are on the path from the root to the leaf labeled the string ay in the suffix tree. substring problem Can you use it nonlinear time in a reasonable of these shortcut links to the lailurelun ction used in the Knuthwhat is the method? or il not. In Section statistics 7. that method takes method + m) time and preprocessing the problem In contrast. Let v be a oolnt on the off that path). clean up the description substrings 01 the entire compaction of 5. used by the Aha-Corasik In each iteration. Let v be tha parent be expanded of p. and we q. The point is that if any luture substrings of 5. and each leaf has zero leaves Recall that T.8. where of pairs of strings by the Aho-Corasick role for suffixtrees problem In detail how this is done 7.16 we discussed in a given O(m) solves takes how to use a suffix length time O(n tree to search O(m) space.erc node on that path that was next that visit nodes by the algorithm shortcut has the fewest of states of states of any Then. but it will not necessarily of this. Edge (v. Since of edges equal as possible. it correctly sk ip/countcomputations text is m. in some "reasonable" Morris-Pratt relationship When the suffix tree encodes more than a single pattern. matching of the matching below statistic algorithm Here is a modilication but never examine into a number as small merge that method that solves the matching new charI. That is.20. In Section maomne inat finite-state are accepted. let y be the label Find all occurrences the exact P in text but does T. The solution not compute there let S be the labet of the edge into will uttimately will exploit out edges il we want to make each edge-label this fact to better from pas before. q). Suppose to compute bound? time bound? 19. assuming there are or the new string. Suffix links can also be obtained that the tree cannot of Weiner's leaves.8. Now suppose the minimum pairs of strings. you already matching machine recognizes created above number path to leaf 1 where some search erceo. Iflhe of all the patterns and O(n) search is nand Anolher time. in links and Follow the details the text unless edge-labels.q) p and q have the same number exists a suffix link from subtrees. there thelengthofthelongestsubstringstartingatpositionjin of Pthatmatchessomesubstring a suffix tree for the long tree We now consider S.

+n' repeats <: tree we can get an implicit length compute the frequency n isspecified.arld a generalized in the set to find all maximal common pairs of 5. Show how to construct in the DNA itself but also by repeats that occur in data collected the data some come from the DNA representation 33. 30. build solvethe Voflength prob31. Some of those O{m) time.172 length prefix of the pairwise problem FlRST APPLICATIONS OF SUFFIX TREES efficiently substrings 35. prefixeS. and ~ we want Try to give such an argument at any postncn lem in the same time bound tree T(S) successively initialize tree traversal. It is posslbletc First. where in time independently were contributed one copy of each distinct substring algorithm that takes in a string are mO /2 substrings count the number any string Pof S and finds the longest maximal P. Then repeat was modifications of each maximal substring. and pair is a triple (p.finding in the data base (these discussed sequences into a few contigs. i. all ow us to frequency n'.12.ifanode write the string-depth of node 5i ends at the leaf labeled with suffix v is encountered coma iningindexiinitslistL(v).e . m. Show and how to use it the number at. a comand techrepeats. follows: equally A maximal but less way to identify by the left-diverse fJfJ for some {J. effective of the window approach. for the set 01 kstrings S (as before). p.000 others. and more 14. developed for repeats (maximal pairs. wewanttofind the Smallest substring in linear time. and ~. r ot vector V. again the all-pairs suffix-prefix without vector using matching an explicit problem.respectively. of these Since there are (-)(m~) substrings. all the available Iragments to the lerlgth some of the fragments unsequenced. if the two copies pcare not only begin at positions P2 of times each appears representation using a suffix In particular. and so on.. how to use the above claim applications. See also In biological occurrences plemented niques we are often where always) common both. of Weiner's for developing 8isasupermaximalrepeatifandonlyifnoproperdescendantofvis node in v's subtree (including via a path of suffix nodeotherthanv. of substrings copy. When the walk and B = aqyxpwthen common and ~. and of triples pairs ~. A substring the worst-case (in practical discussed output. Consider suffix Then siring then in O(n) lime of k. where p.specifiesthelengthofthelongestsulfixofS. begins there shouidbeadirect. Give an O{m)-tlme method. Consider common matching tree. then That is. propose Compare any advantages or dismethod problem. on either the right or if A = aayxp/ whereas starting v tc contan any string match each string in a string that is not common to both 8. Give This a linear-time algorithm to find the longest in 8 represented v) is reachable node v in the suffix tree lor left diverse links t rom a left diverse an input string S. efforts. and For example. the autho rswanted coli genome that appear This is done much more up" the data at hand. in the sequence in a biological These methods strings analysis sequence almost fixed how literature.Provethis.. Show how to solve then the method should take time O(m) plus time proportional the problem maximum 29. numbers the two copies do not overlap. all zeros. and ~. by chance. in repeated copy all the distinct three of that we show extension of this window this or q-gram In the providing Exercise one substring repeats. should the implicit substrings are we 40. suffix-pre fix overlaps. repeats in a string It seems in 5 A of that the all-pairs in O{km) of 8 that occurs 36. to find all supermaximal not only interested repeats Very frequently esting a long find features string. coli.stringsthatcontainall huge amounts Assuming occur DNA sequences may not beth eshortestpossible). and set up a vector C is maximal the string if the addition to C of any character common common substring.lust the rlumber generalization of maximal and an O(n + to find one copy in Section O. The obvious independent way to solve this istosolve of strings in 0(k2 the longest common the substrlnqa in the database. window length. v into position reaches n' is the length of a maximal substring at those the leaf for suffix 1 of 5i. time) of this method ccmpa red to the tree traversal to the tree traversal 5. Verify solved lor 1(2. influenced by the choice Show how to adapt all the definitions supermaximal inversion problem.20. suffix window. how one has a statistical model to determine above which the techniques for finding repeats.m. [298] one example. the O(km) a term (Smallest k-repeat) Given a string find all lnteresttnq 7. approach lowing natural is of course exercises. The match 5i. S if a is a prefix of motivatlnns of the maximal efficient.1 of this method. EXERCISES 173 1ft hedatabasehastota!length. k 1 of Recall that it is not true that at most one maximal Given two strings common substring left of C results A maximal positions. pair in <: identical and so occur more than once in the string. One way to hunt for "irrterestmq" substrings appear in the database alone. V(i). near-supermaximal and complementa- of Chapter 32. and many intervals analysis sequenced mayor of the E. and a threshold .the Tin implicitrepreserltation for handling by Leung et DNA in DNA gives motivated by repetitive structures a problem 1. yxp is a maximal yx is not. problem string Explain also the problem of computing length over all the pairwise suffix-prefix a suffix 8 and a number k. Theorem length 7.simpieargumenttoestabiishthisbound. substrings) is an inverted maximal to handle of the other. n').inthesametimebounds of a string T whose length in Tin is m. supermaximal repeats is as and no S and has the form prefix repeat of suffix 38. were yet of Pin O(n) time. p._ are positions in 8.10 can be is. Ttus is a all of its advantages a is called string a prefix problem and also correct of string one for its disadva ntages. in O(m) time. to the number of mteresnnq pairs. discussed in Section That 7.. through the tree. stored In that paper they discuss sequences from E. to "clean and packaging the substrings Using etten than theyw ould be predicted even more others. coli database. where sequencing from more than contained how to count of Tis sequenced fragments in an E. However. for each of the (~) pairs + kn) time.2.Duringthiswalkfor5i. P. substring show would how to database is to look for to attractive when substrings of a string Tin O(m) time. of anonymous by chance sequences tooay before the desired redundantly contigs was begun. in 8 exactly which states k times. Establish advantages method that maintain 31. Techniques that occur The paper of analyzing of distinct strings. Give a linear-time which and are of length 41.1. in a DNA sequence and will become are available how likely any particular a substring is 'tnterestoq".foreachi. Another. substrings but in always length. Show how to enumerate of all those by independent overlapped Consequently. or (almost repeats. letting time space analysis and/or 7. Show how to solve this problem can be at most maximal repeats repeat time using any linear-time matching method. was established by connecting with suffix trees. pair in a sing te string give an O(m ~ k)-time I)-time alqorhhrn solution to this to count matctes a prenx ot S. begin pick methods aimed certain at finding substrings and interof then fola 26 by cataloging a fixed-length The result of the avoid trees in linear time. that there n maximal time bound does not require why the bound does not involve n. Show 39. repeat string .that This is a generalization m denote where the total length k is the number common common of the maximal of 8. Since the sequences the lerlgth proportional 34. repeat where problem concise I is the total length for a single of those str'lngs. coli genome reg Ions of theE. There cannot when tion.

and contains of the techniques or limitations. in the DNA Let 8 be a string of length n. algorithm.!. a gene is composed and introns. that are independent different this view. is defined on pairs of suffixes of S. exon exons being shuffling.lnthenew is the relationship T'. The goal the time-efficient solve the more complex space-e.quivalence c. We say iE. unsettled issue. the kinds of computation 5. all proteins. concerning should try 10 identify common the exons or repeated by finding substrings for exon E. since facts: Fact n in En IS a Singleton.1 it may be he Ipful to first find all the v represents the same set of suffixes retinementtree.. or cover with it as best you can.whileachieving it would be cheannq pro p. 43.) genome other database. the suffiX tree approach to to implement time. substrings in 8" each 01 length k or 52. Hence exonscorrespond gene construction of the rest of the protein). bu tclearly. sintothe desired EXERCISES 175 7.torepresentthe of the classes contains children What the elements of successive The root of refinements class mtrcns not known.. Consider string properties Therefore. because that 'does E. Note that the substrings contained case? 49.each node at level 1 repres ents a class of Et could be applied. Venfy the followmg sequenced of reusing are called a few thousand exons is called shuffling. whose Similar form. we introduce ve refinement the ideas. were in addition to organisms. Prove Theorem In the Ziv-Lempel traversal with lening 7. nsuffixesofS? Now modify T as follows. of as its parent node I v'. Successive Successive of string connect string refinement refinement [114. coli. each of length of Cin &.calledtherefinementtree.contract v'loasinglenode. that are from two strings each of length k or greater. class is a subset of an Ek class and so the Ek+l partition problem The problem: gions where common Given anonymous.16.!.fficient a suffix have no appllcaticn problems listed to prokaryotic problems is to maintainthe of the suffix array. and delerminedinconstanttimeafteranytest.! is a permutation algorithm of substrings of 5. common substrings of length k or more 7. remany of the E"partition WeuseaiabeledtreeT. Grapple organisms.yeaststringswereincorrectlyidentifiedinthelargercompositedatabases. how to convert T' to the suffix tree for string S? Show I a suffix tree for S into tree T' in O(ri') time . protein. t E~i + 1 is a refinement mosaic proteins. (units orders are often built in a modular refinement to suffix trees. methods is a general 199.lasses two on the reuse to date exons and reordering up of just exon facts of stock exon and the s. iE"" j if and only if iE. appeared higher than quality in the reported sequencing isan yeast strings.! and a parameter good suffix rule. each nonleafnode Prove Lemma has at least two children. 50.3. the excns (ideally. represent S. are substrings) strings of DNA from protein-coding Clearly.jifandonlyifsuffixiandsuffixjol8agreeforatleasltheirfirslkcharacterS.20. discussed in this chapter they would result does require In general. data should set matching With that implementation. repeats shuffling? Although maximal 44. T represents Eo and and its regions in two or more DNA strings.ltisnaturaltowonderif evidence to support by modular tnat This [468]. To further more recent of yeast may yie ld sequences databases. folds concatenation composed a single in a variety is unclear. organisms where. as k increases from 0 to n. end at point p if the string-depth the match of p plus c.4) constructed from the set of substrlnqs.16. 265]. set matching tree to implement independent a strong a weak good suffix rule for a Boyerthe increment of index i was even of the alphabet size. 53. it may not give an elegantworsl'case result. vand It node What That question an empirical.I. the same domains and there is some in many and combination s. . andapplysuccessi of distinct although modular based domains in diiferenl protein protein have distinct problems.j and i . Prove the correctness of the method presented in Section earizationproblem. to particular greater.2.and parts of cloning problem. Fact 2 Every E •.) for some position i. is instructive DNA.ertiesofthesufflxtre~. suggest createdviaexon general search 1 For any i '" j. prcblern rather In addition to methods the x-cover seemofu sa in studying all the E'~l analylical will surely is the relationship of T to the keyword 3. In eukaryotlo specifies it as much This is a harder problem. Or. Extend in this one tenth 01 the errors old sequencing curation 42. a k-cover we used a suffix algorithm. all the nsuffixesof in that class. 8. why should the &. find the largest < k (if any) such Try to give some explanation compared to compression for why the Ziv-Lempel algorithm 1 algorithms 01 finding for these problems. That is. extend wherl computing i? algorithm outputs the extra character (s.Notethat construction are made These may be reflected E. and identical but sequenced. isan equivalence S has relation characters. to cover &.ordeterminethat If there is no k-cover. nonoverlapping as possible. of the yeast database problem. as in most prokaryotic not mean that techniques organisms.14. algorithmic technique that hasbeenusedforanumber exercises. of 5"each k' of length korgreater.and to individual domains.that that there is a 51. Proteins that whose of alternating function functions exons. In Section Moore of the preprocessing to exact needed to implement the bad character rule in the asyeast. Consider Give a linear-time to find a k-cover no such cover exists. where again the increment he alphabet set of substrings concatenation C may overlap korgreater.although No elegant and common have to be tried out on real data to test their utility be expected..13 lor the circular string /in- cleaning up the data and organizing the sequence 45.Jtts estlmated AlSO.. C is a as the in al tools discussed k. However. clean-up existed above. such that s" can be expressed i can be found in constant Can you removethedependenceont of the substrings in 5" but not in some order. k'-cover. Each child of the root represents classesmat refine it tree (Section a class 01 E. in the yeast strings lrom to first use the suttlx array for a stn ng to construct the kinds 01 problems vectors tree for that string Give the details Boyer-Moore 48. and Ih. than theoretical answer. equals i? What would be the problem then find a set of substrings of past character cover the most characters Give linear-time now the problem s". The relation E.174 would you go about contigs? (This application little repetitive structures Similar FIRST APPLICATIONS OF SUFFIX TREES 7. in detail wtlether considered a suffix in this array can be used to eificiently chapter. thaI have approach 7. &. k-eover sequences be integrated will require in the existing How the new and the any large-scale here. incorrectly complicate listed the 47. or distinct are seen problems successive tn the next several are often seen of genes. for handling contains repetitive 46. Given two input strings of 5. and so il partitions every class the elements into e.

the tandem repeats (In Section so comes before the class for the single m.1.fl.jand the classes i+kE.1 (page 40) om-meres. Instead. trom th. any numbers 01 this follows Giveacomp~eteproofofthe the E'k partinon Irom the 55.{2.8).6.S) remain E.2).7) were in the same reftnedin E. to refinement.i Lemma implies that{i.j+ of k lists all Cl = {JI contained S Is called maximal From Fact 2.1. simply and th e crasses are arranged of all are no additional Maximal interested before.partition. Processing the first and all the tandem used a successive arrays in a given string S.20. [2.g that the can be obtained it easy to identify order.are:{12).goflthm In correctness Ekparlition refinement approach of E".1 we will discuss and we allow the repeats the tandem in a string.7) and [3.. useful to use Lemma Lemma 7.) the tandem orby array 4). 112):12. The class in identifyingthemaximaltandemarrayscontainedinastring. Primitivetandemarravs Recall tandem that a string repeat.inlexicalorder are: 112).1).5).j) is a pm-triple. creates [3. These by the substrings in Ihe same represent.S. efficient We focus here on all the maximal to find all methods and implicitly to contain encode and to identify the most informative m lexical order of those characters. What of the integers 1 to n. Alt. tandem of {J before were initial'v atter defined of [308] starts In Sofone by computing character (3. suited is such [308] The advantage for parallel an algorithm. The pm-trlples triples k..8. is obtained in practice depends on the size of the alphabet with O(nlogn) by refining implement should character and the manner comparisons that it We use the pair ({J.{4. IS represented. and E. [114] (with different terminology) refinement method 10 lind all the pmlor each and hence Each class it creates of E2 holds £2 classes {3. First i. i .6. Be sure to detail how the classes Assume smgleton. Now consider bythepair(abababab. i and i + k are both contained in a single class E.(3. 57. lemma There the two classes classes. lor any k.(3.2.1? time.(6. crocnemore In the creation of E. When or array can also be called i. for each E2k class.54. 9. we create than examining I > kine. .7) in O(nlognJtime. Prove Lemma is easy.s. results {lOI. in Exercise 4 in Chapter in Section 1 (page 7. such that each consecutive pair of numbers differs by k.1. we choose the later pair.14)'{6).f8).20.p.s. ends. using a reverse C of Ek to find how number C be refined.{1).. Eachclas~ al. For example. C as a ~ee h~w it forces other locate by C identify Ek classes a complete to split. creates they there can be only O(nlog classes substrrngs {2. However. t~e general idea of reverse of addHional £2" classes that 12.Explainthisindetail .1 Assume that the indices in each class where series repeat of a k·lenglh class direction is harder and it may be substring The method in [114] finds the Ek partition following 7. for S =mississippl$.[1J. are kept in lexical of Ek are sorted is a in increasing substring.. £.{I}'[10J. k. the maximal tandem arrays in mississippi Note described arrays by the pm-triples pm-triples can begin are a suffix S using approach. approach as suggested in is best? Since the first two pairs can be deduced will from the Fact 3. is central: is a tandem i and i + k are One direction 3.hough tation analysis.1ss. in order. Which description array {J'. Explain why.6. Then. 7.The lemma succinctly encode since and (9.'5 In S and limit the size of the output a subset of the maximal tandem arrays. position. {8) and {9). Class {9.k.e lexically 11..1 makes provin.fl1).and the For example.14. of the (singleton) permutation creates the O(n)-lime classes describes array a permutation S.{9).Thisimpliestheverynontrivialiactthatinanystringollengthn nJ pm-triples..fll). /) to describe Cl = abababababababab.2). The other requires a number Give complete order.2). and mark A. Each class of E. . Rather refiner to we use last.4. three of mississippi.theyconstructonly some of the classes is reduced use of space The original the suffix the in practice suffix array partitions or [116]. E.7) and {10).13). E.{9.. character the partition the importance arrays and repeats was discussed subset specific in the alphabet. 13) and We are (see Section 3.6. This "choice" now be precisely defined. Conclude 7.(10).SI and 01 length of identical or two.5. Note that the algorithm within log. the tandem in a string Cl. ~everalstringalgorithmsusesuccessiverefinementwithoutexplicitlyfindingorrepresentmgalltheclassesmtherefinementtree.{9)'{4. holds [308] co. For any x> t.8.6) one of E.2) the same first number. efficient implementation from the Ek classes implemendetails in O(n) and time. class because 14.10) to obtain the of Ek are kepi 01 the two dFfferent maximal tandem the two maximal arrays ss and ssissi both begin at position E2 partition smallest 01 the of S = mississippiS.{3. The end-of-string other character. The correctness 01 the reverse as follows: Ek class For each number in A from Fact 3 to creating array.Asdiscussed on a structured 01 the string sofinterestinorderto members. which comes to be lexically for the s's. largest the three process the classes class of {2.e.61.How $ is arrays that succinctly considered small erthanany 9. etc. the E2k partition time.2.lnstead.11. the order of numbers i.nstructs the starting array for locations 01 a k-Iength Ihe reverse in the lexical substring refinement order of S. 10 the lexically Class. or to stay together.2).10).5. has five classes: before the class it is often best to locus tandem 11) lists the position . Class the starting ordered lexically Note £.we can obtain the E2k partition in O(n) the E. can stop as soon as every class When the algorithm fJ is a k·length if and only if some single class of E" contains and this must happen n iterations. computation where as a byproduct of a successive E.i + + 2k. The For example. Cl is called array a tandem army if Cl is periodic I = 2. computed only for values of k that are a power of we need an extension of Fact 1: Fact3 For any i ¢ two.[2.8. The E2 classes of [4.[7).111.ltcanbedescribed orby(ab.7). It certainly can be obtained (abab.5. marked It IS not clear how to efficiently a class a direct use of Fact 3. We develop that method here.f8).4. should now be clear.SI.2).6) are each no numbers in that refinemenl details. iE2kj if and onlyifiE.20.20.TheclassesoIE.ssi. and 9. class. array is obare 56.7) locations 111.7). The second E.4. if there lt a can be written as {JI for some A tandem copies arrays of tandem I :0: 2. a maximal n is a p~wer of two.11) of E. some errors. E2 class 111). Prove that this method over or is the suffix method for string in Section thai the reverse refinement only compute the tree implicitly.. strings (2. refinement a suffix array in O(nlogn) detailed is the spaceadvantag e of this method an algorithm c~nstructlon tained that is better method computation In that algorithm.i + jk. fJ starting at position of i 01 S if and only if the numbers between 7.the The algorithm :he locations classes of £21< refine E. that two or more tandem can have at the same With t~e added detail that the classes associated With the classes More specdlcally.

Theorem sequenced. One such situation methods is walking". say nine. If.and to find them.20.~ 59. and by modifying 55. That substnnq somewhat fewer than 300 new Let S be a set ~f strings The algOrithm A more certain along u~elul ~Inlmum With the set should version length finite alphabet $over I:. is removed list and placed in the linked by the substrings any point which nu- list lor the appropriate except created has used as long as you know This replication nucleondes has had a tremendous allows inthe cleotldes.thenwa sequence by used to form the primer. called relinementandlemma7. Above. Give Implementation to find all the pm-tr"lpies n) time. was used to partlelly paraltebze but mrcmosome sequenttal. for every k > 1. laboriously this directed nnerennv (In the CystiC FibrOSIS case another thiS sequential Idea. Give an algorithm (using a generalized 01 the strings in S than a is given of " lor every i. byan So by Ihe time C is used as a refiner. we can check sequence to see if a string compl~mentary exists to the left. is certainly desirable. be immediate (i.. class Csay. nocreonoes that initiates overall. When refining child of each Ek_1 Imked /ists correctly 7.tgunsequencmethod process requires it takes much less lon~er walking to IS gene I.20. assuming sequencing the . a technique walking chromosome used in some DNA sequencing or gene an O(ri'J-lime The primer idea. The other identify the Ek+1 classes. a primer complementary to the last nlna nucleotidesjust a stflng comammqme nucreotces. as few ~s nine rune finds It.1 . We discuss in the location of the Cys- Ek. Crochemore consroer of thiS problem using lemma errors. the same way.!). from one. to the point of the pn mer. Do 'It ing.1.lnfact. generaliza~ion 12. a small you have to know two things can only accurately at almost Now SUppose a number. string of DNA. T~~~~~. in the same EHI class. current of nucleotldes. the largest used as refiners child of each E'"I class is skipped.14). then Ip'j to es. classes ~ewly mdetail.79 different directions a version and solved version 'In esof the that By using veloped Fact 1 in place in Exercises sorted of Fact 3. at that the conditions the replication pnmer IS of part of the original string 10 the right of ine primer site. Then replicate ata at the very start 01 the string. There pens larger is a common if the string string? Then and II IfJ I first number Show 61.20. some known want to find a nine. next 300 After 300 (say) nucleotides. if there (i. Ek7-1 class. end of a longer of a DNA string called polymerase molecular starting of the nucleotldes. p EHI class class. Being produce Theore~ resulting that each say. selection problem iin and show how to solve it efficiently stnnq " find the s hortestsubstringthat USiflg a shortest suffix trees. Now provide the bound all the implementation pm-tnples details to find all the pm-triples as a byproduct of the algorithm in this primer Ihe lofa/size of all Ihe classes is at most n log2 n and continue.(lmp- ing method of O(nlognJ sequencing sequence generally are two pm-trlples. the known would appear sinanotherplaceinthe pes itionandanysequence to the left of to 60. to replicate a small impact number the complete laboratory nucleotide methods sequence about of a long existing sestrl~g both be in the same and moved Ek class. 7. To understand technology. That would limit Ihe number would of pm-frlples with the same consisting n.1. show order.classesexceptone. and the O(n log n) bound belief is false in tne correct by example selection that the folklore problem over so~e determined our current the primer from that point Since we know the sequence point. sequenced Compare sequential process.. beginsatiandthatappearsnowhereelseinaorS.Jt that allow with the longer 7.2. suppose qare placed all E. (Usually PCR is done With two one for each end. exactly Then the one (arbitrary) is skipped (i. end of the lastdatermined can then nuclectldes Problem: Formalize More will resequence that does not appear and so sequence suftlxtree_ltoflndashorteststnng 1: that is a subslring in noneofthestringsofS The result will be that the next su bstringsequenced run in time proportional of the. problem Now the problem to the sum of the lengths string nucleotides is to . was used extensively 7.) a method to sequence used.flnd the shortest becomes S that is longer a string" substring and IS not a substring as a substring of any string of S. primers. Primer IfJl. We then one of them (at least) is that by Fact 1. To location 58. for each position (If any) thaI does not appear e generally. Let p and q be two indices not together in its proper that are together gene on human here only the DNA sequencmq that if pand has already + 1 cannot q p p+ 1 and p In DNA sequencing.7._1 class to skip.20. substring sequence to create the longer string gets sequenced using the end of each of the next substring.5 we will is the exact ~atchlng Just the powers of each class 10 obtain of two) m O(ri) in increasing algorithm time. Ifitdoes. used as refiner s a refiner either has been marked E.) problem the primer with the above chromosome walking approach. Often.e. refining This complementary string string can be used to create This creates To do that we have to repeat Ek to creafe class EH1. tic Fibrosis application the idea.bulhereonlyone"variable"primeris in this discussion.we first that the child of an £. EXERCiSES . 2: 21fJI.theboundof3nlog2nisfairIy way.20.a shortest substring asasubstnngofanystringins (il any} that begins and does not appear . then the first nine nucleotides synthesize Hence time. nonalgorilhmic proof is possible. biology. sequencing the goal is to determine the application common First.Sway hybrldlzes refiners. bv successiveiy sequenc- sequencnq but because dtolhesho.20. in Seclion it is an 16. one of finding at position of any string in S.. compute th. Ihen Cas linked E.1 allows complete freedom in choosing fixed and can be ignored The above know two facts Prove Th. This leads to the following: 7. lists plus What remains of the original linked lists for E. the complement to those which then able 10 skip one class the stated bound.n~'t~~:' we established But a direct. then is held in a linked from its current With alilhe that detail. method de- The above sentially primar allows problems can be generalized (In Section in many 54 and all the Ek partitions that method.20. The reason already have been used as refiners in some Ek class. Explain on experimental Conclude that one Ek class while need not be used as a reliner E. how to compute Next.2. To find all the pm-triples introduce to create claim q or E'+1 from Men in O(nlO9 [114) used one additional in "chromosome problems.Mor iofa S. suppose that for every k > 1. number it is possible quenca Second. the idea ona but it isn'! enough larger scale to a str'lng that is complementary a "primer". stl II using the idea ot successive easy to obtain (to be discussed a long DNA string. not used as a refiner). to the left of that pomt chain reaction Knowing (PCRl. Ekr1 list and that when linked if the algorithm are correctly identifies all the Ek this from 300 to 500. One particular selection problem. extend the reverse details refinement to maintain S. classes represented is done using a technology one 10 synthesize long string containing point. What hapot the last nine nucleotides may nOI hybridize be incorrect. Folklore .eorem which Note that Theorem suggest a long string of DNA.has it that for any position > iin $.1.P''!').lenglh~ubstnngnear!he anywher e earlier.I selection problem Chromosome arises frequently in molecular biology. this primer generally. substring ing300 ~rr. in a string for all k (not the indices 7.p.

be in the standard repertoire of string packages '" . the 01 keeping in a certain range.1. v of twa nodes . in a suffix of leaves tree are identical of the string if the paths frum the root. =of "folk" . So a more general to the left of" known string balance repetitive of length not e euostrlnc largerlhan and placing Problem: is the following: of known is known). for I 8 Constant-Time Lowest Common Ancestor Retrieval yet know the sequence. the Origin . . is an ancestor lowes/ common ancestor (lea) of both .literature. will destroylhe conditions[134. Probe selection of Ihe primer setecncn pieces problem (discussed a primer is the hybridization in Chapter probe selection problem.. stringent This is because whan desiqnmq a single certam it is desirable hybrization.180 FIRST APPLICATIONS OF SUFFIX TREES 62. only to resurface site.b~~wee~ 1 In some ways.1.(Cancer. instead oligomers of the target DNA. called walk to a crawl. It shoulddefinilely low-overhead IS simple and very fast in practice. to v. The genome was noThey were guaranteed to slow any the starting contains are known. Weinberga anyone Theywerelikequicksand.not practical.r and nine nucleotides the constraints lor other T. So one must not appear hybridize length fJ or S_ However. find the furthest Definition pathfram the root to refers to an ancestor In a rooted tree node in Tthat in Figure that is not ot s or any string in set S. larger than (th e part 01 the long string in" nine that is a primer much and a set S of strings right subslring in is no such string. primer problem died wih these sinkholes. ! most comcause On the problem that repea ted substrings R. as a black is only in the string and th~re is certainly method. !he find such a longest Some result into The usually Vishkin B Racing to Ihe Beginning of the R~"J. That to review known. contamination olthe target for PCR but to extract some information with the vector the target DNA.r and y of nodes 6 and 10 is node y and 3 it as far right as possible this version 01 the primer selection problem and show how to apply 8. probes mismatch selection the primer problem even unosr is 10 the exact matching Jor mappmq. of the vector. What do ancestors have to do with strings? The when suinss prelix constant-time applying search the result result is probl~ms.eotor ability to i and j will identifies common box.177]. The purpose efforts. then suffix i and suffix} the path to o.astringp whose sequence DNA strings).1.makingitunique.ghtof the sequenced eukaryotic monly chromosome parto! the string is more difficult that should regions writes' on them would be sucked in and then propelled. whose sequences DNA of interest.worofv Definition is the deepest For example. 7_11.1 the Formalize lea 5 while the lea of6 suffix trees to it. some vast subterranean away from "repeated sequences. is widely chapter can be read by taking !ca is if the reader of the diving details of the lea lea box result the details are not. If there nine that does long may falsely the primer then we might see ka reasons. and feasible to design probes probe s so that ca noe creetec commonanceslorqueriesml1stbc to string note labeling down and common consisting ancestors. that to a node To get a feel for the connection two leaves . Introduction We now suffix begin the discussion of an amazing other result that greatly extends the usefulness of trees (in addition to many applications). is a general n_ot which that the result theoretical. somewhere chromosomal treading miles like Alice in Wonderland. techniques discussed Such in this chapter. and the Ihere are some known sequences DNA frequently subslrings walking. the longest prefix will I and j." 8. In tbe prirner selectlon problem.bellef the Schieberto program _ it is a very practical. butarefrequentlylound in the cloned 8. Hannony Books. The result. problems th~ examples before be detailed in Chapter prefers result. In such mapping by vector DNA is common. occurring As discuss ed in Section substrings. Hence the lo~'est common prefix of suffixes primitive motivating but and} share a common commcn anc. the goal of avoiding incorrect hybridizationstolher. tunnel sy stem. This in many stnng. 19% of these as a black the technical statement taken method be an important 9. of repeated fl Still.1. the probe in which case the oligo probe may hybridize One approach to this problem the primer is a better are rarely in the genome This is precisely problem is to use specifically (or probe) fit than selection designed problem.through else in the genome. Given a substring sequence a 01 300 nucleotldes (the common oflenglh parts of (Ihelaslsubstringsequenced). ance. since we don" be avoided. In A variant DNA fingerprinting which oligomers(short of the hybridization and mapping is not to create t6) there is frequent need to see about DNA DNA of DNA) hybridize and fingerprinting to some target piece of DNA. The Seorch ro.

Thus for example. path numbers have another well-known description . That is. it is not trivial to understand at first and has been described as based on "bit magic". Counting from the left-most bit. So assume that Ica(i. number the root. called its path number. then this can be detected by a very simple constant-time algorithm.) We will refer to nodes in B by their path numbers As the tree in Figure 8.that of inordcr numbers. The notation B will refer to this complete binary tree.4.1: A general tree T w~h nodes named by a depth-first numbering However. we must be clear on what computational mode! is being used. we want to find Ica(iJJ in B (remembering that both i and j are path numbers). a path that goes left twice. as long as the numbers have only O(1og n) bits. The lea result (that lea queries canbeansweredinconstanttimeafterlinear-timepreprocessing)can be proved in this unitcost RAM model. we assume that the AND. read. and that the position of the left-most or right-most l-bit in a binary number can be found in constant time.e. the exposition is easier if we assume that certain additional bit. and with several high-level programming languages. or divided in constant time. if d == 6. Complete binary trees: a very simple case We begin the discussion of the lea result on a particularly simple tree whose nodes are named in a very special way. or used as an address in constant time. although the method is easy to program.2 for an additional example. we will explain how to achieve the same results using only the standard unit-cost RAM model that encode paths to them. Suppose that B is a rooted complete binary tree with p leaves (n == 2p . and then recursively number the right child). and XOR (exclusive or) of two (appropriate sized) binary numbers can be done in constant time. We hope the following exposition is a significant step toward making the method more widely understood 8.6 3 5 7 011 101 111 001 Figure 8. I For example. Otherwise we may he accused of cheating . The path number concept is preferred since it explicitly relates the number of a node to the description of the path to it from the root. In fact. discussed in Exercise 3. On many machines. and the XOR of two d + 1 bit numbers is obtained by independenny taking the XOR of . that a detailed discussion of the result is worthwhile. The assumed machine model Because constant retrieval time is such a demanding goal (and even linear preprocessing time requires careful attention to detail). again assuming that the numbers involved are never more than O(logn) bits"long. forbidding "large" numbers from being manipulated in constant time. the node with path bits 00 10 is named by the 7-bit number 001 0 I00. However. when the nodes of B are numbered by an inorder traversal (recursively number the left child. (See Figure 8. and a 1 indicates a right child. 8. Numbers that take more than (-j(\ogn) bits to represent cannot be operated on in constant time. that a "mask" of consecutive I-bits can be created in constant time. we certainly can allow any number with up to O(logn) bits to be written. Each path number is then padded out to d + 1bits by adding a Ito the right of the path hits followed by as many additional Os as needed to make d + 1 bits. In particular. multiplied. that encodes the unique path frum the root to u. the resulting node numbers are exactly the path numbers discussed above We leave the proof of this for the reader (it has little significance in our exposition).2: A binary tree wrth tour leaves. one of these two nodes is an ancestor of the other). the tree is complete and all leaves are at the same depth from the root. when Ica(iJ) is either i or j (i.leve[operationscan also be done in constant time. What primitive operations are permitted in constant time? In the unit-cost RAM model when the input tree has n nodes. These are standard unit-cost requirements.. so that every internal node has exactly two children and the number of edges on the path from the root 10 any leaf in B is d = log2 p. added. OR. The algorithm begins by taking the exclusive or (XOR) of the binary number for i and the binary number for i. these are reasonable assumptions. The bits that describe the path are called path bits.using an overly powerful set of primitive operations.2 suggests.1 nodes in total). after we explain the lea result using these more liberal assumptions. we let lca(iJJ denote the least common ancestor 8. The root node for d == 6 would be 1000000.3. Each node v of B is assigned a d + 1 bit number. The tree is a complete binary tree whose nodes have names Given two nodes i and j. Also. the ith bit of the path number for u corresponds to the ith edge on the path from the root to u: A 0 for the ith bit from the left indicates that the ith edge on the path goes to a left child. and then left again ends at a node whose path number begins (on the left) with 00 IO. That is. denoting the result by xij' The XOR oftwo bits is 1 if and only if the two hits are different.4_ How to solve lea queries in B Definition ofiandj For any two nodes i and j. First. HOW TO SOLVE LeA QUERIES IN B 183 1 . the root node always has a number with left bit 1 followed by d Os.182 CONSTANT-TIME LOWEST COMMON ANCESTOR RETRIEVAL 8. the result has been so heavily applied in many diverse string methods.2. The path numbers are written both in binary am! in base ten Figure 8. subtracted. and T will refer to an arbitrary tree. right once.j) is neither i nor i. But for the purists. two numbers with up to O(logn) bits can be compared. and its use is so critical in those methods. that a binary number can be shifted (left or right) hy up to O(\ogn) bits in constant time. Nonetheless.

and hence 100. Be careful not to confuse depth-first numbers used for the general rree T with path numbers used only for the binary tree B FIgure 8.I edges and then diverge. but it is presented both to develop intuition and because complete binary trees are used in the description of the general case. Since i and j are O(logn) bits long. a very elegant and relatively simple algorithm can answer lea queries in constant time.the right edge out of the root. The /ca algorithm we will present for general trees builds on the case of a complete binary tree. In summary. First steps in mapping T to B FIgure8. That IS.3: A binary tree with four leaves. if (~)(n logn) time is allowed for preprocessing T and ern logn) space is available after the preprocessing. That method is explored in Exercise 12.6 HJO I I I 011 001 standard depth-first numbering (preorder numbering) of nodes (see Figure 8. to iltustrate the definit ionef /(v). node k in B.5. in Figure 8.184 CONSTANT-TIME LOWEST COMMON ANCESTOR RETRIEVAL 8. fiRST STEPS IN MAPPING T TO B 185 I . the root. so their respective paths share one edge .4: uno 10 Node numbers given in four-bit binary. h(k) equal s (he height of 1000) is at height 4.1.1 bits of i and j are the same. If there are q nodes In the s~btree rooted ar u. For any node k (node with path number k) in B. beginmng With the number for u. numbering the nodes of u in the order that they are first encountered in the traversal. 5 0011 0101 tOOO 0110 6 0111 8. and v gets numbered k.first manner. With this numbering scheme.1 '. we have This simple case of a complete binary tree is very special. then the left most k . We first describe the generallca algorithm assuming that the T to B mapping is explicitly used. and the pcsltlcn of the leasl signilicanl i-bit is given in base ten each bit of the tWOnumbers. This is the . the XORof lDl and 111 (nodes 5 and 7) is OlD. Moreover. XOR of 00101 and 10011 is 10110.cutlve .k zeros. XOR is a constant-time operation in our model The algorithm next finds the most significant (left-most) I-bit in xu. That is.j) consists of the left most k . and the paths to i and j must agree for the first k . It follows that the path number for lea! i. then the numbers given to the other nodes lTI the subtreearek+ Ithroughk+q-I For convenience. The first thing the preprocessing does is traverse T in a depth. so the paths to 2 and 5 have no agreement.depthfirst numbers. For example. If the left most I-bit in the XOR of i and j is in position k (counting frum the left). to is both a node and a number.I bits of i (or j) followed by a Lbit followed by d + 1 .2. and then we explain how explicit mapping can be avoided. is their lowest common ancestor Lemma 8. For example. The XOR ofOlO and 101 (nodes 2 and 5) is 111. and the path from it to a leaf has four found in constant time in B. when we refer to node u. t~e no~es in the subtree of any n_o~ L' in T have conse. The idea (conceptually) is to map the nodes of a general tree T to the nodes of a complete binary tree B in such a way that lea retrievals on B will help to quickly solve lea queries on T. from this point on the nodes in T will be referred to by theirdepth-fiTht numbers. The palh numbers are in binary. by actually using complete binary trees.5.5.

- Biological Sequence Analysis
- Understanding Bioinformatics
- Durbin Et Al - Biological Sequence Analysis (CUP 2002) No OCR
- String Algorithms
- RNA-Seq Analysis Course
- 2Q8F6guU jewels of stringology.pdf
- Art of Computer Programming, The - Volume 04a - Combinatorial Algorithms Part 1 (2011)(1st)(Donald E. Knuth)
- Algorithms on String Trees and Sequences-libre
- python for biologists
- Text Algorithms
- Discovering Genomics, Proteomics, And Bioinformatics a Malcolm Campbell Laurie J Heyer 2003
- Introduction to Bioinformatics Algorithms
- Problems and Solutions in Biological Sequence Analysis - Borodovsky and Ekisheva (2006)
- (Chapman & Hall_CRC Mathematical and Computational Biology Series (Unnumbered)) Eija Korpelainen_ Jarno Tuimala_ Panu Somervuo_ Mikael Huss_ Garry Wong-RNA-seq Data Analysis_ a Practical Approach-CRC
- Secure Data Storage In Cloud Computing
- Visual Guidebook Data Visualization
- Bioinformatics - Managing Scientific Data
- Advanced Molecular Biology
- Dynamic Programming and Optimal Control, Vol I
- Dynamic Programming and Optimal Control
- David Object Oriented Design
- Grimmett g.r., Stirzaker d.r. Probability and Random Processes (3ed., Oxford, 2001)(1)
- 7160412 Crystal Report Interview Questions
- Algorithm Design
- Bioinformatics for Biologists.pdf
- Python Programming for Biology_ - Tim J. Stevens

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading