You are on page 1of 36

Outline

Approximate String Matching


Text Retrieval

Theory vs. Practice

Problem

String searching
Ricardo Baeza-Yates
From automata to algorithms
Center for Web Research
Filtering
www.cwr.cl
Depto. de Ciencias de la Computación Indices
Universidad de Chile
Santiago, CHILE ASM with Indices
rbaeza@dcc.uchile.cl
Concluding remarks

Based in surveys by Baeza-Yates [3], Baeza-Yates [5], Navarro


[19] and Navarro et al. [23], and own work and views.

2
User’s point of view Theory vs. Practice

How we can measure the goodness of an algorithm?

Asymptotic worst case behavior


Text Query
Asymptotic average case behavior
User-defined text normalization

User-defined Practical behavior


index points Indexing
and structure Algorithm
D. Knuth [IFIP’89 Invited speech]
Index Searching
Algorithm
Balance between theory and practice
Text Answer
User Software is hard
Interface

? The best theory


Tools vs.
is
Intelligence
inspired by practice

Applications to other areas: The best practice
Web retrieval, XML processing, NL processing, text mining, is
multimedia search, bioinformatics, signal processing, .... inspired by theory

3 4
Problem Search Models

: finite alphabet of size 1. is a word (depends on the language)



2. is any sequence starting in an index-point


Text


of length






Pattern of length ( )







Some data structures assume the first model
is considered bounded


Answer Models
Problem: find all occurrences of in



Space complexity : the extra space used for the search (index)
exact match


RAM model: words of size





approximate match (distance function needed)




Time complexity : time needed to find the pattern


closest match or all matches at a certain distance
or equivalent measure (for example, comparisons)

– Worst-case
Computation Models
– Average-case (uniform text and pattern)

Text-Pattern comparisons

Arithmetical/Bitwise operations

5 6
Algorithmic point of view String-Matching Space-Time Trade-Offs

Input data:

Raw pattern and text


Space Complexity
- sequential, on-line, real-time algorithms
Tries
Preprocessing of the pattern Indexed search

- pattern is known in advance Suffix arrays


Inverted files

Preprocessing of the text index


Patricia
trees
Hybrid solutions
– Inverted index



Signature files
– Suffix trees (tries, Patricia trees, ....)

– Suffix arrays Two level TR

Sequential search
– Based on -grams





Boyer-Moore like algorithms
– Automata: DAWGs, suffix based KMP


Shift-or


Brute force
Hybrid solutions: RAC



 
Time







– Filtering or Filtration Complexity

– Two Level TR

7 8
String Matching: Definition String Matching Complexity

Basic problem: find exact occurrences of a pattern in a text


: size of the text


Variations
: size of the pattern


– Allow mismatches (Hamming distance)

– Allow insertions (Episode distance, not symmetric)


Raw text
– Allos insertions and deletions (LCS distance)
– Worst case:
– Allow mismatches, insertions and deletions

lower and upper bound of comparisons





– Language dependent measure: phonetic, morphems, etc.



– Average case: lower and upper bound





Examples:


– ASM: worst case, average case








text
Preprocessed text:
text

– Index construction: time and space (finite alphabet)


text: This is a text example ...

t ext – Worst case: comparisons


ex



– Average case: comparisons






 



Software examples: grep command in Unix (sequential) or Google – ASM: several results, still open
in the Web (index based).
9 10
Classical Algorithms String Searching: Historical View

Knuth-Morris-Pratt

x
1992 Cole-Hariharan Baeza-Yates/Perleberg
Text y
Hume-Sunday
Colussi-Galil-Giancarlo
1990 Cole Choffrut Sunday
Boyer-Moore Crochemore-Perrin
Match heuristic Wu-Manber Baeza-Yates
x
Regnier Baeza-Yates/Gonnet
1988 Baeza-Yates Baeza-Yates/Gonnet
Text y

y Abrahamson
Occurrence heuristic 1986
Apostolico-Giancarlo
x
Text y

Horspool Sunday
1980 Karp-Rabin Horspool
Rytter
Match heuristic defines BM automata Galil

Boyer-Moore

Fischer-Patterson

1970 Knuth-Morris-Pratt
Theory Practice

11 12
Knuth-Morris-Pratt Algorithm Algorithm
search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
Fascinating story.... from theory and practice {
int next[MAX_PATTERN_SIZE];

pat[m+1] = CHARACTER_NOT_IN_THE_TEXT;
Preprocessing: kmp( pat, m+1, pat, m+1, next ); // Preprocess pattern
kmp( text, n, pat, m, next ); // Search text
pat[m+1] = END_OF_STRING;
}




































kmp( text, n, pat, m, next )




 








char text[], pat[];



int n, m, next[];
{
static dosearch = 0;
for .









int i, j;

i = 1;
if( !dosearch ) // Preprocessing
Example: j = next[1] = 0;
else j = 1;
do {
a b r a c a d a b r a if( j == 0 || text[i] == pat[j] )
{
i++; j++;
next[j] 0 1 1 0 2 0 2 0 1 1 0 5 if( !dosearch ) { // Preprocessing
if( text[i] != pat[j] ) next[i] = j;
else next[i] = next[j];
}
}



Worst case complexity:








else j = next[j];

if( dosearch && j > m ) { // Search


Report_match_at_position( i-m );
Extension to multiple patterns: Aho-Corasick j = next[m+1];
}
}
while( i <= n ) ;
dosearch = 1;
}

13 14
Boyer-Moore-Horspool-Sunday Algorithm Counting: Baeza-Yates/Perleberg, 1992
A simple example of filtering:
Match heuristic can be extended: BM automata, suffix automata
Idea: Count the number of matches for all
In practice the occurrence heuristic is the key issue:
possible positions of the pattern

 




 























Straight implementation: Brute force algorithm

with worst and average case time



search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
{
int d[MAX_ALPHABET_SIZE], i, j, k, lim;

// Preprocessing Pattern =
for( k=0; k<MAX_ALPHABET_SIZE; k++ )
d[k] = m+1; Text =
for( k=1; k<=m; k++ )
d[pat[k]] = m+1-k; Count = 2000 0020 0 0 0 10 0 0 0 0003001
// Search
lim = n-m+1;
for( k=1; k <= lim; k += d[text[k+m]] )
{
i=k; // Could optimal order Improvements
for( j=1; j<=m && text[i] == pat[j]; j++ )
i++;
if( j == m+1 ) Preprocess the pattern computing which
Report_match_at_position( k );
} characters of the alphabet should update a counter
}

We need only the last counters




Complexity ranges between and











extra space




15 16
Example Running time

Pattern = t h a n

total cost





Text = t hi s i s a n ex a m p l e t hat

Count = 2000 0020 0 0 0 10 0 0 0 0003001


is the number of text-pattern symbol matches









 
Each step is:


On average





For all j such that pattern[j] = text[i]
increment count[i-j+1] Cost is independent of number of mismatches

Not suitable for (e.g. DNA)



Code

for (i=0; i<n; i++) {


if ((off1=(aptr=&alpha[c=*t++])->offset) >= 0) {
count[(i+off1)&MOD256]--;
for (aptr=aptr->next; aptr!=NULL; aptr=aptr->next)
count[(i+aptr->offset)&MOD256]--;
}
if (count[i&MOD256] <= k) printf("%d",count[i&MOD256]);
count[i&MOD256] = m;
}

17 18
Bit Parallelism: Baeza-Yates/Gonnet, 1989 [2] Example:

1
Parallel algorithm using processors
 1 t

0 t e





1 output
0 t e x

text
1 t e x t

current character
Output

Processor : 1 if

 


 





t h i s i s a t e x t










0 otherwise

























19 20
Bit sequence simulation Complexity

For the uniform cost RAM model, we have


One bit per processor simulation with a bit vector!





          
























Preprocessing time:




 

Search time:






For finite alphabets, all possible comparisons can be precomputed


Space needed: words





before the search












Code:
In the example: // Preprocessing
for( i=0; i<MAXSYM; i++ ) T[i] = ˜0;
T[t] T[e] T[x] T[*]
for( lim=0, j=1; *pattern != EOS; lim |= j, j <<= B, pattern++ )
t 1 0 0 0
T[*pattern] &= ˜j;
e 0 1 0 0
lim = ˜(lim >> B);
x 0 0 1 0
// Search
t 1 0 0 0
matches = 0; state = ˜0; // Initial state
for( ; *text != EOS; text++ )
{
0 1 shift-and/or algorithm

state = (state << B) | T[*text]; // Next state


Handbook of Algorithms and Data Structures, 2 Ed, 1991




if( state < lim )


// Match at current position-len(pattern)+1
}

21 22
Extensions Approximate String Matching: Dynamic Programming


Every pattern element is a class of symbols Minimum number of errors to match to a suffix of





Just change !

















Don’t care symbols on the text:






















 










Multiple patterns: just one longer sequence


























Mismatches: count the number of mismatches
Example:



instead of &, using bits per position








is maximal number of mismatches allowed s u r g e r y

1 2 ... m 0 0 0 0 0 0 0 0
s 1 0 1 1 1 1 1 1
u 2 1 0 1 2 2 2 2
Overflow bit

bits r 3 2 1 0 1 2 2 3
 

Insertions and deletions [Wu & Manber, 1991] v 4 3 2 1 1 2 3 3


bit sequences



e 5 4 3 2 2 1 2 3
Agrep: Was the fastest approximate search tool for Unix, now y 6 5 4 3 3 2 2 2
nrgrep Bit-wise approach to DP is the fastest for long strings (¿ 8)

23 24
From Automata to Bit-parallelism Approximate string searching

Exploit automata structure Consider the NFA for searching text with at most errors



Consider the NFA to search for text


t e x t
no errors


t e x t
t e x t
1 error


Processors in 1 Active states of standard simulation


t e x t
2 errors
Be careful with -closure


Related to hardware implementations Longest positional match wanted?

Based in Baeza-Yates [5]

25 26
Horizontal bit parallelism: Wu & Manber Vertical bit parallelism:


t e x t


t e x t no errors
no errors



t e x t
t e x t 1 error
1 error



t e x t
t e x t 2 errors
2 errors

Key information: highest (smallest error) active state per column


State of the search: numbers on the range .
















































Initially ( ones)


























Initially and






Drawback: Dependency on


Related to dynamic programming:


Complexity: search time






Longest common subsequence, string editing



space





Related to Ukkonen’s automata approach

27 28
Diagonal bit parallelism: Baeza-Yates & Navarro [1996] ASM: Sequential Algorithms


p a t t
no errors



p a t t
1 error


p a t t
2 errors

Each diagonal represents an -closure (longest match):





















where




 





















Advantage: all s can be computed in parallel




 

Mixing this with filtration we get time





Related to simulation of DP over a suffix array (later) and bioin-


formatic applications

29 30
Dynamic Programming Automata and Bit-Parallelism

31 32
Filtering or Filtration A First Lemma for Filtering

Find potential matches and then apply a sequential algorithm to


check each candidate








Lemma 1: Let and be two strings such that . Let






Search time is , for strings and and for any




 

















  


. Then, at least strings appear in . Moreover, their






















relative distances inside cannot differ from those in by more
Filtration can be done by a sequential scan or by an index than .








There is trade-off between and












 Consider the sequence of at most edit operations that convert


There is always a maximum error ratio up to where into .





Filtration is useful, as for larger error levels the text areas to


Each edit operation can affect at most one of the ’s, at least



verify cover almost all the text
of them must remain unaltered.
Verification can be done in a hierarchical fashion
Relative distances: the edit operations cannot produce mis-
alignments larger than .

33 34
Example Filtering Algorithms

A1 x1 A2 x2 A3 x3 A4 x4 A5
A

A1 A2’ A3 A4’ A5
B

An example of Lemma 1 with and :








At least 2 of the survive unaltered.





They are actually 3 such segments because one of the errors ap-
peared in .
 

Another possible reason could have been more than one error


occurring in a single .


35 36
Worst Case Complexity and Space Average Case Complexity and Error Ratio

37 38
Best Algorithms Data Structures

Inverted indices permit searching for any word in the text.

Suffix trees allow searching for any substring of the text.

Suffix arrays permit the same operations but are slightly slower.

-grams allow searching for any text substring not longer than
.


-samples permit the same but only for some text substrings.

39 40
Inverted Indices Inverted Files: Space

Vocabulary: Heaps’ law





Idea: all words and their positions


Posting file: linear space (one occurrence = one pointer)
1 6 9 11 17 19 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters. Word distribution: Zipf’s law


Text



Vocabulary Occurrences








letters 60...
50...
made
where
many 28...




text 11, 19... Inverted Index









words 33, 40...



Stopwords: half the posting file
Vocabulary search: Hashing, sorted, etc.
Linear space
Granularity of the occurrences depends in what we want to an-
swer: file, word, byte Vocab. Vocab.

Text length Posting lists

Heaps’ Law Zipf’s Law

41 42
Complex Patterns Building Inverted Indices

Search the vocabulary sequentially (e.g. ASM) and do set opera-


Process text pieces as large as possible
1 6 9 11 17 19 24 28 33 40 46 50 55 60
tions
This is a text. A text has many words. Words are made from letters.

d1 d2 d3 d4 d5 d6 Text
letters: 60
Used Cars Excellent Second Cars & Change
"l" made: 50
cars used offer of hand trucks car for "d"
and trucks trucks cards bargan a truc
"m" "a"

many: 28 Vocabulary trie


"t" "n"
text: 11, 19
sell d1 [1] "w"
car d1,d2,d4,d5,d6 [2][1][1][1][2] words: 33, 40
(1) truck d1,d3,d5 [3][3][2] [d1,3][d3,3],[d5,2],[d6,3]
use d2 [2] [d5,3]
excellent d3 [1]
offer d3 [2] Merge partial indexes
second d4 [2]
hand d4 [3]
(2) bargan d5 [3]
change d6 [1] Level 4
(1) truc d6 [3] I-1..8
(final index)

[d5,2-3] 7
Vocabulary Posting file Inversion

I-1..4 I-5..8 Level 3


"bargain for trucks" [approximate]
3 6

I-1..2 I-3..4 I-5..6 I-7..8 Level 2

1 2 4 5

Level 1
I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8 (initial dumps)

43 44
Two-level Text Retrieval: Block addressed inverted files Inverted File Space in Practice

Idea used in PIRS (Personal Information Retrieval System)


Index Small base Medium base Large base
[Wu and Manber 1993]
(1 MB) (200 MB) (2 GB)
The text is divided in 256 blocks of the same size
Full
An inverted file of all the different words of the text is built inverted 45% 73% 36% 64% 35% 63%

Each entry indicates only the blocks where the word appears Document

1 byte per block addressing 19% 26% 18% 32% 26% 47%
Block (64K)
First we search in the inverted file
addressing 27% 41% 18% 32% 5% 9%
next in the corresponding blocks using a fast sequential algo-
Block (256)
rithm
addressing 18% 25% 1.7% 2.4% 0.5% 0.7%
Complexity depends on the number of occurrences

locality of reference is important

For large texts, empirical results show that the index requires
less than 5% of the text size

This idea works reasonable well up to 200Mbs

45 46
Tries and Suffix Trees Suffix Arrays

"a" "c"

"$"
3
to be at the beach or to be at work, that is the real question $ Text
"r"
"d" 7 10
Suffix Trie 1 2 3 4 5 6 7 8 9 10 11
Text
"c"
a b r a c a d a b r a
5
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
9
"b" "$"
"r" "a"
"c" Suffix Array
2

"a" 6 11 8 1 4 6 2 9 5 7 10 3
"c" 4 1: to be at the beach or to be at work, that is the real question
"d" 1
"r" "a"
"c"
4: be at the beach or to be at work, that is the real question
"b" "$" Suffixes
"$"
8 7: at the beach or to be at work, that is the real question
11 10: the beach or to be at work, that is the real question
......
55: question

Search time is optimal:




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
Problem: space can be quadratic in a trie

Compact suffix trees (Patricia trees): cut unary paths to be at the beach or to be at work, that is the real question $ Text

1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
To remember the depth, a count is added at every node or the
string associated to the path is stored

Space is now linear ( )




Useful for complex queries: regular expressions in sublinear av-


erage time [4]

47 48
Suffix Array Search Suffix Array: Construction

Every substring is a prefix of a suffix In principle is a lexicographical order

The prefix relation can be used for lexicographic order But suffixes are suffixes of suffixes

However, random access to the text is the bottleneck





















Best solution: sequential scan with counting
Hence, two binary searches are enough to obtain the suffix array
range where all occurrences of appear


Number of occurrences: range size


a) small text
b) small text

Time is logarithmic in the size of the array


small suffix array long text
small suffix array

counters
be bf

c)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 small text long suffix array
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
small suffix array

counters final suffix array

to be at the beach or to be at work, that is the real question $ Text

Index points
Building time is now linear (2003!)
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55

49 50
Q-gram Indexes Q-Sample Indexes

In a -gram index, every different text -gram is stored, and all its



positions are stored in increasing text order. In a -sample index, only some text -grams are stored. In this


case the samples do not overlap.
1 2 3 4 5 6 7 8 9 10 11
a b r a c a d a b r a Text
"a" "c"
Useful to search long strings and using much less space (from .5
3
"r" rac 3 to )


"a" "b"
dab 7 q-Gram Index
7
"d" cad 5

"c" "a" "d"


bra 2, 9 Search for a string can be constant on average
ada 6
5
"b"
aca 4
"r" "a" abr 1,8 Building them takes linear time
2, 9

"a" "a"
6
"d"
"c" "a"
cad 5
4 bra 9 q-Samples Index
abr 1
"b" "r"
1, 8

P error








q-grams

P q-grams

51 52
Algorithms for ASM using an Index Current Results on this Taxonomy

Search approaches: Search Approach


Data Structure Neighborhood Partitioning into Intermediate
Neighborhood generation generates and searches for, using an Generation Exact Searching Partitioning
Errors in Text Errors in Pattern Errors in Text Errors in Pattern
index, all the strings that are at distance or less from the pattern
[13] Jokinen &
Suffix Tree Ukkonen 91 [24] Shi 96
Partitioning into exact searching selects pattern substrings that [29] Ukkonen 93

must appear unaltered in any approximate occurrence, uses the [8] Cobbs 95
Suffix Array [10] Gonnet 88 [21] Navarro &
index to search for those substrings, and checks the text areas Baeza-Yates 99
[13] Jokinen &
surrounding them.
Q-grams n/a Ukkonen 91 [20] Navarro & Myers 90 [17]
[12] Holsti & Baeza-Yates 97
Assuming that the errors occur in the pattern or in the text leads
Sutinen 94
to radically different approaches. Q-samples n/a [26] Sutinen & n/a [22] Navarro n/a
Tarhio 96 et al. 2000
Intermediate partitioning extracts substrings from the pattern
that are searched for allowing fewer errors using neighborhood
generation.

53 54
Neighborhood Generation Backtracking



Use a suffix tree or array to find in the text [4, 10, 29]


The Neighborhood of the Pattern: Just some branches will be followed, but they factorize many
matches







Let be the “ -neighborhood”





 




While searching we have three cases at node :


of (is s finite set).





a) , which means that , and we report all




Generate and use an index to search for their text occur-









rences [17] the leaves of the current subtree as answers.

 
b) for every , which means that is not a prefix of







Problem: is quite large.






any string in and we can abandon this branch




Good bounds [28, 17] show [28]







c) Otherwise, we continue descending by every branch of that


This approach works well for small and .


node. If we arrive at a leaf node, we have to use a sequential


algorithm in the rest

Some improvements [13, 30, 8] avoid processing some redun-


dant nodes at the cost of a more complex node processing

The same idea can be used to compare a whole text against an-
other one or against itself [6]

55 56
Example Partitioning into Exact Search: Errors in the Pattern



We use Lemma 1 under the setting , . That is, the






The matrix can be seen now as a stack that grows to the right
pattern is split in pieces, and hence of the pieces must





appear inside any occurrence.

Then, the pieces are searched and the text areas where of





s

u those pieces appear under the stated distance requirements are


r
verified for a complete match.





g




Search time in the index is or , but the checking





5






e a 4




time dominates.


r


2


2



y
2
The case , proposed in [20], shows an average time to





2cm In the example:











check the candidates of .


With the backtracking ends indeed after reading "surge"


The case is proposed in [24] without any analysis.






With the search would have been pruned after considering


If grows, the pieces get shorter and hence there are more matches



"surger", and "surga", since in both cases no entry of the to check, but on the other hand, forcing pieces to match makes


matrix is



the filter stricter [24]. Recent results show that this is slower.

Note that, since we cannot know where the pattern pieces can be
found in the text, all the text positions must be searchable.
57 58
Partitioning into Exact Search: Errors in the Text Search Algorithm


Assume now that the errors occur in the text, i.e., is an occur- At search time, all the (overlapping) pattern -grams









rence of in . are extracted and searched for in the index of text -samples.


We extract substrings of length at fixed text intervals of length When pattern -grams match in the text at the proper distances,





. the text area is verified.





Those -samples correspond to the ’s of Lemma 1, and the This idea is presented in [26], and earlier versions in [13, 12, 27].


space between -samples to the ’s.






The best value of :



What the lemma ensures is that, inside any occurrence of con-
– Should be small to avoid a very large set of different -



taining text -samples, at least of them appear in at



samples.
about the same positions ( ).
– Should be large to minimize the amount of verification.


Now we need to ensure that any occurrence of in contains





Some analysis [25] show that is the optimal value.

 













at least text -samples, i.e., .









The best value? A larger may trigger fewer verifications.


59 60
Intermediate Partitioning Proof and Example


We filter the search by looking for pattern pieces, but those pieces The proof is easy: if every needs more than errors to match



are large and still may appear with errors in the occurrences. in , then the total distance cannot be less than .












However, they appear with fewer errors, and then we use neigh- Note that in particular we can choose for every .




borhood generation to search for them.

A1 x1 A2 x2 A3








Lemma 2: Let and be two strings such that . Let


A


A1’ A2’ A3’


, for strings and and for any . Let





 






 




B



be any set of nonnegative numbers such that .











Then, at least one string appears with at most errors in .



Let and . At least one of the ’s has at most one error







(in this case )


61 62
Intermediate Partitioning: Errors in the Pattern What value for ?


In [17], the pattern is partitioned because they use a -gram in-


Search approaches based on this method have been proposed in
dex, so they use the minimum that gives short enough pieces


[17, 21]. The algorithm is:


(they are of length ).


Split the pattern in pieces, for some .



In [21] the index can search for pieces of any length, and the
Use neighborhood generation to find the text positions where partitioning is done in order to optimize the search time.




those pieces appear, allowing errors.


Consider the evolution of the search time as moves from 1


For each such text position, check with an on-line algorithm the (neighborhood generation) to (partitioning into exact search).



surrounding text.




– We search for pieces of length with errors, so the


error level stays about the same for the subpatterns.


– As moves to 1, the cost to search for the neighborhood of


the pieces grows exponentially with their length.

– As moves to this cost decreases, reaching even





when . So, to find the pieces, a larger is better.





63 64
Cost to verify the occurrences: consider a pattern that is split in Trade-off
pieces, for increasing . Start with .




























– Lemma 2 states that every occurrence of the pattern involves



































































































































































































search





























an occurrence of at least one of its two halves with er-







































































































"!







































verify

"!


















































































































rors, although there may be occurrences of the halves that





























Neighborhood generation Intermediate partitioning Partitioning into exact search
yield no occurrences of the pattern.

– Consider now halving the halves (  ), so we have four



pieces now (call them “quarters”). Each occurrence of one


In [21] we show that the optimal is , yielding a

 




of the halves involves an occurrence of at least one quarter

$
time complexity of , for .








with errors, but there may be many quarter occurrences

%

$
that yield no occurrences of a pattern half. This is sublinear ( ) for , a pessimistic and









– Hence, the verification cost grows from zero at to its is replaced by 1 in practice).





maximum at .









The same results are obtained in [17] by setting .









The experiments in [21] show that this intermediate approach is
by far superior to both extremes.

65 66



Intermediate Partitioning: Errors in the Text We chose and assume that every text -sample






indeed matches with errors.



Consider an occurrence containing a sequence of -samples,


We search the pattern blocks permitting only errors. Every





which must be chosen at steps of .








-sample found with errors changes its estimation from





to , otherwise it stays at the optimistic bound .


By Lemma 2, one of the -samples must appear in the pattern









with errors at most. There is a trade-off here:


Moreover, if every -sample appears in the pattern block





– For a small value, the search of the -neighborhoods is


with errors, then it must hold that .












cheaper, but as we must assume that the text -samples not


found have errors, some useless verifications are done.


This method [26, 22] searches every block in the index of




-samples using backtracking, so as to find the least number of – Using larger values gives more exact estimates of the ac-


errors to match each text -sample inside . tual number of errors of each text -sample, reducing use-



If a zone of consecutive samples is found whose errors add up less verifications in exchange for a higher cost to search the

to at most , the area is verified. -environments.

To allow efficient neighborhood searching, we need to limit the Optimal ? In [22] it is mentioned that, as the cost of the search




maximum error level allowed. grows exponentially with , the minimal can be a


good choice. Experimentally this scheme tolerates higher error
Permitting errors may be too expensive, as every text -sample


levels than the corresponding partitioning into exact search.


will be considered.
67 68
Future References
[1] A. Apostolico and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, 1985.
Further study on the power of non-comparison based algorithms:
[2] R. Baeza-Yates and G.H. Gonnet. A new approach to text searching. Communications of the ACM,
Many new bit-based algorithms 35:74–82, Oct 1992.

[3] R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I,
Problem reduction works for text searching
pages 465–476. Elsevier Science, 1992.

Example: Multiple string searching plus checking [4] R. Baeza-Yates and G.H. Gonnet. Fast text searching for regular expressions or automaton searching
on tries. Journal of the ACM, 43(6):915–936, Nov 1996.
– Two dimensional case [Baeza-Yates and Regnier, 1990]
[5] R. Baeza-Yates. A unified view of string matching algorithms. In Keith Jeffery, Jaroslav Král, and
Miroslav Bartosek, editors, SOFSEM’96: Theory and Practice of Informatics, volume 1175 of Lecture
– Approximate pattern matching [Wu and Manber, 1991]
Notes in Computer Science, pages 1–15, Milovy, Czech Republic, November 1996. Springer Verlag.

The final optimal algorithm depends on the input [6] R. Baeza-Yates and G. Gonnet. A fast algorithm on average for all-against-all sequence matching. In
Proc. 6th Symp. on String Processing and Information Retrieval (SPIRE’99). IEEE CS Press, 1999.
Further study of input adaptive algorithms? Previous version unpublished, Dept. of Computer Science, Univ. of Chile, 1990.

[7] E. Chávez and G. Navarro. A metric index for approximate string matching. In Proc. 5th Symp. on
New uses for old concepts. Example: -grams


Latin American Theoretical Informatics (LATIN), 2002. Cancun, Mexico.

Indexing for ASM on NL text can be done better [8] A. Cobbs. Fast approximate matching using suffix trees. In Proc. 6th Ann. Symp. on Combinatorial
Pattern Matching (CPM’95), LNCS 807, pages 41–54, 1995.

Approximation algorithms with worst-case performance guar- [9] R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. In Proc. 3rd
Workshop on Algorithm Engineering (WAE’99), LNCS 1668, pages 30–42, 1999.
antees [16].
[10] G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report,

Use a metric space to search [7]. Informatik E.T.H., Zurich, Switzerland, 1992.

[11] G. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures and Algorithms,
New text indexes tailored to special cases: ASM chapter 3: New indices for text: Pat trees and Pat arrays, pages 66–82. Prentice-Hall, 1992.

69 70
[12] N. Holsti and E. Sutinen. Approximate string matching using -gram places. In Proc. 7th Finnish [25] E. Sutinen and J. Tarhio. On using -gram locations in approximate string matching. In Proc. 3rd
Symp. on Computer Science, pages 23–32. Univ. of Joensuu, 1994. European Symp. on Algorithms (ESA’95), LNCS 979, pages 327–340, 1995.

[13] P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. [26] E. Sutinen and J. Tarhio. Filtration with -samples in approximate string matching. In Proc. 7th Ann.
2nd Ann. Symp. on Mathematical Foundations of Computer Science (MFCS’91), pages 240–248, 1991. Symp. on Combinatorial Pattern Matching (CPM’96), LNCS 1075, pages 50–61, 1996.

[14] U. Manber and E. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. on [27] T. Takaoka. Approximate pattern matching with samples. In Proc. 5th Int’l. Symp. on Algorithms and
Computing, 22(5):935–948, 1993. Computation (ISAAC’94), LNCS 834, pages 234–242, 1994.

[15] E. McCreight. A space-economical suffix tree construction algorithm. J. of the ACM, 23(2):262–272, [28] E. Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.
1976.
[29] E. Ukkonen. Approximate string matching over suffix trees. In Proc. 4th Ann. Symp. on Combinatorial
[16] S. Muthukrishnan and C. Sahinalp. Approximate nearest neighbors and sequence comparisons with Pattern Matching (CPM’93), LNCS 684, pages 228–242, 1993.
block operations. In Proc. ACM Symp. on the Theory of Computing, pages 416–424, 2000.
[30] E. Ukkonen. Constructing suffix trees on-line in linear time. Algorithmica, 14(3):249–260, 1995.
[17] E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374,
1994. Earlier version in Tech. report TR-90-25, Dept. of CS, Univ. of Arizona, 1990.

[18] Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic program-
ming, Journal of the ACM, 46 (3), 395–415, 1999.

[19] G. Navarro. A guided tour to approximate string matching. ACM Comp. Surv., 33(1):31–88, 2001.

[20] G. Navarro and R. Baeza-Yates. A practical -gram index for text retrieval allowing errors. CLEI
Electronic Journal, 1(2), 1998. http://www.clei.cl. Earlier version in Proc. CLEI’97.

[21] G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. J. of
Discrete Algorithms, 1(1):205–239, 2000. Hermes Science Publishing. Earlier version in CPM’99.

[22] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate -grams. In Proc.
11th Ann. Symp. on Combinatorial Pattern Matching (CPM’2000), LNCS 1848, pages 350–363, 2000.

[23] Gonzalo Navarro, Ricardo Baeza-Yates, Erkki Sutinen, Jorma Tarhio. Indexing Methods for Approxi-
mate String Matching. IEEE Data Engineering Bulletin, 2000.

[24] F. Shi. Fast approximate string matching with -blocks sequences. In Proc. 3rd South American
Workshop on String Processing (WSP’96), pages 257–271. Carleton University Press, 1996.

71 72

You might also like