Approximate String

Outline
Approximate String Matching

Text Retrieval
Theory vs. Practice
Problem
String searching
Ricardo Baeza-Yates
From automata to algorithms
Center for Web Research
Filtering
www.cwr.cl
Depto. de Ciencias de la Computación Indices
Universidad de Chile
Santiago, CHILE ASM with Indices
rbaeza@dcc.uchile.cl
Concluding remarks
Based in surveys by Baeza-Yates [3], Baeza-Yates [5], Navarro

[19] and Navarro et al. [23], and own work and views.
2
User’s point of view Theory vs. Practice
How we can measure the goodness of an algorithm?
Asymptotic worst case behavior

Text Query
Asymptotic average case behavior
User-defined text normalization
User-defined Practical behavior

index points Indexing
and structure Algorithm
D. Knuth [IFIP’89 Invited speech]
Index Searching
Algorithm
Balance between theory and practice
Text Answer
User Software is hard
Interface
? The best theory

Tools vs.
is
Intelligence
inspired by practice
—
Applications to other areas: The best practice
Web retrieval, XML processing, NL processing, text mining, is
multimedia search, bioinformatics, signal processing, .... inspired by theory
3 4
Problem Search Models
: finite alphabet of size 1. is a word (depends on the language)

2. is any sequence starting in an index-point

Text

of length

Pattern of length ( )

Some data structures assume the first model
is considered bounded

Answer Models
Problem: find all occurrences of in

Space complexity : the extra space used for the search (index)
exact match

RAM model: words of size

approximate match (distance function needed)

Time complexity : time needed to find the pattern

closest match or all matches at a certain distance
or equivalent measure (for example, comparisons)
– Worst-case
Computation Models
– Average-case (uniform text and pattern)
Text-Pattern comparisons
Arithmetical/Bitwise operations
5 6
Algorithmic point of view String-Matching Space-Time Trade-Offs
Input data:
Raw pattern and text

Space Complexity
- sequential, on-line, real-time algorithms
Tries
Preprocessing of the pattern Indexed search
- pattern is known in advance Suffix arrays

Inverted files
Preprocessing of the text index

Patricia
trees
Hybrid solutions
– Inverted index

Signature files
– Suffix trees (tries, Patricia trees, ....)
– Suffix arrays Two level TR
Sequential search
– Based on -grams

Boyer-Moore like algorithms
– Automata: DAWGs, suffix based KMP

Shift-or

Brute force
Hybrid solutions: RAC

Time

– Filtering or Filtration Complexity
– Two Level TR
7 8
String Matching: Definition String Matching Complexity
Basic problem: find exact occurrences of a pattern in a text

: size of the text

Variations
: size of the pattern

– Allow mismatches (Hamming distance)
– Allow insertions (Episode distance, not symmetric)

Raw text
– Allos insertions and deletions (LCS distance)
– Worst case:
– Allow mismatches, insertions and deletions
lower and upper bound of comparisons

– Language dependent measure: phonetic, morphems, etc.

– Average case: lower and upper bound

Examples:

– ASM: worst case, average case

text
Preprocessed text:
text
– Index construction: time and space (finite alphabet)

text: This is a text example ...
t ext – Worst case: comparisons

ex

– Average case: comparisons

Software examples: grep command in Unix (sequential) or Google – ASM: several results, still open
in the Web (index based).
9 10
Classical Algorithms String Searching: Historical View
Knuth-Morris-Pratt
x
1992 Cole-Hariharan Baeza-Yates/Perleberg
Text y
Hume-Sunday
Colussi-Galil-Giancarlo
1990 Cole Choffrut Sunday
Boyer-Moore Crochemore-Perrin
Match heuristic Wu-Manber Baeza-Yates
x
Regnier Baeza-Yates/Gonnet
1988 Baeza-Yates Baeza-Yates/Gonnet
Text y
y Abrahamson
Occurrence heuristic 1986
Apostolico-Giancarlo
x
Text y
Horspool Sunday
1980 Karp-Rabin Horspool
Rytter
Match heuristic defines BM automata Galil
Boyer-Moore
Fischer-Patterson
1970 Knuth-Morris-Pratt
Theory Practice
11 12
Knuth-Morris-Pratt Algorithm Algorithm
search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
Fascinating story.... from theory and practice {
int next[MAX_PATTERN_SIZE];
pat[m+1] = CHARACTER_NOT_IN_THE_TEXT;
Preprocessing: kmp( pat, m+1, pat, m+1, next ); // Preprocess pattern
kmp( text, n, pat, m, next ); // Search text
pat[m+1] = END_OF_STRING;
}

kmp( text, n, pat, m, next )

char text[], pat[];

int n, m, next[];
{
static dosearch = 0;
for .

int i, j;
i = 1;
if( !dosearch ) // Preprocessing
Example: j = next[1] = 0;
else j = 1;
do {
a b r a c a d a b r a if( j == 0 || text[i] == pat[j] )
{
i++; j++;
next[j] 0 1 1 0 2 0 2 0 1 1 0 5 if( !dosearch ) { // Preprocessing
if( text[i] != pat[j] ) next[i] = j;
else next[i] = next[j];
}
}

Worst case complexity:

else j = next[j];
if( dosearch && j > m ) { // Search

Report_match_at_position( i-m );
Extension to multiple patterns: Aho-Corasick j = next[m+1];
}
}
while( i <= n ) ;
dosearch = 1;
}
13 14
Boyer-Moore-Horspool-Sunday Algorithm Counting: Baeza-Yates/Perleberg, 1992
A simple example of filtering:
Match heuristic can be extended: BM automata, suffix automata
Idea: Count the number of matches for all
In practice the occurrence heuristic is the key issue:
possible positions of the pattern

Straight implementation: Brute force algorithm
with worst and average case time

search( text, n, pat, m ) // Search pat[1..m] in text[1..n]
char text[], pat[];
int n, m;
{
int d[MAX_ALPHABET_SIZE], i, j, k, lim;
// Preprocessing Pattern =
for( k=0; k<MAX_ALPHABET_SIZE; k++ )
d[k] = m+1; Text =
for( k=1; k<=m; k++ )
d[pat[k]] = m+1-k; Count = 2000 0020 0 0 0 10 0 0 0 0003001
// Search
lim = n-m+1;
for( k=1; k <= lim; k += d[text[k+m]] )
{
i=k; // Could optimal order Improvements
for( j=1; j<=m && text[i] == pat[j]; j++ )
i++;
if( j == m+1 ) Preprocess the pattern computing which
Report_match_at_position( k );
} characters of the alphabet should update a counter
}
We need only the last counters

Complexity ranges between and

extra space

15 16
Example Running time
Pattern = t h a n
total cost

Text = t hi s i s a n ex a m p l e t hat
Count = 2000 0020 0 0 0 10 0 0 0 0003001

is the number of text-pattern symbol matches

Each step is:

On average

For all j such that pattern[j] = text[i]
increment count[i-j+1] Cost is independent of number of mismatches
Not suitable for (e.g. DNA)

Code
for (i=0; i<n; i++) {

if ((off1=(aptr=&alpha[c=*t++])->offset) >= 0) {
count[(i+off1)&MOD256]--;
for (aptr=aptr->next; aptr!=NULL; aptr=aptr->next)
count[(i+aptr->offset)&MOD256]--;
}
if (count[i&MOD256] <= k) printf("%d",count[i&MOD256]);
count[i&MOD256] = m;
}
17 18
Bit Parallelism: Baeza-Yates/Gonnet, 1989 [2] Example:
1
Parallel algorithm using processors
1 t
0 t e

1 output
0 t e x
text
1 t e x t
current character
Output
Processor : 1 if

t h i s i s a t e x t

0 otherwise

19 20
Bit sequence simulation Complexity
For the uniform cost RAM model, we have

One bit per processor simulation with a bit vector!

Preprocessing time:

Search time:

For finite alphabets, all possible comparisons can be precomputed

Space needed: words

before the search

Code:
In the example: // Preprocessing
for( i=0; i<MAXSYM; i++ ) T[i] = ˜0;
T[t] T[e] T[x] T[*]
for( lim=0, j=1; *pattern != EOS; lim |= j, j <<= B, pattern++ )
t 1 0 0 0
T[*pattern] &= ˜j;
e 0 1 0 0
lim = ˜(lim >> B);
x 0 0 1 0
// Search
t 1 0 0 0
matches = 0; state = ˜0; // Initial state
for( ; *text != EOS; text++ )
{
0 1 shift-and/or algorithm

state = (state << B) | T[*text]; // Next state

Handbook of Algorithms and Data Structures, 2 Ed, 1991

if( state < lim )

// Match at current position-len(pattern)+1
}
21 22
Extensions Approximate String Matching: Dynamic Programming

Every pattern element is a class of symbols Minimum number of errors to match to a suffix of

Just change !

Don’t care symbols on the text:

Multiple patterns: just one longer sequence

Mismatches: count the number of mismatches
Example:

instead of &, using bits per position

is maximal number of mismatches allowed s u r g e r y
1 2 ... m 0 0 0 0 0 0 0 0
s 1 0 1 1 1 1 1 1
u 2 1 0 1 2 2 2 2
Overflow bit

bits r 3 2 1 0 1 2 2 3

Insertions and deletions [Wu & Manber, 1991] v 4 3 2 1 1 2 3 3

bit sequences

e 5 4 3 2 2 1 2 3
Agrep: Was the fastest approximate search tool for Unix, now y 6 5 4 3 3 2 2 2
nrgrep Bit-wise approach to DP is the fastest for long strings (¿ 8)
23 24
From Automata to Bit-parallelism Approximate string searching
Exploit automata structure Consider the NFA for searching text with at most errors

Consider the NFA to search for text

t e x t
no errors

t e x t
t e x t
1 error

Processors in 1 Active states of standard simulation

t e x t
2 errors
Be careful with -closure

Related to hardware implementations Longest positional match wanted?
Based in Baeza-Yates [5]
25 26
Horizontal bit parallelism: Wu & Manber Vertical bit parallelism:

t e x t

t e x t no errors
no errors

t e x t
t e x t 1 error
1 error

t e x t
t e x t 2 errors
2 errors
Key information: highest (smallest error) active state per column

State of the search: numbers on the range .

Initially ( ones)

Initially and

Drawback: Dependency on

Related to dynamic programming:

Complexity: search time

Longest common subsequence, string editing

space

Related to Ukkonen’s automata approach
27 28
Diagonal bit parallelism: Baeza-Yates & Navarro [1996] ASM: Sequential Algorithms

p a t t
no errors

p a t t
1 error

p a t t
2 errors
Each diagonal represents an -closure (longest match):

where

Advantage: all s can be computed in parallel

Mixing this with filtration we get time

Related to simulation of DP over a suffix array (later) and bioin-

formatic applications
29 30
Dynamic Programming Automata and Bit-Parallelism
31 32
Filtering or Filtration A First Lemma for Filtering
Find potential matches and then apply a sequential algorithm to

check each candidate

Lemma 1: Let and be two strings such that . Let

Search time is , for strings and and for any

. Then, at least strings appear in . Moreover, their

relative distances inside cannot differ from those in by more
Filtration can be done by a sequential scan or by an index than .

There is trade-off between and

Consider the sequence of at most edit operations that convert

There is always a maximum error ratio up to where into .

Filtration is useful, as for larger error levels the text areas to

Each edit operation can affect at most one of the ’s, at least

verify cover almost all the text
of them must remain unaltered.
Verification can be done in a hierarchical fashion
Relative distances: the edit operations cannot produce mis-
alignments larger than .
33 34
Example Filtering Algorithms
A1 x1 A2 x2 A3 x3 A4 x4 A5
A
A1 A2’ A3 A4’ A5
B
An example of Lemma 1 with and :

At least 2 of the survive unaltered.

They are actually 3 such segments because one of the errors ap-
peared in .

Another possible reason could have been more than one error

occurring in a single .

35 36
Worst Case Complexity and Space Average Case Complexity and Error Ratio
37 38
Best Algorithms Data Structures
Inverted indices permit searching for any word in the text.
Suffix trees allow searching for any substring of the text.
Suffix arrays permit the same operations but are slightly slower.
-grams allow searching for any text substring not longer than
.

-samples permit the same but only for some text substrings.
39 40
Inverted Indices Inverted Files: Space
Vocabulary: Heaps’ law

Idea: all words and their positions

Posting file: linear space (one occurrence = one pointer)
1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters. Word distribution: Zipf’s law

Text

Vocabulary Occurrences

letters 60...
50...
made
where
many 28...

text 11, 19... Inverted Index

words 33, 40...

Stopwords: half the posting file
Vocabulary search: Hashing, sorted, etc.
Linear space
Granularity of the occurrences depends in what we want to an-
swer: file, word, byte Vocab. Vocab.
Text length Posting lists
Heaps’ Law Zipf’s Law
41 42
Complex Patterns Building Inverted Indices
Search the vocabulary sequentially (e.g. ASM) and do set opera-

Process text pieces as large as possible
1 6 9 11 17 19 24 28 33 40 46 50 55 60
tions
This is a text. A text has many words. Words are made from letters.
d1 d2 d3 d4 d5 d6 Text
letters: 60
Used Cars Excellent Second Cars & Change
"l" made: 50
cars used offer of hand trucks car for "d"
and trucks trucks cards bargan a truc
"m" "a"
many: 28 Vocabulary trie

"t" "n"
text: 11, 19
sell d1 [1] "w"
car d1,d2,d4,d5,d6 [2][1][1][1][2] words: 33, 40
(1) truck d1,d3,d5 [3][3][2] [d1,3][d3,3],[d5,2],[d6,3]
use d2 [2] [d5,3]
excellent d3 [1]
offer d3 [2] Merge partial indexes
second d4 [2]
hand d4 [3]
(2) bargan d5 [3]
change d6 [1] Level 4
(1) truc d6 [3] I-1..8
(final index)
[d5,2-3] 7
Vocabulary Posting file Inversion
I-1..4 I-5..8 Level 3

"bargain for trucks" [approximate]
3 6
I-1..2 I-3..4 I-5..6 I-7..8 Level 2
1 2 4 5
Level 1
I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8 (initial dumps)
43 44
Two-level Text Retrieval: Block addressed inverted files Inverted File Space in Practice
Idea used in PIRS (Personal Information Retrieval System)

Index Small base Medium base Large base
[Wu and Manber 1993]
(1 MB) (200 MB) (2 GB)
The text is divided in 256 blocks of the same size
Full
An inverted file of all the different words of the text is built inverted 45% 73% 36% 64% 35% 63%
Each entry indicates only the blocks where the word appears Document
1 byte per block addressing 19% 26% 18% 32% 26% 47%
Block (64K)
First we search in the inverted file
addressing 27% 41% 18% 32% 5% 9%
next in the corresponding blocks using a fast sequential algo-
Block (256)
rithm
addressing 18% 25% 1.7% 2.4% 0.5% 0.7%
Complexity depends on the number of occurrences
locality of reference is important
For large texts, empirical results show that the index requires
less than 5% of the text size
This idea works reasonable well up to 200Mbs
45 46
Tries and Suffix Trees Suffix Arrays
"a" "c"
"$"
3
to be at the beach or to be at work, that is the real question $ Text
"r"
"d" 7 10
Suffix Trie 1 2 3 4 5 6 7 8 9 10 11
Text
"c"
a b r a c a d a b r a
5
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
9
"b" "$"
"r" "a"
"c" Suffix Array
2
"a" 6 11 8 1 4 6 2 9 5 7 10 3
"c" 4 1: to be at the beach or to be at work, that is the real question
"d" 1
"r" "a"
"c"
4: be at the beach or to be at work, that is the real question
"b" "$" Suffixes
"$"
8 7: at the beach or to be at work, that is the real question
11 10: the beach or to be at work, that is the real question
......
55: question

Search time is optimal:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
Problem: space can be quadratic in a trie
Compact suffix trees (Patricia trees): cut unary paths to be at the beach or to be at work, that is the real question $ Text
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55 Index points
To remember the depth, a count is added at every node or the
string associated to the path is stored
Space is now linear ( )

Useful for complex queries: regular expressions in sublinear av-

erage time [4]
47 48
Suffix Array Search Suffix Array: Construction
Every substring is a prefix of a suffix In principle is a lexicographical order
The prefix relation can be used for lexicographic order But suffixes are suffixes of suffixes
However, random access to the text is the bottleneck

Best solution: sequential scan with counting
Hence, two binary searches are enough to obtain the suffix array
range where all occurrences of appear

Number of occurrences: range size

a) small text
b) small text
Time is logarithmic in the size of the array

small suffix array long text
small suffix array
counters
be bf
c)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 small text long suffix array
7 29 4 26 14 43 20 55 50 38 10 46 1 23 32 Suffix Array
small suffix array
counters final suffix array
to be at the beach or to be at work, that is the real question $ Text
Index points
Building time is now linear (2003!)
1 4 7 10 14 20 23 26 29 32 38 43 46 50 55
49 50
Q-gram Indexes Q-Sample Indexes
In a -gram index, every different text -gram is stored, and all its

positions are stored in increasing text order. In a -sample index, only some text -grams are stored. In this

case the samples do not overlap.
1 2 3 4 5 6 7 8 9 10 11
a b r a c a d a b r a Text
"a" "c"
Useful to search long strings and using much less space (from .5
3
"r" rac 3 to )

"a" "b"
dab 7 q-Gram Index
7
"d" cad 5
"c" "a" "d"

bra 2, 9 Search for a string can be constant on average
ada 6
5
"b"
aca 4
"r" "a" abr 1,8 Building them takes linear time
2, 9
"a" "a"
6
"d"
"c" "a"
cad 5
4 bra 9 q-Samples Index
abr 1
"b" "r"
1, 8
P error

q-grams
P q-grams
51 52
Algorithms for ASM using an Index Current Results on this Taxonomy
Search approaches: Search Approach

Data Structure Neighborhood Partitioning into Intermediate
Neighborhood generation generates and searches for, using an Generation Exact Searching Partitioning
Errors in Text Errors in Pattern Errors in Text Errors in Pattern
index, all the strings that are at distance or less from the pattern
[13] Jokinen &
Suffix Tree Ukkonen 91 [24] Shi 96
Partitioning into exact searching selects pattern substrings that [29] Ukkonen 93
must appear unaltered in any approximate occurrence, uses the [8] Cobbs 95
Suffix Array [10] Gonnet 88 [21] Navarro &
index to search for those substrings, and checks the text areas Baeza-Yates 99
[13] Jokinen &
surrounding them.
Q-grams n/a Ukkonen 91 [20] Navarro & Myers 90 [17]
[12] Holsti & Baeza-Yates 97
Assuming that the errors occur in the pattern or in the text leads
Sutinen 94
to radically different approaches. Q-samples n/a [26] Sutinen & n/a [22] Navarro n/a
Tarhio 96 et al. 2000
Intermediate partitioning extracts substrings from the pattern
that are searched for allowing fewer errors using neighborhood
generation.
53 54
Neighborhood Generation Backtracking

Use a suffix tree or array to find in the text [4, 10, 29]

The Neighborhood of the Pattern: Just some branches will be followed, but they factorize many
matches

Let be the “ -neighborhood”

While searching we have three cases at node :

of (is s finite set).

a) , which means that , and we report all

Generate and use an index to search for their text occur-

rences [17] the leaves of the current subtree as answers.

b) for every , which means that is not a prefix of

Problem: is quite large.

any string in and we can abandon this branch

Good bounds [28, 17] show [28]

c) Otherwise, we continue descending by every branch of that

This approach works well for small and .

node. If we arrive at a leaf node, we have to use a sequential

algorithm in the rest
Some improvements [13, 30, 8] avoid processing some redun-

dant nodes at the cost of a more complex node processing
The same idea can be used to compare a whole text against an-
other one or against itself [6]
55 56
Example Partitioning into Exact Search: Errors in the Pattern

We use Lemma 1 under the setting , . That is, the

The matrix can be seen now as a stack that grows to the right
pattern is split in pieces, and hence of the pieces must

appear inside any occurrence.
Then, the pieces are searched and the text areas where of

s
u those pieces appear under the stated distance requirements are

r
verified for a complete match.

g

Search time in the index is or , but the checking

5

e a 4

time dominates.

r

2

2

y
2
The case , proposed in [20], shows an average time to

2cm In the example:

check the candidates of .

With the backtracking ends indeed after reading "surge"

The case is proposed in [24] without any analysis.

With the search would have been pruned after considering

If grows, the pieces get shorter and hence there are more matches

"surger", and "surga", since in both cases no entry of the to check, but on the other hand, forcing pieces to match makes

matrix is

the filter stricter [24]. Recent results show that this is slower.
Note that, since we cannot know where the pattern pieces can be
found in the text, all the text positions must be searchable.
57 58
Partitioning into Exact Search: Errors in the Text Search Algorithm

Assume now that the errors occur in the text, i.e., is an occur- At search time, all the (overlapping) pattern -grams

rence of in . are extracted and searched for in the index of text -samples.

We extract substrings of length at fixed text intervals of length When pattern -grams match in the text at the proper distances,

. the text area is verified.

Those -samples correspond to the ’s of Lemma 1, and the This idea is presented in [26], and earlier versions in [13, 12, 27].

space between -samples to the ’s.

The best value of :

What the lemma ensures is that, inside any occurrence of con-
– Should be small to avoid a very large set of different -

taining text -samples, at least of them appear in at

samples.
about the same positions ( ).
– Should be large to minimize the amount of verification.

Now we need to ensure that any occurrence of in contains

Some analysis [25] show that is the optimal value.

at least text -samples, i.e., .

The best value? A larger may trigger fewer verifications.

59 60
Intermediate Partitioning Proof and Example

We filter the search by looking for pattern pieces, but those pieces The proof is easy: if every needs more than errors to match

are large and still may appear with errors in the occurrences. in , then the total distance cannot be less than .

However, they appear with fewer errors, and then we use neigh- Note that in particular we can choose for every .

borhood generation to search for them.
A1 x1 A2 x2 A3

Lemma 2: Let and be two strings such that . Let

A

A1’ A2’ A3’
, for strings and and for any . Let

B

be any set of nonnegative numbers such that .

Then, at least one string appears with at most errors in .

Let and . At least one of the ’s has at most one error

(in this case )

61 62
Intermediate Partitioning: Errors in the Pattern What value for ?

In [17], the pattern is partitioned because they use a -gram in-

Search approaches based on this method have been proposed in
dex, so they use the minimum that gives short enough pieces

[17, 21]. The algorithm is:

(they are of length ).

Split the pattern in pieces, for some .

In [21] the index can search for pieces of any length, and the
Use neighborhood generation to find the text positions where partitioning is done in order to optimize the search time.

those pieces appear, allowing errors.

Consider the evolution of the search time as moves from 1

For each such text position, check with an on-line algorithm the (neighborhood generation) to (partitioning into exact search).

surrounding text.

– We search for pieces of length with errors, so the

error level stays about the same for the subpatterns.

– As moves to 1, the cost to search for the neighborhood of

the pieces grows exponentially with their length.
– As moves to this cost decreases, reaching even

when . So, to find the pieces, a larger is better.

63 64
Cost to verify the occurrences: consider a pattern that is split in Trade-off
pieces, for increasing . Start with .

– Lemma 2 states that every occurrence of the pattern involves

search

an occurrence of at least one of its two halves with er-

"!

verify
"!

rors, although there may be occurrences of the halves that

Neighborhood generation Intermediate partitioning Partitioning into exact search
yield no occurrences of the pattern.
– Consider now halving the halves ( ), so we have four

pieces now (call them “quarters”). Each occurrence of one

In [21] we show that the optimal is , yielding a

of the halves involves an occurrence of at least one quarter
$
time complexity of , for .

with errors, but there may be many quarter occurrences
%

$
that yield no occurrences of a pattern half. This is sublinear ( ) for , a pessimistic and

– Hence, the verification cost grows from zero at to its is replaced by 1 in practice).

maximum at .

The same results are obtained in [17] by setting .

The experiments in [21] show that this intermediate approach is
by far superior to both extremes.
65 66

Intermediate Partitioning: Errors in the Text We chose and assume that every text -sample

indeed matches with errors.

Consider an occurrence containing a sequence of -samples,

We search the pattern blocks permitting only errors. Every

which must be chosen at steps of .

-sample found with errors changes its estimation from

to , otherwise it stays at the optimistic bound .

By Lemma 2, one of the -samples must appear in the pattern

with errors at most. There is a trade-off here:

Moreover, if every -sample appears in the pattern block

– For a small value, the search of the -neighborhoods is

with errors, then it must hold that .

cheaper, but as we must assume that the text -samples not

found have errors, some useless verifications are done.

This method [26, 22] searches every block in the index of

-samples using backtracking, so as to find the least number of – Using larger values gives more exact estimates of the ac-
errors to match each text -sample inside . tual number of errors of each text -sample, reducing use-

If a zone of consecutive samples is found whose errors add up less verifications in exchange for a higher cost to search the
to at most , the area is verified. -environments.
To allow efficient neighborhood searching, we need to limit the Optimal ? In [22] it is mentioned that, as the cost of the search

maximum error level allowed. grows exponentially with , the minimal can be a

good choice. Experimentally this scheme tolerates higher error
Permitting errors may be too expensive, as every text -sample

levels than the corresponding partitioning into exact search.

will be considered.
67 68
Future References
[1] A. Apostolico and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, 1985.
Further study on the power of non-comparison based algorithms:
[2] R. Baeza-Yates and G.H. Gonnet. A new approach to text searching. Communications of the ACM,
Many new bit-based algorithms 35:74–82, Oct 1992.
[3] R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I,
Problem reduction works for text searching
pages 465–476. Elsevier Science, 1992.
Example: Multiple string searching plus checking [4] R. Baeza-Yates and G.H. Gonnet. Fast text searching for regular expressions or automaton searching
on tries. Journal of the ACM, 43(6):915–936, Nov 1996.
– Two dimensional case [Baeza-Yates and Regnier, 1990]
[5] R. Baeza-Yates. A unified view of string matching algorithms. In Keith Jeffery, Jaroslav Král, and
Miroslav Bartosek, editors, SOFSEM’96: Theory and Practice of Informatics, volume 1175 of Lecture
– Approximate pattern matching [Wu and Manber, 1991]
Notes in Computer Science, pages 1–15, Milovy, Czech Republic, November 1996. Springer Verlag.
The final optimal algorithm depends on the input [6] R. Baeza-Yates and G. Gonnet. A fast algorithm on average for all-against-all sequence matching. In
Proc. 6th Symp. on String Processing and Information Retrieval (SPIRE’99). IEEE CS Press, 1999.
Further study of input adaptive algorithms? Previous version unpublished, Dept. of Computer Science, Univ. of Chile, 1990.
[7] E. Chávez and G. Navarro. A metric index for approximate string matching. In Proc. 5th Symp. on
New uses for old concepts. Example: -grams

Latin American Theoretical Informatics (LATIN), 2002. Cancun, Mexico.
Indexing for ASM on NL text can be done better [8] A. Cobbs. Fast approximate matching using suffix trees. In Proc. 6th Ann. Symp. on Combinatorial
Pattern Matching (CPM’95), LNCS 807, pages 41–54, 1995.
Approximation algorithms with worst-case performance guar- [9] R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. In Proc. 3rd
Workshop on Algorithm Engineering (WAE’99), LNCS 1668, pages 30–42, 1999.
antees [16].
[10] G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report,
Use a metric space to search [7]. Informatik E.T.H., Zurich, Switzerland, 1992.
[11] G. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures and Algorithms,
New text indexes tailored to special cases: ASM chapter 3: New indices for text: Pat trees and Pat arrays, pages 66–82. Prentice-Hall, 1992.
69 70
[12] N. Holsti and E. Sutinen. Approximate string matching using -gram places. In Proc. 7th Finnish [25] E. Sutinen and J. Tarhio. On using -gram locations in approximate string matching. In Proc. 3rd
Symp. on Computer Science, pages 23–32. Univ. of Joensuu, 1994. European Symp. on Algorithms (ESA’95), LNCS 979, pages 327–340, 1995.
[13] P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. [26] E. Sutinen and J. Tarhio. Filtration with -samples in approximate string matching. In Proc. 7th Ann.
2nd Ann. Symp. on Mathematical Foundations of Computer Science (MFCS’91), pages 240–248, 1991. Symp. on Combinatorial Pattern Matching (CPM’96), LNCS 1075, pages 50–61, 1996.
[14] U. Manber and E. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. on [27] T. Takaoka. Approximate pattern matching with samples. In Proc. 5th Int’l. Symp. on Algorithms and
Computing, 22(5):935–948, 1993. Computation (ISAAC’94), LNCS 834, pages 234–242, 1994.
[15] E. McCreight. A space-economical suffix tree construction algorithm. J. of the ACM, 23(2):262–272, [28] E. Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.
1976.
[29] E. Ukkonen. Approximate string matching over suffix trees. In Proc. 4th Ann. Symp. on Combinatorial
[16] S. Muthukrishnan and C. Sahinalp. Approximate nearest neighbors and sequence comparisons with Pattern Matching (CPM’93), LNCS 684, pages 228–242, 1993.
block operations. In Proc. ACM Symp. on the Theory of Computing, pages 416–424, 2000.
[30] E. Ukkonen. Constructing suffix trees on-line in linear time. Algorithmica, 14(3):249–260, 1995.
[17] E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374,
1994. Earlier version in Tech. report TR-90-25, Dept. of CS, Univ. of Arizona, 1990.
[18] Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic program-
ming, Journal of the ACM, 46 (3), 395–415, 1999.
[19] G. Navarro. A guided tour to approximate string matching. ACM Comp. Surv., 33(1):31–88, 2001.
[20] G. Navarro and R. Baeza-Yates. A practical -gram index for text retrieval allowing errors. CLEI
Electronic Journal, 1(2), 1998. http://www.clei.cl. Earlier version in Proc. CLEI’97.
[21] G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching. J. of
Discrete Algorithms, 1(1):205–239, 2000. Hermes Science Publishing. Earlier version in CPM’99.
[22] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate -grams. In Proc.
11th Ann. Symp. on Combinatorial Pattern Matching (CPM’2000), LNCS 1848, pages 350–363, 2000.
[23] Gonzalo Navarro, Ricardo Baeza-Yates, Erkki Sutinen, Jorma Tarhio. Indexing Methods for Approxi-
mate String Matching. IEEE Data Engineering Bulletin, 2000.
[24] F. Shi. Fast approximate string matching with -blocks sequences. In Proc. 3rd South American
Workshop on String Processing (WSP’96), pages 257–271. Carleton University Press, 1996.
71 72

Approximate String

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Approximate String

Uploaded by

Copyright:

Available Formats

Outline

Approximate String Matching

Theory vs. Practice

Based in surveys by Baeza-Yates [3], Baeza-Yates [5], Navarro

How we can measure the goodness of an algorithm?

Asymptotic worst case behavior

User-defined Practical behavior

? The best theory

: finite alphabet of size 1. is a word (depends on the language)

RAM model: words of size

approximate match (distance function needed)

Time complexity : time needed to find the pattern

Raw pattern and text

- pattern is known in advance Suffix arrays

Preprocessing of the text index

– Suffix arrays Two level TR

Basic problem: find exact occurrences of a pattern in a text

– Allow insertions (Episode distance, not symmetric)

lower and upper bound of comparisons

– Index construction: time and space (finite alphabet)

t ext – Worst case: comparisons

Worst case complexity:

if( dosearch && j > m ) { // Search

with worst and average case time

We need only the last counters

Complexity ranges between and

Count = 2000 0020 0 0 0 10 0 0 0 0003001

Not suitable for (e.g. DNA)

for (i=0; i<n; i++) {

For the uniform cost RAM model, we have

state = (state << B) | T[*text]; // Next state

Handbook of Algorithms and Data Structures, 2 Ed, 1991

if( state < lim )

instead of &, using bits per position

is maximal number of mismatches allowed s u r g e r y

Insertions and deletions [Wu & Manber, 1991] v 4 3 2 1 1 2 3 3

Related to hardware implementations Longest positional match wanted?

Based in Baeza-Yates [5]

Key information: highest (smallest error) active state per column

Related to dynamic programming:

Complexity: search time

Longest common subsequence, string editing

Related to Ukkonen’s automata approach

Each diagonal represents an -closure (longest match):

Advantage: all s can be computed in parallel

Mixing this with filtration we get time

Related to simulation of DP over a suffix array (later) and bioin-

Find potential matches and then apply a sequential algorithm to

There is trade-off between and

Filtration is useful, as for larger error levels the text areas to

An example of Lemma 1 with and :

At least 2 of the survive unaltered.

Inverted indices permit searching for any word in the text.

Suffix trees allow searching for any substring of the text.

Vocabulary: Heaps’ law

Text length Posting lists

Heaps’ Law Zipf’s Law

Search the vocabulary sequentially (e.g. ASM) and do set opera-

many: 28 Vocabulary trie

I-1..4 I-5..8 Level 3

I-1..2 I-3..4 I-5..6 I-7..8 Level 2

Idea used in PIRS (Personal Information Retrieval System)

locality of reference is important