Professional Documents
Culture Documents
Algorithms:
An Overview
Shoshana Neuburger
The Graduate Center, CUNY
9/15/2009
Overview
Pattern Matching in 1D
Dictionary Matching
Pattern Matching in 2D
Indexing
Suffix Tree
Suffix Array
Research Directions
2 of 59
3 of 59
pi, ti .
Output:
T [i k ] P[k 1], 0 k m
4 of 59
Output: {2,8,11}
5 of 59
KMP Automaton
P = ababcb
8 of 59
Dictionary Matching
is an alphabet.
Input:
Text T = t1 t2 tn
Dictionary of patterns D = {P1, P2, , Pk}
All characters in patterns and text belong to .
Output:
All i, j such that
T [i l ] Pj [l 1], 0 l m j , 1 j k ,
where mj = |Pj|
9 of 59
Dictionary Matching
Algorithms
Nave Approach:
Use an efficient pattern matching
algorithm for each pattern in the
dictionary.
O(kn) time.
More efficient algorithms process text
once.
10 of 59
AC Automaton
Aho and Corasick extended the KMP
automaton to dictionary matching
k
d | Pj |
j 1
11 of 59
AC Automaton
D = {ab, ba, bab, babb, bb}
12 of 59
Dictionary Matching
KMP automaton does not depend on
alphabet size while AC automaton
does branching.
Dori, Landau (2006): AC automaton
is built in linear time for integer
alphabets.
Breslauer (1995) eliminates log
factor in text scanning stage.
13 of 59
Periodicity
A crucial task in preprocessing stage of
most pattern matching algorithms:
computing periodicity.
Many forms
failure table
witnesses
14 of 59
Periodicity
A periodic pattern can be
superimposed on itself without
mismatch before its midpoint.
Why is periodicity useful?
Can quickly eliminate many
candidates for pattern occurrence.
15 of 59
Periodicity
Definition:
k
,k 2
'
S is periodic if
S '=
and
is a proper
suffix of .
S is periodic if its longest prefix that
is also a suffix is at least half |S|.
The shortest period corresponds to
the longest border.
16 of 59
Periodicity - Example
S = abcabcabcab
|S| = 11
Longest border of S: b = abcabcab;
|b| = 8 so S is periodic.
Shortest period ofS:
=abc
| | = 3 so S is periodic.
17 of 59
Witnesses
Popular paradigm in pattern matching:
1.find consistent candidates
2.verify candidates
consistent candidates verification is
linear
18 of 59
Witnesses
Vishkin introduced the duel to choose
between two candidates by checking
the value of a witness.
Alphabet-independent method.
19 of 59
Witnesses
Preprocess pattern:
Compute witness for each location of
self-overlap.
Size of witness table:
| | , if P is periodic,
m
,
otherwise.
2
20 of 59
Witnesses
WIT[i] = any k such that P[k] P[k-i+1].
WIT[i] = 0, if there is no such k.
Pattern
Witness Table
21 of 59
Witnesses
Let j>i.
Candidates i and j are consistent if
they are sufficiently far from each
other
OR
WIT[j-i]=0.
22 of 59
Duel
Scan text:
If pair of candidates is close and
inconsistent, perform duel to
eliminate one (or both).
Sufficient to identify pairwise
consistent candidates: transitivity of
P=
consistent
positions.
i
T=
witness
a
b
?
23 of 59
2D Pattern Matching
is an alphabet.
Input:
MRI
Text T [1 n, 1 n]
Pattern P [1 m, 1 m]
pij , tij .
Output:
All (i, j) such that
T [i k , j l ] P[k 1, l 1], 0 k , l m
24 of 59
Input: PatternA
Text
=A {A,B}
A
Bird / Baker
First linear-time 2D pattern matching
algorithm.
View each pattern row as a
metacharacter to linearize problem.
Convert 2D pattern matching to 1D.
26 of 59
Bird / Baker
Preprocess pattern:
Name rows of pattern using AC
automaton.
Using names, pattern has 1D
representation.
Construct KMP automaton of pattern.
Identical rows receive identical names.
27 of 59
Bird / Baker
Scan text:
Name positions of text that match a
row of pattern, using AC automaton
within each row.
Run KMP on named columns of text.
Since the 1D names are unique, only one
name can be given to a text location.
28 of 59
29 of 59
0
30 of 59
Bird / Baker
Complexity of Bird / Baker algorithm:
n log | |
2
Alphabet-dependent.
Real-time since scans text characters once.
Can be used for dictionary matching:
2D Witnesses
Amir et. al. 2D witness table can be
used for linear time and space
alphabet-independent 2D matching.
The order of duels is significant.
Duels are performed in 2 waves over
text.
32 of 59
Indexing
Index text
Suffix Tree
Suffix Array
Suffix Trie
suf
a
n
a
n
a
suf
a
n
suf
suf
suf
2
suf
3
T = banana
$banana suf1
$anana suf2
$nana suf
$
3
suf $ana
suf4
5
$na
suf5
$a
$ suf6
suf7
One
leaf per suffix.
1
An edge represents one character.
Concatenation of edge-labels on the path from the root to
34 of 59
leaf i spells the suffix that starts at position i.
[7,7]
$
[1,7]
[2,2]
a
$banana
suf
1
[7,7]
[3,4]
na
[5,7
]
na$
suf2
suf
[7,7]
suf
4
$banana suf
na
$anana
1
[5,7]
$nana suf
[7,7]
$ana
$na
$
2
suf suf $na
suf
3
5
$a
3
$
suf
[3,4]
suf
35 of 59
Date
Innovation
Scan
Direction
Weiner
1973
Right to left
McCreight
1976
Left to right
Ukkonen
1995
Alphabet-independent suffix
links,
more efficient
Online linear-time
construction,
represents current end
Left to right
Amir and
Nor
2008
Real-time construction
Left to right
36 of 59
Index of Patterns
Can answer Lowest Common
Ancestor (LCA) queries in constant
time if preprocess tree accordingly.
In suffix tree, LCA corresponds to
Longest Common Prefix (LCP) of
strings represented by leaves.
38 of 59
Index of Patterns
To index several patterns:
Concatenate patterns with unique characters
separating them and build suffix tree.
Problem: inserts meaningless suffixes that span
several patterns.
OR
Build generalized suffix tree single structure
for suffixes of individual patterns.
Can be constructed with Ukkonens algorithm.
39 of 59
Suffix Array
The Suffix Array stores lexicographic order of
suffixes.
More space efficient than suffix tree.
Can locate all occurrences of a substring by
binary search.
With Longest Common Prefix (LCP) array can
perform even more efficient searches.
LCP array stores longest common prefix
between two adjacent suffixes in suffix array.
40 of 59
Suffix Array
Index Suffix Index Suffix LCP
1 mississippi
11
i
0
2 ississippi 8
ippi 1
3 ssissippi 5
issippi
1
4
sissippi
2
ississippi
4
sort1suffixesmississippi 0
5 issippi
alphabetically
6
ssippi 10
pi
0
7 sippi
9
ppi 1
8 ippi7
sippi 0
9 ppi 4
sissippi
2
10
pi
6
ssippi 1
11
i
3
ssissippi
3
41 of 59
Suffix array
T = mississippi
Index
10
11
Suffix
11
10
LCP
42 of 59
44 of 59
Date
Complexity
Innovation
Manber, Myers
1993
O(n log n)
Karkkainen and
Sanders
2003
O(n)
Linear-time
Ko and Aluru
2003
O(n)
Linear-time
2003
O(n)
Linear-time
Compressed Indices
Suffix Tree: O(n) words = O(n log n) bits
Compressed suffix tree
Grossi and Vitter (2000)
O(n) space.
Sadakane (2007)
O(n log ||) space.
Supports all suffix tree operations efficiently.
Slowdown of only polylog(n).
46 of 59
Compressed Indices
Suffix array is an array of n indices, which is stored in:
O(n) words = O(n log n) bits
Compressed Suffix Array (CSA)
Grossi and Vitter (2000)
O(n log ||) bits
access time increased from O(1) to O(log n)
Sadakane (2003)
Pattern matching as efficient as in uncompressed
SA.
O(n log H0) bits
Compressed self-index
47 of 59
Compressed Indices
FM index
Ferragina and Manzini (2005)
Self-indexing data structure
First compressed suffix array that
respects the high-order empirical entropy
Size relative to compressed text length.
Improved by Navarro and Makinen (2007)
48 of 59
49 of 59
Word-Based Index
Text size n contains k distinct words
Index a subset of positions that
correspond to word beginnings
With O(n) working space can index entire
text and discard unnecessary positions.
Desired complexity
O(k) space.
will always need O(n) time.
Problem: missing suffix links.
51 of 59
Dat
e
Results
Karkkainen and
Ukkonen
52 of 59
Research Directions
Problems we are considering:
Small space dictionary matching.
Time-space optimal 2D compressed
dictionary matching algorithm.
Compressed parameterized matching.
Self-indexing word-based data structure.
Dynamic suffix array in O(n)
construction time.
54 of 59
Small-Space
Applications arise in which storage space
is limited.
Many innovative algorithms exist for
single pattern matching using small
additional space:
Galil and Seiferas (1981) developed first
time-space optimal algorithm for pattern
matching.
Rytter (2003) adapted the KMP algorithm to
work in O(1) additional space, O(n) time.
55 of 59
Research Directions
Fast dictionary matching algorithms exist
for 1D and 2D. Achieve expected
sublinear time.
No deterministic dictionary matching
method that works in linear time and small
space.
We believe that recent results in
compressed self-indexing will facilitate the
development of a solution to the small
space dictionary matching problem.
56 of 59
Compressed Matching
Data is compressed to save space.
Lossless compression schemes can be
reversed without loss of data.
Pattern matching cannot be done in
compressed text pattern can span a
compressed character.
LZ78: data can be uncompressed in
time and space proportional to the
uncompressed data.
57 of 59
Research Directions
Amir et. al. (2003) devised an algorithm
for 2D LZ78 compressed matching.
They define strongly inplace as a criteria
for the algorithm: that the extra space is
proportional to the optimal compression
of all strings of the given length.
We are seeking a time-space optimal
solution to 2D compressed dictionary
matching.
58 of 59
Thank you!
59 of 59