Second Exam

Pattern Matching
Algorithms:
An Overview
Shoshana Neuburger
The Graduate Center, CUNY
9/15/2009
Overview
Pattern Matching in 1D
Dictionary Matching
Pattern Matching in 2D
Indexing
Suffix Tree
Suffix Array
Research Directions
2 of 59
What is Pattern Matching?

Given a pattern and text,
find the pattern in the text.
3 of 59
What is Pattern Matching?

is an alphabet.
Input:
Text T = t1 t2 tn
Pattern P = p1 p2 pm
pi, ti .
Output:
T [i k ] P[k 1], 0 k m
All i such that
4 of 59
Pattern Matching - Example

Input: P=cagc
=
{a,g,c,t}
1 2 3 4 5 6 7 8 . 11
acagcatcagcagctagcat
T=acagcatcagcagctagcat
Output: {2,8,11}
5 of 59
Pattern Matching Algorithms

Nave Approach
Compare pattern to text at each
location.
O(mn) time.
More efficient algorithms utilize

information from previous
comparisons.
6 of 59
Pattern Matching Algorithms

Linear time methods have two stages
1. preprocess pattern in O(m) time and
space.
2. scan text in O(n) time and space.
Knuth, Morris, Pratt (1977): automata

method
Boyer, Moore (1977): can be sublinear
7 of 59
KMP Automaton
P = ababcb
8 of 59
Dictionary Matching
is an alphabet.
Input:
Text T = t1 t2 tn
Dictionary of patterns D = {P1, P2, , Pk}
All characters in patterns and text belong to .
Output:
All i, j such that
T [i l ] Pj [l 1], 0 l m j , 1 j k ,
where mj = |Pj|
9 of 59
Dictionary Matching
Algorithms
Nave Approach:
Use an efficient pattern matching
algorithm for each pattern in the
dictionary.
O(kn) time.
More efficient algorithms process text
once.
10 of 59
AC Automaton
Aho and Corasick extended the KMP
automaton to dictionary matching
k
d | Pj |
j 1
Preprocessing time: O(d)

Matching time: O(n log || +k).
Independent of dictionary size!
11 of 59
AC Automaton
D = {ab, ba, bab, babb, bb}
12 of 59
Dictionary Matching
KMP automaton does not depend on
alphabet size while AC automaton
does branching.
Dori, Landau (2006): AC automaton
is built in linear time for integer
alphabets.
Breslauer (1995) eliminates log
factor in text scanning stage.
13 of 59
Periodicity
A crucial task in preprocessing stage of
most pattern matching algorithms:
computing periodicity.
Many forms
failure table
witnesses
14 of 59
Periodicity
A periodic pattern can be
superimposed on itself without
mismatch before its midpoint.
Why is periodicity useful?
Can quickly eliminate many
candidates for pattern occurrence.
15 of 59
Periodicity
Definition:
k
,k 2
'
S is periodic if
S '=
and
is a proper
suffix of .
S is periodic if its longest prefix that
is also a suffix is at least half |S|.
The shortest period corresponds to
the longest border.
16 of 59
Periodicity - Example
S = abcabcabcab
|S| = 11
Longest border of S: b = abcabcab;
|b| = 8 so S is periodic.
Shortest period ofS:
=abc
| | = 3 so S is periodic.
17 of 59
Witnesses
Popular paradigm in pattern matching:
1.find consistent candidates
2.verify candidates
consistent candidates verification is
linear
18 of 59
Witnesses
Vishkin introduced the duel to choose
between two candidates by checking
the value of a witness.
Alphabet-independent method.
19 of 59
Witnesses
Preprocess pattern:
Compute witness for each location of
self-overlap.
Size of witness table:
| | , if P is periodic,
m
,
otherwise.
2
20 of 59
Witnesses
WIT[i] = any k such that P[k] P[k-i+1].
WIT[i] = 0, if there is no such k.
k is a witness against i being a period of

P.
Example:
Pattern
Witness Table
21 of 59
Witnesses
Let j>i.
Candidates i and j are consistent if
they are sufficiently far from each
other
OR
WIT[j-i]=0.
22 of 59
Duel
Scan text:
If pair of candidates is close and
inconsistent, perform duel to
eliminate one (or both).
Sufficient to identify pairwise
consistent candidates: transitivity of
P=
consistent
positions.
i
T=
witness
a
b
?
23 of 59
2D Pattern Matching
is an alphabet.
Input:
MRI
Text T [1 n, 1 n]
Pattern P [1 m, 1 m]
pij , tij .
Output:
All (i, j) such that
T [i k , j l ] P[k 1, l 1], 0 k , l m
24 of 59
2D Pattern Matching Example
Input: PatternA
Text
=A {A,B}
A
Output: { (1,4),(2,2),(4, 3)}

25 of 59
Bird / Baker
First linear-time 2D pattern matching
algorithm.
View each pattern row as a
metacharacter to linearize problem.
Convert 2D pattern matching to 1D.
26 of 59
Bird / Baker
Preprocess pattern:
Name rows of pattern using AC
automaton.
Using names, pattern has 1D
representation.
Construct KMP automaton of pattern.
Identical rows receive identical names.
27 of 59
Bird / Baker
Scan text:
Name positions of text that match a
row of pattern, using AC automaton
within each row.
Run KMP on named columns of text.
Since the 1D names are unique, only one
name can be given to a text location.
28 of 59
Bird / Baker - Example

Preprocess pattern:
Name rows of pattern using AC
automaton.
Using names, pattern has 1D
representation.
A B A
1
Construct
automaton
of pattern.
A B KMP
A
1
A
29 of 59
Bird / Baker - Example

Scan text:
Name positions of text that match a
row of pattern, using AC automaton
within each row.
Run
columns
A A KMP
B A on
B Anamed
A
0 0 2 1 of
0 text.
1 0
B
0
30 of 59
Bird / Baker
Complexity of Bird / Baker algorithm:
n log | |
2
time and space.
Alphabet-dependent.
Real-time since scans text characters once.
Can be used for dictionary matching:
replace KMP with AC automaton.

31 of 59
2D Witnesses
Amir et. al. 2D witness table can be
used for linear time and space
alphabet-independent 2D matching.
The order of duels is significant.
Duels are performed in 2 waves over
text.
32 of 59
Indexing
Index text
Suffix Tree
Suffix Array
Find pattern in O(m) time

Useful paradigm when text will be
searched for several patterns.
33 of 59
Suffix Trie
suf
a
n
a
n
a
suf
a
n
suf
suf
suf
2
suf
3
T = banana
$banana suf1
$anana suf2
$nana suf
$
3
suf $ana
suf4
5
$na
suf5
$a
$ suf6
suf7
One
leaf per suffix.
1
An edge represents one character.
Concatenation of edge-labels on the path from the root to
34 of 59
leaf i spells the suffix that starts at position i.
Suffix Tree T = banana
[7,7]
$
[1,7]
[2,2]
a
$banana
suf
1
[7,7]
[3,4]
na
[5,7
]
na$
suf2
suf
[7,7]
suf
4
$banana suf
na
$anana
1
[5,7]
$nana suf
[7,7]
$ana
$na
$
2
suf suf $na
suf
3
5
$a
3
$
suf
[3,4]
Compact representation of trie.

A node with one child is merged with its parent.
Up to n internal nodes.
O(n) space by using indices to label edges
suf
35 of 59
Suffix Tree Construction

Nave Approach: O(n2) time
Linear-time algorithms:
Author
Date
Innovation
Scan
Direction
Weiner
1973
First linear-time algorithm,

alphabet-dependent suffix
links
Right to left
McCreight
1976
Left to right
Ukkonen
1995
Alphabet-independent suffix
links,
more efficient
Online linear-time
construction,
represents current end
Left to right
Amir and
Nor
2008
Real-time construction
Left to right
36 of 59
Suffix Tree Construction

Linear-time suffix tree construction
algorithms rely on suffix links to facilitate
traversal of tree.
A suffix link is a pointer from a node
labeled xS to a node labeled S; x is a
character and S a possibly empty
substring.
Alphabet-dependent suffix links point from
a node labeled S to a node labeled xS, for
each character x.
37 of 59
Index of Patterns
Can answer Lowest Common
Ancestor (LCA) queries in constant
time if preprocess tree accordingly.
In suffix tree, LCA corresponds to
Longest Common Prefix (LCP) of
strings represented by leaves.
38 of 59
Index of Patterns
To index several patterns:
Concatenate patterns with unique characters
separating them and build suffix tree.
Problem: inserts meaningless suffixes that span
several patterns.
OR
Build generalized suffix tree single structure
for suffixes of individual patterns.
Can be constructed with Ukkonens algorithm.
39 of 59
Suffix Array
The Suffix Array stores lexicographic order of
suffixes.
More space efficient than suffix tree.
Can locate all occurrences of a substring by
binary search.
With Longest Common Prefix (LCP) array can
perform even more efficient searches.
LCP array stores longest common prefix
between two adjacent suffixes in suffix array.
40 of 59
Suffix Array
Index Suffix Index Suffix LCP
1 mississippi
11
i
0
2 ississippi 8
ippi 1
3 ssissippi 5
issippi
1
4
sissippi
2
ississippi
4
sort1suffixesmississippi 0
5 issippi
alphabetically
6
ssippi 10
pi
0
7 sippi
9
ppi 1
8 ippi7
sippi 0
9 ppi 4
sissippi
2
10
pi
6
ssippi 1
11
i
3
ssissippi
3
41 of 59
Suffix array
T = mississippi
Index
10
11
Suffix
11
10
LCP
42 of 59
Search in Suffix Array

O(m log n):
Idea: two binary searches
- search for leftmost position of X
- search for rightmost position of X
In between are all suffixes that begin
with X
With LCP array: O(m + log n) search.
43 of 59
Suffix Array Construction

Nave Approach: O(n2) time
Indirect Construction:
preorder traversal of suffix tree
LCA queries for LCP.
Problem: does not achieve better space
efficiency.
44 of 59
Suffix Array Construction

Direct construction algorithms:
Author
Date
Complexity
Innovation
Manber, Myers
1993
O(n log n)
Sort and search,

KMR renaming
Karkkainen and
Sanders
2003
O(n)
Linear-time
Ko and Aluru
2003
O(n)
Linear-time
Kim, et. al.
2003
O(n)
Linear-time
LCP array construction: range-minima

queries.
45 of 59
Compressed Indices
Suffix Tree: O(n) words = O(n log n) bits
Compressed suffix tree
Grossi and Vitter (2000)
O(n) space.
Sadakane (2007)
O(n log ||) space.
Supports all suffix tree operations efficiently.
Slowdown of only polylog(n).
46 of 59
Compressed Indices
Suffix array is an array of n indices, which is stored in:
O(n) words = O(n log n) bits
Compressed Suffix Array (CSA)
Grossi and Vitter (2000)
O(n log ||) bits
access time increased from O(1) to O(log n)
Sadakane (2003)
Pattern matching as efficient as in uncompressed
SA.
O(n log H0) bits
Compressed self-index
47 of 59
Compressed Indices
FM index
Ferragina and Manzini (2005)
Self-indexing data structure
First compressed suffix array that
respects the high-order empirical entropy
Size relative to compressed text length.
Improved by Navarro and Makinen (2007)
48 of 59
Dynamic Suffix Tree

Dynamic Suffix Tree
Choi and Lam (1997)
Strings can be inserted or deleted efficiently.
Update time proportional to string
inserted/deleted.
No edges labeled by a deleted string.
Two-way pointer for each edge, which can be
done in space linear in the size of the tree.
49 of 59
Dynamic Suffix Array

Dynamic Suffix Array
Recent work by Salson et. al.
Can update suffix array after
construction if text changes.
More efficient than rebuilding suffix
array.
Open problems:
Worst case O(n log n).
No online algorithm yet.
50 of 59
Word-Based Index
Text size n contains k distinct words
Index a subset of positions that
correspond to word beginnings
With O(n) working space can index entire
text and discard unnecessary positions.
Desired complexity
O(k) space.
will always need O(n) time.
Problem: missing suffix links.
51 of 59
Word-Based Suffix Tree

Construction Algorithms:
Author
Dat
e
Results
Karkkainen and
Ukkonen
1996 O(n) time and O(n/j) space construction of

sparse suffix tree (every jth suffix)
Anderson et. al.
1999 Expected linear-time and k-space

construction of word-based suffix tree for
k words.
Inenaga and Takeda
2006 Online, O(n) time and k-space

construction of word-based suffix tree for
k words.
52 of 59
Word-Based Suffix Array

Ferragina and Fischer (2007) word-based
suffix array construction algorithm
Time and space optimal construction.
Computation of word-based LCP array in
O(n) time and O(k) space.
Alternative algorithm for construction of
word-based suffix tree.
Searching as efficient as ordinary sufffix
array.
53 of 59
Research Directions
Problems we are considering:
Small space dictionary matching.
Time-space optimal 2D compressed
dictionary matching algorithm.
Compressed parameterized matching.
Self-indexing word-based data structure.
Dynamic suffix array in O(n)
construction time.
54 of 59
Small-Space
Applications arise in which storage space
is limited.
Many innovative algorithms exist for
single pattern matching using small
additional space:
Galil and Seiferas (1981) developed first
time-space optimal algorithm for pattern
matching.
Rytter (2003) adapted the KMP algorithm to
work in O(1) additional space, O(n) time.
55 of 59
Research Directions
Fast dictionary matching algorithms exist
for 1D and 2D. Achieve expected
sublinear time.
No deterministic dictionary matching
method that works in linear time and small
space.
We believe that recent results in
compressed self-indexing will facilitate the
development of a solution to the small
space dictionary matching problem.
56 of 59
Compressed Matching
Data is compressed to save space.
Lossless compression schemes can be
reversed without loss of data.
Pattern matching cannot be done in
compressed text pattern can span a
compressed character.
LZ78: data can be uncompressed in
time and space proportional to the
uncompressed data.
57 of 59
Research Directions
Amir et. al. (2003) devised an algorithm
for 2D LZ78 compressed matching.
They define strongly inplace as a criteria
for the algorithm: that the extra space is
proportional to the optimal compression
of all strings of the given length.
We are seeking a time-space optimal
solution to 2D compressed dictionary
matching.
58 of 59
Thank you!
59 of 59

Second Exam

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Second Exam

Uploaded by

Copyright:

Available Formats

Pattern Matching

What is Pattern Matching?

What is Pattern Matching?

All i such that

Pattern Matching - Example

Pattern Matching Algorithms

More efficient algorithms utilize

Pattern Matching Algorithms

Knuth, Morris, Pratt (1977): automata

Preprocessing time: O(d)

k is a witness against i being a period of

2D Pattern Matching Example

Output: { (1,4),(2,2),(4, 3)}

Bird / Baker - Example

Bird / Baker - Example

time and space.

replace KMP with AC automaton.

Find pattern in O(m) time

Suffix Tree T = banana

Compact representation of trie.

Suffix Tree Construction

First linear-time algorithm,

Suffix Tree Construction

Search in Suffix Array

Suffix Array Construction

Suffix Array Construction

Sort and search,

Kim, et. al.

LCP array construction: range-minima

Dynamic Suffix Tree

Dynamic Suffix Array

Word-Based Suffix Tree

1996 O(n) time and O(n/j) space construction of

Anderson et. al.

1999 Expected linear-time and k-space

Inenaga and Takeda

2006 Online, O(n) time and k-space

Word-Based Suffix Array

You might also like