Professional Documents
Culture Documents
a b a c a a b
1
a b a c a b
4 3 2
a b a c a b
Pattern Matching 1
Outline and Reading
Strings (§9.1.1)
Pattern matching algorithms
Brute-force algorithm (§9.1.2)
Boyer-Moore algorithm (§9.1.3)
Knuth-Morris-Pratt algorithm (§9.1.4)
Pattern Matching 2
Strings
A string is a sequence of Let P be a string of size m
characters A substring P[i .. j] of P is the
Examples of strings: subsequence of P consisting of
the characters with ranks
Java program between i and j
HTML document A prefix of P is a substring of
DNA sequence the type P[0 .. i]
Digitized image A suffix of P is a substring of the
An alphabet is the set of type P[i ..m 1]
possible characters for a Given strings T (text) and P
family of strings (pattern), the pattern matching
Example of alphabets: problem consists of finding a
substring of T equal to P
ASCII
Unicode
Applications:
Text editors
{0, 1}
Search engines
{A, C, G, T}
Biological research
Pattern Matching 3
Brute-Force Algorithm
Algorithm BruteForceMatch(T, P)
The brute-force pattern
matching algorithm compares Input text T of size n and pattern
the pattern P with the text T P of size m
for each possible shift of P Output starting index of a
relative to T, until either substring of T equal to P or 1
a match is found, or if no such substring exists
all placements of the pattern for i 0 to n m
have been tried { test shift i of the pattern }
Brute-force pattern matching j0
runs in time O(nm) while j m T[i j] P[j]
Example of worst case: jj1
T aaa … ah
if j m
P aaah
return i {match at i}
may occur in images and DNA
else
sequences
unlikely in English text break while loop {mismatch}
return -1 {no match anywhere}
Pattern Matching 4
Boyer-Moore Heuristics
The Boyer-Moore’s pattern matching algorithm is based on two
heuristics
Looking-glass heuristic: Compare P with a subsequence of T
moving backwards
Character-jump heuristic: When a mismatch occurs at T[i] c
If P contains c, shift P to align the last occurrence of c in P with T[i]
Else, shift P to align P[0] with T[i 1]
Example
a p a t t e r n m a t c h i n g a l g o r i t h m
1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m
2 4 6
r i t h m r i t h m r i t h m
Pattern Matching 5
Last-Occurrence Function
Boyer-Moore’s algorithm preprocesses the pattern P and the
alphabet to build the last-occurrence function L mapping to
integers, where L(c) is defined as
the largest index i such that P[i] c or
1 if no such index exists
Example:
c a b c d
{a, b, c, d}
P abacab L(c) 4 5 3 1
Pattern Matching 6
The Boyer-Moore Algorithm
Algorithm BoyerMooreMatch(T, P, ) Case 1: j 1l
. . . . . . a . . . . . .
L lastOccurenceFunction(P, )
i
im1
jm1 . . . . b a
repeat j l
if T[i] P[j] mj
if j 0 . . . . b a
return i { match at i }
else j
ii1
jj1 Case 2: 1lj
else . . . . . . a . . . . . .
{ character-jump } i
l L[T[i]] . a . . b .
i i m – min(j, 1l) l j
jm1 m (1 l)
until i n 1
return 1 { no match } . a . . b .
1l
Pattern Matching 7
Example
a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
Pattern Matching 8
Analysis
Boyer-Moore’s algorithm
runs in time O(nm s) a a a a a a a a a
Example of worst case: 6 5 4 3 2 1
T aaa … a b a a a a a
P baaa
12 11 10 9 8 7
The worst case may occur in b a a a a a
images and DNA sequences
but is unlikely in English text 18 17 16 15 14 13
Boyer-Moore’s algorithm is b a a a a a
significantly faster than the 24 23 22 21 20 19
brute-force algorithm on b a a a a a
English text
Pattern Matching 9
The KMP Algorithm - Motivation
Knuth-Morris-Pratt’s algorithm
compares the pattern to the
text in left-to-right, but shifts . . a b a a b x . . . . .
the pattern more intelligently
than the brute-force algorithm.
When a mismatch occurs, a b a a b a
what is the most we can shift
the pattern so as to avoid
j
redundant comparisons?
a b a a b a
Answer: the largest prefix of
P[0..j] that is a suffix of P[1..j]
No need to Resume
repeat these comparing
comparisons here
Pattern Matching 10
KMP Failure Function
Knuth-Morris-Pratt’s j 0 1 2 3 4
algorithm preprocesses the
P[j] a b a a b a
pattern to find matches of
prefixes of the pattern with F(j) 0 0 1 1 2
the pattern itself
The failure function F(j) is . . a b a a b x . . . . .
defined as the size of the
largest prefix of P[0..j] that is
also a suffix of P[1..j] a b a a b a
Knuth-Morris-Pratt’s j
algorithm modifies the brute-
force algorithm so that if a
a b a a b a
mismatch occurs at P[j]T[i]
we set j F(j 1) F(j1)
Pattern Matching 11
The KMP Algorithm
The failure function can be Algorithm KMPMatch(T, P)
represented by an array and F failureFunction(P)
i0
can be computed in O(m) time
j0
At each iteration of the while- while i n
loop, either if T[i] P[j]
if j m 1
i increases by one, or return i j { match }
the shift amount i j else
increases by at least one ii1
(observe that F(j 1) < j) jj1
else
Hence, there are no more if j 0
than 2n iterations of the j F[j 1]
while-loop else
ii1
Thus, KMP’s algorithm runs in return 1 { no match }
optimal time O(m n)
Pattern Matching 12
Computing the Failure
Function
The failure function can be
represented by an array and Algorithm failureFunction(P)
can be computed in O(m) time F[0] 0
The construction is similar to i1
the KMP algorithm itself j0
while i m
At each iteration of the while- if P[i] P[j]
loop, either {we have matched j + 1 chars}
F[i] j + 1
i increases by one, or
ii1
the shift amount i j jj1
increases by at least one else if j 0 then
(observe that F(j 1) < j) {use failure function to shift P}
j F[j 1]
Hence, there are no more else
than 2m iterations of the F[i] 0 { no match }
while-loop ii1
Pattern Matching 13
Example
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
j 0 1 2 3 4
14 15 16 17 18 19
P[j] a b a c a b
a b a c a b
F(j) 0 0 1 0 1
Pattern Matching 14