You are on page 1of 14

Pattern Matching

a b a c a a b
1
a b a c a b
4 3 2
a b a c a b

Pattern Matching 1
Outline and Reading
Strings (§9.1.1)
Pattern matching algorithms
 Brute-force algorithm (§9.1.2)
 Boyer-Moore algorithm (§9.1.3)
 Knuth-Morris-Pratt algorithm (§9.1.4)

Pattern Matching 2
Strings
A string is a sequence of Let P be a string of size m
characters  A substring P[i .. j] of P is the
Examples of strings: subsequence of P consisting of
the characters with ranks
 Java program between i and j
 HTML document  A prefix of P is a substring of
 DNA sequence the type P[0 .. i]
 Digitized image  A suffix of P is a substring of the
An alphabet  is the set of type P[i ..m  1]
possible characters for a Given strings T (text) and P
family of strings (pattern), the pattern matching
Example of alphabets: problem consists of finding a
substring of T equal to P
 ASCII
 Unicode
Applications:
 Text editors
 {0, 1}
 Search engines
 {A, C, G, T}
 Biological research
Pattern Matching 3
Brute-Force Algorithm
Algorithm BruteForceMatch(T, P)
The brute-force pattern
matching algorithm compares Input text T of size n and pattern
the pattern P with the text T P of size m
for each possible shift of P Output starting index of a
relative to T, until either substring of T equal to P or 1
 a match is found, or if no such substring exists
 all placements of the pattern for i  0 to n  m
have been tried { test shift i of the pattern }
Brute-force pattern matching j0
runs in time O(nm) while j  m  T[i  j]  P[j]
Example of worst case: jj1
 T  aaa … ah
if j  m
 P  aaah
return i {match at i}
 may occur in images and DNA
else
sequences
 unlikely in English text break while loop {mismatch}
return -1 {no match anywhere}
Pattern Matching 4
Boyer-Moore Heuristics
The Boyer-Moore’s pattern matching algorithm is based on two
heuristics
Looking-glass heuristic: Compare P with a subsequence of T
moving backwards
Character-jump heuristic: When a mismatch occurs at T[i]  c
 If P contains c, shift P to align the last occurrence of c in P with T[i]
 Else, shift P to align P[0] with T[i  1]
Example

a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m

2 4 6
r i t h m r i t h m r i t h m

Pattern Matching 5
Last-Occurrence Function
Boyer-Moore’s algorithm preprocesses the pattern P and the
alphabet  to build the last-occurrence function L mapping  to
integers, where L(c) is defined as
 the largest index i such that P[i]  c or
 1 if no such index exists
Example:
c a b c d
  {a, b, c, d}
 P abacab L(c) 4 5 3 1

The last-occurrence function can be represented by an array


indexed by the numeric codes of the characters
The last-occurrence function can be computed in time O(m  s),
where m is the size of P and s is the size of 

Pattern Matching 6
The Boyer-Moore Algorithm
Algorithm BoyerMooreMatch(T, P, ) Case 1: j  1l
. . . . . . a . . . . . .
L  lastOccurenceFunction(P, )
i
im1
jm1 . . . . b a
repeat j l
if T[i]  P[j] mj
if j  0 . . . . b a
return i { match at i }
else j
ii1
jj1 Case 2: 1lj
else . . . . . . a . . . . . .
{ character-jump } i
l  L[T[i]] . a . . b .
i  i  m – min(j, 1l) l j
jm1 m  (1  l)
until i  n  1
return 1 { no match } . a . . b .

1l
Pattern Matching 7
Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Pattern Matching 8
Analysis
Boyer-Moore’s algorithm
runs in time O(nm  s) a a a a a a a a a
Example of worst case: 6 5 4 3 2 1
 T  aaa … a b a a a a a
 P  baaa
12 11 10 9 8 7
The worst case may occur in b a a a a a
images and DNA sequences
but is unlikely in English text 18 17 16 15 14 13
Boyer-Moore’s algorithm is b a a a a a
significantly faster than the 24 23 22 21 20 19
brute-force algorithm on b a a a a a
English text

Pattern Matching 9
The KMP Algorithm - Motivation
Knuth-Morris-Pratt’s algorithm
compares the pattern to the
text in left-to-right, but shifts . . a b a a b x . . . . .
the pattern more intelligently
than the brute-force algorithm.
When a mismatch occurs, a b a a b a
what is the most we can shift
the pattern so as to avoid
j
redundant comparisons?
a b a a b a
Answer: the largest prefix of
P[0..j] that is a suffix of P[1..j]
No need to Resume
repeat these comparing
comparisons here
Pattern Matching 10
KMP Failure Function
Knuth-Morris-Pratt’s j 0 1 2 3 4 
algorithm preprocesses the
P[j] a b a a b a
pattern to find matches of
prefixes of the pattern with F(j) 0 0 1 1 2 
the pattern itself
The failure function F(j) is . . a b a a b x . . . . .
defined as the size of the
largest prefix of P[0..j] that is
also a suffix of P[1..j] a b a a b a
Knuth-Morris-Pratt’s j
algorithm modifies the brute-
force algorithm so that if a
a b a a b a
mismatch occurs at P[j]T[i]
we set j  F(j  1) F(j1)

Pattern Matching 11
The KMP Algorithm
The failure function can be Algorithm KMPMatch(T, P)
represented by an array and F  failureFunction(P)
i0
can be computed in O(m) time
j0
At each iteration of the while- while i  n
loop, either if T[i]  P[j]
if j  m  1
 i increases by one, or return i  j { match }
 the shift amount i  j else
increases by at least one ii1
(observe that F(j  1) < j) jj1
else
Hence, there are no more if j  0
than 2n iterations of the j  F[j  1]
while-loop else
ii1
Thus, KMP’s algorithm runs in return 1 { no match }
optimal time O(m  n)

Pattern Matching 12
Computing the Failure
Function
The failure function can be
represented by an array and Algorithm failureFunction(P)
can be computed in O(m) time F[0]  0
The construction is similar to i1
the KMP algorithm itself j0
while i  m
At each iteration of the while- if P[i]  P[j]
loop, either {we have matched j + 1 chars}
F[i]  j + 1
 i increases by one, or
ii1
 the shift amount i  j jj1
increases by at least one else if j  0 then
(observe that F(j  1) < j) {use failure function to shift P}
j  F[j  1]
Hence, there are no more else
than 2m iterations of the F[i]  0 { no match }
while-loop ii1

Pattern Matching 13
Example
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
j 0 1 2 3 4 
14 15 16 17 18 19
P[j] a b a c a b
a b a c a b
F(j) 0 0 1 0 1 

Pattern Matching 14

You might also like