You are on page 1of 21

String Matching

Algorithms: String Matching: Rajeev Wankar


1

Outline and Reading

• Strings
• Pattern matching algorithms
– Brute-force algorithm
– Boyer-Moore algorithm
– Knuth-Morris-Pratt algorithm
– Demos

Algorithms: String Matching: Rajeev Wankar


2

1
Strings
• Let P be a string of size m
• A string is a sequence of
characters – A substring P[i .. j] of P is the
subsequence of P consisting
• Examples of strings: of the characters with ranks
– Java program between i and j
– HTML document – A prefix of P is a substring of
– DNA sequence the type P[0 .. i]
– Digitized image – A suffix of P is a substring of
• An alphabet S is the set of the type P[i ..m - 1]
possible characters for a • Given strings T (text) and P
family of strings (pattern), the pattern matching
problem consists of finding a
• Example of alphabets:
substring of T equal to P
– ASCII
• Applications:
– Unicode
– Text editors
– {0, 1}
– Search engines
– {A, C, G, T}
– Biological research
Algorithms: String Matching: Rajeev Wankar
3

String Matching Problem

Pattern  compress

Text  We introduce a general framework which is suitable to


capture an essence of compressed
compress pattern matching
according to various dictionary based compress
compressions. The
goal is to find all occurrences of a pattern in a text
without decompression,
compress which is one of the most active
topics in string matching. Our framework includes such
compression methods as Lempel-Ziv family, (LZ77, LZSS,
compress
LZ78, LZW), byte-pair encoding, and the static
dictionary based method. Technically, our pattern
matching algorithm extremely extends that for LZW
compressed text presented by Amir, Benson and Farach.
compress

Algorithms: String Matching: Rajeev Wankar


4

2
String-matching problem

• Find all valid shift with a given pattern P and


given text T

text T a c a a b c

pattern P a a b

S =2

Algorithms: String Matching: Rajeev Wankar


5

Brute-Force Algorithm

• The brute-force pattern matching Algorithm BruteForceMatch(T, P)


algorithm compares the pattern
P with the text T for each Input text T of size n and pattern
possible shift of P relative to T, P of size m
until either Output starting index of a
– a match is found, or substring of T equal to P or -1
– all placements of the pattern if no such substring exists
have been tried for i  0 to n - m do
• Brute-force pattern matching { test shift i of the pattern }
runs in time O(nm) if P[1 m] = T[s + 1 s + m]
• Example of worst case: then print "Pattern occurs with
– T = aaa … ah shift" s
– P = aaah
– may occur in images and
DNA sequences
– unlikely in English text

Algorithms: String Matching: Rajeev Wankar


6

3
Brute-Force Algorithm

a c a a b c a c a a b c a c a a b c a c a a b c

a a b a a b a a b a a b

S =0 S =1 S =2 S =3

Algorithms: String Matching: Rajeev Wankar


7

Analysis: Worst-case Example

1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a

a a a b

a a a b
Algorithms: String Matching: Rajeev Wankar
8

4
Worst-case Analysis

• There are m comparisons for each shift in the worst


case
• There are n-m+1 shifts
• So, the worst-case running time is Θ((n-m+1)m)
– In the example on previous slide, we have (13-
4+1)4 comparisons in total
• Naïve method is inefficient because information
from a shift is not used again

Algorithms: String Matching: Rajeev Wankar


9

Rabin-Karp Algorithm

• Has a worst-case running time of O((n-m+1)m)


• But average-case is O(n+m)
• Also works well in practice
• Based on number-theoretic notion of modular
equivalence. We assume that
•  = {0,1, 2, …, 9}, i.e., each character is a decimal
digit
• In general, use radix-d where d = ||

Algorithms: String Matching: Rajeev Wankar


10

5
Rabin-Karp Algorithm

Division Theorem

• For an integer a and any positive integer n, unique


integers q and r exist such that,

0 ≤ r < n and a = q · n + r
E.g., 23 = 3 · 7 + 2, -19 = -3 · 7 + 2

• q = a/n, is the quotient of the division


• r = a mod n, is the remainder of the division

Algorithms: String Matching: Rajeev Wankar


11

Rabin-Karp Algorithm

Modular Equivalence

• If (a mod n) = (b mod n), then we say


“a is equivalent to b, modulo n”
Denoted by a  b (mod n)

• That is, a  b (mod n) if a and b have the same


remainder when divided by n

• E.g., 23  37  -19 (mod 7)

Algorithms: String Matching: Rajeev Wankar


12

6
Rabin-Karp Algorithm

Modular Arithmetic

• Arithmetic as usual on integers except that, if we are


working modulo n, every result x is replaced by one of
{0,1,…, n-1} that is equivalent to x modulo n

• That is, x is replaced by “x mod n”

• E.g., if we are working modulo 7, then each of 23, 37


and -19 will be replaced by 2

Algorithms: String Matching: Rajeev Wankar


13

Rabin-Karp Algorithm

• We can view a string of k characters (digits) as a length-k decimal


number

• E.g., the string “31425” corresponds to the decimal number


31,425

• Given a pattern P [1..m], let p denote the corresponding decimal


value

• Given a text T [1..n], let ts denote the decimal value of the length-
m substring T [(s+1)..(s+m)] for s=0,1,…,(n-m)

ts = p iff T [(s+1)..(s+m)] = P [1..m]

s is a valid shift iff ts = p


Algorithms: String Matching: Rajeev Wankar
14

7
Rabin-Karp Algorithm

• p can be computed in O(m) time (Horner’s rule)


p = P[m] + 10 (P[m-1] + 10 (P[m-2]+…))

• t0 can similarly be computed in O(m) time

• Other t1, t2,…, tn-m can be computed in O(n-m) time since ts+1 can
be computed from ts in constant time

ts+1 = 10(ts - 10m-1 ·T [s+1]) + T [s+m+1]

E.g., if T={…, 3, 1, 4, 1, 5, 2,…}, m = 5 and ts = 31,415, then

ts+1 = 10(31415 – 10000·3) + 2

Algorithms: String Matching: Rajeev Wankar


15

Rabin-Karp Algorithm
• We can compute p, t0, t1,…, tn-m in O(n+m) time

• But…a problem: this is assuming p and ts are small numbers

They may be too large to work with ease

• Solution: we can use modular arithmetic with a suitable


modulus, q

E.g., ts+1  10(ts - …)+ T [s+m+1] (mod q)

• q is chosen as a small prime number ; e.g., 13 for radix 10


• Generally, if the radix is d, then dq should fit within one computer
word

Algorithms: String Matching: Rajeev Wankar


16

8
Rabin-Karp Algorithm
• RABIN-KARP-MATCHER(T, P, d, q)

1 n ← length[T]
2 m ← length[P]
3 h ← dm-1 mod q
4 p←0
5 t0 ← 0
6 for i ← 1 to m . Preprocessing.
7 do p ← (dp + P[i]) mod q
8 t0 ← (dt0 + T[i]) mod q
9 for s ← 0 to n - m . Matching.
10 do if p = ts
11 then if P[1 m] = T [s + 1 s + m]
12 then print "Pattern occurs with shift" s
13 if s < n - m
14 then ts+1 ← (d(ts - T[s + 1]h) + T[s + m + 1]) mod q

Algorithms: String Matching: Rajeev Wankar


17

Rabin-Karp Algorithm
• How values modulo 13 are computed

Algorithms: String Matching: Rajeev Wankar


18

9
Rabin-Karp Algorithm
Problem of Spurious Hits

• ts  p (mod q) does not imply that ts = p

• Modular equivalence does not necessarily mean that two


integers are equal

• A case in which ts  p (mod q) when ts ≠ p is called a


spurious hit

• On the other hand, if two integers are not modular


equivalent, then they cannot be equal

Algorithms: String Matching: Rajeev Wankar


19

Rabin-Karp Algorithm

Algorithms: String Matching: Rajeev Wankar


20

10
Rabin-Karp Algorithm
• Basic structure like the naïve algorithm, but uses
modular arithmetic as described

• For each hit, i.e., for each s where ts  p (mod q),


verify character by character whether s is a valid shift
or a spurious hit

• In the worst case, every shift is verified


–Running time can be shown as O((n-m+1)m)

• Average-case running time is O(n+m)

Algorithms: String Matching: Rajeev Wankar


21

Boyer-Moore’s Algorithm (1)


• The Boyer-Moore’s pattern matching algorithm is based on two
heuristics
Looking-glass heuristic: Compare P with a subsequence of T moving
backwards
Character-jump heuristic: When a mismatch occurs at T[i] = α
– If P contains α, shift P to align the last occurrence of α in P with T[i]
– Else, shift P to align P[0] with T[i + 1]
T
• Example

a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
r i t h m r i t h m P r i t h m r i t h m

2 4 6
r i t h m r i t h m r i t h m

Algorithms: String Matching: Rajeev Wankar


22

11
Last-Occurrence Function

• Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet S


to build the last-occurrence function L mapping alphabet S to integers,
where L(α) is defined as
– the largest index i such that P[i] = α or -1 if no such index exists

• Example:
α a b c d
S = {a, b, c, d}
L(α) 5 6 4 -1
– P = abacab

• The last-occurrence function can be represented by an array indexed by


the numeric codes of the characters
• The last-occurrence function can be computed in time O(m + s), where m
is the size of P and s is the size of S

Algorithms: String Matching: Rajeev Wankar


23

Boyer-Moore’s Algorithm (2)

Algorithm lastOccurenceFunction(P, m, S)
for each character α  S
do L[α] = 0
for j  1 to m
do L[P[α]]  j
return L
j 1 2 3 4 5 6
P(j) a b a c a b

α 97 98 99 100

L(α) 5 6 4 -1

Algorithms: String Matching: Rajeev Wankar


24

12
Boyer-Moore’s Algorithm (2)

Algorithm BoyerMooreMatch(T, P, S ) Case 1: j  1 + l


L  lastOccurenceFunction(P, m, S ) a .
. . . . . . . . . . .
i
im-1
jm-1 . . . . b a
repeat j l
if T[i] = P[j] m-j
if j = 0 . . . . b a
return i { match at i }
else j
ii-1
jj-1 Case 2: 1 + l  j
else . . . . . . a . . . . . .
{ character-jump } i
l  L[T[i]] . a . . b .
i  i + m – min(j, 1 + l) l j
jm-1 m - (1 + l)
until i > n - 1
return -1 { no match } . a . . b .

1+l
Algorithms: String Matching: Rajeev Wankar
25

Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Algorithms: String Matching: Rajeev Wankar


26

13
Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Algorithms: String Matching: Rajeev Wankar


27

Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Algorithms: String Matching: Rajeev Wankar


28

14
Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Algorithms: String Matching: Rajeev Wankar


29

Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Algorithms: String Matching: Rajeev Wankar


30

15
Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Algorithms: String Matching: Rajeev Wankar


31

Example

a b a c a a b a d c a b a c a b a a b b
1
a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

Algorithms: String Matching: Rajeev Wankar


32

16
Analysis

• In worse case Boyer-


Moore’s algorithm runs in a a a a a a a a a
time O(nm + s) 6 5 4 3 2 1
• Example of worst case: b a a a a a
– T = aaa … a
12 11 10 9 8 7
– P = baaa
b a a a a a
• The worst case may occur in
images and DNA sequences 18 17 16 15 14 13
but is unlikely in English text b a a a a a
• Boyer-Moore’s algorithm is 24 23 22 21 20 19
significantly faster than the b a a a a a
brute-force algorithm on
English text

Algorithms: String Matching: Rajeev Wankar


33

Time Complexity

The Boyer-Moore algorithm as presented in the original


paper has worst-case running time of O(n+m) only if the
pattern does not appear in the text.

When the pattern does occur in the text, running time of the
original algorithm is O(nm) in the worst case.

Algorithms: String Matching: Rajeev Wankar


34

17
Knuth-Morris-Pratt Algorithm (KMP’s) Algorithm (1)

• Knuth-Morris-Pratt’s algorithm
preprocesses the pattern to find
matches of prefixes of the
pattern with the pattern itself
• The prefix function F(i) is
defined as the size of the largest
prefix of P[0..j] that is also a . . a b a a b x . . . . .
proper suffix of P[1..j]
• Knuth-Morris-Pratt’s algorithm
modifies the brute-force
a b a a b a
algorithm so that if a mismatch
occurs at P[j]  T[i] we set j
j  F(j - 1)
a b a a b a

F(j - 1)

Algorithms: String Matching: Rajeev Wankar


35

Why Some Shifts are Invalid

– The first mismatch is at


the 6th character
– consider the pattern
already matched, it is
clear a shift of 1 is not
valid the beginning a in P
would not match the b in
the text
– the next valid shift is +2
because the aba in P
matches the aba in the
text
– the key insight is that we
really only have to check
the pattern matching
against ITSELF

Algorithms: String Matching: Rajeev Wankar


36

18
i 1 2 3 4 5 6 7 8 9 10
P[i] a b a b a b a b c a
F[i] 0 0 1 2 3 4 5 6 0 1

P8 a b a b a b a b c a

P6 a b a b a b a b c a F[8]=6

P4 a b a b a b a b c a F[6]=4

P2 a b a b a b a b c a F[4]=2

P0  a b a b a b a b c a F[2]=0

Algorithms: String Matching: Rajeev Wankar


37

i 1 2 3 4 5 6 7 8 9 10
P[i] a b a b a b a b c a
F[i] 0 0 1 2 3 4 5 6 0 1

P8 a b a b a b a b c a

P6 a b a b a b a b c a F[8]=6

P4 a b a b a b a b c a F[6]=4

P2 a b a b a b a b c a F[4]=2

P0  a b a b a b a b c a F[2]=0

Algorithms: String Matching: Rajeev Wankar


38

19
KMP’s Algorithm (2)

• The prefix function can be Algorithm KMPMatch(T, P)


represented by an array and can F  F(P) /*prefixFuntion */
be computed in O(m) time i0
• At each iteration of the while- j0
loop, either while i < n
if T[i] = P[j]
– i increases by one, or if j = m - 1
– the shift amount i - j return i - j { match }
increases by at least one else
(observe that F(j - 1) < j) ii+1
jj+1
• Hence, there are no more than 2n
else
iterations of the while-loop if j > 0
• Thus, KMP’s algorithm runs in j  F[j - 1]
optimal time O(m + n) else
ii+1
return -1 { no match }

Algorithms: String Matching: Rajeev Wankar


39

KMP’s Algorithm (3)


Function prefixFunction F(P)
1. m ← length[P]
2. F[0] ← 0
3. k ← 0
4. for q ← 1 to m-1 do
5. do while ((k > 0) and (P[k + 1]  P[q]))
6. do k ← F[k]
7. if (P[k +1] = P[q])
8. then k ← k + 1
9. F[q] ← k
10. return F
Algorithms: String Matching: Rajeev Wankar
40

20
Example

a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
j 0 1 2 3 4 5
P[j] a b a c a b 14 15 16 17 18 19
F(j) 0 0 1 0 1 2 a b a c a b

Algorithms: String Matching: Rajeev Wankar


41

21

You might also like