You are on page 1of 14

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE

5.3 S UBSTRING S EARCH 5.3 S UBSTRING S EARCH


‣ introduction ‣ introduction
‣ brute force ‣ brute force
‣ Knuth-Morris-Pratt ‣ Knuth-Morris-Pratt
Algorithms Algorithms
‣ Boyer-Moore ‣ Boyer-Moore
F O U R T H E D I T I O N

R OBERT S EDGEWICK | K EVIN W AYNE


‣ Rabin-Karp R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp
http://algs4.cs.princeton.edu http://algs4.cs.princeton.edu

Substring search Substring search applications

Goal. Find pattern of length M in a text of length N. Goal. Find pattern of length M in a text of length N.
typically N >> M typically N >> M

pattern N E E D L E pattern N E E D L E

text I N A H A Y S T A C K N E E D L E I N A text I N A H A Y S T A C K N E E D L E I N A

match match
Substring search Substring search

3 4
Substring search applications Substring search applications

Goal. Find pattern of length M in a text of length N. Goal. Find pattern of length M in a text of length N.
typically N >> M typically N >> M

pattern N E E D L E pattern N E E D L E

text I N A H A Y S T A C K N E E D L E I N A text I N A H A Y S T A C K N E E D L E I N A

match match
Substring search Substring search

Computer forensics.  Search memory or disk for signatures, Identify patterns indicative of spam.
e.g., all URLs or RSA keys that the user has entered. ・ PROFITS
・ L0SE WE1GHT
・ herbal Viagra
・ There is no catch.
・ This is a one-time mailing.
http://citp.princeton.edu/memory
・ This message is sent in compliance with spam regulations.
5 6

Substring search applications Substring search applications

Electronic surveillance. Screen scraping. Extract relevant data from web page.
Need to monitor all
internet traffic.
(security) Ex. Find string delimited by <b> and </b> after first occurrence of
No way! pattern Last Trade:.
(privacy)

...
<tr>
Well, we’re mainly <td class= "yfnc_tablehead1"
interested in width= "48%">
“ATTACK AT DAWN” Last Trade:
</td>
OK. Build a
<td class= "yfnc_tabledata1">
machine that just
<big><b>452.92</b></big>
looks for that.
</td></tr>
<td class= "yfnc_tablehead1"
http://finance.yahoo.com/q?s=goog width= "48%">
Trade Time:
“ATTACK AT DAWN” </td>
substring search <td class= "yfnc_tabledata1">
...
machine

found
7 8
Screen scraping: Java implementation

Java library. The indexOf() method in Java's string library returns the index
of the first occurrence of a given string, starting at a given offset.

public class StockQuote


{
5.3 S UBSTRING S EARCH
public static void main(String[] args)
{
‣ introduction
String name = "http://finance.yahoo.com/q?s=";
In in = new In(name + args[0]); ‣ brute force
String text = in.readAll();
‣ Knuth-Morris-Pratt
Algorithms
int start = text.indexOf("Last Trade:", 0);
int from = text.indexOf("<b>", start);
int to = text.indexOf("</b>", from); ‣ Boyer-Moore
String price = text.substring(from + 3, to);
StdOut.println(price);
R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp
}
} http://algs4.cs.princeton.edu
% java StockQuote goog
582.93

% java StockQuote msft


24.84
9

Brute-force substring search Brute-force substring search: Java implementation

Check for pattern starting at each text position. Check for pattern starting at each text position.

i j i+j 0 1 2 3 4 5 6 7 8 9 10
A B A C A D A B R A C
i j i+j 0 1 2 3 4 5 6 7 8 9 10
4 3 7 A D A C R
txt A B A C A D A B R A C 5 0 5 A D A C R
0 2 2 A B R A pat
1 0 1 A B R A entries in red are
mismatches
2 1 3 A B R A public static int search(String pat, String txt)

3 0 3 A B R A entries in gray are {


for reference only int M = pat.length();
4 1 5 A B R A int N = txt.length();
entries in black
5 0 5 match the text A B R A for (int i = 0; i <= N - M; i++)
{
6 4 10 A B R A
int j;
return i when j is M
for (j = 0; j < M; j++)
match
if (txt.charAt(i+j) != pat.charAt(j))
break;
Brute-force substring search if (j == M) return i;
index in text where
pattern starts
}
return N; not found
}
11 12
Brute-force substring search: worst case Backup

Brute-force algorithm can be slow if text and pattern are repetitive. In many applications, we want to avoid backup in text stream.
・Treat input as stream of data. “ATTACK AT DAWN”

・Abstract model: standard input. substring search


machine

found

i ij j i+j 0 0 1 1 22 33 44
i+j 5
5 6
6 77 88 9910

txttxt A A A B AA AC AA A D A A BA RA AB C
0 2 2 A B R A pat
Brute-force algorithm needs backup for every mismatch.
0 4 4 A A A A B pat
1 0 1 A B R A entries in red are matched chars
1 4 5 A A A A B mismatches mismatch
2 1 3 A B R A
2 4 6 A A A A B entries in gray are
3 0 3 A B R A A A A A A A A A A A A A A A A A A A A A A B
for reference only
3 44 17 5 A AA AB R A A B
entries in black A A A A A B
4 54 08 5 match the text A A A B A RA AB backup
5 65 10
4 10 A AA BA RA AB
return i when j is M A A A A A A A A A A A A A A A A A A A A A B
Brute-force substring search (worst case)
match A A A A A B

Brute-force substring search


shift pattern right one position

Approach 1. Maintain buffer of last M characters.


Worst case. ~ M N char compares. Approach 2. Stay tuned.
13 14

Brute-force substring search: alternate implementation Algorithmic challenges in substring search

Same sequence of char compares as previous implementation. Brute-force is not always good enough.
・ i points to end of sequence of already-matched chars in text.
・ j stores # of already-matched chars (end of sequence in pattern). Theoretical challenge. Linear-time guarantee. fundamental algorithmic problem

Practical challenge. Avoid backup in text stream. often no room or time to save text
i j 0 1 2 3 4 5 6 7 8 9 10
A B A C A D A B R A C
7 3 A D A C R
5 0 A D A C R Now is the time for all people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for many good people to come to the aid of their party.
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good
people to come to the aid of their party. Now is the time for all of the good people to come to the aid of
their party. Now is the time for all good people to come to the aid of their party. Now is the time for
public static int search(String pat, String txt) each good person to come to the aid of their party. Now is the time for all good people to come to the aid
{ of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for many or all good people to
int i, N = txt.length(); come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
int j, M = pat.length(); is the time for all good Democrats to come to the aid of their party. Now is the time for all people to
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
for (i = 0, j = 0; i < N && j < M; i++) is the time for many good people to come to the aid of their party. Now is the time for all good people to
{ come to the aid of their party. Now is the time for a lot of good people to come to the aid of their
party. Now is the time for all of the good people to come to the aid of their party. Now is the time for
if (txt.charAt(i) == pat.charAt(j)) j++; all good people to come to the aid of their attack at dawn party. Now is the time for each person to come
else { i -= j; j = 0; } explicit backup to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is
the time for all good Republicans to come to the aid of their party. Now is the time for all good people
} to come to the aid of their party. Now is the time for many or all good people to come to the aid of their
if (j == M) return i - M; party. Now is the time for all good people to come to the aid of their party. Now is the time for all good
Democrats to come to the aid of their party.
else return N;
}

15 16
Knuth-Morris-Pratt substring search

Intuition. Suppose we are searching in text for pattern BAAAAAAAAA.


・Suppose we match 5 chars in pattern, with mismatch on 6 th char.
・We know previous 6 chars in text are BAAAAB.
・Don't need to back up text pointer! assuming { A, B } alphabet

5.3 S UBSTRING S EARCH


i

text
‣ introduction A B A A A A B A A A A A A A A A
after mismatch
‣ brute force on sixth char B A A A A A A A A A pattern
brute-force backs B A A A A A A A A A
‣ Knuth-Morris-Pratt
Algorithms up to try this
and this
B A
B
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A A
‣ Boyer-Moore and this B A A A A A A A A A
and this B A A A A A A A A A
R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp and this

http://algs4.cs.princeton.edu but no backup B A A A A A A A A A


is needed

Text pointer backup in substring searching

Knuth-Morris-Pratt algorithm. Clever method to always avoid backup. (!)


18

Deterministic finite state automaton (DFA) Knuth-Morris-Pratt demo: DFA simulation

DFA is abstract string-searching machine.


A A B A C A A B A B A C A A
・Finite number of states (including start and halt).
・Exactly one transition for each char in alphabet.
・Accept if sequence of transitions leads to halt state.
0 1 2 3 4 5
X pat.charAt(j) A B A B A C
internal representation X
j 0 1 2 3 4 5 A 1 1 3 1 5 1
j 0 1
pat.charAt(j) 2A 3 B 4 A5 B A C dfa[][j] B 0 2 0 4 0 4
pat.charAt(j) A B A A1 B 1 A 3C 1 5 If in1state j reading char c: C 0 0 0 0 0 6
A 1 1 B
dfa[][j] 30 1 2 5 01 4 0 4if j is 6 halt and accept
dfa[][j] B 0 2 0 4 0 4 else move to state dfa[c][j]
C 0 0 0 0 0 6
C 0 0 0 0 0 6

A
graphical representation

A B, C A A
B,C A
B,C A A B
A A B B
0 A
01 B 21 A 3 B 4 A 5 C
A6 0 1 2 3 4 5 6
A B 2 A 3 B 4 5 C 6 A B A B A C
C
B,C
C C B,C C
B,C C B,C
B, C
C

Constructing the DFA for KMP substring search for A B A B A C


B, C
Constructing the DFA for KMP substring search for A B A B A C 19 20
Knuth-Morris-Pratt demo: DFA simulation Interpretation of Knuth-Morris-Pratt DFA

Q. What is interpretation of DFA state after reading in txt[i]?


A A B A C A A B A B A C A A
A. State = number of characters in pattern that have been matched.
length of longest prefix of pat[]
that is a suffix of txt[0..i]

0 1 2 3 4 5 Ex. DFA is in state 3 after reading in txt[0..6].


pat.charAt(j) A B A B A C i
A 1 1 3 1 5 1 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5
dfa[][j] B 0 2 0 4 0 4 txt B C B A A B A C A pat A B A B A C
C 0 0 0 0 0 6 suffix of txt[0..6] prefix of pat[]

A A

B, C A A B, C A A
B B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C C
B, C B, C
substring found
C C

B, C 21
B, C 22

Knuth-Morris-Pratt substring search: Java implementation Knuth-Morris-Pratt substring search: Java implementation

Key differences from brute-force implementation. Key differences from brute-force implementation.
・Need to precompute dfa[][] from pattern. ・Need to precompute dfa[][] from pattern.
・Text pointer i never decrements. ・Text pointer i never decrements.
・Could use input stream.
public int search(String txt) public int search(In in)
{ {
int i, j, N = txt.length(); int i, j;
for (i = 0, j = 0; i < N && j < M; i++) for (i = 0, j = 0; !in.isEmpty() && j < M; i++)
j = dfa[txt.charAt(i)][j]; no backup j = dfa[in.readChar()][j]; no backup

if (j == M) return i - M; if (j == M) return i - M;
else return N; else return NOT_FOUND;
} }
X

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

Running time.
・Simulate DFA on text: at most N character accesses. B,C

0 A
C
A

1 B 2 A
A

3 B 4 A
B
A

5 C 6

・Build DFA: how to do efficiently? [warning: tricky algorithm ahead]


B,C C B,C

23 Constructing the DFA for KMP substring search for A B A B A C 24


Knuth-Morris-Pratt demo: DFA construction Knuth-Morris-Pratt demo: DFA construction

Include one state for each character in pattern (plus accept state).

0 1 2 3 4 5 0 1 2 3 4 5
pat.charAt(j) A B A B A C pat.charAt(j) A B A B A C
A A 1 1 3 1 5 1
dfa[][j] B dfa[][j] B 0 2 0 4 0 4
C C 0 0 0 0 0 6

Constructing the DFA for KMP substring search for A B A B A C Constructing the DFA for KMP substring search for A B A B A C

B, C A A
B
0 1 2 3 4 5 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C

25
B, C 26

How to build DFA from pattern? How to build DFA from pattern?

Include one state for each character in pattern (plus accept state). Match transition. If in state j and next char c == pat.charAt(j), go to j+1.

first j characters of pattern next char matches now first j +1 characters of


have already been matched pattern have been matched

0 1 2 3 4 5 0 1 2 3 4 5
pat.charAt(j) A B A B A C pat.charAt(j) A B A B A C
A A 1 3 5
dfa[][j] B dfa[][j] B 2 4
C C 6

0 1 2 3 4 5 6 0 A 1 B 2 A 3 B 4 A 5 C 6

27 28
How to build DFA from pattern? How to build DFA from pattern?

Mismatch transition. If in state j and next char c != pat.charAt(j), Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c. then the last j-1 characters of input are pat[1..j-1], followed by c.
state X

To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c.
Running time. Seems to require j steps. still under construction (!) Running time. Takes only constant time if we maintain state X.

Ex. dfa['A'][5] = 1; dfa['B'][5] = 4 Ex. dfa['A'][5] = 1; dfa['B'][5] = 4 X' = 0


simulate BABA; simulate BABA; j 0 1 2 3 4 5 from state X, from state X, from state X, 0 1 2 3 4 5
take transition 'A' take transition 'B' pat.charAt(j) A take transition 'A' take transition 'B' take transition 'C'
B A B A C A B A B A C
= dfa['A'][3] = dfa['B'][3] = dfa['A'][X] = dfa['B'][X] = dfa['C'][X]

A A

simulation
A of BABA j A X j
B, C A B, C A
B B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C C
B, C B, C
C C

B, C 29
B, C 30

Knuth-Morris-Pratt demo: DFA construction in linear time Knuth-Morris-Pratt demo: DFA construction in linear time

Include one state for each character in pattern (plus accept state).

0 1 2 3 4 5 0 1 2 3 4 5
pat.charAt(j) A B A B A C pat.charAt(j) A B A B A C
A A 1 1 3 1 5 1
dfa[][j] B dfa[][j] B 0 2 0 4 0 4
C C 0 0 0 0 0 6

Constructing the DFA for KMP substring search for A B A B A C Constructing the DFA for KMP substring search for A B A B A C

B, C A A
B
0 1 2 3 4 5 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C

31
B, C 32
Constructing the DFA for KMP substring search: Java implementation KMP substring search analysis

For each state j: Proposition. KMP substring search accesses no more than M + N chars
・Copy dfa[][X] to dfa[][j] for mismatch case. to search for a pattern of length M in a text of length N.
・Set dfa[pat.charAt(j)][j] to j+1 for match case.
・Update X. Pf. Each pattern char accessed once when constructing the DFA;

public KMP(String pat)


each text char accessed once (in the worst case) when simulating the DFA.
{
this.pat = pat; internal representation
M = pat.length(); j 0 1 2 3 4 5
dfa = new int[R][M]; Proposition. KMPpat.charAt(j) A B
constructs dfa[][]Ain time
B A and
C space proportional to R M.
dfa[pat.charAt(0)][0] = 1; next[j] 0 0 0 0 0 3
for (int X = 0, j = 1; j < M; j++)
{
Larger alphabets. Improved version of KMP constructs nfa[] in time and
for (int c = 0; c < R; c++) mismatch transition
dfa[c][j] = dfa[c][X]; copy mismatch cases space proportional to M. (back up at least one state)
dfa[pat.charAt(j)][j] = j+1; set match case graphical representation
X = dfa[pat.charAt(j)][X]; update restart state
}
}
0 A 1 B 2 A 3 B 4 A 5 C 6

Running time. M character accesses (but space/time proportional to R M). KMP NFA for ABABAC
33 NFA corresponding to the string A B A B A C 34

Knuth-Morris-Pratt: brief history

・Independently discovered by two theoreticians and a hacker.


– Knuth: inspired by esoteric theorem, discovered linear algorithm
– Pratt: made running time independent of alphabet size
– Morris: built a text editor for the CDC 6400 computer
・Theory meets practice. 5.3 S UBSTRING S EARCH
‣ introduction
SIAM J. COMPUT.

‣ brute force
Vol. 6, No. 2, June 1977

FAST PATTERN MATCHING IN STRINGS*

‣ Knuth-Morris-Pratt
DONALD E. KNUTHf, JAMES H. MORRIS, JR.:l: AND VAUGHAN R. PRATT
Algorithms
Abstract. An algorithm is presented which finds all occurrences of one. given string within
another, in running time proportional to the sum of the lengths of the strings. The constant of
proportionality is low enough to make this algorithm of practical use, and the procedure can also be
‣ Boyer-Moore
extended to deal with some more general pattern-matching problems. A theoretical application of the
algorithm shows that the set of concatenations of even palindromes, i.e., the language {can}*, can be
recognized in linear time. Other algorithms which run even faster on the average are also considered. R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp
Key words, pattern, string, text-editing, pattern-matching, trie memory, searching, period of a
string, palindrome, optimum algorithm, Fibonacci string, regular expression http://algs4.cs.princeton.edu

Text-editing programs are often required to search through a string of


characters looking for instances of a given "pattern" string; we wish to find all
positions, or perhaps only the leftmost position, in which the pattern occurs as a
contiguous substring of the text. For example, c a e n a r y contains the pattern
e n, but we do not regard c a n a r y as a substring. Robert Boyer J. Strother Moore
Don Knuthway to search for
The obvious Jim Morrispattern is toVaughan
a matching try searchingPratt
at every
35
starting position of the text, abandoning the search as soon as an incorrect
character is found. But this approach can be very inefficient, for example when we
are looking for an occurrence of aaaaaaab in aaaaaaaaaaaaaab.
Boyer-Moore: mismatched character heuristic Boyer-Moore: mismatched character heuristic

Intuition. Q. How much to skip?


・Scan characters in pattern from right to left.
・Can skip as many as M text chars when finding one not in the pattern.
Case 1. Mismatch character not in pattern.

before

txt . . . . . . T L E . . . . . .
i j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 pat N E E D L E
text F I N D I N A H A Y S T A C K N E E D L E I N A
0 5 N E E D L E pattern
5 5 N E E D L E
i
11 4 N E E D L E
15 0 N E E D L E after

return i = 15 txt . . . . . . T L E . . . . . .
Mismatched character heuristic for right-to-left (Boyer-Moore) substring search pat N E E D L E

mismatch character 'T' not in pattern: increment i one character beyond 'T'

37 38

Boyer-Moore: mismatched character heuristic Boyer-Moore: mismatched character heuristic

Q. How much to skip? Q. How much to skip?

Case 2a. Mismatch character in pattern. Case 2b. Mismatch character in pattern (but heuristic no help).

i i

before before

txt . . . . . . N L E . . . . . . txt . . . . . . E L E . . . . . .
pat N E E D L E pat N E E D L E

i i

after aligned with rightmost E?

txt . . . . . . N L E . . . . . . txt . . . . . . E L E . . . . . .
pat N E E D L E pat N E E D L E

mismatch character 'N' in pattern: align text 'N' with rightmost pattern 'N' mismatch character 'E' in pattern: align text 'E' with rightmost pattern 'E' ?

39 40
Boyer-Moore: mismatched character heuristic Boyer-Moore: mismatched character heuristic

Q. How much to skip? Q. How much to skip?

A. Precompute index of rightmost occurrence of character c in pattern.


Case 2b. Mismatch character in pattern (but heuristic no help). (-1 if character not in pattern)

before N E E D L E
c 0 1 2 3 4 5 right[c]
txt . . . . . . E L E . . . . . . A -1 -1 -1 -1 -1 -1 -1 -1
pat N E E D L E B -1 -1 -1 -1 -1 -1 -1 -1
right = new int[R];
for (int c = 0; c < R; c++) C -1 -1 -1 -1 -1 -1 -1 -1
right[c] = -1; D -1 -1 -1 -1 3 3 3 3
for (int j = 0; j < M; j++) E -1 -1 1 2 2 2 5 5
i
right[pat.charAt(j)] = j; ... -1
after L -1 -1 -1 -1 -1 4 4 4
. . . . . . E L E . . . . . . M -1 -1 -1 -1 -1 -1 -1 -1
txt
N -1 0 0 0 0 0 0 0
pat N E E D L E
... -1
Boyer-Moore skip table computation
mismatch character 'E' in pattern: increment i by 1

41 42

Boyer-Moore: Java implementation Boyer-Moore: analysis

Property. Substring search with the Boyer-Moore mismatched character


public int search(String txt) heuristic takes about ~ N / M character compares to search for a pattern of
{
length M in a text of length N. sublinear!
int N = txt.length();
int M = pat.length();
int skip; Worst-case. Can be as bad as ~ M N.
for (int i = 0; i <= N-M; i += skip)
{
skip = 0; i skip 0 1 2 3 4 5 6 7 8 9
for (int j = M-1; j >= 0; j--) txt B B B B B B B B B B
compute
{ 0 0 A B B B B pat
skip value
if (pat.charAt(j) != txt.charAt(i+j)) 1 1 A B B B B
{ 2 1 A B B B B
skip = Math.max(1, j - right[txt.charAt(i+j)]); 3 1 A B B B B
break; 4 1 A B B B B
} in case other term is nonpositive
5 1 A B B B B
}
Boyer-Moore-Horspool substring search (worst case)
if (skip == 0) return i; match
}
return N; Boyer-Moore variant. Can improve worst case to ~ 3 N character compares
} by adding a KMP-like rule to guard against repetitive patterns.
43 44
Rabin-Karp fingerprint search

Basic idea = modular hashing.


・Compute a hash of pat[0..M-1].
・For each i, compute a hash of txt[i..M+i-1].
・If pattern hash = text substring hash, check for a match.
5.3 S UBSTRING S EARCH
pat.charAt(i)
i 0 1 2 3 4
‣ introduction 2 6 5 3 5 % 997 = 613

‣ brute force txt.charAt(i)


i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
‣ Knuth-Morris-Pratt
Algorithms 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3

‣ Boyer-Moore 0 3 1 4 1 5 % 997 = 508


1 1 4 1 5 9 % 997 = 201

R OBERT S EDGEWICK | K EVIN W AYNE


‣ Rabin-Karp 2 4 1 5 9 2 % 997 = 715
3 1 5 9 2 6 % 997 = 971
http://algs4.cs.princeton.edu
4 5 9 2 6 5 % 997 = 442
match
5 9 2 6 5 3 % 997 = 929
Michael Rabin 6 return i = 6 2 6 5 3 5 % 997 = 613
Dick Karp
Basis for Rabin-Karp substring search
modular hashing with R = 10 and hash(s) = s (mod 997)
46

Modular arithmetic Efficiently computing the hash function

Math trick. To keep numbers small, take intermediate results modulo Q. Modular hash function. Using the notation ti for txt.charAt(i),
we wish to compute
Ex. (10000 + 535) * 1000 (mod 997) xi = ti RM–1 + ti+1 R M–2 + … + ti+M–1 R0 (mod Q)

= (30 + 535) * 3 (mod 997)

= 1695 (mod 997) Intuition. M-digit, base-R integer, modulo Q.

= 698 (mod 997)


Horner's method. Linear-time method to evaluate degree-M polynomial.

pat.charAt() // Compute hash for M-digit key


i 0 1 2 3 4 private long hash(String key, int M)
2 6 5 3 5
{
R Q
0 2 % 997 = 2 long h = 0;
1 2 6 % 997 = (2*10 + 6) % 997 = 26 for (int j = 0; j < M; j++)
(a + b) mod Q = ((a mod Q) + (b mod Q)) mod Q 2 2 6 5 % 997 = (26*10 + 5) % 997 = 265 h = (h * R + key.charAt(j)) % Q;
3 2 6 5 3 % 997 = (265*10 + 3) % 997 = 659 return h;
4 2 6 5 3 5 % 997 = (659*10 + 5) % 997 = 613 }
(a * b) mod Q = ((a mod Q) * (b mod Q)) mod Q
Computing the hash value for the pattern with Horner’s method

26535 = 2*10000 + 6*1000 + 5*100 + 3*10 + 5


two useful modular arithmetic identities
47
= ((((2) *10 + 6) * 10 + 5) * 10 + 3) * 10 + 5 48
Efficiently computing the hash function Rabin-Karp substring search example

Challenge. How to efficiently compute xi+1 given that we know xi. First R entries: Use Horner's rule.

xi = ti R M–1 + ti+1 R M–2 + … + ti+M–1 R0 Remaining entries: Use rolling hash (and % to avoid overflow).

xi+1 = ti+1 R M–1 + ti+2 R M–2 + … + ti+M R0

Key property. Can update "rolling" hash function in constant time! i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
xi+1 = ( xi – t i R M–1 ) R + ti+M 0 3 % 997 = 3 Q

1 3 1 % 997 = (3*10 + 1) % 997 = 31


current subtract multiply add new Horner's
(can precompute RM-1) 2 3 1 4 % 997 = (31*10 + 4) % 997 = 314 rule
value leading digit by radix trailing digit
3 3 1 4 1 % 997 = (314*10 + 1) % 997 = 150
4 3 1 4 1 5 % 997 = (150*10 + 5) % 997 = 508 RM R
i ... 2 3 4 5 6 7 ...
5 1 4 1 5 9 % 997 = ((508 + 3*(997 - 30))*10 + 9) % 997 = 201
current value 1 4 1 5 9 2 6 5
text
new value 4 1 5 9 2 6 5 6 4 1 5 9 2 % 997 = ((201 + 1*(997 - 30))*10 + 2) % 997 = 715
rolling
7 1 5 9 2 6 % 997 = ((715 + 4*(997 - 30))*10 + 6) % 997 = 971 hash
4 1 5 9 2 current value 8 5 9 2 6 5 % 997 = ((971 + 1*(997 - 30))*10 + 5) % 997 = 442 match
- 4 0 0 0 0
9 9 2 6 5 3 % 997 = ((442 + 5*(997 - 30))*10 + 3) % 997 = 929
1 5 9 2 subtract leading digit
* 1 0 multiply by radix 10 return i-M+1 = 6 2 6 5 3 5 % 997 = ((929 + 9*(997 - 30))*10 + 5) % 997 = 613
1 5 9 2 0 Rabin-Karp substring search example
+ 6 add new trailing digit
1 5 9 2 6 new value
-30 (mod 997) = 997 – 30 10000 (mod 997) = 30
49 50
Key computation in Rabin-Karp substring search
(move right one position in the text)

Rabin-Karp: Java implementation Rabin-Karp: Java implementation (continued)

public class RabinKarp Monte Carlo version. Return match if hash match.
{
private long patHash; // pattern hash value
private int M; // pattern length public int search(String txt)
private long Q; // modulus check for hash collision
{
private int R; // radix using rolling hash function
int N = txt.length();
private long RM1; // R^(M-1) % Q int txtHash = hash(txt, M);
if (patHash == txtHash) return 0;
public RabinKarp(String pat) { for (int i = M; i < N; i++)
M = pat.length(); {
R = 256; txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
a large prime
Q = longRandomPrime(); (but avoid overflow) txtHash = (txtHash*R + txt.charAt(i)) % Q;
if (patHash == txtHash) return i - M + 1;
RM1 = 1; precompute RM – 1 (mod Q) }
for (int i = 1; i <= M-1; i++) return N;
RM1 = (R * RM1) % Q; }
patHash = hash(pat, M);
}

private long hash(String key, int M)


{ /* as before */ }

public int search(String txt) Las Vegas version. Check for substring match if hash match;
{ /* see next slide */ }
}
continue search if false collision.
51 52
Rabin-Karp analysis Rabin-Karp fingerprint search

Theory. If Q is a sufficiently large random prime (about M N 2), Advantages.


then the probability of a false collision is about 1 / N. ・Extends to 2d patterns.
・Extends to finding multiple patterns.
Practice. Choose Q to be a large prime (but not so large to cause overflow).
Under reasonable assumptions, probability of a collision is about 1 / Q. Disadvantages.
・Arithmetic ops slower than char compares.
Monte Carlo version. ・Las Vegas version requires backup.
・Always runs in linear time. ・Poor worst-case guarantee.
・Extremely likely to return correct answer (but not always!).
5.3 Substring Search 679
Las Vegas version. Q. How would you extend Rabin-Karp to efficiently search for any
・Always returns correct answer. one of P possible patterns in a text of length N ?
・Extremely likely
Rabin-Karp to run
substring in linear
search time
is known (but worst
as a fingerprint searchcase isitM
because N).
uses a small
amount of information to represent a (potentially very large) pattern.
5.3 Then it looks
Substring Search 679
for this fingerprint (the hash value) in the text. The algorithm is efficient because the
fingerprints can be efficiently computed and compared.

Rabin-Karp substring
Summary search
The table at is known
the bottom as apage
of the fingerprint searchthe
summarizes because it usesthat
algorithms a small
we
amount
have of information
discussed for substringto represent
search. As a (potentially
is often the very
case large)
when pattern.
we have Then it algo-
several looks 53 54

for thisfor
rithms fingerprint
the same (the
task, hash
each value)
of theminhastheattractive
text. Thefeatures.
algorithm is efficient
Brute because
force search the
is easy
fingerprints
to implementcan andbeworks
efficiently
well incomputed and(Java’s
typical cases compared.
indexOf() method in String uses
brute-force search); Knuth-Morris-Pratt is guaranteed linear-time with no backup in
Substring
the search
Summary
input; cost
The
Boyer-Moore summary
table isatsublinear
the bottom (by of the page
a factor summarizes
of M) in typical the algorithms
situations; and that
Rabin-we
have is
Karp discussed for substring
linear. Each search. As is
also has drawbacks: often the might
brute-force case when
requirewe time
have proportional
several algo-
rithms
to MN; for the same task, eachand
Knuth-Morris-Pratt of them has attractive
Boyer-Moore features.
use extra Brute
space; and force search ishas
Rabin-Karp easy
a
Cost of searching
to implement
relatively for
long and anloopM-character
innerworks well in typical
(several pattern
cases
arithmetic in as
(Java’s indexOf()
operations, N-character
anopposedmethod text.
in String
to character uses
com-
brute-force
pares search);
in the other Knuth-Morris-Pratt
methods. is guaranteed
These characteristics linear-timeinwith
are summarized no backup
the table below. in
the input; Boyer-Moore is sublinear (by a factor of M) in typical situations; and Rabin-
Karp is linear. Each also has drawbacks: brute-force operation countmight require time proportional
backup extra
to MN; algorithm
Knuth-Morris-Pratt version
and Boyer-Moore use extra space; correct?
and Rabin-Karp has a
guarantee typical in input? space
relatively long inner loop (several arithmetic operations, as opposed to character com-
paresbrute
in theforce
other methods. These— characteristics
M N are1.1 summarized
N yes in the yes
table below.
1

full DFA
algorithm version5.6 ) 2operation
N count
1.1 N no
backup yes
correct?
extra
MR
(Algorithm in input? space
Knuth-Morris-Pratt guarantee typical
mismatch
3 NN 1.1 no yes
brute force — only
transitions M 1.1 N
N yes yes M
1

fullfull DFA
algorithm 32N N1.1
/M yes yes R
N N no yes MR
(Algorithm 5.6 )
Knuth-Morris-Pratt
Boyer-Moore mismatched char
mismatch
heuristic only 3N
M N1.1
/MN no
yes yes
yes R
M
transitions only
(Algorithm 5.7 )
full
Montealgorithm
Carlo 3N N/M yes yes R
7N 7N no yes † 1
(Algorithm 5.8 )

Boyer-Moore
Rabin-Karp mismatched char
heuristic only
Las Vegas 7MNN† N7 /NM yes†
no yes
yes 1R
(Algorithm 5.7 )
† probabilisitic guarantee, with uniform hash function
Monte Carlo
summary 5.8
7N 7N no yes † 1
(Algorithm
Cost for )substring-search implementations
Rabin-Karp† 55
Las Vegas 7N† 7N no † yes 1

You might also like