Professional Documents
Culture Documents
String Matching
Background
String matching Nave method
n size of input string m size of pattern to be matched O( (n-m+1)m )
( n2 ) if m = floor( n/2 )
We can do better
How it works
Consider a hashing scheme
Each symbol in alphabet can be represented by an ordinal value { 0, 1, 2, ..., d }
|| = d Radix-d digits
How it works
Hash pattern P into a numeric value
Let a string be represented by the sum of these digits
Horners rule ( 30.1)
Example
{ A, B, C, ..., Z } { 0, 1, 2, ..., 26 } BAN 1 + 0 + 13 = 14 CARD 2 + 0 + 17 + 3 = 22
Upper limits
Problem
For long patterns, or for large alphabets, the number representing a given string may be too large to be practical
Solution
Use MOD operation When MOD q, values will be < q
Example
BAN = 1 + 0 + 13 = 14
14 mod 13 = 1 BAN 1
CARD = 2 + 0 + 17 + 3 = 22
22 mod 13 = 9 CARD 9
Searching
Spurious Hits
Question
Does a hash value match mean that the patterns match?
Answer
No these are called spurious hits
Possible cases
MOD operation interfered with uniqueness of hash values
14 mod 13 = 1 27 mod 13 = 1 MOD value q is usually chosen as a prime such that 10q just fits within 1 computer word
Code
RABIN-KARP-MATCHER( T, P, d, q )
n length[ T ] m length[ P ] h dm-1 mod q p0 t0 0 for i 1 to m Preprocessing do p ( d*p + P[ i ] ) mod q t0 ( d*t0 + T[ i ] ) mod q for s 0 to n m Matching do if p = ts then if P[ 1..m ] = T[ s+1 .. s+m ] then print Pattern occurs with shift s if s < n m then ts+1 ( d * ( ts T[ s + 1 ] * h ) + T[ s + m + 1 ] ) mod q
Performance
Preprocessing (determining each pattern hash)
( m )
Expected case
If we assume the number of hits is constant compared to n, we expect O( n ) Only pattern-match hits not all shifts
Demonstration
http://www-igm.univmlv.fr/~lecroq/string/node5.html
Sources: Cormen, Thomas S., et al. Introduction to Algorithms. 2nd ed. Boston: MIT Press, 2001. Karp-Rabin algorithm. 15 Jan 1997. <http://www-igm.univ-mlv.fr/~lecroq/string/node5.html>. Shomper, Keith. Rabin-Karp Animation. E-mail to Jonathan Elchison. 12 Nov 2004.