Professional Documents
Culture Documents
Lecture3.6-3.8 String Matching Algorithms
Lecture3.6-3.8 String Matching Algorithms
Since the parity of our pattern is 0, we only need to check positions 2, 4, 6, 8, 10, and 11 in the text
Rabin Karp
• On average we expect the parity check to reject half the inputs.
• To get a better speed-up, by a factor of q, we need a fingerprint
function that maps m-bit strings to q different fingerprint values.
• Rabin and Karp proposed to use a hash function that considers
the next m bits in the text as the binary expansion of an
unsigned integer and then take the remainder after division by
q.
• A good value of q is a prime number greater than m.
Rabin Karp
• More precisely, if the m bits are s0s1s2 .. sm-1 then we compute
the fingerprint value:
m 1
sj2 m 1 j
mod q
• For the previousexample, f[i] =
j 0
5
t[i j ]2 5 j mod 7
j0
For our pattern 010111, its hash value is 23 mod 7 or 2. This means that we would only use the naïve algorithm for
positions where f[i] = 2
Rabin Karp Wrapup
• But we want to compare text, not bits!
• Text is represented using bits
• For a textual pattern and text, we simply convert the pattern into a
sequence of bits that corresponds to its ASCII sequence, and the same
for the text.
• Skipping the details of the actual implemention, we can
compute f[i] in O(m) time giving us the expected runtime of
O(m+n) given a good hashing.
Rabin-Karp
• The Rabin-Karp string searching algorithm calculates a hash value for the pattern, and for each
M-character subsequence of text to be compared.
• If the hash values are unequal, the algorithm will calculate the hash value for next M-character
sequence.
• If the hash values are equal, the algorithm will do a Brute Force comparison between the
pattern and the M-character sequence.
• In this way, there is only one comparison per text subsequence, and Brute Force is only needed
when hash values match.
14
Rabin-Karp Example
• Hash value of “AAAAA” is 37
• Hash value of “AAAAH” is 100
15
Rabin-Karp Algorithm
pattern is M characters long
hash_p=hash value of pattern
hash_t=hash value of first M letters in body of text
do
if (hash_p == hash_t)
brute force comparison of pattern and selected section of text
hash_t= hash value of next section of text, one character over
while (end of text)
16
Hash Function
• Let b be the number of letters in the alphabet. The text subsequence t[i .. i+M-1] is mapped to
the number
• Furthermore, given x(i) we can compute x(i+1) for the next subsequence t[i+1 .. i+M] in constant time, as follows:
19
Rabin-Karp Mods
• If M is large, then the resulting value (~bM) will be enormous. For this reason, we hash the value by
taking it mod a prime number q.
• The mod function is particularly useful in this case due to several of its inherent properties:
[(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
• For these reasons:
h(i)=((t[i] bM-1 mod q) +(t[i+1] bM-2 mod q) + … +(t[i+M-1] mod q))mod q
h(i+1) =( h(i) b mod q
Shift left one digit
-t[i] bM mod q
Subtract leftmost digit
+t[i+M] mod q )
Add new rightmost digit
mod q 20
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function, the hashed values of two
different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the number of characters in the
larger body of text.
• It is always possible to construct a scenario with a worst case complexity of O(MN). This,
however, is likely to happen only if the prime number used for hashing is small.
21
KMP : Knuth Morris Pratt
• This is a famous linear-time running string matching algorithm
that achieves a O(m+n) running time.
KMP : Knuth Morris Pratt
• Given a pattern P construct a finite state automaton.
KMP : Knuth Morris Pratt
• Next: push the string through the automaton, and note whether
you reach at the accepted state. (Time O(m+n)).
String matching in 2D
• Given a 2D grid/array of characters, find the occurrence of pattern P
in the grid. The seach direction could be
• horizontal
• vertical
• diagonally
• The following slides mostly taken from
www.cs.cmu.edu/~ckingsf/bioinfo-lectures/suffixtrees.pdf
Suffix Tries
• Suffix tries are a space-efficient data structure to store a string that
allows many kinds of queries to be answered quickly.
• Let A = a0a1a2 ….. an-1 be a string of length n
• Let Ai = aiai+1a2 ….. an-1 be a suffix A that begins at position i.
• A suffix trie of A is made by compressing all the suffixes.
References
• Fundamentals of Computer Algorithms 2nd Edition (2008) by Horowitz, Sahni
and Rajasekaran
• Introduction to Algorithms 3rd Edition (2012) by Thomas H Cormen, Charles E
Lieserson, Ronald
35
THANK YOU
For queries
Email: vipin.e10753@cumail.in
05/20/2022 36