Lecture3.6-3.8 String Matching Algorithms

Apex Institute of Technology
Department of Computer Science & Engineering

Bachelor of Engineering (Computer Science & Engineering)
Design and Analysis of Algorithms– (20CST-282)
Prepared By: Dr. Vipin Tiwari (E10753)
05/20/2022 DISCOVER . LEARN . EMPOWER

1
String Matching
• Problem is to find if a pattern P[1..m] occurs within text T[1..n]
• Simple solution: Naïve String Matching
• Match each position in the pattern to each position in the text
• T = AAAAAAAAAAAAAA
• P = AAAAAB
AAAAAB
etc.
• O(mn)
• It is important for algorithms on character strings
• Many programming contest problems require the use of this data
structure.
Hash Table
• Hash Function.
Hash Table
• Hash Function.
Hash Table
String Matching
• Problem is to find if a pattern P[1..m] occurs within text T[1..n]
• Simple solution: Naïve String Matching
• Match each position in the pattern to each position in the text
• T = AAAAAAAAAAAAAA
• P = AAAAAB
AAAAAB
etc.
• O(mn)
Rabin Karp
• Idea: Before spending a lot of time comparing chars for a
match, do some pre-processing to eliminate locations that could
not possibly match
• If we could quickly eliminate most of the positions then we can
run the naïve algorithm on whats left
• Eliminate enough to hopefully get O(n) runtime overall
Rabin Karp Idea
• To get a feel for the idea say that our text and pattern is a
sequence of bits.
• For example,
• P=010111
• T=0010110101001010011
• The parity of a binary value is to count the number of one’s. If odd, the
parity is 1. If even, the parity is 0. Since our pattern is six bits long, let’s
compute the parity for each position in T, counting six bits ahead. Call
this f[i] where f[i] is the parity of the string T[i..i+5].
Parity
T=0010110101001010011
P=010111
Since the parity of our pattern is 0, we only need to check positions 2, 4, 6, 8, 10, and 11 in the text
Rabin Karp
• On average we expect the parity check to reject half the inputs.
• To get a better speed-up, by a factor of q, we need a fingerprint
function that maps m-bit strings to q different fingerprint values.
• Rabin and Karp proposed to use a hash function that considers
the next m bits in the text as the binary expansion of an
unsigned integer and then take the remainder after division by
q.
• A good value of q is a prime number greater than m.
Rabin Karp
• More precisely, if the m bits are s0s1s2 .. sm-1 then we compute
the fingerprint value:
 m 1 
 sj2 m 1 j
 mod q
 
• For the previousexample, f[i] =
j  0
 5 
  t[i  j ]2 5  j  mod 7
 
 j0 
For our pattern 010111, its hash value is 23 mod 7 or 2. This means that we would only use the naïve algorithm for
positions where f[i] = 2
Rabin Karp Wrapup
• But we want to compare text, not bits!
• Text is represented using bits
• For a textual pattern and text, we simply convert the pattern into a
sequence of bits that corresponds to its ASCII sequence, and the same
for the text.
• Skipping the details of the actual implemention, we can
compute f[i] in O(m) time giving us the expected runtime of
O(m+n) given a good hashing.
Rabin-Karp
• The Rabin-Karp string searching algorithm calculates a hash value for the pattern, and for each
M-character subsequence of text to be compared.
• If the hash values are unequal, the algorithm will calculate the hash value for next M-character
sequence.
• If the hash values are equal, the algorithm will do a Brute Force comparison between the
pattern and the M-character sequence.
• In this way, there is only one comparison per text subsequence, and Brute Force is only needed
when hash values match.
14
Rabin-Karp Example
• Hash value of “AAAAA” is 37
• Hash value of “AAAAH” is 100
15
Rabin-Karp Algorithm
pattern is M characters long
hash_p=hash value of pattern
hash_t=hash value of first M letters in body of text
do
if (hash_p == hash_t)
brute force comparison of pattern and selected section of text
hash_t= hash value of next section of text, one character over
while (end of text)
16
Hash Function
• Let b be the number of letters in the alphabet. The text subsequence t[i .. i+M-1] is mapped to
the number
• Furthermore, given x(i) we can compute x(i+1) for the next subsequence t[i+1 .. i+M] in constant time, as follows:
• In this way, we never explicitly compute a new value. We

simply adjust the existing value as we move over one
character. 18
Rabin-Karp Math Example
• Let’s say that our alphabet consists of 10 letters.

• our alphabet = a, b, c, d, e, f, g, h, i, j
• Let’s say that “a” corresponds to 1, “b” corresponds to 2 and so on.
The hash value for string “cah” would be ...
3*100 + 1*10 + 8*1 = 318
19
Rabin-Karp Mods
• If M is large, then the resulting value (~bM) will be enormous. For this reason, we hash the value by
taking it mod a prime number q.
• The mod function is particularly useful in this case due to several of its inherent properties:
[(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
• For these reasons:
h(i)=((t[i] bM-1 mod q) +(t[i+1] bM-2 mod q) + … +(t[i+M-1] mod q))mod q
h(i+1) =( h(i) b mod q
Shift left one digit
-t[i] bM mod q
Subtract leftmost digit
+t[i+M] mod q )
Add new rightmost digit
mod q 20
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function, the hashed values of two
different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the number of characters in the
larger body of text.
• It is always possible to construct a scenario with a worst case complexity of O(MN). This,
however, is likely to happen only if the prime number used for hashing is small.
21
KMP : Knuth Morris Pratt
• This is a famous linear-time running string matching algorithm
that achieves a O(m+n) running time.
• Given a pattern P construct a finite state automaton.
• Next: push the string through the automaton, and note whether
you reach at the accepted state. (Time O(m+n)).
String matching in 2D
• Given a 2D grid/array of characters, find the occurrence of pattern P
in the grid. The seach direction could be
• horizontal
• vertical
• diagonally
• The following slides mostly taken from
www.cs.cmu.edu/~ckingsf/bioinfo-lectures/suffixtrees.pdf
Suffix Tries
• Suffix tries are a space-efficient data structure to store a string that
allows many kinds of queries to be answered quickly.
• Let A = a0a1a2 ….. an-1 be a string of length n
• Let Ai = aiai+1a2 ….. an-1 be a suffix A that begins at position i.
• A suffix trie of A is made by compressing all the suffixes.
References
• Fundamentals of Computer Algorithms 2nd Edition (2008) by Horowitz, Sahni
and Rajasekaran
• Introduction to Algorithms 3rd Edition (2012) by Thomas H Cormen, Charles E
Lieserson, Ronald
35
THANK YOU
For queries
Email: vipin.e10753@cumail.in
05/20/2022 36

Lecture3.6-3.8 String Matching Algorithms

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture3.6-3.8 String Matching Algorithms

Uploaded by

Copyright:

Available Formats

Apex Institute of Technology

Department of Computer Science & Engineering

05/20/2022 DISCOVER . LEARN . EMPOWER

• In this way, we never explicitly compute a new value. We

• Let’s say that our alphabet consists of 10 letters.

3100 + 110 + 8*1 = 318

You might also like