Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Exact String Matching Algorithms
Presented By
Dr. Shazzad Hosain
Asst. Prof. EECS, NSU
Exact Matching: What’s the Problem
1
1 2 34 5 67 8 90 1 2
P occurs in T starting at locations 3, 7, and 9
T = bbabaxababay P may overlap, as found at 7 and 9.
P = aba
The Naive Method
• Problem is to find if a pattern P[1..m] occurs
within text T[1..n]
• Let P = abxyabxz and T = xabxyabxyabxz
• Where m = 8 and n = 13
The Naive Method
• If P = aaa and T = aaaaaaaaaa then n=3, m=10

• In worst case exactly n(m-n+1) comparisons
• In this case 24 comparisons in the order of θ (mn).
The Naive Algorithm
Char text[], pat[] ;
int n, m ;
{
int i, j, k, lim ; lim=n-m+1 ;
for (i=1 ; i<=lim ; i++) /* search */
{
k=i ;
for (j=1 ; j<=m && text[k]==pat[j]; j++) k++;
if (j>m) Report_match_at_position(i-j+1);
}
}
• The worst-case bound can be reduced to O(m+n)
• For applications with n = 1000 and m = 10,000,000 the
improvement is significant.
The Smart Algorithm
If you know first
character of P (namely a)
does not occur again at P
12345 678
until position 5 of P
Instead of
Skips over three comparisons
• Reasoning of this sort is the key to shifting by more

than one character
The Smarter Algorithm
Instead of
Instead of Starts at
Skips over three comparisons Skips another three

The Smart Algorithms
• Knuth-Morris-Pratt (KMP) Alogorithm
• Boyer-Moore Algorithm
• Reduced run-time to O(n+m)
Additional knowledge requires preprocessing of strings

Usually P is much shorter than T
So P is preprocessed
The Preprocessing Approach
• Usually P is preprocessed instead of T
• Sometimes T is preprocessed, e.g. suffix tree
• The preprocessing methods are similar in
spirit, but often quite different in detail and
conceptual difficulty
• Fundamental preprocessing of P is
independent of any particular algorithm
• Each algorithm uses this information
Basic String Definitions/Notations
• Let, S be the string
• S[i..j] is the substring of S starting at position i and
ending at position j, S[i..j] is empty if i > j
1
1 2 34 5 67 8 90 1 2 S[3..7] = abaxa
Prefix
S = bbabaxababay S[1..4] = bbab
• |S| is the length of the string. Here, |S| = 12

• S[1..i] is prefix of S that ends at position i
• S[i..|S|] is the suffix of S that begins at position i

Suffix
S[9..12] = abay
Basic String Definitions/Notations
• A proper prefix, suffix or substring of S is, respectively, a prefix,
suffix or substring that is not the entire string S, not the empty
string.
• For any string S, S(i) denotes the ith character of S
Preprocessing
• Goal: To gather the information needed for speeding up the

algorithm
• Definitions:
– Zi: For i>1, the length of the longest substring of S that
starts at i and matches a prefix of S
– Z-box: for any position i >1 where Zi>0, the Z-box at i starts
at i and ends at i+Zi-1
– ri; For every i>1, ri is the right-most endpoint of the Z-boxes
that begin at or before i
– li; For every i>1, li is the left endpoint of the Z-box ends at ri
12
Preprocessing
Zi(S) = The longest prefix of S[i..|S|] that matches a prefix of S,
where i > 1
1 Z5(S) = 3 (aabc…aabx…)
12 3 456 7 8 901 Z6(S) = 1 (aa…ab…)
S = aabcaabxaaz Z7(S) = Z8(S) = 0
Z9(S) = 2 (aab…aaz)
We will use Zi in place of Zi(S)
Z Box
for i > 1, where Zi is greater than zero
Figure 1.2: From Gusfield

The li and ri of Z-Box
ri = the right-most endpoint of the Z-boxes that begin at or before
position i.
li = the left end of the Z-box that ends at ri.
40 50 55 62 70 78 82 85 89 95
r78 = 95 l78 = 78
r82 = 95 l82 = 78
r52 = 50 l52 = 40
r75 = 85 l75 = 70
Preprocessing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
S: a a b a a b c a x a a b a a b c y
Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0
Z-box
a a b a a b c a x a a b a a b c y
ri: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16
li: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10
15
Z-Algorithm
Goal: To calculate Zi for an input string S in a linear time
Starting from i=2, calculate Z2, r2 and l2

For i=3; i<n; i++
In iteration k, calculate Zk, rk and lk based on Zj, rjand lj for j=2,…,k-1
For iteration k, the algorithm only need rk-1 and lk-1. Thus, there is no need to
keep all ri and li. We use r, and l to denote rk-1 and lk-1
16
Z-Algorithm
In iteration k:
(I) if k<=r
l k r
a’ b’ a b
l’ k’ r’ l k r
k’=k-l+1; r’=r-l+1; a=a’; b=b’
a’ a
b’ b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
17
Z-Algorithm
A) If |g’|<|b’|, that is, Z k’< r-k+1, Z k = Z k’
a’
g’’ x g’ y b’ a g y b
g=g’=g’’; x≠y
a’ a
g’’ g’ b’ g
b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Z: 0 1 0 3 1 0 0 1 0 7 1 0 3
18
Z-Algorithm
B) If |g’|>|b’|, that is, Z k’ >r-k+1, Zk =|b|, i.e., r-k+1
a’
b’’ b’ a b
g’’ x g’ x g y
b=b’=b’’
Zk =|b|, i.e., r-k+1
g’=g’’;
x ≠y (because a is a Z box)
a’ b’’ b’
a b
g’’ g’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
S: a a b a a b c a x a a b a a c d
Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0
19
Z-Algorithm
C) If |g’|=|b’|, that is, Z k’ =r-k+1, Zk ≥|b|, i.e., ≥ r-k+1
a’
b’’ b’ a b
g’’ z g’ x g y
b=b’=b’’
Compare S[r+1,...] with S[ |b| +1,…]
g=g’=g’’;
until a mismatch occurs. Update Zk, r,
x ≠y (because a is a Z box)
z ≠x (because g’ is a Z box) and l
z ?? y
a’ b’
a b
g’’ g’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
S: a a b a a e c a x a a b a a b d
Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0
20
Z-Algorithm
(II) if k>r
l r k
Compare the characters starting at k+1 with those
starting at 1.
Update r, and l if necessary
21
Z-Algorithm
Input: Pattern P
Output: Zi
Z Algorithm
Calculate Z2, r2 and l2 specifically by comparisons. R= r2 and l=l2
for i=3; i<n; i++
if k<=r
if Z k-l+1 <r-k+1, then Z k = Z k-l+1
else if Z k-l+1 > r-k+1 Z k = r-k+1
else compare the characters starting at r+1 with those starting at |b|
+1. Update r, and l if necessary
else Compare the characters starting at k to those starting at 1.
Update r, and l if necessary
22
Preprocessing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0
r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16
l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10
23
Z-Algorithm
Time complexity
#mismatches <= number of iterations, n
#matches
• Let q be the number of matches at iteration k, then we need to increase r by at least q
• r<=n
• Thus total #match <=n
T=O( #matches + #mismatches +#iterations)=O(n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0
r: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16
l: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10
#m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0
#mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1
24
Simplest Linear Time Exact Matching Algorithm
Input: Pattern P, Text T

Output: Occurrences of P in T
Algorithm Simplest
S=P$T, where $ is a character that do not appear in P and T
For i=2; i<|S|; i++
Calculate Zi
If Zi=|P|, then report that there is an occurrence of P in T starting
at i-|P|-1 of
T=O(|P|+|T|+1)=O(n+m)
25
Simplest Linear Time Exact Matching Algorithm
a’ b’ $ a b
• Take only O (n) extra space

• Alphabet-independent linear time
26
Reference
• Chapter 1, 2: Exact Matching: Fundamental
Preprocessing and First Algorithms

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exact String Matching Algorithms: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Uploaded by

Copyright:

Available Formats

Exact String Matching Algorithms

• If P = aaa and T = aaaaaaaaaa then n=3, m=10

Skips over three comparisons

• Reasoning of this sort is the key to shifting by more

Skips over three comparisons Skips another three

Additional knowledge requires preprocessing of strings

• |S| is the length of the string. Here, |S| = 12

• S[i..|S|] is the suffix of S that begins at position i

• Goal: To gather the information needed for speeding up the

Figure 1.2: From Gusfield

Starting from i=2, calculate Z2, r2 and l2

Input: Pattern P, Text T

• Take only O (n) extra space

You might also like