0% found this document useful (0 votes)
141 views45 pages

String Matching Algorithms Overview

The document describes string matching algorithms. It begins by defining the problem of finding all occurrences of a pattern string P in a text string T. It then presents the naive brute force algorithm that compares P to all possible substrings of T in O(nm) time, where n and m are the lengths of T and P. It also describes the Rabin-Karp algorithm that hashes substrings to integers in order to solve the problem in expected O(n+m) time. Finally, it discusses the finite automata algorithm that constructs a deterministic finite automaton to process T and recognize matches of P in O(n+m) time.

Uploaded by

Bernando Vialli
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views45 pages

String Matching Algorithms Overview

The document describes string matching algorithms. It begins by defining the problem of finding all occurrences of a pattern string P in a text string T. It then presents the naive brute force algorithm that compares P to all possible substrings of T in O(nm) time, where n and m are the lengths of T and P. It also describes the Rabin-Karp algorithm that hashes substrings to integers in order to solve the problem in expected O(n+m) time. Finally, it discusses the finite automata algorithm that constructs a deterministic finite automaton to process T and recognize matches of P in O(n+m) time.

Uploaded by

Bernando Vialli
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Jim Anderson Comp 750, Fall 2009 String Matching - 1

Chapter 32: String Matching


Given: Two strings T[1..n] and P[1..m] over alphabet E.

Want to find all occurrences of P[1..m] the pattern in T[1..n] the text.

Example: E = {a, b, c}
a b c a b a a b c a a b a c
a b a a
text T
pattern P
s=3
Terminology:
- P occurs with shift s.
- P occurs beginning at position s+1.
- s is a valid shift.

Goal: Find all valid shifts.
Applications: Text editors, search for patterns in DNA sequences
(actually, this is stretching the truth a little),
Jim Anderson Comp 750, Fall 2009 String Matching - 2
Notation and Terminology
w pre x -- w is a prefix of x.

Example: aba pre abaabc.

w suf x -- w is a suffix of x.

Example: abc suf abaabc.

Note: In the book the symbol is used instead of pre,
and the symbol is used instead of suf.

I couldnt figure out an easy way to reproduce these
symbols in Powerpoint.
Jim Anderson Comp 750, Fall 2009 String Matching - 3
Lemma 32.1
Lemma 32.1: Suppose x suf z and y suf z. If |x| s |y| then
x suf y. If |x| > |y| then y suf x. If |x| = |y| then x = y.
x
z
y
x
y
x
z
y
x
y
x
z
y
x
y
Jim Anderson Comp 750, Fall 2009 String Matching - 4
More Notation
P
k
= P[1..k] where k s m.

Thus, P
0
= c, P
m
= P[1..m] = P.

Similarly T
k
= T[1..k], where k s n.

Our Problem: Find all s, where 0 s s s n m such that P suf T
s+m
.

Assumption: We assume the test x = y takes O(t + 1) time,
where t is the length of the longest string z such that z pre x
and z pre y.
Jim Anderson Comp 750, Fall 2009 String Matching - 5
Nave Brute-Force Algorithm
Nave(T, P)
n := length[T];
m := length[P];
for s := 0 to n m do
if P[1..m] = T[s+1..s+m] then
print pattern occurs with shift s
fi
od
Running time is O((n m + 1)m).

Bound is tight. Consider: T = a
n
, P = a
m
.
Jim Anderson Comp 750, Fall 2009 String Matching - 6
Example
a c a a b c
a a b
s = 0
Jim Anderson Comp 750, Fall 2009 String Matching - 7
Example
a c a a b c
a a b
s = 0
Jim Anderson Comp 750, Fall 2009 String Matching - 8
Example
a c a a b c
a a b
s = 0
Jim Anderson Comp 750, Fall 2009 String Matching - 9
Example
a c a a b c
a a b
s = 1
Jim Anderson Comp 750, Fall 2009 String Matching - 10
Example
a c a a b c
a a b
s = 1
Jim Anderson Comp 750, Fall 2009 String Matching - 11
Example
a c a a b c
a a b
s = 2
Jim Anderson Comp 750, Fall 2009 String Matching - 12
Example
a c a a b c
a a b
s = 2
Jim Anderson Comp 750, Fall 2009 String Matching - 13
Example
a c a a b c
a a b
s = 2
Jim Anderson Comp 750, Fall 2009 String Matching - 14
Example
a c a a b c
a a b
s = 2
match!
Jim Anderson Comp 750, Fall 2009 String Matching - 15
Example
a c a a b c
a a b
s = 3
Jim Anderson Comp 750, Fall 2009 String Matching - 16
Example
a c a a b c
a a b
s = 3
Jim Anderson Comp 750, Fall 2009 String Matching - 17
Example
a c a a b c
a a b
s = 3
Jim Anderson Comp 750, Fall 2009 String Matching - 18
Rabin-Karp Algorithm
Suppose E = {0, 1, 2, , 9}.

Let us view P as a decimal number.

Example: View P = 31415 as 31,415.

Can also view substrings of T as decimal numbers.

Let t
s
= the decimal number corresponding to T[s+1..s+m].
Let p = the decimal number corresponding to P.

We want to know all s such that t
s
= p.

We can compute p in O(m) time using Horners Rule:

p = P[m] + 10(P[m1] + 10(P[m2] + + 10(P[2] + 10 P[1]) ))

Can similarly compute t
0
in O(m) time.
Jim Anderson Comp 750, Fall 2009 String Matching - 19
RK Algorithm (Continued)
Can compute t
1
, t
2
, , t
n-m
in O(n m) time as follows:

t
s+1
= 10(t
s
10
m-1
T[s+1]) + T[s+m+1].

Example: T = 314152
t
s+1
= 10(31415 100003) + 2
= 14152
Time Complexity:
O(n+m) + O(nm) = O(n+m)
to compute
p and t
0
,,t
n-m

to perform
nm+1 comparisons
Jim Anderson Comp 750, Fall 2009 String Matching - 20
Two Problems
Might have |E| = d = 10.
Solution: Use radix-d arithmetic.

Numbers may be very large.
Solution: Perform computations modulo-q for some q.
Jim Anderson Comp 750, Fall 2009 String Matching - 21
What is q?
Select q to be a large prime such that dq fits in one memory word.

all computations can be performed using single-precision
arithmetic.

To summarize, p is computed using

p = (P[m] + d(P[m1] + d(P[m2] + + d(P[2] + d P[1]) ))) mod q

t
0
is computed similarly.

Other t
i
s are computed using

t
s+1
= (d(t
s
T[s+1]h) + T[s+m+1]) mod q, where h d
m-1
(mod q)

Unfortunately, we have a new problem: spurious hits.
Jim Anderson Comp 750, Fall 2009 String Matching - 22
Example
pattern P
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
3 1 4 1 5
7
mod 13
text T
7
mod 13
7
mod 13
valid
match
spurious
hit
Jim Anderson Comp 750, Fall 2009 String Matching - 23
Algorithm
We deal with spurious
hits by performing an
explicit check whenever
there is a potential match.
RK(T, P, d, q)
n := length[T];
m := length[P];
h := d
m-1
mod q;
p := 0;
t
0
:= 0;
for i := 1 to m do
p := (dp + P[i]) mod q;
t
0
:= (dt
0
+ P[i]) mod q
od;
for s := 0 to n m do
if p = t
s
then
if P[1..m] = T[s+1..s+m] then
print pattern occurs with shift s
fi
fi;
if s < n-m then
t
s+1
:= (d(t
s
T[s+1]h) + T[s+m+1]) mod q
fi
od
Jim Anderson Comp 750, Fall 2009 String Matching - 24
Running Time
Worst-Case: O((n m + 1)m). (Again, consider P = a
m
, T = a
n
.)

Average-Case:

Some assumptions

Assume O(1) valid shifts.

Think of 0, 1, , q 1 like hash buckets.

Assume each bucket is equally likely.

We expect O(n/q) spurious hits.

Expected running time is:
O(n) + O(m(number of valid shifts + n/q))
= O(n+m) choosing q > m
Jim Anderson Comp 750, Fall 2009 String Matching - 25
Finite Automata Algorithm
0 1 2 3 4 5 6 7
a b
a b
a c
a
a
a
a
a
b
b
a b c P
0 1 0 0 a
1 1 2 0 b
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0
state
input
i -- 1 2 3 4 5 6 7 8 9 10 11
T[i] -- a b a b a b a c a b a
state |(i) 0 1 2 3 4 5 4 5 6 7 2 3
Processing time takes O(n).
But have to first construct FA.
Main Issue: How to construct FA?
Jim Anderson Comp 750, Fall 2009 String Matching - 26
Need some Notation
|(w) = state FA ends up in after processing w.

Example: |(abab) = 4.

o(x) = max{k: P
k
suf x}. Called the suffix function.

Examples: Let P = ab.
o(c) = 0
o(ccaca) = 1
o(ccab) = 2

Note: If |P| = m, then o(x) = m indicates a match.
T: a b a b b a b b a c
States: 0 1..m.m.

Note Also: x suf y o(x) s o(y).
match match
Jim Anderson Comp 750, Fall 2009 String Matching - 27
FA Construction
Given: P[1..m]

Let Q = states = {0, 1, , m}.


Define transition function o as follows:

o(q, a) = o(P
q
a) for each q and a.

Example: o(5, b) = o(P
5
b)
= o(ababab)
= 4

Intuition: Encountering a b in state 5 means the current substring
doesnt match. But, you know this substring ends with abab -- and
this is the longest suffix that matches the beginning of P. Thus, we
go to state 4 and continue processing abab .
initial final
Jim Anderson Comp 750, Fall 2009 String Matching - 28
Time Complexity
FA takes O(m|E|) time to construct.

(Book only gives a O(m
3
|E|) algorithm.)

Total time is O(n + m|E|).
Jim Anderson Comp 750, Fall 2009 String Matching - 29
Correctness
Lemma 32.2: o(xa) s o(x) + 1.
Proof:

Let r = o(xa).

Case: r = 0. Clearly o(xa) s o(x) + 1.

Case: r > 0.
a
P
r

x
P
r-1

We have:
P
r
suf xa.
P
r-1
suf x
r1 s o(x).
Jim Anderson Comp 750, Fall 2009 String Matching - 30
Another Lemma
Lemma 32.3: q = o(x) o(xa) = o(P
q
a) .
Proof:
Let q = o(x).
P
q
suf x
P
q
a suf xa

Let r = o(xa). By Lemma 32.2, r s q + 1. We have:
a
P
q

x
a
P
r

o(P
q
a) = r.
Jim Anderson Comp 750, Fall 2009 String Matching - 31
Main Theorem
Theorem 32.4: |(T
i
) = o(T
i
) for all i = 0, 1, , n.
Implies: in accepting state
if and only if
string processed so far has a match at position (length of string) m

Proof:

Induction on i.

Basis: i = 0. T
0
= c.
|(T
0
) = o(T
0
) = 0.
Step: Assume |(T
i
) = o(T
i
).

Let q = |(T
i
), a = T[i+1].

Then, |(T
i
) = o(T
i
) = q, which by
Lemma 32.3, implies
o(T
i
a) = o(P
q
a). (**)
Jim Anderson Comp 750, Fall 2009 String Matching - 32
Proof Continued
|(T
i+1
) = |(T
i
a) , T
i+1
= T
i
a
= o(|(T
i
), a) , |(wa) = o(|(w), a)
= o(q, a) , q = |(T
i
)
= o(P
q
a) , o(q, a) = o(P
q
a)
= o(T
i
a) , by (**)
= o(T
i+1
) , T
i+1
= T
i
a
Jim Anderson Comp 750, Fall 2009 String Matching - 33
Knuth-Morris-Pratt Algorithm
Achieves O(n + m) by avoiding precomputation
of o.

Instead, we precompute t[1..m] in O(m) time.

As T is scanned, t[1..m] is used to deduce
information given by o in FA algorithm.
Jim Anderson Comp 750, Fall 2009 String Matching - 34
Motivating Example
b a c b a b
a b a
s
a b a a b c b a b
b a c a
T
P
q
Shift s is discovered to be invalid because of mismatch
of 6
th
character of P.

By definition of P, we also know s + 1 is an invalid shift

However, s + 2 may be a valid shift.
Jim Anderson Comp 750, Fall 2009 String Matching - 35
Motivating Example
b a c b a b
a b a
s + 2
a b a a b c b a b
b a c a
T
P
k
The shift s + 2. Note that the first 3 characters of T starting
at s + 2 dont have to be checked again -- we already know
what they are.
Jim Anderson Comp 750, Fall 2009 String Matching - 36
Motivating Example
a b
a b a
a b a
P
k

The longest prefix of P that is also a proper suffix of P
5
is P
3
.
We will define t[5] = 3.
P
q

In general, if q characters have matched successfully at shift
s, the next potentially valid shift is s' = s + (q t[q]).
Jim Anderson Comp 750, Fall 2009 String Matching - 37
The Prefix Function
t is called the prefix function for P.

t: {1, 2, , m} {0, 1, , m1}

t[q] = length of the longest prefix
of P that is a proper suffix
of P
q
, i.e.,

t[q] = max{k: k < q and P
k
suf P
q
}.
Compute-t(P)
1 m := length[P];
2 t[1] := 0;
3 k := 0;
4 for q := 2 to m do
5 while k > 0 and P[k+1] = P[q] do
6 k := t[k]
od;
7 if P[k+1] = P[q] then
8 k := k + 1
fi;
9 t[q] := k
od;
10 return t
Jim Anderson Comp 750, Fall 2009 String Matching - 38
Example
i 1 2 3 4 5 6 7
P[i] a b a b a c a
t[i] 0 0 1 2 3 0 1
Same as our
FA example
P
7
= a b a b a c a

a = P
1

P
6
= a b a b a c

c = P
0

P
5
= a b a b a

a b a = P
3

P
4
= a b a b

a b = P
2

P
3
= a b a

a = P
1

P
2
= a b

c = P
0

P
1
= a

c = P
0

Jim Anderson Comp 750, Fall 2009 String Matching - 39
Another Explanation
0 1 2 3 4 5 6 7
a b
a b
a c
a
c
c
c
c
Essentially KMP is computing a FA with epsilon moves. The spine
of the FA is implicit and doesnt have to be computed -- its just the
pattern P. t gives the c transitions. There are O(m) such transitions.
Recall from Comp 455 that a FA with epsilon moves is
conceptually able to be in several states at the same time (in
parallel). Thats whats happening here -- were exploring
pieces of the pattern in parallel.
Jim Anderson Comp 750, Fall 2009 String Matching - 40
Another Example
i 1 2 3 4 5 6 7 8 9 10
P[i] a b b a b a a b b a
t[i] 0 0 0 1 2 1 1 2 3 4
P
7
= a b b a b a a

a = P
1

P
6
= a b b a b a

a = P
1

P
5
= a b b a b

a b = P
2
P
4
= a b b a

a = P
1

P
3
= a b b

c = P
0

P
2
= a b

c = P
0
P
1
= a

c = P
0

P
10
= a b b a b a a b b a

a b b a = P
4

P
9
= a b b a b a a b b

a b b = P
3

P
8
= a b b a b a a b

a b = P
2
Jim Anderson Comp 750, Fall 2009 String Matching - 41
Time Complexity
Amortized Analysis --

u
0

+ loop q = 2 (1
st
iteration)
u
1
+ loop q = 3 (2
nd
iteration)
u
2
+ loop q = 4 (3
rd
iteration)

+ loop q = m ((m 1)
st
iteration)
u
m-1

u = potential function = value of k

Amortized cost:
i
= c
i
+ u
i
u
i-1
iteration actual loop cost
Jim Anderson Comp 750, Fall 2009 String Matching - 42
Time Complexity (Continued)
Total amortized cost:

=
+ =
+ =
1 m
1 i
0 1 - m i
1 m
1 i
1 i i i
1 m
1 i
i
c
) (c
c


If u
m-1
> u
0
, then amortized cost upper bounds real cost.

We have u
0
= 0 (initial value of k)
u
m-1
> 0 (final value of k).

We show
i
= O(1).
Jim Anderson Comp 750, Fall 2009 String Matching - 43
Time Complexity (Continued)
The value of
i
obviously depends on how many times statement
6 is executed.

Note that k > t[k]. Thus, each execution of statement 6 decreases
k by at least 1.

So, suppose that statements 5..6 iterate several times, decreasing
the value of k.

We have: number of iterations s k
old
k
new
. Thus,

i
s O(1) + 2(k
old
k
new
) + u
i
u
i-1




Hence,
i
= O(1). Total cost is therefore O(m).

for statements
other than 5 & 6
= k
new
= k
old

Jim Anderson Comp 750, Fall 2009 String Matching - 44
Rest of the Algorithm
KMP(T, P)
n := length[T];
m := length[P];
t := Compute-t(P);
q := 0;
for i := 1 to n do
while q > 0 and P[q+1] = T[i] do
q := t[q]
od;
if P[q+1] = T[i] then
q := q + 1
fi;
if q = m then
print pattern occurs with shift i m;
q := t[q]
fi
od
Time complexity
of loop is O(n)
(similar to the
analysis of
Compute-t).

Total time is
O(m + n).
Jim Anderson Comp 750, Fall 2009 String Matching - 45
Example
i 1 2 3 4 5
P[i] a b a b c
t[i] 0 0 1 2 0
P = a b a b c
1 2 3 4 5 6 7 8 9 10
T = a b b a b a b a b c
Start of 1
st
loop: q = 0, i = 1 [a]
2
nd
loop: q = 1, i = 2 [b]
3
rd
loop: q = 2, i = 3 [b]
4
th
loop: q = 0, i = 4 [a]
5
th
loop: q = 1, i = 5 [b]
6
th
loop: q = 2, i = 6 [a]
7
th
loop: q = 3, i = 7 [b]
mismatch
detected
8
th
loop: q = 4, i = 8 [a]
9
th
loop: q = 3, i = 9 [b]
10
th
loop: q = 4, i = 10 [c]
Termination: q = 5
mismatch
detected
match
detected
Please see the book for formal correctness proofs.
(Theyre very tedious.)

You might also like