You are on page 1of 14

Exact String Matching: KMP & Boyer-Moore Algorithms

R.C.T. Lee and Chin Lung Lu CS 5313 Algorithms for Molecular Biology

By R.C.T. Lee and C.L. Lu

Exact String Matching p.1

Exact string matching


Given two strings T and P , where |T | = m and |P | = n, nd all the occurrences of P in T ? (Simplied version) Given two strings T and P , if P occurs in T ? Brute-force method: O(mn) KMP algorithm (Knuth, Morris and Pratt, 1977): O(m + n) Boyer-Moore algorithm (Boyer and Moore, 1977): O(m + n)

By R.C.T. Lee and C.L. Lu

Exact String Matching p.2

Brute-force method
m T P

O(m n + 1) times Time complexity of brute-force method: O(mn)


By R.C.T. Lee and C.L. Lu Exact String Matching p.3

Brute-force method
T P a b a c a b a b a b a b a b a b a b a b a b a b a b a b Simple, but not ecient because it cannot avoid rescanning T
By R.C.T. Lee and C.L. Lu Exact String Matching p.4

How to speedup the brute-force method?


Shift P by more than one place when a mismatch occurs without missing any occurrence of P in T 1. KMP algorithm Proposed by Knuth, Morris and Pratt at 1977 Time complexity: O(m + n) 2. Boyer-Moore algorithm Proposed by Boyer and Moore at 1977 Time complexity: O(m + n)

By R.C.T. Lee and C.L. Lu

Exact String Matching p.5

KMP algorithm
Left-to-right scan Failure function shift rule

By R.C.T. Lee and C.L. Lu

Exact String Matching p.6

KMP algorithm
T P shift a b a c a b a b Left-to-Right Scan a b a b mismatch occurs a b a b a b a b a b a b shift
By R.C.T. Lee and C.L. Lu

shift

a b a b
Exact String Matching p.7

Failure function
If P = p1 p2 pn , then its failure function f is dened as follows, where 1 j n. if such an largest i < j such that p1 pi = pji+1 pj , i 1 exists, f (j) = 0, otherwise. i P 1
By R.C.T. Lee and C.L. Lu

i i (j i + 1) j
Exact String Matching p.8

Examples of failure function


Example 1: P a b a b a b c f (j) 0 0 1 2 3 4 0 Example 2: P a b a c a b a b f (j) 0 0 1 0 1 2 3 2

By R.C.T. Lee and C.L. Lu

Exact String Matching p.9

KMP algorithm
/* 1 2 3 4 5 6 7 8 9 T = T1 T2 Tm and P = p1 p2 pn */ i = 1; j = 1; while i m and j n do if Ti = pj then i = i + 1; j = j + 1; else if j = 1 then i = i + 1; j = 1; else j = f (j 1) + 1;
end if end while if j = n +

1 then a match else no match.

By R.C.T. Lee and C.L. Lu

Exact String Matching p.10

KMP algorithm
T abac f (j) 0 0 1 2 P abab 1 2 3 4 5 6 7 8

abab
(1, 1) (4, 4), j = f (3) + 1 = 2

abab abab abab abab

No this step
(4, 2) (4, 2), j = f (1) + 1 = 1 (4, 1) (4, 1), i = 4 + 1 = 5, j = 1 (5, 1) (8, 4), i = 8 + 1, j = 4 + 1

By R.C.T. Lee and C.L. Lu

Exact String Matching p.11

The correctness of KMP algorithm


Tjk Tj1 T f (j 1) P p1 pk pjk pj1
pj Tj

First mismatch (Tj = pj ) before shift This move is redundant


pj

pj

after shift

Then p1 pk = Tjk Tj1 = pjk pj1 Hence f (j 1) = k > f (j 1), a contradiction.


By R.C.T. Lee and C.L. Lu Exact String Matching p.12

How to compute failure function?


Let f 1 (j) = f (j) and f k (j) = f (f k1 (j)). Let P = p1 p2 pn . Then, f (1) = 0 and for j 2, k f (j 1) + 1 if there exists k which is the least integer such that f (j) = pf k (j1)+1 = pj , 0 otherwise.
By R.C.T. Lee and C.L. Lu Exact String Matching p.13

How to compute failure function?


If pf (j1)+1 = pj , then f (j) = f (j 1) + 1. f (j) P f (j 1) pf (j1)+1 f (j 1) f (j) P
By R.C.T. Lee and C.L. Lu

f (j) pj

f (j) pj

Exact String Matching p.14

How to compute failure function?


If pf (j1)+1 = pj and pf (f (j1))+1 = pj , then f (j) = f (f (j 1)) + 1. f (j) f (j) P f (j 1) pf (j1)+1 f (j 1) P f (f (j 1)) + 1
By R.C.T. Lee and C.L. Lu Exact String Matching p.15

pj

pj

How to compute failure function?


f (j 1) pf (j1)+1 f (j 1) P P Zoom In pj pj

f (f (j 1))
By R.C.T. Lee and C.L. Lu

f (j 1)
Exact String Matching p.16

Boyer-Moore algorithm
Right-to-left scan Bad character shift rule Good sux shift rule

By R.C.T. Lee and C.L. Lu

Exact String Matching p.17

Right-to-left scan
Check whether P occurs in T at some position in the right-to-left scaning manner THE BOYERMOORE ALGORITHM Right-to-Left Scan P BUYERMOORE

By R.C.T. Lee and C.L. Lu

Exact String Matching p.18

Bad character shift rule


What happened if the initial mismatch occurs? THE BOYERMOORE ALGORITHM T Right-to-Left Scan P shift P P
By R.C.T. Lee and C.L. Lu

BUYERMOORE movement BUYERMOORE Thisbe skipped can BUYERMOORE


Exact String Matching p.19

Bad character shift rule


Bad character rule: If x = y, then P is shifted right by max{1, i B(x)} places. B(x): the position of right-most occurrence of x in P (B(x) = 0 if x does not occur in P ) T P before shift P after shift
By R.C.T. Lee and C.L. Lu

x x
B(x)

t t y t
Exact String Matching p.20

y
i

B ) (x

Bad character shift rule


x x
B(x)

T P before shift B(x) < i


i
By R.C.T. Lee and C.L. Lu

t
Right-to-Left Scan

y
i

t P after shift y t x P after shift y t


Exact String Matching p.21

B(x) = 0

Bad character shift rule


If B(x) > i, then bad character rule has no eect.

B ) (x

T P before shift y

x y
i

t x
Right-to-Left Scan

t x

B(x)

t x

P after shift ()

By R.C.T. Lee and C.L. Lu

Exact String Matching p.22

Bad character shift rule


How to compute B(x) for all x in P ? (n = |P |) for i = 1 to n do B(x) = 0; end for for i = n to 1 do if B(P [i]) = 0 then B(P [i]) = i; end for Extended bad character shift rule: For each position i in P and for each x in , nd the position of the closest occurrence of x in P to the left of i.
By R.C.T. Lee and C.L. Lu Exact String Matching p.23

Good sux rule: Case


If x = y, then nd the right-most copy t of t in P such that t is not a sux of P and z = y T P before shift z t x y t
Right-to-Left Scan

t y t

P after shift
By R.C.T. Lee and C.L. Lu

Exact String Matching p.24

Good sux rule: Case


If t does not exist, then nd the largest prex t of P such that it is equal to a sux of t T P before shift t xt Right-to-Left Scan y t t y t

P after shift

By R.C.T. Lee and C.L. Lu

Exact String Matching p.25

Good sux rule: Case


If t and t do not exist, then T P before shift P after shift xt Right-to-Left Scan y t y t

By R.C.T. Lee and C.L. Lu

Exact String Matching p.26

Good sux rule


T z t xt Right-to-Left Scan y z t t t P before shift y t y t y t
Exact String Matching p.27

P after shift for case

P after shift for case

P after shift for case


By R.C.T. Lee and C.L. Lu

You might also like