You are on page 1of 48

Boyer Moore Algorithm

Idan Szpektor
Boyer and Moore
What Its About
A String Matching Algorithm

Preprocess a Pattern P (|P| = n)

For a text T (| T| = m), find all of the


occurrences of P in T

Time complexity: O(n + m), but usually sub-


linear
Right to Left (like in Hebrew)
Matching the pattern from right to left

For a pattern abc:



T: bbacdcbaabcddcdaddaaabcbcb
P: abc

Worst case is still O(n m)


The Bad Character Rule (BCR)

On a mismatch between the pattern and the


text, we can shift the pattern by more than
one place.
Sublinearity!
ddbbacdcbaabcddcdaddaaabcbcb
acabc

BCR Preprocessing

A table, for each position in the pattern and a


character, the size of the shift. O(n ||) space. O(1)
access time.
1 2 3 4 5
a b a c b:
a 1 1 3 3 3
1 2 3 4 5
b 2 2 2 5
c 4 4

A list of positions for each character. O(n + ||)


space. O(n) access time, But in total O(m).
BCR - Summary

On a mismatch, shift the pattern to the right


until the first occurrence of the mismatched
char in P.

Still O(n m) worst case running time:

T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: abaaaa
The Good Suffix Rule (GSR)

We want to use the knowledge of the


matched characters in the patterns suffix.

If we matched S characters in T, what is (if


exists) the smallest shift in P that will align a
sub-string of P of the same S characters ?
GSR (Cont)
Example 1 how much to move:

T: bbacdcbaabcddcdaddaaabcbcb
P: cabbabdbab
cabbabdbab
GSR (Cont)
Example 2 what if there is no alignment:

T: bbacdcbaabcbbabdbabcaabcbcb
P: bcbbabdbabc
bcbbabdbabc
GSR - Detailed
We mark the matched sub-string in T with t
and the mismatched char with x

1. In case of a mismatch: shift right until the


first occurrence of t in P such that the next
char y in P holds yx

2. Otherwise, shift right to the largest prefix of


P that aligns with a suffix of t.
Boyer Moore Algorithm
Preprocess(P)
k := n
while (k m) do
Match P and T from right to left starting at k

If a mismatch occurs: shift P right (advance k)


by max(good suffix rule, bad char rule).

else, print the occurrence and shift P right


(advance k) by the good suffix rule.
Algorithm Correctness

The bad character rule shift never misses a


match

The good suffix rule shift never misses a


match
Preprocessing the GSR L(i)

L(i) The biggest index j, such that j < n and


prefix P[1..j] contains suffix P[i..n] as a suffix
but not suffix P[i-1..n]

1 2 3 4 5 6 7 8 9 10 11 12 13

P: b b a b b a a b b c a b b
L: 0 0 0 0 0 0 0 0 0 5 9 0 12
Preprocessing the GSR l(i)

l(i) The length of the longest suffix of P[i..n]


that is also a prefix of P

P: b b a b b a a b b c a b b
l: 2 2 2 2 2 2 2 2 2 2 2 1
Using L(i) and l(i) in GSR

If mismatch occurs at position n, shift P by 1

If a mismatch occurs at position i-1 in P:


If L(i) > 0, shift P by n L(i)
else shift P by n l(i)

If P was found, shift P by n l(2)


Building L(i) and l(i) the Z
For a string s, Z(i) is the length of the longest
sub-string of s starting at i that matches a
prefix of s.

s: b b a c d c b b a a b b c d d
Z: 1 0 0 0 0 3 1 0 0 2 1 0 0 0

Naively, we can build Z in O(n^2)


From Z to N

N(i) is the longest suffix of P[1..i] that is also a


suffix of P.
N(i) is Z(i), built over P reversed.

s: d d c b b a a b b c d c a b b
N: 0 0 0 1 2 0 0 1 3 0 0 0 0 1
Building L(i) in O(n)
L(i) The biggest index j < n, such that prefix
P[1..j] contains suffix P[i..n] as a suffix but not
suffix P[i-1..n]

L(i) The biggest index j < n such that:


N(j) == | P[i..n] | == n i + 1

for i := 1 to n, L(i) := 0
for j := 1 to n-1
i := n N(j) + 1
L(i) := j
Building l(i) in O(n)
l(i) The length of the longest suffix of P[i..n]
that is also a prefix of P

l(i) The biggest j <= | P[i..n] | == n i + 1


such that N(j) == j

k := 0
for j := 1 to n-1
If(N(j) == j), k := j
l(n j + 1) := k
Building Z in O(n)

For calculating Z(i), we want to use the


previously calculated Z(1)Z(i-1)

For each I we remember the right most Z(j):


j, such that j < i and j + Z(j) >= k + Z(k), for all
k<i
Building Z in O(n) (Cont)


S i j i

If i < j + Z(j), s[i j + Z(j) - 1] appeared previously,


starting at i = i j + 1.
Z(i) < Z(j) (i - j) ?
Building Z in O(n) (Cont)
For Z(2) calculate explicitly
j := 2, i := 3
While i <= |s|:
if i >= j + Z(j), calculate Z(i) explicitly
else
Z(i) := Z(i)

If Z(i) >= Z(j) (i - j), calculate Z(i) tail


explicitly
If j + Z(j) < i + Z(i), j := i
Building Z in O(n) - Analysis

The algorithm builds Z correctly

The algorithm executes in O(n)


A new character is matched only once
All other operations are in O(1)
Boyer Moore Worst Case Analysis
Assume P consists of n copies of a single
char and T consists of m copies of the same
char:
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: aaaaaa

Boyer Moore Algorithm runs in (m n) when


finding all the matches
The Galil Rule
In a specific matching phase, We mark with k
the position in T of the right end of P. We
mark with s the position of last matched char
in this phase.
s k k
T: bbacdcbaabcddcdaddaaabcbcb
P: abaab
abaab
The Galil Rule (Cont)
All the chars in position s < j k are known to
be matching. The algorithm doesnt need to
check them.

An extended Boyer Moore algorithm with the


Galil rule runs in O(m + n) worst case (even
without the bad-character rule).
Dont Sleep Yet
O(n + m) proof - Outline
Preprocess in O(n) already proved

1. Properties of strings
2. Proof of search in O(m) if P is not in T, using
only the good suffix rule.
3. Proof of search in O(m) even if P is in T,
adding the Galil rule.
Properties of Strings
If for two strings , : = then there is a
string such that = i and = j, i, j > 0
- Proof by induction

Definition: A string s is semiperiodic with


period if s consists of a non-empty suffix of
(possibly the entire ) followed by one or
more complete copies of .


Properties of Strings (Cont)

A string is prefix semiperiodic if it contains


one or more complete copies of followed by
a non-empty prefix of .

A string is prefix semiperiodic iff it is


semiperiodic with the same length period
Lemma 1
Suppose P occurs in T starting at position p
and also at position q, q > p. If q p n/2
then P is semiperiodic with period
= P[n-(q-p)+1n]
p


Proof - when P is Not Found in T

We have R rounds during the search.

After each round the good suffix rule decides


on a right shift of si chars.

si m

We shall use si as an upper bound.


Proof (Cont)
For each round we count the matched chars
by:
fi the number of chars matched for the first
time
gi the number of chars already matched in
previous rounds.

fi = m
We want to prove that gi 3si ( gi 3m).
Proof (Cont)
Each round dont find P it matched a
substring ti and one bad char xi in T (xiti T)

T: bbacdcbaabcbbabdbabcaabcbcb
P: bdbabc

|ti|+1 3si gi 3si (because gi + fi = |ti|+1)


For the rest of the proof we assume that for
the specific round i: |ti| + 1 > 3si
Lemma 2 (|ti| + 1 > 3si)
In round i we look at the matched suffix of P,
marked P*. P* = yi ti, yi xi.

Both P* and ti are semiperiodic with period


of length si and hence with minimal length
period , = k.

Proof: by Lemma 1.
Lemma 3 (|ti| + 1 > 3si)
Suppose P overlapped ti during round i. We
shall examine in what ways could P overlap ti
in previous rounds.

In any round h < i, the right end of P could not


have been aligned with the right end of any
full copy of in ti.
- proof:
Both round h and i fail at char xi
two cases of possible shift after round h are invalid
Lemma 4 (|ti| + 1 > 3si)
In round h < i, P can correctly match at most
||-1 chars in ti.

By Lemma 3, P is not aligned with a right end of ti in phase h.
Thus if it matched || chars or more there is a suffix of
followed by a prefix of such that = .
By the string properties there is a substring such that = k,
k>1.
This contradicts the minimal period size property of .
Lemma 5 (|ti| + 1 > 3si)

If in round h < i the right end of P is aligned


with a char in ti, it can only be aligned with
one of the following:
One of the left-most ||-1 chars of ti
One of the right-most || chars of ti
-proof:
If not, By Lemma 3,4, max ||-1 chars are matched and only
from the middle of a copy, while there are at least ||
A shift cannot pass the right end of that copy
Proof (Cont)

If |ti| + 1 > 3si then gi 3si



Using Lemma 5, in previous rounds we could match only the
bad char xi, the last ||-1 chars in ti or start from the first || right
chars in ti.
In the last case, using Lemma 4, we can only match up to ||-1
chars
in total we could previously match:
gi = 1 + ||-1 + (|| + ||-1) 3|| 3si
Proof - Final

Number of matches = (fi + gi) =


fi + gi m + 3si m + 3m = 4m
Proof - when P is Found in T

Split the rounds to two groups:


match rounds an occurrence of P in T was
found.
mismatch rounds P was not found in T.

we have proved O(m) for mismatch rounds.


Proof (Cont)
After P was found in T, P will be shifted by a
constant length s. (s = n l(2)).

|n| + 1 3s
matches in round i 3s m

For the rest of the proof we assume that:


|n| + 1 > 3s
Proof (|n| + 1 > 3s)
By Lemma 1, P is semiperiodic with minimal
length period , || = s.

If round i+1 is also a match round then, by


the Galil rule, only the new || chars are
compared.

A contiguous series of match rounds, ii+k


is called a run.
Proof (|n| + 1 > 3s)

The length of a run, not including chars


that where already matched in previous
runs m

How many chars in a run where already


matched in previous runs?
Lemma (|n| + 1 > 3s)
Suppose k-1 was a match round and k is a
mismatch round that ends the run.
If k > k is the first match round then it
overlaps at most ||-1 chars with the previous
run (ended by round k-1).

The left end of P at round k cannot be aligned with the left end
of a full copy of || at round k-1.
As a result, P cannot overlap || chars or more with round k-1.
Proof (|n| + 1 > 3s)
By the Lemma and because the shift after
every match round is of ||, only the first
round of a run can overlap, and only with
the last previous run.

The length of the chars that where already
matched in previous runs m
Proof (|n| + 1 > 3s) - Final
The length of a run =
The length of a run, not including chars
that where already matched in previous
runs +
The length of the chars that where already
matched in previous runs
m+m

You might also like