Boyer Moore Algorithm: Idan Szpektor

Boyer Moore Algorithm
Idan Szpektor
Boyer and Moore
What Its About
A String Matching Algorithm
Preprocess a Pattern P (|P| = n)
For a text T (| T| = m), find all of the

occurrences of P in T
Time complexity: O(n + m), but usually sub-

linear
Right to Left (like in Hebrew)
Matching the pattern from right to left
For a pattern abc:

T: bbacdcbaabcddcdaddaaabcbcb
P: abc
Worst case is still O(n m)

The Bad Character Rule (BCR)
On a mismatch between the pattern and the

text, we can shift the pattern by more than
one place.
Sublinearity!
ddbbacdcbaabcddcdaddaaabcbcb
acabc

BCR Preprocessing
A table, for each position in the pattern and a

character, the size of the shift. O(n ||) space. O(1)
access time.
1 2 3 4 5
a b a c b:
a 1 1 3 3 3
1 2 3 4 5
b 2 2 2 5
c 4 4
A list of positions for each character. O(n + ||)

space. O(n) access time, But in total O(m).
BCR - Summary
On a mismatch, shift the pattern to the right

until the first occurrence of the mismatched
char in P.
Still O(n m) worst case running time:
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: abaaaa
The Good Suffix Rule (GSR)
We want to use the knowledge of the

matched characters in the patterns suffix.
If we matched S characters in T, what is (if

exists) the smallest shift in P that will align a
sub-string of P of the same S characters ?
GSR (Cont)
Example 1 how much to move:

P: cabbabdbab
cabbabdbab
GSR (Cont)
Example 2 what if there is no alignment:

T: bbacdcbaabcbbabdbabcaabcbcb
P: bcbbabdbabc
bcbbabdbabc
GSR - Detailed
We mark the matched sub-string in T with t
and the mismatched char with x
1. In case of a mismatch: shift right until the

first occurrence of t in P such that the next
char y in P holds yx
2. Otherwise, shift right to the largest prefix of

P that aligns with a suffix of t.
Boyer Moore Algorithm
Preprocess(P)
k := n
while (k m) do
Match P and T from right to left starting at k
If a mismatch occurs: shift P right (advance k)

by max(good suffix rule, bad char rule).
else, print the occurrence and shift P right

(advance k) by the good suffix rule.
Algorithm Correctness
The bad character rule shift never misses a

match
The good suffix rule shift never misses a

match
Preprocessing the GSR L(i)
L(i) The biggest index j, such that j < n and

prefix P[1..j] contains suffix P[i..n] as a suffix
but not suffix P[i-1..n]
1 2 3 4 5 6 7 8 9 10 11 12 13
P: b b a b b a a b b c a b b
L: 0 0 0 0 0 0 0 0 0 5 9 0 12
Preprocessing the GSR l(i)
l(i) The length of the longest suffix of P[i..n]

that is also a prefix of P
P: b b a b b a a b b c a b b
l: 2 2 2 2 2 2 2 2 2 2 2 1
Using L(i) and l(i) in GSR
If mismatch occurs at position n, shift P by 1
If a mismatch occurs at position i-1 in P:

If L(i) > 0, shift P by n L(i)
else shift P by n l(i)
If P was found, shift P by n l(2)

Building L(i) and l(i) the Z
For a string s, Z(i) is the length of the longest
sub-string of s starting at i that matches a
prefix of s.
s: b b a c d c b b a a b b c d d
Z: 1 0 0 0 0 3 1 0 0 2 1 0 0 0
Naively, we can build Z in O(n^2)

From Z to N
N(i) is the longest suffix of P[1..i] that is also a

suffix of P.
N(i) is Z(i), built over P reversed.
s: d d c b b a a b b c d c a b b
N: 0 0 0 1 2 0 0 1 3 0 0 0 0 1
Building L(i) in O(n)
L(i) The biggest index j < n, such that prefix
P[1..j] contains suffix P[i..n] as a suffix but not
suffix P[i-1..n]
L(i) The biggest index j < n such that:

N(j) == | P[i..n] | == n i + 1
for i := 1 to n, L(i) := 0
for j := 1 to n-1
i := n N(j) + 1
L(i) := j
Building l(i) in O(n)
l(i) The length of the longest suffix of P[i..n]
that is also a prefix of P
l(i) The biggest j <= | P[i..n] | == n i + 1

such that N(j) == j
k := 0
for j := 1 to n-1
If(N(j) == j), k := j
l(n j + 1) := k
Building Z in O(n)
For calculating Z(i), we want to use the

previously calculated Z(1)Z(i-1)
For each I we remember the right most Z(j):

j, such that j < i and j + Z(j) >= k + Z(k), for all
k<i
Building Z in O(n) (Cont)

S i j i
If i < j + Z(j), s[i j + Z(j) - 1] appeared previously,

starting at i = i j + 1.
Z(i) < Z(j) (i - j) ?
Building Z in O(n) (Cont)
For Z(2) calculate explicitly
j := 2, i := 3
While i <= |s|:
if i >= j + Z(j), calculate Z(i) explicitly
else
Z(i) := Z(i)
If Z(i) >= Z(j) (i - j), calculate Z(i) tail

explicitly
If j + Z(j) < i + Z(i), j := i
Building Z in O(n) - Analysis
The algorithm builds Z correctly
The algorithm executes in O(n)

A new character is matched only once
All other operations are in O(1)
Boyer Moore Worst Case Analysis
Assume P consists of n copies of a single
char and T consists of m copies of the same
char:
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: aaaaaa
Boyer Moore Algorithm runs in (m n) when

finding all the matches
The Galil Rule
In a specific matching phase, We mark with k
the position in T of the right end of P. We
mark with s the position of last matched char
in this phase.
s k k
P: abaab
abaab
The Galil Rule (Cont)
All the chars in position s < j k are known to
be matching. The algorithm doesnt need to
check them.
An extended Boyer Moore algorithm with the

Galil rule runs in O(m + n) worst case (even
without the bad-character rule).
Dont Sleep Yet
O(n + m) proof - Outline
Preprocess in O(n) already proved
1. Properties of strings
2. Proof of search in O(m) if P is not in T, using
only the good suffix rule.
3. Proof of search in O(m) even if P is in T,
adding the Galil rule.
Properties of Strings
If for two strings , : = then there is a
string such that = i and = j, i, j > 0
- Proof by induction
Definition: A string s is semiperiodic with

period if s consists of a non-empty suffix of
(possibly the entire ) followed by one or
more complete copies of .

Properties of Strings (Cont)
A string is prefix semiperiodic if it contains

one or more complete copies of followed by
a non-empty prefix of .
A string is prefix semiperiodic iff it is

semiperiodic with the same length period
Lemma 1
Suppose P occurs in T starting at position p
and also at position q, q > p. If q p n/2
then P is semiperiodic with period
= P[n-(q-p)+1n]
p

Proof - when P is Not Found in T
We have R rounds during the search.
After each round the good suffix rule decides

on a right shift of si chars.
si m
We shall use si as an upper bound.

Proof (Cont)
For each round we count the matched chars
by:
fi the number of chars matched for the first
time
gi the number of chars already matched in
previous rounds.
fi = m
We want to prove that gi 3si ( gi 3m).
Proof (Cont)
Each round dont find P it matched a
substring ti and one bad char xi in T (xiti T)
T: bbacdcbaabcbbabdbabcaabcbcb
P: bdbabc
|ti|+1 3si gi 3si (because gi + fi = |ti|+1)

For the rest of the proof we assume that for
the specific round i: |ti| + 1 > 3si
Lemma 2 (|ti| + 1 > 3si)
In round i we look at the matched suffix of P,
marked P*. P* = yi ti, yi xi.
Both P* and ti are semiperiodic with period

of length si and hence with minimal length
period , = k.
Proof: by Lemma 1.
Lemma 3 (|ti| + 1 > 3si)
Suppose P overlapped ti during round i. We
shall examine in what ways could P overlap ti
in previous rounds.
In any round h < i, the right end of P could not

have been aligned with the right end of any
full copy of in ti.
- proof:
Both round h and i fail at char xi
two cases of possible shift after round h are invalid
Lemma 4 (|ti| + 1 > 3si)
In round h < i, P can correctly match at most
||-1 chars in ti.

By Lemma 3, P is not aligned with a right end of ti in phase h.
Thus if it matched || chars or more there is a suffix of
followed by a prefix of such that = .
By the string properties there is a substring such that = k,
k>1.
This contradicts the minimal period size property of .
Lemma 5 (|ti| + 1 > 3si)
If in round h < i the right end of P is aligned

with a char in ti, it can only be aligned with
one of the following:
One of the left-most ||-1 chars of ti
One of the right-most || chars of ti
-proof:
If not, By Lemma 3,4, max ||-1 chars are matched and only
from the middle of a copy, while there are at least ||
A shift cannot pass the right end of that copy
Proof (Cont)
If |ti| + 1 > 3si then gi 3si

Using Lemma 5, in previous rounds we could match only the
bad char xi, the last ||-1 chars in ti or start from the first || right
chars in ti.
In the last case, using Lemma 4, we can only match up to ||-1
chars
in total we could previously match:
gi = 1 + ||-1 + (|| + ||-1) 3|| 3si
Proof - Final
Number of matches = (fi + gi) =

fi + gi m + 3si m + 3m = 4m
Proof - when P is Found in T
Split the rounds to two groups:

match rounds an occurrence of P in T was
found.
mismatch rounds P was not found in T.
we have proved O(m) for mismatch rounds.

Proof (Cont)
After P was found in T, P will be shifted by a
constant length s. (s = n l(2)).
|n| + 1 3s
matches in round i 3s m
For the rest of the proof we assume that:

|n| + 1 > 3s
Proof (|n| + 1 > 3s)
By Lemma 1, P is semiperiodic with minimal
length period , || = s.
If round i+1 is also a match round then, by

the Galil rule, only the new || chars are
compared.
A contiguous series of match rounds, ii+k

is called a run.
Proof (|n| + 1 > 3s)
The length of a run, not including chars

that where already matched in previous
runs m
How many chars in a run where already

matched in previous runs?
Lemma (|n| + 1 > 3s)
Suppose k-1 was a match round and k is a
mismatch round that ends the run.
If k > k is the first match round then it
overlaps at most ||-1 chars with the previous
run (ended by round k-1).

The left end of P at round k cannot be aligned with the left end
of a full copy of || at round k-1.
As a result, P cannot overlap || chars or more with round k-1.
Proof (|n| + 1 > 3s)
By the Lemma and because the shift after
every match round is of ||, only the first
round of a run can overlap, and only with
the last previous run.

The length of the chars that where already
matched in previous runs m
Proof (|n| + 1 > 3s) - Final
The length of a run =
The length of a run, not including chars
that where already matched in previous
runs +
The length of the chars that where already
matched in previous runs
m+m

Boyer Moore Algorithm: Idan Szpektor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Boyer Moore Algorithm: Idan Szpektor

Uploaded by

Copyright:

Available Formats

Boyer Moore Algorithm

Preprocess a Pattern P (|P| = n)

For a text T (| T| = m), find all of the

Time complexity: O(n + m), but usually sub-

For a pattern abc:

Worst case is still O(n m)

On a mismatch between the pattern and the

A table, for each position in the pattern and a

A list of positions for each character. O(n + ||)

On a mismatch, shift the pattern to the right

Still O(n m) worst case running time:

We want to use the knowledge of the

If we matched S characters in T, what is (if

1. In case of a mismatch: shift right until the

2. Otherwise, shift right to the largest prefix of

If a mismatch occurs: shift P right (advance k)

else, print the occurrence and shift P right

The bad character rule shift never misses a

The good suffix rule shift never misses a

L(i) The biggest index j, such that j < n and

l(i) The length of the longest suffix of P[i..n]

If mismatch occurs at position n, shift P by 1

If a mismatch occurs at position i-1 in P:

If P was found, shift P by n l(2)

Naively, we can build Z in O(n^2)

N(i) is the longest suffix of P[1..i] that is also a

L(i) The biggest index j < n such that:

l(i) The biggest j <= | P[i..n] | == n i + 1

For calculating Z(i), we want to use the

For each I we remember the right most Z(j):

If i < j + Z(j), s[i j + Z(j) - 1] appeared previously,

If Z(i) >= Z(j) (i - j), calculate Z(i) tail

The algorithm builds Z correctly

The algorithm executes in O(n)

Boyer Moore Algorithm runs in (m n) when

An extended Boyer Moore algorithm with the

Definition: A string s is semiperiodic with

A string is prefix semiperiodic if it contains

A string is prefix semiperiodic iff it is

We have R rounds during the search.

After each round the good suffix rule decides

We shall use si as an upper bound.

|ti|+1 3si gi 3si (because gi + fi = |ti|+1)

Both P* and ti are semiperiodic with period

In any round h < i, the right end of P could not

If in round h < i the right end of P is aligned

If |ti| + 1 > 3si then gi 3si

Number of matches = (fi + gi) =

Split the rounds to two groups:

we have proved O(m) for mismatch rounds.

For the rest of the proof we assume that:

If round i+1 is also a match round then, by

A contiguous series of match rounds, ii+k

The length of a run, not including chars

How many chars in a run where already

You might also like