You are on page 1of 4

*+$',-./#!"%/,'$,#/#0!1!2 !*3,%+-#!4%#05!678'/9!:;.<%,!=,#>#.-!678'/!

Boyer Moore String Matching


(Week 4)

Definition
The Boyer–Moore string search algorithm is a particularly efficient string searching
algorithm. It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm
preprocesses the target string (key) that is being searched for, but not the string being
searched (unlike some algorithms which preprocess the string to be searched, and can
then amortize the expense of the preprocessing by searching repeatedly). The execution
time of the algorithm can actually be sub-linear: it doesn't need to actually check every
character of the string to be searched but rather skips over some of them. Generally the
algorithm gets faster as the key being searched for becomes longer. Its efficiency derives
from the fact that, with each unsuccessful attempt to find a match between the search
string and the text it's searching in, it uses the information gained from that attempt to
rule out as many positions of the text as possible where the string could not match.
The algorithm compares the pattern with the text from right to left. If the text symbol
that is compared with the rightmost pattern symbol does not occur in the pattern at all,
then the pattern can be shifted by m positions behind this text symbol. The following
example illustrates this situation.
0 1 2 3 4 5 6 7 8 9 …

Text→ a b b a d a b a c b a

Pattern→ b a b a c mismatch

b a b a c

The first comparison d - c at position 4 produces a mismatch. The text symbol d does not
occur in the pattern. Therefore, the pattern cannot match at any of the positions 0, ..., 4,
since all corresponding windows don’t contain a d. The pattern can be shifted to position
5.
The best case for the Boyer-Moore algorithm is attained if at each attempt the first
compared text symbol does not occur in the pattern. Then the algorithm requires only
O(n/m) comparisons.
In case of a mismatch (or a complete match of the whole pattern),boyer moore uses two
precomputed functions to shift the window to the right. These two shift functions are
called the good-suffix shift (also called matching shift) and the bad-character shift (also
called the occurrence shift).

Bad Character Shift


This method can be applied if the bad character, i.e. the text symbol that causes a
mismatch, occurs somewhere else in the pattern. Then the pattern can be shifted so that it
is aligned to this text symbol. In the example below, comparison b - c causes a mismatch.
Text symbol b occurs in the pattern at positions 0 and 2. The pattern can be shifted so
that the rightmost b in the pattern is aligned to text symbol b.
0 1 2 3 4 5 6 7 8 9 …

Text→ a b b a b a b a c b a

Pattern→ b a b a c mismatch

b a b a c

Good-Suffix Shift

6%?'+#<!@-0$$-!@%?0-?!63,#A#5#!2 !B#;#+!CDDEFCDDG! !!!!!!!!!!"#$%!&!'(!)!


*+$',-./#!"%/,'$,#/#0!1!2 !*3,%+-#!4%#05!678'/9!:;.<%,!=,#>#.-!678'/!

Sometimes the bad character heuristics fails. In the following situation the comparison a
- b causes a mismatch. An alignment of the rightmost occurence of the pattern symbol a
with the text symbol a would produce a negative shift. Instead, a shift by 1 would be
possible. However, in this case it is better to derive the maximum possible shift distance
from the structure of the pattern. This method is called good-suffix. Example:
0 1 2 3 4 5 6 7 8 9 …

Text→ a b a a b a b a c b a

Pattern→ c a b a b

c a b a b

The suffix ab has matched. The pattern can be shifted until the next occurence of ab in
the pattern is aligned to the text symbols ab, i.e. to position 2.
In the following situation the suffix bab has matched. There is no other occurence of bab
in the pattern. But in this case the pattern cannot be shifted to position 5 as before, but
only to position 3, since a prefix of the pattern (ab) matches the end of bab. Example:
0 1 2 3 4 5 6 7 8 9 …

Text→ a a b a b a b a c b a

Pattern→ a b b a b

a b b a b

Preprocessing
For the bad-character, a function occ is required which yields, for each symbol in the
pattern, the position of its rightmost occurrence in the pattern.
Example : occ(text, x) = 2, occ(text, t) = 3.
The rightmost occurence of symbol 'x' in the string 'text' is at position 2. Symbol 't'
occurs at positions 0 and 3, the rightmost occurence is at position 3.
For the good-suffix, an array s is used. Each entry s[i] contains the shift distance of the
pattern if a mismatch at position i – 1 occurs, i.e. if the suffix of the pattern starting at
position i has matched. In order to determine the shift distance, two cases have to be
considered.

Case 1 : The matching suffix (gray) occurs somewhere else in the pattern

Case 2 : Only a part of the matching suffix occurs at the beginning of the pattern

Pseudocode

6%?'+#<!@-0$$-!@%?0-?!63,#A#5#!2 !B#;#+!CDDEFCDDG! !!!!!!!!!!"#$%!C!'(!)!


*+$',-./#!"%/,'$,#/#0!1!2 !*3,%+-#!4%#05!678'/9!:;.<%,!=,#>#.-!678'/!

void preBmBc(char *x, int m, int bmBc[]) {


int i;

for (i = 0; i < ASIZE; ++i)


bmBc[i] = m;
for (i = 0; i < m - 1; ++i)
bmBc[x[i]] = m - i - 1;
}

void suffixes(char *x, int m, int *suff) {


int f, g, i;

suff[m - 1] = m;
g = m - 1;
for (i = m - 2; i >= 0; --i) {
if (i > g && suff[i + m - 1 - f] < i - g)
suff[i] = suff[i + m - 1 - f];
else {
if (i < g)
g = i;
f = i;
while (g >= 0 && x[g] == x[g + m - 1 - f])
--g;
suff[i] = f - g;
}
}
}

void preBmGs(char *x, int m, int bmGs[]) {


int i, j, suff[XSIZE];

suffixes(x, m, suff);

for (i = 0; i < m; ++i)


bmGs[i] = m;
j = 0;
for (i = m - 1; i >= 0; --i)
if (suff[i] == i + 1)
for (; j < m - 1 - i; ++j)
if (bmGs[j] == m)
bmGs[j] = m - 1 - i;
for (i = 0; i <= m - 2; ++i)
bmGs[m - 1 - suff[i]] = m - 1 - i;
}

void BM(char *x, int m, char *y, int n) {


int i, j, bmGs[XSIZE], bmBc[ASIZE];

/* Preprocessing */
preBmGs(x, m, bmGs);
preBmBc(x, m, bmBc);

/* Searching */
j = 0;
while (j <= n - m) {
for (i = m - 1; i >= 0 && x[i] == y[i + j]; --i);
if (i < 0) {
OUTPUT(j);
j += bmGs[0];
}
else
j += MAX(bmGs[i], bmBc[y[i + j]] - m + 1 + i);
}
}

6%?'+#<!@-0$$-!@%?0-?!63,#A#5#!2 !B#;#+!CDDEFCDDG! !!!!!!!!!!"#$%!1!'(!)!


*+$',-./#!"%/,'$,#/#0!1!2 !*3,%+-#!4%#05!678'/9!:;.<%,!=,#>#.-!678'/!

Example
Text : G C A T C G C A G A G A G T A T A C A G T A C G
Pattern : G C A G A G A G
Preprocessing phase
c A C G T

bmBC(c) 1 6 2 8

i 0 1 2 3 4 5 6 7

x[i] G C A G A G A G

suff [ i ] 1 0 0 2 0 4 0 8

bmGS [ i ] 7 7 7 2 7 4 7 1

First Attempt
G C A T C G C A G A G A G T A T A C A G T A C G
1
G C A G A G A G
Shift by: 1 (bmGs[7]=bmBc[A]-8+8)
Second Attempt
G C A T C G C A G A G A G T A T A C A G T A C G
3 2 1
G C A G A G A G
Shift by: 4 (bmGs[5]=bmBc[C]-8+6
Third Attempt
G C A T C G C A G A G A G T A T A C A G T A C G
8 7 6 5 4 3 2 1
G C A G A G A G
Shift by: 7 (bmGs[0] ) à if match, shift by bmGs[0]
Fourth Attempt
G C A T C G C A G A G A G T A T A C A G T A C G
3 2 1
G C A G A G A G
Shift by: 4 (bmGs[5]=bmBc[C]-8+6)
Fifth Attempt
G C A T C G C A G A G A G T A T A C A G T A C G
2 1
G C A G A G A G
Shift by: 7 (bmGs[6]) End of text, Searching Ended

Referensi
Ø http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
Ø http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm

6%?'+#<!@-0$$-!@%?0-?!63,#A#5#!2 !B#;#+!CDDEFCDDG! !!!!!!!!!!"#$%!)!'(!)!

You might also like