You are on page 1of 38

資料工程 Data Engineering

Pattern Matching
張賢宗
2

Pattern Matching 110/12/07

Outline
• What is Pattern Matching
• The Brute Force Algorithm
• The Knuth-Morris-Pratt(KMP) Algorithm
• The Boyer-Moore Algorithm
• External Pattern Matching
3

Pattern Matching 110/12/07

What is Pattern Matching?


• Given a text string (Long) T and a pattern
(Short) P, find out all the pattern in the text.
▫ T: “It is a good day to take a god damn rest.”
▫ P: “go”
• Applications
▫ Text editor
▫ DNA Sequencing Matching
▫…
4

Pattern Matching 110/12/07

Basic Concepts
• Assume S is a string with length m
• S[i…j] is a fragment between indexes i and j, we
call the fragment as substring of S
• S[0…i] is a prefix of S, where 0<=i<=m-1
• S[i…m-1] is a suffix of S, where 0<=i<=m-1
5

Pattern Matching 110/12/07

Examples
• S: smallpig
• Substring
▫ mal
▫ lpig
• Prefix
▫ smallpig, smallpi, smallp, small, smal, sma, sm, s
• Suffix
▫ smallpig, mallpig, allpig, llpig, lpig, pig, ig, g
6

Pattern Matching 110/12/07

The Brute Force Algorithm


• Check each position in the text T to see if the pat
tern P starts in that position and matches.

T s ma l l p i g
:P a l l
: P al l
: P al l
7

Pattern Matching 110/12/07

Brute Force in C Code


int brute(char *text,char *pattern)
{
int n = strlen(text); // n is length of text
int m = strlen(pattern); // m is length of pattern
int j;
for(int i=0; i <= (n-m); i++) {
j = 0;
while ((j < m) && text[i+j] == pattern[j] )
j++;
if (j == m)
return i; // match at i
}
return -1; // no match
}
8

Pattern Matching 110/12/07

Time Complexity
• Brute force pattern matching runs in time O(mn)
in the worst case.
• But most searches of ordinary text take
O(m+n), which is very quick.
9

Pattern Matching 110/12/07

Alphabets
• Alphabet
▫ The variations of a character in a string
▫ English: a~z, A~Z, 0~9
▫ Computer: ASCII(0) ~ ASCII(255)
▫ Bits: 0, 1
• The brute force algorithm is fast when the alpha
bet of the text is large
• It is slower when the alphabet is small
10

Pattern Matching 110/12/07

Examples of Worst and Average


Cases
• Example of a worst case:
▫ T: “bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbba"
▫ P: “bbbbbba”
• Example of a more average case:
▫ T: “computer science and information engineering"
▫ P: “engine"
11

Pattern Matching 110/12/07

Thinking Over the Problem


• If a mismatch occurs between the text and
pattern P at P[j], what is the most we can s
hift the pattern to avoid wasteful comparis
ons?
12

Pattern Matching 110/12/07

Answer
• The largest prefix of P[0 .. j-1] that is a suffix of P
[1 .. j-1]
13

Pattern Matching 110/12/07

Why
• Let u is the largest prefix of P that is a also a suffi
x of P (P[0 .. k-1] = P[m-k…m-1])
• Assume that we can find a match from T[i+d],
where 0<d<|m|-|u|
• T[i+d… i+|m|-1] = P[0…|m|-|d|-1]
• T[i+d… i+|m|-1] is the suffix of P with length |
m|-d
• |m|-d > |u|, Contradiction.
• We cannot find a such match from T[i+d]
14

Why?
Pattern Matching 110/12/07
15

Pattern Matching 110/12/07

Example
16

Pattern Matching 110/12/07

Example
• Find largest prefix of:
"a b a a b" ( P[0..j-1] )
which is suffix of:
"b a a b" ( p[1 .. j-1] )
• It is "a b"
• Set j = 2 // the new j value
17

Pattern Matching 110/12/07

Failure Function
• KMP preprocesses the pattern to find matches of
prefixes of the pattern with the pattern itself.
• j = mismatch position in P[]
• k = position before the mismatch (k = j-1).
• The failure function F(k) is defined as the size of
the largest prefix of P[0..k] that is also a suffix of
P[1..k].
18

Pattern Matching 110/12/07

Failure Function Example


K=j-1 0 1 2 3 4
• P: "abaaba"
j: 012345 F(j) 0 0 1 1 2

• F(k) is the size of the largest prefix


• In code, F() is represented by an array, like the
table.
19

Pattern Matching 110/12/07

F(4)=2
• Find the size of the largest prefix of P[0..4] that i
s also a suffix of P[1..4]
▫ Find the size largest prefix of "abaab" that
is also a suffix of "baab“
▫ It is "ab“ =2
20

Pattern Matching 110/12/07

KMP in C Code
• Knuth-Morris-Pratt’s algorithm modifies the
brute-force algorithm.
▫ if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
21

Pattern Matching 110/12/07

KMP in C Code
while (i < n) {
if (pattern[j] == text[i]) {
if (j == m - 1)
return i - m + 1; // match
i++;
j++;
}
else if (j > 0)
j = fail[j-1];
else
i++;
}
22

Pattern Matching 110/12/07

Analysis of KMP
• KMP runs in optimal time: O(m+n)
• The algorithm never needs to move backwards i
n the input text, T
▫ This makes the algorithm good for processing very
large files that are read in from external devices or
through a network stream.
23

Pattern Matching 110/12/07

Analysis of KMP
• KMP doesn’t work so well as the size of the alpha
bet increases
▫ More chance of a mismatch (more possible misma
tches)
▫ Mismatches tend to occur early in the pattern, but
KMP is faster when the mismatches occur later
24

Pattern Matching 110/12/07

Boyer Moore Algorithm


• The Boyer-Moore pattern matching algorithm is
based on two techniques.
▫ The looking-glass technique
 Find P in T by moving backwards through P,
starting at its end.
▫ The character-jump technique
25

Pattern Matching 110/12/07

BM Case 1
26

Pattern Matching 110/12/07

BM Case 2
27

Pattern Matching 110/12/07

BM Case 3
28

Pattern Matching 110/12/07

BM Example
T:
a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m

P: r i
2
t h m r i
4
t h m r i
6
t h m
29

Pattern Matching 110/12/07

BM Bad Character Shift Function


• Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build the shift
values for every character.
30

Pattern Matching 110/12/07

Shift Function Example


• A={a,b,c,d}
• P=“abacab”

x a b c d
BMBC 1 1 2 6
31

Pattern Matching 110/12/07

BM Good Suffix
• Assume that a mismatch occurs between the character x[i]=a of the
pattern and the character y[i+j]=b of the text during an attempt at
position j.
• Then, x[i+1 .. m-1]=y[i+j+1 .. j+m-1]=u and x[i]  y[i+j]. The good-
suffix shift consists in aligning the segment y[i+j+1 .. j+m-1]=x[i+1 ..
m-1] with its rightmost occurrence in x that is preceded by a
character different from x[i]
32

Pattern Matching 110/12/07

BM Good Suffix
• If there exists no previous segment, the shift
consists in aligning the longest suffix v of y[i+j+1
.. j+m-1] with a matching prefix of x
33

Pattern Matching 110/12/07

Refine BM Shift Function


• BMBC=BM Bad Character Shift
• BMGS=BM Good Suffix Shift
• Shift(x)= MAX( BMBC, BMGS)
34

Pattern Matching 110/12/07

Analysis of BM
• Boyer-Moore worst case running time is
O(nm)
• Best Case of Moyer-Moore is O(n/m)
• Boyer-Moore is fast when the alphabet is large,
slow when the alphabet is small.
• In practice, the running time BM < KMP < BF
35

Pattern Matching 110/12/07

External Pattern Matching


36

Pattern Matching 110/12/07

External Pattern Matching


37

Pattern Matching 110/12/07

External Pattern Matching


38

Pattern Matching 110/12/07

KMP & BM
• KMP
▫ Small alphabet
▫ Network Stream
▫ External Disk
• BM
▫ Large alphabet
▫ Faster in average

You might also like