0% found this document useful (0 votes)

490 views89 pages

Aho-Corasick and String Matching Techniques

This document provides information about string matching algorithms and discusses the KMP and Aho-Corasick algorithms in detail. It summarizes the KMP algorithm as preprocessing the pattern to determine the longest proper suffix that matches a prefix, using this information to create failure links to efficiently shift the pattern during matching. It also describes the Aho-Corasick algorithm as generalizing KMP to match multiple patterns using a keyword tree with failure links between nodes.

Uploaded by

Akash Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

490 views89 pages

Aho-Corasick and String Matching Techniques

Uploaded by

Akash Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

CS 3343: Analysis of Algorithms

Lecture 26: String Matching Algorithms

Definitions
Text: a longer string T Pattern: a shorter string P Exact matching: find all occurrence of P in T
T P

b b a b a
a b a

a b a b a

a b a

length = m

Length = n

The nave algorithm

b b a b a
a b a a b a

a b a b a

a b a
Length = m

Length = n

a b a a b a a b a a b a a b a a b a

Time complexity
Worst case: O(mn) Best case: O(m)
aaaaaaaaaaaaaa vs. baaaaaaa

Average case?
Alphabet size = k Assume equal probability How many chars do you need to compare before find a mismatch?
In average: k / (k-1) Therefore average-case complexity: mk / (k-1) For large alphabet, ~ m

Not as bad as you thought, huh?

Real strings are not random

T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: O(m) average case is still bad for long strings! Smarter algorithms: O(m + n) in worst case sub-linear in practice how is this possible?

How to speedup?
Pre-processing T or P Why pre-processing can save us time?
Uncovers the structure of T or P Determines when we can skip ahead without missing anything Determines when we can infer the result of character comparisons without actually doing them.
ACGTAXACXTAXACGXAX ACGTACA

Cost for exact string matching

Overhead

Total cost = cost (preprocessing) + cost(comparison) + cost(output)

Constant

Minimize

Hope: gain > overhead

String matching scenarios

One T and one P
Search a word in a document

One T and many P all at once

Search a set of words in a document Spell checking

One fixed T, many P

Search a completed genome for a short sequence

Two (or many) Ts for common patterns

Would you preprocess P or T? Always pre-process the shorter seq, or the one that is repeatedly used

Pattern pre-processing algs

Karp Rabin algorithm
Small alphabet and small pattern

Boyer Moore algorithm

The choice of most cases Typically sub-linear time

Knuth-Morris-Pratt algorithm (KMP) Aho-Corasick algorithm

The algorithm for the unix utility fgrep

Suffix tree
One of the most useful preprocessing techniques Many applications

Algorithm KMP
Not the fastest Best known Good for real-time matching
i.e. text comes one char at a time No memory of previous chars

Idea
Left-to-right comparison Shift P more than one char whenever possible

Intuitive example 1
T P Nave approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabc mismatch abcxabcde

Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. Number of comparisons saved: 6

Intuitive example 2
Should not be a c T P Nave approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde abcxabc mismatch abcxabcde

Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches Number of comparisons saved: 7

KMP algorithm: pre-processing

Key: the reasoning is done without even knowing what string T is. Only the location of mismatch in P must be known.
T P t z t t x y

j
P t

i
z j t i y

Pre-processing: for any position i in P, find P[1..i]s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t, and the next char of t is different from the next char of t (i.e., y z) For each i, let sp(i) = length(t)

KMP algorithm: shift rule

T
P t z j

t
t i

x
y

t y t z 1 sp(i) j i

Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the
right by i sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

Failure Link Example

P: aataac
If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1)

sp(i)

aa at

aat aac

Another example
P: abababc
If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1)

a
Sp(i) 0

b
0

a
0

b
0

a
0

b
4

c
0

ab ab

abab abab

ababa ababc

KMP Example using Failure Link

a a t a a c

T: aacaataaaaataaccttacta aataac Time complexity analysis: ^^* Each char in T may be compared up to n aataac times. A lousy analysis gives O(mn) time. .* More careful analysis: number of aataac Implicit comparisons can be broken to two phases: comparison ^^^^^* Comparison phase: the first time a char in T aataac is compared to P. Total is exactly m. Shift phase. First comparisons made after a ..* shift. Total is at most m. aataac .^^^^^ Time complexity: O(2m)

KMP algorithm using DFA (Deterministic Finite Automata)

P: aataac
If a char in T fails to match at pos 6, re-compare it with the char at pos 3

Failure link

a DFA 0 a 1 a 2

If the next char in T is t after matching 5 chars, go to state 3

a
a

c
a

All other inputs goes to state 0.

DFA Example
a t t

DFA

a a

c a

T: aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.

Therefore, exactly m comparisons are made.

But it takes longer to do pre-processing, and needs more space to store the FSA.

Difference between Failure Link and DFA

Failure link
Preprocessing time and space are O(n), regardless of alphabet size Comparison time is at most 2m (at least m)

DFA
Preprocessing time and space are O(n ||)
May be a problem for very large alphabet size For example, each char is a big integer Chinese characters

Comparison time is always m.

The set matching problem

Find all occurrences of a set of patterns in T First idea: run KMP or BM for each P
O(km + n)
k: number of patterns m: length of text n: total length of patterns

Better idea: combine all patterns together and search in one run

A simpler problem: spell-checking

A dictionary contains five words:
potato poetry pottery science school

Given a document, check if any word is (not) in the dictionary

Words in document are separated by special chars. Relatively easy.

Keyword tree for spell checking

This version of the potato gun was inspired by the Weird Science team out of Illinois p s

o
t a t e t r

c
i

h e n

l 5

t
o 1

e
r y

y
3

c e 4

O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of text Common prefix only need to be compared once. What if there is no space between words?

Aho-Corasick algorithm
Basis of the fgrep algorithm Generalizing KMP
Using failure links

Example: given the following 4 patterns:

potato tattoo theater other

Keyword tree
p o t a t o o 1 2 o 3 t a h e t t

0
t h e r

a t
e r

Keyword tree
p o t a t o o 1 2 o 3 t a h e t t

0
t h e r

a t
e r

potherotathxythopotattooattoo

Keyword tree
p o t a t o o 1 2 o 3 t a h e t t

0
t h e r

a t
e r

potherotathxythopotattooattoo

O(mn)

m: length of text. n: length of longest pattern

Keyword Tree with a failure link

p o t a t o o 1 2 o 3 t a h e t t

0
t h e r

a t
e r

potherotathxythopotattooattoo

Keyword Tree with a failure link

p o t a t o o 1 2 o 3 t a h e t t

0
t h e r

a t
e r

potherotathxythopotattooattoo

Keyword Tree with all failure links

p o t a t o o t a h e t t

0
t h e r

a t
e r 3

1
2