You are on page 1of 11

Finite elds and string matching

Andrew MacFie
MATH 5900X Final Project
Instructor: Daniel Panario
Carleton University
Sunday 3
rd
April, 2011
Abstract
First we discuss the basic string matching problem and look at solu-
tions using the Fast Fourier Transform [1]. Then we focus on the following
more general problem:
Given a text t of length n and a pattern p of length m and a bound
k, we wish to nd all locations in the text where the Hamming distance
between the text and the pattern (ignoring wildcard characters) is at most
k. This takes (nk) time and we present an algorithm of Cliord et al.
[3] with time complexity within logarithmic factors of being optimal. The
algorithm uses nite elds to make the problem related to concepts in
algebraic coding theory. Specically, it uses a nite eld of characteristic
2 to encode the strings and string indices, then makes interesting use of
the Berlekamp-Massey algorithm and fast cross-correlation (convolution)
calculations.
1 Introduction
In this report we explore how various algorithms related to nite elds can be
used together to solve the string matching problem. In this problem, there
is some alphabet (simply a nite set), over which there is a text string t =
t
0
t
1
t
n1
and a pattern string p = p
0
p
1
p
m1
and the objective is to nd
the locations of all the matches of p in t, that is, all i such that
t
i
= p
0
, t
i+1
= p
1
, . . . , t
i+m1
= p
m1
.
The problem will be generalized in two ways: First, we can allow wildcard char-
acters in the the text and the pattern that match with any character. Second,
we let the Hamming distance at position i, which we denote by HD(i), be the
number of mismatched characters in t
i
t
i+1
t
i+m1
and p; then we may ask
1
Finite elds and string matching 2
for all positions i where HD(i) k, and for each of those positions, the lo-
cation of any mismatches found (if k = 0 we have the original problem). The
combination of these two is called the k-mismatch with wildcards problem.
A well-known algorithm for the basic string matching problem is the Knuth-
Morris-Pratt algorithm published in 1977 which takes O(m) preprocessing time
and O(n) matching time, so O(m + n) total [4]. Finite automata-based al-
gorithms take O(m||) preprocessing time and O(n) matching time, where
is the size of the alphabet [7]. A simple algorithm presented in Section 2 of
this report is based on the Fast Fourier Transform and solves this problem in
O(nlog m) time. It is modied in Section 3 to handle wildcards with no dier-
ence in asymptotic running time. Both are due to Cliord et al [1].
For the k-mismatch with wildcards problem, a randomized algorithm was
shown to take O(n(k + log nlog log n) log m) time with high probability [2],
and a deterministic algorithm was presented in the same paper that runs in
O

nk
2
log
3
m

time. The optimal time is O(nk) since the number of matches is


O(n) and each match could have up to k mismatches to record. The algorithm
we present in Section 3, due to Cliord et al. [3], takes

O(nk) time, meaning that
it is within logarithmic factors of being optimal. Importantly, it encodes the
input in an alphabet of nite eld elements in order to do algebraic operations
and use fast algorithms that work over nite elds.
2 Basic string matching
In this section, we look at an algorithm from [1] for the basic problem of nding
all locations in a text string t of occurrences of a pattern p. The rst step is
to replace the alphabet with a set of integers. For example, if the alphabet was
{a, b, c} and we had p = aba and t = cababba, we can replace a, b and c with 1,
2 and 3 respectively and get the alphabet {1, 2, 3} so p = 121 and t = 3121221.
Once we do that, one way to solve the problem is by calculating the following
quantity for each i in [0..n m]:
m1

j=0
(p
j
t
i+j
)
2
,
which is 0 i there is a match at position i. Let this quantity be a
i
. Now if we
expand, we get
a
i
=
m1

j=0

p
2
j
2p
j
t
i+j
+ t
2
i+j

=
m1

j=0
p
2
j
2
m1

j=0
p
j
t
i+j
+
i+m1

j=i
t
2
j
.
The rst sum is independent of i, and the third sum can be calculated by rst
nding the partial sums of {t
2
k
}
k
once and then, for each i, taking dierences
Finite elds and string matching 3
from that sequence. The second sum dominates time-wise, but note that it is
an element in the convolution of p and t, which we can compute using the Fast
Fourier Transform (FFT) in the following way: It is well known that the Fourier
transform and its inverse are able to multiply polynomials over the complex
numbers. If we use it to nd the product of a polynomial with coecients given
by t and a polynomial with coecients given by the reverse of p, the result will
have the coecient of x
m1+i
given by
m1

j=0
t
i+j
p
m1(m1j)
=
m1

j=0
t
i+j
p
j
, (1)
from which we get the convolution of p and t. This operation can be computed
in O(nlog n) time, but we use the following trick to reduce it to O(nlog m):
divide t up into n/m segments of length m. Then nd the convolution of p
and each pair of adjacent length-m segments of t. The results give the whole
convolution, and the time needed is (n/m) O(2mlog(2m)) = O(nlog m) .
3 String matching with wildcards
Extending the algorithm in Section 2 to texts and patterns with wildcard char-
acters is not hard and was also done in [1]. All that needs to be done is to
represent the wildcard character with 0 and represent all non-wildcard charac-
ters with positive integers, and then the analogous quantity to a
i
in Section 2,
which is 0 i there is a match at position i, is
m1

j=0
p
j
t
i+j
(p
j
t
i+j
)
2
=
m1

j=0
p
3
j
t
i+j
2
m1

j=0
p
2
j
t
2
i+j
+
m1

j=0
p
j
t
3
i+j
,
where all three sums are elements in convolutions that can be calculated in
O(nlog m) time as shown above.
An implementation of this algorithm in Mathematica can be found in the
Appendix.
4 k-Mismatch string matching with wildcards
We now turn to the problem of nding all locations i in the text t satisfying
HD(i) k, where HD(i) is the number of mismatches ignoring wildcards be-
tween p and t
i
t
i+1
t
i+m1
. We are furthermore required to output a list for
each such i of the positions of the HD(i) mismatches. In this section we present
a solution due to Cliord et al. [3].
The rst step is to convert everything to an alphabet that is a nite eld
of characteristic 2. The eld should be large enough so that we can use its
elements both as indices in p and t, and as the alphabet. If we use the trick
mentioned in Section 2 and only work with segments of t of length 2m, we will
Finite elds and string matching 4
need 2m elements for indices and at most 2m elements for the alphabet, so we
can use the eld GF

2
lg(4m)

. Thus the eld is of order O(m) and we can do


muliplication and addition in constant time using an exponential respresentation
after computing a Zechs logarithm table once in O(m) time. In the notation,
we will assume the conversion to nite eld elements has already taken place,
and that the input length is a power of 2.
The algorithm works by rst assuming that at each position i in the text,
HD(i) k. Then a system of equations is set up for each i whose solutions
are HD(i) and the locations of the mismatches at i. This step is done in one
go for all i using fast convolution computations. Next, for each i a polynomial
is computed whose roots are the reciprocals of the solutions to the system of
equations, and then the roots are found. Finally, we check for which i we really
have HD(i) k and discard what we computed for the others.
For each i, this is the system of equations:
r
1,i
+ r
2,i
+ . . . + r
k

,i
= s
0,i
r
1,i
x
1
+ r
2,i
x
2
+ . . . + r
k

,i
x
k
= s
1,i
r
1,i
x
2
1
+ r
2,i
x
2
2
+ . . . + r
k

,i
x
2
k
= s
2,i
.
.
.
.
.
.
r
1,i
x
2k
1
+ r
2,i
x
2k
2
+ . . . + r
k

,i
x
2k
k
= s
2k,i
, (2)
where r
j,i
is the dierence between the pattern and the text at the jth mismatch
between p and t
i
t
i+m1
, x
j
is the position of the jth mismatch, and k

=
HD(i). The ss can be found using
s
l,i
=
m1

j=0
(p
j
t
i+j
) (i + j)
l
p

j
t

i+j
,
where p

j
is 0 if p
j
is a wildcard and 1 otherwise, and similarly for t
i+j
. Af-
ter expanding s
l,i
, it can be seen that the sequence s
l,0
, . . . , s
l,nm
can be
computed by a convolution, in a similar manner to expressions in earlier sec-
tions. The FFT has complications for elds of characteristic 2 because 1 + 1
is not a unit, so instead a variant of the fast integer multiplication algorithm
of Schonhage and Strassen [5] is used for these convolutions which has run-
ning time O(nlog nlog log n), which becomes O(nlog mlog log m) by working
in overlapping size-2m chunks.
Now we must solve the equations for each i. Let us x an i now, and
henceforth write s
l
for s
l,i
. First we note that if we knew the k

roots of the
polynomial
P(z) =
k

i=1
(zx
i
1),
we would have our solution. For now, we will just compute the coecients of that
polynomial quickly and it turns out we can do so with the Berlekamp-Massey
algorithm. We will speak of the Berlekamp-Massey algorithm on polynomials
Finite elds and string matching 5
over a nite eld, which returns the reciprocal of the minimal polynomial of the
sequence of coecients of the input polynomial.
1
Lemma 1. If the Berkelamp-Massey algorithm takes as input the polynomial
F(z) over GF(q), it returns the the polynomial H(z) of minimal degree over
GF(q) such that F(z)H(z) mod z
deg F(z)+1
has degree less than the degree of
H(z).
Proof. Let

h and

f be the degrees of H(z) =

h
j=0
h
j
z
j
and F(z) =

f
j=0
f
j
z
j
.
Let R(z) =

h
j=0
r
j
z
j
be the reciprocal of H(z), and let F(z)H(z) = G(z) =
g
0
+ g
1
z + .
It suces to show that G(z) mod z

f+1
has degree less than

h if and only if
R(z) is a characteristic polynomial of F(z). Now G(z) mod z

f+1
=

j=

f
j=0
g
j
z
j
so we must show that

f
j=

h
g
j
z
j
= 0 i R(z) is a characteristic polynomial of
f
0
, f
1
, . . . , f
f
. But for j in [

h..

f],
g
j
=

i=0
h
hi
f
j

h+i
=

i=0
r
i
f
j

h+i
,
which are all zero exactly when R(z) is a characteristic polynomial of f
0
, f
1
, . . . , f
f
.
Theorem 1. The Berlekamp-Massey algorithm on F(z) =

2k
l=0
s
l
z
l
gives
P(z).
Proof. We prove this by showing that {s
l
}
l
has linear span at least k

, then
showing that F(z)P(z) mod z
2k+1
has degree less than k

. The theorem follows


from Lemma 1.
First, assume for the purposes of contradiction that there is a linear recur-
rence relation {a
i
}
k

d
i=0
, d > 0, such that a
0
, a
k

d
= 0 and

d
i=0
a
i
s
i+j
= 0 for
every j [0..2k k

+ d]. If so, wlog d = 1 and we can write


2
k

i=0
a
i
s
i+j
=
k

i=0
a
i
k

l=0
r
l
x
i+j
l
= 0.
In matrix form, this can be written as aM = 0 where M is the k

by k

matrix
given by M
i,j
=

v=1
r
v
x
i+j
v
and a = (a
0
, a
1
, ..., a
k

1
). Now M = DV
T
V
where V is the Vandermonde matrix with V
i,j
= x
j
i
and D is a diagonal
1
The algorithm must also be told the intended degree of the input polynomial since the
sequence may end with a zero.
2
If M(x) is a characteristic polynomial for a sequence, so is xM(x) +M(x).
Finite elds and string matching 6
matrix with r
l
on the diagonal. Thus we have det(M) = det(DV
T
V ) =
det(D)det
2
(V ) = 0, but that means a = 0 which contradicts that a
0
= 0.
Now we show that F(z)P(z) mod z
2k+1
has degree less than k

. We have
F(z) =
2k

l=0

i=1
r
i
x
l
i

z
l
.
We change the order of summation to get
F(z) =
k

i=1
r
i
2k

l=0
(x
i
z)
l
.
Thus
F(z)P(z) =
k

i=1
r
i

2k

l=0
(x
i
z)
l
k

j=1
(zx
j
1)

.
Now we just substitute the following from the geometric series formula
2k

l=0
(x
i
z)
l
=
(zx
i
)
2k+1
1
zx
i
1
and get
F(z)P(z) =
k

i=1
r
i

(zx
i
)
2k+1
1

j=1,j=i
(zx
j
1) .
Finally, taking both sides modz
2k+1
we have
F(z)P(z) mod z
2k+1
=
k

i=1
r
i
k

j=1,j=i
(zx
j
1)
Therefore F(z)P(z) mod z
2k+1
is a polynomial of degree at most k

1 and the
theorem is proved.
Once we get the polynomial P(z) =

i=1
(zx
i
1), we need an algorithm
for nding the roots, which is the information we need to get the xs. Since
the polynomial is over a eld of small characteristic we can use the factoring
algorithm given in [6] which gives the linear factors in O

k log
2
k log
2
m

time.
We have been assuming that at position i in t, HD(i) k. We now do
a procedure that checks for each i whether the mismatches at i found by the
previous steps are really all the mismatches at i. If they are, then HD(i) k,
but if there are more, HD(i) > k because otherwise the procedure would have
found the correct number.
First, using integers instead of elements of a nite eld, for each i we compute
D[i] =
m1

j=0
(p
j
t
i+j
)
2
p

j
t

i+j
,
Finite elds and string matching 7
which is the sum of the squares of the dierences between the pattern and the
text at oset i. To compare with this, for each i, we sum the squares of the
dierences between the pattern and the text just at each mismatch location
returned by the procedures above. If the second value is less, then we know we
dont have a k-match.
Algorithm 1 k-mismatch with wildcards
1. Encode the input as nite eld elements
2. Set up the systems of equations for the mismatch positions
for i = 0 to n m do
3.1. Compute P(z) =

j=1
(zx
j
1)
3.2. Find the roots of P(z)
end for
4. Encode the input as integers
5. Verify that all mismatches are found for each position i
Theorem 2. Algorithm 1 has time complexity O

nk log
2
m

log
2
k + log log m

.
Proof. Step 1 involves a constant number of sorts and a linear traversal, which
can be done in O(nlog n+m) time. Step 2 takes time (2k+1)O(nlog mlog log m) =
O(knlog mlog log m) as discussed. Step 3.1 takes O(k log
2
k) time and step 3.2
takes O(k log
2
k log
2
m) time, so the whole for loop takes O(nk log
2
k log
2
m)
time. Step 4 takes no more time than step 1. For step 5, computing D takes
O(nlog m) time since it can be done by convolutions of complex numbers, and
then there are O(nk) additions required for computing the comparison quanti-
ties.
The total running time is dominated by setting up the equations and nding
the roots of the polynomials; combining these times gives a running time of
O

nk log
2
m

log
2
k + log log m

.
References
[1] Peter Cliord and Raphael Cliord. Simple deterministic wildcard matching.
Inf. Process. Lett., 101:5354, January 2007.
[2] Raphael Cliord, Klim Efremenko, Ely Porat, and Amir Rothschild. K-
mismatch with dont cares. In Proceedings of the 15th annual European
conference on Algorithms, ESA07, pages 151162, Berlin, Heidelberg, 2007.
Springer-Verlag.
[3] Raphael Cliord, Klim Efremenko, Ely Porat, and Amir Rothschild. From
coding theory to ecient pattern matching. In Proceedings of the twentieth
Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 09, pages
778784, Philadelphia, PA, USA, 2009. Society for Industrial and Applied
Mathematics.
Finite elds and string matching 8
[4] Donald E. Knuth, Jr. James H. Morris, and Vaughan R. Pratt. Fast pattern
matching in strings. SIAM Journal on Computing, 6(2):323350, 1977.
[5] A. Schonhage. Snelle multiplikation von polynomen uber korpern der charak-
teristik 2. Acta Inform., 7:395398, 1976.
[6] Victor Shoup. A fast deterministic algorithm for factoring polynomials over
nite elds of small characteristic. In Proceedings of the 1991 international
symposium on Symbolic and algebraic computation, ISSAC 91, pages 1421,
New York, NY, USA, 1991. ACM.
[7] Michael Sipser. Introduction to the theory of computation. SIGACT News,
27:2729, March 1996.
Appendix
Implementations
The first two functions needed are fou and invFou. These simply take a list of complex numbers and return the
Fourier transform of the list. They use Mathematicas implementation of the Fast Fourier Transform, the Fourier
function. Since Fourier is numerical, the function Chop is called on the result which removes very small numbers
that sometimes show up erroneously in the results like 1.24123510
16
.
fou[list_?VectorQ] : Fourier[list, FourierParameters {1, 1}] // Chop;
invFou[list_?VectorQ] : InverseFourier[list, FourierParameters {1, 1}] // Chop;

Next is the polyMultiply function which multiplies two polynomials with complex coefficients. The input is given
as coefficient lists starting with the constant term. They are multiplied using Fourier transforms implented by fou
and invFou.
polyMultiply[largerList_?VectorQ, smallerList_?VectorQ] :
Module[
{n Length[largerList]},
invFou[
fou[PadRight[largerList, 2 n]] fou[PadRight[smallerList, 2 n]]]
];

The function conv computes the convolution of two lists of complex numbers, i.e. if the lists are t and p, the
function conv returns the list with ith element
j0
m1
p
j
t
ij
, for 0 i n m. The method used is the polynomial
multiplication method described in Section 2, using the polyMultiply function.
conv[largerList_?VectorQ, smallerList_?VectorQ] :
Module[
{n Length[largerList],
m Length[smallerList]},
polyMultiply[largerList, ReversesmallerList]m ;; n]
];

We can now implement the string matching algorithm of Section 2. The matches function does this by first
calling ToCharacterCode to convert the strings to lists of integers t and p, then storing the partial sums of t
k
2

k
in
the variable tSquaredTable, then using the conv function to compute the convolution of t and p, then finding the
total of p
k
2

k
and combining the results into a list whose ith element is 0 iff there is an occurrence of p at position i
in t.
matches[tString_, pString_] :
Module]
{
t ToCharacterCode[tString],
p ToCharacterCode[pString],
tSquaredTable,
m StringLength[pString],
n StringLength[tString]
},
tSquaredTable Accumulate]t
2
;
tSquaredTable0] 0;
Total]p
2
2 conv[t, p] Table[
tSquaredTablei m 1] tSquaredTablei 1],
{i, 1, n m 1}] // Chop
;

Finally, the function matchesWithWildcards implements the algorithm from Section 3. The character * is used as
the wildcard. Numerical issues necessitate replacing very small numbers in the output with 0.
matchesWithWildcards[tString_, pString_] :
Module]
{
t ToCharacterCode[tString] /. {42 0},
p ToCharacterCode[pString] /. {42 0},
},
Chop]conv]t, p
3
2 conv]t
2
, p
2
conv]t
3
, p, 10
4

;
Demonstrations
The text we will search is the first two paragraphs of Lord Byrons poem To George, Earl Delawarr.
text "Oh yes, I will own we were dear to each other;
The friendships of childhood, though fleeting, are true;
The love which you felt was the love of a brother,
Nor less the affection I cherish'd for you.
But Friendship can vary her gentle dominion;
The attachment of years, in a moment expires:
Like Love, too, she moves on a swiftwaving pinion,
But glows not, like Love, with unquenchable fires.";
The pattern will be the word pinion which occurs once in the text.
pattern "pinion"
The list a
0
, a
2
, ..., a
nm
is returned by the function matches.
a matches[text, pattern];
The Position function finds the index in a where there is a 0:
2 String Matching.nb
Position[a, 0]
337
To verify that 337 is the correct index, we find the length of all the text that comes before the word pinion:
Length[ToCharacterCode["Oh yes, I will own we were dear to each other;
The friendships of childhood, though fleeting, are true;
The love which you felt was the love of a brother,
Nor less the affection I cherish'd for you.
But Friendship can vary her gentle dominion;
The attachment of years, in a moment expires:
Like Love, too, she moves on a swiftwaving "]]
336
Arrays are 1-based (i.e. not 0-based) in Mathematica, so the position of an occurrence is
1 the length of the text before it. Thus the position of pinion is indeed 336 1 337. Now lets do a wildcard
search. There is a number of occurrences of *ove* in the text, so lets find them:
Position[matchesWithWildcards[text, "ove"], 0]
110, 138, 298, 313, 365
Indeed, there are five occurrences (love, love, Love, moves, Love), and when does the first occur?
Length[ToCharacterCode["Oh yes, I will own we were dear to each other;
The friendships of childhood, though fleeting, are true;
The "]] 1
110
String Matching.nb 3