You are on page 1of 10

Demonstration of Exact String Matching Algorithms using CUDA

Author List
Raymond Tay (Autodesk, formerly Linden Lab)

Summation
In this chapter, the author presents a demonstration application of three commonly used exact
string matching algorithms using NVIDIA CUDA Technology. The algorithms are namely the
Brute-force, QuickSearch and Horspool. The author attempts to apply known CUDA techniques
to implement, test and optimize where applicable;challenges the author faced was mapping
CUDA's threading and memory model to what is normally an algorithm designed to execute on
the single core CPU. The author hopes that through this effort, to demonstrate the power of
CUDA to the budding GPU developer.

Introduction, Problem Statement, and Context


String-matching is a very important subject in the wider domain of text processing. String-
matching algorithms are basic components used in implementations of practical softwares
existing under most operating systems. String-matching consists of finding one or more
occurrences of a pattern in a body of text. All the algorithms in this work locates all occurrences
of the pattern in the text body aided by GPU acceleration. The algorithms developed were
tested for patterns whose length are shorter and greater than the alphabet. The pattern is
denoted by x=[0..m-1] and m denotes its length, the text is denoted by y=[0..n-1] where n
denotes its length; the alphabet of the text and pattern refers to all symbols used to represent
strings (e.g. the alphabet of a binary string is ∑={0,1}) and is denoted by ∑ with the size equal to
∂ (e.g. the size of the alphabet for binary strings is ∂=2).
The author is aware the wide applicability of string matching algorithms ranging from text
editors, the popular Unix tool grep, virus scanning technology, locating DNA sequences. The
author believes that the techniques devised here can be leveraged by current mid-range
workstations as they normally come equipped with CUDA/OpenCL enabled graphics cards.
Core Method
The methods applied to the development includes the following

1) Find ways to parallelize the sequential code

2) Minimize data transfer between the host and device

3) Global memory should be coalesced as much as possible

4) Avoid branch divergence within a CUDA warp

The work here for all algorithms revolves around getting a CUDA thread to execute the scanning
and locating a match; if it does find a match the CUDA thread will update a data structure
revealing the position where the pattern was found. The data structures needed by the CUDA
threads will be provided by the CUDA kernel.

Algorithms, Implementations, and Evaluations


Brute-force
The sequential form consists of a function, BF (acronym for BruteForce) where it attempts to
match the pattern to the text by scanning the text from left to right. In the sequential code, a
single thread is conducting the search and when it finds a match the algorithm will output to
console the position it was found.

In the CUDA version, N threads could be conducting the same search. Each of the N threads
attempts to scan for a match of the text, in parallel, and when it discovers a match a data
structure for storing the found indices will be updated.

The source codes for the sequential and parallelized(CUDA) code is shown below for illustration
purposes.

Illustration 1: Sequential Brute Force

Each CUDA thread can potentially and possibly read each character and obtain a match, in the
event that the pattern follows one another in the string; hence this translates to (N*m) bytes of
data being read. Each CUDA thread potentially writes at most n/m times (assuming the pattern
follows one after another other) but in general, the text and pattern could be absolutely random.

Illustration 2: CUDA Brute Force

Quicksearch
The sequential QuickSearch is a variant of the popular Boyer-Moore Algorithm where it does not
suffer from the problem of sub-optimal performance when it comes to matching patterns that
inherit from small alphabets like DNA.

In the classic QuickSearch, the inventor of the algorithm dropped the “good suffix shift” aka
“matching shift” computation in favour of the “bad-character shift” aka “occurrence shift”
computation. This algorithms precomputes the “bad-character shift” for the pattern before using
the results of the previous computation to aid in its search for pattern in the text body.

In the CUDA version, the classic QuickSearch has been reorganized so that the “bad-character
shift” is parallelized; and in the scanning code the “skipping distance” data structure (which is a
1D array containing the skipping distances regardless of a match or mismatch and each valid
element is a CUDA thread's id) is pre-computed which will be used by the CUDA kernel. In the
CUDA kernel, the thread will only execute the scanning code if it can locate its id in the “skipping
distance” data structure mentioned earlier.
The source codes for the sequential and CUDA version of QuickSearch is presented below:

Illustration 3: Sequential QuickSearch

Illustration 4: CUDA QuickSearch


Horspool
In the classic Horspool algorithm, the implementation favours the use of the bad-character shift
computation alone and it's not very efficient when the pattern is shorter than the alphabet i.e. m
< ∂.

The “bad-character shift” computation is the same as the one shown in the sequential
QuickSearch.

In the CUDA version, the approach the author's taken is very similar to the implementation of
the CUDA version of QuickSearch i.e. In the CUDA version, the classic QuickSearch has been
reorganized so that the “bad-character shift” is parallelized; and in the scanning code the
“skipping distance” data structure (which is a 1D array containing the skipping distances
regardless of a match or mismatch and each valid element is a CUDA thread's id) is pre-
computed which will be used by the CUDA kernel. In the CUDA kernel, the thread will only
execute the scanning code if it can locate its id in the “skipping distance” data structure
mentioned earlier.

The source codes for the sequential and CUDA Horspool is shown below:

Illustration 5: Sequential Horspool


Illustration 6: CUDA Horspool
Evaluation
The author subjected the three sequential and their CUDA equivalent algorithms to
benchmarking and applied some, but not all, CUDA techniques and technology. Each test was
ran with 100 iterations and taking the average. The tests were ran on a 32-bit Ubuntu OS,
GTX480 Nvidia Card, 8-core Intel i7 CPU, 6GB of System RAM.

Two sorts of tests were conducted: (1) pattern was shorter than the alphabet size (2) pattern
was longer than the alphabet size.

One observation from the tests is that the speedup factor of the CUDA to the sequential code
ranges from 31 to 106. Another observation is that the CUDA versions of the code do exhibit
branch divergence and bank conflicts and this behavior is highly dependent on the pattern and
the text involved.

Here is the summary:


Algorithm Type Optimiz Search runtime GPU Effective Speedu
ation (milliseconds) bandwidth (GBps) p
factor

brute-force SEQ -O2 24 N/A -

brute-force CUDA None 0.24 11.9 100

Shared 0.24 11.9


memory

Page- 0.41 7.1 59


locked
memory

QuickSearc SEQ -O2 16 N/A -


h

QuickSearc CUDA None 0.18 15.87 88


h
Shared 0.15 19.77 106
memory

Horspool SEQ -O2 16 N/A -

CUDA None 0.19 15.62 84

Shared 0.16 18.55 100


memory

Table 1: Test results for pattern shorter than alphabet size


Algorithm Type Optimiz Search runtime GPU Effective Speedu
ation (milliseconds) bandwidth (GBps) p
factor

brute-force SEQ -O2 21.2 N/A -

brute-force CUDA Shared 0.55 5.35 38


memory

QuickSearc SEQ -O2 17.2 N/A -


h

QuickSearc CUDA Shared 0.47 6.29 36


h memory

Horspool SEQ -O2 16.8 N/A -

CUDA Shared 0.53 5.56 31


memory

Table 2: Test results when pattern is longer than the size of the alphabet

Final Evaluation
The author believes that performance gains would be better if the implementation was in (a)
Asynchronous concurrent execution since multiple kernels execution concurrently would
possibly improve the run times. The author investigated that optimizations beyond -O2 for the
sequential algorithms did not seem to affect the overall run times.

The author's initial experimentation with page-locked/zero-copy in was not encouraging as


effective bandwidth lagged significantly on the linux operation system; the author cannot offer
an explanation at this point in time, why this is the case.

The author hoped to implement a multi-GPU solution but due to lack of resources, it cannot be
pursued in the near future though the author would get a big kick out of it!

References
• David Kirk and Wen-mei Hwu of Programming Massively Parallel Processors 2010 first
edition.
• AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical
Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter
5, pp 255-300, Elsevier, Amsterdam.
• HORSPOOL R.N., 1980, Practical fast searching in strings, Software - Practice &
Experience, 10(6):501-506.
• SUNDAY D.M., 1990, A very fast substring search algorithm, Communications of the
ACM . 33(8):132-142.
• Quick Search Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• Horspool Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• Brute-force Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/
• NVIDIA CUDA Programming Guide 3.0
• NVIDIA CUDA Reference Manual 3.0
• NVIDIA CUDA Best Practices Guide