Cuda

String Matching on hybrid parallel
architectures, an approach using

MPI and NVIDIA CUDA
Department of Applied Informatics,
University Of Macedonia,
March 2013
John-Alexander M. Assael
iassael@gmail.com
Supervisor
Prof. Konstantinos G. Margaritis
kmarg@uom.edu.gr
Table of Contents
Acknowledgments i
Abstract ii
Glossary iii
1. Introduction 1
1.1. History of String Matching 1
1.2. Application Areas 3
1.2.1. Computational Biology 3
1.2.2. Signal Processing 4
1.2.3. Text Retrieval 4
1.2.4. Computer Security 4
1.3. The Speed Problem 5
1.4. Objective of this Thesis 5
2. Parallel Programming 6
2.1. From GPUs to General Purpose GPUs 6
2.2. CUDA A General-Purpose Parallel Computing Platform 8
2.3. Scalable Programming Model 10
2.4. Clusters with MPI and CUDA 11
3. CUDA Programming Model 14
3.1. Kernels 14
3.2. Thread Hierarchy 15
3.3. Memory Hierarchy 18
3.4. Heterogeneous Architecture 19
4. System model and approach 20
5. Hybrid String Matching 21
5.1. MPI Parallelization 21
5.2. CUDA Parallelization 27
6. String Matching Algorithms 29
6.1. Naive Search 29
6.1.1. Sequential Implementation 30
6.1.2. CUDA Implementation 30
6.2. Knuth Morris-Pratt 31
6.3. Horspool 33
6.4. Karp Rabin 35
6.5. Quick Search 37
6.6. Shift Or 39
6.7. Shift And 41
7. Performance Evaluation 43
7.1. Testing Methodology 43
7.1.1. Pattern Size and CUDA Threads 43
7.1.2. String Matching Test Files 44
7.2. Measuring Speedup 45
7.3. CPU vs GPU Comparison 47
7.4. GPU vs the Cluster Comparison 52
8. Conclusions 57
9. Appendix 59
9.1. Naïve Search 59
9.1.1. MPI Implementation (mpiBrute.cpp) 59
9.1.2. MPI Implementation (mpiBrute.h) 63
9.1.3. CUDA Implementation (mpiBrute.cu) 63
9.2. Knuth Morris-Pratt 66
9.2.1. CUDA Implementation (mpiKMP.cu) 66
9.3. Horspool 69
9.3.1. CUDA Implementation (mpiHorsepool.cu) 69
9.4. Karp Rabin 72
9.4.1. CUDA Implementation (mpiKarpRabin.cu) 72
9.5. Quick Seach 75
9.5.1. CUDA Implementation (mpiQuickSearch.cu) 75
9.6. Shift Or 78
9.6.1. CUDA Implementation (mpiShiftOr.cu) 78
9.7. Shift And 81
9.7.1. CUDA Implementation (mpiShiftAnd.cu) 81
References 84
Acknowledgments - i
Acknowledgments
I would like to thank my supervisor Prof. Konsatntinos G. Margaritis, for his
support and insightful feedback during the conduction of this research. I
would also like to thank the Parallel Distributed Processing Laboratory, as well
as, the department of Applied Informatics of University of Macedonia, for
providing me access to the hardware that was used to run the experiments.
Last but not least, I want to thank my family and friends for all of their support.
Abstract - ii
Abstract
String Matching algorithms are responsible for finding occurrences of a
pattern within a large text. Many areas of Computer Science require
demanding string-matching procedures. Over the past years, Graphics
Processing Units have evolved to powerful parallel processors outperforming
the Central Processing Units in scientific calculations. Moreover, GPUs can
be used in parallel, in computer clusters. This thesis attempts to speed up
seven major String Matching algorithms by exploiting the combined power of
several CUDA-enabled GPUs in a GPU cluster. The algorithms are
implemented and optimized to take advantage of MPI distributed-memory
cluster and the CUDA parallel computing architecture. Finally, they are
compared in terms of their performance with their sequential, and their single
GPU implementations.
Glossary - iii
Glossary
CUDA Compute Unified Device Architecture, by NVIDIA, that by
harnessing the power of Graphics Processing Units (GPU)
offers dramatic performance increase [1]
MPI Standardized and portable Message Passing Interface [2]
n The length of the text being examined
m The length of the pattern
BM Boyer Moore String Matching Algorithm
Hor Horspool String Matching Algorithm
KMP Knuth Morris Pratt String Matching Algorithm
KR Karp Rabin String Matching Algorithm
MP Morris Pratt String Matching Algorithm
QS Quick Search String Matching Algorithm
SA Shift And String Matching Algorithm
SOr Shift Or String Matching Algorithm

Introduction - 1
1. Introduction
String Matching algorithms are an important member of the String algorithms
family. Their duty is to find all the occurrences of one or several strings (also
called patterns) are found within a larger string or text [3]. In general, there are
two inputs given to each algorithm, the large text and the patter that we are
looking for in the former. It is noteworthy to mention, that in most algorithms
the length of the text is symbolized with the letter ‘n’ and the length of the
pattern as ‘m’, where m is less or equal to n. There are many algorithms that
try to give optimal solutions to this problem. In this thesis, we will analyze
seven major String Matching algorithms that through hybrid hardware and
software optimizations exhibit a dramatic performance boost. Finally, the Big
O notation will be used to express the complexity of algorithms.
1.1. History of String Matching

The problem of String Matching was characterized as fundamental for
Computer Science, from the first years of the digital computing era, as most
applications or problems are involving text processing. The first algorithm that
was developed was called the Brute-Force algorithm, which was based on the
common sense of sliding the pattern along the text, and comparing it to each
portion of the text (Figure 1). However, this first approach was very slow,
taking O(nm) time [4]. In 1964, B. Dömölki developed the Bitap [5] algorithm,
also known as Shift Or and Shift And. The algorithm main characteristic was
that it made use of fast bitwise operations to accelerate the searching
process. Later, in 1970, Morris and Pratt came up with a tight analysis of the
Brute-Force algorithm making use of the internal structure of the pattern [6].
The algorithm was using preprocessed values that were computed in O(m)
time and then performed the search in O(n+m), offering better performance. A
more sophisticated version of the MP algorithm came later by Knuth, Morris
and Pratt [7], in 1977. Although, the complexity of the KMP Algorithm was the
same as MP, it was much faster in practice, as it improved the length of the
Introduction - 2
shifts that take place. Moreover, the same year another efficient algorithm
was introduced by, Boyer and Moore [8]. The BM algorithm is considered as
one of the most efficient String Matching algorithms in usual applications,
implementing the background of most “search” commands. Several variants
and enhancements of the BM algorithm appeared the next years, such as:
Horspool [9], Galil-Seiferas [10], Apostolico-Giancarlo [11], Karp-Rabin [12]
and Quick-Search [13]. In this thesis we will analyze the Horspool Karp-Rabin
and Quick-Search algorithms, while they are both simple to implement in
parallel structures. So, in 1980, N. Horspool published a simplified version of
BM, which was related to the KMP algorithm. Horspool’s algorithm had
various optimizations in order to achieve an average-case complexity of O(n)
on a random text, although the worst-case is much worse than BM. Karp-
Rabin algorithm was created in 1987 by M. Rabin and R. Karp, and it made
use of hash methods to search much more efficiently for the pattern inside the
given text. Additionally, it is widely used for especially multiple pattern
searches. Finally, Quick Search was published in 1990 by D. Sunday. The
major unique feature of Quick Search, was that it was simplified to use only
the bad character shift table of BM, making it very fast in practice for short
patterns and large alphabets and also much easier to implement. This short
history overview covers all the algorithms that were implemented in this
thesis.
Figure 1 Brute-Force String Matching

Introduction - 3
1.2. Application Areas

The first references to the problem of string matching came out in the sixties
and the seventies. It was a common barrier in many different fields of science.
Some of the first requirements for fast search solutions, came from
computational biology, signal processing and text retrieval application areas. It
is notable that, even nowadays these remain the major areas of interest [14].
1.2.1. Computational Biology
In Computational Biology, DNA and protein sequences can be depicted by

long texts constructed by specific alphabets such as: (A, C, G, T), that
represent genetic code of living organisms. Although these sequences can be
very long, scientists search for specific parts over those texts, in order to
assemble the DNA chain from the pieces that were obtained by experiments,
look for specific characteristics inside them or compare different parts. These
requirements are modeled as searching for patterns in a large text file.
However, on such applications approximate matching is much more useful
than exact matching, as the experimental measures have errors, and even if
they are correct they may have small differences, due to mutations and
evolutionary alterations. Hence, fuzzy string-matching algorithm can be
proven more efficient when processing such files.
Figure 2 Computational Biology Codon Matching

Introduction - 4
1.2.2. Signal Processing
Another early motivation came from the area of signal processing. This area
deals with speech and in general with sound pattern recognition. The main
problem is to recognize a specific message over a transmitted audio signal.
There are many problems that come out of this main idea from speech
recognition to music recognition and signal error correcting. In such problems
the main text is the encoded input signal while a smaller encoded sequence
represents the pattern that is to be found.
1.2.3. Text Retrieval
The most common problem is the retrieval of text parts of a larger text while
counting the occurrences of the pattern. Moreover, the problem of correcting
misspelled words in written text is one of the oldest potential applications.
Moreover, while the World Wide Web was expanding, a new need came out
for scanning the online content of Internet and categorizing it, in order to make
it searchable by all users. This is an extremely demanding task as the size of
the data that is to be processed is enormous and parallel techniques are
required in order to process it efficiently.
1.2.4. Computer Security
During the last years, several areas of computer security have started making
use of pattern matching techniques, in order to achieve various demanding
tasks such as: intrusion detection, file hash matching, virus scans and spam
filtering. However, all these tasks are quite demanding and are running
endlessly, reserving a notable amount of power for both online and offline
processes. Moreover, these tasks can become even more demanding, by
multiplying the amount of data that has to be processed. For example, Mail
Servers have by far much more data to process than regular computers, while
all these actions have to be executed both fast and efficiently, consuming the
less computer power possible. There are several implementations of GPU
architecture on security applications, such as GNORT [15] in 2008, a GPU
version of the well-known intrusion detection application SNORT [16].
Introduction - 5
1.3. The Speed Problem

As it was analyzed, there are many areas of Computer Science that require
demanding string-matching procedures. Measurements on Network Intrusion
Detection Systems (NIDS) perform deep packet inspection such as SNORT
[16] have shown that 31% of total processing is due to string matching.
Moreover, the percentage exhibits a huge increase in the case of Web Traffic
scanning, leveling up to 80% [17]. Another example is codons recognition in a
DNA sequence, such tasks are totally based on string matching, meaning that
the execution time of the process is totally relevant to the execution time of
string matching. Thus, string matching can be considered as one of the most
computationally intensive parts of many procedures of different sciences
fields. In this thesis we will focus on how these tasks can be optimized and
executed in less time on hybrid architectures.
1.4. Objective of this Thesis

This thesis on hybrid implementations of seven String Matching algorithms
had two major objectives. The first objective was to implement several string
matching algorithms to make use of the multi-core CUDA architecture Graphic
Processing units of NVIDIA Corporation. This would significantly shorten the
load of the Central Processing Unit, while it would offer important performance
speedup by using efficiently all the hardware resources that a computer
offers. However, there are many applications of string matching that exceed
the capabilities of a personal computer in order to be executed under
reasonable time. Moreover, previous research such as the “String Matching
on a multicore GPU using CUDA” [18], published in 2009 by Ch.
Kouzinopoulos and K. Margaritis, has proved that there is a significant
performance increase on several string matching algorithms by using GPU.
So, the second objective of this thesis was not only to present the
performance speedup that a GPU card can offer, but also to distribute the
demanding search tasks to scalable computer clusters, equipped with NVIDIA
CUDA-enabled GPU cards and compare the results.
Parallel Programming - 6
2. Parallel Programming
2.1. From GPUs to General Purpose GPUs

The endless market’s demand for better and more realistic 3D Computer
Graphics evolved the Graphics Processor Units into powerful highly parallel,
multithreaded and multicore processors with computational power and
extremely fast memory bandwidth [19]. A comparison with the CPU Memory
Bandwidth is depicted in Figure 3.
Figure 3 Memory Bandwidth CPU vs GPU
Moreover, due to their parallel architecture, they can be up to 100x faster

sometimes than Central Processing Units in simple operation [20].
Figure 4 Floating-Point Operations per second CPU vs GPU
As it can be seen in Figure 4, there is a significant difference between the

performance of floating point operations on CPU and GPU; this is due to the
nature of graphics rendering operations, as they are compute-intensive and
highly parallel. So, there are much more transistors inside a GPU for data
processing instead of Caching and Flow Control. GPUs have a parallel
throughput architecture that emphasizes executing many concurrent threads
slowly, rather than executing a single thread very quickly. This evolution has
actually changed the nature of Graphics Processing Units to General Purpose
Parallel Processors that are capable to execute demanding computational
tasks much more efficiently, taking advantage of their parallel architecture.
Figure 5 GPU has more transistors devoted to Data Processing

This highly parallel model allows the accelerated execution of problems that
have multiple computations on data that are not highly correlated; the problem
is distributed and the same operations are executed in parallel in multiple
threads and cores. It is very important that the computations are independent
and do not require previously calculated results in order to proceed to the next
operation. Otherwise, the performance exhibits a significant drop as the boost
is due to the highly parallel model. As a result, the more sequential the
processing becomes, the execution time could exceed the same program’s
CPU execution time, as the GPU cores are much weaker compared to a
same price level CPU processor.
2.2. CUDA A General-Purpose Parallel

Computing Platform
NVIDIA introduced CUDA architecture (formerly Compute Unified Device
Architecture) in November of 2009. CUDA is a general purpose parallel
computing platform and programming model, implemented by the graphics
processing units (GPUs) produced by NVIDIA [21]. GPUs are able to solve
many complex computational problems in parallel and much faster than a
CPU. This approach of solving general-purpose problems (i.e. not exclusively
graphics) on GPUs is known as GPGPU [22].
Figure 6 The Multicore Architecture of a GPU

The latest version 5.0 of NVIDIA CUDA comes with an Integrated Software
Environment that is based on the programming Language C as a high-level
programming language. The IDE is called Nsight [23] and offers editions for
Microsoft’s Visual Studio [24], for development under Microsoft Windows and
for the Open Source Eclipse platform [25], for Linux and Apple MacOSX
development.
Nsight is now distributed as a part of CUDA 5.0, and is equipped with a CUDA
Source Code Editor with the capabilities of syntax highlighting, CUDA aware
refactoring methods and code completion. Moreover, it has an integrated
debugger able to monitor program variables across several CUDA threads;
and also gives the ability to set breakpoints and perform single-step execution
at both source-code and assembly levels. Finally, Nsight Profiler is an
advanced code profiler, that identifies performance bottlenecks using a unified
CPU and GPU trace of application activity while it also offers automated
analysis for optimization opportunities. The algorithms implemented within the
scope of this thesis, were developed under Nsight Eclipse Edition and CUDA
5.0.
It is noteworthy to mention, that within the last years CUDA development

projects have exhibited a tremendous multi-fold increase; in addition to
libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA
platform now supports other computational interfaces, including the Khronos
Group's OpenCL [26], Microsoft's DirectCompute [27], and C++ AMP [28].
Finally, wrappers for the platform are also available for Python, Perl, Fortran,
Java, Ruby, Lua, Haskell, MATLAB, IDL, and native support in Mathematica
[21].
P a r a l l e l P r o g r a m m i n g - 10
Figure 7 CUDA supported Programming Languages and Interfaces
2.3. Scalable Programming Model

The new era of many-core GPUs and multicore CPUs has begun,
characterized as an evolution where parallel systems have taken over.
Moreover, it is noteworthy that the parallelism is scaling with Moore ’s law.
The challenge that we face nowadays is the development of applications that
transparently scale, and adjust to each system ’s number of cores either CPU
or GPU. This is very important in order to make the applications efficient, easy
to use and of course portable.
CUDA was designed capable of overcoming these challenges, as it maintains

a low learning curve for developers that are familiar with basic programming
languages like C and C++. CUDA uses three key abstractions: hierarchies of
thread groups, shared memories, and barrier synchronization; while, they are
accessible to the programmer providing data and thread parallelism, nested
within coarse-grained data parallelism and task parallelism. The programmer
is required to split the problem into independent sub-problems that can be
solved by blocks of threads, and each into finer pieces solved cooperatively in
parallel by all threads within a block.
This partitioning allows the threads to cooperate when solving each sub-
problem, and most importantly such architecture offers automatic scalability.
Each block of threads can be run on any of the available GPU’s
multiprocessors, either concurrently or sequentially, this gives the ability to
any CUDA program to execute on any number of multiprocessors and easily
be adjusted by only knowing the physical multiprocessor number. This
scalable model allows applications to easily scale depending on the demands
and hardware of each user, as NVIDIA offers a wide range of CUDA-Enabled
GPUs from mainstream inexpensive (Geforce) to demanding high-
performance (Quadro and Tesla) computing products [19].
Figure 8 Automatic Scalability
2.4. Clusters with MPI and CUDA

Message Passing Interface (MPI) is a standardized and portable message-
passing system, which was developed by a group of researchers from
academia and industry to function on a wide variety of parallel computers. The
first steps were made in the beginning of the nineties, and version 1.0 of the
interface was released in June 1994 [2]. MPI defines each node as a process
and is considered the standard for High Performance Computing application
development, on distributed memory architectures [29].
MPI defines the semantics and the syntax of a core of library routines, useful
to a wide range of developers writing portable message-passing programs.
There are MPI bindings for many languages, but the first ones were in Fortran
77, C and C++. There are several implementations of the MPI interface, while
many of them are totally free and are available to the public domain.
One of the most widely used implementations is MPICH [30], a portable free
implementation of the message-passive interface for distributed-memory
applications that run in parallel. MPICH supports efficiently different
computation and communication platforms including commodity clusters
(desktop systems, shared-memory systems, multicore architectures), high-
speed networks, proprietary high-end computing systems (Blue Gene, Cray)
and multiple operating systems (Windows, most flavors of UNIX including
Linux and MacOSX). The cluster that was used in this thesis was supported
by MPICH2 running under Ubuntu Linux.
As MPI is standardized, the code that is written is portable and is able to run
under any MPI implementation of the same architecture. Moreover, although
the performance may vary between different implementations, the calls that
are made have the same behavior on all of them, offering even more
portability to the parallel programs.
The most common use of MPI is in parallel programming for cluster systems,
as it handles all the communications between the nodes with simplicity and
most important of all, with a standardized way. The main problem that is
encountered is the distributed memory architecture of clustered systems; how
data is distributed and how the processed communicate when they need data
from another node and may others. MPI is responsible for all the node
communications as handling the data distribution and gather. However, a
developer should always remember that passing data over the network is a
time consuming operation, therefore should develop efficient programs that

balance the communication time with the processing time. [29]
Figure 9 GPU Cluster with MPI
In order to achieve this balance, nodes have to undertake more tasks before
the results are transmitted, minimizing the network communications in small
packets, as in such demanding operations even the packet headers affect the
total performance. More specifically, if we had developed an algorithm to add
2 same-length vectors, it would be much more efficient to assign parts of each
vector to each worker, and send the all calculated results together when the
calculations have finished, rather than sending each calculated result to the
master worker. This strategy is the most preferred one in general, especially
when there are no data dependencies.
As it can be seen, this balance is directly affected by the processing time, the
data dependencies and as a result the network communications. This thesis is
focused on using the GPU parallel architecture, taking advantage of the
tremendous performance boosts, using a computer cluster with CUDA-
Enabled GPUs, governed by MPI. Therefore, in this model there are two
levels of parallelism: the first level is between the nodes of the cluster, and the
second one is between cores of the multiple GPUs of each worker.
C U D A P r o g r a m m i n g M o d e l - 14
3. CUDA Programming Model

In this chapter we will identify some of the major characteristics of the CUDA
programming model by outlining how they are exposed in C, based on the
“NVIDIA C Programming Guide” [19]. This is necessary in order understand
the experimental analysis that follows. The sample program that is being used
is called “vectorAdd” and is available in the SDK’s accompanying code
samples.
3.1. Kernels
CUDA library extends C, giving the ability to define and execute CUDA
specialized functions. These functions are called Kernels and opposed to
regular C functions, they are executed N times in parallel by N different CUDA
threads [19]. Each kernel function is defined using the __global__ specifier,
and the number of CUDA Threads that will execute the kernel, is defined
within <<<…>>> execution configuration syntax. The unique thread ID, that
characterizes each thread that executes the defined kernel, can be accessed
inside the kernel using the built-in variable threadIdx.
The following kernel example (Exhibit 1), illustrates the addition of two vectors
A, B of size N, storing the result into vector C.
// Kernel definition
__global__ void vectorAdd(const float *A, const float *B,
float *C) {
int i = threadIdx.x;

C[i] = A[i] + B[i];
}
int main() {
...
// Kernel invocation with N threads
vectorAdd<<<1, N>>>(A, B, C);
...
}

Exhibit 1 Simple vectorAdd() kernel
3.2. Thread Hierarchy

CUDA has specified the threadIdx variable to be a 3-component vector, for
reasons of convenience. Hence, threads can be identified using 1D, 2D or 3D
thread indexes, providing a more natural way of accessing the desired
elements of a vector, matrix or a volume, respectively [19].
At this point, we should note that there is a direct relation between the thread
ID and the thread Index. This relation exists so that we can manipulate our
data and cells with much more convenience.
- For 1D block, they are the same

- For 2D blocks of Dx, Dy size, the thread ID of a thread of index (x, y) is (x
+ y Dx)
- Finally, for 3D blocks of Dx, Dy, Dz size the threads ID of a thread of
index (x, y, z) is (x + y Dx + z Dx Dy).
The Exhibit 2 example is based on the Exhibit 1, but with 2D matrices of NxN
dimensions. The code computes the addition of matrices A, B and the result is
stored in matrix C.
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N]) {
int i = threadIdx.x;
int j = threadIdx.y;

C[i][j] = A[i][j] + B[i][j];
}
int main() {
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}

Exhibit 2 MatAdd() kernel
As threads of a block will be executed on the same cores and will share finite
hardware and especially memory resources, there is an upper limit of 1024
threads per block for the current NVIDIA GPUs. However, multiple blocks can
execute the same kernel. Therefore, blocks are also organized into 1D, 2D or
3D virtual grids of thread Blocks. The number of thread blocks inside a virtual
grid is relevant to the size of the data that is to be processed and to the
number of the processors that our hardware is equipped.
Figure 10 Grid of Thread Blocks
The developer defines the number of threads per block and the number of
blocks per grid with the <<<...>>> syntax, using an int or dim3 variable. Block
index can be accessed within the kernel using the blockIdx variable with the
same way as threadIdx worked. Moreover, the dimension of a block is
identified using the built-in blockDim variable. The following example extends
the previous MatAdd() to handle multiple blocks.
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N]) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main() {
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}

Exhibit 3 MatAdd() with Multiple Blocks
One of the most common choices, that is also used in this case, is a thread
block sized 16x16, meaning 256 threads. Then the number of blocks is
calculated by dividing the total number of elements N with the amount of
threads per block per dimensions. Moreover, thread blocks are executed
independently and can be scheduled in any order across any of the available
systems cores. This facilitates the scalability of the programs that are being
the developed, as they can easily be scaled up in relation to the number of
cores we have at our disposal.
Moreover, CUDA allows the threads of a block to cooperate as they have

limited shared memory and also settings points to synchronize their execution
so that the memory access is coordinated. Synchronization points can be set
within a kernel by calling the __syncthreads() intrinsic function; this function
acts as a barrier at which all threads within the block are required to reach
until the execution continues.
3.3. Memory Hierarchy

Each thread has access to multiple memory spaces. First of all, a private local
memory space is reserved for each thread. Then, the thread block has a
shared memory, which is visible to all the threads and is released by the end
of the execution of the block. Finally, global memory is accessible by all
threads but is less efficient in most problems.
In addition, all threads have read-only access to the constant and the texture
memory spaces. More specifically, CUDA offers the global, constant and
texture memory spaces and they are optimized for different types of
operations and usage. However, the memory spaces are persistent and
cannot be modified while the kernel is being executed.
Figure 11 Levels of Kernel Memory Spaces

3.4. Heterogeneous Architecture

The architecture of CUDA model, assumes that the kernels are executed on a
physically separate device that works as a coprocessor to the host system,
where the initial application was executed. Moreover, CUDA recognizes the
host and the device memory spaces as different spaces and they need to be
referenced differently. Therefore, there are some necessary steps to perform
any operation such as:
- Device memory allocation: the data are being transferred from host
memory to the device memory reserved space.
- Data processing: the kernels are being loaded and executed on the blocks
of threads.
- Device memory de-allocation: the processed data are transferred from the
device memory to the host memory, freeing up the reserved space.
Figure 12 CUDA Heterogeneous Model

S y s t e m m o d e l a n d a p p r o a c h - 20
4. System model and approach

The system model that was used is CUDA-enabled GPU cluster of 10
computers. The computers are running Ubuntu Linux, Lucid Lynx 10.04 Long
Term Support release [31]. Moreover, they are equipped with: Intel Core 2
Duo E8400 CPU processors, that have two cores clocked at 3.00GHz; 3GB of
RAM; NVIDIA Corporation GeForce GTX 280 GPU processors, clocked at
602MHz with 240 CUDA Cores and equipped with 1GB GDDR3 and
1107MHz Memory Clock. Additionally, a Gigabit Switch handles the network
communications. A shared and distributed file system is achieved using NFS
[32]. The development took place under the NVIDIA Nsight Eclipse Edition
IDE and the applications were compiled using mpicxx and nvcc.
H y b r i d S t r i n g M a t c h i n g - 21
5. Hybrid String Matching

In this Hybrid approach there are two levels of parallelization, the first one has
to do with MPI and the clustering of the text file that is to be search, while the
second one takes place inside the CUDA architecture. The first parallelization
level, for distributing jobs within the cluster, was very similar for all the
implemented algorithms. The nodes of the cluster communicate through a
Gigabit Router and share the same disk space through NFS [32]. Hence,
once the text was copied to the file system it was instantly available to all
nodes.
The two layers of parallelization are implemented inside the program

separately and are connected in the linking stage. There is a main file is
written in C++, that executes all the necessary procedures to initialize MPI,
read the files, execute the Search Function and gather the results to the main
node. The CUDA Search Function is placed in a separate file, which is written
in C and is compiled with the use of NVCC.
5.1. MPI Parallelization

The following code illustrates the structure of the main program that is being
in the beginning and later we will examine the CUDA functions.
In the beginning of Exhibit 4, the MPI execution environment is initialized in

the main() function of the program. Moreover, it is worth mentioning, that all
the MPI Functions are executed through a custom function called
MPI_CHECK, which will terminate the execution in case a command failed to
run on a node. The total number of nodes available to MPI is stored in
commSize variable using the MPI_Comm_size() function, while the current
host’s rank number is stored in the commRank variable with the use of the
MPI_Comm_rank() function.
// Initialize MPI state

MPI_CHECK(MPI_Init(&argc, &argv));

// Get our MPI node number and node count
int commSize, commRank;
MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD, &commSize));
MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &commRank));

Exhibit 4 MPI Source - Initialization
Between crucial stages of the execution, the MPI_Wtime() function is called in

order to obtain the current time in seconds and microseconds. This is used, in
order to calculate the execution time of each step, and later compare the
results. The first actual step of the algorithm is to read the pattern from a
specified path, and then convert it in to a C string variable in order to pass it
later to the search function that is written in C. Moreover, the pattern length is
stored in the lenp integer variable.
timeStartRead = MPI_Wtime();
//Read Pattern File
ifstream filePattern(patternpath);
if (!filePattern.is_open()) {
cout << "Rank" << commRank << ": Cannot open pattern!" << endl;
my_abort(1);
}
string str_pattern((std::istreambuf_iterator<char>(filePattern)),
std::istreambuf_iterator<char>());
filePattern.close();

h_pattern = new char[str_pattern.size()];
strcpy(h_pattern, str_pattern.c_str());
int lenp = strlen(h_pattern);

Exhibit 5 MPI Source - Read Pattern
The same procedure is followed in order to read the text file. However, in this
situation each node reads the chunk that he was assigned. First, all nodes
read the total file size and then by using their assigned rank by MPI select a
different chunk of the. For example, if there were 5 nodes in total, then the file
would be split in 5 imaginary chunks and each node would read his part; the
second node in row would read the second chunk and so on. This is achieved
by using the start and stop auxiliary variables, that indicate the start and the
end position when each node is reading the text file.
𝑛𝑛
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 = 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∗
𝑀𝑀𝑀𝑀𝐼𝐼
𝑛𝑛
𝑠𝑠𝑠𝑠𝑠𝑠𝑝𝑝 = 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1 ∗ + (𝑚𝑚 − 1)
𝑀𝑀𝑀𝑀𝐼𝐼
The length of the text that is to be processed is slightly larger than the
absolute division of the total text length and the number of nodes; as it also
contains the next 𝑚𝑚 − 1 characters, satisfying the case that an occurrence of
the pattern was found at the last letter of the chunk.
// Read Main File

ifstream fileMain(filepath);
if (!fileMain.is_open()) {
cout << "Rank" << commRank << ": Cannot open text!" << endl;
my_abort(1);
}
fileMain.seekg(0, std::ios::end);
lenFullText = fileMain.tellg();
timeEndRead = MPI_Wtime();

int start = commRank * lenFullText / commSize;
int stop = (commRank + 1) * lenFullText / commSize + (lenp -‐ 1);

if (stop > lenFullText)
stop = lenFullText;

fileMain.seekg(start, std::ios::beg);

h_text = (char *) malloc(stop -‐ start + 1, sizeof(char));
fileMain.read(h_text, stop -‐ start + 1);
fileMain.close();
int lent = strlen(h_text);

Exhibit 6 MPI Source - Read Text
Later, the length of the chunk that is to be processed is calculated, as well as

the result vector size, which uses a Boolean vector with the same size as the
chunk.
Moreover, there is a global result integer vector h_result_final, initialized ion

the master worker, having the size of h_result divided by the pattern length
and multiplied with the number of nodes, where the final result positions are
gathered from all nodes to the master worker.
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑡𝑡
max 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑠𝑠 = ∗ 𝑀𝑀𝑀𝑀𝐼𝐼
𝑚𝑚
The computeGPU() function is called with the text and the pattern as main
parameters and calls the string matching CUDA implemented kernel to
calculate the results in the local GPU card. The same algorithm is also
implemented in C, and is being called by the computeCPU() function, in order
to compare the execution times. The implementations of both functions differ
depending on the algorithm; while, the remaining code is almost the same for
all of them.
// Calculate Lenths
int result_size = ceil(lenFullText / commSize);

// Initialize local results array
h_result = (bool *) malloc(result_size * sizeof(bool));
memset(h_result, false, result_size * sizeof(bool));

// On each node, run computation on GPU and CPU
timeStartCuda = MPI_Wtime();
computeGPU(h_result, result_size, h_text, lent, h_pattern, lenp,
cudaThreadSize, timeGPU);
timeEndCuda = MPI_Wtime();

timeStartCPU = MPI_Wtime();
computeCPU(h_text, lent, h_pattern, lenp, h_result);
timeEndCPU = MPI_Wtime();

if (commRank == 0)
h_result_final = (int *) malloc(
result_size / lenp * commSize * sizeof(int));

Exhibit 7 MPI Source - Execute GPU and CPU algorithms
The next step after execution is to gather the results to the master worker.
However, it is a fact that the major part of the results vector will be false
values, as it is not usual for a pattern to be found so many times. On the one
hand, the MPI_Gather() function could be used but it wouldn’t be time
efficient; on the other hand, a more efficient solution would be to do

preprocessing on each worker and send only the position of each result to the
master. Moreover, in order to avoid the network packet overheads a buffer is
used sending the gathered results in groups.
timeStartGather = MPI_Wtime();
MPI_Status status;
const int buf = 32;
int position[buf];
if (commRank == 0) {
// Copy masters results
int results = 0;
for (i = 0; i < result_size; i++)
if (h_result[i]) {
h_result_final[results] = i;
results++;
}

// Start receiving
int received = 0;
int finWorkers = 1;
while (finWorkers < commSize) {
MPI_CHECK(
MPI_Recv(&position, buf, MPI_INT, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status));

// Get the number of the received results
MPI_Get_count(&status, MPI_INT, &received);
for (i = 0; i < received; i++) {
h_result_final[results] = position[i];
results++;
}
// Worker has finished transmission
if (status.MPI_TAG == 1)
finWorkers++;
}
}

Exhibit 8 MPI Source – Gather Master
The master worker first copies the local result positions to the final results
vector, and then starts receiving using MPI_Recv() until all workers have
completed transmission. In order to keep track of the number of finished
workers, the last transmission of each one is marked with the value 1 instead
of 0 as MPI_TAG.
else {
int tmp_results = 0;
if (h_result[i]) {
position[tmp_results] = start + i;
tmp_results++;

// If buffer is full send results
if (tmp_results >= buf) {
MPI_CHECK(
MPI_Send(&position, tmp_results, MPI_INT,
0, 0, MPI_COMM_WORLD ));
tmp_results = 0;
}
}

// If there are unsent results send them and finish transmission
if (tmp_results > 0)
MPI_CHECK(
MPI_Send(&position, tmp_results, MPI_INT, 0, 1,
MPI_COMM_WORLD ));
else
MPI_CHECK(
MPI_Send(NULL, 0, MPI_INT, 0, 1, MPI_COMM_WORLD ));
}
timeEndGather = MPI_Wtime();

Exhibit 9 MPI Source – Gather Workers
Each worker scans the local results array and copies the positions to the
buffer, when the buffer is full it is sent to the master using MPI_Send(). The
occurrence positions are calculated using the reading start position of the full
text that had to be processed and the local chunk’s occurrence position, so
that the values are pointing to the original text.
5.2. CUDA Parallelization

The computeGPU() function initializes the CUDA execution environment, and
is called by the MPI in order to process the text on the NVIDIA CUDA
architecture. This function is quite similar on most algorithms except the
preprocessing and the search parts, as expected. In this example, we will
illustrate the computeGPU() function that was used in the implementation of
the Naïve algorithm.
First of all, the text and the pattern vectors are initialized and copied to the
GPU Memory. In order to achieve this, the vectors are being duplicated;
however, they are initialized using cudaMalloc() function, instead of ANSI C
malloc(). For ease of use and readability, the GPU variables start with “d_”
standing for “device”. Then, the cudaMemcpy() function is used, duplicating
the text and pattern variables to GPU memory, and the cudaMemset()
initializes the results Boolean variable with false values.
cudaEventRecord(startMemHost, 0);
// Allocate data on GPU memory
checkCudaErrors(cudaMalloc((void**) &d_text, lent*sizeof(char)));
checkCudaErrors(cudaMalloc((void**) &d_pattern, lenp*sizeof(char)));
checkCudaErrors(cudaMalloc((void**) &d_result, lent*sizeof(bool)));
// Copy to GPU memory
checkCudaErrors(cudaMemcpy(d_text, h_text, lent*sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_pattern, h_pattern, lenp*sizeof(char),

checkCudaErrors(cudaMemset(d_result, false, lent*sizeof(bool)));

cudaEventRecord(stopMemHost, 0);

Exhibit 10 computeGPU() Memory Initialization
As it can be seen, the cudaEventRecord() is being used, measuring the

execution time of each step; while, the checkCudaErrors() is being used for
the execution of CUDA functions, handling the error reporting in case of
problem in execution. The next step is the calculation of the CUDA blocks
depending on the thread sizes. The function that is used for 1D vectors, as
the text that is being processed, is presented below.
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1
𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 =
𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
The vector size in this case is the results variable size, which has the same
length as the text for ease of calculation as linked lists are too complex to be
easily used by parallel architectures.
// Calculate Blocks / Threads

int threadsPerBlock = threadSize;
int blocksPerGrid = (lent + threadsPerBlock -‐ 1) / threadsPerBlock;

// Run Main kernel
cudaEventRecord(startCompute, 0);
bruteforceGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern,
lent, lenp, d_result);
cudaEventRecord(stopCompute, 0);

// Copy data back to Host memory
cudaEventRecord(startMemDevice, 0);
checkCudaErrors(cudaMemcpy(h_result, d_result, lent*sizeof(bool),
cudaMemcpyDeviceToHost));
cudaEventRecord(stopMemDevice, 0);

// Free GPU memory
checkCudaErrors(cudaFree(d_text));
checkCudaErrors(cudaFree(d_pattern));
checkCudaErrors(cudaFree(d_result));

cudaEventElapsedTime(&timeGPU.gpuMemHost, startMemHost, stopMemHost);
cudaEventElapsedTime(&timeGPU.gpuCompute, startCompute, stopCompute);
cudaEventElapsedTime(&timeGPU.gpuMemDevice, startMemDevice,
stopMemDevice);

Exhibit 11 computeGPU() Execute and Finalize
Finally, after the kernel has been executed with the specified block and thread
sizes using the <<<…>>> kernel call, the result vector is being copied back to
the host memory. The device variables are being freed from memory using
cudaFree(), and the following execution times are being calculated: a) Vector
initialization and memory copy from Host to Device, b) Kernel execution, c)
Memory copy from Device to Host.
S t r i n g M a t c h i n g A l g o r i t h m s - 29
6. String Matching Algorithms

This chapter analyzes the main parts of each String Matching Algorithm that
was implemented to run under CUDA architecture. The full source codes are
available in the Appendix chapter.
6.1. Naive Search

Pre-calculation Search
− 𝑂𝑂(𝑚𝑚𝑚𝑚)
The first algorithm that was developed was called the Naïve algorithm (aka
Brute-Force), which was based on the common sense of sliding the pattern
along the text and comparing it to each portion of the text. The algorithm tries
to match the first character of the text with the first character pattern; in case
of success, it tries to match the second, and so on. Otherwise, the algorithm
continues by sliding the pattern to the next character of the text. The outer
loop is responsible for sliding to the next character, while the inner loop tries
to match the characters with the pattern. However, this first and simple
approach is very slow, taking O(nm) time [4]. Finally, its implementation can
be easily converted to a fuzzy string-matching algorithm, by altering the
condition, as the number of matches will be calculated even if a mismatch
was found. This can be very useful in Computational biology applications, but
makes it the most time consuming algorithm.
6.1.1. Sequential Implementation

void computeCPU(char *T, int n, char *P, int m, bool *result) {
int x, i, k;

for (x = 0; x < n; x++) {
k = 0;

for (i = 0; i < m; i++)
if (T[x + i] == P[i])
++k;

if (k == m)
result[x] = true;
}
}

Exhibit 12 C implementation of Naïve Search
6.1.2. CUDA Implementation

__global__ void bruteforceGPU(const char *T, const char *P, const int n,
const int m, bool *result) {

unsigned int x = blockDim.x * blockIdx.x + threadIdx.x;

if (x < n) {
int k = 0;
int i;

for (i = 0; i < m; i++)
if (T[x + i] == P[i])
++k;

if (k == m)
result[x] = true;
}
}

Exhibit 13 CUDA Implementation of Naïve Search
6.2. Knuth Morris-Pratt

Pre-calculation Search Overall time
𝑂𝑂(𝑚𝑚) 𝑂𝑂(𝑛𝑛) 𝑂𝑂(𝑚𝑚 + 𝑛𝑛)
The Knuth, Morris and Pratt [7] algorithm was conceived by D. Knuth and V.
Pratt and independently by J. Morris, in 1974. The three of them published it
together in 1977. The algorithm is based on the Naïve algorithm and reduces
the number of comparisons, by using the learnt information from the inner
loop to determine how many skips should take place in the outer loop [33].
The KMP algorithm uses the pattern to pre-compute this number of skips and
starts searching like the naïve algorithm. However, in case of a mismatch, it
uses the pre-computed number of skips vector in order to determine the
position to continue.
Due to the sequential correlations of the pre-computation part of the

algorithm, only the major search part was parallelized taking advantage of the
CUDA architecture. Moreover, the CUDA implementation of the KMP
algorithm was customized to search on larger chunks of the text per CUDA
thread, in order to make use of the pre-calculated skip vector efficiently.

void preKmp(char *x, int m, int kmpNext[]) {
int i, j;
i = 0;
j = kmpNext[0] = -‐1;
while (i < m) {
while (j > -‐1 && x[i] != x[j])
j = kmpNext[j];
i++;
j++;
if (i < m && x[i] == x[j])
kmpNext[i] = kmpNext[j];
else
kmpNext[i] = j;
}
}

Exhibit 14 C implementation of KMP Preprocessing [34]
int i, j;
int *kmpNext = (int*) malloc((m + 1) * sizeof(int));

/* Preprocessing */
preKmp(P, m, kmpNext);

/* Searching */
i = j = 0;
while (j < n) {
while (i > -‐1 && P[i] != T[j])
i = kmpNext[i];
i++;
j++;
if (i >= m) {
result[j -‐ i] = true;
i = kmpNext[i];
}
}
}

Exhibit 15 C implementation of KMP [34]

__global__ void kmpGPU(const char *T, const char *P, const int n, const
int m, const int *kmpNext, bool *result) {


if (x < n) {
int i, j;

int start = x * m * chunk;
int stop = (x + 1) * m * chunk + m -‐ 1;
if (stop > n)
stop = n;

i = 0;
j = start;
while (j < stop) {
while (i > -‐1 && P[i] != T[j])
i = kmpNext[i];
i++;
j++;
if (i >= m) {
result[j] = j -‐ i;
i = kmpNext[i];
}
}
}
}

Exhibit 16 CUDA Implementation of KMP
6.3. Horspool
𝑂𝑂(𝑚𝑚 + 𝜎𝜎) 𝑂𝑂(𝑚𝑚𝑚𝑚)
The Horspool [9] algorithm is an simplification of the Boyer Moore algorithm

that was published by N. Horspool in 1980. It is much easier to implement
especial in parallel structures. The Horspool algorithm is using the bad-
character shift vector of the Boyer-Moore algorithm alone, producing a very
efficient algorithm in practice. The searching phase has a quadratic worst
case, but the average number of comparisons for a single character is
between 1/σ and 2/(σ+1).
For each position of the windows the algorithm compares the last window’s
character with the last character of the pattern, and if they match it will
continue to search backwardly. Then, it shifts the window so that the pattern
matches the last character of the previous window.
The CUDA implementation pre-calculates the shifts in CPU and then copies
the shift table to the GPU Memory. Using this table, each thread validates if
the current character is a valid shift, in order to continue execution [35].

void preBmBcCPU(char *P, int m, int bmBc[]) {
int i;
for (i = 0; i < ASIZE; i++)
bmBc[i] = m;
for (i = 0; i < m -‐ 1; i++)
bmBc[P[i]] = m -‐ i -‐ 1;
}

Exhibit 17 C implementation of Horspool Preprocessing [36]
int j, bmBc[ASIZE];
char c;

/* Preoprocessing */
preBmBcCPU(P, m, bmBc);

/* Searching */
j = 0;
while (j <= n -‐ m) {
c = T[j + m -‐ 1];
if (P[m -‐ 1] == c && memcmp(P, T + j, m -‐ 1) == 0)
result[j] = true;
j += bmBc[c];
}
}

Exhibit 18 C implementation of Horspool [36]

void precomputeShifts(char *T, int n, int m, int *bmBc, int *preComp) {
int i = 0;
while (i <= n -‐ m) {
i += bmBc[T[i + m -‐ 1]];
preComp[i] = i;
}
}

__global__ void horsGPU(const char *T, const char *P, const int n, const
int m, const int *bmBc, const int *preComp, bool *result) {


if (x <= n -‐ m && preComp[x] == x) {
bool found = true;

char c = T[x + m -‐ 1];
for (int i = 0; i < m -‐ 1; ++i)
if (P[m -‐ 1] != c || P[i] != T[x + i]) {
found = false;
break;
}
if (found)
result[x] = true;
}
}

Exhibit 19 CUDA Implementation of Horspool
6.4. Karp Rabin

Pre-calculation Search Expected
𝑂𝑂(𝑚𝑚) 𝑂𝑂(𝑚𝑚𝑚𝑚) 𝑂𝑂(𝑛𝑛 + 𝑚𝑚)
The Karp-Rabin was created in 1987 by M. Rabin and R. Karp [12] and
makes use of hash methods to search for the pattern inside the given text. It
is widely used for especially multiple pattern searches. The algorithm
calculates a hash value for the pattern and one for the current search window.
If the hash values are unequal, it will calculate the hash for the next window.
As far as the CUDA implementation is concerned, a custom memcmp()

function was developed in order to handle the memory block comparisons.
Moreover, the hash of the pattern hx is pre-calculated in the main program, as
it remains constant through the execution.

int d, hx, hy, i, j;

/* Preprocessing */
for (d = i = 1; i < m; ++i)
d = (d << 1);

for (hy = hx = i = 0; i < m; ++i) {
hx = ((hx << 1) + P[i]);
hy = ((hy << 1) + T[i]);
}

/* Searching */
j = 0;
while (j <= n -‐ m) {
if (hx == hy && memcmp(P, T + j, m) == 0)
result[j] = j;
hy = REHASH(T[j], T[j + m], hy);
++j;
}
}

Exhibit 20 C implementation of Karp Rabin [37]

__global__ void krGPU(const char *T, const char *P, const int n, const
int m,
int hx, bool *result) {


if (x <= n -‐ m) {
int hy, i;

/* Preprocessing */
for (hy = i = 0; i < m; ++i) {
hy = ((hy << 1) + T[i + x]);
}

/* Searching */
if (hx == hy && memcmpGPU(P, T + x, m) == 0)
result[x] = x;
}
}

Exhibit 21 CUDA Implementation of Karp Rabin
6.5. Quick Search

𝑂𝑂(𝑚𝑚 + 𝜎𝜎) 𝑂𝑂(𝑚𝑚𝑚𝑚)
The Quick-Search algorithm was published in 1990 by D. Sunday [13]. The

major unique feature is that it was simplified to use only the bad character
swift table of BM, as well as Horspool, making it very fast in practice for short
patterns and large alphabets. Furthermore, It is much easier to implement
than BM especial in parallel structures.
The algorithm moves the text window from left to right and compares it with
the pattern, when a mismatch is found in the i pattern position, it shifts the
pattern m-i positions right, and continues to search.
The CUDA implementation pre-calculates the shifts in CPU and then copies
the shift table to the GPU Memory. Using this table each thread validates if
the current character is a valid shift, in order to continue execution [35].

void preQsBcCPU(char *P, int m, int qbc[]) {
int i;
qbc[i] = m + 1;
for (i = 0; i < m; i++)
qbc[P[i]] = m -‐ i;
}

Exhibit 22 C implementation of Quick Search Preprocessing [38]
int i, s, qsbc[ASIZE];

/* Preoprocessing */
preQsBcCPU(P, m, qsbc);

/* Searching */
s = 0;
while (s <= n -‐ m) {
i = 0;
while (i < m && P[i] == T[s + i])
i++;
if (i == m)
result[s] = true;
s += qsbc[T[s + m]];
}
}

Exhibit 23 C implementation of Quick Search [38]

int i = 0;
while (i <= n -‐ m) {
i += bmBc[T[i + m]];
preComp[i] = i;
}
}

__global__ void shiftOrGPU(const char *T, const char *P, const int n,
const int m, const int *bmBc, const int *preComp, bool *result) {


if (x <= n -‐ m && preComp[x] == x) {
bool found = true;

for (int i = 0; i < m; ++i)
if (P[i] != T[x + i]) {
found = false;
break;
}
if (found)
result[x] = true;
}
}

Exhibit 24 CUDA Implementation of Quick Search
6.6. Shift Or
𝑂𝑂(𝑚𝑚 + 𝜎𝜎) 𝑂𝑂(𝑛𝑛)
The Shift Or [39] algorithm uses bitwise techniques to search within the
specified text. It is very efficient if the pattern length is less than the memory-
word size of the host machine. The bitwise operations mark all the potential
text and pattern matches. The Shift Or algorithm, uses a complemented bit
mask D in order avoiding one of the final bit operations of Shift And.
The CUDA implementation executes the preprocessing step, building the

table storing a bit mask for each character of the pattern under CPU, and then
copies the vector to the GPU Memory. Moreover, the algorithm was
customized to search on larger chunks of the text per CUDA thread, in order
to make use of the calculated bit vectors for the next characters more
efficiently.

int preSoCPU(char *x, int m, unsigned int S[]) {
unsigned int j, lim;
int i;
for (i = 0; i < ASIZE; ++i)
S[i] = ~0;
for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
S[x[i]] &= ~j;
lim |= j;
}
lim = ~(lim >> 1);
return (lim);
}

Exhibit 25 C implementation of Shift Or Preprocessing [40]
unsigned int lim, D;
unsigned int S[ASIZE];
int j;

/* Preprocessing */
lim = preSoCPU(P, m, S);

/* Searching */
for (D = ~0, j = 0; j < n; ++j) {
D = (D << 1) | S[T[j]];
if (D < lim)
result[j -‐ m + 1] = j -‐ m + 1;
}
}

Exhibit 26 C implementation of Shift Or [40]

__global__ void shiftAndGPU(const char *T, const char *P, const int n,
const int m, const int *S, bool *result) {


if (x < n) {
unsigned int D, F;
int j;

if (stop > n)
stop = n;

F = 1 << (m -‐ 1);

/* Searching */
for (D = 0, j = start; j <= stop; ++j) {
D = ((D << 1) | 1) & S[T[j]];
if (D & F)
result[j -‐ m + 1] = true;
}
}
}

Exhibit 27 CUDA Implementation of Shift Or
6.7. Shift And

𝑂𝑂(𝑚𝑚 + 𝜎𝜎) 𝑂𝑂(𝑛𝑛)
The Shift And [39] algorithm is very similar to the Shift Or and uses bitwise
techniques to search within the specified text. Their major difference is that
Shift And has one bitwise operation more than Shift Or.
The CUDA implementation executes the preprocessing step, building the

table storing a bit mask for each character of the pattern under CPU, and then
copies the vector to the GPU Memory. Moreover, the algorithm was
customized to search on larger chunks of the text per CUDA thread, in order
to make use of the calculated bit vectors for the next characters more
efficiently.

void preSACPU(char *x, int m, unsigned int S[]) {
int i;
S[i] = 0;
for (i = 0, j = 1; i < m; ++i, j <<= 1)
S[x[i]] |= j;
}

Exhibit 28 C implementation of Shift And Preprocessing [41]
unsigned int D;
unsigned int S[ASIZE], F;
int j;

/* Preprocessing */
preSACPU(P, m, S);
F = 1 << (m -‐ 1);

/* Searching */
for (D = 0, j = 0; j < n; ++j) {
D = ((D << 1) | 1) & S[T[j]];
if (D & F)
result[j -‐ m + 1] = j -‐ m + 1;
}
}

Exhibit 29 C implementation of Shift And [41]



if (x < n) {
unsigned int D, F;
int j;

if (stop > n)
stop = n;

F = 1 << (m -‐ 1);

/* Searching */
D = ((D << 1) | 1) & S[T[j]];
if (D & F)
}
}
}

Exhibit 30 CUDA Implementation of Shift And
P e r f o r m a n c e E v a l u a t i o n - 43
7. Performance Evaluation
7.1. Testing Methodology
7.1.1. Pattern Size and CUDA Threads
The experiments were made using two different pattern sizes of 4 bytes and 8
bytes. The size of the pattern affects directly most of the algorithms. Hence,
difference in performance is being illustrated by choosing two usual pattern
sizes the. Moreover, another parameter that affects the execution time, as
well as the Memory Throughput, is the number of CUDA threads that will run
on each block, which is 256 or more. The number of blocks is directly
dependent to the number of threads, and is calculated with the following
function.
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1
𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 =
𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
However, on cards with compute capability [21] of 1.x, as Geforce GTX 280,
there is an upper limit of 65535 blocks on 1D. So, a larger number of 512
threads per CUDA block was chosen, in order to be able to process larger text
files without exceeding this limit.
7.1.2. String Matching Test Files
The following files compose the testing material that was used in the
experiments. They are widely known, and commonly used in string matching
tests. The texts were obtained from The Canterbury Corpus [42] and from
Project Gutenberg [43].
Filename Size
pge0112.txt 8243 kB
bible11.txt 4843 kB
world192.txt 2415 kB
1musk10.txt 1313 kB
plrabn12.txt 471 kB
Table 1 Common Sample Files
The alphabet of the files is limited to the simple 128 ASCII characters.
Furthermore, the selection of files was made, in order to exhibit the
performance differences under variable file sizes. It is an undisputable fact,
that the performance of processing small files, by a cluster, would exhibit a
significant drop due to the network overhead and the GPU memory transfer
time. So, use of a GPU Cluster would be effective on large files.
7.2. Measuring Speedup

In parallel computing, the term speedup [44] refers to how faster the parallel
algorithm is, compared to the corresponding sequential implementation. The
speedup that an algorithm exhibits can be calculated using the form:
𝑇𝑇
𝑆𝑆 =
𝑇𝑇
Where, p is the number of processers; 𝑇𝑇 is the execution time of the

sequential algorithm and 𝑇𝑇 is the execution time of the parallel algorithm with
p processors. The value depends on the architecture, as well as, how efficient
the processing parallelization is, compared to how much effort is wasted in
communication and synchronization. The ideal situation is called Linear
Speedup, and means that by doubling the number of processors, the speed
exhibits a two-fold increase.
However, as only part of the programs can be parallelized it is worth

mentioning the Amdahl's law [45], published by G. Amdahl in 1967. This law
is commonly used in parallel computing in order to find the theoretical
maximum expected improvement to an overall system, when only a part of the
execution was parallelized and improved. For example, assume that a
program needs 20 hours to be executed under a single core and only 1 hour
cannot be parallelized, meaning that the remaining 19 hours (95%) can run in
parallel. This means that no matter how efficient the parallelization is, it
cannot be less than the critical 1-hour. So, the maximum speedup can be 20x.
More examples are depicted in Figure 13.
Figure 13 Amdahl's Law
Amdahl’s Law cannot be applied in the CUDA and CPU comparison, as the
processors speeds are totally different. However, it could be used to
approximate the MPI times. As far as CUDA is concerned, NVIDIA has
provided technical reports that claim CUDA to be 30x faster in most cases
and some times even 100x faster, depending on the algorithm structure and
complexity [20]. The comparison was made between same-price range GPU
cards and CPUs. It is an indisputable fact, that speedups of 100x can be
extremely important for science. In order to understand it better, imagine a
problem that takes 12 hours. A 10x speedup would mean that the execution
would take place in 1.2 hours. A 100x speedup means 7.2 minutes and a
1000x 43.2 seconds! The maximum speedup that was exhibited in this thesis
was approximately 21x. This is due to the nature of the algorithms that were
implemented, as CUDA performs better in 𝑂𝑂(𝑛𝑛 ) and 𝑂𝑂(𝑛𝑛 ) complexity
problems, where most String Matching algorithms have usually 𝑂𝑂(𝑛𝑛)
complexity.
7.3. CPU vs GPU Comparison

In order to evaluate the results each algorithm was executed 10 times and the
average execution times were used. Moreover, the comparison was made
between similar price range CPU and GPU hardware (Intel Core 2 Duo E8400
and NVIDIA Geforce GTX 280).
As the alphabet size was the same for all execution experiments, a correlation
between the text size and the execution time was observed, as expected. In
the following graph the search execution times of CPU and GPU are
presented for the Naïve algorithm (pattern length 8, the GPU memory copy
times are excluded).
Naive CPU Execution Time

500 ms
400 ms
300 ms
200 ms Time in ms
100 ms
Linear
(Time in
0 ms ms)
481,861B 1,344,739B 2,473,401B 4,959,549B 8,441,343B
-100 ms
Figure 14 Naive CPU Times per Text

Naive GPU Execution Time

14 ms
12 ms
10 ms
8 ms
6 ms Time in ms
4 ms
2 ms Linear (Time
in ms)
0 ms
481,861 B 1,344,739 B 2,473,401 B 4,959,549 B 8,441,343 B
-2 ms
Figure 15 Naive GPU Times per Text
This correlation is observed, as in all the implemented algorithms the text

length directly affects the complexity, and as a result the processing time. The
following graphs are based on the largest file (pge0112.txt), as its 8,243 kB
size facilitates the presentation of the results especially for the MPI
implementations, while it are more efficient under larger file sizes.
The pattern choice was made by the most frequent words of the text. The first
one was the word “last” and the other one was the word “probably”. Both
words appear about 530 times within the text, which is approximately the
0.006% of the text.
The following graph (Figure 16) compares the execution times of CPU and
GPU implementations of the 7 string matching algorithms, under different
pattern sizes.
CPU vs GPU Execution Time

SOr
SA
QS
GPU 8
KMP GPU 4
CPU 8
KR
CPU 4
Hor
Naïve
0 ms 20 ms 40 ms 60 ms 80 ms 100 ms 120 ms
Figure 16 CPU vs GPU Execution time
The CPU execution times of the Naïve algorithm were outside the visible chart
area, but are available in more detail in Table 2.
CPU 4 GPU 4 CPU 8 GPU 8
Naïve 245.38 ms 12.95 ms 442.95 ms 18.82 ms

Hor 15.38 ms 9.53 ms 15.38 ms 9.52 ms
KR 71.83 ms 8.79 ms 71.42 ms 9.86 ms
KMP 103.15 ms 10.31 ms 102.17 ms 10.62 ms
QS 25.15 ms 9.75 ms 13.47 ms 9.52 ms
SA 56.74 ms 14.86 ms 56.69 ms 15.90 ms
SOr 56.30 ms 14.85 ms 56.25 ms 15.90 ms
Table 2 CPU vs GPU Execution time

All the presented execution times for both CPU and GPU include any
necessary preprocessing part. However, it was observed that, as the
complexity of preprocessing is dependent on the pattern size and the
alphabet size, for both patterns the times for most algorithms, were extremely
small, fluctuating around 1.58E+29 ms even for the 8 Byte pattern. As a
result, the preprocessing parts of the selected algorithms, is quite insignificant
for the selected pattern sizes and does not affect the total execution time
considerably for such pattern sizes.
The CPU execution time, measured as the execution time of the the
computeCPU() function, is represented by the CPU bars. The GPU execution
time was calculated as the sum of the CUDA memory copy from Host to the
Device, the preprocessing functions, the kernel execution, and the CUDA
memory copy time of the results from the Device to the Host. These
procedures are a substantial part of the GPU algorithm, in order to be able to
print the results or use them in the execution later. The new NVIDIA Fermi
and Kepler cards, with compute capability >2.0, support printf() function calls
from inside the kernel. This means, that the final copy of the results from the
GPU Memory to the Host Memory could be avoided, providing even better
performance speedups. However, the GTX 280 cards have compute
capability of 1.x and do not support such functions.
It can clearly be seen, that all the algorithms exhibit a significant speedup
when running under GPU, for both pattern sizes. Additionally, this
performance boost is depicted in Figure 17 as an average of the two pattern
sizes.
Average GPU vs CPU Speedup

2500%
2167%
2000%
1500%
966%
1000% 768%
500% 369% 366%

242% 205%
0%
Naïve Hor KR KMP QS SA SOr
Figure 17 Average GPU vs CPU Speedup
As it can be seen, the Naïve algorithm, as well as, KR and KMP show the
most significant performance speedups. Furthermore, it is worth mentioning
that all algorithms exhibited at least a two-fold performance increase,
compared to their CPU implementations. As the two architectures are totally
different, Amdahl’s law cannot apply to this case. However, these differences
between the speedups are due to the different structure and complexity of
each algorithm. For example, the significant speedup that the Naïve algorithm
exhibited is due to the algorithm’s O(mn) complexity, even in the best case,
thus, making it having the worst best case out of all. It is a fact that algorithms
with larger complexities can be much more efficient when implemented in
CUDA. Moreover, the Horspool and KMP algorithms exhibited the smallest
speedup, as a significant portion of their structure is sequential. Specifically,
the CUDA implementation of the algorithm runs once in CPU, through the text
using bad character shift table to calculate the shifts and then runs on the
GPU for the specified positions. So, a very large portion of the code still was
executed sequentially, limiting the speedup.
7.4. GPU vs the Cluster Comparison

As it was observed, all GPU implementations had a significant performance
speedup, running under a single host. The following graph (Figure 18)
compares the execution times of a single host with the acceleration provided
when using multiple hosts. In this thesis, 10 hosts with the same hardware
specifications were used. It is worth mentioning, that the large text file was
chosen to run these experiments, so that there is a potential balance between
the processing time and the time required for gathering the results.
GPU vs MPI Cluster Execution Time

SOr
SA
QS
Single 8
KMP Single 4
MPI 8
KR
MPI 4
Hor
Naïve
0 ms 5 ms 10 ms 15 ms
Figure 18 Single GPU vs MPI GPU Cluster Execution time
The 4, 8 suffixes represent the two different pattern sizes that were used, as
in the previous executions. Additionally, Table 3 presents the same execution
results with more details.
MPI 4 MPI 8 Single 4 Single 8
Naïve 6.84 ms 7.24 ms 11.75 ms 17.63 ms

Hor 6.50 ms 6.21 ms 8.32 ms 8.32 ms
KR 6.40 ms 6.28 ms 8.79 ms 9.87 ms
KMP 7.00 ms 7.26 ms 9.12 ms 9.78 ms
QS 6.47 ms 6.30 ms 8.16 ms 8.33 ms
SA 7.14 ms 7.02 ms 13.66 ms 14.71 ms
SOr 6.98 ms 6.88 ms 13.66 ms 14.71 ms
Table 3 Single GPU vs MPI GPU Cluster Execution time
The MPI times were calculated as a sum of the total CUDA execution time,
and the time needed for gathering the results from nodes to the master
worker. The gathering function can be very time consuming as it depends on
the network communication links between the nodes, that are much slower
that inner host communications. Even with the network latency, all algorithms
exhibited a significant performance increase again, which presented in more
detail in Figure 19.
MPI vs Single GPU Speedup

300%
243%
250%
209% 214%
200%
157%
150% 134% 135% 132%
100%
50%
0%
Figure 19 Average Single GPU vs MPI GPU Cluster Speedup

It can clearly be seen, that the speedup of MPI is much less than the CUDA
speedup. This was due to the two memory copy procedures, copying the data
to the device and then collecting the results, as they have a significant role in
the total GPU execution time. Figure 20 shows that these two functions take,
as an average, about 59% of the total single GPU execution time, with a
minimum of 38% and a maximum of 75%, depending on the algorithm. This
means that sometimes they take more time than the total processing does.
Furthermore, this portion represents the crucial part of the algorithms that
cannot be parallelized as Amdahl’s law states, and works as a barrier in the
speedup. Specifically, Amdahl’s law says that, when the parallel portion is
around 50% the maximum speedup that can be achieved is around 2x.
Although, many architecture parameters affect this number, it is very
interesting that the law applies to this condition, as the results are very close
to barrier specified.
GPU Memory Copy and Execution times

SOr
SA
QS
KMP Mem Host
KR Mem Device
Computation
Hor
Naïve
0.00 ms 5.00 ms 10.00 ms 15.00 ms 20.00 ms
Figure 20 Single GPU Memory Copy and Execution times
It is worth mentioning, that crucial part of the MPI implementation is the

gathering of the results to the master worker, as Figure 21 depicts. This
process cannot be optimized more and is very time consuming, as the
network communications have smaller bandwidth compared to PCI Express
2.0. As an average, the gathering of the data was responsible for the 45% of
the execution time in all algorithm, with a minimum of 12% in Naive Search
and a maximum of 71% in Quick Search. Such percentages, work as barriers
to the optimizations that can be performed, as the part of gathering the results
cannot be parallelized.
MPI Execution and Gather times

SOr
SA
QS
KMP
Gather
KR
Computation
Hor
Naïve
0.00 ms 10.00 ms 20.00 ms 30.00 ms 40.00 ms
Figure 21 MPI Execution and Gather times
Moreover, it can be seen in Figure 22 that Karp Rabin, Horspool and Quick
Search algorithms performed best in descending order. The three of them,
managed to search, and also gather the results of the pattern occurrences
within an 8 MB file in 6.38 ms, as an average for both pattern sizes.
Average MPI Execution Time

SOr
SA
QS
KMP
KR
Hor
Naïve
5.80 ms 6.00 ms 6.20 ms 6.40 ms 6.60 ms 6.80 ms 7.00 ms 7.20 ms
Figure 22 Average MPI GPU Cluster execution time
Finally, it is marked that through this hybrid approach of string matching, all
algorithms exhibited at least a three-fold speedup. The Naïve algorithm had
the most significant performance increase of 49x, compared to its single CPU
version. Additionally, algorithms such as Karp Rabin, that ended up having
one of the best execution times, managed to run 12x faster than their single
CPU implementation, as presented in Figure 23.
Total CPU vs MPI GPU cluster Speedup

4889%
5000%
4500%
4000%
3500%
3000%
2500%
2000%
1441%
1500% 1130%
801% 812%
1000%
363% 302%
500%
0%
Figure 23 CPU VS MPI GPU Cluster Speedup

C o n c l u s i o n s - 57
8. Conclusions
In this thesis, parallel implementations of the Naïve, Knuth Morris-Pratt,
Horspool, Karp Rabin, Quick Search, Shift Or and Shift And exact string
matching algorithms were presented using the NVIDIA CUDA Architecture.
Both sequential and parallel implementations were compared in terms of
running time under different pattern sizes. The results shown, that the parallel
implementations of algorithms such as Naïve Search, were executed up to
21x faster, than the sequential algorithm. Furthermore, the Knuth Morris-Pratt
and the Karp Rabin algorithms exhibited 9.6x and 7.7x speedups,
respectively; while, the rest of them exhibited at least a 2x increase. It was
observed, that the speedups were directly dependent to the structure of each
algorithm, and the portion of the overall procedure that could be parallelized.
Furthermore, the algorithms were customized in order to run on a 10-node

GPU cluster, illustrating the performance acceleration offered of each model.
MPI [2] was used to handle the communication between the nodes. Running
on the cluster, all the algorithms exhibited a speedup from 1.3x to a maximum
of 2.5x, compared to their single GPU implementations. It was observed that,
the data transfers between the CPU and the GPU, that couldn’t be avoided,
were responsible for approximately the 59% of the CUDA execution time,
depending on the algorithm. It is very interesting that, this portion of the
execution time is directly related to the speedup that all algorithms exhibited in
this stage, as it comes totally in line with Amdahl’s law. Another crucial part
that couldn’t be avoided was the gathering of the results to the master worker,
which was responsible for the 45% of the MPI execution time as an average.
As it can be seen, both memory copies and gathering of the results are very
time consuming procedures, and take much more time in total than the actual
search function.
In total, the Naïve algorithm had the most significant performance increase of
48x, comparing the MPI with the sequential version. This substantial speedup
was due to the higher algorithm’s complexity, as well as, its simple structure.
However, the fastest algorithm for both patterns proved to be Karp Rabin,
C o n c l u s i o n s - 58
exhibiting an 11.3x speedup in total, followed by Horspool, which was a little

faster in 8 Byte patterns, and Quick Search.
Future research in the area of string matching, GPGPU parallel processing

and MPI, could focus on intense profiling of the implemented algorithms, on
Shared Memory GPU optimizations, and on the latest NVIDIA Compute
Capability 3.x; which implements new functions that could facilitate the
execution, replacing existing time consuming procedures, such as the copy of
the sparse result vector. Additionally, heterogeneous architectures, using
FPGAs, could be examined, and compared in terms of performance, with the
existing implementations.
A p p e n d i x - 59
9. Appendix
9.1. Naïve Search
9.1.1. MPI Implementation (mpiBrute.cpp)

#include <iostream>
#include <fstream>
#include <cmath>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

using namespace std;

// MPI include
#include <mpi.h>

// User include
#include "mpiBrute.h"

// Error handling macros
#define MPI_CHECK(call) \
if((call) != MPI_SUCCESS) { \
cerr << "MPI error calling \""#call"\"\n"; \
my_abort(-‐1); }

int x, i, k;

for (x = 0; x < n; x++) {
k = 0;

for (i = 0; i < m; i++)
if (T[x + i] == P[i])
++k;

if (k == m)
result[x] = true;
}
}

// Shut down MPI cleanly if something goes wrong
void my_abort(int err) {
cout << "Test FAILED\n";
MPI_Abort(MPI_COMM_WORLD, err);
}

// Host code
int main(int argc, char* argv[]) {
int i;

times timeGPU;
double timeStartRead, timeEndRead;

double timeStartCuda, timeEndCuda;
double timeStartCPU, timeEndCPU;
double timeStartGather, timeEndGather;

int cudaThreadSize = 512;
char *filepath;
if (argc > 1) {
cudaThreadSize = atoi(argv[1]);
filepath = argv[2];
} else {
filepath = "/nfs/iassael/str/world192.txt";
}
char patternpath[] = "/nfs/iassael/str/pattern.txt";

char *h_text;
char *h_pattern;
bool *h_result;
int *h_result_final;

// Initialize MPI state
MPI_CHECK(MPI_Init(&argc, &argv));

// Get our MPI node number and node count
int commSize, commRank;
MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD, &commSize));
MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &commRank));

long int lenFullText;

timeStartRead = MPI_Wtime();
//Read Pattern File
ifstream filePattern(patternpath);
if (!filePattern.is_open()) {
cout << "Rank" << commRank << ": Cannot open pattern!" << endl;
my_abort(1);
}
string str_pattern((std::istreambuf_iterator<char>(filePattern)),
std::istreambuf_iterator<char>());
filePattern.close();

h_pattern = new char[str_pattern.size()];
strcpy(h_pattern, str_pattern.c_str());
int lenp = strlen(h_pattern);

// Read Main File
ifstream fileMain(filepath);
if (!fileMain.is_open()) {
cout << "Rank" << commRank << ": Cannot open text!" << endl;
my_abort (1);
}
fileMain.seekg(0, std::ios::end);
lenFullText = fileMain.tellg();
timeEndRead = MPI_Wtime();

int start = commRank * lenFullText / commSize;
int stop = (commRank + 1) * lenFullText / commSize + (lenp -‐ 1);

if (stop > lenFullText)
stop = lenFullText;

fileMain.seekg(start, std::ios::beg);

h_text = (char *) malloc(stop -‐ start + 1, sizeof(char));
fileMain.read(h_text, stop -‐ start + 1);
fileMain.close();

int lent = strlen(h_text);

// Calculate Lenths
int result_size = ceil(lenFullText / commSize);

// Initialize local results array
h_result = (bool *) malloc(result_size * sizeof(bool));
memset(h_result, false, result_size * sizeof(bool));

// On each node, run computation on GPU and CPU
timeStartCuda = MPI_Wtime();
computeGPU(h_result, result_size, h_text, lent, h_pattern, lenp,
cudaThreadSize, timeGPU);
timeEndCuda = MPI_Wtime();

timeStartCPU = MPI_Wtime();
computeCPU(h_text, lent, h_pattern, lenp, h_result);
timeEndCPU = MPI_Wtime();

// Initialize global results array
if (commRank == 0)
h_result_final = (int *) malloc(
result_size / lenp * commSize * sizeof(int));

// Init MPI receive buffer
MPI_Status status;
const int buf = 32;
int position[buf];

// Copy masters results
int results = 0;
if (h_result[i]) {
h_result_final[results] = i;
results++;
}

// Start receiving
int received = 0;
int finWorkers = 1;
while (finWorkers < commSize) {
MPI_CHECK(MPI_Recv(&position, buf, MPI_INT, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status));

// Get the number of the received results
MPI_Get_count(&status, MPI_INT, &received);
for (i = 0; i < received; i++) {
h_result_final[results] = position[i];
results++;
}
// Worker has finished transmission

if (status.MPI_TAG == 1)
finWorkers++;
}
} else {
int tmp_results = 0;
if (h_result[i]) {
position[tmp_results] = start + i;
tmp_results++;

// If buffer is full send results
if (tmp_results >= buf) {
MPI_CHECK(MPI_Send(&position, tmp_results, MPI_INT, 0, 0,
MPI_COMM_WORLD ));
tmp_results = 0;
}
}

// If there are unsent results send them and finish transmission
if (tmp_results > 0) {
MPI_CHECK(MPI_Send(&position, tmp_results, MPI_INT, 0, 1,
MPI_COMM_WORLD ));
} else {
MPI_CHECK(MPI_Send(NULL, 0, MPI_INT, 0, 1, MPI_COMM_WORLD ));
}
}
timeEndGather = MPI_Wtime();

if (dbgMode)
cout << "name\t" << "n\t" << "m\t" << "thr\t" << "read\t" <<
"CPU\t"
<< "cudaFunc\t" << "preComp\t" << "cudaHost\t"
<< "cudaComp\t" << "cudaDevice\t" << "Gather" << endl;
cout << filepath << "\t";
cout << lent << "\t" << lenp << "\t" << cudaThreadSize << "\t";
cout << float(timeEndRead -‐ timeStartRead) * 1000 << "\t";
cout << float(timeEndCPU -‐ timeStartCPU) * 1000 << "\t";
cout << float(timeEndCuda -‐ timeStartCuda) * 1000 << "\t";
//CUDA times
cout << timeGPU.cpuPrepro << "\t" << timeGPU.gpuMemHost << "\t"
<< timeGPU.gpuCompute << "\t" << timeGPU.gpuMemDevice << "\t";
//Gather times
cout << float(timeEndGather -‐ timeStartGather) * 1000 << endl;

}

// Cleanup
MPI_CHECK(MPI_Finalize());

return 0;
}
9.1.2. MPI Implementation (mpiBrute.h)

typedef struct times {
float gpuMemHost;
float gpuCompute;
float gpuMemDevice;
float cpuPrepro;
} times;

#define dbgMode 0

extern "C" {
void computeGPU(bool *h_result, int result_size, char *h_text, int lent,
char *h_pattern, int lenp, int size, times &timeGPU);
void my_abort(int err);
}
9.1.3. CUDA Implementation (mpiBrute.cu)

#include <stdio.h>
#include <iostream>

#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>

// User include
#include "mpiBrute.h"

// Device code
__global__ void bruteforceGPU(const char *T, const char *P, const int n,
const int m, bool *result) {


if (x < n) {
int k = 0;
int i;

for (i = 0; i < m; i++)
if (T[x + i] == P[i])
++k;

if (k == m)
result[x] = true;
}
}

int calcBlocks(int size, int threadsPerBlock) {
return ceil((size + threadsPerBlock -‐ 1) / threadsPerBlock);
}

// CUDA computation on each node
char *h_pattern, int lenp, int threadSize, times &timeGPU) {

char *d_text;
char *d_pattern;
bool *d_result;

cudaEvent_t startMemHost, stopMemHost;
cudaEvent_t startMemDevice, stopMemDevice;
cudaEvent_t startCompute, stopCompute;
cudaEventCreate(&startMemHost);
cudaEventCreate(&stopMemHost);
cudaEventCreate(&startCompute);
cudaEventCreate(&stopCompute);
cudaEventCreate(&startMemDevice);
cudaEventCreate(&stopMemDevice);

checkCudaErrors(cudaMalloc((void**) &d_text, lent * sizeof(char)));
checkCudaErrors(cudaMalloc((void**) &d_pattern, lenp * sizeof(char)));
checkCudaErrors(cudaMalloc((void**) &d_result, lent * sizeof(bool)));

checkCudaErrors(
cudaMemcpy(d_text, h_text, lent * sizeof(char),
checkCudaErrors(
cudaMemcpy(d_pattern, h_pattern, lenp * sizeof(char),

checkCudaErrors(cudaMemset(d_result, false, lent * sizeof(bool)));


// Invoke kernel
int blocksPerGrid = calcBlocks(lent, threadsPerBlock);

// Run Main kernel
bruteforceGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern,
lent, lenp, d_result);

// Copy data back to CPU memory
checkCudaErrors(
cudaMemcpy(h_result, d_result, lent * sizeof(bool),

// Free GPU memory

stopMemDevice);

}
9.2. Knuth Morris-Pratt
9.2.1. CUDA Implementation (mpiKMP.cu)

#include <stdio.h>
#include <iostream>


// MPI include
#include <mpi.h>

// User include
#include "mpiKMP.h"

#define chunk 2

// Device code
__global__ void kmpGPU(const char *T, const char *P, const int n, const int
m, const int *kmpNext, bool *result) {


if (x < n) {
int i, j;

if (stop > n)
stop = n;

i = 0;
j = start;
while (j < stop) {
while (i > -‐1 && P[i] != T[j])
i = kmpNext[i];
i++;
j++;
if (i >= m) {
result[j] = j -‐ i;
i = kmpNext[i];
}
}
}
}

void preKmp(const char *P, int m, int kmpNext[]) {
int i, j;
i = 0;
j = kmpNext[0] = -‐1;
while (i < m) {
while (j > -‐1 && P[i] != P[j])
j = kmpNext[j];
i++;
j++;
if (i < m && P[i] == P[j])
kmpNext[i] = kmpNext[j];
else
kmpNext[i] = j;
}
}

return (size + threadsPerBlock -‐ 1) / threadsPerBlock;
}


char *d_text;
char *d_pattern;
bool *d_result;
int *d_kmpNext;

int *h_kmpNext = (int*) malloc((lenp + 1) * sizeof(int));

clock_t timeStartPrepro, timeEndPrepro;


//Precompute
timeStartPrepro = clock();
preKmp(h_pattern, lenp, h_kmpNext);
timeEndPrepro = clock();

checkCudaErrors(cudaMalloc((void**) &d_kmpNext, (lenp + 1) *
sizeof(int)));

checkCudaErrors(
checkCudaErrors(
checkCudaErrors(
cudaMemcpy(d_kmpNext, h_kmpNext, (lenp + 1) * sizeof(int),



// Invoke kernel
int blocksPerGrid = calcBlocks(lent / (lenp * chunk), threadsPerBlock);

// Run Main kernel
kmpGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern, lent,
lenp, d_kmpNext, d_result);

checkCudaErrors(

// Free GPU memory
checkCudaErrors(cudaFree(d_kmpNext));

stopMemDevice);

timeGPU.cpuPrepro = (timeEndPrepro -‐ timeStartPrepro) / CLOCKS_PER_SEC
* 1000;
}
9.3. Horspool
9.3.1. CUDA Implementation (mpiHorsepool.cu)

#include <stdio.h>
#include <iostream>


// MPI include
#include <mpi.h>

// User include
#include "mpiHorspool.h"

// Device code
__global__ void horsGPU(const char *T, const char *P, const int n, const
int m,
const int *bmBc, const int *preComp, bool *result) {


if (x <= n -‐ m && preComp[x] == x) {
bool found = true;

char c = T[x + m -‐ 1];
for (int i = 0; i < m -‐ 1; ++i) {
if (P[m -‐ 1] != c || P[i] != T[x + i]) {
found = false;
break;
}
}
if (found)
result[x] = true;
}
}

int i = 0;
while (i <= n -‐ m) {
i += bmBc[T[i + m -‐ 1]];
preComp[i] = i;
}
}

void preBmBc(char *P, int m, int bmBc[]) {
int i;

bmBc[i] = m;
for (i = 0; i < m -‐ 1; ++i) {
bmBc[P[i]] = m -‐ i -‐ 1;
}
}

}


char *d_text;
char *d_pattern;
bool *d_result;
int *d_bmBc;
int *d_preComp;
int bmBc[ASIZE];
int *h_preComp = (int*) malloc(lent * sizeof(int));

double timeStartPrepro, timeEndPrepro;


//Precompute shifts
timeStartPrepro = MPI_Wtime();
preBmBc(h_pattern, lenp, bmBc);
precomputeShifts(h_text, lent, lenp, bmBc, h_preComp);
timeEndPrepro = MPI_Wtime();

checkCudaErrors(cudaMalloc((void**) &d_bmBc, ASIZE * sizeof(int)));
checkCudaErrors(cudaMalloc((void**) &d_preComp, lent * sizeof(int)));
checkCudaErrors(
checkCudaErrors(
checkCudaErrors(
cudaMemcpy(d_bmBc, bmBc, ASIZE * sizeof(int),
checkCudaErrors(
cudaMemcpy(d_preComp, h_preComp, lent * sizeof(int),



// Invoke kernel
int blocksPerGrid = calcBlocks(lent -‐ lenp + 1, threadsPerBlock);

// Run Main kernel
horsGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern, lent,
lenp, d_bmBc, d_preComp, d_result);

checkCudaErrors(

// Free GPU memory
checkCudaErrors(cudaFree(d_bmBc));
checkCudaErrors(cudaFree(d_preComp));

stopMemDevice);

timeGPU.cpuPrepro = (timeEndPrepro -‐ timeStartPrepro) * 1000;

}
9.4. Karp Rabin
9.4.1. CUDA Implementation (mpiKarpRabin.cu)

#include <stdio.h>
#include <iostream>


// User include
#include "mpiKarpRabin.h"

// Device code
__device__ int memcmpGPU(const char *cs_in, const char *ct_in, unsigned int
n) {
unsigned int i;
const unsigned char * cs = (const unsigned char*) cs_in;
const unsigned char * ct = (const unsigned char*) ct_in;

for (i = 0; i < n; i++, cs++, ct++) {
if (*cs < *ct) {
return -‐1;
} else if (*cs > *ct) {
return 1;
}
}
return 0;
}

__global__ void krGPU(const char *T, const char *P, const int n, const int
m,
int hx, bool *result) {


if (x <= n -‐ m) {
int hy, i;

/* Preprocessing */
for (hy = i = 0; i < m; ++i) {
hy = ((hy << 1) + T[i + x]);
}

/* Searching */
if (hx == hy && memcmpGPU(P, T + x, m) == 0)
result[x] = x;
}
}

}



char *d_text;
char *d_pattern;
bool *d_result;



//Precompute
int i, hx;
for (hx = i = 0; i < lenp; ++i) {
hx = ((hx << 1) + h_pattern[i]);
}

checkCudaErrors(cudaMemcpy(d_text, h_text, lent * sizeof(char),
checkCudaErrors(cudaMemcpy(d_pattern, h_pattern, lenp * sizeof(char),


// Invoke kernel
int blocksPerGrid = calcBlocks((lent -‐ lenp + 1), threadsPerBlock);

// Run Main kernel
krGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern, lent, lenp,
hx, d_result);

checkCudaErrors(cudaMemcpy(h_result, d_result, lent * sizeof(bool),

// Free GPU memory

stopMemDevice);

* 1000;
}

9.5. Quick Seach
9.5.1. CUDA Implementation (mpiQuickSearch.cu)

#include <stdio.h>
#include <iostream>


// MPI include
#include <mpi.h>

// User include
#include "mpiQuickSearch.h"

// Device code
const int m, const int *bmBc, const int *preComp, bool *result) {


if (x <= n -‐ m && preComp[x] == x) {
bool found = true;

for (int i = 0; i < m; ++i) {
if (P[i] != T[x + i]) {
found = false;
break;
}
}
if (found)
result[x] = true;
}
}

int i = 0;
while (i <= n -‐ m) {
i += bmBc[T[i + m]];
preComp[i] = i;
}
}
void preBmBc(char *P, int m, int bmBc[]) {
int i;
bmBc[i] = m + 1;
for (i = 0; i < m; i++)
bmBc[P[i]] = m -‐ i;
}

}



char *d_text;
char *d_pattern;
bool *d_result;
int *d_bmBc;
int *d_preComp;
int bmBc[ASIZE];
int *h_preComp = (int*) malloc(lent * sizeof(int));

double timeStartPrepro, timeEndPrepro;


//Precompute shifts
timeStartPrepro = MPI_Wtime();
preBmBc(h_pattern, lenp, bmBc);
precomputeShifts(h_text, lent, lenp, bmBc, h_preComp);
timeEndPrepro = MPI_Wtime();

checkCudaErrors(cudaMalloc((void**) &d_bmBc, ASIZE * sizeof(int)));
checkCudaErrors(cudaMalloc((void**) &d_preComp, lent * sizeof(int)));
checkCudaErrors(
checkCudaErrors(
checkCudaErrors(
cudaMemcpy(d_bmBc, bmBc, ASIZE * sizeof(int),
checkCudaErrors(
cudaMemcpy(d_preComp, h_preComp, lent * sizeof(int),



// Invoke kernel
int blocksPerGrid = calcBlocks(lent -‐ lenp + 1, threadsPerBlock);

// Run Main kernel

shiftOrGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern, lent,
lenp, d_bmBc, d_preComp, d_result);

checkCudaErrors(

// Free GPU memory
checkCudaErrors(cudaFree(d_bmBc));
checkCudaErrors(cudaFree(d_preComp));

stopMemDevice);

timeGPU.cpuPrepro = (timeEndPrepro -‐ timeStartPrepro) * 1000;
}
9.6. Shift Or
9.6.1. CUDA Implementation (mpiShiftOr.cu)

#include <stdio.h>
#include <iostream>


// MPI include
#include <mpi.h>

// User include
#include "mpiShiftOr.h"

#define chunk 2

// Device code
const int m, const int *S, int lim, bool *result) {


if (x < n) {
unsigned int D, j;

if (stop > n)
stop = n;

for (D = ~0, j = start; j <= stop; ++j) {
D = (D << 1) | S[T[j]];
if (D < lim)
}

}
}

int preSo(char *T, int m, int S[]) {
int i;
S[i] = ~0;
for (lim = i = 0, j = 1; i < m; ++i, j <<= 1) {
S[T[i]] &= ~j;
lim |= j;
}
lim = ~(lim >> 1);
return (lim);
}


}


char *d_text;
char *d_pattern;
bool *d_result;
int *d_S;
int h_S[ASIZE];



//Precompute shifts
int lim = preSo(h_pattern, lenp, h_S);

checkCudaErrors(cudaMalloc((void**) &d_S, ASIZE * sizeof(int)));
checkCudaErrors(
checkCudaErrors(
checkCudaErrors(
cudaMemcpy(d_S, h_S, ASIZE * sizeof(int), cudaMemcpyHostToDevice));



// Invoke kernel

// Run Main kernel
shiftOrGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern, lent,
lenp, d_S, lim, d_result);

checkCudaErrors(

// Free GPU memory
checkCudaErrors(cudaFree(d_S));

stopMemDevice);

* 1000;
}
9.7. Shift And
9.7.1. CUDA Implementation (mpiShiftAnd.cu)

#include <stdio.h>
#include <iostream>


// MPI include
#include <mpi.h>

// User include
#include "mpiShiftAnd.h"

#define chunk 2

// Device code


if (x < n) {
unsigned int D, F;
int j;

if (stop > n)
stop = n;

F = 1 << (m -‐ 1);

/* Searching */
D = ((D << 1) | 1) & S[T[j]];
if (D & F)
}
}
}

void preSA(char *P, int m, int S[]) {
unsigned int j;
int i;
S[i] = 0;
for (i = 0, j = 1; i < m; ++i, j <<= 1) {
S[P[i]] |= j;
}
}


}


char *d_text;
char *d_pattern;
bool *d_result;
int *d_S;
int h_S[ASIZE];



//Precompute shifts
preSA(h_pattern, lenp, h_S);

checkCudaErrors(cudaMalloc((void**) &d_S, ASIZE * sizeof(int)));
checkCudaErrors(
checkCudaErrors(
checkCudaErrors(
cudaMemcpy(d_S, h_S, ASIZE * sizeof(int), cudaMemcpyHostToDevice));



// Invoke kernel

// Run Main kernel
shiftAndGPU<<<blocksPerGrid, threadsPerBlock>>>(d_text, d_pattern, lent,
lenp, d_S, d_result);

checkCudaErrors(

// Free GPU memory
checkCudaErrors(cudaFree(d_S));

stopMemDevice);

* 1000;
}
R e f e r e n c e s - 84
References
[1] Nvidia CUDA FAQ | NVIDIA Developer Zone. https://developer.nvidia.com/cuda-
faq (accessed 2013/01/27).
[2] Wikipedia Message Passing Interface.
http://en.wikipedia.org/w/index.php?title=Message_Passing_Interface&oldid=537
683372 (accessed 2013/02/15).
[3] Wikipedia String searching algorithm - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/String_matching (accessed 2013/01/28).
[4] Francisco G. Martin, Εxact String Pattern Recognition. Escuela Universitaria de
Informática, Madrid, Spain.
[5] Dömölki, B., An algorithm for syntactic analysis. Computational Linguistics 1964,
3, 29–46.
[6] Morris, J. H.; Pratt, V. R. A Linear Pattern-Matching Algorithm; University of
California: Berkeley, 1970.
[7] Knuth, D.; Jr; Pratt, V., Fast Pattern Matching in Strings. SIAM Journal on
Computing 1977, 6 (2), 323-350.
[8] Boyer, R. S.; Moore, J. S., A fast string searching algorithm. Commun. ACM
1977, 20 (10), 762-772.
[9] Horspool, R. N., Practical fast searching in strings. Software: Practice and
Experience 1980, 10 (6), 501-506.
[10] Galil, Z.; Seiferas, J., Time-space-optimal string matching. Journal of Computer
and System Sciences 1983, 26 (3), 280-294.
[11] Apostolico, A.; Giancarlo, R., The Boyer Moore Galil string searching strategies
revisited. SIAM J. Comput. 1986, 15 (1), 98-105.
[12] Karp, R. M.; Rabin, M. O., Efficient randomized pattern-matching algorithms.
IBM J. Res. Dev. 1987, 31 (2), 249-260.
[13] Sunday, D. M., A very fast substring search algorithm. Commun. ACM 1990, 33
(8), 132-142.
[14] Navarro, G., A guided tour to approximate string matching. ACM Comput. Surv.
2001, 33 (1), 31-88.
[15] Vasiliadis, G.; Antonatos, S.; Polychronakis, M.; Markatos, E. P.; Ioannidis, S.,
Gnort: High Performance Network Intrusion Detection Using Graphics
Processors. In Proceedings of the 11th international symposium on Recent
Advances in Intrusion Detection, Springer-Verlag: Cambridge, MA, USA, 2008;
pp 116-134.
[16] Roesch, M., Snort - Lightweight Intrusion Detection for Networks. In Proceedings
of the 13th USENIX conference on System administration, USENIX Association:
Seattle, Washington, 1999; pp 229-238.
[17] Fisk, M.; Varghese, G. Fast Content-Based Packet Handling for Intrusion
Detection; UCSD: 2001.
[18] Kouzinopoulos, C. S.; Margaritis, K. G., String Matching on a Multicore GPU

Using CUDA. In Proceedings of the 2009 13th Panhellenic Conference on
Informatics, IEEE Computer Society: 2009; pp 14-18.
[19] Nvidia, CUDA C Programming Guide. PG-02829-001_v5.0 ed.; Nvidia: 2012.
[20] Chong, J. 100x Speedups: Are they real?; Parasians Parallel Computing
Artisans: 2011.
[21] Wikipedia CUDA - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/CUDA (accessed 2013/02/14).
[22] Wikipedia GPGPU - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/GPGPU (accessed 2013/02/14).
[23] Nvidia Nsight Eclipse Edition | NVIDIA Developer Zone.
https://developer.nvidia.com/nsight-eclipse-edition (accessed 2013/02/14).
[24] Microsoft Visual Studio 2012 | Microsoft Visual Studio.
http://www.microsoft.com/visualstudio/eng/team-foundation-service (accessed
2013/02/14).
[25] Eclipse Foundation The Eclipse Foundation open source community website.
http://eclipse.org/ (accessed 2013/02/14).
[26] Khronos Group OpenCL - The open standard for parallel programming of
heterogeneous systems. http://www.khronos.org/opencl/ (accessed 2013/02/14).
[27] Wikipedia DirectCompute - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/DirectCompute (accessed 2013/02/14).
[28] Microsoft MSDN C++ AMP Overview. http://msdn.microsoft.com/en-
us/library/vstudio/hh265136.aspx (accessed 2013/02/14).
[29] Yang, C.-T.; Huang, C.-L.; Lin, C.-F., Hybrid CUDA, OpenMP, and MPI parallel
programming on multicore GPU clusters. Computer Physics Communications
2011, 182 (1), 266-269.
[30] MPICH MPICH | High-performance and Portable MPI. http://www.mpich.org/
(accessed 2013/02/17).
[31] Ubuntu Ubuntu 10.04.4 LTS (Lucid Lynx). http://releases.ubuntu.com/lucid/
(accessed 2013/02/18).
[32] Wikipedia Network File System.
http://en.wikipedia.org/w/index.php?title=Network_File_System&oldid=53688818
4 (accessed 2013/02/13).
[33] Charras, C.; Lecroq, T., Handbook of Exact String Matching Algorithms. King's
College Publications: 2004.
[34] Smart Knuth Morris-Pratt | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=KMP&code=kmp
(accessed 2013/03/03).
[35] Tay, R. Demonstration of Exact String Matching Algorithms using CUDA 2011.
https://exactstrmatchgpu.googlecode.com/files/Raymond%20Tay%20Abstract%
20Submission%202.pdf.
[36] Smart Horspool | String Matching Algorithms Research Tool.

http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=HOR&code=hor
(accessed 2013/03/03).
[37] Smart Karp Rabin | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=KR&code=kr
(accessed 2013/03/03).
[38] Smart Quick Search | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=QS&code=qs
(accessed 2013/03/03).
[39] Baeza-Yates, R.; Gonnet, G. H., A new approach to text searching. Commun.
ACM 1992, 35 (10), 74-82.
[40] Smart Shift Or | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=SO&code=so
(accessed 2013/03/03).
[41] Smart Shift And | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=SA&code=sa
(accessed 2013/03/03).
[42] The Canterbury Corpus The Canterbury Corpus. http://corpus.canterbury.ac.nz/.
[43] Project Gutenberg PROJECT GUTENBERG - Free Books On-Line.
http://www.promo.net/pg/ (accessed 2013/02/18).
[44] Wikipedia Speedup.
http://en.wikipedia.org/w/index.php?title=Speedup&oldid=541180041 (accessed
2013/02/28).
[45] Amdahl, G. M., Validity of the single processor approach to achieving large scale
computing capabilities. In Proceedings of the April 18-20, 1967, spring joint
computer conference, ACM: Atlantic City, New Jersey, 1967; pp 483-485.

Cuda

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cuda

Uploaded by

Copyright:

Available Formats

String Matching on hybrid parallel

architectures, an approach using

MPI Standardized and portable Message Passing Interface [2]

n The length of the text being examined

m The length of the pattern

BM Boyer Moore String Matching Algorithm

Hor Horspool String Matching Algorithm

KMP Knuth Morris Pratt String Matching Algorithm

KR Karp Rabin String Matching Algorithm

MP Morris Pratt String Matching Algorithm

QS Quick Search String Matching Algorithm

SA Shift And String Matching Algorithm

SOr Shift Or String Matching Algorithm

1.1. History of String Matching

Figure 1 Brute-Force String Matching

1.2. Application Areas

1.2.1. Computational Biology

In Computational Biology, DNA and protein sequences can be depicted by

Figure 2 Computational Biology Codon Matching

1.2.2. Signal Processing

1.2.3. Text Retrieval

1.2.4. Computer Security

1.3. The Speed Problem

1.4. Objective of this Thesis

2.1. From GPUs to General Purpose GPUs

Figure 3 Memory Bandwidth CPU vs GPU

Moreover, due to their parallel architecture, they can be up to 100x faster

Figure 4 Floating-Point Operations per second CPU vs GPU

As it can be seen in Figure 4, there is a significant difference between the

Figure 5 GPU has more transistors devoted to Data Processing

2.2. CUDA A General-Purpose Parallel

Figure 6 The Multicore Architecture of a GPU

It is noteworthy to mention, that within the last years CUDA development

Figure 7 CUDA supported Programming Languages and Interfaces

2.3. Scalable Programming Model

CUDA was designed capable of overcoming these challenges, as it maintains

Figure 8 Automatic Scalability

2.4. Clusters with MPI and CUDA

time consuming operation, therefore should develop efficient programs that

Figure 9 GPU Cluster with MPI

3. CUDA Programming Model

3.2. Thread Hierarchy

- For 1D block, they are the same

Figure 10 Grid of Thread Blocks

Moreover, CUDA allows the threads of a block to cooperate as they have

3.3. Memory Hierarchy

Figure 11 Levels of Kernel Memory Spaces

3.4. Heterogeneous Architecture

Figure 12 CUDA Heterogeneous Model

4. System model and approach

5. Hybrid String Matching

The two layers of parallelization are implemented inside the program

5.1. MPI Parallelization

In the beginning of Exhibit 4, the MPI execution environment is initialized in

// Initialize MPI state

Between crucial stages of the execution, the MPI_Wtime() function is called in

// Read Main File

Later, the length of the chunk that is to be processed is calculated, as well as

Moreover, there is a global result integer vector h_result_final, initialized ion

efficient; on the other hand, a more efficient solution would be to do

5.2. CUDA Parallelization

As it can be seen, the cudaEventRecord() is being used, measuring the

// Calculate Blocks / Threads

6. String Matching Algorithms