Professional Documents
Culture Documents
John-Alexander M. Assael
iassael@gmail.com
Supervisor
Prof. Konstantinos G. Margaritis
kmarg@uom.edu.gr
Table of Contents
Acknowledgments i
Abstract ii
Glossary iii
1.
Introduction 1
1.1.
History of String Matching 1
1.2.
Application Areas 3
1.2.1.
Computational Biology 3
1.2.2.
Signal Processing 4
1.2.3.
Text Retrieval 4
1.2.4.
Computer Security 4
1.3.
The Speed Problem 5
1.4.
Objective of this Thesis 5
2.
Parallel Programming 6
2.1.
From GPUs to General Purpose GPUs 6
2.2.
CUDA A General-Purpose Parallel Computing Platform 8
2.3.
Scalable Programming Model 10
2.4.
Clusters with MPI and CUDA 11
3.
CUDA Programming Model 14
3.1.
Kernels 14
3.2.
Thread Hierarchy 15
3.3.
Memory Hierarchy 18
3.4.
Heterogeneous Architecture 19
4.
System model and approach 20
5.
Hybrid String Matching 21
5.1.
MPI Parallelization 21
5.2.
CUDA Parallelization 27
6.
String Matching Algorithms 29
6.1.
Naive Search 29
6.1.1.
Sequential Implementation 30
6.1.2.
CUDA Implementation 30
6.2.
Knuth Morris-Pratt 31
6.2.1.
Sequential Implementation 31
6.2.2.
CUDA Implementation 32
6.3.
Horspool 33
6.3.1.
Sequential Implementation 33
6.3.2.
CUDA Implementation 34
6.4.
Karp Rabin 35
6.4.1.
Sequential Implementation 35
6.4.2.
CUDA Implementation 36
6.5.
Quick Search 37
6.5.1.
Sequential Implementation 37
6.5.2.
CUDA Implementation 38
6.6.
Shift Or 39
6.6.1.
Sequential Implementation 39
6.6.2.
CUDA Implementation 40
6.7.
Shift And 41
6.7.1.
Sequential Implementation 41
6.7.2.
CUDA Implementation 42
7.
Performance Evaluation 43
7.1.
Testing Methodology 43
7.1.1.
Pattern Size and CUDA Threads 43
7.1.2.
String Matching Test Files 44
7.2.
Measuring Speedup 45
7.3.
CPU vs GPU Comparison 47
7.4.
GPU vs the Cluster Comparison 52
8.
Conclusions 57
9.
Appendix 59
9.1.
Naïve Search 59
9.1.1.
MPI Implementation (mpiBrute.cpp) 59
9.1.2.
MPI Implementation (mpiBrute.h) 63
9.1.3.
CUDA Implementation (mpiBrute.cu) 63
9.2.
Knuth Morris-Pratt 66
9.2.1.
CUDA Implementation (mpiKMP.cu) 66
9.3.
Horspool 69
9.3.1.
CUDA Implementation (mpiHorsepool.cu) 69
9.4.
Karp Rabin 72
9.4.1.
CUDA Implementation (mpiKarpRabin.cu) 72
9.5.
Quick Seach 75
9.5.1.
CUDA Implementation (mpiQuickSearch.cu) 75
9.6.
Shift Or 78
9.6.1.
CUDA Implementation (mpiShiftOr.cu) 78
9.7.
Shift And 81
9.7.1.
CUDA Implementation (mpiShiftAnd.cu) 81
References 84
Acknowledgments - i
Acknowledgments
I would like to thank my supervisor Prof. Konsatntinos G. Margaritis, for his
support and insightful feedback during the conduction of this research. I
would also like to thank the Parallel Distributed Processing Laboratory, as well
as, the department of Applied Informatics of University of Macedonia, for
providing me access to the hardware that was used to run the experiments.
Last but not least, I want to thank my family and friends for all of their support.
Abstract - ii
Abstract
String Matching algorithms are responsible for finding occurrences of a
pattern within a large text. Many areas of Computer Science require
demanding string-matching procedures. Over the past years, Graphics
Processing Units have evolved to powerful parallel processors outperforming
the Central Processing Units in scientific calculations. Moreover, GPUs can
be used in parallel, in computer clusters. This thesis attempts to speed up
seven major String Matching algorithms by exploiting the combined power of
several CUDA-enabled GPUs in a GPU cluster. The algorithms are
implemented and optimized to take advantage of MPI distributed-memory
cluster and the CUDA parallel computing architecture. Finally, they are
compared in terms of their performance with their sequential, and their single
GPU implementations.
Glossary - iii
Glossary
CUDA Compute Unified Device Architecture, by NVIDIA, that by
harnessing the power of Graphics Processing Units (GPU)
offers dramatic performance increase [1]
1. Introduction
String Matching algorithms are an important member of the String algorithms
family. Their duty is to find all the occurrences of one or several strings (also
called patterns) are found within a larger string or text [3]. In general, there are
two inputs given to each algorithm, the large text and the patter that we are
looking for in the former. It is noteworthy to mention, that in most algorithms
the length of the text is symbolized with the letter ‘n’ and the length of the
pattern as ‘m’, where m is less or equal to n. There are many algorithms that
try to give optimal solutions to this problem. In this thesis, we will analyze
seven major String Matching algorithms that through hybrid hardware and
software optimizations exhibit a dramatic performance boost. Finally, the Big
O notation will be used to express the complexity of algorithms.
shifts that take place. Moreover, the same year another efficient algorithm
was introduced by, Boyer and Moore [8]. The BM algorithm is considered as
one of the most efficient String Matching algorithms in usual applications,
implementing the background of most “search” commands. Several variants
and enhancements of the BM algorithm appeared the next years, such as:
Horspool [9], Galil-Seiferas [10], Apostolico-Giancarlo [11], Karp-Rabin [12]
and Quick-Search [13]. In this thesis we will analyze the Horspool Karp-Rabin
and Quick-Search algorithms, while they are both simple to implement in
parallel structures. So, in 1980, N. Horspool published a simplified version of
BM, which was related to the KMP algorithm. Horspool’s algorithm had
various optimizations in order to achieve an average-case complexity of O(n)
on a random text, although the worst-case is much worse than BM. Karp-
Rabin algorithm was created in 1987 by M. Rabin and R. Karp, and it made
use of hash methods to search much more efficiently for the pattern inside the
given text. Additionally, it is widely used for especially multiple pattern
searches. Finally, Quick Search was published in 1990 by D. Sunday. The
major unique feature of Quick Search, was that it was simplified to use only
the bad character shift table of BM, making it very fast in practice for short
patterns and large alphabets and also much easier to implement. This short
history overview covers all the algorithms that were implemented in this
thesis.
Another early motivation came from the area of signal processing. This area
deals with speech and in general with sound pattern recognition. The main
problem is to recognize a specific message over a transmitted audio signal.
There are many problems that come out of this main idea from speech
recognition to music recognition and signal error correcting. In such problems
the main text is the encoded input signal while a smaller encoded sequence
represents the pattern that is to be found.
The most common problem is the retrieval of text parts of a larger text while
counting the occurrences of the pattern. Moreover, the problem of correcting
misspelled words in written text is one of the oldest potential applications.
Moreover, while the World Wide Web was expanding, a new need came out
for scanning the online content of Internet and categorizing it, in order to make
it searchable by all users. This is an extremely demanding task as the size of
the data that is to be processed is enormous and parallel techniques are
required in order to process it efficiently.
During the last years, several areas of computer security have started making
use of pattern matching techniques, in order to achieve various demanding
tasks such as: intrusion detection, file hash matching, virus scans and spam
filtering. However, all these tasks are quite demanding and are running
endlessly, reserving a notable amount of power for both online and offline
processes. Moreover, these tasks can become even more demanding, by
multiplying the amount of data that has to be processed. For example, Mail
Servers have by far much more data to process than regular computers, while
all these actions have to be executed both fast and efficiently, consuming the
less computer power possible. There are several implementations of GPU
architecture on security applications, such as GNORT [15] in 2008, a GPU
version of the well-known intrusion detection application SNORT [16].
Introduction - 5
2. Parallel Programming
This highly parallel model allows the accelerated execution of problems that
have multiple computations on data that are not highly correlated; the problem
is distributed and the same operations are executed in parallel in multiple
threads and cores. It is very important that the computations are independent
and do not require previously calculated results in order to proceed to the next
operation. Otherwise, the performance exhibits a significant drop as the boost
is due to the highly parallel model. As a result, the more sequential the
processing becomes, the execution time could exceed the same program’s
CPU execution time, as the GPU cores are much weaker compared to a
same price level CPU processor.
The latest version 5.0 of NVIDIA CUDA comes with an Integrated Software
Environment that is based on the programming Language C as a high-level
programming language. The IDE is called Nsight [23] and offers editions for
Microsoft’s Visual Studio [24], for development under Microsoft Windows and
for the Open Source Eclipse platform [25], for Linux and Apple MacOSX
development.
Nsight is now distributed as a part of CUDA 5.0, and is equipped with a CUDA
Source Code Editor with the capabilities of syntax highlighting, CUDA aware
refactoring methods and code completion. Moreover, it has an integrated
debugger able to monitor program variables across several CUDA threads;
and also gives the ability to set breakpoints and perform single-step execution
at both source-code and assembly levels. Finally, Nsight Profiler is an
advanced code profiler, that identifies performance bottlenecks using a unified
CPU and GPU trace of application activity while it also offers automated
analysis for optimization opportunities. The algorithms implemented within the
scope of this thesis, were developed under Nsight Eclipse Edition and CUDA
5.0.
This partitioning allows the threads to cooperate when solving each sub-
problem, and most importantly such architecture offers automatic scalability.
Each block of threads can be run on any of the available GPU’s
multiprocessors, either concurrently or sequentially, this gives the ability to
any CUDA program to execute on any number of multiprocessors and easily
be adjusted by only knowing the physical multiprocessor number. This
scalable model allows applications to easily scale depending on the demands
and hardware of each user, as NVIDIA offers a wide range of CUDA-Enabled
GPUs from mainstream inexpensive (Geforce) to demanding high-
performance (Quadro and Tesla) computing products [19].
first steps were made in the beginning of the nineties, and version 1.0 of the
interface was released in June 1994 [2]. MPI defines each node as a process
and is considered the standard for High Performance Computing application
development, on distributed memory architectures [29].
MPI defines the semantics and the syntax of a core of library routines, useful
to a wide range of developers writing portable message-passing programs.
There are MPI bindings for many languages, but the first ones were in Fortran
77, C and C++. There are several implementations of the MPI interface, while
many of them are totally free and are available to the public domain.
One of the most widely used implementations is MPICH [30], a portable free
implementation of the message-passive interface for distributed-memory
applications that run in parallel. MPICH supports efficiently different
computation and communication platforms including commodity clusters
(desktop systems, shared-memory systems, multicore architectures), high-
speed networks, proprietary high-end computing systems (Blue Gene, Cray)
and multiple operating systems (Windows, most flavors of UNIX including
Linux and MacOSX). The cluster that was used in this thesis was supported
by MPICH2 running under Ubuntu Linux.
As MPI is standardized, the code that is written is portable and is able to run
under any MPI implementation of the same architecture. Moreover, although
the performance may vary between different implementations, the calls that
are made have the same behavior on all of them, offering even more
portability to the parallel programs.
The most common use of MPI is in parallel programming for cluster systems,
as it handles all the communications between the nodes with simplicity and
most important of all, with a standardized way. The main problem that is
encountered is the distributed memory architecture of clustered systems; how
data is distributed and how the processed communicate when they need data
from another node and may others. MPI is responsible for all the node
communications as handling the data distribution and gather. However, a
developer should always remember that passing data over the network is a
P a r a l l e l P r o g r a m m i n g - 13
In order to achieve this balance, nodes have to undertake more tasks before
the results are transmitted, minimizing the network communications in small
packets, as in such demanding operations even the packet headers affect the
total performance. More specifically, if we had developed an algorithm to add
2 same-length vectors, it would be much more efficient to assign parts of each
vector to each worker, and send the all calculated results together when the
calculations have finished, rather than sending each calculated result to the
master worker. This strategy is the most preferred one in general, especially
when there are no data dependencies.
As it can be seen, this balance is directly affected by the processing time, the
data dependencies and as a result the network communications. This thesis is
focused on using the GPU parallel architecture, taking advantage of the
tremendous performance boosts, using a computer cluster with CUDA-
Enabled GPUs, governed by MPI. Therefore, in this model there are two
levels of parallelism: the first level is between the nodes of the cluster, and the
second one is between cores of the multiple GPUs of each worker.
C U D A P r o g r a m m i n g M o d e l - 14
3.1. Kernels
CUDA library extends C, giving the ability to define and execute CUDA
specialized functions. These functions are called Kernels and opposed to
regular C functions, they are executed N times in parallel by N different CUDA
threads [19]. Each kernel function is defined using the __global__ specifier,
and the number of CUDA Threads that will execute the kernel, is defined
within <<<…>>> execution configuration syntax. The unique thread ID, that
characterizes each thread that executes the defined kernel, can be accessed
inside the kernel using the built-in variable threadIdx.
The following kernel example (Exhibit 1), illustrates the addition of two vectors
A, B of size N, storing the result into vector C.
//
Kernel
definition
__global__
void
vectorAdd(const
float
*A,
const
float
*B,
float
*C)
{
int
i
=
threadIdx.x;
C[i]
=
A[i]
+
B[i];
}
int
main()
{
...
//
Kernel
invocation
with
N
threads
vectorAdd<<<1,
N>>>(A,
B,
C);
...
}
Exhibit 1 Simple vectorAdd() kernel
C U D A P r o g r a m m i n g M o d e l - 15
At this point, we should note that there is a direct relation between the thread
ID and the thread Index. This relation exists so that we can manipulate our
data and cells with much more convenience.
The Exhibit 2 example is based on the Exhibit 1, but with 2D matrices of NxN
dimensions. The code computes the addition of matrices A, B and the result is
stored in matrix C.
//
Kernel
definition
__global__
void
MatAdd(float
A[N][N],
float
B[N][N],
float
C[N][N])
{
int
i
=
threadIdx.x;
int
j
=
threadIdx.y;
C[i][j]
=
A[i][j]
+
B[i][j];
}
int
main()
{
...
//
Kernel
invocation
with
one
block
of
N
*
N
*
1
threads
int
numBlocks
=
1;
dim3
threadsPerBlock(N,
N);
MatAdd<<<numBlocks,
threadsPerBlock>>>(A,
B,
C);
...
}
Exhibit 2 MatAdd() kernel
C U D A P r o g r a m m i n g M o d e l - 16
As threads of a block will be executed on the same cores and will share finite
hardware and especially memory resources, there is an upper limit of 1024
threads per block for the current NVIDIA GPUs. However, multiple blocks can
execute the same kernel. Therefore, blocks are also organized into 1D, 2D or
3D virtual grids of thread Blocks. The number of thread blocks inside a virtual
grid is relevant to the size of the data that is to be processed and to the
number of the processors that our hardware is equipped.
The developer defines the number of threads per block and the number of
blocks per grid with the <<<...>>> syntax, using an int or dim3 variable. Block
index can be accessed within the kernel using the blockIdx variable with the
same way as threadIdx worked. Moreover, the dimension of a block is
identified using the built-in blockDim variable. The following example extends
the previous MatAdd() to handle multiple blocks.
C U D A P r o g r a m m i n g M o d e l - 17
//
Kernel
definition
__global__
void
MatAdd(float
A[N][N],
float
B[N][N],
float
C[N][N])
{
int
i
=
blockIdx.x
*
blockDim.x
+
threadIdx.x;
int
j
=
blockIdx.y
*
blockDim.y
+
threadIdx.y;
if
(i
<
N
&&
j
<
N)
C[i][j]
=
A[i][j]
+
B[i][j];
}
int
main()
{
...
//
Kernel
invocation
dim3
threadsPerBlock(16,
16);
dim3
numBlocks(N
/
threadsPerBlock.x,
N
/
threadsPerBlock.y);
MatAdd<<<numBlocks,
threadsPerBlock>>>(A,
B,
C);
...
}
Exhibit 3 MatAdd() with Multiple Blocks
One of the most common choices, that is also used in this case, is a thread
block sized 16x16, meaning 256 threads. Then the number of blocks is
calculated by dividing the total number of elements N with the amount of
threads per block per dimensions. Moreover, thread blocks are executed
independently and can be scheduled in any order across any of the available
systems cores. This facilitates the scalability of the programs that are being
the developed, as they can easily be scaled up in relation to the number of
cores we have at our disposal.
In addition, all threads have read-only access to the constant and the texture
memory spaces. More specifically, CUDA offers the global, constant and
texture memory spaces and they are optimized for different types of
operations and usage. However, the memory spaces are persistent and
cannot be modified while the kernel is being executed.
- Device memory allocation: the data are being transferred from host
memory to the device memory reserved space.
- Data processing: the kernels are being loaded and executed on the blocks
of threads.
- Device memory de-allocation: the processed data are transferred from the
device memory to the host memory, freeing up the reserved space.
Exhibit 4 MPI Source - Initialization
timeStartRead
=
MPI_Wtime();
//Read
Pattern
File
ifstream
filePattern(patternpath);
if
(!filePattern.is_open())
{
cout
<<
"Rank"
<<
commRank
<<
":
Cannot
open
pattern!"
<<
endl;
my_abort(1);
}
string
str_pattern((std::istreambuf_iterator<char>(filePattern)),
std::istreambuf_iterator<char>());
filePattern.close();
h_pattern
=
new
char[str_pattern.size()];
strcpy(h_pattern,
str_pattern.c_str());
int
lenp
=
strlen(h_pattern);
Exhibit 5 MPI Source - Read Pattern
The same procedure is followed in order to read the text file. However, in this
situation each node reads the chunk that he was assigned. First, all nodes
read the total file size and then by using their assigned rank by MPI select a
different chunk of the. For example, if there were 5 nodes in total, then the file
would be split in 5 imaginary chunks and each node would read his part; the
H y b r i d S t r i n g M a t c h i n g - 23
second node in row would read the second chunk and so on. This is achieved
by using the start and stop auxiliary variables, that indicate the start and the
end position when each node is reading the text file.
𝑛𝑛
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 = 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∗
𝑀𝑀𝑀𝑀𝐼𝐼
𝑛𝑛
𝑠𝑠𝑠𝑠𝑠𝑠𝑝𝑝 = 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1 ∗ + (𝑚𝑚 − 1)
𝑀𝑀𝑀𝑀𝐼𝐼
The length of the text that is to be processed is slightly larger than the
absolute division of the total text length and the number of nodes; as it also
contains the next 𝑚𝑚 − 1 characters, satisfying the case that an occurrence of
the pattern was found at the last letter of the chunk.
𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑡𝑡
max 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑠𝑠 = ∗ 𝑀𝑀𝑀𝑀𝐼𝐼
𝑚𝑚
The computeGPU() function is called with the text and the pattern as main
parameters and calls the string matching CUDA implemented kernel to
calculate the results in the local GPU card. The same algorithm is also
implemented in C, and is being called by the computeCPU() function, in order
to compare the execution times. The implementations of both functions differ
depending on the algorithm; while, the remaining code is almost the same for
all of them.
//
Calculate
Lenths
int
result_size
=
ceil(lenFullText
/
commSize);
//
Initialize
local
results
array
h_result
=
(bool
*)
malloc(result_size
*
sizeof(bool));
memset(h_result,
false,
result_size
*
sizeof(bool));
//
On
each
node,
run
computation
on
GPU
and
CPU
timeStartCuda
=
MPI_Wtime();
computeGPU(h_result,
result_size,
h_text,
lent,
h_pattern,
lenp,
cudaThreadSize,
timeGPU);
timeEndCuda
=
MPI_Wtime();
timeStartCPU
=
MPI_Wtime();
computeCPU(h_text,
lent,
h_pattern,
lenp,
h_result);
timeEndCPU
=
MPI_Wtime();
if
(commRank
==
0)
h_result_final
=
(int
*)
malloc(
result_size
/
lenp
*
commSize
*
sizeof(int));
Exhibit 7 MPI Source - Execute GPU and CPU algorithms
The next step after execution is to gather the results to the master worker.
However, it is a fact that the major part of the results vector will be false
values, as it is not usual for a pattern to be found so many times. On the one
hand, the MPI_Gather() function could be used but it wouldn’t be time
H y b r i d S t r i n g M a t c h i n g - 25
timeStartGather
=
MPI_Wtime();
MPI_Status
status;
const
int
buf
=
32;
int
position[buf];
if
(commRank
==
0)
{
//
Copy
masters
results
int
results
=
0;
for
(i
=
0;
i
<
result_size;
i++)
if
(h_result[i])
{
h_result_final[results]
=
i;
results++;
}
//
Start
receiving
int
received
=
0;
int
finWorkers
=
1;
while
(finWorkers
<
commSize)
{
MPI_CHECK(
MPI_Recv(&position,
buf,
MPI_INT,
MPI_ANY_SOURCE,
MPI_ANY_TAG,
MPI_COMM_WORLD,
&status));
//
Get
the
number
of
the
received
results
MPI_Get_count(&status,
MPI_INT,
&received);
for
(i
=
0;
i
<
received;
i++)
{
h_result_final[results]
=
position[i];
results++;
}
//
Worker
has
finished
transmission
if
(status.MPI_TAG
==
1)
finWorkers++;
}
}
Exhibit 8 MPI Source – Gather Master
The master worker first copies the local result positions to the final results
vector, and then starts receiving using MPI_Recv() until all workers have
completed transmission. In order to keep track of the number of finished
workers, the last transmission of each one is marked with the value 1 instead
of 0 as MPI_TAG.
H y b r i d S t r i n g M a t c h i n g - 26
else
{
int
tmp_results
=
0;
for
(i
=
0;
i
<
result_size;
i++)
if
(h_result[i])
{
position[tmp_results]
=
start
+
i;
tmp_results++;
//
If
buffer
is
full
send
results
if
(tmp_results
>=
buf)
{
MPI_CHECK(
MPI_Send(&position,
tmp_results,
MPI_INT,
0,
0,
MPI_COMM_WORLD
));
tmp_results
=
0;
}
}
//
If
there
are
unsent
results
send
them
and
finish
transmission
if
(tmp_results
>
0)
MPI_CHECK(
MPI_Send(&position,
tmp_results,
MPI_INT,
0,
1,
MPI_COMM_WORLD
));
else
MPI_CHECK(
MPI_Send(NULL,
0,
MPI_INT,
0,
1,
MPI_COMM_WORLD
));
}
timeEndGather
=
MPI_Wtime();
Exhibit 9 MPI Source – Gather Workers
Each worker scans the local results array and copies the positions to the
buffer, when the buffer is full it is sent to the master using MPI_Send(). The
occurrence positions are calculated using the reading start position of the full
text that had to be processed and the local chunk’s occurrence position, so
that the values are pointing to the original text.
H y b r i d S t r i n g M a t c h i n g - 27
First of all, the text and the pattern vectors are initialized and copied to the
GPU Memory. In order to achieve this, the vectors are being duplicated;
however, they are initialized using cudaMalloc() function, instead of ANSI C
malloc(). For ease of use and readability, the GPU variables start with “d_”
standing for “device”. Then, the cudaMemcpy() function is used, duplicating
the text and pattern variables to GPU memory, and the cudaMemset()
initializes the results Boolean variable with false values.
cudaEventRecord(startMemHost,
0);
//
Allocate
data
on
GPU
memory
checkCudaErrors(cudaMalloc((void**)
&d_text,
lent*sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_pattern,
lenp*sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_result,
lent*sizeof(bool)));
//
Copy
to
GPU
memory
checkCudaErrors(cudaMemcpy(d_text,
h_text,
lent*sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_pattern,
h_pattern,
lenp*sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemset(d_result,
false,
lent*sizeof(bool)));
cudaEventRecord(stopMemHost,
0);
Exhibit 10 computeGPU() Memory Initialization
depending on the thread sizes. The function that is used for 1D vectors, as
the text that is being processed, is presented below.
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1
𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 =
𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
The vector size in this case is the results variable size, which has the same
length as the text for ease of calculation as linked lists are too complex to be
easily used by parallel architectures.
Finally, after the kernel has been executed with the specified block and thread
sizes using the <<<…>>> kernel call, the result vector is being copied back to
the host memory. The device variables are being freed from memory using
cudaFree(), and the following execution times are being calculated: a) Vector
initialization and memory copy from Host to Device, b) Kernel execution, c)
Memory copy from Device to Host.
S t r i n g M a t c h i n g A l g o r i t h m s - 29
− 𝑂𝑂(𝑚𝑚𝑚𝑚)
The first algorithm that was developed was called the Naïve algorithm (aka
Brute-Force), which was based on the common sense of sliding the pattern
along the text and comparing it to each portion of the text. The algorithm tries
to match the first character of the text with the first character pattern; in case
of success, it tries to match the second, and so on. Otherwise, the algorithm
continues by sliding the pattern to the next character of the text. The outer
loop is responsible for sliding to the next character, while the inner loop tries
to match the characters with the pattern. However, this first and simple
approach is very slow, taking O(nm) time [4]. Finally, its implementation can
be easily converted to a fuzzy string-matching algorithm, by altering the
condition, as the number of matches will be calculated even if a mismatch
was found. This can be very useful in Computational biology applications, but
makes it the most time consuming algorithm.
S t r i n g M a t c h i n g A l g o r i t h m s - 30
The Knuth, Morris and Pratt [7] algorithm was conceived by D. Knuth and V.
Pratt and independently by J. Morris, in 1974. The three of them published it
together in 1977. The algorithm is based on the Naïve algorithm and reduces
the number of comparisons, by using the learnt information from the inner
loop to determine how many skips should take place in the outer loop [33].
The KMP algorithm uses the pattern to pre-compute this number of skips and
starts searching like the naïve algorithm. However, in case of a mismatch, it
uses the pre-computed number of skips vector in order to determine the
position to continue.
void
computeCPU(char
*T,
int
n,
char
*P,
int
m,
bool
*result)
{
int
i,
j;
int
*kmpNext
=
(int*)
malloc((m
+
1)
*
sizeof(int));
/*
Preprocessing
*/
preKmp(P,
m,
kmpNext);
/*
Searching
*/
i
=
j
=
0;
while
(j
<
n)
{
while
(i
>
-‐1
&&
P[i]
!=
T[j])
i
=
kmpNext[i];
i++;
j++;
if
(i
>=
m)
{
result[j
-‐
i]
=
true;
i
=
kmpNext[i];
}
}
}
Exhibit 15 C implementation of KMP [34]
6.3. Horspool
Pre-calculation Search
For each position of the windows the algorithm compares the last window’s
character with the last character of the pattern, and if they match it will
continue to search backwardly. Then, it shifts the window so that the pattern
matches the last character of the previous window.
The CUDA implementation pre-calculates the shifts in CPU and then copies
the shift table to the GPU Memory. Using this table, each thread validates if
the current character is a valid shift, in order to continue execution [35].
void
computeCPU(char
*T,
int
n,
char
*P,
int
m,
bool
*result)
{
int
j,
bmBc[ASIZE];
char
c;
/*
Preoprocessing
*/
preBmBcCPU(P,
m,
bmBc);
/*
Searching
*/
j
=
0;
while
(j
<=
n
-‐
m)
{
c
=
T[j
+
m
-‐
1];
if
(P[m
-‐
1]
==
c
&&
memcmp(P,
T
+
j,
m
-‐
1)
==
0)
result[j]
=
true;
j
+=
bmBc[c];
}
}
Exhibit 18 C implementation of Horspool [36]
The Karp-Rabin was created in 1987 by M. Rabin and R. Karp [12] and
makes use of hash methods to search for the pattern inside the given text. It
is widely used for especially multiple pattern searches. The algorithm
calculates a hash value for the pattern and one for the current search window.
If the hash values are unequal, it will calculate the hash for the next window.
The algorithm moves the text window from left to right and compares it with
the pattern, when a mismatch is found in the i pattern position, it shifts the
pattern m-i positions right, and continues to search.
The CUDA implementation pre-calculates the shifts in CPU and then copies
the shift table to the GPU Memory. Using this table each thread validates if
the current character is a valid shift, in order to continue execution [35].
void
computeCPU(char
*T,
int
n,
char
*P,
int
m,
bool
*result)
{
int
i,
s,
qsbc[ASIZE];
/*
Preoprocessing
*/
preQsBcCPU(P,
m,
qsbc);
/*
Searching
*/
s
=
0;
while
(s
<=
n
-‐
m)
{
i
=
0;
while
(i
<
m
&&
P[i]
==
T[s
+
i])
i++;
if
(i
==
m)
result[s]
=
true;
s
+=
qsbc[T[s
+
m]];
}
}
Exhibit 23 C implementation of Quick Search [38]
6.6. Shift Or
Pre-calculation Search
The Shift Or [39] algorithm uses bitwise techniques to search within the
specified text. It is very efficient if the pattern length is less than the memory-
word size of the host machine. The bitwise operations mark all the potential
text and pattern matches. The Shift Or algorithm, uses a complemented bit
mask D in order avoiding one of the final bit operations of Shift And.
void
computeCPU(char
*T,
int
n,
char
*P,
int
m,
bool
*result)
{
unsigned
int
lim,
D;
unsigned
int
S[ASIZE];
int
j;
/*
Preprocessing
*/
lim
=
preSoCPU(P,
m,
S);
/*
Searching
*/
for
(D
=
~0,
j
=
0;
j
<
n;
++j)
{
D
=
(D
<<
1)
|
S[T[j]];
if
(D
<
lim)
result[j
-‐
m
+
1]
=
j
-‐
m
+
1;
}
}
Exhibit 26 C implementation of Shift Or [40]
The Shift And [39] algorithm is very similar to the Shift Or and uses bitwise
techniques to search within the specified text. Their major difference is that
Shift And has one bitwise operation more than Shift Or.
void
computeCPU(char
*T,
int
n,
char
*P,
int
m,
bool
*result)
{
unsigned
int
D;
unsigned
int
S[ASIZE],
F;
int
j;
/*
Preprocessing
*/
preSACPU(P,
m,
S);
F
=
1
<<
(m
-‐
1);
/*
Searching
*/
for
(D
=
0,
j
=
0;
j
<
n;
++j)
{
D
=
((D
<<
1)
|
1)
&
S[T[j]];
if
(D
&
F)
result[j
-‐
m
+
1]
=
j
-‐
m
+
1;
}
}
Exhibit 29 C implementation of Shift And [41]
Exhibit 30 CUDA Implementation of Shift And
P e r f o r m a n c e E v a l u a t i o n - 43
7. Performance Evaluation
The experiments were made using two different pattern sizes of 4 bytes and 8
bytes. The size of the pattern affects directly most of the algorithms. Hence,
difference in performance is being illustrated by choosing two usual pattern
sizes the. Moreover, another parameter that affects the execution time, as
well as the Memory Throughput, is the number of CUDA threads that will run
on each block, which is 256 or more. The number of blocks is directly
dependent to the number of threads, and is calculated with the following
function.
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1
𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 =
𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
However, on cards with compute capability [21] of 1.x, as Geforce GTX 280,
there is an upper limit of 65535 blocks on 1D. So, a larger number of 512
threads per CUDA block was chosen, in order to be able to process larger text
files without exceeding this limit.
P e r f o r m a n c e E v a l u a t i o n - 44
The following files compose the testing material that was used in the
experiments. They are widely known, and commonly used in string matching
tests. The texts were obtained from The Canterbury Corpus [42] and from
Project Gutenberg [43].
Filename Size
The alphabet of the files is limited to the simple 128 ASCII characters.
Furthermore, the selection of files was made, in order to exhibit the
performance differences under variable file sizes. It is an undisputable fact,
that the performance of processing small files, by a cluster, would exhibit a
significant drop due to the network overhead and the GPU memory transfer
time. So, use of a GPU Cluster would be effective on large files.
P e r f o r m a n c e E v a l u a t i o n - 45
𝑇𝑇
𝑆𝑆 =
𝑇𝑇
Amdahl’s Law cannot be applied in the CUDA and CPU comparison, as the
processors speeds are totally different. However, it could be used to
approximate the MPI times. As far as CUDA is concerned, NVIDIA has
provided technical reports that claim CUDA to be 30x faster in most cases
and some times even 100x faster, depending on the algorithm structure and
complexity [20]. The comparison was made between same-price range GPU
cards and CPUs. It is an indisputable fact, that speedups of 100x can be
extremely important for science. In order to understand it better, imagine a
problem that takes 12 hours. A 10x speedup would mean that the execution
would take place in 1.2 hours. A 100x speedup means 7.2 minutes and a
1000x 43.2 seconds! The maximum speedup that was exhibited in this thesis
was approximately 21x. This is due to the nature of the algorithms that were
implemented, as CUDA performs better in 𝑂𝑂(𝑛𝑛 ) and 𝑂𝑂(𝑛𝑛 ) complexity
problems, where most String Matching algorithms have usually 𝑂𝑂(𝑛𝑛)
complexity.
P e r f o r m a n c e E v a l u a t i o n - 47
As the alphabet size was the same for all execution experiments, a correlation
between the text size and the execution time was observed, as expected. In
the following graph the search execution times of CPU and GPU are
presented for the Naïve algorithm (pattern length 8, the GPU memory copy
times are excluded).
400 ms
300 ms
200 ms Time in ms
100 ms
Linear
(Time in
0 ms ms)
481,861B 1,344,739B 2,473,401B 4,959,549B 8,441,343B
-100 ms
12 ms
10 ms
8 ms
6 ms Time in ms
4 ms
2 ms Linear (Time
in ms)
0 ms
481,861 B 1,344,739 B 2,473,401 B 4,959,549 B 8,441,343 B
-2 ms
The pattern choice was made by the most frequent words of the text. The first
one was the word “last” and the other one was the word “probably”. Both
words appear about 530 times within the text, which is approximately the
0.006% of the text.
The following graph (Figure 16) compares the execution times of CPU and
GPU implementations of the 7 string matching algorithms, under different
pattern sizes.
P e r f o r m a n c e E v a l u a t i o n - 49
SA
QS
GPU 8
KMP GPU 4
CPU 8
KR
CPU 4
Hor
Naïve
0 ms 20 ms 40 ms 60 ms 80 ms 100 ms 120 ms
The CPU execution times of the Naïve algorithm were outside the visible chart
area, but are available in more detail in Table 2.
All the presented execution times for both CPU and GPU include any
necessary preprocessing part. However, it was observed that, as the
complexity of preprocessing is dependent on the pattern size and the
alphabet size, for both patterns the times for most algorithms, were extremely
small, fluctuating around 1.58E+29 ms even for the 8 Byte pattern. As a
result, the preprocessing parts of the selected algorithms, is quite insignificant
for the selected pattern sizes and does not affect the total execution time
considerably for such pattern sizes.
The CPU execution time, measured as the execution time of the the
computeCPU() function, is represented by the CPU bars. The GPU execution
time was calculated as the sum of the CUDA memory copy from Host to the
Device, the preprocessing functions, the kernel execution, and the CUDA
memory copy time of the results from the Device to the Host. These
procedures are a substantial part of the GPU algorithm, in order to be able to
print the results or use them in the execution later. The new NVIDIA Fermi
and Kepler cards, with compute capability >2.0, support printf() function calls
from inside the kernel. This means, that the final copy of the results from the
GPU Memory to the Host Memory could be avoided, providing even better
performance speedups. However, the GTX 280 cards have compute
capability of 1.x and do not support such functions.
It can clearly be seen, that all the algorithms exhibit a significant speedup
when running under GPU, for both pattern sizes. Additionally, this
performance boost is depicted in Figure 17 as an average of the two pattern
sizes.
P e r f o r m a n c e E v a l u a t i o n - 51
2000%
1500%
966%
1000% 768%
0%
Naïve Hor KR KMP QS SA SOr
As it can be seen, the Naïve algorithm, as well as, KR and KMP show the
most significant performance speedups. Furthermore, it is worth mentioning
that all algorithms exhibited at least a two-fold performance increase,
compared to their CPU implementations. As the two architectures are totally
different, Amdahl’s law cannot apply to this case. However, these differences
between the speedups are due to the different structure and complexity of
each algorithm. For example, the significant speedup that the Naïve algorithm
exhibited is due to the algorithm’s O(mn) complexity, even in the best case,
thus, making it having the worst best case out of all. It is a fact that algorithms
with larger complexities can be much more efficient when implemented in
CUDA. Moreover, the Horspool and KMP algorithms exhibited the smallest
speedup, as a significant portion of their structure is sequential. Specifically,
the CUDA implementation of the algorithm runs once in CPU, through the text
using bad character shift table to calculate the shifts and then runs on the
GPU for the specified positions. So, a very large portion of the code still was
executed sequentially, limiting the speedup.
P e r f o r m a n c e E v a l u a t i o n - 52
SA
QS
Single 8
KMP Single 4
MPI 8
KR
MPI 4
Hor
Naïve
0 ms 5 ms 10 ms 15 ms
The 4, 8 suffixes represent the two different pattern sizes that were used, as
in the previous executions. Additionally, Table 3 presents the same execution
results with more details.
P e r f o r m a n c e E v a l u a t i o n - 53
The MPI times were calculated as a sum of the total CUDA execution time,
and the time needed for gathering the results from nodes to the master
worker. The gathering function can be very time consuming as it depends on
the network communication links between the nodes, that are much slower
that inner host communications. Even with the network latency, all algorithms
exhibited a significant performance increase again, which presented in more
detail in Figure 19.
100%
50%
0%
Naïve Hor KR KMP QS SA SOr
It can clearly be seen, that the speedup of MPI is much less than the CUDA
speedup. This was due to the two memory copy procedures, copying the data
to the device and then collecting the results, as they have a significant role in
the total GPU execution time. Figure 20 shows that these two functions take,
as an average, about 59% of the total single GPU execution time, with a
minimum of 38% and a maximum of 75%, depending on the algorithm. This
means that sometimes they take more time than the total processing does.
Furthermore, this portion represents the crucial part of the algorithms that
cannot be parallelized as Amdahl’s law states, and works as a barrier in the
speedup. Specifically, Amdahl’s law says that, when the parallel portion is
around 50% the maximum speedup that can be achieved is around 2x.
Although, many architecture parameters affect this number, it is very
interesting that the law applies to this condition, as the results are very close
to barrier specified.
SA
QS
KR Mem Device
Computation
Hor
Naïve
2.0. As an average, the gathering of the data was responsible for the 45% of
the execution time in all algorithm, with a minimum of 12% in Naive Search
and a maximum of 71% in Quick Search. Such percentages, work as barriers
to the optimizations that can be performed, as the part of gathering the results
cannot be parallelized.
SA
QS
KMP
Gather
KR
Computation
Hor
Naïve
Moreover, it can be seen in Figure 22 that Karp Rabin, Horspool and Quick
Search algorithms performed best in descending order. The three of them,
managed to search, and also gather the results of the pattern occurrences
within an 8 MB file in 6.38 ms, as an average for both pattern sizes.
P e r f o r m a n c e E v a l u a t i o n - 56
SA
QS
KMP
KR
Hor
Naïve
Finally, it is marked that through this hybrid approach of string matching, all
algorithms exhibited at least a three-fold speedup. The Naïve algorithm had
the most significant performance increase of 49x, compared to its single CPU
version. Additionally, algorithms such as Karp Rabin, that ended up having
one of the best execution times, managed to run 12x faster than their single
CPU implementation, as presented in Figure 23.
8. Conclusions
In this thesis, parallel implementations of the Naïve, Knuth Morris-Pratt,
Horspool, Karp Rabin, Quick Search, Shift Or and Shift And exact string
matching algorithms were presented using the NVIDIA CUDA Architecture.
Both sequential and parallel implementations were compared in terms of
running time under different pattern sizes. The results shown, that the parallel
implementations of algorithms such as Naïve Search, were executed up to
21x faster, than the sequential algorithm. Furthermore, the Knuth Morris-Pratt
and the Karp Rabin algorithms exhibited 9.6x and 7.7x speedups,
respectively; while, the rest of them exhibited at least a 2x increase. It was
observed, that the speedups were directly dependent to the structure of each
algorithm, and the portion of the overall procedure that could be parallelized.
In total, the Naïve algorithm had the most significant performance increase of
48x, comparing the MPI with the sequential version. This substantial speedup
was due to the higher algorithm’s complexity, as well as, its simple structure.
However, the fastest algorithm for both patterns proved to be Karp Rabin,
C o n c l u s i o n s - 58
9. Appendix
stop
=
lenFullText;
fileMain.seekg(start,
std::ios::beg);
h_text
=
(char
*)
malloc(stop
-‐
start
+
1,
sizeof(char));
fileMain.read(h_text,
stop
-‐
start
+
1);
fileMain.close();
int
lent
=
strlen(h_text);
//
Calculate
Lenths
int
result_size
=
ceil(lenFullText
/
commSize);
//
Initialize
local
results
array
h_result
=
(bool
*)
malloc(result_size
*
sizeof(bool));
memset(h_result,
false,
result_size
*
sizeof(bool));
//
On
each
node,
run
computation
on
GPU
and
CPU
timeStartCuda
=
MPI_Wtime();
computeGPU(h_result,
result_size,
h_text,
lent,
h_pattern,
lenp,
cudaThreadSize,
timeGPU);
timeEndCuda
=
MPI_Wtime();
timeStartCPU
=
MPI_Wtime();
computeCPU(h_text,
lent,
h_pattern,
lenp,
h_result);
timeEndCPU
=
MPI_Wtime();
//
Initialize
global
results
array
if
(commRank
==
0)
h_result_final
=
(int
*)
malloc(
result_size
/
lenp
*
commSize
*
sizeof(int));
//
Init
MPI
receive
buffer
MPI_Status
status;
const
int
buf
=
32;
int
position[buf];
if
(commRank
==
0)
{
//
Copy
masters
results
int
results
=
0;
for
(i
=
0;
i
<
result_size;
i++)
if
(h_result[i])
{
h_result_final[results]
=
i;
results++;
}
//
Start
receiving
int
received
=
0;
int
finWorkers
=
1;
while
(finWorkers
<
commSize)
{
MPI_CHECK(MPI_Recv(&position,
buf,
MPI_INT,
MPI_ANY_SOURCE,
MPI_ANY_TAG,
MPI_COMM_WORLD,
&status));
//
Get
the
number
of
the
received
results
MPI_Get_count(&status,
MPI_INT,
&received);
for
(i
=
0;
i
<
received;
i++)
{
h_result_final[results]
=
position[i];
results++;
}
A p p e n d i x - 62
}
A p p e n d i x - 63
}
char
*d_pattern;
bool
*d_result;
cudaEvent_t
startMemHost,
stopMemHost;
cudaEvent_t
startMemDevice,
stopMemDevice;
cudaEvent_t
startCompute,
stopCompute;
cudaEventCreate(&startMemHost);
cudaEventCreate(&stopMemHost);
cudaEventCreate(&startCompute);
cudaEventCreate(&stopCompute);
cudaEventCreate(&startMemDevice);
cudaEventCreate(&stopMemDevice);
//
Allocate
data
on
GPU
memory
cudaEventRecord(startMemHost,
0);
checkCudaErrors(cudaMalloc((void**)
&d_text,
lent
*
sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_pattern,
lenp
*
sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_result,
lent
*
sizeof(bool)));
//
Copy
to
GPU
memory
checkCudaErrors(
cudaMemcpy(d_text,
h_text,
lent
*
sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(
cudaMemcpy(d_pattern,
h_pattern,
lenp
*
sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemset(d_result,
false,
lent
*
sizeof(bool)));
cudaEventRecord(stopMemHost,
0);
//
Invoke
kernel
int
threadsPerBlock
=
threadSize;
int
blocksPerGrid
=
calcBlocks(lent,
threadsPerBlock);
//
Run
Main
kernel
cudaEventRecord(startCompute,
0);
bruteforceGPU<<<blocksPerGrid,
threadsPerBlock>>>(d_text,
d_pattern,
lent,
lenp,
d_result);
cudaEventRecord(stopCompute,
0);
//
Copy
data
back
to
CPU
memory
cudaEventRecord(startMemDevice,
0);
checkCudaErrors(
cudaMemcpy(h_result,
d_result,
lent
*
sizeof(bool),
cudaMemcpyDeviceToHost));
cudaEventRecord(stopMemDevice,
0);
//
Free
GPU
memory
checkCudaErrors(cudaFree(d_text));
checkCudaErrors(cudaFree(d_pattern));
checkCudaErrors(cudaFree(d_result));
cudaEventElapsedTime(&timeGPU.gpuMemHost,
startMemHost,
stopMemHost);
cudaEventElapsedTime(&timeGPU.gpuCompute,
startCompute,
stopCompute);
cudaEventElapsedTime(&timeGPU.gpuMemDevice,
startMemDevice,
stopMemDevice);
}
A p p e n d i x - 65
A p p e n d i x - 66
j++;
if
(i
<
m
&&
P[i]
==
P[j])
kmpNext[i]
=
kmpNext[j];
else
kmpNext[i]
=
j;
}
}
int
calcBlocks(int
size,
int
threadsPerBlock)
{
return
(size
+
threadsPerBlock
-‐
1)
/
threadsPerBlock;
}
//
CUDA
computation
on
each
node
void
computeGPU(bool
*h_result,
int
result_size,
char
*h_text,
int
lent,
char
*h_pattern,
int
lenp,
int
threadSize,
times
&timeGPU)
{
char
*d_text;
char
*d_pattern;
bool
*d_result;
int
*d_kmpNext;
int
*h_kmpNext
=
(int*)
malloc((lenp
+
1)
*
sizeof(int));
clock_t
timeStartPrepro,
timeEndPrepro;
cudaEvent_t
startMemHost,
stopMemHost;
cudaEvent_t
startMemDevice,
stopMemDevice;
cudaEvent_t
startCompute,
stopCompute;
cudaEventCreate(&startMemHost);
cudaEventCreate(&stopMemHost);
cudaEventCreate(&startCompute);
cudaEventCreate(&stopCompute);
cudaEventCreate(&startMemDevice);
cudaEventCreate(&stopMemDevice);
//Precompute
timeStartPrepro
=
clock();
preKmp(h_pattern,
lenp,
h_kmpNext);
timeEndPrepro
=
clock();
//
Allocate
data
on
GPU
memory
cudaEventRecord(startMemHost,
0);
checkCudaErrors(cudaMalloc((void**)
&d_text,
lent
*
sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_pattern,
lenp
*
sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_result,
lent
*
sizeof(bool)));
checkCudaErrors(cudaMalloc((void**)
&d_kmpNext,
(lenp
+
1)
*
sizeof(int)));
//
Copy
to
GPU
memory
checkCudaErrors(
cudaMemcpy(d_text,
h_text,
lent
*
sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(
cudaMemcpy(d_pattern,
h_pattern,
lenp
*
sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(
cudaMemcpy(d_kmpNext,
h_kmpNext,
(lenp
+
1)
*
sizeof(int),
cudaMemcpyHostToDevice));
A p p e n d i x - 68
}
A p p e n d i x - 69
9.3. Horspool
int
calcBlocks(int
size,
int
threadsPerBlock)
{
return
(size
+
threadsPerBlock
-‐
1)
/
threadsPerBlock;
}
//
CUDA
computation
on
each
node
void
computeGPU(bool
*h_result,
int
result_size,
char
*h_text,
int
lent,
char
*h_pattern,
int
lenp,
int
threadSize,
times
&timeGPU)
{
char
*d_text;
char
*d_pattern;
bool
*d_result;
int
*d_bmBc;
int
*d_preComp;
int
bmBc[ASIZE];
int
*h_preComp
=
(int*)
malloc(lent
*
sizeof(int));
double
timeStartPrepro,
timeEndPrepro;
cudaEvent_t
startMemHost,
stopMemHost;
cudaEvent_t
startMemDevice,
stopMemDevice;
cudaEvent_t
startCompute,
stopCompute;
cudaEventCreate(&startMemHost);
cudaEventCreate(&stopMemHost);
cudaEventCreate(&startCompute);
cudaEventCreate(&stopCompute);
cudaEventCreate(&startMemDevice);
cudaEventCreate(&stopMemDevice);
//Precompute
shifts
timeStartPrepro
=
MPI_Wtime();
preBmBc(h_pattern,
lenp,
bmBc);
precomputeShifts(h_text,
lent,
lenp,
bmBc,
h_preComp);
timeEndPrepro
=
MPI_Wtime();
//
Allocate
data
on
GPU
memory
cudaEventRecord(startMemHost,
0);
checkCudaErrors(cudaMalloc((void**)
&d_text,
lent
*
sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_pattern,
lenp
*
sizeof(char)));
checkCudaErrors(cudaMalloc((void**)
&d_result,
lent
*
sizeof(bool)));
checkCudaErrors(cudaMalloc((void**)
&d_bmBc,
ASIZE
*
sizeof(int)));
checkCudaErrors(cudaMalloc((void**)
&d_preComp,
lent
*
sizeof(int)));
//
Copy
to
GPU
memory
checkCudaErrors(
cudaMemcpy(d_text,
h_text,
lent
*
sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(
cudaMemcpy(d_pattern,
h_pattern,
lenp
*
sizeof(char),
cudaMemcpyHostToDevice));
checkCudaErrors(
cudaMemcpy(d_bmBc,
bmBc,
ASIZE
*
sizeof(int),
cudaMemcpyHostToDevice));
checkCudaErrors(
cudaMemcpy(d_preComp,
h_preComp,
lent
*
sizeof(int),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemset(d_result,
false,
lent
*
sizeof(bool)));
cudaEventRecord(stopMemHost,
0);
A p p e n d i x - 71
//
Invoke
kernel
int
threadsPerBlock
=
threadSize;
int
blocksPerGrid
=
calcBlocks(lent
-‐
lenp
+
1,
threadsPerBlock);
//
Run
Main
kernel
cudaEventRecord(startCompute,
0);
horsGPU<<<blocksPerGrid,
threadsPerBlock>>>(d_text,
d_pattern,
lent,
lenp,
d_bmBc,
d_preComp,
d_result);
cudaEventRecord(stopCompute,
0);
//
Copy
data
back
to
CPU
memory
cudaEventRecord(startMemDevice,
0);
checkCudaErrors(
cudaMemcpy(h_result,
d_result,
lent
*
sizeof(bool),
cudaMemcpyDeviceToHost));
cudaEventRecord(stopMemDevice,
0);
//
Free
GPU
memory
checkCudaErrors(cudaFree(d_text));
checkCudaErrors(cudaFree(d_pattern));
checkCudaErrors(cudaFree(d_result));
checkCudaErrors(cudaFree(d_bmBc));
checkCudaErrors(cudaFree(d_preComp));
cudaEventElapsedTime(&timeGPU.gpuMemHost,
startMemHost,
stopMemHost);
cudaEventElapsedTime(&timeGPU.gpuCompute,
startCompute,
stopCompute);
cudaEventElapsedTime(&timeGPU.gpuMemDevice,
startMemDevice,
stopMemDevice);
timeGPU.cpuPrepro
=
(timeEndPrepro
-‐
timeStartPrepro)
*
1000;
}
A p p e n d i x - 72
checkCudaErrors(cudaFree(d_pattern));
checkCudaErrors(cudaFree(d_result));
cudaEventElapsedTime(&timeGPU.gpuMemHost,
startMemHost,
stopMemHost);
cudaEventElapsedTime(&timeGPU.gpuCompute,
startCompute,
stopCompute);
cudaEventElapsedTime(&timeGPU.gpuMemDevice,
startMemDevice,
stopMemDevice);
timeGPU.cpuPrepro
=
(timeEndPrepro
-‐
timeStartPrepro)
/
CLOCKS_PER_SEC
*
1000;
}
A p p e n d i x - 75
}
A p p e n d i x - 78
9.6. Shift Or
//
Copy
data
back
to
CPU
memory
cudaEventRecord(startMemDevice,
0);
checkCudaErrors(
cudaMemcpy(h_result,
d_result,
lent
*
sizeof(bool),
cudaMemcpyDeviceToHost));
cudaEventRecord(stopMemDevice,
0);
//
Free
GPU
memory
checkCudaErrors(cudaFree(d_text));
checkCudaErrors(cudaFree(d_pattern));
checkCudaErrors(cudaFree(d_result));
checkCudaErrors(cudaFree(d_S));
cudaEventElapsedTime(&timeGPU.gpuMemHost,
startMemHost,
stopMemHost);
cudaEventElapsedTime(&timeGPU.gpuCompute,
startCompute,
stopCompute);
cudaEventElapsedTime(&timeGPU.gpuMemDevice,
startMemDevice,
stopMemDevice);
timeGPU.cpuPrepro
=
(timeEndPrepro
-‐
timeStartPrepro)
/
CLOCKS_PER_SEC
*
1000;
}
A p p e n d i x - 81
//
Copy
data
back
to
CPU
memory
cudaEventRecord(startMemDevice,
0);
checkCudaErrors(
cudaMemcpy(h_result,
d_result,
lent
*
sizeof(bool),
cudaMemcpyDeviceToHost));
cudaEventRecord(stopMemDevice,
0);
//
Free
GPU
memory
checkCudaErrors(cudaFree(d_text));
checkCudaErrors(cudaFree(d_pattern));
checkCudaErrors(cudaFree(d_result));
checkCudaErrors(cudaFree(d_S));
cudaEventElapsedTime(&timeGPU.gpuMemHost,
startMemHost,
stopMemHost);
cudaEventElapsedTime(&timeGPU.gpuCompute,
startCompute,
stopCompute);
cudaEventElapsedTime(&timeGPU.gpuMemDevice,
startMemDevice,
stopMemDevice);
timeGPU.cpuPrepro
=
(timeEndPrepro
-‐
timeStartPrepro)
/
CLOCKS_PER_SEC
*
1000;
}
R e f e r e n c e s - 84
References
[1] Nvidia CUDA FAQ | NVIDIA Developer Zone. https://developer.nvidia.com/cuda-
faq (accessed 2013/01/27).
[2] Wikipedia Message Passing Interface.
http://en.wikipedia.org/w/index.php?title=Message_Passing_Interface&oldid=537
683372 (accessed 2013/02/15).
[3] Wikipedia String searching algorithm - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/String_matching (accessed 2013/01/28).
[4] Francisco G. Martin, Εxact String Pattern Recognition. Escuela Universitaria de
Informática, Madrid, Spain.
[5] Dömölki, B., An algorithm for syntactic analysis. Computational Linguistics 1964,
3, 29–46.
[6] Morris, J. H.; Pratt, V. R. A Linear Pattern-Matching Algorithm; University of
California: Berkeley, 1970.
[7] Knuth, D.; Jr; Pratt, V., Fast Pattern Matching in Strings. SIAM Journal on
Computing 1977, 6 (2), 323-350.
[8] Boyer, R. S.; Moore, J. S., A fast string searching algorithm. Commun. ACM
1977, 20 (10), 762-772.
[9] Horspool, R. N., Practical fast searching in strings. Software: Practice and
Experience 1980, 10 (6), 501-506.
[10] Galil, Z.; Seiferas, J., Time-space-optimal string matching. Journal of Computer
and System Sciences 1983, 26 (3), 280-294.
[11] Apostolico, A.; Giancarlo, R., The Boyer Moore Galil string searching strategies
revisited. SIAM J. Comput. 1986, 15 (1), 98-105.
[12] Karp, R. M.; Rabin, M. O., Efficient randomized pattern-matching algorithms.
IBM J. Res. Dev. 1987, 31 (2), 249-260.
[13] Sunday, D. M., A very fast substring search algorithm. Commun. ACM 1990, 33
(8), 132-142.
[14] Navarro, G., A guided tour to approximate string matching. ACM Comput. Surv.
2001, 33 (1), 31-88.
[15] Vasiliadis, G.; Antonatos, S.; Polychronakis, M.; Markatos, E. P.; Ioannidis, S.,
Gnort: High Performance Network Intrusion Detection Using Graphics
Processors. In Proceedings of the 11th international symposium on Recent
Advances in Intrusion Detection, Springer-Verlag: Cambridge, MA, USA, 2008;
pp 116-134.
[16] Roesch, M., Snort - Lightweight Intrusion Detection for Networks. In Proceedings
of the 13th USENIX conference on System administration, USENIX Association:
Seattle, Washington, 1999; pp 229-238.
[17] Fisk, M.; Varghese, G. Fast Content-Based Packet Handling for Intrusion
Detection; UCSD: 2001.
R e f e r e n c e s - 85