You are on page 1of 93

String Matching on hybrid parallel

architectures, an approach using


MPI and NVIDIA CUDA
Department of Applied Informatics,
University Of Macedonia,
March 2013

John-Alexander M. Assael
iassael@gmail.com

Supervisor
Prof. Konstantinos G. Margaritis
kmarg@uom.edu.gr
Table of Contents
Acknowledgments i  
Abstract ii  
Glossary iii  
1.   Introduction 1  
1.1.   History of String Matching 1  
1.2.   Application Areas 3  
1.2.1.   Computational Biology 3  
1.2.2.   Signal Processing 4  
1.2.3.   Text Retrieval 4  
1.2.4.   Computer Security 4  
1.3.   The Speed Problem 5  
1.4.   Objective of this Thesis 5  
2.   Parallel Programming 6  
2.1.   From GPUs to General Purpose GPUs 6  
2.2.   CUDA A General-Purpose Parallel Computing Platform 8  
2.3.   Scalable Programming Model 10  
2.4.   Clusters with MPI and CUDA 11  
3.   CUDA Programming Model 14  
3.1.   Kernels 14  
3.2.   Thread Hierarchy 15  
3.3.   Memory Hierarchy 18  
3.4.   Heterogeneous Architecture 19  
4.   System model and approach 20  
5.   Hybrid String Matching 21  
5.1.   MPI Parallelization 21  
5.2.   CUDA Parallelization 27  
6.   String Matching Algorithms 29  
6.1.   Naive Search 29  
6.1.1.   Sequential Implementation 30  
6.1.2.   CUDA Implementation 30  
6.2.   Knuth Morris-Pratt 31  
6.2.1.   Sequential Implementation 31  
6.2.2.   CUDA Implementation 32  
6.3.   Horspool 33  
6.3.1.   Sequential Implementation 33  
6.3.2.   CUDA Implementation 34  
6.4.   Karp Rabin 35  
6.4.1.   Sequential Implementation 35  
6.4.2.   CUDA Implementation 36  
6.5.   Quick Search 37  
6.5.1.   Sequential Implementation 37  
6.5.2.   CUDA Implementation 38  
6.6.   Shift Or 39  
6.6.1.   Sequential Implementation 39  
6.6.2.   CUDA Implementation 40  
6.7.   Shift And 41  
6.7.1.   Sequential Implementation 41  
6.7.2.   CUDA Implementation 42  
7.   Performance Evaluation 43  
7.1.   Testing Methodology 43  
7.1.1.   Pattern Size and CUDA Threads 43  
7.1.2.   String Matching Test Files 44  
7.2.   Measuring Speedup 45  
7.3.   CPU vs GPU Comparison 47  
7.4.   GPU vs the Cluster Comparison 52  
8.   Conclusions 57  
9.   Appendix 59  
9.1.   Naïve Search 59  
9.1.1.   MPI Implementation (mpiBrute.cpp) 59  
9.1.2.   MPI Implementation (mpiBrute.h) 63  
9.1.3.   CUDA Implementation (mpiBrute.cu) 63  
9.2.   Knuth Morris-Pratt 66  
9.2.1.   CUDA Implementation (mpiKMP.cu) 66  
9.3.   Horspool 69  
9.3.1.   CUDA Implementation (mpiHorsepool.cu) 69  
9.4.   Karp Rabin 72  
9.4.1.   CUDA Implementation (mpiKarpRabin.cu) 72  
9.5.   Quick Seach 75  
9.5.1.   CUDA Implementation (mpiQuickSearch.cu) 75  
9.6.   Shift Or 78  
9.6.1.   CUDA Implementation (mpiShiftOr.cu) 78  
9.7.   Shift And 81  
9.7.1.   CUDA Implementation (mpiShiftAnd.cu) 81  
References 84  
Acknowledgments - i

Acknowledgments
I would like to thank my supervisor Prof. Konsatntinos G. Margaritis, for his
support and insightful feedback during the conduction of this research. I
would also like to thank the Parallel Distributed Processing Laboratory, as well
as, the department of Applied Informatics of University of Macedonia, for
providing me access to the hardware that was used to run the experiments.
Last but not least, I want to thank my family and friends for all of their support.
Abstract - ii

Abstract
String Matching algorithms are responsible for finding occurrences of a
pattern within a large text. Many areas of Computer Science require
demanding string-matching procedures. Over the past years, Graphics
Processing Units have evolved to powerful parallel processors outperforming
the Central Processing Units in scientific calculations. Moreover, GPUs can
be used in parallel, in computer clusters. This thesis attempts to speed up
seven major String Matching algorithms by exploiting the combined power of
several CUDA-enabled GPUs in a GPU cluster. The algorithms are
implemented and optimized to take advantage of MPI distributed-memory
cluster and the CUDA parallel computing architecture. Finally, they are
compared in terms of their performance with their sequential, and their single
GPU implementations.
Glossary - iii

Glossary
CUDA Compute Unified Device Architecture, by NVIDIA, that by
harnessing the power of Graphics Processing Units (GPU)
offers dramatic performance increase [1]

MPI Standardized and portable Message Passing Interface [2]

n The length of the text being examined

m The length of the pattern

BM Boyer Moore String Matching Algorithm

Hor Horspool String Matching Algorithm

KMP Knuth Morris Pratt String Matching Algorithm

KR Karp Rabin String Matching Algorithm

MP Morris Pratt String Matching Algorithm

QS Quick Search String Matching Algorithm

SA Shift And String Matching Algorithm

SOr Shift Or String Matching Algorithm


Introduction - 1

1. Introduction
String Matching algorithms are an important member of the String algorithms
family. Their duty is to find all the occurrences of one or several strings (also
called patterns) are found within a larger string or text [3]. In general, there are
two inputs given to each algorithm, the large text and the patter that we are
looking for in the former. It is noteworthy to mention, that in most algorithms
the length of the text is symbolized with the letter ‘n’ and the length of the
pattern as ‘m’, where m is less or equal to n. There are many algorithms that
try to give optimal solutions to this problem. In this thesis, we will analyze
seven major String Matching algorithms that through hybrid hardware and
software optimizations exhibit a dramatic performance boost. Finally, the Big
O notation will be used to express the complexity of algorithms.

1.1. History of String Matching


The problem of String Matching was characterized as fundamental for
Computer Science, from the first years of the digital computing era, as most
applications or problems are involving text processing. The first algorithm that
was developed was called the Brute-Force algorithm, which was based on the
common sense of sliding the pattern along the text, and comparing it to each
portion of the text (Figure 1). However, this first approach was very slow,
taking O(nm) time [4]. In 1964, B. Dömölki developed the Bitap [5] algorithm,
also known as Shift Or and Shift And. The algorithm main characteristic was
that it made use of fast bitwise operations to accelerate the searching
process. Later, in 1970, Morris and Pratt came up with a tight analysis of the
Brute-Force algorithm making use of the internal structure of the pattern [6].
The algorithm was using preprocessed values that were computed in O(m)
time and then performed the search in O(n+m), offering better performance. A
more sophisticated version of the MP algorithm came later by Knuth, Morris
and Pratt [7], in 1977. Although, the complexity of the KMP Algorithm was the
same as MP, it was much faster in practice, as it improved the length of the
Introduction - 2

shifts that take place. Moreover, the same year another efficient algorithm
was introduced by, Boyer and Moore [8]. The BM algorithm is considered as
one of the most efficient String Matching algorithms in usual applications,
implementing the background of most “search” commands. Several variants
and enhancements of the BM algorithm appeared the next years, such as:
Horspool [9], Galil-Seiferas [10], Apostolico-Giancarlo [11], Karp-Rabin [12]
and Quick-Search [13]. In this thesis we will analyze the Horspool Karp-Rabin
and Quick-Search algorithms, while they are both simple to implement in
parallel structures. So, in 1980, N. Horspool published a simplified version of
BM, which was related to the KMP algorithm. Horspool’s algorithm had
various optimizations in order to achieve an average-case complexity of O(n)
on a random text, although the worst-case is much worse than BM. Karp-
Rabin algorithm was created in 1987 by M. Rabin and R. Karp, and it made
use of hash methods to search much more efficiently for the pattern inside the
given text. Additionally, it is widely used for especially multiple pattern
searches. Finally, Quick Search was published in 1990 by D. Sunday. The
major unique feature of Quick Search, was that it was simplified to use only
the bad character shift table of BM, making it very fast in practice for short
patterns and large alphabets and also much easier to implement. This short
history overview covers all the algorithms that were implemented in this
thesis.

Figure 1 Brute-Force String Matching


Introduction - 3

1.2. Application Areas


The first references to the problem of string matching came out in the sixties
and the seventies. It was a common barrier in many different fields of science.
Some of the first requirements for fast search solutions, came from
computational biology, signal processing and text retrieval application areas. It
is notable that, even nowadays these remain the major areas of interest [14].

1.2.1. Computational Biology

In Computational Biology, DNA and protein sequences can be depicted by


long texts constructed by specific alphabets such as: (A, C, G, T), that
represent genetic code of living organisms. Although these sequences can be
very long, scientists search for specific parts over those texts, in order to
assemble the DNA chain from the pieces that were obtained by experiments,
look for specific characteristics inside them or compare different parts. These
requirements are modeled as searching for patterns in a large text file.
However, on such applications approximate matching is much more useful
than exact matching, as the experimental measures have errors, and even if
they are correct they may have small differences, due to mutations and
evolutionary alterations. Hence, fuzzy string-matching algorithm can be
proven more efficient when processing such files.

Figure 2 Computational Biology Codon Matching


Introduction - 4

1.2.2. Signal Processing

Another early motivation came from the area of signal processing. This area
deals with speech and in general with sound pattern recognition. The main
problem is to recognize a specific message over a transmitted audio signal.
There are many problems that come out of this main idea from speech
recognition to music recognition and signal error correcting. In such problems
the main text is the encoded input signal while a smaller encoded sequence
represents the pattern that is to be found.

1.2.3. Text Retrieval

The most common problem is the retrieval of text parts of a larger text while
counting the occurrences of the pattern. Moreover, the problem of correcting
misspelled words in written text is one of the oldest potential applications.
Moreover, while the World Wide Web was expanding, a new need came out
for scanning the online content of Internet and categorizing it, in order to make
it searchable by all users. This is an extremely demanding task as the size of
the data that is to be processed is enormous and parallel techniques are
required in order to process it efficiently.

1.2.4. Computer Security

During the last years, several areas of computer security have started making
use of pattern matching techniques, in order to achieve various demanding
tasks such as: intrusion detection, file hash matching, virus scans and spam
filtering. However, all these tasks are quite demanding and are running
endlessly, reserving a notable amount of power for both online and offline
processes. Moreover, these tasks can become even more demanding, by
multiplying the amount of data that has to be processed. For example, Mail
Servers have by far much more data to process than regular computers, while
all these actions have to be executed both fast and efficiently, consuming the
less computer power possible. There are several implementations of GPU
architecture on security applications, such as GNORT [15] in 2008, a GPU
version of the well-known intrusion detection application SNORT [16].
Introduction - 5

1.3. The Speed Problem


As it was analyzed, there are many areas of Computer Science that require
demanding string-matching procedures. Measurements on Network Intrusion
Detection Systems (NIDS) perform deep packet inspection such as SNORT
[16] have shown that 31% of total processing is due to string matching.
Moreover, the percentage exhibits a huge increase in the case of Web Traffic
scanning, leveling up to 80% [17]. Another example is codons recognition in a
DNA sequence, such tasks are totally based on string matching, meaning that
the execution time of the process is totally relevant to the execution time of
string matching. Thus, string matching can be considered as one of the most
computationally intensive parts of many procedures of different sciences
fields. In this thesis we will focus on how these tasks can be optimized and
executed in less time on hybrid architectures.

1.4. Objective of this Thesis


This thesis on hybrid implementations of seven String Matching algorithms
had two major objectives. The first objective was to implement several string
matching algorithms to make use of the multi-core CUDA architecture Graphic
Processing units of NVIDIA Corporation. This would significantly shorten the
load of the Central Processing Unit, while it would offer important performance
speedup by using efficiently all the hardware resources that a computer
offers. However, there are many applications of string matching that exceed
the capabilities of a personal computer in order to be executed under
reasonable time. Moreover, previous research such as the “String Matching
on a multicore GPU using CUDA” [18], published in 2009 by Ch.
Kouzinopoulos and K. Margaritis, has proved that there is a significant
performance increase on several string matching algorithms by using GPU.
So, the second objective of this thesis was not only to present the
performance speedup that a GPU card can offer, but also to distribute the
demanding search tasks to scalable computer clusters, equipped with NVIDIA
CUDA-enabled GPU cards and compare the results.
Parallel Programming - 6

2. Parallel Programming

2.1. From GPUs to General Purpose GPUs


The endless market’s demand for better and more realistic 3D Computer
Graphics evolved the Graphics Processor Units into powerful highly parallel,
multithreaded and multicore processors with computational power and
extremely fast memory bandwidth [19]. A comparison with the CPU Memory
Bandwidth is depicted in Figure 3.

Figure 3 Memory Bandwidth CPU vs GPU

Moreover, due to their parallel architecture, they can be up to 100x faster


sometimes than Central Processing Units in simple operation [20].
Parallel Programming - 7

Figure 4 Floating-Point Operations per second CPU vs GPU

As it can be seen in Figure 4, there is a significant difference between the


performance of floating point operations on CPU and GPU; this is due to the
nature of graphics rendering operations, as they are compute-intensive and
highly parallel. So, there are much more transistors inside a GPU for data
processing instead of Caching and Flow Control. GPUs have a parallel
throughput architecture that emphasizes executing many concurrent threads
slowly, rather than executing a single thread very quickly. This evolution has
actually changed the nature of Graphics Processing Units to General Purpose
Parallel Processors that are capable to execute demanding computational
tasks much more efficiently, taking advantage of their parallel architecture.

Figure 5 GPU has more transistors devoted to Data Processing


Parallel Programming - 8

This highly parallel model allows the accelerated execution of problems that
have multiple computations on data that are not highly correlated; the problem
is distributed and the same operations are executed in parallel in multiple
threads and cores. It is very important that the computations are independent
and do not require previously calculated results in order to proceed to the next
operation. Otherwise, the performance exhibits a significant drop as the boost
is due to the highly parallel model. As a result, the more sequential the
processing becomes, the execution time could exceed the same program’s
CPU execution time, as the GPU cores are much weaker compared to a
same price level CPU processor.

2.2. CUDA A General-Purpose Parallel


Computing Platform
NVIDIA introduced CUDA architecture (formerly Compute Unified Device
Architecture) in November of 2009. CUDA is a general purpose parallel
computing platform and programming model, implemented by the graphics
processing units (GPUs) produced by NVIDIA [21]. GPUs are able to solve
many complex computational problems in parallel and much faster than a
CPU. This approach of solving general-purpose problems (i.e. not exclusively
graphics) on GPUs is known as GPGPU [22].

Figure 6 The Multicore Architecture of a GPU


Parallel Programming - 9

The latest version 5.0 of NVIDIA CUDA comes with an Integrated Software
Environment that is based on the programming Language C as a high-level
programming language. The IDE is called Nsight [23] and offers editions for
Microsoft’s Visual Studio [24], for development under Microsoft Windows and
for the Open Source Eclipse platform [25], for Linux and Apple MacOSX
development.

Nsight is now distributed as a part of CUDA 5.0, and is equipped with a CUDA
Source Code Editor with the capabilities of syntax highlighting, CUDA aware
refactoring methods and code completion. Moreover, it has an integrated
debugger able to monitor program variables across several CUDA threads;
and also gives the ability to set breakpoints and perform single-step execution
at both source-code and assembly levels. Finally, Nsight Profiler is an
advanced code profiler, that identifies performance bottlenecks using a unified
CPU and GPU trace of application activity while it also offers automated
analysis for optimization opportunities. The algorithms implemented within the
scope of this thesis, were developed under Nsight Eclipse Edition and CUDA
5.0.

It is noteworthy to mention, that within the last years CUDA development


projects have exhibited a tremendous multi-fold increase; in addition to
libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA
platform now supports other computational interfaces, including the Khronos
Group's OpenCL [26], Microsoft's DirectCompute [27], and C++ AMP [28].
Finally, wrappers for the platform are also available for Python, Perl, Fortran,
Java, Ruby, Lua, Haskell, MATLAB, IDL, and native support in Mathematica
[21].
P a r a l l e l P r o g r a m m i n g - 10

Figure 7 CUDA supported Programming Languages and Interfaces

2.3. Scalable Programming Model


The new era of many-core GPUs and multicore CPUs has begun,
characterized as an evolution where parallel systems have taken over.
Moreover, it is noteworthy that the parallelism is scaling with Moore ’s law.
The challenge that we face nowadays is the development of applications that
transparently scale, and adjust to each system ’s number of cores either CPU
or GPU. This is very important in order to make the applications efficient, easy
to use and of course portable.

CUDA was designed capable of overcoming these challenges, as it maintains


a low learning curve for developers that are familiar with basic programming
languages like C and C++. CUDA uses three key abstractions: hierarchies of
thread groups, shared memories, and barrier synchronization; while, they are
accessible to the programmer providing data and thread parallelism, nested
within coarse-grained data parallelism and task parallelism. The programmer
is required to split the problem into independent sub-problems that can be
solved by blocks of threads, and each into finer pieces solved cooperatively in
parallel by all threads within a block.
P a r a l l e l P r o g r a m m i n g - 11

This partitioning allows the threads to cooperate when solving each sub-
problem, and most importantly such architecture offers automatic scalability.
Each block of threads can be run on any of the available GPU’s
multiprocessors, either concurrently or sequentially, this gives the ability to
any CUDA program to execute on any number of multiprocessors and easily
be adjusted by only knowing the physical multiprocessor number. This
scalable model allows applications to easily scale depending on the demands
and hardware of each user, as NVIDIA offers a wide range of CUDA-Enabled
GPUs from mainstream inexpensive (Geforce) to demanding high-
performance (Quadro and Tesla) computing products [19].

Figure 8 Automatic Scalability

2.4. Clusters with MPI and CUDA


Message Passing Interface (MPI) is a standardized and portable message-
passing system, which was developed by a group of researchers from
academia and industry to function on a wide variety of parallel computers. The
P a r a l l e l P r o g r a m m i n g - 12

first steps were made in the beginning of the nineties, and version 1.0 of the
interface was released in June 1994 [2]. MPI defines each node as a process
and is considered the standard for High Performance Computing application
development, on distributed memory architectures [29].

MPI defines the semantics and the syntax of a core of library routines, useful
to a wide range of developers writing portable message-passing programs.
There are MPI bindings for many languages, but the first ones were in Fortran
77, C and C++. There are several implementations of the MPI interface, while
many of them are totally free and are available to the public domain.

One of the most widely used implementations is MPICH [30], a portable free
implementation of the message-passive interface for distributed-memory
applications that run in parallel. MPICH supports efficiently different
computation and communication platforms including commodity clusters
(desktop systems, shared-memory systems, multicore architectures), high-
speed networks, proprietary high-end computing systems (Blue Gene, Cray)
and multiple operating systems (Windows, most flavors of UNIX including
Linux and MacOSX). The cluster that was used in this thesis was supported
by MPICH2 running under Ubuntu Linux.

As MPI is standardized, the code that is written is portable and is able to run
under any MPI implementation of the same architecture. Moreover, although
the performance may vary between different implementations, the calls that
are made have the same behavior on all of them, offering even more
portability to the parallel programs.

The most common use of MPI is in parallel programming for cluster systems,
as it handles all the communications between the nodes with simplicity and
most important of all, with a standardized way. The main problem that is
encountered is the distributed memory architecture of clustered systems; how
data is distributed and how the processed communicate when they need data
from another node and may others. MPI is responsible for all the node
communications as handling the data distribution and gather. However, a
developer should always remember that passing data over the network is a
P a r a l l e l P r o g r a m m i n g - 13

time consuming operation, therefore should develop efficient programs that


balance the communication time with the processing time. [29]

Figure 9 GPU Cluster with MPI

In order to achieve this balance, nodes have to undertake more tasks before
the results are transmitted, minimizing the network communications in small
packets, as in such demanding operations even the packet headers affect the
total performance. More specifically, if we had developed an algorithm to add
2 same-length vectors, it would be much more efficient to assign parts of each
vector to each worker, and send the all calculated results together when the
calculations have finished, rather than sending each calculated result to the
master worker. This strategy is the most preferred one in general, especially
when there are no data dependencies.

As it can be seen, this balance is directly affected by the processing time, the
data dependencies and as a result the network communications. This thesis is
focused on using the GPU parallel architecture, taking advantage of the
tremendous performance boosts, using a computer cluster with CUDA-
Enabled GPUs, governed by MPI. Therefore, in this model there are two
levels of parallelism: the first level is between the nodes of the cluster, and the
second one is between cores of the multiple GPUs of each worker.
C U D A P r o g r a m m i n g M o d e l - 14

3. CUDA Programming Model


In this chapter we will identify some of the major characteristics of the CUDA
programming model by outlining how they are exposed in C, based on the
“NVIDIA C Programming Guide” [19]. This is necessary in order understand
the experimental analysis that follows. The sample program that is being used
is called “vectorAdd” and is available in the SDK’s accompanying code
samples.

3.1. Kernels
CUDA library extends C, giving the ability to define and execute CUDA
specialized functions. These functions are called Kernels and opposed to
regular C functions, they are executed N times in parallel by N different CUDA
threads [19]. Each kernel function is defined using the __global__ specifier,
and the number of CUDA Threads that will execute the kernel, is defined
within <<<…>>> execution configuration syntax. The unique thread ID, that
characterizes each thread that executes the defined kernel, can be accessed
inside the kernel using the built-in variable threadIdx.

The following kernel example (Exhibit 1), illustrates the addition of two vectors
A, B of size N, storing the result into vector C.

//  Kernel  definition  
__global__  void  vectorAdd(const  float  *A,  const  float  *B,  
float  *C)  {  
  int  i  =  threadIdx.x;  
   
C[i]  =  A[i]  +  B[i];  
}  
int  main()  {  
  ...  
  //  Kernel  invocation  with  N  threads  
  vectorAdd<<<1,  N>>>(A,  B,  C);  
  ...  
}  

 
Exhibit 1 Simple vectorAdd() kernel
C U D A P r o g r a m m i n g M o d e l - 15

3.2. Thread Hierarchy


CUDA has specified the threadIdx variable to be a 3-component vector, for
reasons of convenience. Hence, threads can be identified using 1D, 2D or 3D
thread indexes, providing a more natural way of accessing the desired
elements of a vector, matrix or a volume, respectively [19].

At this point, we should note that there is a direct relation between the thread
ID and the thread Index. This relation exists so that we can manipulate our
data and cells with much more convenience.

- For 1D block, they are the same


- For 2D blocks of Dx, Dy size, the thread ID of a thread of index (x, y) is (x
+ y Dx)
- Finally, for 3D blocks of Dx, Dy, Dz size the threads ID of a thread of
index (x, y, z) is (x + y Dx + z Dx Dy).

The Exhibit 2 example is based on the Exhibit 1, but with 2D matrices of NxN
dimensions. The code computes the addition of matrices A, B and the result is
stored in matrix C.

//  Kernel  definition  
__global__  void  MatAdd(float  A[N][N],  float  B[N][N],  
float  C[N][N])  {  
  int  i  =  threadIdx.x;  
  int  j  =  threadIdx.y;  
   
C[i][j]  =  A[i][j]  +  B[i][j];  
}  
int  main()  {  
  ...  
  //  Kernel  invocation  with  one  block  of  N  *  N  *  1  threads  
int  numBlocks  =  1;  
dim3  threadsPerBlock(N,  N);  
  MatAdd<<<numBlocks,  threadsPerBlock>>>(A,  B,  C);  
  ...  
}  
 
Exhibit 2 MatAdd() kernel
C U D A P r o g r a m m i n g M o d e l - 16

As threads of a block will be executed on the same cores and will share finite
hardware and especially memory resources, there is an upper limit of 1024
threads per block for the current NVIDIA GPUs. However, multiple blocks can
execute the same kernel. Therefore, blocks are also organized into 1D, 2D or
3D virtual grids of thread Blocks. The number of thread blocks inside a virtual
grid is relevant to the size of the data that is to be processed and to the
number of the processors that our hardware is equipped.

Figure 10 Grid of Thread Blocks

The developer defines the number of threads per block and the number of
blocks per grid with the <<<...>>> syntax, using an int or dim3 variable. Block
index can be accessed within the kernel using the blockIdx variable with the
same way as threadIdx worked. Moreover, the dimension of a block is
identified using the built-in blockDim variable. The following example extends
the previous MatAdd() to handle multiple blocks.
C U D A P r o g r a m m i n g M o d e l - 17

//  Kernel  definition  
__global__  void  MatAdd(float  A[N][N],  float  B[N][N],  
float  C[N][N])  {  
int  i  =  blockIdx.x  *  blockDim.x  +  threadIdx.x;    
  int  j  =  blockIdx.y  *  blockDim.y  +  threadIdx.y;  
  if  (i  <  N  &&  j  <  N)  
C[i][j]  =  A[i][j]  +  B[i][j];  
}  
int  main()  {  
  ...  
  //  Kernel  invocation  
dim3  threadsPerBlock(16,  16);    
dim3  numBlocks(N  /  threadsPerBlock.x,  N  /  threadsPerBlock.y);  
  MatAdd<<<numBlocks,  threadsPerBlock>>>(A,  B,  C);  
  ...  
}  
 
Exhibit 3 MatAdd() with Multiple Blocks

One of the most common choices, that is also used in this case, is a thread
block sized 16x16, meaning 256 threads. Then the number of blocks is
calculated by dividing the total number of elements N with the amount of
threads per block per dimensions. Moreover, thread blocks are executed
independently and can be scheduled in any order across any of the available
systems cores. This facilitates the scalability of the programs that are being
the developed, as they can easily be scaled up in relation to the number of
cores we have at our disposal.

Moreover, CUDA allows the threads of a block to cooperate as they have


limited shared memory and also settings points to synchronize their execution
so that the memory access is coordinated. Synchronization points can be set
within a kernel by calling the __syncthreads() intrinsic function; this function
acts as a barrier at which all threads within the block are required to reach
until the execution continues.
C U D A P r o g r a m m i n g M o d e l - 18

3.3. Memory Hierarchy


Each thread has access to multiple memory spaces. First of all, a private local
memory space is reserved for each thread. Then, the thread block has a
shared memory, which is visible to all the threads and is released by the end
of the execution of the block. Finally, global memory is accessible by all
threads but is less efficient in most problems.

In addition, all threads have read-only access to the constant and the texture
memory spaces. More specifically, CUDA offers the global, constant and
texture memory spaces and they are optimized for different types of
operations and usage. However, the memory spaces are persistent and
cannot be modified while the kernel is being executed.

Figure 11 Levels of Kernel Memory Spaces


C U D A P r o g r a m m i n g M o d e l - 19

3.4. Heterogeneous Architecture


The architecture of CUDA model, assumes that the kernels are executed on a
physically separate device that works as a coprocessor to the host system,
where the initial application was executed. Moreover, CUDA recognizes the
host and the device memory spaces as different spaces and they need to be
referenced differently. Therefore, there are some necessary steps to perform
any operation such as:

- Device memory allocation: the data are being transferred from host
memory to the device memory reserved space.
- Data processing: the kernels are being loaded and executed on the blocks
of threads.
- Device memory de-allocation: the processed data are transferred from the
device memory to the host memory, freeing up the reserved space.

Figure 12 CUDA Heterogeneous Model


S y s t e m m o d e l a n d a p p r o a c h - 20

4. System model and approach


The system model that was used is CUDA-enabled GPU cluster of 10
computers. The computers are running Ubuntu Linux, Lucid Lynx 10.04 Long
Term Support release [31]. Moreover, they are equipped with: Intel Core 2
Duo E8400 CPU processors, that have two cores clocked at 3.00GHz; 3GB of
RAM; NVIDIA Corporation GeForce GTX 280 GPU processors, clocked at
602MHz with 240 CUDA Cores and equipped with 1GB GDDR3 and
1107MHz Memory Clock. Additionally, a Gigabit Switch handles the network
communications. A shared and distributed file system is achieved using NFS
[32]. The development took place under the NVIDIA Nsight Eclipse Edition
IDE and the applications were compiled using mpicxx and nvcc.
H y b r i d S t r i n g M a t c h i n g - 21

5. Hybrid String Matching


In this Hybrid approach there are two levels of parallelization, the first one has
to do with MPI and the clustering of the text file that is to be search, while the
second one takes place inside the CUDA architecture. The first parallelization
level, for distributing jobs within the cluster, was very similar for all the
implemented algorithms. The nodes of the cluster communicate through a
Gigabit Router and share the same disk space through NFS [32]. Hence,
once the text was copied to the file system it was instantly available to all
nodes.

The two layers of parallelization are implemented inside the program


separately and are connected in the linking stage. There is a main file is
written in C++, that executes all the necessary procedures to initialize MPI,
read the files, execute the Search Function and gather the results to the main
node. The CUDA Search Function is placed in a separate file, which is written
in C and is compiled with the use of NVCC.

5.1. MPI Parallelization


The following code illustrates the structure of the main program that is being
in the beginning and later we will examine the CUDA functions.

In the beginning of Exhibit 4, the MPI execution environment is initialized in


the main() function of the program. Moreover, it is worth mentioning, that all
the MPI Functions are executed through a custom function called
MPI_CHECK, which will terminate the execution in case a command failed to
run on a node. The total number of nodes available to MPI is stored in
commSize variable using the MPI_Comm_size() function, while the current
host’s rank number is stored in the commRank variable with the use of the
MPI_Comm_rank() function.
H y b r i d S t r i n g M a t c h i n g - 22

//  Initialize  MPI  state  


MPI_CHECK(MPI_Init(&argc,  &argv));  
 
//  Get  our  MPI  node  number  and  node  count  
int  commSize,  commRank;  
MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD,  &commSize));  
MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD,  &commRank));  

 
Exhibit 4 MPI Source - Initialization

Between crucial stages of the execution, the MPI_Wtime() function is called in


order to obtain the current time in seconds and microseconds. This is used, in
order to calculate the execution time of each step, and later compare the
results. The first actual step of the algorithm is to read the pattern from a
specified path, and then convert it in to a C string variable in order to pass it
later to the search function that is written in C. Moreover, the pattern length is
stored in the lenp integer variable.

timeStartRead  =  MPI_Wtime();  
//Read  Pattern  File  
ifstream  filePattern(patternpath);  
if  (!filePattern.is_open())  {  
  cout  <<  "Rank"  <<  commRank  <<  ":  Cannot  open  pattern!"  <<  endl;  
  my_abort(1);  
}  
string  str_pattern((std::istreambuf_iterator<char>(filePattern)),  
      std::istreambuf_iterator<char>());  
filePattern.close();  
 
h_pattern  =  new  char[str_pattern.size()];  
strcpy(h_pattern,  str_pattern.c_str());  
int  lenp  =  strlen(h_pattern);  

 
Exhibit 5 MPI Source - Read Pattern

The same procedure is followed in order to read the text file. However, in this
situation each node reads the chunk that he was assigned. First, all nodes
read the total file size and then by using their assigned rank by MPI select a
different chunk of the. For example, if there were 5 nodes in total, then the file
would be split in 5 imaginary chunks and each node would read his part; the
H y b r i d S t r i n g M a t c h i n g - 23

second node in row would read the second chunk and so on. This is achieved
by using the start and stop auxiliary variables, that indicate the start and the
end position when each node is reading the text file.

𝑛𝑛
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 = 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∗  
𝑀𝑀𝑀𝑀𝐼𝐼

𝑛𝑛
𝑠𝑠𝑠𝑠𝑠𝑠𝑝𝑝 = 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1 ∗   + (𝑚𝑚 − 1)
𝑀𝑀𝑀𝑀𝐼𝐼

The length of the text that is to be processed is slightly larger than the
absolute division of the total text length and the number of nodes; as it also
contains the next 𝑚𝑚 − 1 characters, satisfying the case that an occurrence of
the pattern was found at the last letter of the chunk.

//  Read  Main  File  


ifstream  fileMain(filepath);  
if  (!fileMain.is_open())  {  
  cout  <<  "Rank"  <<  commRank  <<  ":  Cannot  open  text!"  <<  endl;  
  my_abort(1);  
}  
fileMain.seekg(0,  std::ios::end);  
lenFullText  =  fileMain.tellg();  
timeEndRead  =  MPI_Wtime();  
 
int  start  =  commRank  *  lenFullText  /  commSize;  
int  stop  =  (commRank  +  1)  *  lenFullText  /  commSize  +  (lenp  -­‐  1);  
 
if  (stop  >  lenFullText)  
  stop  =  lenFullText;  
 
fileMain.seekg(start,  std::ios::beg);  
 
h_text  =  (char  *)  malloc(stop  -­‐  start  +  1,  sizeof(char));  
fileMain.read(h_text,  stop  -­‐  start  +  1);  
fileMain.close();  
int  lent  =  strlen(h_text);  
 
Exhibit 6 MPI Source - Read Text

Later, the length of the chunk that is to be processed is calculated, as well as


the result vector size, which uses a Boolean vector with the same size as the
chunk.
H y b r i d S t r i n g M a t c h i n g - 24

Moreover, there is a global result integer vector h_result_final, initialized ion


the master worker, having the size of h_result divided by the pattern length
and multiplied with the number of nodes, where the final result positions are
gathered from all nodes to the master worker.

𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑡𝑡
max 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑠𝑠 = ∗ 𝑀𝑀𝑀𝑀𝐼𝐼
𝑚𝑚

The computeGPU() function is called with the text and the pattern as main
parameters and calls the string matching CUDA implemented kernel to
calculate the results in the local GPU card. The same algorithm is also
implemented in C, and is being called by the computeCPU() function, in order
to compare the execution times. The implementations of both functions differ
depending on the algorithm; while, the remaining code is almost the same for
all of them.

//  Calculate  Lenths  
int  result_size  =  ceil(lenFullText  /  commSize);  
 
//  Initialize  local  results  array  
h_result  =  (bool  *)  malloc(result_size  *  sizeof(bool));  
memset(h_result,  false,  result_size  *  sizeof(bool));  
 
//  On  each  node,  run  computation  on  GPU  and  CPU  
timeStartCuda  =  MPI_Wtime();  
computeGPU(h_result,  result_size,  h_text,  lent,  h_pattern,  lenp,  
    cudaThreadSize,  timeGPU);  
timeEndCuda  =  MPI_Wtime();  
 
timeStartCPU  =  MPI_Wtime();  
computeCPU(h_text,  lent,  h_pattern,  lenp,  h_result);  
timeEndCPU  =  MPI_Wtime();  
 
if  (commRank  ==  0)  
  h_result_final  =  (int  *)  malloc(  
        result_size  /  lenp  *  commSize  *  sizeof(int));  
 
Exhibit 7 MPI Source - Execute GPU and CPU algorithms

The next step after execution is to gather the results to the master worker.
However, it is a fact that the major part of the results vector will be false
values, as it is not usual for a pattern to be found so many times. On the one
hand, the MPI_Gather() function could be used but it wouldn’t be time
H y b r i d S t r i n g M a t c h i n g - 25

efficient; on the other hand, a more efficient solution would be to do


preprocessing on each worker and send only the position of each result to the
master. Moreover, in order to avoid the network packet overheads a buffer is
used sending the gathered results in groups.

timeStartGather  =  MPI_Wtime();  
MPI_Status  status;  
const  int  buf  =  32;  
int  position[buf];  
if  (commRank  ==  0)  {  
  //  Copy  masters  results  
  int  results  =  0;  
  for  (i  =  0;  i  <  result_size;  i++)  
    if  (h_result[i])  {  
      h_result_final[results]  =  i;  
      results++;  
    }  
 
//  Start  receiving  
  int  received  =  0;  
  int  finWorkers  =  1;  
  while  (finWorkers  <  commSize)  {  
    MPI_CHECK(  
MPI_Recv(&position,  buf,  MPI_INT,  MPI_ANY_SOURCE,  
MPI_ANY_TAG,  MPI_COMM_WORLD,  &status));  
 
    //  Get  the  number  of  the  received  results  
    MPI_Get_count(&status,  MPI_INT,  &received);  
    for  (i  =  0;  i  <  received;  i++)  {  
      h_result_final[results]  =  position[i];  
      results++;  
    }  
    //  Worker  has  finished  transmission  
    if  (status.MPI_TAG  ==  1)  
      finWorkers++;  
  }  
}  
 
Exhibit 8 MPI Source – Gather Master

The master worker first copies the local result positions to the final results
vector, and then starts receiving using MPI_Recv() until all workers have
completed transmission. In order to keep track of the number of finished
workers, the last transmission of each one is marked with the value 1 instead
of 0 as MPI_TAG.
H y b r i d S t r i n g M a t c h i n g - 26

else  {  
  int  tmp_results  =  0;  
  for  (i  =  0;  i  <  result_size;  i++)  
    if  (h_result[i])  {  
      position[tmp_results]  =  start  +  i;  
      tmp_results++;  
 
      //  If  buffer  is  full  send  results  
      if  (tmp_results  >=  buf)  {  
        MPI_CHECK(  
MPI_Send(&position,  tmp_results,  MPI_INT,  
0,  0,  MPI_COMM_WORLD  ));  
        tmp_results  =  0;  
      }  
    }  
 
  //  If  there  are  unsent  results  send  them  and  finish  transmission  
  if  (tmp_results  >  0)  
    MPI_CHECK(  
MPI_Send(&position,  tmp_results,  MPI_INT,  0,  1,  
MPI_COMM_WORLD  ));  
  else    
    MPI_CHECK(  
MPI_Send(NULL,  0,  MPI_INT,  0,  1,  MPI_COMM_WORLD  ));  
}  
timeEndGather  =  MPI_Wtime();  

 
Exhibit 9 MPI Source – Gather Workers

Each worker scans the local results array and copies the positions to the
buffer, when the buffer is full it is sent to the master using MPI_Send(). The
occurrence positions are calculated using the reading start position of the full
text that had to be processed and the local chunk’s occurrence position, so
that the values are pointing to the original text.
H y b r i d S t r i n g M a t c h i n g - 27

5.2. CUDA Parallelization


The computeGPU() function initializes the CUDA execution environment, and
is called by the MPI in order to process the text on the NVIDIA CUDA
architecture. This function is quite similar on most algorithms except the
preprocessing and the search parts, as expected. In this example, we will
illustrate the computeGPU() function that was used in the implementation of
the Naïve algorithm.

First of all, the text and the pattern vectors are initialized and copied to the
GPU Memory. In order to achieve this, the vectors are being duplicated;
however, they are initialized using cudaMalloc() function, instead of ANSI C
malloc(). For ease of use and readability, the GPU variables start with “d_”
standing for “device”. Then, the cudaMemcpy() function is used, duplicating
the text and pattern variables to GPU memory, and the cudaMemset()
initializes the results Boolean variable with false values.

cudaEventRecord(startMemHost,  0);  
//  Allocate  data  on  GPU  memory  
checkCudaErrors(cudaMalloc((void**)  &d_text,  lent*sizeof(char)));  
checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp*sizeof(char)));  
checkCudaErrors(cudaMalloc((void**)  &d_result,  lent*sizeof(bool)));    
//  Copy  to  GPU  memory  
checkCudaErrors(cudaMemcpy(d_text,  h_text,  lent*sizeof(char),  
cudaMemcpyHostToDevice));  
checkCudaErrors(cudaMemcpy(d_pattern,  h_pattern,  lenp*sizeof(char),  
cudaMemcpyHostToDevice));  
 
checkCudaErrors(cudaMemset(d_result,  false,  lent*sizeof(bool)));  
 
cudaEventRecord(stopMemHost,  0);  
 
Exhibit 10 computeGPU() Memory Initialization

As it can be seen, the cudaEventRecord() is being used, measuring the


execution time of each step; while, the checkCudaErrors() is being used for
the execution of CUDA functions, handling the error reporting in case of
problem in execution. The next step is the calculation of the CUDA blocks
H y b r i d S t r i n g M a t c h i n g - 28

depending on the thread sizes. The function that is used for 1D vectors, as
the text that is being processed, is presented below.

𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1
𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 =  
𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟

The vector size in this case is the results variable size, which has the same
length as the text for ease of calculation as linked lists are too complex to be
easily used by parallel architectures.

//  Calculate  Blocks  /  Threads  


int  threadsPerBlock  =  threadSize;  
int  blocksPerGrid  =  (lent  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock;  
 
//  Run  Main  kernel  
cudaEventRecord(startCompute,  0);  
bruteforceGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  
lent,  lenp,  d_result);  
cudaEventRecord(stopCompute,  0);  
 
//  Copy  data  back  to  Host  memory  
cudaEventRecord(startMemDevice,  0);  
checkCudaErrors(cudaMemcpy(h_result,  d_result,  lent*sizeof(bool),  
cudaMemcpyDeviceToHost));  
cudaEventRecord(stopMemDevice,  0);  
 
//  Free  GPU  memory  
checkCudaErrors(cudaFree(d_text));  
checkCudaErrors(cudaFree(d_pattern));  
checkCudaErrors(cudaFree(d_result));  
 
cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
Exhibit 11 computeGPU() Execute and Finalize

Finally, after the kernel has been executed with the specified block and thread
sizes using the <<<…>>> kernel call, the result vector is being copied back to
the host memory. The device variables are being freed from memory using
cudaFree(), and the following execution times are being calculated: a) Vector
initialization and memory copy from Host to Device, b) Kernel execution, c)
Memory copy from Device to Host.
S t r i n g M a t c h i n g A l g o r i t h m s - 29

6. String Matching Algorithms


This chapter analyzes the main parts of each String Matching Algorithm that
was implemented to run under CUDA architecture. The full source codes are
available in the Appendix chapter.

6.1. Naive Search


Pre-calculation Search

− 𝑂𝑂(𝑚𝑚𝑚𝑚)

The first algorithm that was developed was called the Naïve algorithm (aka
Brute-Force), which was based on the common sense of sliding the pattern
along the text and comparing it to each portion of the text. The algorithm tries
to match the first character of the text with the first character pattern; in case
of success, it tries to match the second, and so on. Otherwise, the algorithm
continues by sliding the pattern to the next character of the text. The outer
loop is responsible for sliding to the next character, while the inner loop tries
to match the characters with the pattern. However, this first and simple
approach is very slow, taking O(nm) time [4]. Finally, its implementation can
be easily converted to a fuzzy string-matching algorithm, by altering the
condition, as the number of matches will be calculated even if a mismatch
was found. This can be very useful in Computational biology applications, but
makes it the most time consuming algorithm.
S t r i n g M a t c h i n g A l g o r i t h m s - 30

6.1.1. Sequential Implementation


void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  int  x,  i,  k;  
 
  for  (x  =  0;  x  <  n;  x++)  {  
    k  =  0;  
 
    for  (i  =  0;  i  <  m;  i++)  
      if  (T[x  +  i]  ==  P[i])  
        ++k;  
 
    if  (k  ==  m)  
      result[x]  =  true;  
  }  
}  
 
Exhibit 12 C implementation of Naïve Search

6.1.2. CUDA Implementation


__global__  void  bruteforceGPU(const  char  *T,  const  char  *P,  const  int  n,  
    const  int  m,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    int  k  =  0;  
    int  i;  
 
    for  (i  =  0;  i  <  m;  i++)  
      if  (T[x  +  i]  ==  P[i])  
        ++k;  
 
    if  (k  ==  m)  
      result[x]  =  true;  
  }  
}  
 
Exhibit 13 CUDA Implementation of Naïve Search
S t r i n g M a t c h i n g A l g o r i t h m s - 31

6.2. Knuth Morris-Pratt


Pre-calculation Search Overall time

𝑂𝑂(𝑚𝑚) 𝑂𝑂(𝑛𝑛) 𝑂𝑂(𝑚𝑚 + 𝑛𝑛)

The Knuth, Morris and Pratt [7] algorithm was conceived by D. Knuth and V.
Pratt and independently by J. Morris, in 1974. The three of them published it
together in 1977. The algorithm is based on the Naïve algorithm and reduces
the number of comparisons, by using the learnt information from the inner
loop to determine how many skips should take place in the outer loop [33].
The KMP algorithm uses the pattern to pre-compute this number of skips and
starts searching like the naïve algorithm. However, in case of a mismatch, it
uses the pre-computed number of skips vector in order to determine the
position to continue.

Due to the sequential correlations of the pre-computation part of the


algorithm, only the major search part was parallelized taking advantage of the
CUDA architecture. Moreover, the CUDA implementation of the KMP
algorithm was customized to search on larger chunks of the text per CUDA
thread, in order to make use of the pre-calculated skip vector efficiently.

6.2.1. Sequential Implementation


void  preKmp(char  *x,  int  m,  int  kmpNext[])  {  
  int  i,  j;  
  i  =  0;  
  j  =  kmpNext[0]  =  -­‐1;  
  while  (i  <  m)  {  
    while  (j  >  -­‐1  &&  x[i]  !=  x[j])  
      j  =  kmpNext[j];  
    i++;  
    j++;  
    if  (i  <  m  &&  x[i]  ==  x[j])  
      kmpNext[i]  =  kmpNext[j];  
    else  
      kmpNext[i]  =  j;  
  }  
}  
 
 
Exhibit 14 C implementation of KMP Preprocessing [34]
S t r i n g M a t c h i n g A l g o r i t h m s - 32

void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  int  i,  j;  
  int  *kmpNext  =  (int*)  malloc((m  +  1)  *  sizeof(int));  
 
  /*  Preprocessing  */  
  preKmp(P,  m,  kmpNext);  
 
  /*  Searching  */  
  i  =  j  =  0;  
  while  (j  <  n)  {  
    while  (i  >  -­‐1  &&  P[i]  !=  T[j])  
      i  =  kmpNext[i];  
    i++;  
    j++;  
    if  (i  >=  m)  {  
      result[j  -­‐  i]  =  true;  
      i  =  kmpNext[i];  
    }  
  }  
}  
 
Exhibit 15 C implementation of KMP [34]

6.2.2. CUDA Implementation


__global__  void  kmpGPU(const  char  *T,  const  char  *P,  const  int  n,  const  
int  m,  const  int  *kmpNext,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    int  i,  j;  
 
    int  start  =  x  *  m  *  chunk;  
    int  stop  =  (x  +  1)  *  m  *  chunk  +  m  -­‐  1;  
    if  (stop  >  n)  
      stop  =  n;  
 
    i  =  0;  
    j  =  start;  
    while  (j  <  stop)  {  
      while  (i  >  -­‐1  &&  P[i]  !=  T[j])  
        i  =  kmpNext[i];  
      i++;  
      j++;  
      if  (i  >=  m)  {  
        result[j]  =  j  -­‐  i;  
        i  =  kmpNext[i];  
      }  
    }  
  }  
}  
 
 
Exhibit 16 CUDA Implementation of KMP
S t r i n g M a t c h i n g A l g o r i t h m s - 33

6.3. Horspool
Pre-calculation Search

𝑂𝑂(𝑚𝑚 + 𝜎𝜎) 𝑂𝑂(𝑚𝑚𝑚𝑚)

The Horspool [9] algorithm is an simplification of the Boyer Moore algorithm


that was published by N. Horspool in 1980. It is much easier to implement
especial in parallel structures. The Horspool algorithm is using the bad-
character shift vector of the Boyer-Moore algorithm alone, producing a very
efficient algorithm in practice. The searching phase has a quadratic worst
case, but the average number of comparisons for a single character is
between 1/σ and 2/(σ+1).

For each position of the windows the algorithm compares the last window’s
character with the last character of the pattern, and if they match it will
continue to search backwardly. Then, it shifts the window so that the pattern
matches the last character of the previous window.

The CUDA implementation pre-calculates the shifts in CPU and then copies
the shift table to the GPU Memory. Using this table, each thread validates if
the current character is a valid shift, in order to continue execution [35].

6.3.1. Sequential Implementation


void  preBmBcCPU(char  *P,  int  m,  int  bmBc[])  {  
  int  i;  
  for  (i  =  0;  i  <  ASIZE;  i++)  
    bmBc[i]  =  m;  
  for  (i  =  0;  i  <  m  -­‐  1;  i++)  
    bmBc[P[i]]  =  m  -­‐  i  -­‐  1;  
}  
 
Exhibit 17 C implementation of Horspool Preprocessing [36]
S t r i n g M a t c h i n g A l g o r i t h m s - 34

void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  int  j,  bmBc[ASIZE];  
  char  c;  
 
  /*  Preoprocessing  */  
  preBmBcCPU(P,  m,  bmBc);  
 
  /*  Searching  */  
  j  =  0;  
  while  (j  <=  n  -­‐  m)  {  
    c  =  T[j  +  m  -­‐  1];  
    if  (P[m  -­‐  1]  ==  c  &&  memcmp(P,  T  +  j,  m  -­‐  1)  ==  0)  
      result[j]  =  true;  
    j  +=  bmBc[c];  
  }  
}  
 
Exhibit 18 C implementation of Horspool [36]

6.3.2. CUDA Implementation


void  precomputeShifts(char  *T,  int  n,  int  m,  int  *bmBc,  int  *preComp)  {  
  int  i  =  0;  
  while  (i  <=  n  -­‐  m)  {  
    i  +=  bmBc[T[i  +  m  -­‐  1]];  
    preComp[i]  =  i;  
  }  
}  
 
__global__  void  horsGPU(const  char  *T,  const  char  *P,  const  int  n,  const  
int  m,  const  int  *bmBc,  const  int  *preComp,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <=  n  -­‐  m  &&  preComp[x]  ==  x)  {  
    bool  found  =  true;  
 
    char  c  =  T[x  +  m  -­‐  1];  
    for  (int  i  =  0;  i  <  m  -­‐  1;  ++i)    
      if  (P[m  -­‐  1]  !=  c  ||  P[i]  !=  T[x  +  i])  {  
        found  =  false;  
        break;  
      }  
    if  (found)  
      result[x]  =  true;  
  }  
}  
 
 
Exhibit 19 CUDA Implementation of Horspool
S t r i n g M a t c h i n g A l g o r i t h m s - 35

6.4. Karp Rabin


Pre-calculation Search Expected

𝑂𝑂(𝑚𝑚) 𝑂𝑂(𝑚𝑚𝑚𝑚) 𝑂𝑂(𝑛𝑛 + 𝑚𝑚)

The Karp-Rabin was created in 1987 by M. Rabin and R. Karp [12] and
makes use of hash methods to search for the pattern inside the given text. It
is widely used for especially multiple pattern searches. The algorithm
calculates a hash value for the pattern and one for the current search window.
If the hash values are unequal, it will calculate the hash for the next window.

As far as the CUDA implementation is concerned, a custom memcmp()


function was developed in order to handle the memory block comparisons.
Moreover, the hash of the pattern hx is pre-calculated in the main program, as
it remains constant through the execution.

6.4.1. Sequential Implementation


void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  int  d,  hx,  hy,  i,  j;  
 
  /*  Preprocessing  */  
  for  (d  =  i  =  1;  i  <  m;  ++i)  
    d  =  (d  <<  1);  
 
  for  (hy  =  hx  =  i  =  0;  i  <  m;  ++i)  {  
    hx  =  ((hx  <<  1)  +  P[i]);  
    hy  =  ((hy  <<  1)  +  T[i]);  
  }  
 
  /*  Searching  */  
  j  =  0;  
  while  (j  <=  n  -­‐  m)  {  
    if  (hx  ==  hy  &&  memcmp(P,  T  +  j,  m)  ==  0)  
      result[j]  =  j;  
    hy  =  REHASH(T[j],  T[j  +  m],  hy);  
    ++j;  
  }  
}  
 
Exhibit 20 C implementation of Karp Rabin [37]
S t r i n g M a t c h i n g A l g o r i t h m s - 36

6.4.2. CUDA Implementation


__global__  void  krGPU(const  char  *T,  const  char  *P,  const  int  n,  const  
int  m,  
    int  hx,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <=  n  -­‐  m)  {  
    int  hy,  i;  
 
    /*  Preprocessing  */  
    for  (hy  =  i  =  0;  i  <  m;  ++i)  {  
      hy  =  ((hy  <<  1)  +  T[i  +  x]);  
    }  
 
    /*  Searching  */  
    if  (hx  ==  hy  &&  memcmpGPU(P,  T  +  x,  m)  ==  0)  
      result[x]  =  x;  
  }  
}  
 
Exhibit 21 CUDA Implementation of Karp Rabin
S t r i n g M a t c h i n g A l g o r i t h m s - 37

6.5. Quick Search


Pre-calculation Search

𝑂𝑂(𝑚𝑚 + 𝜎𝜎) 𝑂𝑂(𝑚𝑚𝑚𝑚)

The Quick-Search algorithm was published in 1990 by D. Sunday [13]. The


major unique feature is that it was simplified to use only the bad character
swift table of BM, as well as Horspool, making it very fast in practice for short
patterns and large alphabets. Furthermore, It is much easier to implement
than BM especial in parallel structures.

The algorithm moves the text window from left to right and compares it with
the pattern, when a mismatch is found in the i pattern position, it shifts the
pattern m-i positions right, and continues to search.

The CUDA implementation pre-calculates the shifts in CPU and then copies
the shift table to the GPU Memory. Using this table each thread validates if
the current character is a valid shift, in order to continue execution [35].

6.5.1. Sequential Implementation


void  preQsBcCPU(char  *P,  int  m,  int  qbc[])  {  
  int  i;  
  for  (i  =  0;  i  <  ASIZE;  i++)  
    qbc[i]  =  m  +  1;  
  for  (i  =  0;  i  <  m;  i++)  
    qbc[P[i]]  =  m  -­‐  i;  
}  
 
 
Exhibit 22 C implementation of Quick Search Preprocessing [38]
S t r i n g M a t c h i n g A l g o r i t h m s - 38

void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  int  i,  s,  qsbc[ASIZE];  
 
  /*  Preoprocessing  */  
  preQsBcCPU(P,  m,  qsbc);  
 
  /*  Searching  */  
  s  =  0;  
  while  (s  <=  n  -­‐  m)  {  
    i  =  0;  
    while  (i  <  m  &&  P[i]  ==  T[s  +  i])  
      i++;  
    if  (i  ==  m)  
      result[s]  =  true;  
    s  +=  qsbc[T[s  +  m]];  
  }  
}  
 
Exhibit 23 C implementation of Quick Search [38]

6.5.2. CUDA Implementation


void  precomputeShifts(char  *T,  int  n,  int  m,  int  *bmBc,  int  *preComp)  {  
  int  i  =  0;  
  while  (i  <=  n  -­‐  m)  {  
    i  +=  bmBc[T[i  +  m]];  
    preComp[i]  =  i;  
  }  
}  
 
__global__  void  shiftOrGPU(const  char  *T,  const  char  *P,  const  int  n,  
const  int  m,  const  int  *bmBc,  const  int  *preComp,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <=  n  -­‐  m  &&  preComp[x]  ==  x)  {  
    bool  found  =  true;  
 
    for  (int  i  =  0;  i  <  m;  ++i)    
      if  (P[i]  !=  T[x  +  i])  {  
        found  =  false;  
        break;  
      }  
    if  (found)  
      result[x]  =  true;  
  }  
}  
 
Exhibit 24 CUDA Implementation of Quick Search
S t r i n g M a t c h i n g A l g o r i t h m s - 39

6.6. Shift Or
Pre-calculation Search

𝑂𝑂(𝑚𝑚 + 𝜎𝜎)   𝑂𝑂(𝑛𝑛)

The Shift Or [39] algorithm uses bitwise techniques to search within the
specified text. It is very efficient if the pattern length is less than the memory-
word size of the host machine. The bitwise operations mark all the potential
text and pattern matches. The Shift Or algorithm, uses a complemented bit
mask D in order avoiding one of the final bit operations of Shift And.

The CUDA implementation executes the preprocessing step, building the


table storing a bit mask for each character of the pattern under CPU, and then
copies the vector to the GPU Memory. Moreover, the algorithm was
customized to search on larger chunks of the text per CUDA thread, in order
to make use of the calculated bit vectors for the next characters more
efficiently.

6.6.1. Sequential Implementation


int  preSoCPU(char  *x,  int  m,  unsigned  int  S[])  {  
  unsigned  int  j,  lim;  
  int  i;  
  for  (i  =  0;  i  <  ASIZE;  ++i)  
    S[i]  =  ~0;  
  for  (lim  =  i  =  0,  j  =  1;  i  <  m;  ++i,  j  <<=  1)  {  
    S[x[i]]  &=  ~j;  
    lim  |=  j;  
  }  
  lim  =  ~(lim  >>  1);  
  return  (lim);  
}  
 
Exhibit 25 C implementation of Shift Or Preprocessing [40]
S t r i n g M a t c h i n g A l g o r i t h m s - 40

void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  unsigned  int  lim,  D;  
  unsigned  int  S[ASIZE];  
  int  j;  
 
  /*  Preprocessing  */  
  lim  =  preSoCPU(P,  m,  S);  
 
  /*  Searching  */  
  for  (D  =  ~0,  j  =  0;  j  <  n;  ++j)  {  
    D  =  (D  <<  1)  |  S[T[j]];  
    if  (D  <  lim)  
      result[j  -­‐  m  +  1]  =  j  -­‐  m  +  1;  
  }  
}  
 
Exhibit 26 C implementation of Shift Or [40]

6.6.2. CUDA Implementation


__global__  void  shiftAndGPU(const  char  *T,  const  char  *P,  const  int  n,  
    const  int  m,  const  int  *S,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    unsigned  int  D,  F;  
    int  j;  
 
    int  start  =  x  *  m  *  chunk;  
    int  stop  =  (x  +  1)  *  m  *  chunk  +  m  -­‐  1;  
    if  (stop  >  n)  
      stop  =  n;  
 
    F  =  1  <<  (m  -­‐  1);  
 
    /*  Searching  */  
    for  (D  =  0,  j  =  start;  j  <=  stop;  ++j)  {  
      D  =  ((D  <<  1)  |  1)  &  S[T[j]];  
      if  (D  &  F)  
        result[j  -­‐  m  +  1]  =  true;  
    }  
  }  
}  
 
Exhibit 27 CUDA Implementation of Shift Or
S t r i n g M a t c h i n g A l g o r i t h m s - 41

6.7. Shift And


Pre-calculation Search

𝑂𝑂(𝑚𝑚 + 𝜎𝜎)   𝑂𝑂(𝑛𝑛)

The Shift And [39] algorithm is very similar to the Shift Or and uses bitwise
techniques to search within the specified text. Their major difference is that
Shift And has one bitwise operation more than Shift Or.

The CUDA implementation executes the preprocessing step, building the


table storing a bit mask for each character of the pattern under CPU, and then
copies the vector to the GPU Memory. Moreover, the algorithm was
customized to search on larger chunks of the text per CUDA thread, in order
to make use of the calculated bit vectors for the next characters more
efficiently.

6.7.1. Sequential Implementation


void  preSACPU(char  *x,  int  m,  unsigned  int  S[])  {  
  unsigned  int  j,  lim;  
  int  i;  
  for  (i  =  0;  i  <  ASIZE;  ++i)  
    S[i]  =  0;  
  for  (i  =  0,  j  =  1;  i  <  m;  ++i,  j  <<=  1)    
    S[x[i]]  |=  j;  
}  
 
Exhibit 28 C implementation of Shift And Preprocessing [41]
S t r i n g M a t c h i n g A l g o r i t h m s - 42

void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  unsigned  int  D;  
  unsigned  int  S[ASIZE],  F;  
  int  j;  
 
  /*  Preprocessing  */  
  preSACPU(P,  m,  S);  
  F  =  1  <<  (m  -­‐  1);  
 
  /*  Searching  */  
  for  (D  =  0,  j  =  0;  j  <  n;  ++j)  {  
    D  =  ((D  <<  1)  |  1)  &  S[T[j]];  
    if  (D  &  F)  
      result[j  -­‐  m  +  1]  =  j  -­‐  m  +  1;  
  }  
}  

 
Exhibit 29 C implementation of Shift And [41]

6.7.2. CUDA Implementation


__global__  void  shiftAndGPU(const  char  *T,  const  char  *P,  const  int  n,  
        const  int  m,  const  int  *S,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    unsigned  int  D,  F;  
    int  j;  
 
    int  start  =  x  *  m  *  chunk;  
    int  stop  =  (x  +  1)  *  m  *  chunk  +  m  -­‐  1;  
    if  (stop  >  n)  
      stop  =  n;  
 
    F  =  1  <<  (m  -­‐  1);  
 
    /*  Searching  */  
    for  (D  =  0,  j  =  start;  j  <=  stop;  ++j)  {  
      D  =  ((D  <<  1)  |  1)  &  S[T[j]];  
      if  (D  &  F)  
        result[j  -­‐  m  +  1]  =  true;  
    }  
  }  
}  

 
Exhibit 30 CUDA Implementation of Shift And
P e r f o r m a n c e E v a l u a t i o n - 43

7. Performance Evaluation

7.1. Testing Methodology

7.1.1. Pattern Size and CUDA Threads

The experiments were made using two different pattern sizes of 4 bytes and 8
bytes. The size of the pattern affects directly most of the algorithms. Hence,
difference in performance is being illustrated by choosing two usual pattern
sizes the. Moreover, another parameter that affects the execution time, as
well as the Memory Throughput, is the number of CUDA threads that will run
on each block, which is 256 or more. The number of blocks is directly
dependent to the number of threads, and is calculated with the following
function.

𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 1
𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 =  
𝑡𝑡ℎ𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟

However, on cards with compute capability [21] of 1.x, as Geforce GTX 280,
there is an upper limit of 65535 blocks on 1D. So, a larger number of 512
threads per CUDA block was chosen, in order to be able to process larger text
files without exceeding this limit.
P e r f o r m a n c e E v a l u a t i o n - 44

7.1.2. String Matching Test Files

The following files compose the testing material that was used in the
experiments. They are widely known, and commonly used in string matching
tests. The texts were obtained from The Canterbury Corpus [42] and from
Project Gutenberg [43].

Filename   Size  

pge0112.txt   8243  kB  

bible11.txt   4843  kB  

world192.txt   2415  kB  

1musk10.txt   1313  kB  

plrabn12.txt   471  kB  

Table 1 Common Sample Files

The alphabet of the files is limited to the simple 128 ASCII characters.
Furthermore, the selection of files was made, in order to exhibit the
performance differences under variable file sizes. It is an undisputable fact,
that the performance of processing small files, by a cluster, would exhibit a
significant drop due to the network overhead and the GPU memory transfer
time. So, use of a GPU Cluster would be effective on large files.
P e r f o r m a n c e E v a l u a t i o n - 45

7.2. Measuring Speedup


In parallel computing, the term speedup [44] refers to how faster the parallel
algorithm is, compared to the corresponding sequential implementation. The
speedup that an algorithm exhibits can be calculated using the form:

𝑇𝑇
𝑆𝑆 =
𝑇𝑇

Where, p is the number of processers; 𝑇𝑇 is the execution time of the


sequential algorithm and 𝑇𝑇 is the execution time of the parallel algorithm with
p processors. The value depends on the architecture, as well as, how efficient
the processing parallelization is, compared to how much effort is wasted in
communication and synchronization. The ideal situation is called Linear
Speedup, and means that by doubling the number of processors, the speed
exhibits a two-fold increase.

However, as only part of the programs can be parallelized it is worth


mentioning the Amdahl's law [45], published by G. Amdahl in 1967. This law
is commonly used in parallel computing in order to find the theoretical
maximum expected improvement to an overall system, when only a part of the
execution was parallelized and improved. For example, assume that a
program needs 20 hours to be executed under a single core and only 1 hour
cannot be parallelized, meaning that the remaining 19 hours (95%) can run in
parallel. This means that no matter how efficient the parallelization is, it
cannot be less than the critical 1-hour. So, the maximum speedup can be 20x.
More examples are depicted in Figure 13.
P e r f o r m a n c e E v a l u a t i o n - 46

Figure 13 Amdahl's Law

Amdahl’s Law cannot be applied in the CUDA and CPU comparison, as the
processors speeds are totally different. However, it could be used to
approximate the MPI times. As far as CUDA is concerned, NVIDIA has
provided technical reports that claim CUDA to be 30x faster in most cases
and some times even 100x faster, depending on the algorithm structure and
complexity [20]. The comparison was made between same-price range GPU
cards and CPUs. It is an indisputable fact, that speedups of 100x can be
extremely important for science. In order to understand it better, imagine a
problem that takes 12 hours. A 10x speedup would mean that the execution
would take place in 1.2 hours. A 100x speedup means 7.2 minutes and a
1000x 43.2 seconds! The maximum speedup that was exhibited in this thesis
was approximately 21x. This is due to the nature of the algorithms that were
implemented, as CUDA performs better in 𝑂𝑂(𝑛𝑛 ) and 𝑂𝑂(𝑛𝑛 ) complexity
problems, where most String Matching algorithms have usually 𝑂𝑂(𝑛𝑛)
complexity.
P e r f o r m a n c e E v a l u a t i o n - 47

7.3. CPU vs GPU Comparison


In order to evaluate the results each algorithm was executed 10 times and the
average execution times were used. Moreover, the comparison was made
between similar price range CPU and GPU hardware (Intel Core 2 Duo E8400
and NVIDIA Geforce GTX 280).

As the alphabet size was the same for all execution experiments, a correlation
between the text size and the execution time was observed, as expected. In
the following graph the search execution times of CPU and GPU are
presented for the Naïve algorithm (pattern length 8, the GPU memory copy
times are excluded).

Naive CPU Execution Time


500 ms

400 ms

300 ms

200 ms Time in ms

100 ms
Linear
(Time in
0 ms ms)
481,861B 1,344,739B 2,473,401B 4,959,549B 8,441,343B
-100 ms

Figure 14 Naive CPU Times per Text


P e r f o r m a n c e E v a l u a t i o n - 48

Naive GPU Execution Time


14 ms

12 ms

10 ms

8 ms

6 ms Time in ms
4 ms

2 ms Linear (Time
in ms)
0 ms
481,861 B 1,344,739 B 2,473,401 B 4,959,549 B 8,441,343 B
-2 ms

Figure 15 Naive GPU Times per Text

This correlation is observed, as in all the implemented algorithms the text


length directly affects the complexity, and as a result the processing time. The
following graphs are based on the largest file (pge0112.txt), as its 8,243 kB
size facilitates the presentation of the results especially for the MPI
implementations, while it are more efficient under larger file sizes.

The pattern choice was made by the most frequent words of the text. The first
one was the word “last” and the other one was the word “probably”. Both
words appear about 530 times within the text, which is approximately the
0.006% of the text.

The following graph (Figure 16) compares the execution times of CPU and
GPU implementations of the 7 string matching algorithms, under different
pattern sizes.
P e r f o r m a n c e E v a l u a t i o n - 49

CPU vs GPU Execution Time


SOr

SA

QS

GPU 8
KMP GPU 4
CPU 8
KR
CPU 4

Hor

Naïve

0 ms 20 ms 40 ms 60 ms 80 ms 100 ms 120 ms

Figure 16 CPU vs GPU Execution time

The CPU execution times of the Naïve algorithm were outside the visible chart
area, but are available in more detail in Table 2.

CPU 4 GPU 4 CPU 8 GPU 8

Naïve 245.38 ms 12.95 ms 442.95 ms 18.82 ms


Hor 15.38 ms 9.53 ms 15.38 ms 9.52 ms
KR 71.83 ms 8.79 ms 71.42 ms 9.86 ms
KMP 103.15 ms 10.31 ms 102.17 ms 10.62 ms
QS 25.15 ms 9.75 ms 13.47 ms 9.52 ms
SA 56.74 ms 14.86 ms 56.69 ms 15.90 ms
SOr 56.30 ms 14.85 ms 56.25 ms 15.90 ms

Table 2 CPU vs GPU Execution time


P e r f o r m a n c e E v a l u a t i o n - 50

All the presented execution times for both CPU and GPU include any
necessary preprocessing part. However, it was observed that, as the
complexity of preprocessing is dependent on the pattern size and the
alphabet size, for both patterns the times for most algorithms, were extremely
small, fluctuating around 1.58E+29 ms even for the 8 Byte pattern. As a
result, the preprocessing parts of the selected algorithms, is quite insignificant
for the selected pattern sizes and does not affect the total execution time
considerably for such pattern sizes.

The CPU execution time, measured as the execution time of the the
computeCPU() function, is represented by the CPU bars. The GPU execution
time was calculated as the sum of the CUDA memory copy from Host to the
Device, the preprocessing functions, the kernel execution, and the CUDA
memory copy time of the results from the Device to the Host. These
procedures are a substantial part of the GPU algorithm, in order to be able to
print the results or use them in the execution later. The new NVIDIA Fermi
and Kepler cards, with compute capability >2.0, support printf() function calls
from inside the kernel. This means, that the final copy of the results from the
GPU Memory to the Host Memory could be avoided, providing even better
performance speedups. However, the GTX 280 cards have compute
capability of 1.x and do not support such functions.

It can clearly be seen, that all the algorithms exhibit a significant speedup
when running under GPU, for both pattern sizes. Additionally, this
performance boost is depicted in Figure 17 as an average of the two pattern
sizes.
P e r f o r m a n c e E v a l u a t i o n - 51

Average GPU vs CPU Speedup


2500%
2167%

2000%

1500%

966%
1000% 768%

500% 369% 366%


242% 205%

0%
Naïve Hor KR KMP QS SA SOr

Figure 17 Average GPU vs CPU Speedup

As it can be seen, the Naïve algorithm, as well as, KR and KMP show the
most significant performance speedups. Furthermore, it is worth mentioning
that all algorithms exhibited at least a two-fold performance increase,
compared to their CPU implementations. As the two architectures are totally
different, Amdahl’s law cannot apply to this case. However, these differences
between the speedups are due to the different structure and complexity of
each algorithm. For example, the significant speedup that the Naïve algorithm
exhibited is due to the algorithm’s O(mn) complexity, even in the best case,
thus, making it having the worst best case out of all. It is a fact that algorithms
with larger complexities can be much more efficient when implemented in
CUDA. Moreover, the Horspool and KMP algorithms exhibited the smallest
speedup, as a significant portion of their structure is sequential. Specifically,
the CUDA implementation of the algorithm runs once in CPU, through the text
using bad character shift table to calculate the shifts and then runs on the
GPU for the specified positions. So, a very large portion of the code still was
executed sequentially, limiting the speedup.
P e r f o r m a n c e E v a l u a t i o n - 52

7.4. GPU vs the Cluster Comparison


As it was observed, all GPU implementations had a significant performance
speedup, running under a single host. The following graph (Figure 18)
compares the execution times of a single host with the acceleration provided
when using multiple hosts. In this thesis, 10 hosts with the same hardware
specifications were used. It is worth mentioning, that the large text file was
chosen to run these experiments, so that there is a potential balance between
the processing time and the time required for gathering the results.

GPU vs MPI Cluster Execution Time


SOr

SA

QS

Single 8
KMP Single 4
MPI 8
KR
MPI 4

Hor

Naïve

0 ms 5 ms 10 ms 15 ms

Figure 18 Single GPU vs MPI GPU Cluster Execution time

The 4, 8 suffixes represent the two different pattern sizes that were used, as
in the previous executions. Additionally, Table 3 presents the same execution
results with more details.
P e r f o r m a n c e E v a l u a t i o n - 53

MPI 4 MPI 8 Single 4 Single 8

Naïve 6.84 ms 7.24 ms 11.75 ms 17.63 ms


Hor 6.50 ms 6.21 ms 8.32 ms 8.32 ms
KR 6.40 ms 6.28 ms 8.79 ms 9.87 ms
KMP 7.00 ms 7.26 ms 9.12 ms 9.78 ms
QS 6.47 ms 6.30 ms 8.16 ms 8.33 ms
SA 7.14 ms 7.02 ms 13.66 ms 14.71 ms
SOr 6.98 ms 6.88 ms 13.66 ms 14.71 ms

Table 3 Single GPU vs MPI GPU Cluster Execution time

The MPI times were calculated as a sum of the total CUDA execution time,
and the time needed for gathering the results from nodes to the master
worker. The gathering function can be very time consuming as it depends on
the network communication links between the nodes, that are much slower
that inner host communications. Even with the network latency, all algorithms
exhibited a significant performance increase again, which presented in more
detail in Figure 19.

MPI vs Single GPU Speedup


300%
243%
250%
209% 214%
200%
157%
150% 134% 135% 132%

100%

50%

0%
Naïve Hor KR KMP QS SA SOr

Figure 19 Average Single GPU vs MPI GPU Cluster Speedup


P e r f o r m a n c e E v a l u a t i o n - 54

It can clearly be seen, that the speedup of MPI is much less than the CUDA
speedup. This was due to the two memory copy procedures, copying the data
to the device and then collecting the results, as they have a significant role in
the total GPU execution time. Figure 20 shows that these two functions take,
as an average, about 59% of the total single GPU execution time, with a
minimum of 38% and a maximum of 75%, depending on the algorithm. This
means that sometimes they take more time than the total processing does.
Furthermore, this portion represents the crucial part of the algorithms that
cannot be parallelized as Amdahl’s law states, and works as a barrier in the
speedup. Specifically, Amdahl’s law says that, when the parallel portion is
around 50% the maximum speedup that can be achieved is around 2x.
Although, many architecture parameters affect this number, it is very
interesting that the law applies to this condition, as the results are very close
to barrier specified.

GPU Memory Copy and Execution times


SOr

SA

QS

KMP Mem Host

KR Mem Device
Computation
Hor

Naïve

0.00 ms 5.00 ms 10.00 ms 15.00 ms 20.00 ms

Figure 20 Single GPU Memory Copy and Execution times

It is worth mentioning, that crucial part of the MPI implementation is the


gathering of the results to the master worker, as Figure 21 depicts. This
process cannot be optimized more and is very time consuming, as the
network communications have smaller bandwidth compared to PCI Express
P e r f o r m a n c e E v a l u a t i o n - 55

2.0. As an average, the gathering of the data was responsible for the 45% of
the execution time in all algorithm, with a minimum of 12% in Naive Search
and a maximum of 71% in Quick Search. Such percentages, work as barriers
to the optimizations that can be performed, as the part of gathering the results
cannot be parallelized.

MPI Execution and Gather times


SOr

SA

QS

KMP
Gather
KR
Computation
Hor

Naïve

0.00 ms 10.00 ms 20.00 ms 30.00 ms 40.00 ms

Figure 21 MPI Execution and Gather times

Moreover, it can be seen in Figure 22 that Karp Rabin, Horspool and Quick
Search algorithms performed best in descending order. The three of them,
managed to search, and also gather the results of the pattern occurrences
within an 8 MB file in 6.38 ms, as an average for both pattern sizes.
P e r f o r m a n c e E v a l u a t i o n - 56

Average MPI Execution Time


SOr

SA

QS

KMP

KR

Hor

Naïve

5.80 ms 6.00 ms 6.20 ms 6.40 ms 6.60 ms 6.80 ms 7.00 ms 7.20 ms

Figure 22 Average MPI GPU Cluster execution time

Finally, it is marked that through this hybrid approach of string matching, all
algorithms exhibited at least a three-fold speedup. The Naïve algorithm had
the most significant performance increase of 49x, compared to its single CPU
version. Additionally, algorithms such as Karp Rabin, that ended up having
one of the best execution times, managed to run 12x faster than their single
CPU implementation, as presented in Figure 23.

Total CPU vs MPI GPU cluster Speedup


4889%
5000%
4500%
4000%
3500%
3000%
2500%
2000%
1441%
1500% 1130%
801% 812%
1000%
363% 302%
500%
0%
Naïve Hor KR KMP QS SA SOr

Figure 23 CPU VS MPI GPU Cluster Speedup


C o n c l u s i o n s - 57

8. Conclusions
In this thesis, parallel implementations of the Naïve, Knuth Morris-Pratt,
Horspool, Karp Rabin, Quick Search, Shift Or and Shift And exact string
matching algorithms were presented using the NVIDIA CUDA Architecture.
Both sequential and parallel implementations were compared in terms of
running time under different pattern sizes. The results shown, that the parallel
implementations of algorithms such as Naïve Search, were executed up to
21x faster, than the sequential algorithm. Furthermore, the Knuth Morris-Pratt
and the Karp Rabin algorithms exhibited 9.6x and 7.7x speedups,
respectively; while, the rest of them exhibited at least a 2x increase. It was
observed, that the speedups were directly dependent to the structure of each
algorithm, and the portion of the overall procedure that could be parallelized.

Furthermore, the algorithms were customized in order to run on a 10-node


GPU cluster, illustrating the performance acceleration offered of each model.
MPI [2] was used to handle the communication between the nodes. Running
on the cluster, all the algorithms exhibited a speedup from 1.3x to a maximum
of 2.5x, compared to their single GPU implementations. It was observed that,
the data transfers between the CPU and the GPU, that couldn’t be avoided,
were responsible for approximately the 59% of the CUDA execution time,
depending on the algorithm. It is very interesting that, this portion of the
execution time is directly related to the speedup that all algorithms exhibited in
this stage, as it comes totally in line with Amdahl’s law. Another crucial part
that couldn’t be avoided was the gathering of the results to the master worker,
which was responsible for the 45% of the MPI execution time as an average.
As it can be seen, both memory copies and gathering of the results are very
time consuming procedures, and take much more time in total than the actual
search function.

In total, the Naïve algorithm had the most significant performance increase of
48x, comparing the MPI with the sequential version. This substantial speedup
was due to the higher algorithm’s complexity, as well as, its simple structure.
However, the fastest algorithm for both patterns proved to be Karp Rabin,
C o n c l u s i o n s - 58

exhibiting an 11.3x speedup in total, followed by Horspool, which was a little


faster in 8 Byte patterns, and Quick Search.

Future research in the area of string matching, GPGPU parallel processing


and MPI, could focus on intense profiling of the implemented algorithms, on
Shared Memory GPU optimizations, and on the latest NVIDIA Compute
Capability 3.x; which implements new functions that could facilitate the
execution, replacing existing time consuming procedures, such as the copy of
the sparse result vector. Additionally, heterogeneous architectures, using
FPGAs, could be examined, and compared in terms of performance, with the
existing implementations.
A p p e n d i x - 59

9. Appendix

9.1. Naïve Search

9.1.1. MPI Implementation (mpiBrute.cpp)


#include  <iostream>  
#include  <fstream>  
#include  <cmath>  
#include  <stdlib.h>  
#include  <string.h>  
#include  <sys/time.h>  
 
using  namespace  std;  
 
//  MPI  include  
#include  <mpi.h>  
 
//  User  include  
#include  "mpiBrute.h"  
 
//  Error  handling  macros  
#define  MPI_CHECK(call)  \  
       if((call)  !=  MPI_SUCCESS)  {  \  
               cerr  <<  "MPI  error  calling  \""#call"\"\n";  \  
               my_abort(-­‐1);  }  
 
void  computeCPU(char  *T,  int  n,  char  *P,  int  m,  bool  *result)  {  
  int  x,  i,  k;  
 
  for  (x  =  0;  x  <  n;  x++)  {  
    k  =  0;  
 
    for  (i  =  0;  i  <  m;  i++)  
      if  (T[x  +  i]  ==  P[i])  
        ++k;  
 
    if  (k  ==  m)  
      result[x]  =  true;  
  }  
}  
 
//  Shut  down  MPI  cleanly  if  something  goes  wrong  
void  my_abort(int  err)  {  
  cout  <<  "Test  FAILED\n";  
  MPI_Abort(MPI_COMM_WORLD,  err);  
}  
 
//  Host  code  
int  main(int  argc,  char*  argv[])  {  
  int  i;  
 
  times  timeGPU;  
A p p e n d i x - 60

  double  timeStartRead,  timeEndRead;  


  double  timeStartCuda,  timeEndCuda;  
  double  timeStartCPU,  timeEndCPU;  
  double  timeStartGather,  timeEndGather;  
 
  int  cudaThreadSize  =  512;  
  char  *filepath;  
  if  (argc  >  1)  {  
    cudaThreadSize  =  atoi(argv[1]);  
    filepath  =  argv[2];  
  }  else  {  
    filepath  =  "/nfs/iassael/str/world192.txt";  
  }  
  char  patternpath[]  =  "/nfs/iassael/str/pattern.txt";  
 
  char  *h_text;  
  char  *h_pattern;  
  bool  *h_result;  
  int  *h_result_final;  
 
  //  Initialize  MPI  state  
  MPI_CHECK(MPI_Init(&argc,  &argv));  
 
  //  Get  our  MPI  node  number  and  node  count  
  int  commSize,  commRank;  
  MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD,  &commSize));  
  MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD,  &commRank));  
 
  long  int  lenFullText;  
 
  timeStartRead  =  MPI_Wtime();  
  //Read  Pattern  File  
  ifstream  filePattern(patternpath);  
  if  (!filePattern.is_open())  {  
    cout  <<  "Rank"  <<  commRank  <<  ":  Cannot  open  pattern!"  <<  endl;  
    my_abort(1);  
  }    
  string  str_pattern((std::istreambuf_iterator<char>(filePattern)),  
      std::istreambuf_iterator<char>());  
  filePattern.close();  
 
  h_pattern  =  new  char[str_pattern.size()];  
  strcpy(h_pattern,  str_pattern.c_str());  
  int  lenp  =  strlen(h_pattern);  
 
  //  Read  Main  File  
  ifstream  fileMain(filepath);  
  if  (!fileMain.is_open())  {  
    cout  <<  "Rank"  <<  commRank  <<  ":  Cannot  open  text!"  <<  endl;  
    my_abort  (1);  
  }    
  fileMain.seekg(0,  std::ios::end);  
  lenFullText  =  fileMain.tellg();  
  timeEndRead  =  MPI_Wtime();  
 
  int  start  =  commRank  *  lenFullText  /  commSize;  
  int  stop  =  (commRank  +  1)  *  lenFullText  /  commSize  +  (lenp  -­‐  1);  
 
  if  (stop  >  lenFullText)  
A p p e n d i x - 61

    stop  =  lenFullText;  
 
  fileMain.seekg(start,  std::ios::beg);  
 
  h_text  =  (char  *)  malloc(stop  -­‐  start  +  1,  sizeof(char));  
  fileMain.read(h_text,  stop  -­‐  start  +  1);  
  fileMain.close();  
 
  int  lent  =  strlen(h_text);  
 
  //  Calculate  Lenths  
  int  result_size  =  ceil(lenFullText  /  commSize);  
 
  //  Initialize  local  results  array  
  h_result  =  (bool  *)  malloc(result_size  *  sizeof(bool));  
  memset(h_result,  false,  result_size  *  sizeof(bool));  
 
  //  On  each  node,  run  computation  on  GPU  and  CPU  
  timeStartCuda  =  MPI_Wtime();  
  computeGPU(h_result,  result_size,  h_text,  lent,  h_pattern,  lenp,  
      cudaThreadSize,  timeGPU);  
  timeEndCuda  =  MPI_Wtime();  
 
  timeStartCPU  =  MPI_Wtime();  
  computeCPU(h_text,  lent,  h_pattern,  lenp,  h_result);  
  timeEndCPU  =  MPI_Wtime();  
 
  //  Initialize  global  results  array  
  if  (commRank  ==  0)  
    h_result_final  =  (int  *)  malloc(  
        result_size  /  lenp  *  commSize  *  sizeof(int));  
 
  //  Init  MPI  receive  buffer  
  MPI_Status  status;  
  const  int  buf  =  32;  
  int  position[buf];  
 
  if  (commRank  ==  0)  {  
    //  Copy  masters  results  
    int  results  =  0;  
    for  (i  =  0;  i  <  result_size;  i++)  
      if  (h_result[i])  {  
        h_result_final[results]  =  i;  
        results++;  
      }  
 
    //  Start  receiving  
    int  received  =  0;  
    int  finWorkers  =  1;  
    while  (finWorkers  <  commSize)  {  
      MPI_CHECK(MPI_Recv(&position,  buf,  MPI_INT,  MPI_ANY_SOURCE,    
              MPI_ANY_TAG,  MPI_COMM_WORLD,  &status));  
 
      //  Get  the  number  of  the  received  results  
      MPI_Get_count(&status,  MPI_INT,  &received);  
      for  (i  =  0;  i  <  received;  i++)  {  
        h_result_final[results]  =  position[i];  
        results++;  
      }  
A p p e n d i x - 62

      //  Worker  has  finished  transmission  


      if  (status.MPI_TAG  ==  1)  
        finWorkers++;  
    }  
  }  else  {  
    int  tmp_results  =  0;  
    for  (i  =  0;  i  <  result_size;  i++)  
      if  (h_result[i])  {  
        position[tmp_results]  =  start  +  i;  
        tmp_results++;  
 
        //  If  buffer  is  full  send  results  
        if  (tmp_results  >=  buf)  {  
          MPI_CHECK(MPI_Send(&position,  tmp_results,  MPI_INT,  0,  0,    
                MPI_COMM_WORLD  ));  
          tmp_results  =  0;  
        }  
      }  
 
    //  If  there  are  unsent  results  send  them  and  finish  transmission  
    if  (tmp_results  >  0)  {  
      MPI_CHECK(MPI_Send(&position,  tmp_results,  MPI_INT,  0,  1,  
              MPI_COMM_WORLD  ));  
    }  else  {  
      MPI_CHECK(MPI_Send(NULL,  0,  MPI_INT,  0,  1,  MPI_COMM_WORLD  ));  
    }  
  }  
  timeEndGather  =  MPI_Wtime();  
 
  if  (commRank  ==  0)  {  
    if  (dbgMode)  
      cout  <<  "name\t"  <<  "n\t"  <<  "m\t"  <<  "thr\t"  <<  "read\t"  <<  
"CPU\t"  
          <<  "cudaFunc\t"  <<  "preComp\t"  <<  "cudaHost\t"  
          <<  "cudaComp\t"  <<  "cudaDevice\t"  <<  "Gather"  <<  endl;  
    cout  <<  filepath  <<  "\t";  
    cout  <<  lent  <<  "\t"  <<  lenp  <<  "\t"  <<  cudaThreadSize  <<  "\t";  
    cout  <<  float(timeEndRead  -­‐  timeStartRead)  *  1000  <<  "\t";  
    cout  <<  float(timeEndCPU  -­‐  timeStartCPU)  *  1000     <<  "\t";  
    cout  <<  float(timeEndCuda  -­‐  timeStartCuda)  *  1000  <<  "\t";  
    //CUDA  times  
    cout  <<  timeGPU.cpuPrepro  <<  "\t"  <<  timeGPU.gpuMemHost  <<  "\t"  
        <<  timeGPU.gpuCompute  <<  "\t"  <<  timeGPU.gpuMemDevice  <<  "\t";  
    //Gather  times  
    cout  <<  float(timeEndGather  -­‐  timeStartGather)  *  1000  <<  endl;  
 
  }  
 
  //  Cleanup  
  MPI_CHECK(MPI_Finalize());  
 
  return  0;  

}  
A p p e n d i x - 63

9.1.2. MPI Implementation (mpiBrute.h)


typedef  struct  times  {  
  float  gpuMemHost;  
  float  gpuCompute;  
  float  gpuMemDevice;  
  float  cpuPrepro;  
}  times;  
 
#define  dbgMode  0  
 
extern  "C"  {  
  void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  
          char  *h_pattern,  int  lenp,  int  size,  times  &timeGPU);  
  void  my_abort(int  err);  

}  

9.1.3. CUDA Implementation (mpiBrute.cu)


#include  <stdio.h>  
#include  <iostream>  
 
#include  <cuda_runtime.h>  
#include  <helper_functions.h>  
#include  <helper_cuda.h>  
 
//  User  include  
#include  "mpiBrute.h"  
 
//  Device  code  
__global__  void  bruteforceGPU(const  char  *T,  const  char  *P,  const  int  n,  
    const  int  m,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    int  k  =  0;  
    int  i;  
 
    for  (i  =  0;  i  <  m;  i++)  
      if  (T[x  +  i]  ==  P[i])  
        ++k;  
 
    if  (k  ==  m)  
      result[x]  =  true;  
  }  
}  
 
int  calcBlocks(int  size,  int  threadsPerBlock)  {  
  return  ceil((size  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock);  
}  
 
//  CUDA  computation  on  each  node  
void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  
    char  *h_pattern,  int  lenp,  int  threadSize,  times  &timeGPU)  {  
 
  char  *d_text;  
A p p e n d i x - 64

  char  *d_pattern;  
  bool  *d_result;  
 
  cudaEvent_t  startMemHost,  stopMemHost;  
  cudaEvent_t  startMemDevice,  stopMemDevice;  
  cudaEvent_t  startCompute,  stopCompute;  
  cudaEventCreate(&startMemHost);  
  cudaEventCreate(&stopMemHost);  
  cudaEventCreate(&startCompute);  
  cudaEventCreate(&stopCompute);  
  cudaEventCreate(&startMemDevice);  
  cudaEventCreate(&stopMemDevice);  
 
  //  Allocate  data  on  GPU  memory  
  cudaEventRecord(startMemHost,  0);  
  checkCudaErrors(cudaMalloc((void**)  &d_text,  lent  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_result,  lent  *  sizeof(bool)));  
 
  //  Copy  to  GPU  memory  
  checkCudaErrors(  
      cudaMemcpy(d_text,  h_text,  lent  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_pattern,  h_pattern,  lenp  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
 
  checkCudaErrors(cudaMemset(d_result,  false,  lent  *  sizeof(bool)));  
 
  cudaEventRecord(stopMemHost,  0);  
 
  //  Invoke  kernel  
  int  threadsPerBlock  =  threadSize;  
  int  blocksPerGrid  =  calcBlocks(lent,  threadsPerBlock);  
 
  //  Run  Main  kernel  
  cudaEventRecord(startCompute,  0);  
  bruteforceGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  
lent,  lenp,  d_result);  
  cudaEventRecord(stopCompute,  0);  
 
  //  Copy  data  back  to  CPU  memory  
  cudaEventRecord(startMemDevice,  0);  
  checkCudaErrors(  
      cudaMemcpy(h_result,  d_result,  lent  *  sizeof(bool),  
          cudaMemcpyDeviceToHost));  
  cudaEventRecord(stopMemDevice,  0);  
 
  //  Free  GPU  memory  
  checkCudaErrors(cudaFree(d_text));  
  checkCudaErrors(cudaFree(d_pattern));  
  checkCudaErrors(cudaFree(d_result));  
 
  cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
  cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
  cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
}
A p p e n d i x - 65
A p p e n d i x - 66

9.2. Knuth Morris-Pratt

9.2.1. CUDA Implementation (mpiKMP.cu)


#include  <stdio.h>  
#include  <iostream>  
 
#include  <cuda_runtime.h>  
#include  <helper_functions.h>  
#include  <helper_cuda.h>  
 
//  MPI  include  
#include  <mpi.h>  
 
//  User  include  
#include  "mpiKMP.h"  
 
#define  chunk  2  
 
//  Device  code  
__global__  void  kmpGPU(const  char  *T,  const  char  *P,  const  int  n,  const  int  
m,  const  int  *kmpNext,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    int  i,  j;  
 
    int  start  =  x  *  m  *  chunk;  
    int  stop  =  (x  +  1)  *  m  *  chunk  +  m  -­‐  1;  
    if  (stop  >  n)  
      stop  =  n;  
 
    i  =  0;  
    j  =  start;  
    while  (j  <  stop)  {  
      while  (i  >  -­‐1  &&  P[i]  !=  T[j])  
        i  =  kmpNext[i];  
      i++;  
      j++;  
      if  (i  >=  m)  {  
        result[j]  =  j  -­‐  i;  
        i  =  kmpNext[i];  
      }  
    }  
  }  
}  
 
void  preKmp(const  char  *P,  int  m,  int  kmpNext[])  {  
  int  i,  j;  
  i  =  0;  
  j  =  kmpNext[0]  =  -­‐1;  
  while  (i  <  m)  {  
    while  (j  >  -­‐1  &&  P[i]  !=  P[j])  
      j  =  kmpNext[j];  
    i++;  
A p p e n d i x - 67

    j++;  
    if  (i  <  m  &&  P[i]  ==  P[j])  
      kmpNext[i]  =  kmpNext[j];  
    else  
      kmpNext[i]  =  j;  
  }  
}  
 
int  calcBlocks(int  size,  int  threadsPerBlock)  {  
  return  (size  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock;  
}  
 
//  CUDA  computation  on  each  node  
void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  
    char  *h_pattern,  int  lenp,  int  threadSize,  times  &timeGPU)  {  
 
  char  *d_text;  
  char  *d_pattern;  
  bool  *d_result;  
  int  *d_kmpNext;  
 
  int  *h_kmpNext  =  (int*)  malloc((lenp  +  1)  *  sizeof(int));  
 
  clock_t  timeStartPrepro,  timeEndPrepro;  
 
  cudaEvent_t  startMemHost,  stopMemHost;  
  cudaEvent_t  startMemDevice,  stopMemDevice;  
  cudaEvent_t  startCompute,  stopCompute;  
  cudaEventCreate(&startMemHost);  
  cudaEventCreate(&stopMemHost);  
  cudaEventCreate(&startCompute);  
  cudaEventCreate(&stopCompute);  
  cudaEventCreate(&startMemDevice);  
  cudaEventCreate(&stopMemDevice);  
 
  //Precompute  
  timeStartPrepro  =  clock();  
  preKmp(h_pattern,  lenp,  h_kmpNext);  
  timeEndPrepro  =  clock();  
 
  //  Allocate  data  on  GPU  memory  
  cudaEventRecord(startMemHost,  0);  
  checkCudaErrors(cudaMalloc((void**)  &d_text,  lent  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_result,  lent  *  sizeof(bool)));  
  checkCudaErrors(cudaMalloc((void**)  &d_kmpNext,  (lenp  +  1)  *  
sizeof(int)));  
 
  //  Copy  to  GPU  memory  
  checkCudaErrors(  
      cudaMemcpy(d_text,  h_text,  lent  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_pattern,  h_pattern,  lenp  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_kmpNext,  h_kmpNext,  (lenp  +  1)  *  sizeof(int),  
          cudaMemcpyHostToDevice));  
 
A p p e n d i x - 68

  checkCudaErrors(cudaMemset(d_result,  false,  lent  *  sizeof(bool)));  


 
  cudaEventRecord(stopMemHost,  0);  
 
  //  Invoke  kernel  
  int  threadsPerBlock  =  threadSize;  
  int  blocksPerGrid  =  calcBlocks(lent  /  (lenp  *  chunk),  threadsPerBlock);  
 
  //  Run  Main  kernel  
  cudaEventRecord(startCompute,  0);  
  kmpGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  lent,  
lenp,  d_kmpNext,  d_result);  
  cudaEventRecord(stopCompute,  0);  
 
  //  Copy  data  back  to  CPU  memory  
  cudaEventRecord(startMemDevice,  0);  
  checkCudaErrors(  
      cudaMemcpy(h_result,  d_result,  lent  *  sizeof(bool),  
          cudaMemcpyDeviceToHost));  
  cudaEventRecord(stopMemDevice,  0);  
 
  //  Free  GPU  memory  
  checkCudaErrors(cudaFree(d_text));  
  checkCudaErrors(cudaFree(d_pattern));  
  checkCudaErrors(cudaFree(d_result));  
  checkCudaErrors(cudaFree(d_kmpNext));  
 
  cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
  cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
  cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
  timeGPU.cpuPrepro  =  (timeEndPrepro  -­‐  timeStartPrepro)  /  CLOCKS_PER_SEC  
      *  1000;  

}  
A p p e n d i x - 69

9.3. Horspool

9.3.1. CUDA Implementation (mpiHorsepool.cu)


#include  <stdio.h>  
#include  <iostream>  
 
#include  <cuda_runtime.h>  
#include  <helper_functions.h>  
#include  <helper_cuda.h>  
 
//  MPI  include  
#include  <mpi.h>  
 
//  User  include  
#include  "mpiHorspool.h"  
 
//  Device  code  
__global__  void  horsGPU(const  char  *T,  const  char  *P,  const  int  n,  const  
int  m,  
    const  int  *bmBc,  const  int  *preComp,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <=  n  -­‐  m  &&  preComp[x]  ==  x)  {  
    bool  found  =  true;  
 
    char  c  =  T[x  +  m  -­‐  1];  
    for  (int  i  =  0;  i  <  m  -­‐  1;  ++i)  {  
      if  (P[m  -­‐  1]  !=  c  ||  P[i]  !=  T[x  +  i])  {  
        found  =  false;  
        break;  
      }  
    }  
    if  (found)  
      result[x]  =  true;  
  }  
}  
 
void  precomputeShifts(char  *T,  int  n,  int  m,  int  *bmBc,  int  *preComp)  {  
  int  i  =  0;  
  while  (i  <=  n  -­‐  m)  {  
    i  +=  bmBc[T[i  +  m  -­‐  1]];  
    preComp[i]  =  i;  
  }  
}  
 
void  preBmBc(char  *P,  int  m,  int  bmBc[])  {  
  int  i;  
 
  for  (i  =  0;  i  <  ASIZE;  ++i)  
    bmBc[i]  =  m;  
  for  (i  =  0;  i  <  m  -­‐  1;  ++i)  {  
    bmBc[P[i]]  =  m  -­‐  i  -­‐  1;  
  }  
}  
A p p e n d i x - 70

 
int  calcBlocks(int  size,  int  threadsPerBlock)  {  
  return  (size  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock;  
}  
 
//  CUDA  computation  on  each  node  
void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  
    char  *h_pattern,  int  lenp,  int  threadSize,  times  &timeGPU)  {  
 
  char  *d_text;  
  char  *d_pattern;  
  bool  *d_result;  
  int  *d_bmBc;  
  int  *d_preComp;  
  int  bmBc[ASIZE];  
  int  *h_preComp  =  (int*)  malloc(lent  *  sizeof(int));  
 
  double  timeStartPrepro,  timeEndPrepro;  
 
  cudaEvent_t  startMemHost,  stopMemHost;  
  cudaEvent_t  startMemDevice,  stopMemDevice;  
  cudaEvent_t  startCompute,  stopCompute;  
  cudaEventCreate(&startMemHost);  
  cudaEventCreate(&stopMemHost);  
  cudaEventCreate(&startCompute);  
  cudaEventCreate(&stopCompute);  
  cudaEventCreate(&startMemDevice);  
  cudaEventCreate(&stopMemDevice);  
 
  //Precompute  shifts  
  timeStartPrepro  =  MPI_Wtime();  
  preBmBc(h_pattern,  lenp,  bmBc);  
  precomputeShifts(h_text,  lent,  lenp,  bmBc,  h_preComp);  
  timeEndPrepro  =  MPI_Wtime();  
 
  //  Allocate  data  on  GPU  memory  
  cudaEventRecord(startMemHost,  0);  
  checkCudaErrors(cudaMalloc((void**)  &d_text,  lent  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_result,  lent  *  sizeof(bool)));  
  checkCudaErrors(cudaMalloc((void**)  &d_bmBc,  ASIZE  *  sizeof(int)));  
  checkCudaErrors(cudaMalloc((void**)  &d_preComp,  lent  *  sizeof(int)));  
  //  Copy  to  GPU  memory  
  checkCudaErrors(  
      cudaMemcpy(d_text,  h_text,  lent  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_pattern,  h_pattern,  lenp  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_bmBc,  bmBc,  ASIZE  *  sizeof(int),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_preComp,  h_preComp,  lent  *  sizeof(int),  
          cudaMemcpyHostToDevice));  
 
  checkCudaErrors(cudaMemset(d_result,  false,  lent  *  sizeof(bool)));  
 
  cudaEventRecord(stopMemHost,  0);  
A p p e n d i x - 71

 
  //  Invoke  kernel  
  int  threadsPerBlock  =  threadSize;  
  int  blocksPerGrid  =  calcBlocks(lent  -­‐  lenp  +  1,  threadsPerBlock);  
 
  //  Run  Main  kernel  
  cudaEventRecord(startCompute,  0);  
  horsGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  lent,  
lenp,  d_bmBc,  d_preComp,  d_result);  
  cudaEventRecord(stopCompute,  0);  
 
  //  Copy  data  back  to  CPU  memory  
  cudaEventRecord(startMemDevice,  0);  
  checkCudaErrors(  
      cudaMemcpy(h_result,  d_result,  lent  *  sizeof(bool),  
          cudaMemcpyDeviceToHost));  
  cudaEventRecord(stopMemDevice,  0);  
 
  //  Free  GPU  memory  
  checkCudaErrors(cudaFree(d_text));  
  checkCudaErrors(cudaFree(d_pattern));  
  checkCudaErrors(cudaFree(d_result));  
  checkCudaErrors(cudaFree(d_bmBc));  
  checkCudaErrors(cudaFree(d_preComp));  
 
  cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
  cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
  cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
  timeGPU.cpuPrepro  =  (timeEndPrepro  -­‐  timeStartPrepro)  *  1000;  
 
}  
A p p e n d i x - 72

9.4. Karp Rabin

9.4.1. CUDA Implementation (mpiKarpRabin.cu)


#include  <stdio.h>  
#include  <iostream>  
 
#include  <cuda_runtime.h>  
#include  <helper_functions.h>  
#include  <helper_cuda.h>  
 
//  User  include  
#include  "mpiKarpRabin.h"  
 
//  Device  code  
__device__  int  memcmpGPU(const  char  *cs_in,  const  char  *ct_in,  unsigned  int  
n)  {  
  unsigned  int  i;  
  const  unsigned  char  *  cs  =  (const  unsigned  char*)  cs_in;  
  const  unsigned  char  *  ct  =  (const  unsigned  char*)  ct_in;  
 
  for  (i  =  0;  i  <  n;  i++,  cs++,  ct++)  {  
    if  (*cs  <  *ct)  {  
      return  -­‐1;  
    }  else  if  (*cs  >  *ct)  {  
      return  1;  
    }  
  }  
  return  0;  
}  
 
__global__  void  krGPU(const  char  *T,  const  char  *P,  const  int  n,  const  int  
m,  
    int  hx,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <=  n  -­‐  m)  {  
    int  hy,  i;  
 
    /*  Preprocessing  */  
    for  (hy  =  i  =  0;  i  <  m;  ++i)  {  
      hy  =  ((hy  <<  1)  +  T[i  +  x]);  
    }  
 
    /*  Searching  */  
    if  (hx  ==  hy  &&  memcmpGPU(P,  T  +  x,  m)  ==  0)  
      result[x]  =  x;  
  }  
}  
 
int  calcBlocks(int  size,  int  threadsPerBlock)  {  
  return  (size  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock;  
}  
 
//  CUDA  computation  on  each  node  
A p p e n d i x - 73

void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  


    char  *h_pattern,  int  lenp,  int  threadSize,  times  &timeGPU)  {  
 
  char  *d_text;  
  char  *d_pattern;  
  bool  *d_result;  
 
  clock_t  timeStartPrepro,  timeEndPrepro;  
 
  cudaEvent_t  startMemHost,  stopMemHost;  
  cudaEvent_t  startMemDevice,  stopMemDevice;  
  cudaEvent_t  startCompute,  stopCompute;  
  cudaEventCreate(&startMemHost);  
  cudaEventCreate(&stopMemHost);  
  cudaEventCreate(&startCompute);  
  cudaEventCreate(&stopCompute);  
  cudaEventCreate(&startMemDevice);  
  cudaEventCreate(&stopMemDevice);  
 
  //Precompute  
  timeStartPrepro  =  clock();  
  int  i,  hx;  
  for  (hx  =  i  =  0;  i  <  lenp;  ++i)  {  
    hx  =  ((hx  <<  1)  +  h_pattern[i]);  
  }  
  timeEndPrepro  =  clock();  
 
  //  Allocate  data  on  GPU  memory  
  cudaEventRecord(startMemHost,  0);  
  checkCudaErrors(cudaMalloc((void**)  &d_text,  lent  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_result,  lent  *  sizeof(bool)));  
  //  Copy  to  GPU  memory  
  checkCudaErrors(cudaMemcpy(d_text,  h_text,  lent  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(cudaMemcpy(d_pattern,  h_pattern,  lenp  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(cudaMemset(d_result,  false,  lent  *  sizeof(bool)));  
 
  cudaEventRecord(stopMemHost,  0);  
 
//  Invoke  kernel  
  int  threadsPerBlock  =  threadSize;  
  int  blocksPerGrid  =  calcBlocks((lent  -­‐  lenp  +  1),  threadsPerBlock);  
 
  //  Run  Main  kernel  
  cudaEventRecord(startCompute,  0);  
  krGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  lent,  lenp,  
hx,  d_result);  
  cudaEventRecord(stopCompute,  0);  
 
  //  Copy  data  back  to  CPU  memory  
  cudaEventRecord(startMemDevice,  0);  
  checkCudaErrors(cudaMemcpy(h_result,  d_result,  lent  *  sizeof(bool),  
          cudaMemcpyDeviceToHost));  
  cudaEventRecord(stopMemDevice,  0);  
 
  //  Free  GPU  memory  
  checkCudaErrors(cudaFree(d_text));  
A p p e n d i x - 74

  checkCudaErrors(cudaFree(d_pattern));  
  checkCudaErrors(cudaFree(d_result));  
 
  cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
  cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
  cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
  timeGPU.cpuPrepro  =  (timeEndPrepro  -­‐  timeStartPrepro)  /  CLOCKS_PER_SEC  
      *  1000;  

}  
   
A p p e n d i x - 75

9.5. Quick Seach

9.5.1. CUDA Implementation (mpiQuickSearch.cu)


#include  <stdio.h>  
#include  <iostream>  
 
#include  <cuda_runtime.h>  
#include  <helper_functions.h>  
#include  <helper_cuda.h>  
 
//  MPI  include  
#include  <mpi.h>  
 
//  User  include  
#include  "mpiQuickSearch.h"  
 
//  Device  code  
__global__  void  shiftOrGPU(const  char  *T,  const  char  *P,  const  int  n,  
    const  int  m,  const  int  *bmBc,  const  int  *preComp,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <=  n  -­‐  m  &&  preComp[x]  ==  x)  {  
    bool  found  =  true;  
 
    for  (int  i  =  0;  i  <  m;  ++i)  {  
      if  (P[i]  !=  T[x  +  i])  {  
        found  =  false;  
        break;  
      }  
    }  
    if  (found)  
      result[x]  =  true;  
  }  
}  
 
void  precomputeShifts(char  *T,  int  n,  int  m,  int  *bmBc,  int  *preComp)  {  
  int  i  =  0;  
  while  (i  <=  n  -­‐  m)  {  
    i  +=  bmBc[T[i  +  m]];  
    preComp[i]  =  i;  
  }  
}  
void  preBmBc(char  *P,  int  m,  int  bmBc[])  {  
  int  i;  
  for  (i  =  0;  i  <  ASIZE;  i++)  
    bmBc[i]  =  m  +  1;  
  for  (i  =  0;  i  <  m;  i++)  
    bmBc[P[i]]  =  m  -­‐  i;  
}  
 
int  calcBlocks(int  size,  int  threadsPerBlock)  {  
  return  (size  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock;  
}  
 
A p p e n d i x - 76

//  CUDA  computation  on  each  node  


void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  
    char  *h_pattern,  int  lenp,  int  threadSize,  times  &timeGPU)  {  
 
  char  *d_text;  
  char  *d_pattern;  
  bool  *d_result;  
  int  *d_bmBc;  
  int  *d_preComp;  
  int  bmBc[ASIZE];  
  int  *h_preComp  =  (int*)  malloc(lent  *  sizeof(int));  
 
  double  timeStartPrepro,  timeEndPrepro;  
 
  cudaEvent_t  startMemHost,  stopMemHost;  
  cudaEvent_t  startMemDevice,  stopMemDevice;  
  cudaEvent_t  startCompute,  stopCompute;  
  cudaEventCreate(&startMemHost);  
  cudaEventCreate(&stopMemHost);  
  cudaEventCreate(&startCompute);  
  cudaEventCreate(&stopCompute);  
  cudaEventCreate(&startMemDevice);  
  cudaEventCreate(&stopMemDevice);  
 
  //Precompute  shifts  
  timeStartPrepro  =  MPI_Wtime();  
  preBmBc(h_pattern,  lenp,  bmBc);  
  precomputeShifts(h_text,  lent,  lenp,  bmBc,  h_preComp);  
  timeEndPrepro  =  MPI_Wtime();  
 
  //  Allocate  data  on  GPU  memory  
  cudaEventRecord(startMemHost,  0);  
  checkCudaErrors(cudaMalloc((void**)  &d_text,  lent  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_result,  lent  *  sizeof(bool)));  
  checkCudaErrors(cudaMalloc((void**)  &d_bmBc,  ASIZE  *  sizeof(int)));  
  checkCudaErrors(cudaMalloc((void**)  &d_preComp,  lent  *  sizeof(int)));  
  //  Copy  to  GPU  memory  
  checkCudaErrors(  
      cudaMemcpy(d_text,  h_text,  lent  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_pattern,  h_pattern,  lenp  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_bmBc,  bmBc,  ASIZE  *  sizeof(int),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_preComp,  h_preComp,  lent  *  sizeof(int),  
          cudaMemcpyHostToDevice));  
 
  checkCudaErrors(cudaMemset(d_result,  false,  lent  *  sizeof(bool)));  
 
  cudaEventRecord(stopMemHost,  0);  
 
//  Invoke  kernel  
  int  threadsPerBlock  =  threadSize;  
  int  blocksPerGrid  =  calcBlocks(lent  -­‐  lenp  +  1,  threadsPerBlock);  
 
A p p e n d i x - 77

  //  Run  Main  kernel  


  cudaEventRecord(startCompute,  0);  
  shiftOrGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  lent,  
      lenp,  d_bmBc,  d_preComp,  d_result);  
  cudaEventRecord(stopCompute,  0);  
 
  //  Copy  data  back  to  CPU  memory  
  cudaEventRecord(startMemDevice,  0);  
  checkCudaErrors(  
      cudaMemcpy(h_result,  d_result,  lent  *  sizeof(bool),  
          cudaMemcpyDeviceToHost));  
  cudaEventRecord(stopMemDevice,  0);  
 
  //  Free  GPU  memory  
  checkCudaErrors(cudaFree(d_text));  
  checkCudaErrors(cudaFree(d_pattern));  
  checkCudaErrors(cudaFree(d_result));  
  checkCudaErrors(cudaFree(d_bmBc));  
  checkCudaErrors(cudaFree(d_preComp));  
 
  cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
  cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
  cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
  timeGPU.cpuPrepro  =  (timeEndPrepro  -­‐  timeStartPrepro)  *  1000;  

}  
A p p e n d i x - 78

9.6. Shift Or

9.6.1. CUDA Implementation (mpiShiftOr.cu)


#include  <stdio.h>  
#include  <iostream>  
 
#include  <cuda_runtime.h>  
#include  <helper_functions.h>  
#include  <helper_cuda.h>  
 
//  MPI  include  
#include  <mpi.h>  
 
//  User  include  
#include  "mpiShiftOr.h"  
 
#define  chunk  2  
 
//  Device  code  
__global__  void  shiftOrGPU(const  char  *T,  const  char  *P,  const  int  n,  
    const  int  m,  const  int  *S,  int  lim,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    unsigned  int  D,  j;  
 
    int  start  =  x  *  m  *  chunk;  
    int  stop  =  (x  +  1)  *  m  *  chunk  +  m  -­‐  1;  
    if  (stop  >  n)  
      stop  =  n;  
 
    for  (D  =  ~0,  j  =  start;  j  <=  stop;  ++j)  {  
      D  =  (D  <<  1)  |  S[T[j]];  
      if  (D  <  lim)  
        result[j  -­‐  m  +  1]  =  true;  
    }  
 
  }  
}  
 
int  preSo(char  *T,  int  m,  int  S[])  {  
  unsigned  int  j,  lim;  
  int  i;  
  for  (i  =  0;  i  <  ASIZE;  ++i)  
    S[i]  =  ~0;  
  for  (lim  =  i  =  0,  j  =  1;  i  <  m;  ++i,  j  <<=  1)  {  
    S[T[i]]  &=  ~j;  
    lim  |=  j;  
  }  
  lim  =  ~(lim  >>  1);  
  return  (lim);  
}  
 
int  calcBlocks(int  size,  int  threadsPerBlock)  {  
A p p e n d i x - 79

  return  (size  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock;  


}  
 
//  CUDA  computation  on  each  node  
void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  
    char  *h_pattern,  int  lenp,  int  threadSize,  times  &timeGPU)  {  
 
  char  *d_text;  
  char  *d_pattern;  
  bool  *d_result;  
  int  *d_S;  
  int  h_S[ASIZE];  
 
  clock_t  timeStartPrepro,  timeEndPrepro;  
 
  cudaEvent_t  startMemHost,  stopMemHost;  
  cudaEvent_t  startMemDevice,  stopMemDevice;  
  cudaEvent_t  startCompute,  stopCompute;  
  cudaEventCreate(&startMemHost);  
  cudaEventCreate(&stopMemHost);  
  cudaEventCreate(&startCompute);  
  cudaEventCreate(&stopCompute);  
  cudaEventCreate(&startMemDevice);  
  cudaEventCreate(&stopMemDevice);  
 
  //Precompute  shifts  
  timeStartPrepro  =  clock();  
  int  lim  =  preSo(h_pattern,  lenp,  h_S);  
  timeEndPrepro  =  clock();  
 
  //  Allocate  data  on  GPU  memory  
  cudaEventRecord(startMemHost,  0);  
  checkCudaErrors(cudaMalloc((void**)  &d_text,  lent  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_result,  lent  *  sizeof(bool)));  
  checkCudaErrors(cudaMalloc((void**)  &d_S,  ASIZE  *  sizeof(int)));  
  //  Copy  to  GPU  memory  
  checkCudaErrors(  
      cudaMemcpy(d_text,  h_text,  lent  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_pattern,  h_pattern,  lenp  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_S,  h_S,  ASIZE  *  sizeof(int),  cudaMemcpyHostToDevice));  
 
  checkCudaErrors(cudaMemset(d_result,  false,  lent  *  sizeof(bool)));  
 
  cudaEventRecord(stopMemHost,  0);  
 
  //  Invoke  kernel  
  int  threadsPerBlock  =  threadSize;  
  int  blocksPerGrid  =  calcBlocks(lent  /  (lenp  *  chunk),  threadsPerBlock);  
 
  //  Run  Main  kernel  
  cudaEventRecord(startCompute,  0);  
  shiftOrGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  lent,  
      lenp,  d_S,  lim,  d_result);  
  cudaEventRecord(stopCompute,  0);  
A p p e n d i x - 80

 
  //  Copy  data  back  to  CPU  memory  
  cudaEventRecord(startMemDevice,  0);  
  checkCudaErrors(  
      cudaMemcpy(h_result,  d_result,  lent  *  sizeof(bool),  
          cudaMemcpyDeviceToHost));  
  cudaEventRecord(stopMemDevice,  0);  
 
  //  Free  GPU  memory  
  checkCudaErrors(cudaFree(d_text));  
  checkCudaErrors(cudaFree(d_pattern));  
  checkCudaErrors(cudaFree(d_result));  
  checkCudaErrors(cudaFree(d_S));  
 
  cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
  cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
  cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
  timeGPU.cpuPrepro  =  (timeEndPrepro  -­‐  timeStartPrepro)  /  CLOCKS_PER_SEC  
      *  1000;  

}  
A p p e n d i x - 81

9.7. Shift And

9.7.1. CUDA Implementation (mpiShiftAnd.cu)


#include  <stdio.h>  
#include  <iostream>  
 
#include  <cuda_runtime.h>  
#include  <helper_functions.h>  
#include  <helper_cuda.h>  
 
//  MPI  include  
#include  <mpi.h>  
 
//  User  include  
#include  "mpiShiftAnd.h"  
 
#define  chunk  2  
 
//  Device  code  
__global__  void  shiftAndGPU(const  char  *T,  const  char  *P,  const  int  n,  
    const  int  m,  const  int  *S,  bool  *result)  {  
 
  unsigned  int  x  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;  
 
  if  (x  <  n)  {  
    unsigned  int  D,  F;  
    int  j;  
 
    int  start  =  x  *  m  *  chunk;  
    int  stop  =  (x  +  1)  *  m  *  chunk  +  m  -­‐  1;  
    if  (stop  >  n)  
      stop  =  n;  
 
    F  =  1  <<  (m  -­‐  1);  
 
    /*  Searching  */  
    for  (D  =  0,  j  =  start;  j  <=  stop;  ++j)  {  
      D  =  ((D  <<  1)  |  1)  &  S[T[j]];  
      if  (D  &  F)  
        result[j  -­‐  m  +  1]  =  true;  
    }  
  }  
}  
 
void  preSA(char  *P,  int  m,  int  S[])  {  
  unsigned  int  j;  
  int  i;  
  for  (i  =  0;  i  <  ASIZE;  ++i)  
    S[i]  =  0;  
  for  (i  =  0,  j  =  1;  i  <  m;  ++i,  j  <<=  1)  {  
    S[P[i]]  |=  j;  
  }  
}  
 
int  calcBlocks(int  size,  int  threadsPerBlock)  {  
A p p e n d i x - 82

  return  (size  +  threadsPerBlock  -­‐  1)  /  threadsPerBlock;  


}  
 
//  CUDA  computation  on  each  node  
void  computeGPU(bool  *h_result,  int  result_size,  char  *h_text,  int  lent,  
    char  *h_pattern,  int  lenp,  int  threadSize,  times  &timeGPU)  {  
 
  char  *d_text;  
  char  *d_pattern;  
  bool  *d_result;  
  int  *d_S;  
  int  h_S[ASIZE];  
 
  clock_t  timeStartPrepro,  timeEndPrepro;  
 
  cudaEvent_t  startMemHost,  stopMemHost;  
  cudaEvent_t  startMemDevice,  stopMemDevice;  
  cudaEvent_t  startCompute,  stopCompute;  
  cudaEventCreate(&startMemHost);  
  cudaEventCreate(&stopMemHost);  
  cudaEventCreate(&startCompute);  
  cudaEventCreate(&stopCompute);  
  cudaEventCreate(&startMemDevice);  
  cudaEventCreate(&stopMemDevice);  
 
  //Precompute  shifts  
  timeStartPrepro  =  clock();  
  preSA(h_pattern,  lenp,  h_S);  
  timeEndPrepro  =  clock();  
 
  //  Allocate  data  on  GPU  memory  
  cudaEventRecord(startMemHost,  0);  
  checkCudaErrors(cudaMalloc((void**)  &d_text,  lent  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_pattern,  lenp  *  sizeof(char)));  
  checkCudaErrors(cudaMalloc((void**)  &d_result,  lent  *  sizeof(bool)));  
  checkCudaErrors(cudaMalloc((void**)  &d_S,  ASIZE  *  sizeof(int)));  
  //  Copy  to  GPU  memory  
  checkCudaErrors(  
      cudaMemcpy(d_text,  h_text,  lent  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_pattern,  h_pattern,  lenp  *  sizeof(char),  
          cudaMemcpyHostToDevice));  
  checkCudaErrors(  
      cudaMemcpy(d_S,  h_S,  ASIZE  *  sizeof(int),  cudaMemcpyHostToDevice));  
 
  checkCudaErrors(cudaMemset(d_result,  false,  lent  *  sizeof(bool)));  
 
  cudaEventRecord(stopMemHost,  0);  
 
  //  Invoke  kernel  
  int  threadsPerBlock  =  threadSize;  
  int  blocksPerGrid  =  calcBlocks(lent  /  (lenp  *  chunk),  threadsPerBlock);  
 
  //  Run  Main  kernel  
  cudaEventRecord(startCompute,  0);  
  shiftAndGPU<<<blocksPerGrid,  threadsPerBlock>>>(d_text,  d_pattern,  lent,  
      lenp,  d_S,  d_result);  
  cudaEventRecord(stopCompute,  0);  
A p p e n d i x - 83

 
  //  Copy  data  back  to  CPU  memory  
  cudaEventRecord(startMemDevice,  0);  
  checkCudaErrors(  
      cudaMemcpy(h_result,  d_result,  lent  *  sizeof(bool),  
          cudaMemcpyDeviceToHost));  
  cudaEventRecord(stopMemDevice,  0);  
 
  //  Free  GPU  memory  
  checkCudaErrors(cudaFree(d_text));  
  checkCudaErrors(cudaFree(d_pattern));  
  checkCudaErrors(cudaFree(d_result));  
  checkCudaErrors(cudaFree(d_S));  
 
  cudaEventElapsedTime(&timeGPU.gpuMemHost,  startMemHost,  stopMemHost);  
  cudaEventElapsedTime(&timeGPU.gpuCompute,  startCompute,  stopCompute);  
  cudaEventElapsedTime(&timeGPU.gpuMemDevice,  startMemDevice,  
stopMemDevice);  
 
  timeGPU.cpuPrepro  =  (timeEndPrepro  -­‐  timeStartPrepro)  /  CLOCKS_PER_SEC  
      *  1000;  

}  
R e f e r e n c e s - 84

References
[1] Nvidia CUDA FAQ | NVIDIA Developer Zone. https://developer.nvidia.com/cuda-
faq (accessed 2013/01/27).
[2] Wikipedia Message Passing Interface.
http://en.wikipedia.org/w/index.php?title=Message_Passing_Interface&oldid=537
683372 (accessed 2013/02/15).
[3] Wikipedia String searching algorithm - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/String_matching (accessed 2013/01/28).
[4] Francisco G. Martin, Εxact String Pattern Recognition. Escuela Universitaria de
Informática, Madrid, Spain.
[5] Dömölki, B., An algorithm for syntactic analysis. Computational Linguistics 1964,
3, 29–46.
[6] Morris, J. H.; Pratt, V. R. A Linear Pattern-Matching Algorithm; University of
California: Berkeley, 1970.
[7] Knuth, D.; Jr; Pratt, V., Fast Pattern Matching in Strings. SIAM Journal on
Computing 1977, 6 (2), 323-350.
[8] Boyer, R. S.; Moore, J. S., A fast string searching algorithm. Commun. ACM
1977, 20 (10), 762-772.
[9] Horspool, R. N., Practical fast searching in strings. Software: Practice and
Experience 1980, 10 (6), 501-506.
[10] Galil, Z.; Seiferas, J., Time-space-optimal string matching. Journal of Computer
and System Sciences 1983, 26 (3), 280-294.
[11] Apostolico, A.; Giancarlo, R., The Boyer Moore Galil string searching strategies
revisited. SIAM J. Comput. 1986, 15 (1), 98-105.
[12] Karp, R. M.; Rabin, M. O., Efficient randomized pattern-matching algorithms.
IBM J. Res. Dev. 1987, 31 (2), 249-260.
[13] Sunday, D. M., A very fast substring search algorithm. Commun. ACM 1990, 33
(8), 132-142.
[14] Navarro, G., A guided tour to approximate string matching. ACM Comput. Surv.
2001, 33 (1), 31-88.
[15] Vasiliadis, G.; Antonatos, S.; Polychronakis, M.; Markatos, E. P.; Ioannidis, S.,
Gnort: High Performance Network Intrusion Detection Using Graphics
Processors. In Proceedings of the 11th international symposium on Recent
Advances in Intrusion Detection, Springer-Verlag: Cambridge, MA, USA, 2008;
pp 116-134.
[16] Roesch, M., Snort - Lightweight Intrusion Detection for Networks. In Proceedings
of the 13th USENIX conference on System administration, USENIX Association:
Seattle, Washington, 1999; pp 229-238.
[17] Fisk, M.; Varghese, G. Fast Content-Based Packet Handling for Intrusion
Detection; UCSD: 2001.
R e f e r e n c e s - 85

[18] Kouzinopoulos, C. S.; Margaritis, K. G., String Matching on a Multicore GPU


Using CUDA. In Proceedings of the 2009 13th Panhellenic Conference on
Informatics, IEEE Computer Society: 2009; pp 14-18.
[19] Nvidia, CUDA C Programming Guide. PG-02829-001_v5.0 ed.; Nvidia: 2012.
[20] Chong, J. 100x Speedups: Are they real?; Parasians Parallel Computing
Artisans: 2011.
[21] Wikipedia CUDA - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/CUDA (accessed 2013/02/14).
[22] Wikipedia GPGPU - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/GPGPU (accessed 2013/02/14).
[23] Nvidia Nsight Eclipse Edition | NVIDIA Developer Zone.
https://developer.nvidia.com/nsight-eclipse-edition (accessed 2013/02/14).
[24] Microsoft Visual Studio 2012 | Microsoft Visual Studio.
http://www.microsoft.com/visualstudio/eng/team-foundation-service (accessed
2013/02/14).
[25] Eclipse Foundation The Eclipse Foundation open source community website.
http://eclipse.org/ (accessed 2013/02/14).
[26] Khronos Group OpenCL - The open standard for parallel programming of
heterogeneous systems. http://www.khronos.org/opencl/ (accessed 2013/02/14).
[27] Wikipedia DirectCompute - Wikipedia, the free encyclopedia.
http://en.wikipedia.org/wiki/DirectCompute (accessed 2013/02/14).
[28] Microsoft MSDN C++ AMP Overview. http://msdn.microsoft.com/en-
us/library/vstudio/hh265136.aspx (accessed 2013/02/14).
[29] Yang, C.-T.; Huang, C.-L.; Lin, C.-F., Hybrid CUDA, OpenMP, and MPI parallel
programming on multicore GPU clusters. Computer Physics Communications
2011, 182 (1), 266-269.
[30] MPICH MPICH | High-performance and Portable MPI. http://www.mpich.org/
(accessed 2013/02/17).
[31] Ubuntu Ubuntu 10.04.4 LTS (Lucid Lynx). http://releases.ubuntu.com/lucid/
(accessed 2013/02/18).
[32] Wikipedia Network File System.
http://en.wikipedia.org/w/index.php?title=Network_File_System&oldid=53688818
4 (accessed 2013/02/13).
[33] Charras, C.; Lecroq, T., Handbook of Exact String Matching Algorithms. King's
College Publications: 2004.
[34] Smart Knuth Morris-Pratt | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=KMP&code=kmp
(accessed 2013/03/03).
[35] Tay, R. Demonstration of Exact String Matching Algorithms using CUDA 2011.
https://exactstrmatchgpu.googlecode.com/files/Raymond%20Tay%20Abstract%
20Submission%202.pdf.
R e f e r e n c e s - 86

[36] Smart Horspool | String Matching Algorithms Research Tool.


http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=HOR&code=hor
(accessed 2013/03/03).
[37] Smart Karp Rabin | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=KR&code=kr
(accessed 2013/03/03).
[38] Smart Quick Search | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=QS&code=qs
(accessed 2013/03/03).
[39] Baeza-Yates, R.; Gonnet, G. H., A new approach to text searching. Commun.
ACM 1992, 35 (10), 74-82.
[40] Smart Shift Or | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=SO&code=so
(accessed 2013/03/03).
[41] Smart Shift And | String Matching Algorithms Research Tool.
http://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=SA&code=sa
(accessed 2013/03/03).
[42] The Canterbury Corpus The Canterbury Corpus. http://corpus.canterbury.ac.nz/.
[43] Project Gutenberg PROJECT GUTENBERG - Free Books On-Line.
http://www.promo.net/pg/ (accessed 2013/02/18).
[44] Wikipedia Speedup.
http://en.wikipedia.org/w/index.php?title=Speedup&oldid=541180041 (accessed
2013/02/28).
[45] Amdahl, G. M., Validity of the single processor approach to achieving large scale
computing capabilities. In Proceedings of the April 18-20, 1967, spring joint
computer conference, ACM: Atlantic City, New Jersey, 1967; pp 483-485.

You might also like