You are on page 1of 1

A BSTRACT

Parallel Algorithms and Dynamic Data Structures on the Graphics Processing Unit: a
warp-centric approach

Graphics Processing Units (GPUs) are massively parallel processors with thousands of active
threads originally designed for throughput-oriented tasks. In order to get as much performance as
possible given the hardware characteristics of GPUs, it is extremely important for programmers
to not only design an efficient algorithm with good enough asymptotic complexities, but also to
take into account the hardware limitations and preferences. In this work, we focus our design on
two high level abstractions: work assignment and processing. The former denotes the assigned
task by the programmer to each thread or group of threads. The latter encapsulates the actual
execution of assigned tasks.
Previous work conflates work assignment and processing into similar granularities. The
most traditional way is to have per-thread work assignment followed by per-thread processing of
that assigned work. Each thread sequentially processes a part of input and then the results are
combined appropriately. In this work, we use this approach in implementing various algorithms
for the string matching problem (finding all instances of a pattern within a larger text). Another
effective but less popular idea is per-warp work assignment followed by per-warp processing of
that work. It usually requires efficient intra-warp communication to be able to efficiently process
input data which is now distributed among all threads within that warp. With the emergence of
warp-wide voting and shuffle instructions, this approach has gained more potential in solving
particular problems efficiently and with some benefits compared to the per-thread assignment
and processing. In this work, we use this approach to implement a series of parallel algorithms:
histogram, multisplit and radix sort.
An advantage of using similar granularities for work assignment and processing is in problems
with uniform per-thread or per-warp workloads, where it is quite easy to adapt warp-synchronous
ideas and achieve high performance. However, with non-uniform irregular workloads, different
threads might finish their processing in different times which can cause a sub-par performance.
This is mainly because the whole warp continues to be resident in the device as long as all its

-x-

You might also like