You are on page 1of 1

We provide an efficient implementation of a fully concurrent data structure (potentially

an irregular workload itself) based on the per-thread assignment and per-warp processing
approach.

6. By using the slab list, we implement a dynamic hash table for the GPU, the slab hash
(Chapter 5).

7. We design a novel dynamic memory allocator, the SlabAlloc, to be used in the slab hash
(Chapter 5).

As also summarized in Table 1.1, items 1 and 4 follow traditional per-thread assignment and
per-thread processing. Items 2 and 3 follow per-warp assignment and per-warp processing. Items
5, 6 and 7 use per-thread assignment and per-warp processing (the WCWS strategy).

1.2 Preliminaries: the Graphics Processing Unit


The Graphics Processing Unit (GPU) is a throughput-oriented programmable processor. It is
designed to maximize overall throughput, even by sacrificing the latency of sequential operations.
Throughout this section and this work, we focus on NVIDIA GPUs and CUDA as our parallel
computing framework. More details can be found in the CUDA programming guide [78].
Lindholm et al. [63] and Nickolls et al. [76] also provide more details on GPU hardware and the
GPU programming model, respectively.

1.2.1 CUDA Terminology


In CUDA, the CPU is referred to as the host and all available GPUs are devices. GPU programs
(“kernels”) are launched from the host over a grid of numerous blocks (or thread-blocks); the
GPU hardware maps blocks to available parallel cores (streaming multiprocessors (SMs)). The
programmer has no control over scheduling of blocks to SMs; no programs may contain execution
ordering assumptions. Each block typically consists of dozens to thousands of individual threads,
which are arranged into 32-wide warps. CUDA v8.0 and all previous versions assume that all
threads within warp execute instructions in lockstep (i.e., physically parallel). Hence, if threads
within a warp fetch different instructions to execute (e.g., by using branching statements), then
those instructions are serialized: similar instructions are executed in a SIMD fashion and one

You might also like