You are on page 1of 1

Approach per-thread assignment and processing per-warp assignment and processing per-thread assignment, per-warp processing

• good for uniform workloads


• good for uniform workloads • often coalesced memory accesses
• straightforward implementation and de-
• fully coalesced memory accesses • good for irregular workloads
Advantages sign
• warp-level privatization: less shared • higher warp efficiency
• sequential local processing, no commu-
memory usage than per-thread methods. • reduced branch divergence
nications
• parallel processing requires warp-wide
• vectorized memory access
• parallel processing requires warp-wide communications
Disadvantages • thread-level privatization: potentially
communications • requires all threads within a warp to be
more shared memory usage
active (no intra-warp branches)
1. CUB’s reduce, scan, histogram, and 1. The verification stage in the DRK-2S
radix sort [66] 1. P-ary searches in B-trees and sorted string matching method [5].
2. CUDPP’s cuckoo hashing [3] lists [47]. 2. Fully concurrent dynamic data struc-
Examples 3. String matching (e.g., DRK, CRK, and 2. CUDA-based Reyes renderer [91]. tures (e.g., the slab list and the slab
HRK [5]) 3. Ashkiani et al.’s multisplit, histogram, hash [8])

4. Batched dynamic data structures (e.g., and radix sort [6, 7]. 3. SlabAlloc: a warp-cooperative dynamic
GPU LSM [9]) memory allocator [8]

Table 1.1: Advantages, disadvantages and some examples for each pair of assignment-processing.

starvation due to code divergence (because of irregular workloads). Next, we elaborate on the
existing conflated approaches and then propose our novel assignment-processing arrangements.
Figure 1.1 shows some high-level schematic examples of the assignment-processing spectrum.
Table 1.1 shows high-level aspects of each approach as well as some practical examples of the
implementation of different algorithms under these approaches.

Per-thread assignment, per-thread processing: A traditional way of programming GPUs is


to treat CUDA threads as generic processors (e.g., as in the parallel random-access machines
(PRAM) model [45]). Each thread 1) is assigned to a portion of input data from the GPU’s
global memory (i.e., per-thread work assignment); 2) locally processes that portion (i.e., per-
thread processing); and 3) appropriately stores its result so that it can be combined with others.1
Figure 1.1a shows this approach in a problem with a regular workload.
This approach can be seen in most available GPU algorithms and implementations. To name
a few, scan [68, 88], radix sort [69], merge sort [25], merge [11], histogram [19, 77, 89], and
multisplit [41], are all implemented with the same strategy. There are some technical advantages
1
In some algorithms the output does not have any dependence on the position of input data (i.e., any permutation
of input results in the same output), e.g., performing a reduction with a binary associative operator. But in most
algorithms, the output depends on the order of input data. As a result, each thread reads consecutive data elements
to preserve locality and to locally process its portion (e.g., performing a prefix-sum operation). Throughout the rest
of this work, we assume this most general case unless otherwise stated.

You might also like