You are on page 1of 1

Approach per-thread assignment and processing per-warp assignment and processing per-thread assignment, per-warp processing

• good for uniform workloads

• good for uniform workloads • often coalesced memory accesses
• straightforward implementation and de-
• fully coalesced memory accesses • good for irregular workloads
Advantages sign
• warp-level privatization: less shared • higher warp efficiency
• sequential local processing, no commu-
memory usage than per-thread methods. • reduced branch divergence
• parallel processing requires warp-wide
• vectorized memory access
• parallel processing requires warp-wide communications
Disadvantages • thread-level privatization: potentially
communications • requires all threads within a warp to be
more shared memory usage
active (no intra-warp branches)
1. CUB’s reduce, scan, histogram, and 1. The verification stage in the DRK-2S
radix sort [66] 1. P-ary searches in B-trees and sorted string matching method [5].
2. CUDPP’s cuckoo hashing [3] lists [47]. 2. Fully concurrent dynamic data struc-
Examples 3. String matching (e.g., DRK, CRK, and 2. CUDA-based Reyes renderer [91]. tures (e.g., the slab list and the slab
HRK [5]) 3. Ashkiani et al.’s multisplit, histogram, hash [8])

4. Batched dynamic data structures (e.g., and radix sort [6, 7]. 3. SlabAlloc: a warp-cooperative dynamic
GPU LSM [9]) memory allocator [8]

Table 1.1: Advantages, disadvantages and some examples for each pair of assignment-processing.

starvation due to code divergence (because of irregular workloads). Next, we elaborate on the
existing conflated approaches and then propose our novel assignment-processing arrangements.
Figure 1.1 shows some high-level schematic examples of the assignment-processing spectrum.
Table 1.1 shows high-level aspects of each approach as well as some practical examples of the
implementation of different algorithms under these approaches.

Per-thread assignment, per-thread processing: A traditional way of programming GPUs is

to treat CUDA threads as generic processors (e.g., as in the parallel random-access machines
(PRAM) model [45]). Each thread 1) is assigned to a portion of input data from the GPU’s
global memory (i.e., per-thread work assignment); 2) locally processes that portion (i.e., per-
thread processing); and 3) appropriately stores its result so that it can be combined with others.1
Figure 1.1a shows this approach in a problem with a regular workload.
This approach can be seen in most available GPU algorithms and implementations. To name
a few, scan [68, 88], radix sort [69], merge sort [25], merge [11], histogram [19, 77, 89], and
multisplit [41], are all implemented with the same strategy. There are some technical advantages
In some algorithms the output does not have any dependence on the position of input data (i.e., any permutation
of input results in the same output), e.g., performing a reduction with a binary associative operator. But in most
algorithms, the output depends on the order of input data. As a result, each thread reads consecutive data elements
to preserve locality and to locally process its portion (e.g., performing a prefix-sum operation). Throughout the rest
of this work, we assume this most general case unless otherwise stated.

You might also like