You are on page 1of 1

1.2.

3 Warp-wide communications
As we described before, in GPUs warps are the actual parallel units that operate in lockstep:
all memory accesses and computations are done in SIMD fashion for all threads within a warp.
CUDA also provides an efficient communication medium for all threads within a warp, without
directly using any shared or global memory. Here we name the two most prominent types of
warp-wide communication, and discuss each briefly. For a more detailed description refer to
CUDA programming guide [78, Appendix B].
1.2.3.1 Warp-wide Voting:

There are a series of operations defined in CUDA so that threads can validate a certain binary
predicate and share their results all together.

Any: __any(pred) returns true if there is at least one thread whose predicate is true.

All: __all(pred) returns true if all threads have their predicates validated as true.

Ballot: __ballot(pred) returns a 32-bit variable in which each bit represents its corre-
sponding thread’s predicate.
1.2.3.2 Warp-wide Shuffle:

The purpose of shuffle instructions is to read another thread’s specific registers. It is particularly
useful for broadcasting a certain value, or performing parallel operations such as reduction, scan,
binary search, etc. within a single warp. There are four different types of shuffle instructions: (1)
__shfl, (2)__shfl_up, (3) __shfl_down, (4)__shfl_xor.2
(1) is usually used for asking for the content of the specific register belonging to a thread.
This can be any arbitrary thread, but we cannot ask for registers with dynamic indexing (i.e.,
register names should be known at compile time). For instance, in Section 3.5.5, histogram is
computed within a warp so that each thread collects the results for specific buckets. Later when
other threads need these results, they simply use shuffle instructions to ask for specific bucket
counts from the corresponding responsible thread.
(2)–(4) are usually used when there is a fixed pattern of communication among the threads.
2
Since CUDA 9.0, threads within a warp are not guaranteed to be in lockstep and there should be specific barriers
to make sure all threads have reached a certain point in the program. As a result, all shuffle instructions are turned
into their synchronized versions that have extra synchronization barriers (e.g., __shfl_sync) [79].

10

You might also like