You are on page 1of 1

For example, warp-wide reduction can be computed using five rounds of __shfl_xor with

1, 2, 4, 8, and 16 values respectively. At each round, each thread computes its corresponding
lane ID by XORing its own lane ID with a value and forming a butterfly network. In the end, all
threads receive the final reduction result. Similarly, inclusive scan can be computed using five
rounds of __shfl_up/down instructions. Here, each round values are added up/down with
specific lane jumps. More details about these algorithm implementations can be found in the
CUDA programming guide [78, Ch. B14]

1.2.4 CUDA Built-in intrinsics


CUDA provides several useful built-in integer intrinsics that are particularly useful for bitwise
manipulations. We extensively use these intrinsics through all our implementations from Chap-
ters 2–5, especially in order to process the result of ballot operations. Here we name a few
examples, but more detailed descriptions and other intrinsics can be found in the CUDA Math
API manual [80]:

Reverse bits: __brev() takes an unsigned integer and reverses its bits.

Number of high-order zero bits: __clz() takes an integer and returns the number of high-
order zero bits before the first set bit. For example, the output of a 32-bit input variable is a value
between 0 and 32. This instruction can also be used to find the most significant set bit of x (i.e.,
32 - __clz(x)).

Finding the first set bit: __ffs() can be use to find the least significant set bit in an integer
variable.

Population count: __popc() returns the total number of set bits in its input argument.
All these operations are provided in 32-bit and 64-bit versions. For 64-bit versions, there is a
ll (i.e., long long) suffix added at the end (e.g., __brevll()).

11

You might also like