Professional Documents
Culture Documents
25 - PDFsam - Escholarship UC Item 5qd0r4ws
25 - PDFsam - Escholarship UC Item 5qd0r4ws
1, 2, 4, 8, and 16 values respectively. At each round, each thread computes its corresponding
lane ID by XORing its own lane ID with a value and forming a butterfly network. In the end, all
threads receive the final reduction result. Similarly, inclusive scan can be computed using five
rounds of __shfl_up/down instructions. Here, each round values are added up/down with
specific lane jumps. More details about these algorithm implementations can be found in the
CUDA programming guide [78, Ch. B14]
Reverse bits: __brev() takes an unsigned integer and reverses its bits.
Number of high-order zero bits: __clz() takes an integer and returns the number of high-
order zero bits before the first set bit. For example, the output of a 32-bit input variable is a value
between 0 and 32. This instruction can also be used to find the most significant set bit of x (i.e.,
32 - __clz(x)).
Finding the first set bit: __ffs() can be use to find the least significant set bit in an integer
variable.
Population count: __popc() returns the total number of set bits in its input argument.
All these operations are provided in 32-bit and 64-bit versions. For 64-bit versions, there is a
ll (i.e., long long) suffix added at the end (e.g., __brevll()).
11