This scenario is referred to as branch divergence and should be avoided as much as
possible in order to reach high warp efficiency.
1.2.2 GPU Memory Hierarchy
GPUs possess several memory resources with organized into a memory hierarchy, each with different capacity, access time, and access scope (to be shared collectively). All threads have access to a global DRAM (global memory). For example, the Tesla K40c GPU has 12 GB of global memory available. This is the largest memory resource available in the device, but it is generally the slowest one as well, i.e., it requires the most instruction cycles to fetch specific memory contents. All threads within a thread-block have access to a faster but more limited shared memory. Shared memory is behaves like a manually controlled cache in the programmer’s disposal, with orders of magnitude faster access time compared to global memory but much less available capacity (48 KB to be shared among all resident blocks per SM on a Tesla K40c). Shared memory contents are assigned to thread-blocks by the scheduler, and cannot be accessed by other thread-blocks (even those on the same SM). All contents are lost and the portion is re-assigned to a new block once a block’s execution is finished. At the lowest level of the hierarchy, each thread has access to local registers. While registers have the fastest access time, they are only accessible by their own thread (64 KB to be shared among all resident blocks per SM on a Tesla K40c). On GPUs there are also other types of memory such as the L1 cache (on-chip 16 KB per SM, mostly used for register spills and dynamic indexed arrays), the L2 cache (on-chip 1.5 MB on the whole Tesla K40c device), constant memory (64 KB on a Tesla K40c), texture memory, and local memory [78]. On GPUs, memory accesses are done in units of 128 bytes. As a result, it is preferred that the programmer writes the program in such a way that threads within a warp require consecutive memory elements (coalesced memory access) to access 32 consecutive 4-byte units (such as float, integer, etc.). Any other memory access pattern results in a waste in the available memory bandwidth of the device.