You are on page 1of 1

at a time.

This scenario is referred to as branch divergence and should be avoided as much as


possible in order to reach high warp efficiency.

1.2.2 GPU Memory Hierarchy


GPUs possess several memory resources with organized into a memory hierarchy, each with
different capacity, access time, and access scope (to be shared collectively). All threads have
access to a global DRAM (global memory). For example, the Tesla K40c GPU has 12 GB of
global memory available. This is the largest memory resource available in the device, but it is
generally the slowest one as well, i.e., it requires the most instruction cycles to fetch specific
memory contents. All threads within a thread-block have access to a faster but more limited
shared memory. Shared memory is behaves like a manually controlled cache in the programmer’s
disposal, with orders of magnitude faster access time compared to global memory but much
less available capacity (48 KB to be shared among all resident blocks per SM on a Tesla K40c).
Shared memory contents are assigned to thread-blocks by the scheduler, and cannot be accessed
by other thread-blocks (even those on the same SM). All contents are lost and the portion is
re-assigned to a new block once a block’s execution is finished. At the lowest level of the
hierarchy, each thread has access to local registers. While registers have the fastest access time,
they are only accessible by their own thread (64 KB to be shared among all resident blocks per
SM on a Tesla K40c). On GPUs there are also other types of memory such as the L1 cache
(on-chip 16 KB per SM, mostly used for register spills and dynamic indexed arrays), the L2
cache (on-chip 1.5 MB on the whole Tesla K40c device), constant memory (64 KB on a Tesla
K40c), texture memory, and local memory [78].
On GPUs, memory accesses are done in units of 128 bytes. As a result, it is preferred that
the programmer writes the program in such a way that threads within a warp require consecutive
memory elements (coalesced memory access) to access 32 consecutive 4-byte units (such as
float, integer, etc.). Any other memory access pattern results in a waste in the available memory
bandwidth of the device.

You might also like