You are on page 1of 16

Revision Class

1D Grid of 1D Thread Blocks

dim3 DIMGRID (3, 1, 1);


dim3 DIMBLCOK (4,1, 1);

1D Grid of 2D Thread Blocks

dim3 DIMGRID (3, 1, 1);


dim3 DIMBLCOK (4, 4, 1);
2D Grid of 2D Thread Blocks

dim3 DIMGRID (3, 3, 1),


dim3 DIMBLCOK (4, 4, 1);
Tile-Based Algorithms Using Shared
Memory
• The performance of a kernel is related to the number of data accesses
from the global memory.
• In other words, the performance depends on the Compute to Global
Memory Access (CGMA) ratio.
• It is the number of floating-point operations (FLOP) performed for
each global memory access.
• We know that access to shared memory is faster compared to global
memory. However, the size of the shared memory allocated to each
block is quite small.
• Therefore, the data is partitioned into subsets or tiles so that it can be
held in the shared memory.
• Each tile is handled by a different thread block.
• Threads of a block collaboratively load a tile into the shared memory
and then each thread performs the computations using the data
elements from shared memory.
• Here, a data element of a tile is loaded into shared memory only once
by any one thread of the block, but it can be accessed from the
shared memory by all threads of that block
Tiled Matrix Multiplication
• we used a 2x2 grid of 2x2 blocks to multiply two 4x4 matrices.
• The threads (0,0,0), (1,0,0), (0,1,0), and (1,1,0) produce elements C0,0
C1,0 C0,1, C1,1 res of matrix C.

From the multiplication activities listed above, it can be observed that


each thread accesses 4 elements of matrix A and 4 elements of matrix B.
The 4 threads of block (0,0,0) perform a total of 16 global memory access
into A and 16 access into B.
From the multiplication activities listed above we can also observe the
overlap in the access to the elements of A and B by different threads of
block (0,0,0).
• It can be found that each element of row 0 and row 1 of A, and column 0
and column 1 of B are accessed exactly twice by different threads of block
(0,0,0).
• For example, clement A00 is accessed by both threads (0,0,0) and (1,0,0).
Instead of cache thread loading the element A00 separately, they can
collaborate so that this element is loaded from the global memory into
shared memory only once.
• This data in the shared memory can then be used by all threads of the
block. In the same way if all four threads of block (0,0,0) collaborate, each
element of row 0 and row 1 of A, and column 0 and column 1 of B is loaded
only once from the global memory.
• Thus, the 4 threads of block (0,0,0) performs total of 8 global memory
access into A and 8 access into B. Therefore, the total number of accesses to
each matrix is reduced by half (compared to the 16 accesses each into A
and B without thread collaboration).
• Through collaboration among the threads of a block and by using the
shared memory, the number of accesses to global memory can be
considerably reduced.
• To understand the tiled matrix multiplication kernel, we consider a
4x4 matrix with 2x2 tiles. Each thread of a block computes one
element of product matrix C. Therefore, one block produces one tile
or matrix C. Hence, we need a 2x2 grid of 2x2 blocks.
• A block loads two tiles, one each from A and B. Each thread of a block
loads one element of a given tile. Therefore, the block size must be
equal to the tile size. Choosing larger blocks requires more space in
the shared memory. Hence, care must be taken while choosing the
size of the thread block
• For example, tiles of matrix are CO0, C10, CO1, and Ci1. The tiles of C are
computed as follows

• The block (0,0,0) computes the tile C00.


• Initially all elements of matrix C00 are set to 0.
• Two matrices S_A and S_B are declared on the shared memory to hold the tiles
loaded from global memory.
• The tiles C00 is computed in two steps
Step-1
Step-1 :
Thread of the block (0,0,0 )

Loading Operation: Copied from Global memory to shared memory

Block can perform A00 x B00 by accessing the elements from the shared memory.

After the first step, the submatrix C00 is computed partially.


Step-2
Step 2
Loading Operation: Copied from Global memory to shared memory

Block can perform A10 x B10 by accessing the elements from the shared memory.

After the Second step, the computation of the tile C00 is completed
Memory Coalescing
• We know that at any point in time during the execution all threads within
a warp execute the same instruction.
• Suppose that the threads within the warp execute memory access
instruction and these threads access consecutive locations in the global
memory.
• In such case, the hardware coalesces or combines all these accesses into
a single access.
• That is, instead of accessing each location separately, multiple
consecutive locations are accessed in a single consolidated memory
transaction.
• Memory coalescing improves the performance of CUDA applications.

You might also like