Professional Documents
Culture Documents
Block can perform A00 x B00 by accessing the elements from the shared memory.
Block can perform A10 x B10 by accessing the elements from the shared memory.
After the Second step, the computation of the tile C00 is completed
Memory Coalescing
• We know that at any point in time during the execution all threads within
a warp execute the same instruction.
• Suppose that the threads within the warp execute memory access
instruction and these threads access consecutive locations in the global
memory.
• In such case, the hardware coalesces or combines all these accesses into
a single access.
• That is, instead of accessing each location separately, multiple
consecutive locations are accessed in a single consolidated memory
transaction.
• Memory coalescing improves the performance of CUDA applications.