Lecture 2.

4

Memory Model and Locality
- Tiled Parallel Algorithms

Objective • To understand the motivation and ideas for tiled parallel algorithms • Memory bandwidth limits parallel kernel performance • Tiled algorithms and barrier synchronization 2 .

i] and B[i. ++i) /* A[Row. int Col = blockIdx. float* A.y.y*blockDim. int k. int n. float* C) { int Row = blockIdx.Review: Basic Matrix Multiplication Kernel __global__ void MatrixMulKernel(int m. } } 3 . i < n. if ((Row < m) && (Col < k)) { float Cvalue = 0. Col] */ Cvalue += A[Row*n+i] * B[Col+i*k].x+threadIdx.x*blockDim. float* B.0. for (int i = 0.x.y+threadIdx. C[Row*k+Col] = Cvalue.

Global Memory Access Pattern of the Base Matrix Multiplication Kernel Global Memory Thread 1 Thread 2 … 4 .

000 = 4.000 GB/s required to achieve peak FLOP rating • Reality .5 GFLOPS • The actual code runs at about 25 GFLOPS • Need to drastically cut down memory accesses to get close to the peak 1. limiting the code at 37.000 GFLOPS 5 .000 GFLOPS • 4*1.150 GB/s.How about performance on Fermi GPU? • All threads access global memory for their input matrix elements • Two memory accesses (8 bytes) per floating point multiply-add • 4B/s of memory bandwidth/FLOPS • Peak floating-point rate is 1.

Global Memory Access Pattern of the Base Matrix Multiplication Kernel Global Memory Thread 1 Thread 2 … 6 .

Basic Idea Global Memory On-chip Memory Thread 1 Thread 2 … Divide the global memory content into tiles Focus the computation of threads on one or a small number of tiles at each point in time 7 . Shared Memory Tiling/Blocking .

Basic Idea Global Memory On-chip Memory Thread 1 Thread 2 … 8 . Shared Memory Tiling/Blocking .

significant reduction of vehicles can greatly improve the delay seen by all vehicles • Carpooling for commuters • Tiling for global memory accesses • drivers = thread operands • cars = memory access requests 9 . Basic Concept of Blocking/Tiling • In a congested traffic system.

Some computations are more challenging to tile than others. • Some carpools may be easier than others • More efficient if neighbors are also classmates or co-workers • Some vehicles may be more suitable for carpooling • Similar variations exist in tiling 10 .

Carpools need synchronization. • Good – when people have similar schedule Worker A sleep work dinner Time Worker B sleep work dinner 11 .

Carpools need synchronization. • Bad – when people have very different schedule Worker A party sleep work time Worker B sleep work dinner 12 .

Same with Blocking/Tiling • Good – when threads have similar access timing Thread 1 Time Thread 2 … Thread 1 Time Thread 2 • Bad – when threads have very different timing 13 .

Outline of Tiling Technique • Identify a tile of global memory content that are accessed by multiple threads • Load the tile from global memory into on-chip memory • Have the multiple threads to access their data from the on-chip memory • Move on to the next tile 14 .

3 . READ SECTION 5.TO LEARN MORE.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.