Lesson 6 Memory Hierarchy

2/10/09 
Administrative
•  Projects mostly graded

–  Grades by Wednesday, possibly sooner
L6: Memory Hierarchy Optimization I,
Data Placement •  Homework coming out this afternoon
–  Due Wednesday, Feb. 18
2
CS6963  CS6963 
L6: Memory Hierarchy I 
Targets of Memory Hierarchy

Overview
Optimizations
•  Where data can be stored •  Reduce memory latency
•  And how to get it there –  The latency of a memory access is the time
(usually in cycles) between a memory request
•  Some guidelines for where to store data and its completion
•  High level description of how to write •  Maximize memory bandwidth
code to optimize for memory hierarchy –  Bandwidth is the amount of useful data that
can be retrieved over a time interval
–  Fill in details Wednesday •  Manage overhead
–  Cost of performing optimization (e.g., copying)
should be less than anticipated gain
3 4
CS6963  CS6963 
L6: Memory Hierarchy I  L6: Memory Hierarchy I 
1 
2/10/09 
Optimizing the Memory Hierarchy on

Reuse and Locality
GPUs
•  Device memory access times non-uniform so •  Consider how data is accessed
data placement significantly affects –  Data reuse:
performance. •  Same data used multiple times
•  But controlling data placement may require •  Intrinsic in computation
additional copying, so consider overhead. –  Data locality:
•  Optimizations to increase memory bandwidth. •  Data is reused and is present in “fast memory”
Idea: maximize utility of each memory access. •  Same data or same data transfer
•  Align data structures to address boundaries •  If a computation has reuse, what can we do to get
•  Coalesce global memory accesses locality?
•  Avoid memory bank conflicts to increase memory •  Appropriate data placement and layout
access parallelism •  Code reordering transformations
5 6
CS6963  CS6963 
Programmer’s View: Memory Spaces Hardware Implementation: Memory

Architecture
•  Each thread can: Grid 
Device 
•  The local, global, constant, and MulEprocessor N 

–  Read/write per-thread registers
Block (0, 0)  Block (1, 0)  texture spaces are regions of
–  Read/write per-thread local memory device memory MulEprocessor 2 
–  Read/write per-block shared memory Shared Memory  Shared Memory 
•  Each multiprocessor has: MulEprocessor 1 
–  Read/write per-grid global memory Registers  Registers  Registers  Registers  –  A set of 32-bit registers per
–  Read only per-grid constant memory processor Shared Memory 
–  On-chip shared memory

Read only per-grid texture memory
Registers  Registers  Registers 
–  Thread (0, 0)  Thread (1, 0)  Thread (0, 0)  Thread (1, 0) 
Where the shared memory
InstrucEon 
•  …  Unit 
space resides Processor 1  Processor 2  Processor M 
Local  Local  Local  Local  –  A read-only constant cache

Memory  Memory  Memory  Memory 
•  To speed up access to the Constant 
Cache 
constant memory space
Host 
A read-only texture cache
Global 
Memory  –  Texture 
Cache 
To speed up access to the
•  The host can read/write  Constant 
• 
texture memory space
global, constant, and 
Memory 
Device memory 
texture memory  Texture 
Memory 
Global, constant, texture memories 
7
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  8
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  L6: Memory Hierarchy  ECE 498AL, University of Illinois, Urbana‐Champaign  L6: Memory Hierarchy I 
ECE 498AL, University of Illinois, Urbana‐Champaign 
2 
2/10/09 
Terminology Review
Access Times (REWRITE?)
•  device = GPU = set of multiprocessors
•  Multiprocessor = set of processors & shared memory •  Register – dedicated HW - single cycle
•  Kernel = GPU program •  Constant and Texture caches – possibly single
•  Grid = array of thread blocks that execute a kernel cycle, proportional to addresses accessed by warp
•  Thread block = group of SIMD threads that execute •  Shared Memory – dedicated HW - single cycle
a kernel and can communicate via shared memory •  Local Memory – DRAM, no cache - *slow*
•  Global Memory – DRAM, no cache - *slow*
Memory Location Cached Access Who •  Constant Memory – DRAM, cached, 1…10s…100s of
Local Off-chip No Read/write One thread cycles, depending on cache locality
Shared On-chip N/A - resident Read/write All threads in a block •  Texture Memory – DRAM, cached, 1…10s…100s of
Global Off-chip No Read/write All threads + host cycles, depending on cache locality
Constant Off-chip Yes Read All threads + host •  Instruction Memory (invisible) – DRAM, cached
Texture Off-chip Yes Read All threads + host
ECE 498AL, University of Illinois, Urbana‐Champaign 
ECE 498AL, University of Illinois, Urbana‐Champaign  L5: Memory Hierarchy 
L6: Memory Hierarchy I
Data Placement: Conceptual

Data Placement: Syntax
•  Copies from host to device go to some part of
global memory (possibly, constant or texture •  Through type qualifiers
memory)
–  __constant__, __shared__, __local__,
•  How to use SP shared memory __device__
•  Must construct or be copied from global memory
by kernel program •  Through cudaMemcpy calls
•  How to use constant or texture cache –  Flavor of call and symbolic constant designate
–  Read-only “reused” data can be placed in constant where to copy
& texture memory by host •  Implicit default behavior
•  Local memory –  Device memory without qualifier is global memory
–  Deals with capacity limitations of registers and –  Host by default copies to global memory
shared memory
–  Thread variables go into registers unless capacity
–  Eliminates worries about race conditions
exceeded, then local memory
–  … but SLOW
11 12
CS6963  CS6963 
3 
2/10/09 
Language Extensions: Variable Type Qualifiers

Memory Scope Lifetime
Variable Type Restrictions
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block •  Pointers can only point to memory
__device__ int GlobalVar; global grid application allocated or declared in global memory:
__device__ __constant__ int ConstantVar; constant grid application
–  Allocated in the host and passed to the
•  __device__ is optional when used with kernel:
__local__, __shared__, or __global__ void KernelFunc(float*
__constant__ ptr)
–  Obtained as the address of a global
•  Automatic variables without any qualifier variable: float* ptr = &GlobalVar;
reside in a register
–  Except arrays that reside in local memory
13
14
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  L6: Memory Hierarchy  © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007 
ECE 498AL, University of Illinois, Urbana‐Champaign  ECE 498AL, University of Illinois, Urbana‐Champaign  L6: Memory Hierarchy I
Constant Memory Example

Rest of Today’s Lecture
•  Signal recognition:
•  Mechanics of how to place data in –  Apply input signal (a vector) to a set of
shared memory and constant memory precomputed transform matrices
•  Tiling transformation to reuse data –  Compute M1V, M2V, …, MnV
within __constant__ float d_signalVector[M]; 
__device__ float R[N][M]; 
–  Shared memory __global__ void ApplySignal (int M) { 
    float result = 0.0; /* register */ 
__host__ void outerApplySignal () { 
–  Constant cache     float *h_inputSignal; 
    for (j=0; j<M; j++) 
    dim3 dimGrid(N); 
        result += d_M[blockIdx.x][threadIdx.x][j] * 
    dim3 dimBlock(M); 
               d_signalVector[j]; 
    cudaMemcpyToSymbol (d_signalVector, 
    R[blockIdx.x][threadIdx.x] = result; 
          h_inputSignal, M*sizeof(float)); 
} 
   ApplySignal<<<dimGrid,dimBlock>>>(M); 
} 
15 16
CS6963 
4 
2/10/09 
More on Constant Cache

Additional Detail
•  Example from previous slide
–  All threads in a block accessing same
•  Suppose each thread accesses different
element of signal vector
data from constant memory on same
–  Brought into cache for first access, then instruction
latency equivalent to a register access
–  Reuse across threads?
•  Consider capacity of constant cache and locality
Reg  Reg  Reg 
P0  P!  ...  P M‐1 

Instrucmon 
Unit 
LD signalVector[j]  •  Code transformation needed? (later in lecture)
•  Cache latency proportional to number of
accesses in a warp
Constant Cache 
–  No reuse?
•  Should not be in constant memory.
17 18
CS6963  CS6963 
Mechanics of Using Shared Memory

Now Let’s Look at Shared Memory
•  __shared__ type qualifier required
•  Common Programming Pattern (5.1.2 •  Must be allocated from global/device
of CUDA manual) function, or as “extern”
Shared 
–  Load data into shared memory memory 
•  Examples:
–  Synchronize (if necessary)
–  Operate on data in shared memory extern __shared__ float  d_s_array[];  __global__ void compute2() { 
   __shared__ float d_s_array[M]; 
–  Synchronize (if necessary) /* a form of dynamic allocamon */ 
–  Write intermediate results to global /* MEMSIZE is size of per‐block  */     /* create or copy from global memory */ 
/* shared memory*/      d_s_array[j] = …; 
memory __host__ void outerCompute() { 
Global memory 
–  Repeat until done    compute<<<gs,bs,MEMSIZE>>>();    /* write result back to global memory */ 
}      d_g_array[j] =  d_s_array[j]; 
__global__ void compute() {  }  
     d_s_array[i] = …; 
19 }  20
CS6963  CS6963 
5 
2/10/09 
Matrix Transpose (from SDK)

Recall Reuse and Locality
_global__ void transpose(float *odata, float *idata, int width, int height)
{
__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];
odata and idata in 
global memory  •  Consider how data is accessed
// read the matrix tile into shared memory –  Data reuse:
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y; •  Same data used multiple times
unsigned int index_in = yIndex * width + xIndex; •  Intrinsic in computation
block[threadIdx.y][threadIdx.x] = idata[index_in];
–  Data locality:
__syncthreads(); •  Data is reused and is present in “fast memory”
Rearrange in 
shared memory  •  Same data or same data transfer
// write the transposed matrix tile to global memory
xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x; and write back  •  If a computation has reuse, what can we do to get
efficiently to 
yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
global memory   locality?
unsigned int index_out = yIndex * height + xIndex;
odata[index_out] = block[threadIdx.x][threadIdx.y]; •  Appropriate data placement and layout
} •  Code reordering transformations
21 22
CS6963  CS6963 
Temporal Reuse in Sequential Code Spatial Reuse (Ignore for now)
•  Same data used in distinct iterations I and •  Same data transfer (usually cache line) used in
I’ distinct iterations I and I’
for (i=1; i<N; i++)
for (i=1; i<N; i++) for (j=1; j<N; j++)
for (j=1; j<N; j++) A[j]= A[j]+A[j+1]+A[j-1];
A[j]= A[j]+A[j+1]+A[j-1]
·  A[j] has self-spatial reuse in loop j
•  Multi-dimensional array note: C arrays are
•  A[j] has self-temporal reuse in loop i
stored in row-major order
23 24
CS6963  CS6963 
6 
2/10/09 
Group Reuse Can Use Reordering Transformations!
•  Same data used by distinct references •  Analyze reuse in computation

•  Apply loop reordering transformations
for (i=1; i<N; i++)
to improve locality based on reuse
for (j=1; j<N; j++) •  With any loop reordering
A[j]= A[j]+A[j+1]+A[j-1]; transformation, always ask
–  Safety? (doesn’t reverse dependences)
•  A[j],A[j+1] and A[j-1] have group reuse (spatial and temporal) in
–  Profitablity? (improves locality)
loop j 
25 26
CS6963 
CS6963 
26 
Loop Permutation:
Safety of Permutation
A Reordering Transformation
Permute the order of the loops to modify the traversal order
•  Intuition: Cannot permute two loops i and j in a loop
nest if doing so reverses the direction of any
for (i= 0; i<3; i++) for (j=0; j<6; j++) dependence.
for (j=0; j<6; j++) for (i= 0; i<3; i++) •  Loops i through j of an n-deep loop nest are fully
A[i][j+1]=A[i][j]+B[j];  A[i][j+1]=A[i][j]+B[j]; permutable if for all dependences D,
i  i  new traversal order! either
(d1, … di-1) > 0
or
forall k, i ≤ k ≤ j, dk ≥ 0
•  Stated without proof: Within the affine domain, n-1
inner loops of n-deep loop nest can be transformed to
j  j  be fully permutable.
Which one is better for row-major storage?
27 28
CS6963  CS6963 
7 
2/10/09 
Tiling (Blocking):
Simple Examples: 2-d Loop Nests Another Loop Reordering Transformation
•  Blocking reorders loop iterations to
for (i= 0; i<3; i++) for (i= 0; i<3; i++) bring iterations that reuse data closer
for (j=0; j<6; j++) for (j=0; j<6; j++) in time
A[i][j+1]=A[i][j]+B[j];  A[i+1][j-1]=A[i][j]
+B[j];  I  I 
•  Distance vectors
•  Ok to permute?
J  J 
29 30
CS6963  CS6963 
Tiling Example
Legality of Tiling
for (j=1; j<M; j++)
for (i=1; i<N; i++) •  Tiling = strip-mine and permutation
D[i] = D[i] + B[j][i];  – Strip-mine does not reorder iterations
Strip for (j=1; j<M; j++) – Permutation must be legal
mine for (ii=1; ii<N; ii+=s)
OR
for (i=ii, i<min(ii+s-1,N), i++)
D[i] = D[i] +B[j][i];  –  strip size less than dependence
for (ii=1; ii<N; ii+=s)
distance
Permute       for (j=1; j<M; j++)
for (i=ii, i<min(ii+s-1,N),i++)
D[i] = D[i] +B[j][i];
31 32
CS6963  CS6963 
8 
2/10/09 
Matrix Multiplication
A Few Words On Tiling A Simple Host Version in C
N
// Matrix mulmplicamon on the (CPU) host in double precision 
•  Tiling can be used hierarchically to compute void MatrixMulOnHost(float* M, float* N, float* P, int Width)‫‏‬
{
k 
partial results on a block of data wherever for (int i = 0; i < Width; ++i)‫‏‬ j 
WIDTH
for (int j = 0; j < Width; ++j) {
there are capacity limitations double sum = 0;
for (int k = 0; k < Width; ++k) {
–  Between grids if data exceeds global memory double a = M[i * width + k];
capacity double b = N[k * width + j];
sum += a * b;
–  Across thread blocks if shared data exceeds } M P
shared memory capacity }

P[i * Width + j] = sum; i 
–  Within threads if data in constant cache exceeds }
WIDTH
cache capacity
k 
WIDTH WIDTH
33 © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  34
CS6963  ECE498AL, University of Illinois, Urbana‐Champaign 
Tiled Multiply Using Thread Blocks

Shared Memory Usage
bx
One block computes one square sub-

0 1 2
• 
matrix Psub of size BLOCK_SIZE tx
•  Assume each SMP has 16KB shared memory
012 bsize-1
•  One thread computes one element
of Psub –  Each Thread Block uses 2*256*4B = 2KB of shared
BLOCK_SIZE
memory.
•  Assume that the dimensions of M
WIDTH
–  Can potentially have up to 8 Thread Blocks actively

and N are multiples of
BLOCK_SIZE
executing
BLOCK_SIZE and square shape
–  For BLOCK_SIZE = 16, this allows up to 8*512 = 4,096
M P
pending loads
0
•  In practice, there will probably be up to half of this due to
0 scheduling to make use of SPs.
BLOCK_SIZE
1 Psub
WIDTH
by 1
2
–  The next BLOCK_SIZE 32 would lead to 2*32*32*4B= 8KB
ty
bsize-1 shared memory usage per Thread Block, allowing only up to
BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE two Thread Blocks active at the same time
2
WIDTH WIDTH
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  35 © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  36
ECE 498AL, University of Illinois, Urbana‐Champaign  L6: Memory Hierarchy I  ECE 498AL, University of Illinois, Urbana‐Champaign  L6: Memory Hierarchy I 
9 
2/10/09 
CUDA Code – Kernel Execution

First-order Size Considerations Configuration
•  Each Thread Block should have a minimal of // Setup the execution configuration
192 threads dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
–  BLOCK_SIZE of 16 gives 16*16 = 256 threads dim3 dimGrid(N.width / dimBlock.x,
M.height / dimBlock.y);
•  A minimal of 32 Thread Blocks
–  A 1024*1024 P Matrix gives 64*64 = 4096
Thread Blocks For very large N and M dimensions, one
will need to add another level of blocking
•  Each thread block performs 2*256 = 512 and execute the second-level blocks
float loads from global memory for 256 *
(2*16) = 8,192 mul/add operations. sequentially.
–  Memory bandwidth no longer a limiting factor
CUDA Code - Load Data to Shared

CUDA Code – Kernel Overview Memory
// Block index // Get a pointer to the current sub-matrix Msub of M
int bx = blockIdx.x; Matrix Msub = GetSubMatrix(M, m, by);
int by = blockIdx.y;
// Thread index // Get a pointer to the current sub-matrix Nsub of N
int tx = threadIdx.x;
int ty = threadIdx.y;
Matrix Nsub = GetSubMatrix(N, bx, m);
// Pvalue stores the element of the block sub-matrix __shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];
// that is computed by the thread
float Pvalue = 0;
__shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE];
// Loop over all the sub-matrices of M and N // each thread loads one element of the sub-matrix
// required to compute the block sub-matrix Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);
for (int m = 0; m < M.width/BLOCK_SIZE; ++m) {
code from the next few slides }; // each thread loads one element of the sub-matrix
Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);
10 
2/10/09 
CUDA Code - Compute Result CUDA Code - Save Result

// Synchronize to make sure the sub-matrices are loaded // Get a pointer to the block sub-matrix of P
// before starting the computation
Matrix Psub = GetSubMatrix(P, bx, by);
__syncthreads();
// Write the block sub-matrix to device memory;
// each thread computes one element of the block sub-matrix
// each thread writes one element
for (int k = 0; k < BLOCK_SIZE; ++k)
SetMatrixElement(Psub, tx, ty, Pvalue);
Pvalue += Ms[ty][k] * Ns[k][tx];
// Synchronize to make sure that the preceding

// computation is done before loading two new
// sub-matrices of M and N in the next iteration
This code should run at about 45 GFLOPS 
__syncthreads();
Matrix Multiply in CUDA Summary of Lecture
•  Imagine you want to compute extremely •  How to place data in constant memory
large matrices. and shared memory
–  That don’t fit in global memory •  Reordering transformations to improve
•  This is where an additional level of tiling locality
could be used, between grids •  Tiling transformation
•  Matrix multiply example
43 44
CS6963  CS6963 
11 
2/10/09 
Next Time
•  Complete this example

•  Reasoning about reuse and locality
•  Talk about projects and assign proposal
45
CS6963 
12 

Lesson 6 Memory Hierarchy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 6 Memory Hierarchy

Uploaded by

Copyright:

Available Formats

2/10/09

• Projects mostly graded

Targets of Memory Hierarchy

Optimizing the Memory Hierarchy on

Programmer’s View: Memory Spaces Hardware Implementation: Memory

• The local, global, constant, and MulEprocessor N

– On-chip shared memory

Local Local Local Local – A read-only constant cache

Data Placement: Conceptual

Language Extensions: Variable Type Qualifiers

Constant Memory Example

More on Constant Cache

P0 P! ... P M‐1

Mechanics of Using Shared Memory

Matrix Transpose (from SDK)

Temporal Reuse in Sequential Code Spatial Reuse (Ignore for now)

Group Reuse Can Use Reordering Transformations!

• Same data used by distinct references • Analyze reuse in computation

shared memory capacity }

– Within threads if data in constant cache exceeds }

Tiled Multiply Using Thread Blocks

One block computes one square sub-

– Can potentially have up to 8 Thread Blocks actively

CUDA Code – Kernel Execution

CUDA Code - Load Data to Shared

CUDA Code - Compute Result CUDA Code - Save Result

// Synchronize to make sure that the preceding

Matrix Multiply in CUDA Summary of Lecture

• Complete this example

You might also like

2/10/09 

•  Projects mostly graded

•  The local, global, constant, and MulEprocessor N 

–  On-chip shared memory

Local  Local  Local  Local  –  A read-only constant cache

P0  P!  ...  P M‐1 

•  Same data used by distinct references •  Analyze reuse in computation

–  Within threads if data in constant cache exceeds }

–  Can potentially have up to 8 Thread Blocks actively

•  Complete this example