You are on page 1of 12

2/10/09


Administrative

•  Projects mostly graded


–  Grades by Wednesday, possibly sooner
L6: Memory Hierarchy Optimization I,
Data Placement •  Homework coming out this afternoon
–  Due Wednesday, Feb. 18

2
CS6963
 CS6963

L6:
Memory
Hierarchy
I


Targets of Memory Hierarchy


Overview
Optimizations
•  Where data can be stored •  Reduce memory latency
•  And how to get it there –  The latency of a memory access is the time
(usually in cycles) between a memory request
•  Some guidelines for where to store data and its completion
•  High level description of how to write •  Maximize memory bandwidth
code to optimize for memory hierarchy –  Bandwidth is the amount of useful data that
can be retrieved over a time interval
–  Fill in details Wednesday •  Manage overhead
–  Cost of performing optimization (e.g., copying)
should be less than anticipated gain

3 4
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


1

2/10/09


Optimizing the Memory Hierarchy on


Reuse and Locality
GPUs
•  Device memory access times non-uniform so •  Consider how data is accessed
data placement significantly affects –  Data reuse:
performance. •  Same data used multiple times
•  But controlling data placement may require •  Intrinsic in computation
additional copying, so consider overhead. –  Data locality:
•  Optimizations to increase memory bandwidth. •  Data is reused and is present in “fast memory”
Idea: maximize utility of each memory access. •  Same data or same data transfer
•  Align data structures to address boundaries •  If a computation has reuse, what can we do to get
•  Coalesce global memory accesses locality?
•  Avoid memory bank conflicts to increase memory •  Appropriate data placement and layout
access parallelism •  Code reordering transformations

5 6
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


Programmer’s View: Memory Spaces Hardware Implementation: Memory


Architecture
•  Each thread can: Grid

Device


•  The local, global, constant, and MulEprocessor
N



–  Read/write per-thread registers
Block
(0,
0)
 Block
(1,
0)
 texture spaces are regions of
–  Read/write per-thread local memory device memory MulEprocessor
2

–  Read/write per-block shared memory Shared
Memory
 Shared
Memory

•  Each multiprocessor has: MulEprocessor
1


–  Read/write per-grid global memory Registers
 Registers
 Registers
 Registers
 –  A set of 32-bit registers per
–  Read only per-grid constant memory processor Shared
Memory


–  On-chip shared memory


Read only per-grid texture memory
Registers
 Registers
 Registers

–  Thread
(0,
0)
 Thread
(1,
0)
 Thread
(0,
0)
 Thread
(1,
0)

Where the shared memory
InstrucEon

•  …
 Unit

space resides Processor
1
 Processor
2
 Processor
M


Local
 Local
 Local
 Local
 –  A read-only constant cache


Memory
 Memory
 Memory
 Memory

•  To speed up access to the Constant

Cache

constant memory space
Host

A read-only texture cache
Global

Memory
 –  Texture

Cache

To speed up access to the
•  The
host
can
read/write
 Constant

• 
texture memory space
global,
constant,
and

Memory

Device
memory


texture
memory
 Texture

Memory

Global,
constant,
texture
memories


7
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 8
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 L6:
Memory
Hierarchy
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I

ECE
498AL,
University
of
Illinois,
Urbana‐Champaign


2

2/10/09


Terminology Review
Access Times (REWRITE?)
•  device = GPU = set of multiprocessors
•  Multiprocessor = set of processors & shared memory •  Register – dedicated HW - single cycle
•  Kernel = GPU program •  Constant and Texture caches – possibly single
•  Grid = array of thread blocks that execute a kernel cycle, proportional to addresses accessed by warp
•  Thread block = group of SIMD threads that execute •  Shared Memory – dedicated HW - single cycle
a kernel and can communicate via shared memory •  Local Memory – DRAM, no cache - *slow*
•  Global Memory – DRAM, no cache - *slow*
Memory Location Cached Access Who •  Constant Memory – DRAM, cached, 1…10s…100s of
Local Off-chip No Read/write One thread cycles, depending on cache locality
Shared On-chip N/A - resident Read/write All threads in a block •  Texture Memory – DRAM, cached, 1…10s…100s of
Global Off-chip No Read/write All threads + host cycles, depending on cache locality
Constant Off-chip Yes Read All threads + host •  Instruction Memory (invisible) – DRAM, cached
Texture Off-chip Yes Read All threads + host

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 10
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 9
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign

ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L5:
Memory
Hierarchy

L6: Memory Hierarchy I

Data Placement: Conceptual


Data Placement: Syntax
•  Copies from host to device go to some part of
global memory (possibly, constant or texture •  Through type qualifiers
memory)
–  __constant__, __shared__, __local__,
•  How to use SP shared memory __device__
•  Must construct or be copied from global memory
by kernel program •  Through cudaMemcpy calls
•  How to use constant or texture cache –  Flavor of call and symbolic constant designate
–  Read-only “reused” data can be placed in constant where to copy
& texture memory by host •  Implicit default behavior
•  Local memory –  Device memory without qualifier is global memory
–  Deals with capacity limitations of registers and –  Host by default copies to global memory
shared memory
–  Thread variables go into registers unless capacity
–  Eliminates worries about race conditions
exceeded, then local memory
–  … but SLOW
11 12
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


3

2/10/09


Language Extensions: Variable Type Qualifiers


Memory Scope Lifetime
Variable Type Restrictions
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block •  Pointers can only point to memory
__device__ int GlobalVar; global grid application allocated or declared in global memory:
__device__ __constant__ int ConstantVar; constant grid application
–  Allocated in the host and passed to the
•  __device__ is optional when used with kernel:
__local__, __shared__, or __global__ void KernelFunc(float*
__constant__ ptr)
–  Obtained as the address of a global
•  Automatic variables without any qualifier variable: float* ptr = &GlobalVar;
reside in a register
–  Except arrays that reside in local memory
13
14
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 L6:
Memory
Hierarchy
 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007

ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I

Constant Memory Example


Rest of Today’s Lecture
•  Signal recognition:
•  Mechanics of how to place data in –  Apply input signal (a vector) to a set of
shared memory and constant memory precomputed transform matrices
•  Tiling transformation to reuse data –  Compute M1V, M2V, …, MnV
within __constant__
float
d_signalVector[M];

__device__
float
R[N][M];

–  Shared memory __global__
void
ApplySignal
(int
M)
{





float
result
=
0.0;
/*
register
*/

__host__
void
outerApplySignal
()
{

–  Constant cache 



float
*h_inputSignal;





for
(j=0;
j<M;
j++)





dim3
dimGrid(N);









result
+=
d_M[blockIdx.x][threadIdx.x][j]
*





dim3
dimBlock(M);
















d_signalVector[j];





cudaMemcpyToSymbol
(d_signalVector,





R[blockIdx.x][threadIdx.x]
=
result;











h_inputSignal,
M*sizeof(float));

}




ApplySignal<<<dimGrid,dimBlock>>>(M);

}

15 16
CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


4

2/10/09


More on Constant Cache


Additional Detail
•  Example from previous slide
–  All threads in a block accessing same
•  Suppose each thread accesses different
element of signal vector
data from constant memory on same
–  Brought into cache for first access, then instruction
latency equivalent to a register access
–  Reuse across threads?
•  Consider capacity of constant cache and locality
Reg
 Reg
 Reg


P0
 P!
 ...
 P M‐1



Instrucmon

Unit

LD
signalVector[j]
 •  Code transformation needed? (later in lecture)
•  Cache latency proportional to number of
accesses in a warp
Constant
Cache

–  No reuse?
•  Should not be in constant memory.

17 18
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


Mechanics of Using Shared Memory


Now Let’s Look at Shared Memory
•  __shared__ type qualifier required
•  Common Programming Pattern (5.1.2 •  Must be allocated from global/device
of CUDA manual) function, or as “extern”
Shared

–  Load data into shared memory memory

•  Examples:
–  Synchronize (if necessary)
–  Operate on data in shared memory extern
__shared__
float

d_s_array[];
 __global__
void
compute2()
{




__shared__
float
d_s_array[M];

–  Synchronize (if necessary) /*
a
form
of
dynamic
allocamon
*/

–  Write intermediate results to global /*
MEMSIZE
is
size
of
per‐block

*/
 


/*
create
or
copy
from
global
memory
*/

/*
shared
memory*/

 


d_s_array[j]
=
…;

memory __host__
void
outerCompute()
{

Global
memory

–  Repeat until done 


compute<<<gs,bs,MEMSIZE>>>();
 

/*
write
result
back
to
global
memory
*/

}

 


d_g_array[j]
=

d_s_array[j];

__global__
void
compute()
{
 }







d_s_array[i]
=
…;

19 }
 20
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


5

2/10/09


Matrix Transpose (from SDK)


Recall Reuse and Locality
_global__ void transpose(float *odata, float *idata, int width, int height)
{
__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];
odata
and
idata
in

global
memory
 •  Consider how data is accessed
// read the matrix tile into shared memory –  Data reuse:
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y; •  Same data used multiple times
unsigned int index_in = yIndex * width + xIndex; •  Intrinsic in computation
block[threadIdx.y][threadIdx.x] = idata[index_in];
–  Data locality:
__syncthreads(); •  Data is reused and is present in “fast memory”
Rearrange
in

shared
memory
 •  Same data or same data transfer
// write the transposed matrix tile to global memory
xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x; and
write
back
 •  If a computation has reuse, what can we do to get
efficiently
to

yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
global
memory

 locality?
unsigned int index_out = yIndex * height + xIndex;
odata[index_out] = block[threadIdx.x][threadIdx.y]; •  Appropriate data placement and layout
} •  Code reordering transformations

21 22
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


Temporal Reuse in Sequential Code Spatial Reuse (Ignore for now)

•  Same data used in distinct iterations I and •  Same data transfer (usually cache line) used in
I’ distinct iterations I and I’
for (i=1; i<N; i++)
for (i=1; i<N; i++) for (j=1; j<N; j++)
for (j=1; j<N; j++) A[j]= A[j]+A[j+1]+A[j-1];
A[j]= A[j]+A[j+1]+A[j-1]
·  A[j]
has self-spatial reuse in loop
j
•  Multi-dimensional array note: C arrays are
•  A[j]
has self-temporal reuse in loop
i
stored in row-major order

23 24
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


6

2/10/09


Group Reuse Can Use Reordering Transformations!

•  Same data used by distinct references •  Analyze reuse in computation


•  Apply loop reordering transformations
for (i=1; i<N; i++)
to improve locality based on reuse
for (j=1; j<N; j++) •  With any loop reordering
A[j]= A[j]+A[j+1]+A[j-1]; transformation, always ask
–  Safety? (doesn’t reverse dependences)
•  A[j],A[j+1]
and A[j-1]
have group reuse (spatial and temporal) in
–  Profitablity? (improves locality)
loop
j


25 26
CS6963

L6:
Memory
Hierarchy
I

CS6963

L6:
Memory
Hierarchy
I

26


Loop Permutation:
Safety of Permutation
A Reordering Transformation
Permute the order of the loops to modify the traversal order
•  Intuition: Cannot permute two loops i and j in a loop
nest if doing so reverses the direction of any
for (i= 0; i<3; i++) for (j=0; j<6; j++) dependence.
for (j=0; j<6; j++) for (i= 0; i<3; i++) •  Loops i through j of an n-deep loop nest are fully
A[i][j+1]=A[i][j]+B[j];
 A[i][j+1]=A[i][j]+B[j]; permutable if for all dependences D,
i
 i
 new traversal order!  either
(d1, … di-1) > 0
or
forall k, i ≤ k ≤ j, dk ≥
0
•  Stated without proof: Within the affine domain, n-1
inner loops of n-deep loop nest can be transformed to
j
 j
 be fully permutable.
Which one is better for row-major storage?
27 28
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


7

2/10/09


Tiling (Blocking):
Simple Examples: 2-d Loop Nests Another Loop Reordering Transformation
•  Blocking reorders loop iterations to
for (i= 0; i<3; i++) for (i= 0; i<3; i++) bring iterations that reuse data closer
for (j=0; j<6; j++) for (j=0; j<6; j++) in time
A[i][j+1]=A[i][j]+B[j];
 A[i+1][j-1]=A[i][j]
+B[j];
 I
 I


•  Distance vectors

•  Ok to permute?

J
 J


29 30
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


Tiling Example
Legality of Tiling
for (j=1; j<M; j++)
for (i=1; i<N; i++) •  Tiling = strip-mine and permutation
D[i] = D[i] + B[j][i];
 – Strip-mine does not reorder iterations
Strip for (j=1; j<M; j++) – Permutation must be legal
mine for (ii=1; ii<N; ii+=s)
OR
for (i=ii, i<min(ii+s-1,N), i++)
D[i] = D[i] +B[j][i];
 –  strip size less than dependence
for (ii=1; ii<N; ii+=s)
distance
Permute 





for (j=1; j<M; j++)
for (i=ii, i<min(ii+s-1,N),i++)
D[i] = D[i] +B[j][i];

31 32
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


8

2/10/09


Matrix Multiplication
A Few Words On Tiling A Simple Host Version in C
N
//
Matrix
mulmplicamon
on
the
(CPU)
host
in
double
precision

•  Tiling can be used hierarchically to compute void MatrixMulOnHost(float* M, float* N, float* P, int Width)‫‏‬
{
k


partial results on a block of data wherever for (int i = 0; i < Width; ++i)‫‏‬ j


WIDTH
for (int j = 0; j < Width; ++j) {
there are capacity limitations double sum = 0;
for (int k = 0; k < Width; ++k) {
–  Between grids if data exceeds global memory double a = M[i * width + k];
capacity double b = N[k * width + j];
sum += a * b;
–  Across thread blocks if shared data exceeds } M P

shared memory capacity }


P[i * Width + j] = sum; i


–  Within threads if data in constant cache exceeds }

WIDTH
cache capacity
k


WIDTH WIDTH

33 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 34
CS6963
 ECE498AL,
University
of
Illinois,
Urbana‐Champaign

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


Tiled Multiply Using Thread Blocks


Shared Memory Usage
bx

One block computes one square sub-


0 1 2
• 
matrix Psub of size BLOCK_SIZE tx
•  Assume each SMP has 16KB shared memory
012 bsize-1
•  One thread computes one element
of Psub –  Each Thread Block uses 2*256*4B = 2KB of shared
BLOCK_SIZE

memory.
•  Assume that the dimensions of M
WIDTH

–  Can potentially have up to 8 Thread Blocks actively


and N are multiples of
BLOCK_SIZE

executing
BLOCK_SIZE and square shape
–  For BLOCK_SIZE = 16, this allows up to 8*512 = 4,096
M P
pending loads
0
•  In practice, there will probably be up to half of this due to
0 scheduling to make use of SPs.
BLOCK_SIZE

1 Psub
WIDTH

by 1
2
–  The next BLOCK_SIZE 32 would lead to 2*32*32*4B= 8KB
ty
bsize-1 shared memory usage per Thread Block, allowing only up to
BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE two Thread Blocks active at the same time
2
WIDTH WIDTH

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 35 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 36
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I


9

2/10/09


CUDA Code – Kernel Execution


First-order Size Considerations Configuration
•  Each Thread Block should have a minimal of // Setup the execution configuration
192 threads dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
–  BLOCK_SIZE of 16 gives 16*16 = 256 threads dim3 dimGrid(N.width / dimBlock.x,
M.height / dimBlock.y);
•  A minimal of 32 Thread Blocks
–  A 1024*1024 P Matrix gives 64*64 = 4096
Thread Blocks For very large N and M dimensions, one
will need to add another level of blocking
•  Each thread block performs 2*256 = 512 and execute the second-level blocks
float loads from global memory for 256 *
(2*16) = 8,192 mul/add operations. sequentially.
–  Memory bandwidth no longer a limiting factor

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 37 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 38
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I


CUDA Code - Load Data to Shared


CUDA Code – Kernel Overview Memory
// Block index // Get a pointer to the current sub-matrix Msub of M
int bx = blockIdx.x; Matrix Msub = GetSubMatrix(M, m, by);
int by = blockIdx.y;
// Thread index // Get a pointer to the current sub-matrix Nsub of N
int tx = threadIdx.x;
int ty = threadIdx.y;
Matrix Nsub = GetSubMatrix(N, bx, m);

// Pvalue stores the element of the block sub-matrix __shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];
// that is computed by the thread
float Pvalue = 0;
__shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE];

// Loop over all the sub-matrices of M and N // each thread loads one element of the sub-matrix
// required to compute the block sub-matrix Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);
for (int m = 0; m < M.width/BLOCK_SIZE; ++m) {
code from the next few slides }; // each thread loads one element of the sub-matrix
Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 39 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 40
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I


10

2/10/09


CUDA Code - Compute Result CUDA Code - Save Result


// Synchronize to make sure the sub-matrices are loaded // Get a pointer to the block sub-matrix of P
// before starting the computation
Matrix Psub = GetSubMatrix(P, bx, by);
__syncthreads();
// Write the block sub-matrix to device memory;
// each thread computes one element of the block sub-matrix
// each thread writes one element
for (int k = 0; k < BLOCK_SIZE; ++k)
SetMatrixElement(Psub, tx, ty, Pvalue);
Pvalue += Ms[ty][k] * Ns[k][tx];

// Synchronize to make sure that the preceding


// computation is done before loading two new
// sub-matrices of M and N in the next iteration
This
code
should
run
at
about
45
GFLOPS

__syncthreads();

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 41 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 42
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L6:
Memory
Hierarchy
I


Matrix Multiply in CUDA Summary of Lecture

•  Imagine you want to compute extremely •  How to place data in constant memory
large matrices. and shared memory
–  That don’t fit in global memory •  Reordering transformations to improve
•  This is where an additional level of tiling locality
could be used, between grids •  Tiling transformation
•  Matrix multiply example

43 44
CS6963
 CS6963

L6:
Memory
Hierarchy
I
 L6:
Memory
Hierarchy
I


11

2/10/09


Next Time

•  Complete this example


•  Reasoning about reuse and locality
•  Talk about projects and assign proposal

45
CS6963

L6:
Memory
Hierarchy
I


12


You might also like