Professional Documents
Culture Documents
Lesson 6 Memory Hierarchy
Lesson 6 Memory Hierarchy
Administrative
2
CS6963
CS6963
L6:
Memory
Hierarchy
I
3
4
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
1
2/10/09
5
6
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
– Read/write per-grid global memory Registers
Registers
Registers
Registers
– A set of 32-bit registers per
– Read only per-grid constant memory processor Shared
Memory
texture
memory
Texture
Memory
Global,
constant,
texture
memories
7
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
8
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
L6:
Memory
Hierarchy
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
2
2/10/09
Terminology Review
Access Times (REWRITE?)
• device = GPU = set of multiprocessors
• Multiprocessor = set of processors & shared memory • Register – dedicated HW - single cycle
• Kernel = GPU program • Constant and Texture caches – possibly single
• Grid = array of thread blocks that execute a kernel cycle, proportional to addresses accessed by warp
• Thread block = group of SIMD threads that execute • Shared Memory – dedicated HW - single cycle
a kernel and can communicate via shared memory • Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
Memory Location Cached Access Who • Constant Memory – DRAM, cached, 1…10s…100s of
Local Off-chip No Read/write One thread cycles, depending on cache locality
Shared On-chip N/A - resident Read/write All threads in a block • Texture Memory – DRAM, cached, 1…10s…100s of
Global Off-chip No Read/write All threads + host cycles, depending on cache locality
Constant Off-chip Yes Read All threads + host • Instruction Memory (invisible) – DRAM, cached
Texture Off-chip Yes Read All threads + host
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
10
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
9
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L5:
Memory
Hierarchy
L6: Memory Hierarchy I
3
2/10/09
4
2/10/09
17
18
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
5
2/10/09
21
22
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
• Same data used in distinct iterations I and • Same data transfer (usually cache line) used in
I’ distinct iterations I and I’
for (i=1; i<N; i++)
for (i=1; i<N; i++) for (j=1; j<N; j++)
for (j=1; j<N; j++) A[j]= A[j]+A[j+1]+A[j-1];
A[j]= A[j]+A[j+1]+A[j-1]
· A[j]
has self-spatial reuse in loop
j
• Multi-dimensional array note: C arrays are
• A[j]
has self-temporal reuse in loop
i
stored in row-major order
23
24
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
6
2/10/09
25
26
CS6963
L6:
Memory
Hierarchy
I
CS6963
L6:
Memory
Hierarchy
I
26
Loop Permutation:
Safety of Permutation
A Reordering Transformation
Permute the order of the loops to modify the traversal order
• Intuition: Cannot permute two loops i and j in a loop
nest if doing so reverses the direction of any
for (i= 0; i<3; i++) for (j=0; j<6; j++) dependence.
for (j=0; j<6; j++) for (i= 0; i<3; i++) • Loops i through j of an n-deep loop nest are fully
A[i][j+1]=A[i][j]+B[j];
A[i][j+1]=A[i][j]+B[j]; permutable if for all dependences D,
i
i
new traversal order! either
(d1, … di-1) > 0
or
forall k, i ≤ k ≤ j, dk ≥
0
• Stated without proof: Within the affine domain, n-1
inner loops of n-deep loop nest can be transformed to
j
j
be fully permutable.
Which one is better for row-major storage?
27
28
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
7
2/10/09
Tiling (Blocking):
Simple Examples: 2-d Loop Nests Another Loop Reordering Transformation
• Blocking reorders loop iterations to
for (i= 0; i<3; i++) for (i= 0; i<3; i++) bring iterations that reuse data closer
for (j=0; j<6; j++) for (j=0; j<6; j++) in time
A[i][j+1]=A[i][j]+B[j];
A[i+1][j-1]=A[i][j]
+B[j];
I
I
• Distance vectors
• Ok to permute?
J J
29
30
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
Tiling Example
Legality of Tiling
for (j=1; j<M; j++)
for (i=1; i<N; i++) • Tiling = strip-mine and permutation
D[i] = D[i] + B[j][i];
– Strip-mine does not reorder iterations
Strip for (j=1; j<M; j++) – Permutation must be legal
mine for (ii=1; ii<N; ii+=s)
OR
for (i=ii, i<min(ii+s-1,N), i++)
D[i] = D[i] +B[j][i];
– strip size less than dependence
for (ii=1; ii<N; ii+=s)
distance
Permute
for (j=1; j<M; j++)
for (i=ii, i<min(ii+s-1,N),i++)
D[i] = D[i] +B[j][i];
31
32
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
8
2/10/09
Matrix Multiplication
A Few Words On Tiling A Simple Host Version in C
N
//
Matrix
mulmplicamon
on
the
(CPU)
host
in
double
precision
• Tiling can be used hierarchically to compute void MatrixMulOnHost(float* M, float* N, float* P, int Width)
{
k
partial results on a block of data wherever for (int i = 0; i < Width; ++i) j
WIDTH
for (int j = 0; j < Width; ++j) {
there are capacity limitations double sum = 0;
for (int k = 0; k < Width; ++k) {
– Between grids if data exceeds global memory double a = M[i * width + k];
capacity double b = N[k * width + j];
sum += a * b;
– Across thread blocks if shared data exceeds } M P
WIDTH
cache capacity
k
WIDTH WIDTH
33
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
34
CS6963
ECE498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
memory.
• Assume that the dimensions of M
WIDTH
executing
BLOCK_SIZE and square shape
– For BLOCK_SIZE = 16, this allows up to 8*512 = 4,096
M P
pending loads
0
• In practice, there will probably be up to half of this due to
0 scheduling to make use of SPs.
BLOCK_SIZE
1 Psub
WIDTH
by 1
2
– The next BLOCK_SIZE 32 would lead to 2*32*32*4B= 8KB
ty
bsize-1 shared memory usage per Thread Block, allowing only up to
BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE two Thread Blocks active at the same time
2
WIDTH WIDTH
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
35
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
36
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
9
2/10/09
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
37
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
38
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
// Pvalue stores the element of the block sub-matrix __shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];
// that is computed by the thread
float Pvalue = 0;
__shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE];
// Loop over all the sub-matrices of M and N // each thread loads one element of the sub-matrix
// required to compute the block sub-matrix Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);
for (int m = 0; m < M.width/BLOCK_SIZE; ++m) {
code from the next few slides }; // each thread loads one element of the sub-matrix
Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
39
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
40
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
10
2/10/09
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
41
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
42
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L6:
Memory
Hierarchy
I
• Imagine you want to compute extremely • How to place data in constant memory
large matrices. and shared memory
– That don’t fit in global memory • Reordering transformations to improve
• This is where an additional level of tiling locality
could be used, between grids • Tiling transformation
• Matrix multiply example
43
44
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
11
2/10/09
Next Time
45
CS6963
L6:
Memory
Hierarchy
I
12