Professional Documents
Culture Documents
Administrative: - Homework #2, Posted On Website Project Proposals
Administrative: - Homework #2, Posted On Website Project Proposals
Administrative
2
CS6963
CS6963
L7:
Memory
Hierarchy
II
1
2/11/09
2
2/11/09
9
10
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
11
12
CS6963
L7:
Memory
Hierarchy
II
CS6963
L7:
Memory
Hierarchy
II
12
3
2/11/09
Loop Permutation:
Safety of Permutation
A Reordering Transformation
Permute the order of the loops to modify the traversal order
• Intuition: Cannot permute two loops i and j in a loop
nest if doing so reverses the direction of any
for (i= 0; i<3; i++) for (j=0; j<6; j++) dependence.
for (j=0; j<6; j++) for (i= 0; i<3; i++) • Loops i through j of an n-deep loop nest are fully
A[i][j+1]=A[i][j]+B[j];
A[i][j+1]=A[i][j]+B[j]; permutable if for all dependences D,
i
i
new traversal order! either
(d1, … di-1) > 0
or
forall k, i ≤ k ≤ j, dk ≥
0
• Stated without proof: Within the affine domain, n-1
inner loops of n-deep loop nest can be transformed to
j
j
be fully permutable.
Which one is better for row-major storage?
13
14
CS6963
CS6963
L7:
Memory
Hierarchy
II
L7:
Memory
Hierarchy
II
Tiling (Blocking):
Simple Examples: 2-d Loop Nests Another Loop Reordering Transformation
• Blocking reorders loop iterations to
for (i= 0; i<3; i++) for (i= 0; i<3; i++) bring iterations that reuse data closer
for (j=0; j<6; j++) for (j=0; j<6; j++) in time
A[i][j+1]=A[i][j]+B[j];
A[i+1][j-1]=A[i][j]
+B[j];
I
I
• Distance vectors
• Ok to permute?
J J
15
16
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
4
2/11/09
Tiling Example
Legality of Tiling
for (j=1; j<M; j++)
for (i=1; i<N; i++) • Tiling = strip-mine and permutation
D[i] = D[i] + B[j][i];
– Strip-mine does not reorder iterations
Strip for (j=1; j<M; j++) – Permutation must be legal
mine for (ii=1; ii<N; ii+=s)
OR
for (i=ii, i<min(ii+s-1,N), i++)
D[i] = D[i] +B[j][i];
– strip size less than dependence
for (ii=1; ii<N; ii+=s)
distance
Permute
for (j=1; j<M; j++)
for (i=ii, i<min(ii+s-1,N),i++)
D[i] = D[i] +B[j][i];
17
18
CS6963
CS6963
L6:
Memory
Hierarchy
I
L6:
Memory
Hierarchy
I
• Tiling can be used hierarchically to compute • Reuse analysis can be formulated in a manner
partial results on a block of data wherever similar to dependence analysis
there are capacity limitations – Particularly true for temporal reuse
– Between grids if data exceeds global memory – Spatial reuse requires special handling of most
capacity quickly varying dimension (still ignoring)
– Across thread blocks if shared data exceeds • Simplification for today’s lecture
shared memory capacity – Estimate data footprint for innermost loop for
– Within threads if data in constant cache exceeds different scenarios
cache capacity
– Select scenario that minimizes footprint
– Special form (unroll-and-jam) used for registers
19
20
CS6963
L7:
Memory
Hierarchy
II
CS6963
L7:
Memory
Hierarchy
II
20
5
2/11/09
DO I = 1, N DO K = 1, N by TK
DO J = 1, N DO I = 1, N by TI
DO K = 1, N DO J = 1, N
C(I,J)= C(I,J) + A(I,K) * B(K,J) DO KK = K, min(KK+ TK,N)
DO II = I, min(II+ TI,N)
C(II,J)= C(II,J)+A(II,KK)*B(KK,J)
BK
• CM(I) = 2N3/cls + N2
• CM(J) = 2N3 + N2 BI
• CM(K) = N3 + N3/cls + N2
• Ordering by innermost loop cost: (J, K, I)
C
A
B
23
24
CS6963
L7:
Memory
Hierarchy
II
CS6963
L7:
Memory
Hierarchy
II
24
6
2/11/09
• Scalar replacement
– May be followed by replacing array references In
other
architectures
with
deep
instrucJon
pipelines,
this
opJmizaJon
can
also
be
with scalar variables to help compiler identify used
to
expose
instrucJon‐level
parallelism.
register opportunities
25
26
CS6963
CS6963
L7:
Memory
Hierarchy
II
L7:
Memory
Hierarchy
II
Matrix Multiplication
Scalar Replacement A Simple Host Version in C
N
//
Matrix
mulJplicaJon
on
the
(CPU)
host
in
double
precision
DO K = 1, N by TK void MatrixMulOnHost(float* M, float* N, float* P, int Width) k
DO I = 1, N by 4 {
DO J = 1, N for (int i = 0; i < Width; ++i) j
WIDTH
C1 = C(I,J) C2 =C(I+1,J) C3=C(I+2,J) C4=C(I+3,J) for (int j = 0; j < Width; ++j) {
DO KK = K, min(KK+ TK,N) double sum = 0;
C1 = C1+A(I,KK)*B(KK,J) for (int k = 0; k < Width; ++k) {
C2 = C2+A(I+1,KK)*B(KK,J) double a = M[i * width + k];
C3 = C3+A(I+2,KK)*B(KK,J) double b = N[k * width + j];
C4 = C4+A(I+3,KK)*B(KK,J) sum += a * b;
M P
}
P[i * Width + j] = sum; i
}
}
Now
C
accesses
are
to
named
registers.
WIDTH
Compiler
guaranteed
to
map
to
registers.
k
WIDTH WIDTH
27
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
28
CS6963
ECE498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
L7:
Memory
Hierarchy
II
7
2/11/09
BLOCK_SIZE
N
– Each Thread Block uses 2*256*4B = 2KB of shared memory.
• Assume that the dimensions of M – Can potentially have up to 8 Thread Blocks actively executing
WIDTH
and N are multiples of – For BLOCK_SIZE = 16, this allows up to 8*512 = 4,096
BLOCK_SIZE
BLOCK_SIZE and square shape pending loads
• In practice, there will probably be up to half of this due to
M P scheduling to make use of SPs.
0 – The next BLOCK_SIZE 32 would lead to 2*32*32*4B= 8KB
0 shared memory usage per Thread Block, allowing only up to
BLOCK_SIZE
Psub
two Thread Blocks active at the same time
1
WIDTH
2
by 1
ty
bsize-1
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
29
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
30
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
31
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
32
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
8
2/11/09
// Pvalue stores the element of the block sub-matrix __shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];
// that is computed by the thread
float Pvalue = 0;
__shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE];
// Loop over all the sub-matrices of M and N // each thread loads one element of the sub-matrix
// required to compute the block sub-matrix Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);
for (int m = 0; m < M.width/BLOCK_SIZE; ++m) {
code from the next few slides }; // each thread loads one element of the sub-matrix
Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
33
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
34
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
35
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
36
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L7:
Memory
Hierarchy
II
9