Professional Documents
Culture Documents
Shared Memory
14
David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
2007 2012
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
15
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Global Memory
Constant Memory
16
Thread
1
Thread
2
Global Memory
in
On-chip Memory
Thread
1
Thread
2
17
work
sleep
dinner
Time
sleep
Worker B
work
dinner
party
sleep
work
time
Worker B
sleep
work
dinner
20
Thread 1
time
Thread 2
21
Outline of Technique
Identify a block/tile of global memory content that
are accessed by multiple threads
Load the block/tile from global memory into onchip memory
Have the multiple threads to access their data
from the on-chip memory
Move on to the next block/tile
22
WIDTH
Tiled algorithms
tx
WIDTH
WIDTH
ty
WIDTH
23
Col = 0 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y
Col = 1
Col = 0
blockDim.x
blockIdx.x
blockIdx.y
Row = 0
Row = 1
24
Tiled Multiply
Md
tx
TILE_WIDTH
Nd
WIDTH
0 1 2 TILE_WIDTH-1
TILE_WIDTH
bx
Pd
ty
Pdsub
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
2
David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE408/CS483/ECE498al, 2007-2012
WIDTH
WIDTH
by
0
1
2
TILE_WIDTHE
TILE_WIDTH
WIDTH
25
Loading a Tile
All threads in a block participate
26
N0,0 N0,1
N1,0 N1,1
SM
M0,0 M0,1
M1,0 M1,1
27
SM
N1,0 N1,1
N0,0 N0,1
SM
M0,0 M0,1
M1,0 M1,1
28
SM
N1,0 N1,1
N0,0 N0,1
SM
M0,0 M0,1
M1,0 M1,1
29
N0,0 N0,1
N1,0 N1,1
SM
M0,0 M0,1
M1,0 M1,1
30
SM
N0,0 N0,1
N1,0 N1,1
SM
M0,0 M0,1 M0,2 M0,3
M0,0 M0,1
M1,0 M1,1
31
Barrier Synchronization
An API function call in CUDA
__syncthreads()
32
Time
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread N-3
Thread N-2
Thread N-1
Figure 4.11 An example execution timing of barrier synchronization.
bx
1
tx
Pdsub
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
2
Wen-mei W. Hwu and David Kirk/NVIDIA, Urbana, August 13-17,
2012
WIDTH
WIDTH
ty
by
Row = by * TILE_WIDTH
+ty
1
Pd
by
0
1
2
TILE_WIDTHE
bx
TILE_WIDTH
Md
TILE_WIDTH
Nd
WIDTH
0 1 2 TILE_WIDTH-1
TILE_WIDTH
WIDTH
34
bx
1
tx
ty
by
Row = by * TILE_WIDTH
+ty
1
Pd
by
0
1
2
Pdsub
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
2
Wen-mei W. Hwu and David Kirk/NVIDIA, Urbana, August 13-17,
2012
WIDTH
WIDTH
bx
TILE_WIDTHE
Md
TILE_WIDTH
TILE_WIDTH
Nd
WIDTH
0 1 2 TILE_WIDTH-1
TILE_WIDTH
WIDTH
35
N[m*TILE_WIDTH+ty][Col]
N[(m*TILE_WIDTH+ty) * Width + Col]
d_M
d_P
Pdsub
TILE_WIDTH
TILE_WIDTH
TILE_WIDTH
WIDTH
WIDTH
WIDTH
m*TILE_WIDTH
TILE_WIDTHE
Row
WIDTH
M[Row][m*TILE_WIDTH+tx]
M[Row*Width + m*TILE_WIDTH + tx]
TILE_WIDTH
d_N
TILE_WIDTH
Col
m*TILE_WIDTH
3.
4.
37
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
38
39
40
Device Query
Number of devices in the system
int dev_count;
cudaGetDeviceCount( &dev_count);
Capability of devices
cudaDeviceProp
dev_prop;
for (i = 0; i < dev_count; i++) {
cudaGetDeviceProperties( &dev_prop, i);
// decide if device has sufficient resources and capabilities
}
41
42