Professional Documents
Culture Documents
2
RULE #1: DON’T TRY TOO HARD
Performance
Peak Performance
Time
3
RULE #1: DON’T TRY TOO HARD
Performance
Peak Performance
Unrealistic Effort/Reward
Time
4
RULE #1: DON’T TRY TOO HARD
Performance
Peak Performance
Time
5
RULE #1: DON’T TRY TOO HARD
Performance
Peak Performance
Get on
this curve
Time
6
RULE #1: DON’T TRY TOO HARD
Performance
Peak Performance
Here be ninjas
Hire an
intern
Premature Point of
excitement diminishing
returns
Time
7
PERFORMANCE CONSTRAINTS
Compute Intensity
Divergence 10%
3%
Instruction
2%
Occupancy
10%
Memory
75%
8
PERFORMANCE CONSTRAINTS
Chart Title
Occupancy
Coalescence
Divergent Access
Cache Inefficiency
Register Spilling
9
MEMORY ORDERS OF MAGNITUDE
10
TALK BREAKDOWN
In no particular order
11
WHERE TO BEGIN?
12
THE OBVIOUS
14
PCI ISSUES
regs
shmem
PCIe
regs
shmem
bus
regs
shmem
16
GB/sec
Copy Data
16
PIN YOUR CPU MEMORY
DMA
Data
Controller
17
PIN YOUR CPU MEMORY
DMA
Controller Data
Swap
18
PIN YOUR CPU MEMORY
Data
19
PIN YOUR CPU MEMORY
21
REMEMBER: PCIe GOES BOTH WAYS
22
STREAMS & CONCURRENCY
Hiding the cost of data transfer
Time
23
STREAMS & CONCURRENCY
Copy Copy
Stream 1 Work
up back
Time
24
STREAMS & CONCURRENCY
Can keep on breaking work into smaller chunks and saving time
8 2 1
streams streams stream
25
SMALL PCIe TRANSFERS
Too 8 2 1
many
26
APPARENTLY NOT THAT SMALL
27
FROM GPU MEMORY TO GPU THREADS
28
FEEDING THE MACHINE
regs
shmem
PCIe
regs
shmem
bus
regs
shmem
L2 Cache Line
30
VECTORIZE MEMORY LOADS
Multi-Word as well as Multi-Thread
T0-T32
int
31
VECTORIZE MEMORY LOADS
T0-T15
T16-T31
int2
T0-T7
T8-T15
int4
T16-T23
T24-T31
34
DO MULTIPLE LOADS PER THREAD
Multi-Thread, Multi-Word AND Multi-Iteration
36
COALESCED MEMORY ACCESS
It’s not just good enough to use all SIMT threads
3
1
Uncoalesced: Sequential memory accesses are unassociated
4
2
37
SIMT PENALTIES WHEN NOT COALESCED
x = data[threadIdx.x] x = data[rand()]
38
SCATTER & GATHER
Gathering Scattering
1 2 3 4 1 2 3 4
3 3
1 1
4 4
2 2
39
AVOID SCATTER/GATHER IF YOU CAN
40
AVOID SCATTER/GATHER IF YOU CAN
41
SORTING MIGHT BE AN OPTION
Slow Fast
2 4 1 3 Sort 1 2 3 4
42
SORTING MIGHT BE AN OPTION
43
PRE-SORTING TURNS OUT TO BE GOOD
44
DATA LAYOUT: “AOS vs. SOA”
Sometimes you can’t just sort your data
Array-of-Structures Structure-of-Arrays
#define NPTS 1024 * 1024 #define NPTS 1024 *1024
45
DATA LAYOUT: “AOS vs. SOA”
46
SOA: STRIDED ARRAY ACCESS
GPU reads data one element at a time, but in parallel by 32 threads in a warp
Conceptual Layout
47
AOS: COALESCED BUT COMPLEX
GPU reads data one element at a time, but in parallel by 32 threads in a warp
Conceptual Layout
double u0 = gridData.u[0][threadIdx.x];
48
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Device Memory
Shared Memory
49
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Device Memory
Shared Memory
50
CLEVER AOS/SOA TRICKS
51
CLEVER AOS/SOA TRICKS
Helps for any data size
52
HANDY LIBRARY TO HELP YOU
53
(AB)USING THE CACHE
54
MAKING THE MOST OF L2-CACHE
2,000 300
GB/sec GB/sec
55
TRAINING DEEP NEURAL NETWORKS
56
LOTS OF PASSES OVER DATA
3x3
W1
convolution
5x5
FFT W2
convolution + Cat!
7x7
W3
convolution
57
MULTI-RESOLUTION CONVOLUTIONS
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
58
TILED, MULTI-RESOLUTION CONVOLUTION
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
60
SHARED MEMORY: DEFINITELY WORTH IT
61
USING SHARED MEMORY WISELY
Shared memory arranged into “banks” for concurrent SIMT access
▪ 32 threads can read simultaneously so long as into separate banks
62
STENCIL ALGORITHM
Many algorithms have high data re-use: potentially good for shared memory
Adjacent threads will share (W-1) items of data – good potential for data re-use
63
STENCILS IN SHARED MEMORY
64
SIZE MATTERS
65
PERSISTENT KERNELS
Revisiting the tiled convolutions
66
PERSISTENT KERNELS
67
OPERATING DIRECTLY FROM CPU MEMORY
Can save memory copies. It’s obvious when you think about it ...
Write to
Uses lots of threads to cover
CPU
CPU
Compute
host-memory access latency. Takes
advantage of bi-directional PCI.
68
OPERATING DIRECTLY FROM CPU MEMORY
69
OCCUPANCY AND REGISTER LIMITATIONS
Function float double
Register file is bigger than shared log 7 18
memory and L1 cache! cos 16 28
acos 6 18
Occupancy can kill you if you use cosh 7 10
too many registers tan 15 28
erfc 14 22
Often worth forcing fewer registers
exp 7 10
to allow more blocks per SM
log10 6 18
normcdf 16 26
But watch out for math functions!
cbrt 8 20
sqrt 6 12
__launch_bounds__(maxThreadsPerBlock, rsqrt 5 12
minBlocksPerMultiprocessor) y0 20 30