s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques

CUDA OPTIMIZATION TIPS,
TRICKS AND TECHNIQUES

Stephen Jones, GTC 2017
The art of doing more with less
2
RULE #1: DON’T TRY TOO HARD
Performance
Peak Performance
Time
3
Performance
Peak Performance
Unrealistic Effort/Reward
Time
4
Performance
Peak Performance
Time
5
Performance
Peak Performance
Don’t waste this time
Get on
this curve
Reduce this time
Time
6
Performance
Peak Performance
Here be ninjas
Hire an
intern
Premature Point of
excitement diminishing
returns
Most people Trough of

give up here despair
Wait, it’s 4 weeks and
going slower?? this is it?
Time
7
PERFORMANCE CONSTRAINTS
Compute Intensity
Divergence 10%
3%
Instruction
2%
Occupancy
10%
Memory
75%
8
PERFORMANCE CONSTRAINTS
Chart Title
Compute Intensity CPU <> GPU

Transfer
Divergence
Instruction
Occupancy
Coalescence
Divergent Access
Cache Inefficiency
Register Spilling
9
MEMORY ORDERS OF MAGNITUDE
SM L1$ L2 Cache GDRAM DRAM CPU

regs
shmem
PCIe
regs
shmem
bus
regs
shmem
20,000 2,000 300 16 150

GB/sec GB/sec GB/sec GB/sec GB/sec
10
TALK BREAKDOWN
In no particular order
1. Why Didn’t I Think Of That?

2. CPU Memory to GPU Memory (the PCIe Bus)
3. GPU Memory to the SM
4. Registers & Shared Memory
5. Occupancy, Divergence & Latency
6. Weird Things You Never Thought Of (and probably shouldn’t try)
11
WHERE TO BEGIN?
12
THE OBVIOUS
NVIDIA Visual Profiler
Start with the Visual Profiler

13
CPU <> GPU DATA MOVEMENT
14
PCI ISSUES
regs
shmem
PCIe
regs
shmem
bus
regs
shmem
16
GB/sec
Moving data over the PCIe bus

15
PIN YOUR CPU MEMORY
GPU Memory CPU Memory
Copy Data
16
PIN YOUR CPU MEMORY
DMA
Data
Controller
17
PIN YOUR CPU MEMORY
DMA
Controller Data
Swap
18
PIN YOUR CPU MEMORY
Data
CPU allocates & pins page then

DMA copies locally before DMA
Controller
Pinned
Copy of
Data
19
PIN YOUR CPU MEMORY

User
DMA
Pinned
Controller Data
cudaHostAlloc( &data, size, cudaHostAllocMapped );

cudaHostRegister( &data, size, cudaHostRegisterDefault );
20
PIN YOUR CPU MEMORY
21
REMEMBER: PCIe GOES BOTH WAYS
22
STREAMS & CONCURRENCY
Hiding the cost of data transfer
Operations in a single stream are ordered

But hardware can copy and compute at the same time
Single Copy data Copy data

Stream Compute
to GPU to Host
Time
23
Stream 2 Copy Copy

Work
up back Saved Time
Copy Copy
Stream 1 Work
up back
Single Copy data Copy data

Stream Compute
to GPU to Host
Time
24
Can keep on breaking work into smaller chunks and saving time
8 2 1
streams streams stream
25
SMALL PCIe TRANSFERS
PCIe is designed for large data transfers

But fine-grained copy/compute overlap prefers small transfers
So how small can we go?
Too 8 2 1
many
26
APPARENTLY NOT THAT SMALL
27
FROM GPU MEMORY TO GPU THREADS
28
FEEDING THE MACHINE
regs
shmem
PCIe
regs
shmem
bus
regs
shmem
From GPU Memory to the SMs

29
USE THE PARALLEL ARCHITECTURE
Hardware is optimized to use all SIMT threads at once
L2 Cache Line
Threads run Cache is sized to service High-speed GPU memory

in groups of 32 sets of 32 requests at a time works best with linear access
30
VECTORIZE MEMORY LOADS
Multi-Word as well as Multi-Thread
T0-T32
int
31
T0-T15
T16-T31
int2
Fill multiple cache lines in a single fetch

32
T0-T7
T8-T15
int4
T16-T23
T24-T31
Fill multiple cache lines in a single fetch

33
34
DO MULTIPLE LOADS PER THREAD
Multi-Thread, Multi-Word AND Multi-Iteration
__global__ void copy(int2 *input, __global__ void copy(int2 *input,

int2 *output, int2 *output,
int max) { int max,
int loadsPerThread) {
int id = threadIdx.x +
blockDim.x * blockIdx.x; int id = threadIdx.x +
blockDim.x * blockIdx.x;
if( id < max ) {
output[id] = input[id]; for(int n=0; n<loadsPerThread; n++) {
} if( id >= max ) {
} break;
}
output[id] = input[id];
id += blockDim.x * gridDim.x;
}
}
One copy per thread Multiple copies per thread

Maximum overhead Amortize overhead
35
“MAXIMAL” LAUNCHES ARE BEST
36
COALESCED MEMORY ACCESS
It’s not just good enough to use all SIMT threads
Coalesced: Sequential memory accesses are adjacent 1 2 3 4
3
1
Uncoalesced: Sequential memory accesses are unassociated
4
2
37
SIMT PENALTIES WHEN NOT COALESCED
x = data[threadIdx.x] x = data[rand()]
Single 32-wide operation 32 one-wide operations
38
SCATTER & GATHER
Gathering Scattering
1 2 3 4 1 2 3 4
3 3
1 1
4 4
2 2
Reading randomly Reading sequentially

Writing sequentially Writing randomly
39
AVOID SCATTER/GATHER IF YOU CAN
40
AVOID SCATTER/GATHER IF YOU CAN
41
SORTING MIGHT BE AN OPTION
If reading non-sequential data is expensive, is it worth sorting it to make it sequential?
Gathering Coalesced Read

1 2 3 4 1 2 3 4
Slow Fast
2 4 1 3 Sort 1 2 3 4
42
SORTING MIGHT BE AN OPTION
Even if you’re only going to read it twice, then yes!
43
PRE-SORTING TURNS OUT TO BE GOOD
44
DATA LAYOUT: “AOS vs. SOA”
Sometimes you can’t just sort your data
Array-of-Structures Structure-of-Arrays
#define NPTS 1024 * 1024 #define NPTS 1024 *1024
struct Coefficients_AOS { struct Coefficients_SOA {

double u[3]; double u[3][NPTS];
double x[3][3]; double x[3][3][NPTS];
double p; double p[NPTS];
double rho; double rho[NPTS];
double eta; double eta[NPTS];
}; };
Coefficients_AOS gridData[NPTS]; Coefficients_SOA gridData;
Single-thread code prefers arrays of SIMT code prefers structures of arrays,

structures, for cache efficiency for execution & memory efficiency
45
DATA LAYOUT: “AOS vs. SOA”
#define NPTS 1024 * 1024

u0 u1 u2
struct Coefficients_AOS { x00 x01 x02
double u[3];
double x[3][3]; x10 x11 x12
double p; x20 x21 x22
double rho;
double eta; p
}; rho
Coefficients_AOS gridData[NPTS]; eta
Structure Definition Conceptual Layout
46
SOA: STRIDED ARRAY ACCESS
GPU reads data one element at a time, but in parallel by 32 threads in a warp
Array-of-Structures Memory Layout

u0 u1 u2
x00 x01 x02
x10 x11 x12
x20 x21 x22
p
rho
eta
double u0 = gridData[threadIdx.x].u[0];
Conceptual Layout
47
AOS: COALESCED BUT COMPLEX
GPU reads data one element at a time, but in parallel by 32 threads in a warp
Array-of-Structures Memory Layout

u0 u1 u2
x00 x01 x02
x10 x11 x12 Structure-of-Arrays Memory Layout
x20 x21 x22
p
rho
eta
Conceptual Layout
double u0 = gridData.u[0][threadIdx.x];
48
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Device Memory
Block copies data

to shared memory
Shared Memory
49
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Device Memory
Shared Memory
Threads which own the data

grab it from shared memory
50
CLEVER AOS/SOA TRICKS
51
CLEVER AOS/SOA TRICKS
Helps for any data size
52
HANDY LIBRARY TO HELP YOU
Trove – A utility library for fast AOS/SOA access and transposition

https://github.com/bryancatanzaro/trove
53
(AB)USING THE CACHE
54
MAKING THE MOST OF L2-CACHE
L2 cache is fast but small: L2 Cache GDRAM
Architecture L2 Cache Total Cache Bytes

Size Threads per Thread
Kepler 1536 KB 30,720 51
Maxwell 3072 KB 49,152 64
Pascal 4096 KB 114,688 36
2,000 300
GB/sec GB/sec
55
TRAINING DEEP NEURAL NETWORKS
56
LOTS OF PASSES OVER DATA
3x3
W1
convolution
5x5
FFT W2
convolution + Cat!
7x7
W3
convolution
57
MULTI-RESOLUTION CONVOLUTIONS
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
58
TILED, MULTI-RESOLUTION CONVOLUTION
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
Do 3 passes per-tile Each tile sized to fit in L2 cache

59
LAUNCHING FEWER THAN MAXIMUM THREADS
60
SHARED MEMORY: DEFINITELY WORTH IT
61
USING SHARED MEMORY WISELY
Shared memory arranged into “banks” for concurrent SIMT access
▪ 32 threads can read simultaneously so long as into separate banks
Shared memory has 4-byte and 8-byte “bank” sizes
62
STENCIL ALGORITHM
Many algorithms have high data re-use: potentially good for shared memory
“Stencil” algorithms accumulate data from neighbours onto a central point

▪ Stencil has width “W” (in the above case, W=5)
Adjacent threads will share (W-1) items of data – good potential for data re-use
63
STENCILS IN SHARED MEMORY
64
SIZE MATTERS
65
PERSISTENT KERNELS
Revisiting the tiled convolutions
Avoid multiple kernel launches by caching in shared memory instead of L2

void tiledConvolution() {
convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); Separate kernel
convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); launches with
convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); L2 re-use
}
__global__ void convolutionShared(int *data, int count, int sharedelems) {

extern __shared__ int shdata[];
shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x];
__syncthreads();
Single kernel
convolve<3>(threadIdx.x, shdata, sharedelems); launch with
__syncthreads(); persistent kernel
convolve<5>(threadIdx.x, shdata, sharedelems);
__syncthreads();
convolve<7>(threadIdx.x, shdata, sharedelems);
}
66
PERSISTENT KERNELS
67
OPERATING DIRECTLY FROM CPU MEMORY
Can save memory copies. It’s obvious when you think about it ...
Compute only begins when 1st copy

Copy data Copy data
Compute has finished. Task only ends when
to GPU to Host 2nd copy has finished.
Compute begins after first fetch.

Read from
Write to
Uses lots of threads to cover
CPU
CPU
Compute
host-memory access latency. Takes
advantage of bi-directional PCI.
68
OPERATING DIRECTLY FROM CPU MEMORY
69
OCCUPANCY AND REGISTER LIMITATIONS
Function float double
Register file is bigger than shared log 7 18
memory and L1 cache! cos 16 28
acos 6 18
Occupancy can kill you if you use cosh 7 10
too many registers tan 15 28
erfc 14 22
Often worth forcing fewer registers
exp 7 10
to allow more blocks per SM
log10 6 18
normcdf 16 26
But watch out for math functions!
cbrt 8 20
sqrt 6 12
__launch_bounds__(maxThreadsPerBlock, rsqrt 5 12
minBlocksPerMultiprocessor) y0 20 30
__global__ void compute() { y1 22 30

y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3); fdivide 11 20
}
pow 11 24
grad. desc. 14 22
70
THANK YOU!

s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques

Uploaded by

Copyright:

Available Formats

CUDA OPTIMIZATION TIPS,

TRICKS AND TECHNIQUES

Don’t waste this time

Reduce this time

Most people Trough of

Compute Intensity CPU <> GPU

SM L1$ L2 Cache GDRAM DRAM CPU

20,000 2,000 300 16 150

1. Why Didn’t I Think Of That?

NVIDIA Visual Profiler

Start with the Visual Profiler

Moving data over the PCIe bus

GPU Memory CPU Memory

GPU Memory CPU Memory

GPU Memory CPU Memory

GPU Memory CPU Memory

CPU allocates & pins page then

GPU Memory CPU Memory

cudaHostAlloc( &data, size, cudaHostAllocMapped );

Operations in a single stream are ordered

Single Copy data Copy data

Stream 2 Copy Copy

Single Copy data Copy data

PCIe is designed for large data transfers

From GPU Memory to the SMs

Threads run Cache is sized to service High-speed GPU memory

Fill multiple cache lines in a single fetch

Fill multiple cache lines in a single fetch

__global__ void copy(int2 *input, __global__ void copy(int2 *input,

One copy per thread Multiple copies per thread

Coalesced: Sequential memory accesses are adjacent 1 2 3 4

Single 32-wide operation 32 one-wide operations

Reading randomly Reading sequentially

If reading non-sequential data is expensive, is it worth sorting it to make it sequential?

Gathering Coalesced Read

Even if you’re only going to read it twice, then yes!

struct Coefficients_AOS { struct Coefficients_SOA {

Coefficients_AOS gridData[NPTS]; Coefficients_SOA gridData;

Single-thread code prefers arrays of SIMT code prefers structures of arrays,

#define NPTS 1024 * 1024

Structure Definition Conceptual Layout

Array-of-Structures Memory Layout

Array-of-Structures Memory Layout

Block copies data

Threads which own the data

Trove – A utility library for fast AOS/SOA access and transposition

L2 cache is fast but small: L2 Cache GDRAM

Architecture L2 Cache Total Cache Bytes

Maxwell 3072 KB 49,152 64

Pascal 4096 KB 114,688 36

Do 3 passes per-tile Each tile sized to fit in L2 cache

Shared memory has 4-byte and 8-byte “bank” sizes

“Stencil” algorithms accumulate data from neighbours onto a central point

Avoid multiple kernel launches by caching in shared memory instead of L2

__global__ void convolutionShared(int *data, int count, int sharedelems) {

Compute only begins when 1st copy

Compute begins after first fetch.

__global__ void compute() { y1 22 30

You might also like

global void copy(int2 input, global void copy(int2 input,

global void convolutionShared(int *data, int count, int sharedelems) {

global void compute() { y1 22 30