You are on page 1of 71

CUDA OPTIMIZATION TIPS,

TRICKS AND TECHNIQUES


Stephen Jones, GTC 2017
The art of doing more with less

2
RULE #1: DON’T TRY TOO HARD
Performance

Peak Performance

Time
3
RULE #1: DON’T TRY TOO HARD
Performance

Peak Performance
Unrealistic Effort/Reward

Time
4
RULE #1: DON’T TRY TOO HARD
Performance

Peak Performance

Time
5
RULE #1: DON’T TRY TOO HARD
Performance

Peak Performance

Don’t waste this time

Get on
this curve

Reduce this time

Time
6
RULE #1: DON’T TRY TOO HARD
Performance

Peak Performance

Here be ninjas

Hire an
intern

Premature Point of
excitement diminishing
returns

Most people Trough of


give up here despair
Wait, it’s 4 weeks and
going slower?? this is it?

Time
7
PERFORMANCE CONSTRAINTS
Compute Intensity
Divergence 10%
3%
Instruction
2%

Occupancy
10%

Memory
75%
8
PERFORMANCE CONSTRAINTS
Chart Title

Compute Intensity CPU <> GPU


Transfer
Divergence
Instruction

Occupancy

Coalescence

Divergent Access

Cache Inefficiency

Register Spilling
9
MEMORY ORDERS OF MAGNITUDE

SM L1$ L2 Cache GDRAM DRAM CPU


regs
shmem
PCIe
regs
shmem
bus
regs
shmem

20,000 2,000 300 16 150


GB/sec GB/sec GB/sec GB/sec GB/sec

10
TALK BREAKDOWN
In no particular order

1. Why Didn’t I Think Of That?


2. CPU Memory to GPU Memory (the PCIe Bus)
3. GPU Memory to the SM
4. Registers & Shared Memory
5. Occupancy, Divergence & Latency
6. Weird Things You Never Thought Of (and probably shouldn’t try)

11
WHERE TO BEGIN?

12
THE OBVIOUS

NVIDIA Visual Profiler

Start with the Visual Profiler


13
CPU <> GPU DATA MOVEMENT

14
PCI ISSUES

regs
shmem
PCIe
regs
shmem
bus
regs
shmem

16
GB/sec

Moving data over the PCIe bus


15
PIN YOUR CPU MEMORY

GPU Memory CPU Memory

Copy Data

16
PIN YOUR CPU MEMORY

GPU Memory CPU Memory

DMA
Data
Controller

17
PIN YOUR CPU MEMORY

GPU Memory CPU Memory

DMA
Controller Data

Swap

18
PIN YOUR CPU MEMORY

GPU Memory CPU Memory

Data

CPU allocates & pins page then


DMA copies locally before DMA
Controller
Pinned
Copy of
Data

19
PIN YOUR CPU MEMORY

GPU Memory CPU Memory


User
DMA
Pinned
Controller Data

cudaHostAlloc( &data, size, cudaHostAllocMapped );


cudaHostRegister( &data, size, cudaHostRegisterDefault );
20
PIN YOUR CPU MEMORY

21
REMEMBER: PCIe GOES BOTH WAYS

22
STREAMS & CONCURRENCY
Hiding the cost of data transfer

Operations in a single stream are ordered


But hardware can copy and compute at the same time

Single Copy data Copy data


Stream Compute
to GPU to Host

Time

23
STREAMS & CONCURRENCY

Stream 2 Copy Copy


Work
up back Saved Time

Copy Copy
Stream 1 Work
up back

Single Copy data Copy data


Stream Compute
to GPU to Host

Time

24
STREAMS & CONCURRENCY

Can keep on breaking work into smaller chunks and saving time

8 2 1
streams streams stream
25
SMALL PCIe TRANSFERS

PCIe is designed for large data transfers


But fine-grained copy/compute overlap prefers small transfers
So how small can we go?

Too 8 2 1
many
26
APPARENTLY NOT THAT SMALL

27
FROM GPU MEMORY TO GPU THREADS

28
FEEDING THE MACHINE

regs
shmem
PCIe
regs
shmem
bus
regs
shmem

From GPU Memory to the SMs


29
USE THE PARALLEL ARCHITECTURE
Hardware is optimized to use all SIMT threads at once

L2 Cache Line

Threads run Cache is sized to service High-speed GPU memory


in groups of 32 sets of 32 requests at a time works best with linear access

30
VECTORIZE MEMORY LOADS
Multi-Word as well as Multi-Thread

T0-T32

int

31
VECTORIZE MEMORY LOADS

T0-T15

T16-T31
int2

Fill multiple cache lines in a single fetch


32
VECTORIZE MEMORY LOADS

T0-T7

T8-T15
int4
T16-T23

T24-T31

Fill multiple cache lines in a single fetch


33
VECTORIZE MEMORY LOADS

34
DO MULTIPLE LOADS PER THREAD
Multi-Thread, Multi-Word AND Multi-Iteration

__global__ void copy(int2 *input, __global__ void copy(int2 *input,


int2 *output, int2 *output,
int max) { int max,
int loadsPerThread) {
int id = threadIdx.x +
blockDim.x * blockIdx.x; int id = threadIdx.x +
blockDim.x * blockIdx.x;
if( id < max ) {
output[id] = input[id]; for(int n=0; n<loadsPerThread; n++) {
} if( id >= max ) {
} break;
}
output[id] = input[id];
id += blockDim.x * gridDim.x;
}
}

One copy per thread Multiple copies per thread


Maximum overhead Amortize overhead
35
“MAXIMAL” LAUNCHES ARE BEST

36
COALESCED MEMORY ACCESS
It’s not just good enough to use all SIMT threads

Coalesced: Sequential memory accesses are adjacent 1 2 3 4

3
1
Uncoalesced: Sequential memory accesses are unassociated
4
2

37
SIMT PENALTIES WHEN NOT COALESCED

x = data[threadIdx.x] x = data[rand()]

Single 32-wide operation 32 one-wide operations

38
SCATTER & GATHER
Gathering Scattering
1 2 3 4 1 2 3 4

3 3
1 1

4 4
2 2

Reading randomly Reading sequentially


Writing sequentially Writing randomly

39
AVOID SCATTER/GATHER IF YOU CAN

40
AVOID SCATTER/GATHER IF YOU CAN

41
SORTING MIGHT BE AN OPTION

If reading non-sequential data is expensive, is it worth sorting it to make it sequential?

Gathering Coalesced Read


1 2 3 4 1 2 3 4

Slow Fast

2 4 1 3 Sort 1 2 3 4

42
SORTING MIGHT BE AN OPTION

Even if you’re only going to read it twice, then yes!

43
PRE-SORTING TURNS OUT TO BE GOOD

44
DATA LAYOUT: “AOS vs. SOA”
Sometimes you can’t just sort your data

Array-of-Structures Structure-of-Arrays
#define NPTS 1024 * 1024 #define NPTS 1024 *1024

struct Coefficients_AOS { struct Coefficients_SOA {


double u[3]; double u[3][NPTS];
double x[3][3]; double x[3][3][NPTS];
double p; double p[NPTS];
double rho; double rho[NPTS];
double eta; double eta[NPTS];
}; };

Coefficients_AOS gridData[NPTS]; Coefficients_SOA gridData;

Single-thread code prefers arrays of SIMT code prefers structures of arrays,


structures, for cache efficiency for execution & memory efficiency

45
DATA LAYOUT: “AOS vs. SOA”

#define NPTS 1024 * 1024


u0 u1 u2
struct Coefficients_AOS { x00 x01 x02
double u[3];
double x[3][3]; x10 x11 x12
double p; x20 x21 x22
double rho;
double eta; p
}; rho
Coefficients_AOS gridData[NPTS]; eta

Structure Definition Conceptual Layout

46
SOA: STRIDED ARRAY ACCESS

GPU reads data one element at a time, but in parallel by 32 threads in a warp

Array-of-Structures Memory Layout


u0 u1 u2
x00 x01 x02
x10 x11 x12
x20 x21 x22
p
rho
eta
double u0 = gridData[threadIdx.x].u[0];

Conceptual Layout

47
AOS: COALESCED BUT COMPLEX

GPU reads data one element at a time, but in parallel by 32 threads in a warp

Array-of-Structures Memory Layout


u0 u1 u2
x00 x01 x02
x10 x11 x12 Structure-of-Arrays Memory Layout
x20 x21 x22
p
rho
eta

Conceptual Layout
double u0 = gridData.u[0][threadIdx.x];

48
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct

Device Memory

Block copies data


to shared memory

Shared Memory

49
BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct

Device Memory

Shared Memory

Threads which own the data


grab it from shared memory

50
CLEVER AOS/SOA TRICKS

51
CLEVER AOS/SOA TRICKS
Helps for any data size

52
HANDY LIBRARY TO HELP YOU

Trove – A utility library for fast AOS/SOA access and transposition


https://github.com/bryancatanzaro/trove

53
(AB)USING THE CACHE

54
MAKING THE MOST OF L2-CACHE

L2 cache is fast but small: L2 Cache GDRAM

Architecture L2 Cache Total Cache Bytes


Size Threads per Thread
Kepler 1536 KB 30,720 51

Maxwell 3072 KB 49,152 64

Pascal 4096 KB 114,688 36

2,000 300
GB/sec GB/sec

55
TRAINING DEEP NEURAL NETWORKS

56
LOTS OF PASSES OVER DATA

3x3
W1
convolution

5x5
FFT W2
convolution + Cat!

7x7
W3
convolution

57
MULTI-RESOLUTION CONVOLUTIONS

Pass 1 : 3x3

Pass 2: 5x5

Pass 3: 7x7

58
TILED, MULTI-RESOLUTION CONVOLUTION

Pass 1 : 3x3

Pass 2: 5x5

Pass 3: 7x7

Do 3 passes per-tile Each tile sized to fit in L2 cache


59
LAUNCHING FEWER THAN MAXIMUM THREADS

60
SHARED MEMORY: DEFINITELY WORTH IT

61
USING SHARED MEMORY WISELY
Shared memory arranged into “banks” for concurrent SIMT access
▪ 32 threads can read simultaneously so long as into separate banks

Shared memory has 4-byte and 8-byte “bank” sizes

62
STENCIL ALGORITHM

Many algorithms have high data re-use: potentially good for shared memory

“Stencil” algorithms accumulate data from neighbours onto a central point


▪ Stencil has width “W” (in the above case, W=5)

Adjacent threads will share (W-1) items of data – good potential for data re-use

63
STENCILS IN SHARED MEMORY

64
SIZE MATTERS

65
PERSISTENT KERNELS
Revisiting the tiled convolutions

Avoid multiple kernel launches by caching in shared memory instead of L2


void tiledConvolution() {
convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); Separate kernel
convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); launches with
convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); L2 re-use
}

__global__ void convolutionShared(int *data, int count, int sharedelems) {


extern __shared__ int shdata[];
shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x];
__syncthreads();
Single kernel
convolve<3>(threadIdx.x, shdata, sharedelems); launch with
__syncthreads(); persistent kernel
convolve<5>(threadIdx.x, shdata, sharedelems);
__syncthreads();
convolve<7>(threadIdx.x, shdata, sharedelems);
}

66
PERSISTENT KERNELS

67
OPERATING DIRECTLY FROM CPU MEMORY

Can save memory copies. It’s obvious when you think about it ...

Compute only begins when 1st copy


Copy data Copy data
Compute has finished. Task only ends when
to GPU to Host 2nd copy has finished.

Compute begins after first fetch.


Read from

Write to
Uses lots of threads to cover
CPU

CPU
Compute
host-memory access latency. Takes
advantage of bi-directional PCI.

68
OPERATING DIRECTLY FROM CPU MEMORY

69
OCCUPANCY AND REGISTER LIMITATIONS
Function float double
Register file is bigger than shared log 7 18
memory and L1 cache! cos 16 28
acos 6 18
Occupancy can kill you if you use cosh 7 10
too many registers tan 15 28
erfc 14 22
Often worth forcing fewer registers
exp 7 10
to allow more blocks per SM
log10 6 18
normcdf 16 26
But watch out for math functions!
cbrt 8 20
sqrt 6 12

__launch_bounds__(maxThreadsPerBlock, rsqrt 5 12
minBlocksPerMultiprocessor) y0 20 30

__global__ void compute() { y1 22 30


y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3); fdivide 11 20
}
pow 11 24
grad. desc. 14 22
70
THANK YOU!

You might also like