You are on page 1of 26

CMPE 220 | Winter 2008

CUDA Optimizations
 Overview
 Memory optimizations
 Execution con guration optimization
 Instruction optimization

19{1

CMPE220 - Winter 2008

A. Di Blas

Overview

Optimize algorithms for the GPU


 Maximize independent parallelism
 Maximize arithmetic intensity = operations/byte
 Often it is better to re-compute than to transfer or cache
 Do more computation on the GPU rather than data transfers

{ Even when computations have low parallelism and complexity


{ Ex. Reduction.

19{3

CMPE220 - Winter 2008

A. Di Blas

Overview

Optimize for memory system


 Take advantage of memory coalescing to global and local memory
 Optimize for spatial locality in constant and texture memory (they are
cached)
 Use shared memory as programmer-managed cache, it's 100s of times
faster than global memory (4 cc vs. 400-600 cc).
 Use one or just few threads to load from global memory into shared
memory, then share among all threads.
 Use shared memory to re-order transfers to global memory such that they
will be coalesced

19{4

CMPE220 - Winter 2008

A. Di Blas

Overview

Use parallelism eciently

 Partition the computation to keep the GPU multiprocessors equally busy


| many threads, many thread blocks
 Keep resource usage low enough to support multiple active thread blocks
per multiprocessor

19{5

CMPE220 - Winter 2008

A. Di Blas

Overview

Terminology
 Thread: concurrent code and associated state, executed on the CUDA
device in parallel with other threads
 Warp: a group of threads executed physically in parallel on a streaming
multiprocessor (SM) in SIMD way.

{ half-warp: the rst or second half of a warp


 Block: a group of threads that are executed together and that share
shared memory on a multiprocessor
 Grid: a group of blocks that execute a single CUDA kernel logically in
parallel on a single GPU

19{6

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Memory architecture
 Thousands of lightweight threads

{ Low switching overhead


{ Hide instruction latency
 Random access to global memory

{ Any thread can R/W any location


Memory

Local
Shared
Global
Constant
Texture

19{7

Location Cached Access


O -chip
ON-chip
O -chip
O -chip
O -chip

CMPE220 - Winter 2008

NO
n/a
NO
YES
YES

R/W
R/W
R/W
R
R

Who
One thread
All threads in a block
All threads + host
All threads + host
All threads + host

A. Di Blas

Memory optimizations

Hiding global memory latency


 Event-based scheduling is more ecient that time-based scheduling
 With many threads, can eciently hide memory latency

19{8

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Texture and constant memory


 Texture memory is cached (8/16 KB)

{ Uses same texture cache used for graphics


{ Optimized for 2D spatial locality. Best performance when all threads in
a warp read locations that are close in 2D
 Constant memory is cached (32 KB)

{ 4 cc per read within a single warp, per thread, with broadcast


{ If all threads in a warp read the same address: 4 cc
{ If all threads in a warp read di erent addresses: 64 cc

19{9

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Using constant memory


 Declaration (global):

__constant__ char midpointsGpu[CONST_MEM_SIZE];

{ No dynamic declaration (no cudaMalloc()).


 Data transfers:

{ cudaMemcpyToSymbol(..., kind)
Where kind can be cudaMemcpyHostToDevice or
cudaMemcpyDeviceToDevice

{ cudaMemcpyFromSymbol(..., kind)
Where kind can be cudaMemcpyDeviceToHost or
cudaMemcpyDeviceToDevice
 NOTE that cudaGetSymbolAddress() cannot take address of
__constant__ data.

19{10

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Global memory access

coalescing

 If per-thread memory accesses for a single half-warp form a contiguous


region of addresses, all accesses will be coalesced into a single access.
 Contiguous region:

{ 64 bytes | each thread reads a 4-byte word: int, oat,. . .


{ 128 bytes | each thread reads a double-word int2, oat2,. . .
{ 256 bytes | each thread reads a double-word int2, oat2,. . .
 Additional restrictions:

{ The starting address for a region must be on region size boundary


{ The kth thread in a half-warp must access the kth element in the block
 Exception: not all threads need to be participating

{ Ex: predicated access, divergence within a half-warp


NOTE: cudaMalloc() always aligns to 256-byte boundaries.

19{11

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Coalesced access: reading oats

t0

128

t1

132

t2

136

t3

140

t14

144

180

184

t15

192

All threads participate


t0

128

t1

t2

132

136

t3

140

t14

144

180

184

t15

192

Some threads do not participate

19{12

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Uncoalesced access: reading oats

t0

128

t1

132

t3

t2

136

140

t14

144

180

184

t15

192

Permuted access by threads


t1

t0

128

132

t2

136

t3

140

t14

144

180

184

t15

192

Misaligned starting address

19{13

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Shared memory
 16 banks, 4-byte wide.
 As fast as registers, if there are no bank con icts
 The fast case:

{ All threads in a half-warp read from di erent banks


{ All threads in a half-warp read the same address | only one read
followed by a broadcast
 The slow case:

{ Multiple threads in the same half-warp access the same bank


{ Accesses are serialized: delay = max number of accesses to a single bank.

19{14

CMPE220 - Winter 2008

A. Di Blas

Memory optimizations

Shared memory: no bank con icts


Thread 0

Bank 0

Thread 0

Bank 0

Thread 1

Bank 1

Thread 1

Bank 1

Thread 2

Bank 2

Thread 2

Bank 2

Thread 3

Bank 3

Thread 3

Bank 3

Thread 4

Bank 4

Thread 4

Bank 4

Thread 5

Bank 5

Thread 5

Bank 5

Thread 6

Bank 6

Thread 6

Bank 6

Thread 7

Bank 7

Thread 7

Bank 7

Thread 8

Bank 8

Thread 8

Bank 8

Thread 9

Bank 9

Thread 9

Bank 9

Thread 10

Bank 10

Thread 10

Bank 10

Thread 11

Bank 11

Thread 11

Bank 11

Thread 12

Bank 12

Thread 12

Bank 12

Thread 13

Bank 13

Thread 13

Bank 13

Thread 14

Bank 14

Thread 14

Bank 14

Thread 15

Bank 15

Thread 15

Bank 15

Stride1 access

19{15

CMPE220 - Winter 2008

Any 1:1 permutation

A. Di Blas

Memory optimizations

Shared memory: with bank con icts


Thread 0

Bank 0

Thread 0

Bank 0

Thread 1

Bank 1

Thread 1

Bank 1

Thread 2

Bank 2

Thread 2

Bank 2

Thread 3

Bank 3

Thread 3

Bank 3

Thread 4

Bank 4

Thread 4

Bank 4

Thread 5

Bank 5

Thread 5

Bank 5

Thread 6

Bank 6

Thread 6

Bank 6

Thread 7

Bank 7

Thread 7

Bank 7

Thread 8

Bank 8

Thread 8

Bank 8

Thread 9

Bank 9

Thread 9

Bank 9

Thread 10

Bank 10

Thread 10

Bank 10

Thread 11

Bank 11

Thread 11

Bank 11

Thread 12

Bank 12

Thread 12

Bank 12

Thread 13

Bank 13

Thread 13

Bank 13

Thread 14

Bank 14

Thread 14

Bank 14

Thread 15

Bank 15

Thread 15

Bank 15

Stride2 access

19{16

CMPE220 - Winter 2008

Stride8 access

A. Di Blas

Execution con guration optimization

Execution con guration optimization


 Threads are executed sequentially, in warp-SIMD groups.

{ Executing other warps is the only way to hide latency and keep
hardware busy

{
 Occupancy = number of warps running concurrently on a multiprocessor
as a fraction of the max number of warps that can run concurrently
 Limited by resource usage

{ Registers
{ Shared memory
 Increasing occupancy does not necessarily improve performance, but
low-occupancy kernels cannot hide memory latency.

19{17

CMPE220 - Winter 2008

A. Di Blas

Execution con guration optimization

Grid and block size heuristics

 # of blocks

> # of multiprocessors

{ all SM have at least one block to execute


 # of blocks / # of multiprocessors

>2

{ Multiple blocks per multiprocessor


 # of blocks

19{18

> 100 to scale to future devices

CMPE220 - Winter 2008

A. Di Blas

Execution con guration optimization

Register dependency
 RAW in registers: about 11 cc latency
 To completely hide this latency

{ Run at least 192 threads = 6 32-thread warps per SM


{ Threads do not have to belong to the same block
 Limiting factor: 8192 registers per SM, shared by all active threads
 Use -maxregcount=N ag to the nvcc compiler

{ N = max number of registers/kernel


{ Will spill other registers to local memory (slow)

19{19

CMPE220 - Winter 2008

A. Di Blas

Execution con guration optimization

Another method to determine resource usage: the .cubin le


 Compile with the -cubin ag
 Open the .cubin le with a text editor. Ex:
architecture {sm_10}
abiversion {0}
code {
name = _Z8saxpyGPUfPfS_S_i
lmem = 0
smem = 36
reg = 6
bar = 0
bincode {
0x3080d1fd 0x6c6087c8 0x1000d005 0x0423c780
...

19{20

CMPE220 - Winter 2008

A. Di Blas

Execution con guration optimization

The .cubin le example cont.


...
const {
segname = const
segnum = 1
offset = 0
bytes = 8
mem {
0000000000 0xffffffff
}
}
}

19{21

CMPE220 - Winter 2008

A. Di Blas

Execution con guration optimization

Optimizing threads per block


 # threads per block = multiple of warp size (32)
 More threads per block = better memory latency hiding
 More threads per block = fewer registers per thread

{ Kernel launches fail if too many registers are required


 Heuristics:

{ Use at least 64 threads/block


{ 192 or 256 whenever possible
{ Must experiment!

19{22

CMPE220 - Winter 2008

A. Di Blas

Instruction optimization

Instruction optimization: arithmetic instructions


 int and float add, shift, min, max: 4 cc per warp
 float mul, mad: 4 cc per warp
 int mul: \multiple" cc per warp (32-bit)

{ Use __mul24() or __umul24() for 4-cc 24-bit integer multiplication


 Integer divide and modulo are \more expensive" (how much more?)
 Two types of math operations:

{ func(): compile to multiple operations, slower, higher accuracy


{ __func(): compile to single ISA instruction (\intrinsics")
 Intrinsics reciprocal, reciprocal square root, sin, cos, log, exp: 16 cc
 Other functions are combinations of the above

19{23

CMPE220 - Winter 2008

A. Di Blas

Instruction optimization

GPU results may not match CPU


 Floating-point math is not associative | di erent execution produces
di erent results
 CPU (Intel) has precision higher than 64 bits (80-bit internal)
 GPUs are not fully IEEE-754 compliant:

{ Not all rounding modes are supported


{ Denormalized numbers are not supported
{ In nity (not sure)
{ No exceptions

19{24

CMPE220 - Winter 2008

A. Di Blas

Instruction optimization

Making programs oat-safe


 Future hardware will have double-precision support

{ Everything so far is single-precision only


{ Double-precision will be more expensive
{ Double-precision will be slower
 Avoid using double-precision when SP is enough:

{ Add `f' speci er to constants:


0.1234f

{ Use oat version of standard library functions:


sinf(x) instead of sin(x)

19{25

CMPE220 - Winter 2008

A. Di Blas

Instruction optimization

Control ow instructions
 Branching may generate divergence:

{ Threads from a single warp take di erent paths


{ Di erent paths must be serialized (it's SIMD!)
 Avoid divergence with conditions on thread Id

{ Example with warp divergence:


if(threadIdx.x > 2) {}
Branch granularity < warp size

{ Example without warp divergence:


if(threadIdx.x / WARP_SIZE > 2) {}
Branch granularity = warp size

19{26

CMPE220 - Winter 2008

A. Di Blas

Lab

Third CUDA lab


Write a program cudaLab3.cu to:
 Optimize array inversion

19{27

CMPE220 - Winter 2008

A. Di Blas