Professional Documents
Culture Documents
QUIZ
Principals of Efficient GPU Programming
❑ Decrease arithmetic intensity
2
LEVELS OF
OPTIMIZATION
1. Picking good Algorithms Our Concern
2. Basic principles for efficiency
3. Arch-specific detailed optimization Geeks
4. Instructions micro optimization
CPU: 80% GPU: 30%
Quiz
❑3Vector registers – SSE, AVX
❑1Use Odd-Even O(n2), Merge Sort O(nlogn), Or Heap sort(nlogn)
❑2Write cache aware code (Iterate 2D array by column)
❑3Block for L1 cache
❑4Perform x−½ by: 0x5f3759df - (x >> 1) (Quake3 game) 3
APOD
SYSTEMATIC OPTIMIZATION
APOD
Analyze
Deploy Parallelize
Optimize 5
APOD
(CONT.)
Analyze: Profile whole application
Where can it benefit?
By how much?
Parallelize:
Pick an Approach
Complete algorithms: find Libraries or write by yourself (CUDA or
OpenCl )
Small code: Directives (OpenMP for CPU), (OpenACC for GPU)
Pick an algorithm
Analysis
Understanding
Understanding
Strong vs.
Hotspots
Weak scaling
5
WEAK VS. STRONG
SCALING
Serial program solves problem size A in time T
Ex: PageRank matrix multiplication in an hour
6
UNDERSTANDING
HOTSPOTS
Don’t rely on intuition
60
Run a profiler 50
Gprof
% of time
40
VTune 30
Very Sleepy 20
10
0
a b c d
function
Max speed up if you parallelize (Drop to zero):
Speedup by parallelize function (a): = 2x
Speedup by parallelize functions (a, c, d): = 5x
7
UNDERSTANDING HOTSPOTS
(CONT.)
Amdahl’s Law 60
Total speedup from 50
parallelization is limited by
portion of time spent doing
% of time
40
something to be parallelized.
30
1 20
max speedup =
s +np 10
where s is serial port ion and 0
p=(1-s) is parallel portion,
n is the number of threads a b c d
1 1 function
= s = 1–p We drop this part to 0
because we have enough n
Previous example:
p(a): 1/(1-50%) = 2x 11
p(a, c, d): 1/(1-80%) = 5x
PARALLELIZE
MATRIX TRANSPOSE -
SERIAL
c
r 0 1 2
0 a b c 0 1 2 3 4 5 6 7 8
1 d e ƒ → a b c d e ƒ g h i
2 g h i
Out[r][c] = in [c][r];
c c
0 1 2 0 1 2 0 1 2
out 0 = in[0] 0
r
out 3 = in[1] 1
r
out 6 = in[2]
a a
c=O,r=O 1 c = O,r = 1 2 b c = O,r = 2
2 2
c
01 2
c
0 1 2 0 1 2
r r
out 1 = in[3] 1 a d out 4 = in[4] 1 a d
a
2 b 2 b e …
b c=1,r=O c=1,r=1 13
c 3 c 3 c
<Forward
MATRIX TRANSPOSE –
PARALLEL “SIMPLE”
OPTIMIZATION
Program
Computation Memory
16
OPTIMIZATION
(CONT.1)
A: Memory clock: 900 Mhz = 900 * 106 clock per sec.
B: Memory bus bandwidth: 64-bit = 8 byte per clock
13
OPTIMIZATION
(CONT.3)
nSight: Integrated debuggers and profilers, a
complete development environment with many
tools.
Nvidia Visual Profiler (NVVP)
14
LITTLE’S LAW
15
OPTIMIZATION
(CONT.2)
Memory …. ….
✓ Back> 21
(N step) (1 step)
MATRIX TRANSPOSE –
PARALLEL “TILED”
[COALESCED]
# blocks=32*32, # threads= 32*32
X 1024 ⎯⎯→ X 1024 ⎯⎯→
32 ⎯ ⎯→
32 ⎯ ⎯→ 32 ⎯⎯→
32 32 32 line 1
Y
Y line 2
line 31
1024
1024
line 1 Still, no line 32
coalesced line ..
line 2
write ??!
line 32 line 1024
Cache
Write policy (Through, Back)
See NVVP replay overhead
LITTLE’S
LAW
23
MATRIX TRANSPOSE –
PARALLEL “TILED SHARED
MEMORY”
# blocks=32*32, # threads= 32*32
X 1024 ⎯⎯→
32 ⎯ ⎯→ 32 ⎯ ⎯→
32 32
Y
024
Still, no
coalesced
write ??!
32 ⎯ ⎯→
32
0
SharedMemory
See NVVP replay overhead
✓ Make it 16*16
See NVVP Occupancy
22
OCCUPANCY
(CONT.2)
Shared memory bank conflict: shared memory
organized in banks.
23
OPTIMIZE
Computation
COMPUTATION
OPTIMIZATION
Minimize time spent at barriers
Minimize thread divergence
Math Optimization
Streams
24
MINIMIZE THREAD
DIVERGENCE
25
THREADS WARP
012 31
if(cond){
.
.
}
26
BRANCH
DIVERGENCE
What is the maximum branch inst 012 31
divergence penalty for a CUDA .
.
thread block with 1024 threads?
if(cond){
32x slowdown .
.
}
else{
.
.
}
27
QUIZ
❑
32
global void foo(){ switch (threadIdx.x%32) case(031)}
kernel <<1024, 1>> ();
❑
32 global void foo(){ switch (threadIdx.x%64) case(063)}
kernel <<1024, 1>> ();
❑
1
global void foo(){ switch (threadIdx.y%32) case(031)} Next>
X 64 ⎯ →
Y 0 1 2 3 4 … 31 32 33 34 35 36 … 63
16
0 1 2 3 4 … 31 32 33 34 35 36 … 63 <Back
⋮ ⋮
0 1 2 3 4 … 31 32 33 34 35 36 … 63
29
QUIZ
(CONT.2)
CUDA assigns thread ID as following:
x varies fast
y varies slower
z varies slowest
X 16 ⎯ →
Y 0 1 2 3 4 … 15
0 1 2 3 4 … 15 <Back
16
⋮ ⋮
0 1 2 3 4 … 15
30
LOOP
DIVERGENCE
global %void foo(){ •
for(int i=0; i < threadIdx.x 32; i++) //A •
bar(); •
31
REAL WORLD DIVERGENCE
EXAMPLE
Operating on 1024 x 1024 image special handling of pixels
on the boundary
33
COMPUTATION
OPTIMIZATION
Minimize time spent at barriers
Minimize thread divergence
Math Optimization
Streams
34
MATH
OPTIMIZATION
Use double precision when only needed
64-bit math slower than 32-bit math
float a=b+2.5 > float a=b+2.5f
Use intrinsics when possible
sin(), sqrt() (2-3 bits less precision but faster)
math.h
35
COMPUTATION
OPTIMIZATION
Minimize time spent at barriers
Minimize thread divergence
Math Optimization
Streams
36
HOST-GPU
INTERACTION
Swap-out problem
GPU memory
Host memory
PCIe
Gen2 (6GB/s)
CPU GPU
37
HOST-GPU INTERACTION
OPTIMIZATION
Use pinned host memory (page-locked)
cudaMallocHost, cudaHostRegister instead of Malloc
If you are using Pinned memory:
Use cudaMemcpyAsync instead of cudaMemcpy
This help to free the CPU
But you need to define stream
38
COMPUTATION
OPTIMIZATION
Minimize time spent at barriers
Minimize thread divergence
Math Optimization
Streams
39
STREAMS
Stream: sequence of operations that execute in
order.
(Memory transfers, kernels)
Ex.
cudaMemcpyAsync(..); cudaMemcpyAsync(.., s1);
A<<..>>; A<<>>(..,s2);
cudaMemcpyAsync(..);
B<<..>>;
Default steam (s0)
B<<>>(.., s4);
✓
cudaMemcpyAsync(.., s3);
3
47
QUIZ
[CONT.]
If all operations take exactly 1 second, how long to
complete the following:
1
With
a wrong values
48
STREAMS
(CONT.)
Old GPUs big amount of data to chunks
Too much overhead to transfer data back and forward
GPU memory
CPU memory
Streams good with big amount of data
Run chunks and kernels in asynchronous manner
Overlap memory and compute
Fill GPU with smaller kernels 49
COMPUTATION
OPTIMIZATION
Minimize time spent at barriers
Minimize thread divergence
Math Optimization
Streams
50
APOD
Analyze
Deploy Parallelize
Optimize 51
PROBLEM
SET
Optimize the Histogram program in lecture 3
38
34
16
12