Professional Documents
Culture Documents
Performance Considerations
Frauke Sprengel
Outline
1 Warps
5 Parallel Reduction
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 2
In this chapter, we take a closer look at the performance of CUDA kernels
and discuss methods to improve it.
In the second part, we introduce the concept of parallel reduction. A central
rôle in these considerations plays CUDA’s concept of warps.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 3
Outline
1 Warps
Warps as Scheduling Units
Streaming Processors as SIMD Processors
Control Divergence
Control Divergence Examples
5 Parallel Reduction
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 4
Warps as
Warps as Scheduling Scheduling Units
Units
Block 1 Warps Block 2 Warps Block 3 Warps
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… … …
Figure 4.1:
– Each blockThreads in Warpsinto
is divided in Blocks (Nvidia warps
32-thread 2016a).
– An implementation technique, not part of the CUDA programming
model
Each block is divided into 32-thread warps
– Warps are scheduling units in SM
• An implementation technique,
– Threads in not part
a warp execute of the
in Single CUDA programming
Instruction Multiple Data model
(SIMD) manner
• Warps are scheduling units in SM
– The number of threads in a warp may vary in future generations
• Threads in a warp execute in Single Instruction Multiple Data (SIMD)
manner
• The number of threads in a warp may vary in future generations
3
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 5
Organisation of Threads in Warps
• Problem: The threads of the active warp are waiting for the results of a
device memory access.
• Solution: Another warp whose threads have all required data in registers
or shared memory is selected for execution instead.
• The first warp is advanced later when the device memory access is
finished.
In general, the larger the number of warps, the better the memory latency
can be reduced.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 6
Organisation of Threads in Warps
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 7
Warps in Multi-dimensional Thread Blocks
The thread blocks are first linearized into 1D in row major order.
(In x-dimension first, y-dimension next, and z-dimension last)
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 8
Organisation of Threads in Warps
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 9
StreamingSMs
Processors as SIMD
are SIMD Processors
Processors
Control –unit for instruction fetch, decode, and control is shared among
Control unit for instruction fetch, decode, and control is shared
multiple processing units.
among multiple processing units
– Control overhead is minimized (Module 1)
Memory I/O
Processing Unit
Shared
Register
Memory ALU File
Control Unit
PC IR
Processor (SM)
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 10
SIMD Execution Among Threads in a Warp
• All threads in a warp must execute the same instruction at any point in
time
• This works efficiently if all threads follow the same control flow path
• All if-then-else statements make the same decision
• All loops iterate the same number of times
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 11
Control Divergence
Control divergence occurs when threads in a warp take different control flow
paths by making different control decisions.
• Some take the then-path and others take the else-path of an if-
statement.
• Some threads take different number of loop iterations than others.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 13
Example: Vector Addition
__global__
void vecAddKernel ( f l o a t ∗ A, f l o a t ∗ B, f l o a t ∗ C, i n t n )
{
i n t i = t h r e a d I d x . x + blockDim . x ∗ b l o c k I d x . x ;
i f ( i <n ) C [ i ] = A [ i ] + B [ i ] ;
}
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 14
Example: Vector Addition
Analysis for vector size of 1 000 elements
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 15
Outline
1 Warps
5 Parallel Reduction
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 16
Number of Warps
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 17
Warps
• 4 × 4: 16 threads per block, at most 32 blocks per SM
→ 512 threads in 32 warps, each of which is only half full
• 8 × 8: 64 threads per block, at most 2048/64 = 32 blocks per SM
→ 2048 threads in 64 warps
• 16 × 16: 256 threads per block, at most 2048/256 = 8 blocks
→ 2048 threads in 64 warps
• 32 × 32: 1024 threads per block, at most 2048/1024 = 2 blocks
→ 2048 threads in 64 warps
• 64 × 64: not allowed, would require 4096 threads per block
If you look only on the largest number of warps, it doesn’t matter whether
you choose 8 × 8, 16 × 16 or 32 × 32. Concerning control divergence it can be
better to take smaller blocks.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 18
Registers
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 19
Registers
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 20
Registers
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 21
Shared memory
In a similar way, the amount of shared memory required by each block can
limit the number of blocks that can run on a multiprocessor.
As an aid for determining the optimal grid and block size, a CUDA
occupancy calculator (in terms of an Excel sheet) is provided at
http://developer.download.nvidia.com/compute/cuda/CUDA_
Occupancy_calculator.xls
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 22
Outline
1 Warps
5 Parallel Reduction
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 23
DRAM Burst – A System View
DRAM Bursting
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 24
Memory Coalescing
Device Memory Coalescing
Coalesced Loads Coalesced Loads
T0 T1 T2 T3 T0 T1 T2 T3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 25
Un-coalesced Accesses
Un-coalesced Accesses
Un-coalesced Loads Un-coalesced Loads
T0 T1 T2 T3 T0 T1 T2 T3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 26
Device Memory Coalescing
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 27
7 Arrays
Linearized
A 2D C Array in Linear Memory Space
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 28
Column-wise Matrix Access
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 29
Row-wise Matrix Access
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 30
Memory Coalescing in Matrix Multiplication?
Is memory coalescing possible in the matrix-matrix multiplication kernel that
uses device memory?
1 i n t row = b l o c k I d x . y ∗ BLOCK_SIZE + t h r e a d I d x . y ;
2 int col = blockIdx . x ∗ BLOCK_SIZE + t h r e a d I d x . x ;
3 // . . .
4 for ( int i = 0; i < k ; k++) {
5 m u l t V a l += A [ row ∗ k + i ] ∗ B[ i ∗ k + col ] ;
6 }
Two Access Patterns of Basic Matrix Multiplication
A B
HEIGHT
Thread 1
Thread 2
WIDTH
A[Row*n+i] B[i*k+Col]
i is the loop counter in the inner product loop of the kernel code
Figure 4.10: Access patterns for matrix multiplication (Nvidia 2016a).
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 31
Column-wise Access for B
B accesses are coalesced
N
B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 32
Row-wise Access for A are Not Coalesced
A Accesses
Load iteration 1
…
T0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3
10
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 33
Memory Coalescing in Matrix Multiplication?
Is memory coalescing possible in the matrix-matrix multiplication kernel that
uses shared memory?
1 i n t row = b l o c k I d x . y ∗ BLOCK_SIZE + t h r e a d I d x . y ;
2 i n t c o l = b l o c k I d x . x ∗ BLOCK_SIZE + t h r e a d I d x . x ;
3 int rowIdxBlock = threadIdx . y ;
4 int colIdxBlock = threadIdx . x ;
5 // . . .
6 f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
7 s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + c o l I d x B l o c k ] =
8 A [ row ∗ s i z e + l ∗ BLOCK_SIZE
9 + colIdxBlock ] ;
10 s r c B S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + c o l I d x B l o c k ] =
11 B [ ( l ∗ BLOCK_SIZE + r o w I d x B l o c k ) ∗ s i z e
12 + col ] ;
13 // . . .
14 }
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 34
Using Shared Memory
Have each thread load an A element
and a B element at the same relative
position as its C element.
B
Col
int tx = threadIdx.x
n
int ty = threadIdx.y
Accessing tile 0 2D indexing:
A[Row][tx] k
B[ty][Col]
A C
Row
m
m
WIDTH
n k
Figure 4.13: Loading an input tile into shared memory (Nvidia 2016a).
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 35
11
Corner Turning
Corner Turning
d_M d_N
Original
WIDTH
Access
Pattern
WIDTH
Copy into
shared
memory
d_M d_N
Tiled
Access
Pattern
Perform
multiplication
with shared memory
values
12
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 36
Shared Memory Banks
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 37
Outline
1 Warps
5 Parallel Reduction
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 38
Loop Unrolling
A loop requires control instructions and (at least in a for loop) one register
for the count variable, as in the matrix-matrix multiplication example.
f o r ( i n t k = 0 ; k < BLOCK_SIZE ; k++) {
m u l t V a l +=
s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + k ] ∗
s r c B S h a r e d [ k ∗ BLOCK_SIZE + c o l I d x B l o c k ] ;
}
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 39
Loop Unrolling
For loops with a fixed number of iterations, loop unrolling might increase the
performance.
m u l t V a l +=
s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 0 ] ∗
s r c B S h a r e d [ 0 ∗ BLOCK_SIZE + c o l I d x B l o c k ]
+ s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 1 ] ∗
s r c B S h a r e d [ 1 ∗ BLOCK_SIZE + c o l I d x B l o c k ]
+ s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 2 ] ∗
s r c B S h a r e d [ 2 ∗ BLOCK_SIZE + c o l I d x B l o c k ]
// . . .
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 41
Data Prefetching
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 42
Outline
1 Warps
5 Parallel Reduction
Combining Neighboring Elements
Combining Blocks of Elements
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 43
Parallel Reduction
Reduction algorithms extract a single value from an array, which is used for
many purposes, e. g.,
In parallel programming, the access to the single value for the result has to be
synchronized.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 44
Parallel Reduction
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 45
Combining Neighboring Elements
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 46
Combining Neighboring Elements
1 __shared__ f l o a t p a r t i a l S u m [ ] ;
2 // copy a r r a y t o p a r t i a l S u m
3 int tx = threadIdx . x ;
4 f o r ( i n t s t r i d e = 1 ; s t r i d e < blockDim . x ;
5 s t r i d e <<= 1 ) {
6 __syncthreads__ ( ) ;
7 i f ( t x % ( 2 ∗ s t r i d e ) == 0 ) {
8 p a r t i a l S u m [ t x ] += p a r t i a l S u m [ t x + s t r i d e ] ;
9 }
10 }
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 47
Combining Blocks of Elements
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 48
1 __shared__ f l o a t p a r t i a l S u m [ ] ;
2 // copy a r r a y t o p a r t i a l S u m
3 int tx = threadIdx . x ;
4 f o r ( i n t s t r i d e = blockDim . x >> 1 ; s t r i d e > 0 ;
5 s t r i d e >>= 1 ) {
6 __syncthreads__ ( ) ;
7 i f ( tx < s t r i d e ) {
8 p a r t i a l S u m [ t x ] += p a r t i a l S u m [ t x + s t r i d e ] ;
9 }
10 }
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 49
Outline
1 Warps
5 Parallel Reduction
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 50
Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 51