Performance

Visual Computing – GPU Computing
Performance Considerations
Frauke Sprengel
Outline
1 Warps
2 Organization of Threads in Warps in Blocks
3 Memory Access Revisited
4 Further Means to Improve Perfomance
5 Parallel Reduction
6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 2
In this chapter, we take a closer look at the performance of CUDA kernels
and discuss methods to improve it.
In the second part, we introduce the concept of parallel reduction. A central
rôle in these considerations plays CUDA’s concept of warps.
Outline
1 Warps
Warps as Scheduling Units
Streaming Processors as SIMD Processors
Control Divergence
Control Divergence Examples
6 Guidelines
Warps as
Warps as Scheduling Scheduling Units
Units
Block 1 Warps Block 2 Warps Block 3 Warps
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… … …
Figure 4.1:
– Each blockThreads in Warpsinto
is divided in Blocks (Nvidia warps
32-thread 2016a).
– An implementation technique, not part of the CUDA programming
model
Each block is divided into 32-thread warps
– Warps are scheduling units in SM
• An implementation technique,
– Threads in not part
a warp execute of the
in Single CUDA programming
Instruction Multiple Data model
(SIMD) manner
• Warps are scheduling units in SM
– The number of threads in a warp may vary in future generations
• Threads in a warp execute in Single Instruction Multiple Data (SIMD)
manner
• The number of threads in a warp may vary in future generations
3
Organisation of Threads in Warps
A block of n threads is divided into n/32 warps of 32 threads each. In this

way, memory latency is reduced as follows.
• Problem: The threads of the active warp are waiting for the results of a
device memory access.
• Solution: Another warp whose threads have all required data in registers
or shared memory is selected for execution instead.
• The first warp is advanced later when the device memory access is
finished.
In general, the larger the number of warps, the better the memory latency
can be reduced.
• Example: A block of 512 threads is divided into 512/32 = 16 warps.

• If the number of threads within a block is not a multiple of 32, dummy
threads are added in the last warp.
• The division of threads into warps is deterministic:
• For a 1D block layout, warp 0 contains threads 0 to 31, warp 1 contains
threads 32 to 63, etc.
• For 2D or 3D block layouts, the threads are treated linearly with x as
innermost and z as outermost index.
Warps in Multi-dimensional Thread Blocks
The thread blocks are first linearized into 1D in row major order.
(In x-dimension first, y-dimension next, and z-dimension last)
Figure 4.2: Organisation of Threads in Warps (Kirk and Hwu 2010).
• Partitioning scheme is consistent across devices.

Thus you can use this knowledge in control flow
However, the exact size of warps may change from generation to
generation
• DO NOT rely on any ordering within or between warps
If there are any dependencies between threads, you must __syncthreads()
to get correct results.
StreamingSMs
Processors as SIMD
are SIMD Processors
Processors
Control –unit for instruction fetch, decode, and control is shared among
Control unit for instruction fetch, decode, and control is shared
multiple processing units.
among multiple processing units
– Control overhead is minimized (Module 1)
Memory I/O
Processing Unit
Shared
Register
Memory ALU File
Control Unit
PC IR
Processor (SM)
Figure 4.3: Streaming Processors are SIMD Processors (Nvidia 2016a).

6
SIMD Execution Among Threads in a Warp
• All threads in a warp must execute the same instruction at any point in
time
• This works efficiently if all threads follow the same control flow path
• All if-then-else statements make the same decision
• All loops iterate the same number of times
Control Divergence
Control divergence occurs when threads in a warp take different control flow
paths by making different control decisions.
• Some take the then-path and others take the else-path of an if-
statement.
• Some threads take different number of loop iterations than others.
The execution of threads taking different paths are serialized in current

GPUs.
• The control paths taken by the threads in a warp are traversed one at a
time until there is no more.
• During the execution of each path, all threads taking that path will be
executed in parallel.
• The number of different paths can be large when considering nested
control flow statements.
Control Divergence Examples
Divergence can arise when branch or loop condition is a function of thread

indices:
• Example kernel statement with divergence:
• if (threadIdx .x > 2) { }
• This creates two different control paths for threads in a block
• Decision granularity < warp size; threads 0, 1 and 2 follow different path
than the rest of the threads in the first warp
• Example without divergence:
• if (blockIdx .x > 2) { }
• Decision granularity is a multiple of blocks size; all threads in any given
warp follow the same path
Example: Vector Addition
Kernel code for vector addition

// Compute v e c t o r sum C = A + B
// Each t h r e a d p e r f o r m s one p a i r −w i s e a d d i t i o n
__global__
void vecAddKernel ( f l o a t ∗ A, f l o a t ∗ B, f l o a t ∗ C, i n t n )
{
i n t i = t h r e a d I d x . x + blockDim . x ∗ b l o c k I d x . x ;
i f ( i <n ) C [ i ] = A [ i ] + B [ i ] ;
}
Example: Vector Addition
Analysis for vector size of 1 000 elements
• Assume that block size is 256 threads (8 warps in each block)

• All threads in Blocks 0, 1, and 2 are within valid range
• i values from 0 to 767
• There are 24 warps in these three blocks, none will have control divergence
• Most warps in Block 3 will not control divergence
• Threads in the warps 0-6 are all within valid range, thus no control
divergence
• One warp in Block 3 will have control divergence
• Threads with i values 992-999 will all be within valid range
• Threads with i values of 1000-1023 will be outside valid range
• Effect of serialization on control divergence will be small
• 1 out of 32 warps has control divergence
• The impact on performance will likely be less than 3%
Outline
1 Warps

Number of Warps
Number of Registers
Amount of Shared Memory
6 Guidelines
Number of Warps
Example: For matrix-matrix multiplication, should we use blocks of dimension

4 × 4, 8 × 8, 16 × 16, 32 × 32, 64 × 64?
Assume we are given a GPU of compute capability 5.0 with the following
restrictions:
Maximum number of threads per block 1024

Maximum number of blocks per multiprocessor 32
Maximum number of threads per multiprocessor 2048
Number of registers per block 65536
Size of shared memory per block 49152 bytes
Warps
• 4 × 4: 16 threads per block, at most 32 blocks per SM
→ 512 threads in 32 warps, each of which is only half full
• 8 × 8: 64 threads per block, at most 2048/64 = 32 blocks per SM
→ 2048 threads in 64 warps
• 16 × 16: 256 threads per block, at most 2048/256 = 8 blocks
• 32 × 32: 1024 threads per block, at most 2048/1024 = 2 blocks
• 64 × 64: not allowed, would require 4096 threads per block
If you look only on the largest number of warps, it doesn’t matter whether
you choose 8 × 8, 16 × 16 or 32 × 32. Concerning control divergence it can be
better to take smaller blocks.
Registers
Besides the number of warps, the number of required registers has to be

taken into account. The registers of a multiprocessor are divided between the
blocks, and within the blocks between the threads. For the time of execution
of a kernel, each thread has its own exclusive set of registers.
Registers
Example: If we choose blocks of dimension 16 × 16 for a kernel which requires

r registers per thread, how many blocks and threads can run on each
multiprocessor?
As stated above, a GPU of compute capability 5.0 has 65536 registers per
multiprocessor.
Registers
• r = 32: each block requires 32 · 256 = 8192 registers

→ ⌊65536/8192⌋ = 8 blocks with 8 · 256 = 2048 threads can run on
each multiprocessor
• r = 33: each block requires 33 · 256 = 8448 registers
→ ⌊65536/8448⌋ = 7 blocks – in reality only 6, since registers are
allocated by warps – with 6 · 256 = 1536 threads can run on each
multiprocessor
Increasing the number of required registers by 1 thus decreases the possible

number of threads per multiprocessor by 1/4.
Shared memory
In a similar way, the amount of shared memory required by each block can
limit the number of blocks that can run on a multiprocessor.
As an aid for determining the optimal grid and block size, a CUDA
occupancy calculator (in terms of an Excel sheet) is provided at
http://developer.download.nvidia.com/compute/cuda/CUDA_
Occupancy_calculator.xls
Outline
1 Warps

DRAM Bursting
Device Memory Coalescing
Device Memory Coalescing for Matrices
Memory Coalescing in Matrix Multiplication?
Shared Memory Banks
6 Guidelines
DRAM Burst – A System View
DRAM Bursting
Burst section Burst section Burst section Burst section

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 4.4: Array with burst sections (Nvidia 2016a).

– Each address space is partitioned into burst sections
– Whenever a location is accessed, all other locations in the same
Each addresssection
spaceareis partitioned into
also delivered burst
to the sections.
processor
– Basic example: a 16-byte address space, 4-byte burst sections
Whenever–a Inlocation
practice,iswe
accessed, all other
have at least locationsspace,
4GB address in theburst
same section are
section
also delivered to the
sizes processor.
of 128-bytes or more
Basic example: a 16-byte address space, 4-byte burst sections.

In practice, we have at least 4GB address space, burst section sizes of
128-bytes or more.
3
Memory Coalescing
Coalesced Loads Coalesced Loads
T0 T1 T2 T3 T0 T1 T2 T3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 4.5: Memory Coalescing (Nvidia 2016a).

– When all threads of a warp execute a load instruction, if all accessed
locations fall into the same burst section, only one DRAM request
When all will
threads of aand
be made warp
theexecute
access isa fully
loadcoalesced.
instruction, if all accessed
locations fall into the same burst section, only one DRAM request will be
made and the access is fully coalesced.
Accesses in a warp are to consecutive locations if the index in an array access
is in the form of
A[( expression with terms independent of threadIdx .x) + threadIdx.x ];
4
Un-coalesced Accesses
Un-coalesced Accesses
Un-coalesced Loads Un-coalesced Loads
T0 T1 T2 T3 T0 T1 T2 T3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 4.6: Un-coalesced Accesses (Nvidia 2016a).

– When the accessed locations spread across burst section
boundaries:
When the –accessed
Coalescinglocations
fails spread across burst section boundaries:
– Multiple DRAM requests are made
• Coalescing fails. is not fully coalesced.
– The access
• Multiple DRAM
– Some of requests
the bytes are made.
accessed and transferred are not used by the
threads
• The access is not fully coalesced.
• Some of the bytes accessed and transferred are not used by the threads.
5
The performance of device memory access can be increased by memory

coalescing, i. e., accessing consecutive addresses by consecutive threads.
Example: 32 consecutive threads (within a warp) access an array of 4-byte
float entries in device memory.
• With memory coalescing, one memory transaction of a 128-byte segment

(from compute capability 1.2) is required.
• Without memory coalescing, 32 independent memory transactions of
32-byte segments (which is the minimal size) are required.
7 Arrays
Linearized
A 2D C Array in Linear Memory Space
M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3
M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3
M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
linearized order in increasing address
Figure 4.7: A 2D C array in linear memory space (Nvidia 2016a).
Column-wise Matrix Access
Figure 4.8: Column-wise Matrix Access (Kirk and Hwu 2010).
Row-wise Matrix Access
Figure 4.9: Row-wise Matrix Access (Kirk and Hwu 2010)
Is memory coalescing possible in the matrix-matrix multiplication kernel that
uses device memory?
1 i n t row = b l o c k I d x . y ∗ BLOCK_SIZE + t h r e a d I d x . y ;
2 int col = blockIdx . x ∗ BLOCK_SIZE + t h r e a d I d x . x ;
3 // . . .
4 for ( int i = 0; i < k ; k++) {
5 m u l t V a l += A [ row ∗ k + i ] ∗ B[ i ∗ k + col ] ;
6 }
Two Access Patterns of Basic Matrix Multiplication
A B
HEIGHT
Thread 1
Thread 2
WIDTH
A[Row*n+i] B[i*k+Col]
i is the loop counter in the inner product loop of the kernel code
Figure 4.10: Access patterns for matrix multiplication (Nvidia 2016a).
Column-wise Access for B
B accesses are coalesced
Load iteration 0 Load iteration 1

T0 T1 T2 T3 T0 T1 T2 T3
N
B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3
Access B0,0 B0,1 B0,2 B0,3

direction in B1,0 B1,1 B1,2 B1,3
kernel code
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
Figure 4.11: B Accesses are coalesced (Nvidia 2016a).
Row-wise Access for A are Not Coalesced
A Accesses
Load iteration 1
…
T0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3
A0,0 A0,1 A0,2 A0,3

Access
A1,0 A1,1 A1,2 A1,3
direction in
kernel code A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
Figure 4.12: A Accesses are not coalesced (Nvidia 2016a).
10
Is memory coalescing possible in the matrix-matrix multiplication kernel that
uses shared memory?
1 i n t row = b l o c k I d x . y ∗ BLOCK_SIZE + t h r e a d I d x . y ;
2 i n t c o l = b l o c k I d x . x ∗ BLOCK_SIZE + t h r e a d I d x . x ;
3 int rowIdxBlock = threadIdx . y ;
4 int colIdxBlock = threadIdx . x ;
5 // . . .
6 f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
7 s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + c o l I d x B l o c k ] =
8 A [ row ∗ s i z e + l ∗ BLOCK_SIZE
9 + colIdxBlock ] ;
10 s r c B S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + c o l I d x B l o c k ] =
11 B [ ( l ∗ BLOCK_SIZE + r o w I d x B l o c k ) ∗ s i z e
12 + col ] ;
13 // . . .
14 }
Using Shared Memory
Have each thread load an A element
and a B element at the same relative
position as its C element.
B
Col
int tx = threadIdx.x
n
int ty = threadIdx.y
Accessing tile 0 2D indexing:
A[Row][tx] k
B[ty][Col]
A C
Row
m
m
WIDTH
n k
Figure 4.13: Loading an input tile into shared memory (Nvidia 2016a).
11
Corner Turning
Corner Turning
d_M d_N
Original
WIDTH
Access
Pattern
WIDTH
Copy into
shared
memory
d_M d_N
Tiled
Access
Pattern
Perform
multiplication
with shared memory
values
Figure 4.14: Corner turning (Nvidia 2016a).
12
Shared Memory Banks
Memory considerations are also necessary to optimize the usage of shared

memory, which is divided into 16 or 32 banks (depending on the compute
capability) in an interleaved fashion, i. e., such that successive 32-bit words
are assigned to successive banks.
The reason is that memory conflicts of threads within one warp can be
avoided, provided an appropriate access on shared memory. In particular, the
stride between words in shared memory that are addressed by successive
threads must not have a common factor with the number of banks. For
16 = 24 or 32 = 25 banks this means that the stride has to be odd. Again we
refer to CUDA C Programming Guide (NVIDIA 2015a), appendices G.2-5
with figure 17, for details.
Outline
1 Warps

Loop Unrolling
Data Prefetching
6 Guidelines
Loop Unrolling
A loop requires control instructions and (at least in a for loop) one register
for the count variable, as in the matrix-matrix multiplication example.
f o r ( i n t k = 0 ; k < BLOCK_SIZE ; k++) {
m u l t V a l +=
s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + k ] ∗
s r c B S h a r e d [ k ∗ BLOCK_SIZE + c o l I d x B l o c k ] ;
}
Loop Unrolling
For loops with a fixed number of iterations, loop unrolling might increase the
performance.
m u l t V a l +=
s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 0 ] ∗
s r c B S h a r e d [ 0 ∗ BLOCK_SIZE + c o l I d x B l o c k ]
+ s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 1 ] ∗
+ s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 2 ] ∗
// . . .
By default all loops with a fixed number of iterations are automatically

unrolled by nvcc.
The loop unrolling for any given loop can be controlled by the #pragma unroll
directive (cf. CUDA C Programming Guide (NVIDIA 2015a)).
Data Prefetching
In the matrix-matrix multiplication kernel that uses shared memory, the

synchronization statements have the effect that all threads access device
memory at the same time (lines 2 and 3).
2 // copy m a t r i x b l o c k s c o n c u r r e n t l y from d e v i c e
3 // memory t o s h a r e d memory
4 __syncthreads ( ) ;
5 // m u l t i p l y c u r r e n t m a t r i x b l o c k s i n s h a r e d memory
7 }
Data Prefetching
1 // copy f i r s t m a t r i x b l o c k s ( l =0) c o n c u r r e n t l y from

2 // d e v i c e memory t o r e g i s t e r s
4 // copy c u r r e n t m a t r i x b l o c k s ( l ) c o n c u r r e n t l y from
5 // r e g i s t e r s t o s h a r e d memory
7 // copy n e x t m a t r i x b l o c k s ( l +1) c o n c u r r e n t l y from
8 // d e v i c e memory t o r e g i s t e r s
9 // m u l t i p l y c u r r e n t m a t r i x b l o c k s ( l ) i n
10 // s h a r e d memory
12 }
Outline
1 Warps
Combining Neighboring Elements
Combining Blocks of Elements
6 Guidelines
Parallel Reduction
Reduction algorithms extract a single value from an array, which is used for
many purposes, e. g.,
• determining the maximal or minimal array value,

• computing the sum of all array values,
• computing the average array value,
• computing the inner product (scalar product) of two vectors.
In parallel programming, the access to the single value for the result has to be
synchronized.
Parallel Reduction
Parallel reduction uses a divide-and conquer approach known from soccer

tournaments (KO-Verfahren).
• n teams are divided into n/2 pairs.

• First round: n/2 pairs play in parallel, n/2 winners advance to second
round.
• Second round: n/4 pairs play in parallel, n/4 winners advance to third
round, etc.
• Last round: one pair plays the final, the winner wins the tournament.
Figure 4.15: Combining Neighboring Elements (Kirk and Hwu 2010)
1 __shared__ f l o a t p a r t i a l S u m [ ] ;
2 // copy a r r a y t o p a r t i a l S u m
3 int tx = threadIdx . x ;
4 f o r ( i n t s t r i d e = 1 ; s t r i d e < blockDim . x ;
5 s t r i d e <<= 1 ) {
6 __syncthreads__ ( ) ;
7 i f ( t x % ( 2 ∗ s t r i d e ) == 0 ) {
8 p a r t i a l S u m [ t x ] += p a r t i a l S u m [ t x + s t r i d e ] ;
9 }
10 }
Combining Blocks of Elements
Figure 4.16: Combining Blocks of Elements (Kirk and Hwu 2010)
1 __shared__ f l o a t p a r t i a l S u m [ ] ;
2 // copy a r r a y t o p a r t i a l S u m
3 int tx = threadIdx . x ;
4 f o r ( i n t s t r i d e = blockDim . x >> 1 ; s t r i d e > 0 ;
5 s t r i d e >>= 1 ) {
6 __syncthreads__ ( ) ;
7 i f ( tx < s t r i d e ) {
8 p a r t i a l S u m [ t x ] += p a r t i a l S u m [ t x + s t r i d e ] ;
9 }
10 }
Outline
1 Warps
6 Guidelines
Guidelines
• A large number of warps per multiprocessor is advantageous to reduce

memory latency, but the available number of registers and the size of
shared memory pose further constraints.
• Memory coalescing for device memory and bank access for shared
memory should be kept in mind when accessing arrays by successive
threads.
• Minor performance improvement might be possible by loop unrolling,
data prefetching, and selecting the thread granularity, i. e., deciding
between a lot of threads with a small number of operations and fewer
threads with a larger number of operations.
• In parallel reduction algorithms, the order of combining array elements is
important for the overall performance.

Performance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance

Uploaded by

Copyright:

Available Formats

Visual Computing – GPU Computing

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

A block of n threads is divided into n/32 warps of 32 threads each. In this

• Example: A block of 512 threads is divided into 512/32 = 16 warps.

Figure 4.2: Organisation of Threads in Warps (Kirk and Hwu 2010).

• Partitioning scheme is consistent across devices.

Figure 4.3: Streaming Processors are SIMD Processors (Nvidia 2016a).

The execution of threads taking different paths are serialized in current

Divergence can arise when branch or loop condition is a function of thread

Kernel code for vector addition

• Assume that block size is 256 threads (8 warps in each block)

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

Example: For matrix-matrix multiplication, should we use blocks of dimension

Maximum number of threads per block 1024

Besides the number of warps, the number of required registers has to be

Example: If we choose blocks of dimension 16 × 16 for a kernel which requires

• r = 32: each block requires 32 · 256 = 8192 registers

Increasing the number of required registers by 1 thus decreases the possible

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

Burst section Burst section Burst section Burst section

Figure 4.4: Array with burst sections (Nvidia 2016a).

Basic example: a 16-byte address space, 4-byte burst sections.

Burst section Burst section Burst section Burst section

Figure 4.5: Memory Coalescing (Nvidia 2016a).

Burst section Burst section Burst section Burst section

Figure 4.6: Un-coalesced Accesses (Nvidia 2016a).

The performance of device memory access can be increased by memory

• With memory coalescing, one memory transaction of a 128-byte segment

M0,0 M0,1 M0,2 M0,3

linearized order in increasing address

Figure 4.7: A 2D C array in linear memory space (Nvidia 2016a).

Figure 4.8: Column-wise Matrix Access (Kirk and Hwu 2010).

Figure 4.9: Row-wise Matrix Access (Kirk and Hwu 2010)

Load iteration 0 Load iteration 1

Access B0,0 B0,1 B0,2 B0,3

Figure 4.11: B Accesses are coalesced (Nvidia 2016a).

A0,0 A0,1 A0,2 A0,3

Figure 4.12: A Accesses are not coalesced (Nvidia 2016a).

Figure 4.14: Corner turning (Nvidia 2016a).

Memory considerations are also necessary to optimize the usage of shared

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

By default all loops with a fixed number of iterations are automatically

In the matrix-matrix multiplication kernel that uses shared memory, the

1 // copy f i r s t m a t r i x b l o c k s ( l =0) c o n c u r r e n t l y from

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

• determining the maximal or minimal array value,

Parallel reduction uses a divide-and conquer approach known from soccer

• n teams are divided into n/2 pairs.

Figure 4.15: Combining Neighboring Elements (Kirk and Hwu 2010)

Figure 4.16: Combining Blocks of Elements (Kirk and Hwu 2010)