You are on page 1of 51

Visual Computing – GPU Computing

Performance Considerations

Frauke Sprengel
Outline
1 Warps

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

5 Parallel Reduction

6 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 2
In this chapter, we take a closer look at the performance of CUDA kernels
and discuss methods to improve it.
In the second part, we introduce the concept of parallel reduction. A central
rôle in these considerations plays CUDA’s concept of warps.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 3
Outline
1 Warps
Warps as Scheduling Units
Streaming Processors as SIMD Processors
Control Divergence
Control Divergence Examples

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

5 Parallel Reduction

6 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 4
Warps as
Warps as Scheduling Scheduling Units
Units
Block 1 Warps Block 2 Warps Block 3 Warps
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… … …

Figure 4.1:
– Each blockThreads in Warpsinto
is divided in Blocks (Nvidia warps
32-thread 2016a).
– An implementation technique, not part of the CUDA programming
model
Each block is divided into 32-thread warps
– Warps are scheduling units in SM
• An implementation technique,
– Threads in not part
a warp execute of the
in Single CUDA programming
Instruction Multiple Data model
(SIMD) manner
• Warps are scheduling units in SM
– The number of threads in a warp may vary in future generations
• Threads in a warp execute in Single Instruction Multiple Data (SIMD)
manner
• The number of threads in a warp may vary in future generations
3

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 5
Organisation of Threads in Warps

A block of n threads is divided into n/32 warps of 32 threads each. In this


way, memory latency is reduced as follows.

• Problem: The threads of the active warp are waiting for the results of a
device memory access.
• Solution: Another warp whose threads have all required data in registers
or shared memory is selected for execution instead.
• The first warp is advanced later when the device memory access is
finished.

In general, the larger the number of warps, the better the memory latency
can be reduced.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 6
Organisation of Threads in Warps

• Example: A block of 512 threads is divided into 512/32 = 16 warps.


• If the number of threads within a block is not a multiple of 32, dummy
threads are added in the last warp.
• The division of threads into warps is deterministic:
• For a 1D block layout, warp 0 contains threads 0 to 31, warp 1 contains
threads 32 to 63, etc.
• For 2D or 3D block layouts, the threads are treated linearly with x as
innermost and z as outermost index.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 7
Warps in Multi-dimensional Thread Blocks
The thread blocks are first linearized into 1D in row major order.
(In x-dimension first, y-dimension next, and z-dimension last)

Figure 4.2: Organisation of Threads in Warps (Kirk and Hwu 2010).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 8
Organisation of Threads in Warps

• Partitioning scheme is consistent across devices.


Thus you can use this knowledge in control flow
However, the exact size of warps may change from generation to
generation
• DO NOT rely on any ordering within or between warps
If there are any dependencies between threads, you must __syncthreads()
to get correct results.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 9
StreamingSMs
Processors as SIMD
are SIMD Processors
Processors
Control –unit for instruction fetch, decode, and control is shared among
Control unit for instruction fetch, decode, and control is shared
multiple processing units.
among multiple processing units
– Control overhead is minimized (Module 1)

Memory I/O

Processing Unit
Shared
Register
Memory ALU File

Control Unit
PC IR

Processor (SM)

Figure 4.3: Streaming Processors are SIMD Processors (Nvidia 2016a).


6

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 10
SIMD Execution Among Threads in a Warp

• All threads in a warp must execute the same instruction at any point in
time
• This works efficiently if all threads follow the same control flow path
• All if-then-else statements make the same decision
• All loops iterate the same number of times

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 11
Control Divergence
Control divergence occurs when threads in a warp take different control flow
paths by making different control decisions.
• Some take the then-path and others take the else-path of an if-
statement.
• Some threads take different number of loop iterations than others.

The execution of threads taking different paths are serialized in current


GPUs.
• The control paths taken by the threads in a warp are traversed one at a
time until there is no more.
• During the execution of each path, all threads taking that path will be
executed in parallel.
• The number of different paths can be large when considering nested
control flow statements.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 12
Control Divergence Examples

Divergence can arise when branch or loop condition is a function of thread


indices:
• Example kernel statement with divergence:
• if (threadIdx .x > 2) { }
• This creates two different control paths for threads in a block
• Decision granularity < warp size; threads 0, 1 and 2 follow different path
than the rest of the threads in the first warp
• Example without divergence:
• if (blockIdx .x > 2) { }
• Decision granularity is a multiple of blocks size; all threads in any given
warp follow the same path

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 13
Example: Vector Addition

Kernel code for vector addition


// Compute v e c t o r sum C = A + B
// Each t h r e a d p e r f o r m s one p a i r −w i s e a d d i t i o n

__global__
void vecAddKernel ( f l o a t ∗ A, f l o a t ∗ B, f l o a t ∗ C, i n t n )
{
i n t i = t h r e a d I d x . x + blockDim . x ∗ b l o c k I d x . x ;
i f ( i <n ) C [ i ] = A [ i ] + B [ i ] ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 14
Example: Vector Addition
Analysis for vector size of 1 000 elements

• Assume that block size is 256 threads (8 warps in each block)


• All threads in Blocks 0, 1, and 2 are within valid range
• i values from 0 to 767
• There are 24 warps in these three blocks, none will have control divergence
• Most warps in Block 3 will not control divergence
• Threads in the warps 0-6 are all within valid range, thus no control
divergence
• One warp in Block 3 will have control divergence
• Threads with i values 992-999 will all be within valid range
• Threads with i values of 1000-1023 will be outside valid range
• Effect of serialization on control divergence will be small
• 1 out of 32 warps has control divergence
• The impact on performance will likely be less than 3%

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 15
Outline
1 Warps

2 Organization of Threads in Warps in Blocks


Number of Warps
Number of Registers
Amount of Shared Memory

3 Memory Access Revisited

4 Further Means to Improve Perfomance

5 Parallel Reduction

6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 16
Number of Warps

Example: For matrix-matrix multiplication, should we use blocks of dimension


4 × 4, 8 × 8, 16 × 16, 32 × 32, 64 × 64?
Assume we are given a GPU of compute capability 5.0 with the following
restrictions:

Maximum number of threads per block 1024


Maximum number of blocks per multiprocessor 32
Maximum number of threads per multiprocessor 2048
Number of registers per block 65536
Size of shared memory per block 49152 bytes

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 17
Warps
• 4 × 4: 16 threads per block, at most 32 blocks per SM
→ 512 threads in 32 warps, each of which is only half full
• 8 × 8: 64 threads per block, at most 2048/64 = 32 blocks per SM
→ 2048 threads in 64 warps
• 16 × 16: 256 threads per block, at most 2048/256 = 8 blocks
→ 2048 threads in 64 warps
• 32 × 32: 1024 threads per block, at most 2048/1024 = 2 blocks
→ 2048 threads in 64 warps
• 64 × 64: not allowed, would require 4096 threads per block

If you look only on the largest number of warps, it doesn’t matter whether
you choose 8 × 8, 16 × 16 or 32 × 32. Concerning control divergence it can be
better to take smaller blocks.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 18
Registers

Besides the number of warps, the number of required registers has to be


taken into account. The registers of a multiprocessor are divided between the
blocks, and within the blocks between the threads. For the time of execution
of a kernel, each thread has its own exclusive set of registers.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 19
Registers

Example: If we choose blocks of dimension 16 × 16 for a kernel which requires


r registers per thread, how many blocks and threads can run on each
multiprocessor?
As stated above, a GPU of compute capability 5.0 has 65536 registers per
multiprocessor.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 20
Registers

• r = 32: each block requires 32 · 256 = 8192 registers


→ ⌊65536/8192⌋ = 8 blocks with 8 · 256 = 2048 threads can run on
each multiprocessor
• r = 33: each block requires 33 · 256 = 8448 registers
→ ⌊65536/8448⌋ = 7 blocks – in reality only 6, since registers are
allocated by warps – with 6 · 256 = 1536 threads can run on each
multiprocessor

Increasing the number of required registers by 1 thus decreases the possible


number of threads per multiprocessor by 1/4.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 21
Shared memory

In a similar way, the amount of shared memory required by each block can
limit the number of blocks that can run on a multiprocessor.
As an aid for determining the optimal grid and block size, a CUDA
occupancy calculator (in terms of an Excel sheet) is provided at
http://developer.download.nvidia.com/compute/cuda/CUDA_
Occupancy_calculator.xls

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 22
Outline
1 Warps

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited


DRAM Bursting
Device Memory Coalescing
Device Memory Coalescing for Matrices
Memory Coalescing in Matrix Multiplication?
Shared Memory Banks

4 Further Means to Improve Perfomance

5 Parallel Reduction

6 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 23
DRAM Burst – A System View
DRAM Bursting

Burst section Burst section Burst section Burst section


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 4.4: Array with burst sections (Nvidia 2016a).


– Each address space is partitioned into burst sections
– Whenever a location is accessed, all other locations in the same
Each addresssection
spaceareis partitioned into
also delivered burst
to the sections.
processor
– Basic example: a 16-byte address space, 4-byte burst sections
Whenever–a Inlocation
practice,iswe
accessed, all other
have at least locationsspace,
4GB address in theburst
same section are
section
also delivered to the
sizes processor.
of 128-bytes or more

Basic example: a 16-byte address space, 4-byte burst sections.


In practice, we have at least 4GB address space, burst section sizes of
128-bytes or more.
3

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 24
Memory Coalescing
Device Memory Coalescing
Coalesced Loads Coalesced Loads
T0 T1 T2 T3 T0 T1 T2 T3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Burst section Burst section Burst section Burst section

Figure 4.5: Memory Coalescing (Nvidia 2016a).


– When all threads of a warp execute a load instruction, if all accessed
locations fall into the same burst section, only one DRAM request
When all will
threads of aand
be made warp
theexecute
access isa fully
loadcoalesced.
instruction, if all accessed
locations fall into the same burst section, only one DRAM request will be
made and the access is fully coalesced.
Accesses in a warp are to consecutive locations if the index in an array access
is in the form of
A[( expression with terms independent of threadIdx .x) + threadIdx.x ];
4

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 25
Un-coalesced Accesses
Un-coalesced Accesses
Un-coalesced Loads Un-coalesced Loads
T0 T1 T2 T3 T0 T1 T2 T3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Burst section Burst section Burst section Burst section

Figure 4.6: Un-coalesced Accesses (Nvidia 2016a).


– When the accessed locations spread across burst section
boundaries:
When the –accessed
Coalescinglocations
fails spread across burst section boundaries:
– Multiple DRAM requests are made
• Coalescing fails. is not fully coalesced.
– The access
• Multiple DRAM
– Some of requests
the bytes are made.
accessed and transferred are not used by the
threads
• The access is not fully coalesced.
• Some of the bytes accessed and transferred are not used by the threads.
5

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 26
Device Memory Coalescing

The performance of device memory access can be increased by memory


coalescing, i. e., accessing consecutive addresses by consecutive threads.
Example: 32 consecutive threads (within a warp) access an array of 4-byte
float entries in device memory.

• With memory coalescing, one memory transaction of a 128-byte segment


(from compute capability 1.2) is required.
• Without memory coalescing, 32 independent memory transactions of
32-byte segments (which is the minimal size) are required.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 27
7 Arrays
Linearized
A 2D C Array in Linear Memory Space

M0,0 M0,1 M0,2 M0,3


M1,0 M1,1 M1,2 M1,3
M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3
M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

linearized order in increasing address

Figure 4.7: A 2D C array in linear memory space (Nvidia 2016a).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 28
Column-wise Matrix Access

Figure 4.8: Column-wise Matrix Access (Kirk and Hwu 2010).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 29
Row-wise Matrix Access

Figure 4.9: Row-wise Matrix Access (Kirk and Hwu 2010)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 30
Memory Coalescing in Matrix Multiplication?
Is memory coalescing possible in the matrix-matrix multiplication kernel that
uses device memory?
1 i n t row = b l o c k I d x . y ∗ BLOCK_SIZE + t h r e a d I d x . y ;
2 int col = blockIdx . x ∗ BLOCK_SIZE + t h r e a d I d x . x ;
3 // . . .
4 for ( int i = 0; i < k ; k++) {
5 m u l t V a l += A [ row ∗ k + i ] ∗ B[ i ∗ k + col ] ;
6 }
Two Access Patterns of Basic Matrix Multiplication

A B

HEIGHT
Thread 1
Thread 2
WIDTH

A[Row*n+i] B[i*k+Col]
i is the loop counter in the inner product loop of the kernel code
Figure 4.10: Access patterns for matrix multiplication (Nvidia 2016a).
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 31
Column-wise Access for B
B accesses are coalesced

Load iteration 0 Load iteration 1


T0 T1 T2 T3 T0 T1 T2 T3

N
B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3

Access B0,0 B0,1 B0,2 B0,3


direction in B1,0 B1,1 B1,2 B1,3
kernel code
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3

Figure 4.11: B Accesses are coalesced (Nvidia 2016a).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 32
Row-wise Access for A are Not Coalesced
A Accesses

Load iteration 1

T0 T1 T2 T3

Load iteration 0
T0 T1 T2 T3

A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3

A0,0 A0,1 A0,2 A0,3


Access
A1,0 A1,1 A1,2 A1,3
direction in
kernel code A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3

Figure 4.12: A Accesses are not coalesced (Nvidia 2016a).

10

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 33
Memory Coalescing in Matrix Multiplication?
Is memory coalescing possible in the matrix-matrix multiplication kernel that
uses shared memory?
1 i n t row = b l o c k I d x . y ∗ BLOCK_SIZE + t h r e a d I d x . y ;
2 i n t c o l = b l o c k I d x . x ∗ BLOCK_SIZE + t h r e a d I d x . x ;
3 int rowIdxBlock = threadIdx . y ;
4 int colIdxBlock = threadIdx . x ;
5 // . . .
6 f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
7 s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + c o l I d x B l o c k ] =
8 A [ row ∗ s i z e + l ∗ BLOCK_SIZE
9 + colIdxBlock ] ;
10 s r c B S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + c o l I d x B l o c k ] =
11 B [ ( l ∗ BLOCK_SIZE + r o w I d x B l o c k ) ∗ s i z e
12 + col ] ;
13 // . . .
14 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 34
Using Shared Memory
Have each thread load an A element
and a B element at the same relative
position as its C element.
B
Col
int tx = threadIdx.x
n
int ty = threadIdx.y
Accessing tile 0 2D indexing:
A[Row][tx] k
B[ty][Col]

A C
Row
m
m

WIDTH
n k
Figure 4.13: Loading an input tile into shared memory (Nvidia 2016a).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 35
11
Corner Turning
Corner Turning

d_M d_N

Original

WIDTH
Access
Pattern

WIDTH
Copy into
shared
memory
d_M d_N

Tiled
Access
Pattern
Perform
multiplication
with shared memory
values

Figure 4.14: Corner turning (Nvidia 2016a).

12

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 36
Shared Memory Banks

Memory considerations are also necessary to optimize the usage of shared


memory, which is divided into 16 or 32 banks (depending on the compute
capability) in an interleaved fashion, i. e., such that successive 32-bit words
are assigned to successive banks.
The reason is that memory conflicts of threads within one warp can be
avoided, provided an appropriate access on shared memory. In particular, the
stride between words in shared memory that are addressed by successive
threads must not have a common factor with the number of banks. For
16 = 24 or 32 = 25 banks this means that the stride has to be odd. Again we
refer to CUDA C Programming Guide (NVIDIA 2015a), appendices G.2-5
with figure 17, for details.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 37
Outline
1 Warps

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance


Loop Unrolling
Data Prefetching

5 Parallel Reduction

6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 38
Loop Unrolling

A loop requires control instructions and (at least in a for loop) one register
for the count variable, as in the matrix-matrix multiplication example.
f o r ( i n t k = 0 ; k < BLOCK_SIZE ; k++) {
m u l t V a l +=
s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + k ] ∗
s r c B S h a r e d [ k ∗ BLOCK_SIZE + c o l I d x B l o c k ] ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 39
Loop Unrolling
For loops with a fixed number of iterations, loop unrolling might increase the
performance.
m u l t V a l +=
s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 0 ] ∗
s r c B S h a r e d [ 0 ∗ BLOCK_SIZE + c o l I d x B l o c k ]
+ s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 1 ] ∗
s r c B S h a r e d [ 1 ∗ BLOCK_SIZE + c o l I d x B l o c k ]
+ s r c A S h a r e d [ r o w I d x B l o c k ∗ BLOCK_SIZE + 2 ] ∗
s r c B S h a r e d [ 2 ∗ BLOCK_SIZE + c o l I d x B l o c k ]
// . . .

By default all loops with a fixed number of iterations are automatically


unrolled by nvcc.
The loop unrolling for any given loop can be controlled by the #pragma unroll
directive (cf. CUDA C Programming Guide (NVIDIA 2015a)).
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 40
Data Prefetching

In the matrix-matrix multiplication kernel that uses shared memory, the


synchronization statements have the effect that all threads access device
memory at the same time (lines 2 and 3).
1 f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
2 // copy m a t r i x b l o c k s c o n c u r r e n t l y from d e v i c e
3 // memory t o s h a r e d memory
4 __syncthreads ( ) ;
5 // m u l t i p l y c u r r e n t m a t r i x b l o c k s i n s h a r e d memory
6 __syncthreads ( ) ;
7 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 41
Data Prefetching

1 // copy f i r s t m a t r i x b l o c k s ( l =0) c o n c u r r e n t l y from


2 // d e v i c e memory t o r e g i s t e r s
3 f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
4 // copy c u r r e n t m a t r i x b l o c k s ( l ) c o n c u r r e n t l y from
5 // r e g i s t e r s t o s h a r e d memory
6 __syncthreads ( ) ;
7 // copy n e x t m a t r i x b l o c k s ( l +1) c o n c u r r e n t l y from
8 // d e v i c e memory t o r e g i s t e r s
9 // m u l t i p l y c u r r e n t m a t r i x b l o c k s ( l ) i n
10 // s h a r e d memory
11 __syncthreads ( ) ;
12 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 42
Outline
1 Warps

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

5 Parallel Reduction
Combining Neighboring Elements
Combining Blocks of Elements

6 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 43
Parallel Reduction

Reduction algorithms extract a single value from an array, which is used for
many purposes, e. g.,

• determining the maximal or minimal array value,


• computing the sum of all array values,
• computing the average array value,
• computing the inner product (scalar product) of two vectors.

In parallel programming, the access to the single value for the result has to be
synchronized.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 44
Parallel Reduction

Parallel reduction uses a divide-and conquer approach known from soccer


tournaments (KO-Verfahren).

• n teams are divided into n/2 pairs.


• First round: n/2 pairs play in parallel, n/2 winners advance to second
round.
• Second round: n/4 pairs play in parallel, n/4 winners advance to third
round, etc.
• Last round: one pair plays the final, the winner wins the tournament.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 45
Combining Neighboring Elements

Figure 4.15: Combining Neighboring Elements (Kirk and Hwu 2010)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 46
Combining Neighboring Elements

1 __shared__ f l o a t p a r t i a l S u m [ ] ;
2 // copy a r r a y t o p a r t i a l S u m
3 int tx = threadIdx . x ;
4 f o r ( i n t s t r i d e = 1 ; s t r i d e < blockDim . x ;
5 s t r i d e <<= 1 ) {
6 __syncthreads__ ( ) ;
7 i f ( t x % ( 2 ∗ s t r i d e ) == 0 ) {
8 p a r t i a l S u m [ t x ] += p a r t i a l S u m [ t x + s t r i d e ] ;
9 }
10 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 47
Combining Blocks of Elements

Figure 4.16: Combining Blocks of Elements (Kirk and Hwu 2010)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 48
1 __shared__ f l o a t p a r t i a l S u m [ ] ;
2 // copy a r r a y t o p a r t i a l S u m
3 int tx = threadIdx . x ;
4 f o r ( i n t s t r i d e = blockDim . x >> 1 ; s t r i d e > 0 ;
5 s t r i d e >>= 1 ) {
6 __syncthreads__ ( ) ;
7 i f ( tx < s t r i d e ) {
8 p a r t i a l S u m [ t x ] += p a r t i a l S u m [ t x + s t r i d e ] ;
9 }
10 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 49
Outline
1 Warps

2 Organization of Threads in Warps in Blocks

3 Memory Access Revisited

4 Further Means to Improve Perfomance

5 Parallel Reduction

6 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 50
Guidelines

• A large number of warps per multiprocessor is advantageous to reduce


memory latency, but the available number of registers and the size of
shared memory pose further constraints.
• Memory coalescing for device memory and bank access for shared
memory should be kept in mind when accessing arrays by successive
threads.
• Minor performance improvement might be possible by loop unrolling,
data prefetching, and selecting the thread granularity, i. e., deciding
between a lot of threads with a small number of operations and fewer
threads with a larger number of operations.
• In parallel reduction algorithms, the order of combining array elements is
important for the overall performance.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Performance Considerations, 51

You might also like