Lecture 2.

2

Kernel-based Parallel Programming
- Control Divergence

Objective • To understand the implications of warp scheduling on control flow • Nature of control flow • Warp partitioning • Control divergence .

not part of the CUDA programming model – Warps are scheduling units in SM – Threads in a warp execute in SIMD . Warps as Scheduling Units … Block 1 Warps t0 t1 t2 … t31 …t0Block 2 Warps t1 t2 … t31 … Block 1 Warps t0 t1 t2 … t31 … … … • Each Block is divided into 32- thread Warps – An implementation decision.

Data transfer. • An example instruction cycle is the following: Fetch | Decode | Execute | Memory . decoded. and Program Control Flow. then executed.Back to Von-Neumann • Every instruction needs to be fetched from memory. • Instructions come in three flavors: Operate.

jump back four instructions . Control Flow Operations • Example of control flow instruction: BRp #-4 if the condition is positive.

the exact size of warps may change from generation to generation • (Covered next) • DO NOT rely on any ordering within or between warps • If there are any dependencies between threads. How thread blocks are partitioned • Thread blocks are partitioned into warps • Thread IDs within a warp are consecutive and increasing • Warp 0 starts with Thread ID 0 • Partitioning is always the same • Thus you can use this knowledge in control flow • However. . you must __syncthreads() to get correct results (more later).

Control Flow Instructions • Main performance concern with branching is divergence • Threads within a single warp take different paths • Different execution paths are serialized in current GPUs • The control paths taken by the threads in a warp are traversed one at a time until there is no more. .

all threads in any given warp follow the same path . threads 0. 1 and 2 follow different path than the rest of the threads in the first warp • Example without divergence: • If (blockIdx.x > 2) { } • Branch granularity is a multiple of blocks size.x > 2) { } • This creates two different control paths for threads in a block • Branch granularity < warp size. Control Divergence Exampels • Divergence can arise only when branch condition is a function of thread indices • Example with divergence: • If (threadIdx.

x. } 9 . float* C. Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A. if(i<n) C[i] = A[i] + B[i].x*blockIdx. float* B.x+blockDim. int n) { int i = threadIdx.

and 2 are within valid range • i values from 0 to 767 • Block 3 will have control divergence • 1st group i values from 768-999 • 2nd group i values from 1000-1023 • Performance penalty to Blocks with no divergence is very small • Only the last Block will have divergence • Overall performance penalty is small as long as there are large number of elements.Analysis for vector size of 1. 1.000 elements • All threads in Blocks 0. .

READ SECTION 4.3 .TO LEARN MORE.