You are on page 1of 12

Synchronization

These notes introduce:

• Ways to achieve thread synchronization.

• __syncthreads()

• cudaThreadSynchronize()

ITCS 4/5145 Parallel Programming, B. Wilkinson, July 11, 2012. CUDASynchronization.ppt 1


Thread Barrier Synchronization
Threads
When we divide a computation into T0 T1 T2 Tn-1
parallel parts to be done concurrently
by independent threads, often need all
threads to do their computation before Active
processing next stage of computation

In parallel programming, we call this Time


Waiting
barrier synchronization Barr ier

– all threads wait when they reach the


barrier until all the threads have
reached that point and then they are
all released to continue

2
CUDA synchronization

CUDA provides a synchronization barrier routine for


those threads within each block

__syncthreads()

This routine would be used within a kernel.

Threads would waits at this point until all threads in


the block have reached it and they are all released.

NOTE only synchronizes with other threads in block


3
Threads only synchronize with other
threads in the block
Kernel code
Block 0 Block n-1
__global void mykernel () {

.
.
.
__syncthreads()
Barrier Barrier
.
. Continue Continue
.

}
Separate barriers

4
__syncthreads() constraints
All threads must reach a particular __syncthreads() routine or
deadlock occurs.

Multiple __syncthreads() can be used in a kernel but each one is


unique. Hence cannot have:

if { ...
__syncthreads();
}
else { …
__syncthreads();
}

and expect threads going thro different paths to be synchronized.


They all must go through the if or all go through the else clause. 5
Global Kernel Barrier
Unfortunately no global kernel barrier routine available in CUDA .

Often we want to synchronized all threads in computation.


To do that, have to use workarounds such as returning from kernel
and placing a barrier in CPU code.

The following could be used in the CPU code:



myKernel<<<B,T>>>( … );
cudaThreadSynchronize();

which waits until all preceding commands in all “streams” have


completed. cudaThreadSynchronize() not needed if there is an
existing synchronous CUDA call such as cudaMemcpy().
6
Achieving global synchronization
through multiple kernel launches
Kernel launches efficiently implemented:

- Minimal hardware overhead


- Little software overhead

So could do:
for (i= 0; i < n; i++) {
myKernel<<<B,T>>>( … );
cudaThreadSynchronize();
}

Recursion -- not allowed within kernel but can be used in host code
to launch kernels

7
Code Example
N-body problem
Need to compute forces on each body in each time interval and
then update positions and velocities of bodies and then repeat.

for (t = 0; t < tmax; t++) { // for each time period, force calculation on all bodies

cudaMemcpy(dev_A, A ,arraySize,cudaMemcpyHostToDevice); // data to


GPU

bodyCal<<<B,T>>>(dev_A); // kernel call

cudaMemcpy(A,dev_A,arraySize,cudaMemcpyDeviceToHost); // updated data

} // end of time period loop


No explicit synchronization needed as cudaMemcpy provides that
here.
8
Reasoning behind not having CUDA
global synchronization within GPU

Expensive to implement for a large number of GPU


processors.

At the block level, allows blocks to be executed in any


order on GPU.

Can use different sizes of blocks depending upon the


resources of GPU – so-called “transparent scalability.”

9
Other ways to achieve global
synchronization (if it cannot be avoided)

• CUDA memory fence __threadfence() that waits to


memory operations to be visible to other threads but
probably is not useable for synchronization.

• Write your own code for the kernel that implements


global synchronization.

How? (Using atomics and critical sections see next).

10
Discussion points
• Using writing to global memory to enforce
synchronization expensive

11
Questions

You might also like