You are on page 1of 28

Block IDs and Thread IDs

Host Device

ß  Each thread uses IDs to decide Grid 1

what data to work on Kernel Block Block


1 (0, 0) (1, 0)
Þ  Block ID: 1D or 2D
Þ  Thread ID: 1D, 2D, or 3D Block Block
(0, 1) (1, 1)

Grid 2
ß  Simplifies memory
Kernel
addressing when processing 2

multidimensional data Block (1, 1)


(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Þ  Image processing
Þ  Solving PDEs on volumes Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Þ  …
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organizati


1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009!
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign!
CUDA Memory Model Overview
ß  Global memory
Þ  Main means of
communicating R/W
Data between host and Grid

device Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

Þ  Contents visible to all


threads Shared Memory Shared Memory

Þ  Long latency access Registers Registers Registers Registers

ß  We will focus on Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

global memory for


now Host Global Memory

Þ  Constant and texture


memory will come later
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009!
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign! 2
Chapter 4
Parallel Programming in CUDA C
4.1 Chapter Objectives
ß  You will learn one of the fundamental ways
CUDA exposes its parallelism.
ß  You will write your first parallel code with
CUDA C.
4.2.1 Summing Vectors (1)

CPU Vector Sums


• Source: add_loop_cpu.cu
4.2.1 Summing Vectors (2)
GPU Vector Sums
ß  Source: add_loop_gpu.cu
Þ  We allocate three arrays on the device using calls to
cudaMalloc(): two arrays, dev_a and dev_b, to
hold inputs, and one array, dev_c, to hold the result.
Þ  Because we are environmentally conscientious coders,
we clean up after ourselves with cudaFree().
Þ  Using cudaMemcpy(), we copy the input data to the
device with the parameter cudaMemcpyHostToDevice
and copy the result data back to the host with
cudaMemcpyDeviceToHost.
Þ  We execute the device code in add() from the host
code in main() using the triple angle bracket syntax.
GPU Vector Sums (2)
ß  We have written a function called add()
that executes on the device. We
accomplished this by taking C code and
adding a __global__ qualifier to the
function name.
GPU Vector Sums (3)
ß  add<<<N,1>>>(dev_a, dev_b, dev_c);!
Þ  The first number N represents the number of
parallel blocks in which we would like the device to
execute our kernel.
Ý  The first argument can be represented by three
dimensional value. Max: 65536
Þ  The second number 1 represents the number of
threads in a block.
Ý  The number can be represented by three dimensional
value. Max: 512
Þ  N copies of the kernel are created and are run in
parallel.
Þ  We call each of these parallel invocations a block.
GPU Vector Sums (4)
ß  How can we tell from within the code which block is
currently running?
ß  The variable blockIdx.x indicates this.
ß  This variable is one of the built-in variables that the CUDA
runtime defines.
Þ  blockIdx.y, blockDim.x {.y}, gridDim.x{.y}, threadIdx.x{.y, .z}
GPU Vector Sums (5)
ß  CUDA C allows you to define a group of blocks in two
dimensions.
ß  We call the collection of parallel blocks a grid.
ß  This specifies to the runtime system that we want a one-
dimensional grid of N blocks.

Add<<<N,1>>>(dev_a, dev_b, dev_c)!


ß  The threads will have varying values for blockIdx.x, the

first taking value 0 and the last taking value N-1.


GPU Vector Sums (6)
ß  All running through the same copy of the device code but having
different values for the variable blockIdx.x. Imagine four blocks
GPU Vector Sums (7)
ß  Why do we check whether tid is less than
N?
Þ  It should be always be less than N, since we’ve
specifically launched our kernel such that this
assumption holds.
4.2.2 A Fun Example
ß  Julia Set
Þ  A point is not in the set if the process of iterating
the equation diverges for that point.
Þ  That is, if the sequence of values produced by

iterating the equation grows toward infinity, a


point is considered outside the set.
Þ  Conversely, if the values taken by the equation

remain bounded, the point is in the set.


Z n+1 = Z n2 + C
Julia Set Example
CPU Julia Set (1)
CPU Julia Set (2)
CPU Julia Set (3)
ß  Function julia()!
Þ  Return 1 if the point is in the set otherwise 0
Þ  The point’s color to be red if julia() return 1 and
black if it returns 0.
Þ  We begin by translating our pixel coordinate to a
coordinate in complex space.
Þ  To center the complex plane at the image center, we
shift by DIM/2.
Þ  To ensure that the image spans the range of -1.0 to 1.0,
we scale the image coordinate by DIM/2.
Þ  Thus, given an image point at (x, y), we get a point in
complex space at
((DIM/2-x)/(DIM/2), ((DIM/2-y)/(DIM/2))
CPU Julia Set (4)
ß  We need to determine whether the point is in or out
of the Julia Set by computing the values of the
iterative equation: Z n+1 = Z n2 + C
ß  For C, we choose -0.8 + 0.156i.
ß  We compute 200 iterations of this function.
ß  After each iteration, we check whether the
magnitude of the result exceeds some threshold (say
1,000). If so, the equation is diverging, and we can
return 0 to indicate that the point is not in the set.
ß  On the other hand, if we finish all 200 iterations and
the magnitude is still bounded under 1,000, we
assume that the point is in the set, and we return 1
to the caller, kernel().
CPU Julia Set (5)
CPU Julia Set (6)
GPU Julia Set (1)
GPU Julia Set (2)
ß  Source:Julia_gpu.cu !
ß  *dev_bitmap!
Þ  For allocating memory on device
ß  dim3 grid(DIM, DIM);!
Þ  Two dimensional grid of blocks
Þ  dim3 is not a standard C type but the CUDA definition
Þ  dim3 represents a three-dimensional tuple, however,
currently a three-dimensional launch grid is not
supported.
Þ  dim3 grid(DIM, DIM) means dim3 grid(DIM,
DIM, 1)!
Þ  kernel<<<grid,1>>>(dev_bitmap);
GPU Julia Set(3)
GPU Julia Set(4)
ß  No longer need nested for loops
ß  blockIdx!
Þ  One block for each pair of integers (x,y)
between (0,0) and (DIM-1, DIM-1).
ß  offset = x+y*gridDim.x;!
Þ  Linear offset into the output buffer, ptr.
Þ  gridDim is another builtin variable and shows

the dimensions of the grid that was launched.


GPU Julia Set(5)
OpenGL with MacOS
Environmental Variables
ß  export PATH=/usr/local/cuda/bin:$PATH

ß  export PATH=/opt/local/bin:/opt/local/sbin:$PATH

ß  export CPLUS_INCLUDE_PATH=/usr/X11/include:


$CPLUS_INCLUDE_PATH
ß  export C_INCLUDE_PATH=/usr/X11/include:
$C_INCLUDE_PATH
ß  export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:
$DYLD_LIBRARY_PATH
Compiling and Linking libraries
ß  nvcc julia_cpu.cu -Xlinker -framework, OpenGL, -
framework, GLUT