CSL 730: Parallel

Programming
CUDA
(Compute Unified Device Architecture)
Monday 21 March 2011
Tesla (G80)
SP
16KB Scratch
SP SP
128KB L2
GB DRAM
SP
SM
Monday 21 March 2011
Fermi
SP
64K L1/
Scratch
SP SP
768KB L2
GB DRAM
SP
SM
Texture
Constant
64K L1/
Scratch
Texture
Constant
64K L1/
Scratch
Texture
Constant
64K L1/
Scratch
Texture
Constant
Monday 21 March 2011
CPU vs GPU architecture
• Memory latency needs to be hidden
– Run many threads
– Can do because of high “compute” density
Source nVIDIA
~ 8MB
~ 64KB
Monday 21 March 2011
Cuda Architecture
Courtesy NVIDIA
Monday 21 March 2011
SM
!"#
Dispatch Unit
Warp ScheduIer
Instruction Cache
Dispatch Unit
Warp ScheduIer
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
Uniform Cache
Core
Register FiIe (32,768 x 32-bit)
CUDA Core
Operand CoIIector
Dispatch Port
ResuIt Queue
FP Unit INT Unit
#
!"#$%&'()'*+,-'!&%."'/0'"1,2$3&4'56',7%&48'9('27+3:4;7%&'$1";48'<7$%'4=&,"+2><$1,;"71'
$1";48'+'56?>@7%3'%&#"4;&%'<"2&8'(A?'7<',71<"#$%+B2&'CD08'+13';-%&+3',71;%72'27#",)'*+,-',7%&'
-+4'B7;-'<27+;"1#>=7"1;'+13'"1;&#&%'&E&,$;"71'$1";4)'F/7$%,&G'HIJKJDL'
$%&'()*+,-&)*(#&-./'()&*0#1&%%&2#(3.#4555#678,!""9#1%&'()*+,-&)*(#0('*:'/:;#
5'<3#<&/.#<'*#-./1&/=#&*.#0)*+%.,-/.<)0)&*#1>0.:#=>%()-%?,'::#&-./'()&*#)*#.'<3#
<%&<@#-./)&:#'*:#&*.#:&>A%.,-/.<)0)&*#$BC#)*#(2&#<%&<@#-./)&:0;#C(#(3.#<3)-#%.D.%E#
$./=)#-./1&/=0#=&/.#(3'*#9×#'0#='*?#:&>A%.,-/.<)0)&*#&-./'()&*0#-./#<%&<@#(3'*#
(3.#-/.D)&>0#FG!""#+.*./'()&*E#23./.#:&>A%.,-/.<)0)&*#-/&<.00)*+#2'0#3'*:%.:#A?#
'#:.:)<'(.:#>*)(#-./#HB#2)(3#=><3#%&2./#(3/&>+3->(;#
4555#1%&'()*+,-&)*(#<&=-%)'*<.#)*<%>:.0#'%%#1&>/#/&>*:)*+#=&:.0E#'*:#
0>A*&/='%#*>=A./0#I*>=A./0#<%&0./#(&#J./&#(3'*#'#*&/='%)J.:#1&/='(#<'*#
Monday 21 March 2011
GPU Performance
• Massively parallel: 512 cores
• Low power
• Massively threaded:1000s of threads
– Hardware-supported threads

Courtesy NVIDIA
Monday 21 March 2011
What is CUDA?
• “Compute Unified Device Architecture”
• General purpose programming model
– User kicks off batches of threads on the GPU
– GPU = dedicated super-threaded, massively data parallel co-
processor
• Driver for loading computation programs into GPU
– Standalone Driver - Optimized for computation
– Interface designed for compute – graphics-free API
– Data sharing with OpenGL buffer objects
– Guaranteed maximum download & readback speeds
– Explicit GPU memory management
Monday 21 March 2011
CUDA is C-like
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C
code
Serial Code (host)‏
. . .
. . .
Parallel Kernel (device)‏
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)‏
Parallel Kernel (device)‏
KernelB<<< nBlk, nTid >>>(args);
Courtesy Kirk &
Hwu
Monday 21 March 2011
CUDA Devices and Threads
• A compute device
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)‏
– Runs many threads in parallel
– Is typically a GPU but can also be another type of parallel
processing device
• Data-parallel portions of an application are expressed as
device kernels which run on many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
Courtesy Kirk & Hwu
Monday 21 March 2011
Extended C
• Declspecs
– global, device, shared,
local, constant
• Built-in variables
– threadIdx, blockIdx
• Intrinsics
– __syncthreads
• Runtime API
– Memory, symbol,
execution
management
• Function launch
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M];
...
region[threadIdx] = image[i];
__syncthreads()
...
image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
Courtesy Kirk &
Hwu
Monday 21 March 2011
gcc / cl
Architecture SASS
foo.sass
OCG
Extended-C SW stack
nvcc
C/C++ frontend
GPU Assembly
foo.s
CPU Host Code
foo.cpp
Integrated source
(foo.cu)
Monday 21 March 2011
gcc / cl
Architecture SASS
foo.sass
OCG
Extended-C SW stack
nvcc
C/C++ frontend
GPU Assembly
foo.s
CPU Host Code
foo.cpp
Integrated source
(foo.cu)
cuda-gdb
CUDA Visual
Profiler
Parallel
Nsight
Monday 21 March 2011
CUDA software pipeline
• Source files has mix of host and device code
• nvcc separates device code from host code
– and compiles device code into PTX/cubin
• Host code is output as C code
– nvcc can invoke the host compiler
– or, it can be compiled later
• Applications can link to the generated host code
– host code includes PTX/cubin code as a global initialized data array
– and cudart CUDA C runtime function calls to load and launch kernels
• Alternatvely, one may load and execute the PTX/cubin using
the CUDA driver API
– host code is then ignored
Monday 21 March 2011
CUDA software architecture
Source nVIDIA
Provides library functions
for host as well as device
Implement subset of stdlib
Monday 21 March 2011
System Requirements
• CUDA GPU
– With CUDA device driver
• CUDA software
– CUDA Toolkit
• Tools to build a CUDA
• Libraries
• header files, and other resources
– CUDA SDK
• Sample projects (with configurations) including utility functions
• C/C++ compiler
– Needs to be a compatible version
Monday 21 March 2011
Arrays of Parallel Threads
• A CUDA kernel is executed many times
– By a block of threads running concurrently
– Once per thread, each running the same kernel (SPMD)
– Thread have access to their ID
• may compute different memory addresses or control
7 6 5 4 3 2 1 0

float x = input[tID];
float y = func(x);
output[tID] = y;

thread ID

float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;

Monday 21 March 2011
22A
17
Monday 21 March 2011
thread ID
Thread Block 0

Thread Block 0 Thread Block N - 1
Arrays of Parallel Blocks
• Multiple blocks of threads may execute a kernel
– A grid of blocks
– Threads within a block communicate using shared
memory, global memory, atomic operations and barrier
– Threads in different blocks only share global memory
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;


float x = input[tID];
float y = func(x);
output[tID] = y;

Monday 21 March 2011
Main CUDA Construct
• Run k instances of function f
– Such parallel functions are called Kernels
– Declared with __global__ specifier
// Kernel definition
__global__ void f(float* A)
{
int id = threadIdx.x;

}
int main()
{
// Kernel invocation
f<<<1, N>>>(A);
}
Monday 21 March 2011
Main CUDA Construct
• Run k instances of function f
– Such parallel functions are called Kernels
– Declared with __global__ specifier
// Kernel definition
__global__ void f(float* A)
{
int id = threadIdx.x;

}
int main()
{
// Kernel invocation
f<<<1, N>>>(A);
}
Object oriented parts of
C++ not supported for
device code in earlier
CUDA
Monday 21 March 2011
Kernel Invocation
• <<<A, B>>> specifies a 2-level hierarchy
– Grid of blocks
• |A| blocks, each block is of size |B|
– All thread within a block scheduled on the same SIMD SM
– Can share local memory
• Actually called shared memory in CUDA lingo
• There is separate thread local memory
– Ironically, may not be physically close
– Can synchronize with each other in a block

__syncthreads()
– Different blocks only loosely tied
• Must be able to execute independently (concurrently)
– Do share global memory
Monday 21 March 2011
Thread Execution
• A block does not execute in a SIMD fashion
– There are only 8 SPs
• Executed in groups of 32 parallel threads
– called warps
• Divided into two half-warps
– There need not be 32 or even 16 SPs
– Logical separation; Instructions may be “double pumped”
• All threads of a warp start together
– But may diverge by branching
– Branch paths are serialized until they converge back
– Important efficiency consideration
Monday 21 March 2011
Grid/Block Dimension
• A and B need not be ints
– A is a (up to) two dimensional vector
• dim3 A(m, n)
– B is a (up to) three dimensional vector
• dim3 B(a, b, c); a, b, c are ints
• a x b x c <= 512 on Tesla (1024 on Fermi)
– Resource sharing further limits the count
– Up to 8 blocks may co-exist on SM; at least 1 must fit
• c is the most significant dimension
• a is the least significant dimension
• Dereference: B.x, B.y and B.z
• Thread ID = (B.x + B.y * a + B.z * a*b)
Monday 21 March 2011
Thread ID
a
b
c
x
y
z
(x + y * a + z * a*b).
Monday 21 March 2011
CUDA Memory Model Overview
• Global memory
– Main host-device data communication path
– Visible to all threads
– Long latency
• Shared Memory
– Fast memory
– Use as scratch
– Shared across block
• More memory segments
– Constant and texture
• Read-only, cached
Grid
Global Memory
Block (0, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Block (1, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Host
courtesy Kirk & Hwu
Monday 21 March 2011
CUDA Memory Model Overview
• Global memory
– Main host-device data communication path
– Visible to all threads
– Long latency
• Shared Memory
– Fast memory
– Use as scratch
– Shared across block
• More memory segments
– Constant and texture
• Read-only, cached
Grid
Global Memory
Block (0, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Block (1, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Host
courtesy Kirk & Hwu
Monday 21 March 2011
CUDA Memory Model Overview
• Global memory
– Main host-device data communication path
– Visible to all threads
– Long latency
• Shared Memory
– Fast memory
– Use as scratch
– Shared across block
• More memory segments
– Constant and texture
• Read-only, cached
Grid
Global Memory
Block (0, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Block (1, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Host
courtesy Kirk & Hwu
Monday 21 March 2011
23A
25
Monday 21 March 2011
Memory Model Details
• Shared memory is tied to a block
– Lifetime ends with the block
• global, constant, and texture memories are
persistent across kernels (within application)
• These are recognized as device memory
– Separate from host memory
• App must explicitly allocate/de-allocate device
memory
– And manage data transfer between host and device
memory
Monday 21 March 2011
CUDA Device Memory Allocation
• cudaMalloc()
– Allocates in global memory
– Requires parameters:
• Address of a pointer to the
allocated object
• Size of of allocated object
– Beware of display mode
change
• cudaFree()
– Frees object from global memory
• Takes pointer to object to free
• Called on the host!
– Feels like host pointers
Grid
Global
Memory
Block (0, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Block (1, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Host
Thanks Kirk & Hwu
Monday 21 March 2011
CUDA Device Memory Allocation (cont.)‏
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md
– Suffix “d” often used for device data
TILE_WIDTH = 64;
float *Md, *M;
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);

cudaFree(Md);
Courtesy Kirk &
Hwu
Monday 21 March 2011
Example Memory Copy
size_t size = N * sizeof(float);
// Allocate vector in host memory
float* h_A = (float*)malloc(size);
// Make sure to initialize input vectors
float* d_A, *d_B, *d_C;
// Allocate vectors in device “global” memory
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
// Copy host->device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
// Invoke kernel on GPU
ProcessDo<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, N);
// Copy result, h_B, from device memory to host memory
cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
Monday 21 March 2011
Example Memory Copy
size_t size = N * sizeof(float);
// Allocate vector in host memory
float* h_A = (float*)malloc(size);
// Make sure to initialize input vectors
float* d_A, *d_B, *d_C;
// Allocate vectors in device “global” memory
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
// Copy host->device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
// Invoke kernel on GPU
ProcessDo<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, N);
// Copy result, h_B, from device memory to host memory
cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
See cudaMallocPitch() and cudaMalloc3D()
for allocating 2D/3D arrays. Pads to meet alignment
for efficient access (also see cudaMemcpy2D() and
cudaMemcpy3D()).
Monday 21 March 2011
CUDA Host-Device Data Transfer
• cudaMemcpy()‏
– Requires four
parameters
• Pointer to destination
• Pointer to source
• Number of bytes copied
• Type of transfer
– Host to Host
– Host to Device
– Device to Host
– Device to Device
• Asynchronous transfer
Grid
Global
Memory
Block (0, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Block (1, 0)‏
Shared Memory
Thread (0, 0)‏
Registers
Thread (1, 0)‏
Registers
Host
Courtesy Kirk & Hwu
Monday 21 March 2011
CUDA Host-Device Data
Transfer
• Code example:
– Recall allocation earlier
– Transfer a 64 * 64 float array
– M is in host memory and Md is in device
memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic
constants
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);
Courtesy Kirk & Hwu
Monday 21 March 2011
More Ways to Initialize
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data))
Monday 21 March 2011
More Ways to Initialize
• There also is page-locked (i.e., pinned) host memory
– cudaHostAlloc() and cudaFreeHost()
– Copies between page-locked host memory and device
memory can be performed concurrently with kernel execution
– Page-locked host memory can be directly mapped into the
address space of the device
– Bandwidth between host memory and device memory is
generally higher
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data))
Monday 21 March 2011

Tesla (G80)
SM
SP SP SP SP

16KB Scratch 128KB L2

GB DRAM

Monday 21 March 2011

Fermi
SM
SP SP SP SP

64K L1/ Scratch
Texture Constant

64K L1/ Scratch
Texture Constant

64K L1/ Scratch
Texture Constant

64K L1/ Scratch
Texture Constant

768KB L2

GB DRAM

Monday 21 March 2011

CPU vs GPU architecture ~ 8MB ~ 64KB • Memory latency needs to be hidden – Run many threads – Can do because of high “compute” density Source nVIDIA Monday 21 March 2011 .

Cuda Architecture Courtesy NVIDIA Monday 21 March 2011 .

6* 54-/ 54-/ 54-/ 54-/ 7!8.9 54-/ 54-/ 54-/ 54-/ 7!8.'(/012/!"#$%&'()*+"& SM 5*!R)54-/ !"#$%&'()T4-& S$/-%+0)5422/'&4- !"#$%&'()*+"& F/G"#&/-)6"2/)HIJKL=M)N)IJOP"&Q 7!8.9 7!8.9 7!8.9 54-/ 54-/ 54-/ 54-/ 7!8.%-$).6* 6T)*+"& 3:9)*+"& 54-/ 54-/ 54-/ 54-/ 7!8.'(/012/.9 F/#12&)U1/1/ 54-/ 54-/ 54-/ 54-/ .9 3+&/-'4++/'&):/&.9 7!8.(%-/0)A/B4-C)8)7D)5%'(/ *+"E4-B)5%'(/ # Monday 21 March 2011 .9 54-/ 54-/ 54-/ 54-/ .3+#&-1'&"4+)5%'(/ .9 7!8.6* 54-/ 54-/ 54-/ 54-/ 7!8.9 7!8.9 7!8.9 7!8.4-< =>)?@).9 7!8.6* 54-/ 54-/ 54-/ 54-/ 7!8.9 .9 .9 7!8.%-$).

GPU Performance • Massively parallel: 512 cores • Low power • Massively threaded:1000s of threads – Hardware-supported threads • Courtesy NVIDIA Monday 21 March 2011 .

What is CUDA? • “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded. massively data parallel coprocessor • Driver for loading computation programs into GPU – Standalone Driver .Optimized for computation – Interface designed for compute – graphics-free API – Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds – Explicit GPU memory management Monday 21 March 2011 .

nTid >>>(args). Serial Code (host)‫‏‬ Parallel Kernel (device)‫‏‬ KernelB<<< nBlk. ..CUDA is C-like • Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host)‫‏‬ Parallel Kernel (device)‫‏‬ KernelA<<< nBlk. nTid >>>(args). Courtesy Kirk & Hwu Monday 21 March 2011 ... ..

CUDA Devices and Threads • A compute device – – – – Is a coprocessor to the CPU or host Has its own DRAM (device memory)‫‏‬ Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device • • Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads – – GPU threads are extremely lightweight • • Very little creation overhead Multi-core CPU needs only a few Courtesy Kirk & Hwu GPU needs 1000s of threads for full efficiency Monday 21 March 2011 .

.. device. local. 10>>> (myimage). __syncthreads() . . region[threadIdx] = image[i]. __global__ void convolve (float *image) __shared__ float region[M].. image[j] = result. blockIdx Intrinsics – __syncthreads • Runtime API – Memory. shared. symbol.• Declspecs Extended C __device__ float filter[N].. 10 threads per block convolve<<<100. constant • • Built-in variables – threadIdx. { – global. execution management • Function launch Monday 21 March 2011 Courtesy Kirk & Hwu . } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks.

Extended-C SW stack Integrated source (foo.s CPU Host Code foo.sass gcc / cl Monday 21 March 2011 .cu) C/C++ frontend nvcc GPU Assembly foo.cpp OCG Architecture SASS foo.

cpp OCG Architecture SASS foo.sass gcc / cl Parallel Nsight Monday 21 March 2011 .Extended-C SW stack Integrated source (foo.cu) C/C++ frontend CUDA Visual Profiler nvcc cuda-gdb GPU Assembly foo.s CPU Host Code foo.

it can be compiled later • Applications can link to the generated host code – host code includes PTX/cubin code as a global initialized data array – and cudart CUDA C runtime function calls to load and launch kernels • Alternatvely.CUDA software pipeline • Source files has mix of host and device code • nvcc separates device code from host code – and compiles device code into PTX/cubin • Host code is output as C code – nvcc can invoke the host compiler – or. one may load and execute the PTX/cubin using the CUDA driver API – host code is then ignored Monday 21 March 2011 .

CUDA software architecture Provides library functions for host as well as device Implement subset of stdlib Source nVIDIA Monday 21 March 2011 .

and other resources – CUDA SDK • Sample projects (with configurations) including utility functions • C/C++ compiler – Needs to be a compatible version Monday 21 March 2011 .System Requirements • CUDA GPU – With CUDA device driver • CUDA software – CUDA Toolkit • Tools to build a CUDA • Libraries • header files.

output[tID] = y. … … float x = input[tID]. float y = func(x). float y = func(x). … … float x = input[tID]. … … float x = input[tID]. float y = func(x). output[tID] = y. … … float x = input[tID]. float y = func(x). … … float x = input[tID]. float y = func(x).Arrays of Parallel Threads • A CUDA kernel is executed many times – By a block of threads running concurrently – Once per thread. output[tID] = y. output[tID] = y. output[tID] = y. each running the same kernel (SPMD) – Thread have access to their ID • may compute different memory addresses or control 0 1 2 3 4 5 6 7 thread ID … float x = input[tID]. output[tID] = y. … Monday 21 March 2011 . output[tID] = y. float y = func(x). … … float x = input[tID]. float y = func(x). … … float x = input[tID]. float y = func(x). output[tID] = y.

22A 17 Monday 21 March 2011 .

float y = func(x). atomic operations and barrier – Threads in different blocks only share global memory Thread Block 0 thread ID 0 1 2 3 4 5 6 7 0 Thread Block 0 1 2 3 4 5 6 7 Thread Block N . float y = func(x). float y = func(x). … … … float x = input[tID]. output[tID] = y. … Monday 21 March 2011 . global memory. output[tID] = y. … … float x = input[tID].1 0 1 2 3 4 5 6 7 … float x = input[tID].Arrays of Parallel Blocks • Multiple blocks of threads may execute a kernel – A grid of blocks – Threads within a block communicate using shared memory. output[tID] = y.

Main CUDA Construct • Run k instances of function f – Such parallel functions are called Kernels – Declared with __global__ specifier // Kernel definition __global__ void f(float* A) { int id = threadIdx. … } int main() { // Kernel invocation f<<<1.x. } Monday 21 March 2011 . N>>>(A).

N>>>(A). … } int main() { // Kernel invocation f<<<1.x. } Monday 21 March 2011 Object oriented parts of C++ not supported for device code in earlier CUDA .Main CUDA Construct • Run k instances of function f – Such parallel functions are called Kernels – Declared with __global__ specifier // Kernel definition __global__ void f(float* A) { int id = threadIdx.

B>>> specifies a 2-level hierarchy – Grid of blocks • |A| blocks. each block is of size |B| – All thread within a block scheduled on the same SIMD SM – Can share local memory • Actually called shared memory in CUDA lingo • There is separate thread local memory – Ironically.Kernel Invocation • <<<A. may not be physically close – Can synchronize with each other in a block • __syncthreads() – Different blocks only loosely tied • Must be able to execute independently (concurrently) – Do share global memory Monday 21 March 2011 .

Thread Execution • A block does not execute in a SIMD fashion – There are only 8 SPs • Executed in groups of 32 parallel threads – called warps • Divided into two half-warps – There need not be 32 or even 16 SPs – Logical separation. Instructions may be “double pumped” • All threads of a warp start together – But may diverge by branching – Branch paths are serialized until they converge back – Important efficiency consideration Monday 21 March 2011 .

z * a*b) Monday 21 March 2011 . n) – B is a (up to) three dimensional vector • dim3 B(a.Grid/Block Dimension • A and B need not be ints – A is a (up to) two dimensional vector • dim3 A(m.y and B. c).x. b. B.x + B.y * a + B.z • Thread ID = (B. at least 1 must fit • c is the most significant dimension • a is the least significant dimension • Dereference: B. c are ints • a x b x c <= 512 on Tesla (1024 on Fermi) – Resource sharing further limits the count – Up to 8 blocks may co-exist on SM. a. b.

c Monday 21 March 2011 .Thread ID b y xz a (x + y * a + z * a*b).

CUDA Memory Model Overview • Global memory – Main host-device data communication path – Visible to all threads Grid – Long latency Block (0. 0)‫‏‬ Thread (0. 0)‫ ‏‬Thread (1. 0)‫‏‬ Shared Memory • Shared Memory – Fast memory – Use as scratch – Shared across block Shared Memory Registers Registers Registers Registers Thread (0. cached Host Global Memory courtesy Kirk & Hwu Monday 21 March 2011 . 0)‫‏‬ • More memory segments – Constant and texture • Read-only. 0)‫ ‏‬Thread (1. 0)‫‏‬ Block (1.

0)‫ ‏‬Thread (1. 0)‫‏‬ Thread (0. 0)‫‏‬ Block (1. cached Host Global Memory courtesy Kirk & Hwu Monday 21 March 2011 . 0)‫ ‏‬Thread (1. 0)‫‏‬ • More memory segments – Constant and texture • Read-only.CUDA Memory Model Overview • Global memory – Main host-device data communication path – Visible to all threads Grid – Long latency Block (0. 0)‫‏‬ Shared Memory • Shared Memory – Fast memory – Use as scratch – Shared across block Shared Memory Registers Registers Registers Registers Thread (0.

0)‫ ‏‬Thread (1. 0)‫‏‬ Block (1. cached Host Global Memory courtesy Kirk & Hwu Monday 21 March 2011 . 0)‫ ‏‬Thread (1. 0)‫‏‬ Shared Memory • Shared Memory – Fast memory – Use as scratch – Shared across block Shared Memory Registers Registers Registers Registers Thread (0. 0)‫‏‬ • More memory segments – Constant and texture • Read-only.CUDA Memory Model Overview • Global memory – Main host-device data communication path – Visible to all threads Grid – Long latency Block (0. 0)‫‏‬ Thread (0.

23A 25 Monday 21 March 2011 .

Memory Model Details • Shared memory is tied to a block – Lifetime ends with the block • global. constant. and texture memories are persistent across kernels (within application) • These are recognized as device memory – Separate from host memory • App must explicitly allocate/de-allocate device memory – And manage data transfer between host and device memory Monday 21 March 2011 .

0)‫‏‬ Thread (0. 0)‫‏‬ – Frees object from global memory Host Global Memory • Called on the host! – Feels like host pointers Thanks Kirk & Hwu Monday 21 March 2011 . 0)‫ ‏‬Thread (1. 0)‫ ‏‬Thread (1. 0)‫‏‬ Shared Memory – Beware of display mode change • cudaFree() • Takes pointer to object to free Registers Registers Registers Registers Thread (0.CUDA Device Memory Allocation • cudaMalloc() – Allocates in global memory – Requires parameters: • Address of a pointer to the allocated object • Size of of allocated object Grid Block (0. 0)‫‏‬ Shared Memory Block (1.

)‫‏‬ • Code example: – Allocate a 64 * 64 single precision float array – Attach the allocated storage to Md – Suffix “d” often used for device data TILE_WIDTH = 64. float *Md. … cudaFree(Md). *M.CUDA Device Memory Allocation (cont. Monday 21 March 2011 Courtesy Kirk & Hwu . cudaMalloc((void**)&Md. int size = TILE_WIDTH * TILE_WIDTH * sizeof(float). size).

cudaFree(d_B). *d_B. // Copy host->device cudaMemcpy(d_A.Example Memory Copy size_t size = N * sizeof(float). // Allocate vectors in device “global” memory cudaMalloc(&d_A. threadsPerBlock>>>(d_A. // Invoke kernel on GPU ProcessDo<<<blocksPerGrid. d_B. cudaMemcpyDeviceToHost). size. N). h_B. d_B. *d_C. // Allocate vector in host memory float* h_A = (float*)malloc(size). size. h_A. Monday 21 March 2011 . cudaMemcpyHostToDevice). size). cudaMalloc(&d_B. // Copy result. from device memory to host memory cudaMemcpy(h_B. // Make sure to initialize input vectors float* d_A. size). // Free device memory cudaFree(d_A).

// Free device memory cudaFree(d_A). // Allocate vector in host memory float* h_A = (float*)malloc(size). d_B. size. cudaMemcpyDeviceToHost). Pads to meet alignment for efficient access (also see cudaMemcpy2D() and cudaMemcpy3D()). // Make sure to initialize input vectors float* d_A. threadsPerBlock>>>(d_A. // Copy host->device cudaMemcpy(d_A. // Invoke kernel on GPU ProcessDo<<<blocksPerGrid. size). cudaMemcpyHostToDevice). from device memory to host memory cudaMemcpy(h_B. // Allocate vectors in device “global” memory cudaMalloc(&d_A. N). cudaMalloc(&d_B. *d_C. for allocating 2D/3D arrays. See cudaMallocPitch() and cudaMalloc3D() cudaFree(d_B). // Copy result. Monday 21 March 2011 . d_B. size. h_B. h_A. *d_B.Example Memory Copy size_t size = N * sizeof(float). size).

0)‫ ‏‬Thread (1. 0)‫‏‬ Shared Memory Block (1. 0)‫‏‬ Thread (0. 0)‫‏‬ Global Memory • Asynchronous transfer Courtesy Kirk & Hwu Monday 21 March 2011 . 0)‫ ‏‬Thread (1.CUDA Host-Device Data Transfer • cudaMemcpy()‫‏‬ – Requires four parameters • • • • Pointer to destination Pointer to source Number of bytes copied Type of transfer – – – – Host to Host Host to Device Device to Host Device to Device Host Grid Block (0. 0)‫‏‬ Shared Memory Registers Registers Registers Registers Thread (0.

Md. cudaMemcpyHostToDevice). size. – Transfer a 64 * 64 float array – M is in host memory and Md is in device memory – cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants Monday 21 March 2011 Courtesy Kirk & Hwu . M.CUDA Host-Device Data Transfer • Code example: – Recall allocation earlier cudaMemcpy(Md. cudaMemcpyDeviceToHost). size. cudaMemcpy(M.

sizeof(data)) Monday 21 March 2011 . data. float data[256]. cudaMemcpyToSymbol(constData.More Ways to Initialize __constant__ float constData[256].

More Ways to Initialize __constant__ float constData[256]. data. pinned) host memory – cudaHostAlloc() and cudaFreeHost() – Copies between page-locked host memory and device memory can be performed concurrently with kernel execution – Page-locked host memory can be directly mapped into the address space of the device – Bandwidth between host memory and device memory is generally higher Monday 21 March 2011 . cudaMemcpyToSymbol(constData. float data[256]..e. sizeof(data)) • There also is page-locked (i.

Sign up to vote on this title
UsefulNot useful