CUDA Odds and Ends: Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

CUDA Odds and
Ends
Joseph Kider
University of Pennsylvania
CIS 565 - Fall 2011
Sources
Patrick Cozzi Spring 2011
NVIDIA CUDA Programming Guide
CUDA by Example
Programming Massively Parallel
Processors
Agenda
Atomic Functions
Paged-Locked Host Memory
Streams
Graphics Interoperability
Atomic Functions
What is the value of count if 8 threads
execute ++count?
__device__ unsigned int count = 0;

// ...
++count;
Atomic Functions
Read-modify-write atomic operation
Guaranteed no interference from other threads
No guarantee on order
Shared or global memory

Requires compute capability 1.1 (> G80)
See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements
Atomic Functions
What is the value of count if 8 threads
execute atomicInc below?
__device__ unsigned int count = 0;

// ...
// atomic ++count
atomicInc(&count, 1);
Atomic Functions
How do you implement atomicInc?
__device__ int atomicAdd(

int *address, int val);
Atomic Functions
How do you implement atomicInc?

int *address, int val)
{ // Made up keyword:
__lock (address) {
*address += value;
}
}
Atomic Functions
How do you implement atomicInc
without locking?
Atomic Functions
How do you implement atomicInc
without locking?
What if you were given an atomic compare
and swap?
int atomicCAS(int *address, int

compare, int val);
Atomic Functions
atomicCAS pseudo implementation
int atomicCAS(int *address,
int compare, int val)
{ // Made up keyword
__lock(address) {
int old = *address;
*address = (old == compare) ? val : old;
return old;
}
}
Atomic Functions
__lock(address) {
int old = *address;
return old;
}
}
Atomic Functions
__lock(address) {
int old = *address;
return old;
}
}
Atomic Functions
Example:
*addr = 1;
atomicCAS(addr, 1, 2);
Atomic Functions
Example:
*addr = 1;
atomicCAS(addr, 1, 2); // returns 1

atomicCAS(addr, 1, 3); // *addr = 2
Atomic Functions
Example:
*addr = 1;
atomicCAS(addr, 2, 3); // *addr = 2
Atomic Functions
Example:
*addr = 1;
// *addr = 3
Atomic Functions
Again, how do you implement atomicInc
given atomicCAS?

int *address, int val);
Atomic Functions
__device__ int atomicAdd(int *address, int val)
{
int old = *address, assumed;
do {
assumed = old;
old = atomicCAS(address,
assumed, val + assumed);
} while (assumed != old);
return old;
}
Atomic Functions
{
int old = *address, assumed; Read original value at
*address.
do {
assumed = old;
assumed, val + assumed);
return old;
}
Atomic Functions
{
do {
assumed = old;
old = atomicCAS(address, If the value at
*address didn’t
assumed, val + assumed); change, increment it.
return old;
}
Atomic Functions
{
do {
assumed = old;
assumed, assumed + val);
} while (assumed != old); Otherwise, loop until
atomicCAS succeeds.
return old;
} The value of *address after this function
returns is not necessarily the original value
of *address + val, why?
Atomic Functions
Lots of atomics:
// Arithmetic // Bitwise
atomicAdd() atomicAnd()
atomicSub() atomicOr()
atomicExch() atomicXor()
atomicMin()
atomicMax()
atomicInc()
atomicDec()
atomicCAS()
See B.10 in the NVIDIA CUDA C Programming Guide
Atomic Functions
How can threads from different
blocks work together?
Use atomics sparingly. Why?
Page-Locked Host Memory
Page-locked Memory
Host memory that is essentially removed from
virtual memory
Also called Pinned Memory
Benefits
Overlap kernel execution and data transfers
Time
Normally: Data Transfer Kernel Execution
Paged-locked: Data Transfer
Kernel Execution
See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements
Benefits
Increased memory bandwidth for systems
with a front-side bus
Up to ~2x throughput
Image from http://arstechnica.com/hardware/news/2009/10/day-of-nvidia-chipset-reckoning-arrives.ars

Benefits
Writing-Combing Memory
Page-locked memory is cacheable
Allocate with cudaHostAllocWriteCombined to
Avoid polluting L1 and L2 caches
Avoid snooping transfers across PCIe
Improve transfer performance up to 40% - in theory
Reading from write-combing memory is slow!

Only write to it from the host
Benefits
Paged-locked host memory can be mapped
into the address space of the device on some
systems
What systems allow this?
What does this eliminate?
What applications does this enable?
CallcudaGetDeviceProperties() and
check canMapHostMemory
Usage:
cudaHostAlloc() / cudaMallocHost()
cudaHostFree()
cudaMemcpyAsync()
See 3.2.5 in the NVIDIA CUDA C Programming Guide

DEMO
CUDA SDK Example: bandwidthTest
What’s the catch?
Page-locked memory is scarce
Allocations will start failing before allocation of in
pageable memory
Reduces amount of physical memory
available to the OS for paging
Allocating too much will hurt overall system
performance
Streams
Stream: Sequence of commands that
execute in order
Streams may execute their commands
out-of-order or concurrently with respect to
other streams
Stream A Stream B
Command 0 Command 0
Command 1 Command 1
Command 2 Command 2
Streams
Is this a possible order?
Stream A Stream B Time
Command 0 Command 0 Command 0

Command 0
Command 1
Command 2
Streams

Command 0
Command 1
Command 2
Streams
Command 1
Command 2
Command 2
Streams
Command 2
Command 1
Command 1
Streams
Command 0 Command 0 Command 0 Command 0

Streams
In CUDA, what commands go in a stream?
Kernel launches
Host device memory transfers
Streams
Code Example
1. Create two streams
2. Each stream:
1. Copy page-locked memory to device
2. Launch kernel
3. Copy memory back to host
3. Destroy streams
Stream Example (Step 1 of 3)
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
{
cudaStreamCreate(&stream[i]);
}
float *hostPtr;
cudaMallocHost(&hostPtr, 2 * size);
for (int i = 0; i < 2; ++i)
{
}
Create two streams
float *hostPtr;
for (int i = 0; i < 2; ++i)
{
}
float *hostPtr;
Allocate two buffers in page-locked memory
for (int i = 0; i < 2; ++i)
{
cudaMemcpyAsync(/* ... */,
cudaMemcpyHostToDevice, stream[i]);
kernel<<<100, 512, 0, stream[i]>>>
(/* ... */);
cudaMemcpyDeviceToHost, stream[i]);
}
for (int i = 0; i < 2; ++i)
{
cudaMemcpyHostToDevice, stream[i]);
kernel<<<100, 512, 0, stream[i]>>>
(/* ... */);
cudaMemcpyDeviceToHost, stream[i]);
}
Commands are assigned to, and executed by streams
for (int i = 0; i < 2; ++i)
{
// Blocks until commands complete
cudaStreamDestroy(stream[i]);
}
Streams
Assume compute capabilities:
Overlapof data transfer and kernel execution
Concurrent kernel execution
Concurrent data transfer
How can the streams overlap?
See G.1 in the NVIDIA CUDA C Programming Guide for more on compute capabilities
Streams
Can we have more overlap than this?
Time
Stream A Stream B
Host device memory
Kernel execution Host device memory
Device to host memory Kernel execution
Device to host memory

Streams
Can we have this?
Time
Stream A Stream B
Host device memory
Kernel execution Host device memory
Kernel execution

Streams
Implicit Synchronization
An operation that requires a dependency
check to see if a kernel finished executing:
Blocks all kernel launches from any stream until
the checked kernel is finished
See 3.2.6.5.3 in the NVIDIA CUDA C Programming Guide for all limitations
Streams
Can we have this?
Stream A Time
Host device memory

Stream B
Kernel execution
Dependent Host device memory

on kernel
completion Device to host memory Kernel execution
Blocked until kernel from

Stream A completes
Streams
Performance Advice
Issue all independent commands before
dependent ones
Delay synchronization (implicit or explicit) as
long as possible
Streams
Rewrite this to allow concurrent kernel
execution
for (int i = 0; i < 2; ++i)
{
cudaMemcpyAsync(/* ... */, stream[i]);
kernel<<< /*... */ stream[i]>>>();
}
Streams
for (int i = 0; i < 2; ++i) // to device
for (int i = 0; i < 2; ++i)

kernel<<< /*... */ stream[i]>>>();
for (int i = 0; i < 2; ++i) // to host

Streams
Explicit Synchronization
cudaThreadSynchronize()
Blocks until commands in all streams finish
cudaStreamSynchronize()
Blocks until commands in a stream finish
See 3.2.6.5 in the NVIDIA CUDA C Programming Guide for more synchronization functions
Timing with Stream Events
Events can be added to a stream to
monitor the device’s progress
An event is completed when all commands
in the stream preceding it complete.
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop)
cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i)
// ...
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
// cudaEventDestroy(...)
cudaEvent_t start, stop; Create two events.
cudaEventCreate(&start); Each will record the time
for (int i = 0; i < 2; ++i)
// ...
float elapsedTime;
cudaEventRecord(start, 0); Record events before

for (int i = 0; i < 2; ++i) and after each stream is
assigned its work
// ...
float elapsedTime;
for (int i = 0; i < 2; ++i)
// ...
Delay addition commands in

cudaEventSynchronize(stop); stream until after the stop
float elapsedTime; event
for (int i = 0; i < 2; ++i)
// ...
Compute elapsed
float elapsedTime; time
What applications use both CUDA and
OpenGL/Direct3D?
CUDA GL
GL CUDA
If CUDA and GL cannot share resources,
what is the performance implication?
Graphics Interop: Map GL resource into
CUDA address space
Buffers
Textures
Renderbuffers
OpenGL Buffer Interop
1. Assign device with GL interop
cudaGLSetGLDevice()
2. Register GL resource with CUDA
cudaGraphicsGLRegisterBuffer()
3. Map it
cudaGraphicsMapResources()
4. Get mapped pointer
cudaGraphicsResourceGetMappedPointer()

CUDA Odds and Ends: Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA Odds and Ends: Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

Uploaded by

Copyright:

Available Formats

CUDA Odds and

__device__ unsigned int count = 0;

 Shared or global memory

__device__ unsigned int count = 0;

__device__ int atomicAdd(

__device__ int atomicAdd(

int atomicCAS(int *address, int

atomicCAS(addr, 1, 2); // returns 1

__device__ int atomicAdd(

Normally: Data Transfer Kernel Execution

Paged-locked: Data Transfer

Image from http://arstechnica.com/hardware/news/2009/10/day-of-nvidia-chipset-reckoning-arrives.ars

Improve transfer performance up to 40% - in theory

 Reading from write-combing memory is slow!

 What applications does this enable?

See 3.2.5 in the NVIDIA CUDA C Programming Guide

Command 0 Command 0 Command 0

Command 1 Command 1 Command 1

Command 2 Command 2 Command 2

Command 0 Command 0 Command 0

Command 1 Command 1 Command 1

Command 2 Command 2 Command 2

Command 0 Command 0 Command 0

Command 1 Command 1 Command 0

Command 2 Command 2 Command 1

Command 0 Command 0 Command 0

Command 1 Command 1 Command 0

Command 2 Command 2 Command 2

Command 0 Command 0 Command 0 Command 0

Command 1 Command 1 Command 1 Command 1

Command 2 Command 2 Command 2 Command 2

 How can the streams overlap?

Host device memory

Kernel execution Host device memory

Device to host memory Kernel execution

Device to host memory

Host device memory

Kernel execution Host device memory

Device to host memory

Device to host memory

Host device memory

Dependent Host device memory

Blocked until kernel from

for (int i = 0; i < 2; ++i)

for (int i = 0; i < 2; ++i) // to host

cudaEventRecord(start, 0); Record events before

Delay addition commands in

You might also like

device unsigned int count = 0;

Shared or global memory

device unsigned int count = 0;

device int atomicAdd(

device int atomicAdd(

device int atomicAdd(

Reading from write-combing memory is slow!

What applications does this enable?

How can the streams overlap?