CUDA 6.0: Acknowledgements

CUDA 6.
0
PRACE School
Acknowledgements
BSC, Barcelona UPC, June 2-6, 2014
To the great Nvidia people, for sharing with me ideas,

material, figures, presentations, ... Particularly, for this
presentation:
Mark Ebersole (webinars and slides):
CUDA 6.0 overview.
Optimizations for Kepler.
Mark Harris (SC13 talk, webinar and parallel for all blog):
CUDA 6.0 announcements.
New hardware features in Maxwell.
Manuel Ujaldn
Associate Professor, Univ. of Malaga (Spain)
Conjoint Senior Lecturer, Univ. of Newcastle (Australia)
CUDA Fellow
2
Talk contents [40 slides]
1. The evolution of CUDA [7 slides]

2. CUDA 6.0 support [5]
3. Compiling and linking (CUDA 5.0 only) [3]
4. Dynamic parallelism (CUDA 5 & 6) [6]
5. New tools for development, debugging and
optimization (CUDA 5 & 6) [1]
6. GPUDirect-RDMA (CUDA 5 & 6) [4]
7. Unified memory (CUDA 6.0 only) [13]
I. The evolution of CUDA
3
Worldwide distribution
The impressive evolution of CUDA
of CUDA university courses
Year 2008 Year 2014
100.000.000 500.000.000
CUDA-capable GPUs CUDA-capable GPUs
150.000 2.100.000
CUDA downloads CUDA downloads
1 supercomputer 52
supercomputers
60 780
university courses courses
4.000 40.000
academic papers academic papers
The CUDA software is downloaded once every minute. 5 6
Summary of GPU evolution The CUDA family picture
2001: First many-cores (vertex and pixel processors).

2003: Those processor become programmable (with Cg).
2006: Vertex and pixel processors unify.
2007: CUDA emerges.
2008: Double precision floating-point arithmetic.
2010: Operands are IEEE-normalized and memory is ECC.
2012: Wider support for irregular computing.
2014: The CPU-GPU memory space is unified.
Still pending: Reliability in clusters and connection to disk.
7 8
CUDA 5.0 highlights CUDA 5.5 highlights
Dynamic Parallelism: Optimized for MPI Applications:

Spawn new parallel work from within GPU code (from GK110 on). Enhanced Hyper-Q support for multiple MPI processes via the new
GPU Object Linking: Multi-Process Service (MPS) on Linux systems.
MPI workload prioritization enabled by CUDA stream prioritization.
Libraries and plug-ins for GPU code.
Multi-process MPI debugging & profiling.
New Nsight Eclipse Edition:
Develop, Debug, and Optimize... All in one tool!
Support for ARM Platforms:
Native compilation, for easy application porting.
GPUDirect:
Fast cross-compile on x86 for large applications.
RDMA between GPUs and PCI-express devices.
Developer tools:
Step-by-step guidance for Visual Profiler and Nsight Eclipse Edition.
Single GPU debugging for Linux.
Static CUDA runtime library.
9 10
CUDA 6.0 highlights
Unified Memory:
CPU and GPU can share data without much programming effort.
Extended Library Interface (XT) and Drop-in Libraries:
Libraries much easier to use.
GPUDirect RDMA:
A key achievement in multi-GPU environments.
Developer tools:
Visual Profiler enhanced with:
Side-by-side source and disassembly view showing.
New analysis passes (per SM activity level), generates a kernel analysis report. II. CUDA 6.0 support
Multi-Process Server (MPS) support in nvprof and cuda-memcheck. (operating systems and platforms)
Nsight Eclipse Edition supports remote development (x86 and ARM).
11
Platforms (depending on OS).
Operating systems
CUDA 6 Production Release
Windows: https://developer.nvidia.com/cuda-downloads
XP, Vista, 7, 8, 8.1, Server 2008 R2, Server 2012.
Visual Studio 2008, 2010, 2012, 2012 Express.
Linux:
Fedora 19.
RHEL & CentOS 5, 6.
OpenSUSE 12.3.
SUSE SLES 11 SP2, SP3.
Ubuntu 12.04 LTS (including ARM cross and native), 13.04.
ICC 13.1.
Mac:
OSX 10.8, 10.9. 13 14
Deprecations
GPUs for CUDA 6.0
CUDA Compute Capabilities 3.0 (sm_30, 2012 versions of Things that tend to be obsolete:
Kepler like Tesla K10, GK104): Still supported.
Do not support dynamic parallelism nor Hyper-Q. Not recommended.
Support unified memory with a separate pool of shared data with New developments may not work with it.
auto-migration (a subset of the memory which has many limitations). Likely to be dropped in the future.
CUDA Compute Capabilities 3.5 (sm_35, 2013 and 2014
versions of Kepler like Tesla K20, K20X and K40, GK110): Some examples:
Support dynamic parallelism and Hyper-Q. 32-bit applications on x86 Linux (toolkit & driver).
Support unified memory, with similar restrictions than CCC 3.0. 32-bit applications on Mac (toolkit & driver).
CUDA Compute Capabilities 5.0 (sm_50, 2014 versions of G80 platform / sm_10 (toolkit).
Maxwell like GeForce GTX750Ti, GM107-GM108):
Full support of dynamic parallelism, Hyper-Q and unified memory.
15 16
Dropped support
cuSPARSE Legacy API.

Ubuntu 10.04 LTS (toolkit & driver).
SUSE Linux Enterprise Server 11 SP1 (toolkit & driver).
Mac OSX 10.7 (toolkit & driver).
Mac Models with the MCP79 Chipset (driver)

iMac: 20-inch (early 09), 24-inch (early 09), 21.5-inch (late 09).
MacBook Pro: 15-inch (late08), 17-inch (early09), 17-inch (mid09),
15-inch (mid 09), 15-inch 2.53 GHz (mid09), 13-inch (mid09).
Mac mini: Early 09, Late 09. III. Compiling and linking
MacBook Air (Late 08, Mid 09).
17
CUDA 4.0: Whole-program

CUDA 5.0: Separate Compilation & Linking
compilation and linking
CUDA 4 required a single source file for a single kernel. It Now it is possible to compile and link each file separately:
was not possible to link enternal device code. That way, we can build multiple object files independently, which
can later be linked to build the executable file.
Include files together to build
19 20
CUDA 5.0: Separate Compilation & Linking
We can also combine object files into static libraries, which

can be shared from different source files when linking:
To facilitate code reuse.
To reduce the compilation time. This also enables closed-
source device libraries to call
user-defined device callback
functions.
IV. Dynamic parallelism in CUDA 5 & 6

21
Dynamic parallelism allows CUDA 5.0

Familiar syntax and programming model
to improve three primary issues:
int main() {
Data-dependent float *data; CPU
execution setup(data);
main
Recursive parallel A <<< ... >>> (data);
algorithms B <<< ... >>> (data);
Execution C <<< ... >>> (data);
cudaDeviceSynchronize(); GPU
Dynamic load return 0;
balancing A
} X
Performance
Thread scheduling __global__ void B(float *data) {
B Y
to help fill the GPU do_stuff(data);
Programmability X <<< ... >>> (data); Z
Library calls from Y <<< ... >>> (data); C
GPU kernels Z <<< ... >>> (data);
cudaDeviceSynchronize();
Simplify
CPU/GPU division do_more_stuff(data);
}
23 24
Before CUDA 6.0:
CUDA 6.0 uses an extended PLB (EPLB)
Tight limit on Pending Launch Buffer (PLB)
Applications using dynamic parallelism can launch too many EPLB guarantees all launches succeed by using a lower
grids and exhaust the pre-allocated pending launch buffer performance virtualized launch buffer, when fast PLB is full.
(PLB). No more launch failures regardless of scheduling.
Result in launch failures, sometimes intermittent due to scheduling. PLB size tuning provides direct performance improvement path.
PLB size tuning can fix the problem, but often involves trial-and-error. Enabled by default.
Finite Pending
Launch Buffer Out-of-memory failure with
too many concurrent launches. Virtualized Extended
Finite Pending Pending Launch Buffer (PLB)
Launch Buffer

25 26
CUDA 6.0: Performance Performance improvements

improvements in key use cases on dynamic parallelism
Kernel launch. 40,0
Repeated launch of the same set of kernels. 35,0 Back to Back Launches (usecs)
Launch and Synchronize (usecs)
cudaDeviceSynchronize(). 30,0
Back-to-back grids in a stream. 22,0
20,0 17,0
14,0
10,6 9,1
10,0
0
CUDA 5 CUDA 5.5 CUDA 6
27 28
New features in Nvidia Nsight, Eclipse Edition,
also available for Linux and Mac OS
CUDA-aware editor: Nsight debugger Nsight profiler

Automated CPU to Simultaneously Quickly identifies
debugging of CPU and bottlenecks in source
GPU code refactoring.
GPU code. lines and using a
Semantic highlight- unified CPU-GPU trace.
Inspect variables
ing of CUDA code. across CUDA threads. Integrated expert
V. New tools for development, debugging Integrated code system.
Use breakpoints &
and optimization samples & docs. single step debugging. Fast edit-build-profile
optimization cycle.
30
Communication among GPU memories
GPU Direct 1.0 was released in Fermi to allow

communications among GPUs within CPU clusters.
VI. GPU Direct

Receiver Sender 32
Kepler + CUDA 5 support GPUDirect-RDMA
GPUDirect-RDMA in Maxwell
[Remote Direct Memory Access]
This allows a more direct transfer between GPUs. The situation is more complex in CUDA 6.0 with unified
Usually, the link is PCI-express or InfiniBand. memory.
33 34
Preliminary results using GPUDirect-RDMA

(better perf. ahead w. CUDA 6.0 & OpenMPI)
Total execution time (seconds)
GPU-GPU latency (microseconds)
Message size (bytes) Side number
Inter-node latency using: Better MPI Applic. Scaling:

Tesla K40m GPUs (no GeForces). Code: HSG (bioinformatics).
MPI MVAPICH2 library. 2 GPU nodes.
VII. Unified memory
ConnectX-3, IVB 3GHz. 4 MPI processes each node.
35
The idea Unified memory contributions
Simpler programming and memory model:

CPU GPU CPU Kepler+
GPU
Single pointer to data, accessible anywhere.
Eliminate need for cudaMemcpy().
Dual-, tri- or 256, 320, Greatly simplifies code porting.
quad-channel 384 bits
(~100 GB/s.) (~300 GB/s.) Performance through data locality:
PCI-express Migrate data to accessing processor.
(~10 GB/s.)
DDR3 GDDR5 DDR3 Unified GDDR5
memory Guarantee global coherency.
Main memory Video memory Still allows cudaMemcpyAsync() hand tuning.
37 38
System requirements CUDA memory types
Zero-Copy Unified Virtual

Required Limitations Unified Memory
(pinned memory) Addressing
CUDA call cudaMallocHost(&A, 4); cudaMalloc(&A, 4); cudaMallocManaged(&A, 4);
Kepler (GK10x+) or Limited performance in
GPU Allocation fixed in Main memory (DDR3) Video memory (GDDR5) Both
Maxwell (GM10x+) CCC 3.0 and CCC 3.5
Local access for CPU Home GPU CPU and home GPU
Operating System 64 bits
PIC-e access for All GPUs Other GPUs Other GPUs
Other features Avoid swapping to disk No CPU access On access CPU/GPU migration
Windows 7 or 8 WDDM & TCC no XP/Vista
Coherency At all times Between GPUs Only at launch & sync.
Full support in CUDA 2.2 CUDA 1.0 CUDA 6.0
Linux Kernel 2.6.18+ All CUDA-supported distros, not ARM
Linux on ARM ARM64
Mac OSX Not supported in CUDA 6.0
39 40
A preliminar example:
Additions to the CUDA API
Sorting elements from a file
New call: cudaMallocManaged()
CPU code in C GPU code in CUDA 6.0
Drop-in replacement for cudaMalloc() allocates managed memory.
Returns pointer accessible from both Host and Device.
void sortfile (FILE *fp, int N) { void sortfile (FILE *fp, int N) {
New call: cudaStreamAttachMemAsync() char *data; char *data;
data = (char *) malloc(N); cudaMallocManaged(&data, N);
Manages concurrently in multi-threaded CPU applications.
New keyword: __managed__ fread(data, 1, N, fp); fread(data, 1, N, fp);
Global variable annotation combines with __device__. qsort(data, N, 1, compare); qsort<<<...>>> (data, N, 1, compare);
cudaDeviceSynchronize();
Declares global-scope migratable device variable.
use_data(data); use_data(data);
Symbol accessible from both GPU and CPU code.
free(data); cudaFree(data);
} }
41 42
Before unified memory The code required without unified memory

struct dataElem {
int prop1;
CPU memory CPU memory void launch(dataElem *elem) {
dataElem *g_elem;
int prop2; dataElem dataElem char *g_text;
char *text; prop1 prop1
} int textlen = strlen(elem->text);
prop2 prop2
A deep copy is required: *text Hello, world *text Hello, world // Allocate storage for struct and text
cudaMalloc(&g_elem, sizeof(dataElem));
We must copy the structure cudaMalloc(&g_text, textlen);
Two addresses Two addresses
and everything that it points to. and two copies and two copies // Copy up each piece separately, including
This is why C++ invented the of the data of the data new text pointer value
copy constructor. cudaMemcpy(g_elem, elem, sizeof(dataElem));
GPU memory GPU memory

cudaMemcpy(g_text, elem->text, textlen);
CPU and GPU cannot share a cudaMemcpy(&(g_elem->text), &g_text,
copy of the data (coherency). dataElem dataElem sizeof(g_text));
This prevents memcpy style prop1 prop1

// Finally we can launch our kernel, but
comparisons, checksumming prop2 prop2 // CPU and GPU use different copies of elem
and other things. *text Hello, world *text Hello, world kernel<<< ... >>>(g_elem);
43 } 44
The code required WITH unified memory Another example: Linked lists
CPU memory void launch(dataElem *elem) {

kernel<<< ... >>>(elem);
key key key
} value value value CPU memory
next next next
What remains the same: All accesses via PCI-express bus
Data movement.
Unified memory GPU accesses a local copy of text.
key key key
value value value GPU memory
dataElem
What has changed: next next next
prop1
prop2
Programmer sees a single pointer. Almost impossible to manage in the original CUDA API.
CPU and GPU both reference the
*text Hello, world The best you can do is use pinned memory:
same object.
Pointers are global: Just as unified memory pointers.
There is coherence.
GPU memory Performance is low: GPU suffers from PCI-e bandwidth.
To pass-by-reference vs. pass- GPU latency is very high, which is particularly important for linked
by-value you need to use C++. lists because of the intrinsic pointer chasing.
45 46
Linked lists with unified memory Unified memory: Summary
CPU memory Drop-in replacement for cudaMalloc() using

cudaMallocManaged().
key key key cudaMemcpy() now optional.
value value value
Greatly simplifies code porting.
next next next
Less Host-side memory management.
Enables shared data structures between CPU & GPU
GPU memory
Single pointer to data = no change to data structures.
Can pass list elements between CPU & GPU. Powerful for high-level languages like C++.
No need to move data back and forth between CPU and GPU.
Can insert and delete elements from CPU or GPU.
But program must still ensure no race conditions (data is coherent
between CPU & GPU at kernel launch only). 47 48
Unified memory: Future developments
CUDA 6.0 CUDA 6 CUDA 6

now on the way for Maxwell
Single pointer to data. No Prefetching

System Allocator Unified
cudaMemcpy() is required mechanisms
Coherence @ Migration
Stack memory unified
launch & synchronized hints
Shared C/C++ Additional Hardware-accelerated

Data Structures OS support coherence
49

CUDA 6.0: Acknowledgements

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA 6.0: Acknowledgements

Uploaded by

Copyright:

Available Formats

CUDA 6.

To the great Nvidia people, for sharing with me ideas,

Talk contents [40 slides]

1. The evolution of CUDA [7 slides]

The CUDA software is downloaded once every minute. 5 6

Summary of GPU evolution The CUDA family picture

2001: First many-cores (vertex and pixel processors).

Dynamic Parallelism: Optimized for MPI Applications:

CUDA 6.0 highlights

cuSPARSE Legacy API.

Mac Models with the MCP79 Chipset (driver)

CUDA 4.0: Whole-program

Include files together to build

We can also combine object files into static libraries, which

IV. Dynamic parallelism in CUDA 5 & 6

Dynamic parallelism allows CUDA 5.0

CUDA 6.0: Performance Performance improvements

CUDA-aware editor: Nsight debugger Nsight profiler

Communication among GPU memories

GPU Direct 1.0 was released in Fermi to allow

VI. GPU Direct

Preliminary results using GPUDirect-RDMA

Message size (bytes) Side number

Inter-node latency using: Better MPI Applic. Scaling:

Simpler programming and memory model:

System requirements CUDA memory types

Zero-Copy Unified Virtual

Linux on ARM ARM64

Mac OSX Not supported in CUDA 6.0

Before unified memory The code required without unified memory

GPU memory GPU memory

This prevents memcpy style prop1 prop1

CPU memory void launch(dataElem *elem) {

Linked lists with unified memory Unified memory: Summary

CPU memory Drop-in replacement for cudaMalloc() using

CUDA 6.0 CUDA 6 CUDA 6

Single pointer to data. No Prefetching

Shared C/C++ Additional Hardware-accelerated

You might also like