You are on page 1of 13

CUDA 6.

0
PRACE School
Acknowledgements
BSC, Barcelona UPC, June 2-6, 2014

To the great Nvidia people, for sharing with me ideas,


material, figures, presentations, ... Particularly, for this
presentation:
Mark Ebersole (webinars and slides):
CUDA 6.0 overview.
Optimizations for Kepler.
Mark Harris (SC13 talk, webinar and parallel for all blog):
CUDA 6.0 announcements.
New hardware features in Maxwell.

Manuel Ujaldn
Associate Professor, Univ. of Malaga (Spain)
Conjoint Senior Lecturer, Univ. of Newcastle (Australia)
CUDA Fellow
2

Talk contents [40 slides]

1. The evolution of CUDA [7 slides]


2. CUDA 6.0 support [5]
3. Compiling and linking (CUDA 5.0 only) [3]
4. Dynamic parallelism (CUDA 5 & 6) [6]
5. New tools for development, debugging and
optimization (CUDA 5 & 6) [1]
6. GPUDirect-RDMA (CUDA 5 & 6) [4]
7. Unified memory (CUDA 6.0 only) [13]
I. The evolution of CUDA
3
Worldwide distribution
The impressive evolution of CUDA
of CUDA university courses
Year 2008 Year 2014
100.000.000 500.000.000
CUDA-capable GPUs CUDA-capable GPUs

150.000 2.100.000
CUDA downloads CUDA downloads

1 supercomputer 52
supercomputers

60 780
university courses courses

4.000 40.000
academic papers academic papers

The CUDA software is downloaded once every minute. 5 6

Summary of GPU evolution The CUDA family picture

2001: First many-cores (vertex and pixel processors).


2003: Those processor become programmable (with Cg).
2006: Vertex and pixel processors unify.
2007: CUDA emerges.
2008: Double precision floating-point arithmetic.
2010: Operands are IEEE-normalized and memory is ECC.
2012: Wider support for irregular computing.
2014: The CPU-GPU memory space is unified.
Still pending: Reliability in clusters and connection to disk.

7 8
CUDA 5.0 highlights CUDA 5.5 highlights

Dynamic Parallelism: Optimized for MPI Applications:


Spawn new parallel work from within GPU code (from GK110 on). Enhanced Hyper-Q support for multiple MPI processes via the new
GPU Object Linking: Multi-Process Service (MPS) on Linux systems.
MPI workload prioritization enabled by CUDA stream prioritization.
Libraries and plug-ins for GPU code.
Multi-process MPI debugging & profiling.
New Nsight Eclipse Edition:
Develop, Debug, and Optimize... All in one tool!
Support for ARM Platforms:
Native compilation, for easy application porting.
GPUDirect:
Fast cross-compile on x86 for large applications.
RDMA between GPUs and PCI-express devices.
Developer tools:
Step-by-step guidance for Visual Profiler and Nsight Eclipse Edition.
Single GPU debugging for Linux.
Static CUDA runtime library.
9 10

CUDA 6.0 highlights

Unified Memory:
CPU and GPU can share data without much programming effort.
Extended Library Interface (XT) and Drop-in Libraries:
Libraries much easier to use.
GPUDirect RDMA:
A key achievement in multi-GPU environments.
Developer tools:
Visual Profiler enhanced with:
Side-by-side source and disassembly view showing.
New analysis passes (per SM activity level), generates a kernel analysis report. II. CUDA 6.0 support
Multi-Process Server (MPS) support in nvprof and cuda-memcheck. (operating systems and platforms)
Nsight Eclipse Edition supports remote development (x86 and ARM).
11
Platforms (depending on OS).
Operating systems
CUDA 6 Production Release
Windows: https://developer.nvidia.com/cuda-downloads
XP, Vista, 7, 8, 8.1, Server 2008 R2, Server 2012.
Visual Studio 2008, 2010, 2012, 2012 Express.
Linux:
Fedora 19.
RHEL & CentOS 5, 6.
OpenSUSE 12.3.
SUSE SLES 11 SP2, SP3.
Ubuntu 12.04 LTS (including ARM cross and native), 13.04.
ICC 13.1.
Mac:
OSX 10.8, 10.9. 13 14

Deprecations
GPUs for CUDA 6.0

CUDA Compute Capabilities 3.0 (sm_30, 2012 versions of Things that tend to be obsolete:
Kepler like Tesla K10, GK104): Still supported.
Do not support dynamic parallelism nor Hyper-Q. Not recommended.
Support unified memory with a separate pool of shared data with New developments may not work with it.
auto-migration (a subset of the memory which has many limitations). Likely to be dropped in the future.
CUDA Compute Capabilities 3.5 (sm_35, 2013 and 2014
versions of Kepler like Tesla K20, K20X and K40, GK110): Some examples:
Support dynamic parallelism and Hyper-Q. 32-bit applications on x86 Linux (toolkit & driver).
Support unified memory, with similar restrictions than CCC 3.0. 32-bit applications on Mac (toolkit & driver).
CUDA Compute Capabilities 5.0 (sm_50, 2014 versions of G80 platform / sm_10 (toolkit).
Maxwell like GeForce GTX750Ti, GM107-GM108):
Full support of dynamic parallelism, Hyper-Q and unified memory.
15 16
Dropped support

cuSPARSE Legacy API.


Ubuntu 10.04 LTS (toolkit & driver).
SUSE Linux Enterprise Server 11 SP1 (toolkit & driver).
Mac OSX 10.7 (toolkit & driver).

Mac Models with the MCP79 Chipset (driver)


iMac: 20-inch (early 09), 24-inch (early 09), 21.5-inch (late 09).
MacBook Pro: 15-inch (late08), 17-inch (early09), 17-inch (mid09),
15-inch (mid 09), 15-inch 2.53 GHz (mid09), 13-inch (mid09).
Mac mini: Early 09, Late 09. III. Compiling and linking
MacBook Air (Late 08, Mid 09).
17

CUDA 4.0: Whole-program


CUDA 5.0: Separate Compilation & Linking
compilation and linking
CUDA 4 required a single source file for a single kernel. It Now it is possible to compile and link each file separately:
was not possible to link enternal device code. That way, we can build multiple object files independently, which
can later be linked to build the executable file.

Include files together to build

19 20
CUDA 5.0: Separate Compilation & Linking

We can also combine object files into static libraries, which


can be shared from different source files when linking:
To facilitate code reuse.
To reduce the compilation time. This also enables closed-
source device libraries to call
user-defined device callback
functions.

IV. Dynamic parallelism in CUDA 5 & 6


21

Dynamic parallelism allows CUDA 5.0


Familiar syntax and programming model
to improve three primary issues:
int main() {
Data-dependent float *data; CPU
execution setup(data);
main
Recursive parallel A <<< ... >>> (data);
algorithms B <<< ... >>> (data);
Execution C <<< ... >>> (data);
cudaDeviceSynchronize(); GPU
Dynamic load return 0;
balancing A
} X
Performance
Thread scheduling __global__ void B(float *data) {
B Y
to help fill the GPU do_stuff(data);
Programmability X <<< ... >>> (data); Z
Library calls from Y <<< ... >>> (data); C
GPU kernels Z <<< ... >>> (data);
cudaDeviceSynchronize();
Simplify
CPU/GPU division do_more_stuff(data);
}
23 24
Before CUDA 6.0:
CUDA 6.0 uses an extended PLB (EPLB)
Tight limit on Pending Launch Buffer (PLB)
Applications using dynamic parallelism can launch too many EPLB guarantees all launches succeed by using a lower
grids and exhaust the pre-allocated pending launch buffer performance virtualized launch buffer, when fast PLB is full.
(PLB). No more launch failures regardless of scheduling.
Result in launch failures, sometimes intermittent due to scheduling. PLB size tuning provides direct performance improvement path.
PLB size tuning can fix the problem, but often involves trial-and-error. Enabled by default.

Finite Pending
Launch Buffer Out-of-memory failure with
too many concurrent launches. Virtualized Extended
Finite Pending Pending Launch Buffer (PLB)
Launch Buffer


25 26

CUDA 6.0: Performance Performance improvements


improvements in key use cases on dynamic parallelism
Kernel launch. 40,0
Repeated launch of the same set of kernels. 35,0 Back to Back Launches (usecs)
Launch and Synchronize (usecs)
cudaDeviceSynchronize(). 30,0
Back-to-back grids in a stream. 22,0
20,0 17,0
14,0
10,6 9,1
10,0

0
CUDA 5 CUDA 5.5 CUDA 6

27 28
New features in Nvidia Nsight, Eclipse Edition,
also available for Linux and Mac OS

CUDA-aware editor: Nsight debugger Nsight profiler


Automated CPU to Simultaneously Quickly identifies
debugging of CPU and bottlenecks in source
GPU code refactoring.
GPU code. lines and using a
Semantic highlight- unified CPU-GPU trace.
Inspect variables
ing of CUDA code. across CUDA threads. Integrated expert
V. New tools for development, debugging Integrated code system.
Use breakpoints &
and optimization samples & docs. single step debugging. Fast edit-build-profile
optimization cycle.
30

Communication among GPU memories

GPU Direct 1.0 was released in Fermi to allow


communications among GPUs within CPU clusters.

VI. GPU Direct


Receiver Sender 32
Kepler + CUDA 5 support GPUDirect-RDMA
GPUDirect-RDMA in Maxwell
[Remote Direct Memory Access]
This allows a more direct transfer between GPUs. The situation is more complex in CUDA 6.0 with unified
Usually, the link is PCI-express or InfiniBand. memory.

33 34

Preliminary results using GPUDirect-RDMA


(better perf. ahead w. CUDA 6.0 & OpenMPI)
Total execution time (seconds)
GPU-GPU latency (microseconds)

Message size (bytes) Side number

Inter-node latency using: Better MPI Applic. Scaling:


Tesla K40m GPUs (no GeForces). Code: HSG (bioinformatics).
MPI MVAPICH2 library. 2 GPU nodes.
VII. Unified memory
ConnectX-3, IVB 3GHz. 4 MPI processes each node.
35
The idea Unified memory contributions

Simpler programming and memory model:


CPU GPU CPU Kepler+
GPU
Single pointer to data, accessible anywhere.
Eliminate need for cudaMemcpy().
Dual-, tri- or 256, 320, Greatly simplifies code porting.
quad-channel 384 bits
(~100 GB/s.) (~300 GB/s.) Performance through data locality:
PCI-express Migrate data to accessing processor.
(~10 GB/s.)
DDR3 GDDR5 DDR3 Unified GDDR5
memory Guarantee global coherency.
Main memory Video memory Still allows cudaMemcpyAsync() hand tuning.

37 38

System requirements CUDA memory types

Zero-Copy Unified Virtual


Required Limitations Unified Memory
(pinned memory) Addressing
CUDA call cudaMallocHost(&A, 4); cudaMalloc(&A, 4); cudaMallocManaged(&A, 4);
Kepler (GK10x+) or Limited performance in
GPU Allocation fixed in Main memory (DDR3) Video memory (GDDR5) Both
Maxwell (GM10x+) CCC 3.0 and CCC 3.5
Local access for CPU Home GPU CPU and home GPU
Operating System 64 bits
PIC-e access for All GPUs Other GPUs Other GPUs
Other features Avoid swapping to disk No CPU access On access CPU/GPU migration
Windows 7 or 8 WDDM & TCC no XP/Vista
Coherency At all times Between GPUs Only at launch & sync.
Full support in CUDA 2.2 CUDA 1.0 CUDA 6.0
Linux Kernel 2.6.18+ All CUDA-supported distros, not ARM

Linux on ARM ARM64

Mac OSX Not supported in CUDA 6.0

39 40
A preliminar example:
Additions to the CUDA API
Sorting elements from a file
New call: cudaMallocManaged()
CPU code in C GPU code in CUDA 6.0
Drop-in replacement for cudaMalloc() allocates managed memory.
Returns pointer accessible from both Host and Device.
void sortfile (FILE *fp, int N) { void sortfile (FILE *fp, int N) {
New call: cudaStreamAttachMemAsync() char *data; char *data;
data = (char *) malloc(N); cudaMallocManaged(&data, N);
Manages concurrently in multi-threaded CPU applications.
New keyword: __managed__ fread(data, 1, N, fp); fread(data, 1, N, fp);

Global variable annotation combines with __device__. qsort(data, N, 1, compare); qsort<<<...>>> (data, N, 1, compare);
cudaDeviceSynchronize();
Declares global-scope migratable device variable.
use_data(data); use_data(data);
Symbol accessible from both GPU and CPU code.
free(data); cudaFree(data);
} }

41 42

Before unified memory The code required without unified memory


struct dataElem {
int prop1;
CPU memory CPU memory void launch(dataElem *elem) {
dataElem *g_elem;
int prop2; dataElem dataElem char *g_text;
char *text; prop1 prop1
} int textlen = strlen(elem->text);
prop2 prop2

A deep copy is required: *text Hello, world *text Hello, world // Allocate storage for struct and text
cudaMalloc(&g_elem, sizeof(dataElem));
We must copy the structure cudaMalloc(&g_text, textlen);
Two addresses Two addresses
and everything that it points to. and two copies and two copies // Copy up each piece separately, including
This is why C++ invented the of the data of the data new text pointer value
copy constructor. cudaMemcpy(g_elem, elem, sizeof(dataElem));

GPU memory GPU memory


cudaMemcpy(g_text, elem->text, textlen);
CPU and GPU cannot share a cudaMemcpy(&(g_elem->text), &g_text,
copy of the data (coherency). dataElem dataElem sizeof(g_text));

This prevents memcpy style prop1 prop1


// Finally we can launch our kernel, but
comparisons, checksumming prop2 prop2 // CPU and GPU use different copies of elem

and other things. *text Hello, world *text Hello, world kernel<<< ... >>>(g_elem);
43 } 44
The code required WITH unified memory Another example: Linked lists

CPU memory void launch(dataElem *elem) {


kernel<<< ... >>>(elem);
key key key
} value value value CPU memory
next next next
What remains the same: All accesses via PCI-express bus
Data movement.
Unified memory GPU accesses a local copy of text.
key key key
value value value GPU memory
dataElem
What has changed: next next next
prop1
prop2
Programmer sees a single pointer. Almost impossible to manage in the original CUDA API.
CPU and GPU both reference the
*text Hello, world The best you can do is use pinned memory:
same object.
Pointers are global: Just as unified memory pointers.
There is coherence.
GPU memory Performance is low: GPU suffers from PCI-e bandwidth.
To pass-by-reference vs. pass- GPU latency is very high, which is particularly important for linked
by-value you need to use C++. lists because of the intrinsic pointer chasing.
45 46

Linked lists with unified memory Unified memory: Summary

CPU memory Drop-in replacement for cudaMalloc() using


cudaMallocManaged().
key key key cudaMemcpy() now optional.
value value value
Greatly simplifies code porting.
next next next
Less Host-side memory management.
Enables shared data structures between CPU & GPU
GPU memory
Single pointer to data = no change to data structures.
Can pass list elements between CPU & GPU. Powerful for high-level languages like C++.
No need to move data back and forth between CPU and GPU.
Can insert and delete elements from CPU or GPU.
But program must still ensure no race conditions (data is coherent
between CPU & GPU at kernel launch only). 47 48
Unified memory: Future developments

CUDA 6.0 CUDA 6 CUDA 6


now on the way for Maxwell

Single pointer to data. No Prefetching


System Allocator Unified
cudaMemcpy() is required mechanisms

Coherence @ Migration
Stack memory unified
launch & synchronized hints

Shared C/C++ Additional Hardware-accelerated


Data Structures OS support coherence

49

You might also like