Professional Documents
Culture Documents
0
PRACE School
Acknowledgements
BSC, Barcelona UPC, June 2-6, 2014
Manuel Ujaldn
Associate Professor, Univ. of Malaga (Spain)
Conjoint Senior Lecturer, Univ. of Newcastle (Australia)
CUDA Fellow
2
150.000 2.100.000
CUDA downloads CUDA downloads
1 supercomputer 52
supercomputers
60 780
university courses courses
4.000 40.000
academic papers academic papers
7 8
CUDA 5.0 highlights CUDA 5.5 highlights
Unified Memory:
CPU and GPU can share data without much programming effort.
Extended Library Interface (XT) and Drop-in Libraries:
Libraries much easier to use.
GPUDirect RDMA:
A key achievement in multi-GPU environments.
Developer tools:
Visual Profiler enhanced with:
Side-by-side source and disassembly view showing.
New analysis passes (per SM activity level), generates a kernel analysis report. II. CUDA 6.0 support
Multi-Process Server (MPS) support in nvprof and cuda-memcheck. (operating systems and platforms)
Nsight Eclipse Edition supports remote development (x86 and ARM).
11
Platforms (depending on OS).
Operating systems
CUDA 6 Production Release
Windows: https://developer.nvidia.com/cuda-downloads
XP, Vista, 7, 8, 8.1, Server 2008 R2, Server 2012.
Visual Studio 2008, 2010, 2012, 2012 Express.
Linux:
Fedora 19.
RHEL & CentOS 5, 6.
OpenSUSE 12.3.
SUSE SLES 11 SP2, SP3.
Ubuntu 12.04 LTS (including ARM cross and native), 13.04.
ICC 13.1.
Mac:
OSX 10.8, 10.9. 13 14
Deprecations
GPUs for CUDA 6.0
CUDA Compute Capabilities 3.0 (sm_30, 2012 versions of Things that tend to be obsolete:
Kepler like Tesla K10, GK104): Still supported.
Do not support dynamic parallelism nor Hyper-Q. Not recommended.
Support unified memory with a separate pool of shared data with New developments may not work with it.
auto-migration (a subset of the memory which has many limitations). Likely to be dropped in the future.
CUDA Compute Capabilities 3.5 (sm_35, 2013 and 2014
versions of Kepler like Tesla K20, K20X and K40, GK110): Some examples:
Support dynamic parallelism and Hyper-Q. 32-bit applications on x86 Linux (toolkit & driver).
Support unified memory, with similar restrictions than CCC 3.0. 32-bit applications on Mac (toolkit & driver).
CUDA Compute Capabilities 5.0 (sm_50, 2014 versions of G80 platform / sm_10 (toolkit).
Maxwell like GeForce GTX750Ti, GM107-GM108):
Full support of dynamic parallelism, Hyper-Q and unified memory.
15 16
Dropped support
19 20
CUDA 5.0: Separate Compilation & Linking
Finite Pending
Launch Buffer Out-of-memory failure with
too many concurrent launches. Virtualized Extended
Finite Pending Pending Launch Buffer (PLB)
Launch Buffer
25 26
0
CUDA 5 CUDA 5.5 CUDA 6
27 28
New features in Nvidia Nsight, Eclipse Edition,
also available for Linux and Mac OS
33 34
37 38
39 40
A preliminar example:
Additions to the CUDA API
Sorting elements from a file
New call: cudaMallocManaged()
CPU code in C GPU code in CUDA 6.0
Drop-in replacement for cudaMalloc() allocates managed memory.
Returns pointer accessible from both Host and Device.
void sortfile (FILE *fp, int N) { void sortfile (FILE *fp, int N) {
New call: cudaStreamAttachMemAsync() char *data; char *data;
data = (char *) malloc(N); cudaMallocManaged(&data, N);
Manages concurrently in multi-threaded CPU applications.
New keyword: __managed__ fread(data, 1, N, fp); fread(data, 1, N, fp);
Global variable annotation combines with __device__. qsort(data, N, 1, compare); qsort<<<...>>> (data, N, 1, compare);
cudaDeviceSynchronize();
Declares global-scope migratable device variable.
use_data(data); use_data(data);
Symbol accessible from both GPU and CPU code.
free(data); cudaFree(data);
} }
41 42
A deep copy is required: *text Hello, world *text Hello, world // Allocate storage for struct and text
cudaMalloc(&g_elem, sizeof(dataElem));
We must copy the structure cudaMalloc(&g_text, textlen);
Two addresses Two addresses
and everything that it points to. and two copies and two copies // Copy up each piece separately, including
This is why C++ invented the of the data of the data new text pointer value
copy constructor. cudaMemcpy(g_elem, elem, sizeof(dataElem));
and other things. *text Hello, world *text Hello, world kernel<<< ... >>>(g_elem);
43 } 44
The code required WITH unified memory Another example: Linked lists
Coherence @ Migration
Stack memory unified
launch & synchronized hints
49