Professional Documents
Culture Documents
Glib Ivashkevych
For reference
GFLOPs – 109 FLoating Point Operations Per second
∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to
Intel)
∼ 125 GFLOPs AMD Opteron Istanbul 2435
∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision
on Nvidia Tesla C2050
∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National
Laboratory – the fastest supercomputer as for November 1999
∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo
Institute of Technology – first GPU based supercomputer, №88
in Top500, as for November 2010 (№56 – in November 2009)
∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National
Supercomputing Center in Tianjin – the fastest
supercomputer as for November 2010 – GPU based
Computations on GPU: a road towards desktop supercomputing
Examples
∼ 8x in double precision
∼ 4 − 6x speed–up
1
CUDA accelerated Basic Linear Algebra Subprograms
1
Math Kernel Library
1
LAPACK for Heterogeneous systems
Computations on GPU: a road towards desktop supercomputing
Examples
∼ 9x in double precision
∼ 20x in single precision
Computations on GPU: a road towards desktop supercomputing
Examples
1
A.S. Antoniou et al., American Institute of Aeronautics and Astronautics
Paper 2010 – 0525
Computations on GPU: a road towards desktop supercomputing
Examples
Quantum chemistry
Calculations of molecular orbitals1
Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz)
∼ 173x over serial non–optimized code
∼ 14x over parallel optimized code (4 threads)
1
D.J. Hardy et al., GPGPU 2009
Computations on GPU: a road towards desktop supercomputing
Examples
Medical Imaging
Isosurfaces reconstruction from scalar volumetric data1
Nvidia GeForce GTX 285 vs ?
∼ 68x over optimized CPU code
nearly real-time processing of data
1
T. Kalbe et al., Proceedings of 5th International Symposium on Visual
Computing (ISVC 2009)
Computations on GPU: a road towards desktop supercomputing
Examples
GPUGrid.net
Biomolecular simulations
accelerated by Nvidia CUDA boards and Sony PlayStation
∼ 8000 users from 101 country
∼ 145 TFLOPs in average ≈ №25 in Top500
∼ 50 GFLOPs from every active user
Computations on GPU: a road towards desktop supercomputing
Examples
1
P.J. Clark et al., Processing Petabytes per Second with the ATLAS
Experiment at the LHC (GTC 2010)
Computations on GPU: a road towards desktop supercomputing
Examples
VFX, rendering
image editing, video
Computations on GPU: a road towards desktop supercomputing
Examples
Outline
1 History
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
History
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
History
History in brief:
Computations on GPU: a road towards desktop supercomputing
History
GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision
Computations on GPU: a road towards desktop supercomputing
History
GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision
GPGPU today:
straightforward
easy
in double precision.
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
consists of
multiprocessors
each MP has:
8 stream processors
1 unit for double
precision operations
shared memory
global memory
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
each MP has:
32 stream processors
4 SFU’s (Special
Function Unit)
each SP has:
1 FP Unit & 1 INT
Unit
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}
// Host c o d e
i n t main ( )
{
// K e r n e l i n v o c a t i o n
SomeKernel <<<1,N>>>(A , B , C)
}
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Execution model
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Scalability
underlying hardware architecture is hidden
threads could syncronize only within the MP
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Single threads
each thread have private local memory
are identified by built–in variable threadIdx (uint3 type)
int idx = threadIdx . x + threadIdx . y + threadIdx . z ;
Thread blocks
each block have shared memory visible to all threads within
the block
are identified by built–in variable blockIdx (uint3 type)
int b idx = blockIdx . x + blockIdx . y ;
// P r e p a r e t h e k e r n e l l a u n c h
int threadsPerBlock = 256;
i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ;
VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ;
cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ;
// F r e e d e v i c e memory
cudaFree ( d A ) ;
cudaFree ( d B ) ;
cudaFree ( d C ) ;
}
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Toolbox
Start–up tools
drivers
CUDA Toolkit
nvcc compiler, runtime library, header files, CUBLAS, CUFFT,
Visual Profiler etc.
CUDA SDK
examples, Occupancy Calculator etc.
Free download at
http://developer.nvidia.com/object/cuda 2 3 downloads.html
1
Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5,
OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04
Computations on GPU: a road towards desktop supercomputing
Toolbox
Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution
Computations on GPU: a road towards desktop supercomputing
Toolbox
Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Python
easy to learn
dynamically typed
rich built–in functionality
interpreted
very well documented
have large and active community
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···
PyCUDA
provide complete access to CUDA features
automatically manages resources
errors handling and translation to Python exceptions
convenient abstractions: GPUArray
metaprogramming: creation of CUDA source code
dynamically
interactive!
PyOpenCL is pretty much the same in concept – but not only for
Nvidia GPUs.
Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
Motivation
Combine flexibility of Python with efficiency of C++ → CUDA for
N–body sim
interface of EnSPy is written in Python
core of EnSPy is written in C++
joined together by scipy.weave.inline
C++ core could be used without Python – just include header
and link with precompiled shared library
easily extensible: both through high–level Python interface
and low–level C++ core – new algorithms, initial distributions
etc.
multi–GPU parallelization
it’s easy to experiment with EnSPy!
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
EnSPy functionality
Types of ensembles:
”Simple” ensemble – without interaction, only external
potential
N–body ensemble – both external potential and gravitational
interaction between particles
Current algorithms:
4-th order Runge–Kutta for ”simple” ensemble
Hermite scheme with shared time steps for N-body ensemble
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
Drawback:
extra time for generation and compilation of new code
Computations on GPU: a road towards desktop supercomputing
EnSPy architecture
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
EnSPy architecture
Input parameters
Ensemble population
(predefined or user specified
distribution)
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Overview
Problem:
Escape from potential well.
Watched values (trigger):
N(t) – number of particles, remaining in the well at time t
Potential:
x4
UD5 = 2ay 2 − x 2 + xy 2 +
4
2
2
1
1
0
px
0
y
−1
2 1 0 1 2
−2 x
−2 −1 0 1 2
x
Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
x > 0 ∩ U(x, y ) < E
trigger: x = 0 → q0 = 0.
12 lines of simple Python code (examples/d5.py):
specification of integration parameters
Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Results:
Regular particles are trapped in well → initial ”mixed state” splits
E = 0.1
0.8
0.6
N (t)/N (0)
0.4
E = 0.9
0.2
0
0 10 20 30
t
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Overview
Problem:
Toy model of escape from star cluster: escape of star from
potential of point rotating star cluster Mc and point galaxy core
Mg Mc
Watched values (trigger):
N(t) – number of particles, remaining in cluster at time t
GMc
UHill = −3ω 2 x 2 −
r2
Potential:
Hill curves
0.5
0.0
y
−0.5
−1.0
−1.0 −0.5 0.0 0.5
x
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
|x| < rt ∩ U(x, y ) < E
ω= √1 → rt = 1
3
trigger: |x| − rt = 0 → abs(q0) - 1. = 0.
12 lines of simple Python code (examples/hill plain.py):
specification of integration parameters
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Results:
Traping of regular particles (some tricky physics here):
1 · 104
8 · 103
6 · 103
N (t)
4 · 103
2 · 103 E = −1.3
E = −0.8
E = −0.3
0
0 2.5 · 104 5 · 104 7.5 · 104 1 · 105
nt
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
Overview
Problem:
Simplified model of escape from star cluster: escape of star from
potential of rotating star cluster with total mass Mc and point
potential of galaxy core with mass Mg Mc (2D)
Watched values:
Configuration of cluster
Potential of galaxy core in cluster frame of reference (tidal
approximation):
UHillNB = −3ω 2 x 2
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
Calculation setup:
N–body ensemble
2D (z = 0) initial distribution of N = 10240 particles inside
circle R with zero initial velocities
14 lines of simple Python code (examples/hill nbody.py):
specification of integration parameters
Mc = 1, R = 200, ω = √1
3
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
0 0 0
y
y
−100 −100 −100
0 0 0
y
y
−100 −100 −100
40 300
OpenM P
SSE optimized
250
CU DA
30
200
speed − up
GF lop/s
20 150
100
10
GTX260 DP - N –body 50
GTX260 DP – ”simple” ensemble
0 0
1 · 104 2 · 104 5 · 104 1 · 105 2 · 105 0 2.5 · 105 5 · 105 7.5 · 105 1 · 106
N number of particles
Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
Yesterday:
uniform programming with OpenCL: no need to care about
concrete implementation
desktop supercomputers (full ATX form–factor):
Today:
CUDA 3.2 → C++: classes, namespaces, default
parameters, operators overloading
∼ 2TFLOPs/4TFLOPs ∼ 2GFLOPs/8TFLOPs
concurent kernel execution stable double–precision support
(12 August 2010)
∼ 8x in GFLOPs, ∼ 6x in
GFLOPs/$, ∼ 5x in LOEWE–CSC (University of
GFLOPs/W vs four Intel Xeon Frankfurt): №22 in Top500
X5550
(85GFLOPs/73GFLOPs)
Tianhe-1-A, Nebulae,
Tsubame-2: №1, 3, 4 SC from
Top500
Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
Tommorow:
OpenCL 1.2 (?) → matrix and ”field” complex and real
types
New libraries: GPU programming as simple as CPU
programming