You are on page 1of 95

Computations on GPU: a road towards desktop supercomputing

Computations on GPU: a road towards desktop


supercomputing

Glib Ivashkevych

Institute of Theoretical Physics, NSC KIPT

November 24, 2010


Computations on GPU: a road towards desktop supercomputing
Quick outline

GPU – Graphic Processing Unit


programmable
manycore
multithreaded
with very high memory bandwidth
Computations on GPU: a road towards desktop supercomputing
Quick outline

GPU – Graphic Processing Unit


programmable
manycore
multithreaded
with very high memory bandwidth

We are going to talk about:


how GPU became usefull for scientific computations
GPU intrinsics and programming
how to get as much as possible from GPU and survive:)
the future of GPUs and GPU programming
0
CUDA(Nvidia s Compute Unified Device Architecture) , most of the
time, and OpenCL(Open Computing Language)
Computations on GPU: a road towards desktop supercomputing
Should we care?

But first of all: do we really need GPU computing?


Computations on GPU: a road towards desktop supercomputing
Should we care?

But first of all: do we really need GPU computing?

Short answer: yes!


high performance
transparent scalability
Computations on GPU: a road towards desktop supercomputing
Should we care?

But first of all: do we really need GPU computing?

Short answer: yes!


high performance
transparent scalability

More accurate answer: yes, for problems with high parallelism.


large datasets
portions of data could be processed independently
Computations on GPU: a road towards desktop supercomputing
Should we care?

But first of all: do we really need GPU computing?

Short answer: yes!


high performance
transparent scalability

More accurate answer: yes, for problems with high parallelism.


large datasets
portions of data could be processed independently

Most accurate answer: yes, for problems with high data


parallelism.
Computations on GPU: a road towards desktop supercomputing
Reference

For reference
GFLOPs – 109 FLoating Point Operations Per second
∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to
Intel)
∼ 125 GFLOPs AMD Opteron Istanbul 2435
∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision
on Nvidia Tesla C2050
∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National
Laboratory – the fastest supercomputer as for November 1999
∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo
Institute of Technology – first GPU based supercomputer, №88
in Top500, as for November 2010 (№56 – in November 2009)
∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National
Supercomputing Center in Tianjin – the fastest
supercomputer as for November 2010 – GPU based
Computations on GPU: a road towards desktop supercomputing
Examples

Matrix and vector operations


CUBLAS 1 on Nvidia Tesla C2050 (CUDA 3.2) vs Intel MKL 10.21
on Intel Core i7 Nehalem (4 threads)

∼ 8x in double precision

CULA 1 on Nvidia Tesla C2050 (CUDA 3.2)

up to ∼ 220 GFLOPs in double precision


up to ∼ 450 GFLOPs in single precision

vs Intel MKL 10.2

∼ 4 − 6x speed–up
1
CUDA accelerated Basic Linear Algebra Subprograms
1
Math Kernel Library
1
LAPACK for Heterogeneous systems
Computations on GPU: a road towards desktop supercomputing
Examples

Fast Fourier Transform


CUFFT on Nvidia Tesla C2070 (CUDA 3.2)

up to 65 GFLOPs in double precision


up to 220 GFLOPs in single precision

vs Intel MKL on Intel Core i7 Nehalem

∼ 9x in double precision
∼ 20x in single precision
Computations on GPU: a road towards desktop supercomputing
Examples

Physics: Computational Fluid Dynamics


Simulation of transition to turbulence1
Nvidia Tesla S1070 vs quad-core Intel Xeon X5450 (3GHz)
∼ 20x over serial code
∼ 10x over OpenMP realization (2 threads)
∼ 5x over OpenMP realization (4 threads)

1
A.S. Antoniou et al., American Institute of Aeronautics and Astronautics
Paper 2010 – 0525
Computations on GPU: a road towards desktop supercomputing
Examples

Quantum chemistry
Calculations of molecular orbitals1
Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz)
∼ 173x over serial non–optimized code
∼ 14x over parallel optimized code (4 threads)

1
D.J. Hardy et al., GPGPU 2009
Computations on GPU: a road towards desktop supercomputing
Examples

Medical Imaging
Isosurfaces reconstruction from scalar volumetric data1
Nvidia GeForce GTX 285 vs ?
∼ 68x over optimized CPU code
nearly real-time processing of data

1
T. Kalbe et al., Proceedings of 5th International Symposium on Visual
Computing (ISVC 2009)
Computations on GPU: a road towards desktop supercomputing
Examples

GPUGrid.net
Biomolecular simulations
accelerated by Nvidia CUDA boards and Sony PlayStation
∼ 8000 users from 101 country
∼ 145 TFLOPs in average ≈ №25 in Top500
∼ 50 GFLOPs from every active user
Computations on GPU: a road towards desktop supercomputing
Examples

ATLAS experiment on Large Hadron Collider


Particles tracking, trigerring, events simulations1
possible Higgs events – track large number of particles
∼ 32x in tracking, ∼ 35x in triggering on Nvidia Tesla C1060

1
P.J. Clark et al., Processing Petabytes per Second with the ATLAS
Experiment at the LHC (GTC 2010)
Computations on GPU: a road towards desktop supercomputing
Examples

And even more examples in:


N–body simulations
seismic simulations
molecular dynamics
SETI@Home & MilkyWay@Home
finance
neural networks
...
and, of course, graphics

VFX, rendering
image editing, video
Computations on GPU: a road towards desktop supercomputing
Examples

GPU Technology Conference 2010 (September 20-23)


Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
Computations on GPU: a road towards desktop supercomputing
Outline

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
History

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
History

History in brief:
Computations on GPU: a road towards desktop supercomputing
History

GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision
Computations on GPU: a road towards desktop supercomputing
History

GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision

GPGPU today:
straightforward
easy
in double precision.
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Hardware model: GT200 architecture

consists of
multiprocessors
each MP has:
8 stream processors
1 unit for double
precision operations
shared memory
global memory
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Hardware model: Fermi architecture

each MP has:
32 stream processors
4 SFU’s (Special
Function Unit)
each SP has:
1 FP Unit & 1 INT
Unit
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Hardware model: Multiprocessors and threads


MP can launch numerous threads
threads are ”lightweight” – little creation and switching
overhead
threads run the same code
threads syncronization within MP
cooperation via shared memory
each thread have unique identifier – thread ID

Efficiency is achieved by latency hiding by calculation, and not by


cache usage, as on CPU
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Software model: C for CUDA


a set of extensions to C
runtime library
function and variable type qualifiers
built–in vector types: float4, double2 etc.
built–in variables
Kernels
maps parallel part of the program to the GPU
execution: N times in parallel by N CUDA threads

CUDA Driver API


low–level control over the execution
no need in nvcc compiler if kernels are precompiled – only
driver needed
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Software model: Example


//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Software model: Example


//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}

// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Software model: Example


//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}

// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}

// Host c o d e
i n t main ( )
{
// K e r n e l i n v o c a t i o n
SomeKernel <<<1,N>>>(A , B , C)
}
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Software model: Explanations


device qualifier defines function that is:
executed on device
callable from device only
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Software model: Explanations


device qualifier defines function that is:
executed on device
callable from device only

global qualifier defines function that is:


executed on device
callable from host only
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Execution model
Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model

Scalability
underlying hardware architecture is hidden
threads could syncronize only within the MP

we do not need to know exact number of MP

scalable applications – from GTX8800 to Fermi


Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Single threads
each thread have private local memory
are identified by built–in variable threadIdx (uint3 type)
int idx = threadIdx . x + threadIdx . y + threadIdx . z ;

form 1–, 2– or 3–dimensional array – vector, matrix or field

Threads are organized into


thread blocks
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Thread blocks
each block have shared memory visible to all threads within
the block
are identified by built–in variable blockIdx (uint3 type)
int b idx = blockIdx . x + blockIdx . y ;

dimension of the block is identified by built–in variable


blockDim (dim3 type)

Blocks are organized into


grid
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Grid of thread blocks


global device memory is accessible by all threads in the grid
dimension of the grid is identified by built–in variable gridDim
(dim3 type)
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Threads and memories hierarchy


Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Example: vector addition


i n t main ( )
{
// A l l o c a t e v e c t o r s i n d e v i c e memory
size t size = N ∗ sizeof ( float );
float ∗ d A ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
float ∗ d B ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
float ∗ d C ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;

// Copy d a t a from h o s t memory t o d e v i c e memory


cudamemcpy ( d A , h A , s i z e , cudaMemcpyHostToDevice ) ;
cudamemcpy ( d B , h B , s i z e , cudaMemcpyHostToDevice ) ;

// P r e p a r e t h e k e r n e l l a u n c h
int threadsPerBlock = 256;
i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ;

VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ;
cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ;

// F r e e d e v i c e memory
cudaFree ( d A ) ;
cudaFree ( d B ) ;
cudaFree ( d C ) ;
}
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Example: vector addition


// K e r n e l c o d e
global v o i d VecAdd ( f l o a t ∗ A , f l o a t ∗ B , f l o a t ∗ C)
{
int i = threadIdx . x ;
i f ( i < N)
C[ i ] = A[ i ] + B[ i ] ;
}
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Performance analysis and optimization


there must be enough thread blocks per MP to hide latency
try not to under–populate blocks
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Performance analysis and optimization


there must be enough thread blocks per MP to hide latency
try not to under–populate blocks
use memory bandwidth (∼ 100GB/s!) efficiently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity
Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy

Performance analysis and optimization


there must be enough thread blocks per MP to hide latency
try not to under–populate blocks
use memory bandwidth (∼ 100GB/s!) efficiently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity
never diverge threads within one warp:
divergence → serialization = parallelism
Computations on GPU: a road towards desktop supercomputing
Toolbox

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Toolbox

Start–up tools
drivers
CUDA Toolkit
nvcc compiler, runtime library, header files, CUBLAS, CUFFT,
Visual Profiler etc.
CUDA SDK
examples, Occupancy Calculator etc.

Free download at
http://developer.nvidia.com/object/cuda 2 3 downloads.html

Support for 32 and 64-bit Windows, Linux1 & Mac OS X

1
Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5,
OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04
Computations on GPU: a road towards desktop supercomputing
Toolbox

Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution
Computations on GPU: a road towards desktop supercomputing
Toolbox

Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution

CUDA Visual Profiler


tracks events with hardware counters
global memory loads/stores
total branches and divergent branches taken by threads
instruction count
number of serialized thread warps due to address conflicts
(shared and constant memory)
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL

Python
easy to learn
dynamically typed
rich built–in functionality
interpreted
very well documented
have large and active community
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL

Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL

Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL

Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···

Python is a convenient way of interfacing C/C++ libraries


Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL

PyCUDA
provide complete access to CUDA features
automatically manages resources
errors handling and translation to Python exceptions
convenient abstractions: GPUArray
metaprogramming: creation of CUDA source code
dynamically
interactive!
PyOpenCL is pretty much the same in concept – but not only for
Nvidia GPUs.
Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)
Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL

Python and CUDA


We could interface with:
Python C API – low–level approach: overkill
SWIG, Boost::Python – high–level approach: overkill
PyCUDA – most simple and straightforward way for CUDA
only
scipy.weave.inline – simple and straightforward way for
both CUDA and plain C/C++
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality

Motivation
Combine flexibility of Python with efficiency of C++ → CUDA for
N–body sim
interface of EnSPy is written in Python
core of EnSPy is written in C++
joined together by scipy.weave.inline
C++ core could be used without Python – just include header
and link with precompiled shared library
easily extensible: both through high–level Python interface
and low–level C++ core – new algorithms, initial distributions
etc.
multi–GPU parallelization
it’s easy to experiment with EnSPy!
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality

EnSPy functionality
Types of ensembles:
”Simple” ensemble – without interaction, only external
potential
N–body ensemble – both external potential and gravitational
interaction between particles

Current algorithms:
4-th order Runge–Kutta for ”simple” ensemble
Hermite scheme with shared time steps for N-body ensemble
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality

Predefined initial distributions:


Uniform, point and spherical for ”simple” ensembles
Uniform sphere with 2T /|U| = 1 for N-body ensemble
user could supply functions (in Python) for initial ensemble
generation

User specified values and expressions:


parameters of initial distribution
potential, forces, parameters of integration scheme
arbitrary number of triggers – Ni (t) of particles which do not
cross the given hypersurface Fi (q, p) = 0 before time t
arbitrary number of averages – F̄i (q, p, t) – quantities which
should be averaged over the ensembles
Computations on GPU: a road towards desktop supercomputing
EnSPy functionality

Runtime generation and compilation of C and CUDA code:


User specified expressions (as Python strings) are wrapped by
EnSPy template subpackage into C functions and CUDA
module
Compiled at runtime

High usage and calculation efficiency:


flexible Python interface
all actual calculations are performed by runtime generated C
extension and precompiled shared library

Drawback:
extra time for generation and compilation of new code
Computations on GPU: a road towards desktop supercomputing
EnSPy architecture

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
EnSPy architecture

Execution flow and architecture

Input parameters

Ensemble population
(predefined or user specified
distribution)

Code generation and


compilation

Launching NGPUs threads


Computations on GPU: a road towards desktop supercomputing
EnSPy architecture

GPU parallelization scheme for N–body simulations


Computations on GPU: a road towards desktop supercomputing
EnSPy architecture

Order of force calculation


Computations on GPU: a road towards desktop supercomputing
Example: D5 potential

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Example: D5 potential

Overview
Problem:
Escape from potential well.
Watched values (trigger):
N(t) – number of particles, remaining in the well at time t

Potential:

x4
UD5 = 2ay 2 − x 2 + xy 2 +
4

”Critical” energy: Ecr = ES = 0


Computations on GPU: a road towards desktop supercomputing
Example: D5 potential

Potential and structure of phase space:

Level lines of D5 potential

2
2

1
1

0
px
0
y

−1

2 1 0 1 2
−2 x
−2 −1 0 1 2
x
Computations on GPU: a road towards desktop supercomputing
Example: D5 potential

Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
x > 0 ∩ U(x, y ) < E
trigger: x = 0 → q0 = 0.
12 lines of simple Python code (examples/d5.py):
specification of integration parameters
Computations on GPU: a road towards desktop supercomputing
Example: D5 potential

Results:
Regular particles are trapped in well → initial ”mixed state” splits

E = 0.1

0.8

0.6

N (t)/N (0)

0.4
E = 0.9

0.2

0
0 10 20 30
t
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem

Overview
Problem:
Toy model of escape from star cluster: escape of star from
potential of point rotating star cluster Mc and point galaxy core
Mg  Mc
Watched values (trigger):
N(t) – number of particles, remaining in cluster at time t

”Potential” in cluster frame of reference (tidal approximation):

GMc
UHill = −3ω 2 x 2 −
r2

”Critical” energy: Ecr = ES = −4.5ω 2


Computations on GPU: a road towards desktop supercomputing
Example: Hill problem

Potential:

Hill curves

0.5

0.0

y
−0.5

−1.0
−1.0 −0.5 0.0 0.5
x
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem

Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
|x| < rt ∩ U(x, y ) < E
ω= √1 → rt = 1
3
trigger: |x| − rt = 0 → abs(q0) - 1. = 0.
12 lines of simple Python code (examples/hill plain.py):
specification of integration parameters
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem

Results:
Traping of regular particles (some tricky physics here):

1 · 104

8 · 103

6 · 103
N (t)

4 · 103

2 · 103 E = −1.3
E = −0.8
E = −0.3
0
0 2.5 · 104 5 · 104 7.5 · 104 1 · 105
nt
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version

Overview
Problem:
Simplified model of escape from star cluster: escape of star from
potential of rotating star cluster with total mass Mc and point
potential of galaxy core with mass Mg  Mc (2D)

Watched values:
Configuration of cluster
Potential of galaxy core in cluster frame of reference (tidal
approximation):

UHillNB = −3ω 2 x 2
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version

”Toy” Hill model vs N–body Hill model:


Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version

Calculation setup:
N–body ensemble
2D (z = 0) initial distribution of N = 10240 particles inside
circle R with zero initial velocities
14 lines of simple Python code (examples/hill nbody.py):
specification of integration parameters
Mc = 1, R = 200, ω = √1
3
Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version

Results: cluster configuration


step = 201 step = 401 step = 601
300 300 300

200 200 200

100 100 100

0 0 0
y

y
−100 −100 −100

−200 −200 −200

−300 −300 −300


−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x

step = 801 step = 1001 step = 1201


300 300 300

200 200 200

100 100 100

0 0 0
y

y
−100 −100 −100

−200 −200 −200

−300 −300 −300


−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x
Computations on GPU: a road towards desktop supercomputing
Performance results

OpenSUSE 11.2, GCC 4.4, CUDA 3.0. AMD Athlon X2 4400+


(2.3GHz) / Intel Core2Duo E8500 (3.16GHz), Nvidia Geforce 260
GTX. Not as good, as it could be – subject to improve.
Estimation: ∼ 1TFLOPs on 2x recent Fermi graphic processors

40 300
OpenM P
SSE optimized
250
CU DA
30

200

speed − up
GF lop/s

20 150

100
10

GTX260 DP - N –body 50
GTX260 DP – ”simple” ensemble
0 0
1 · 104 2 · 104 5 · 104 1 · 105 2 · 105 0 2.5 · 105 5 · 105 7.5 · 105 1 · 106
N number of particles
Computations on GPU: a road towards desktop supercomputing
GPU computing prospects

Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
GPU computing prospects

Yesterday:
uniform programming with OpenCL: no need to care about
concrete implementation
desktop supercomputers (full ATX form–factor):

Nvidia Tesla C1060 x4 ATI FireStream x4


∼ 300GFLOPs/4TFLOPs ∼ 960GFLOPs/4.8TFLOPs
Windows & Linux 32/64–bit Windows & Linux 32/64–bit
support support
Computations on GPU: a road towards desktop supercomputing
GPU computing prospects

Today:
CUDA 3.2 → C++: classes, namespaces, default
parameters, operators overloading

Nvidia Tesla C2050/2070 x4 ATI FireStream 9350/9370 x4

∼ 2TFLOPs/4TFLOPs ∼ 2GFLOPs/8TFLOPs
concurent kernel execution stable double–precision support
(12 August 2010)
∼ 8x in GFLOPs, ∼ 6x in
GFLOPs/$, ∼ 5x in LOEWE–CSC (University of
GFLOPs/W vs four Intel Xeon Frankfurt): №22 in Top500
X5550
(85GFLOPs/73GFLOPs)
Tianhe-1-A, Nebulae,
Tsubame-2: №1, 3, 4 SC from
Top500
Computations on GPU: a road towards desktop supercomputing
GPU computing prospects

Tommorow:
OpenCL 1.2 (?) → matrix and ”field” complex and real
types
New libraries: GPU programming as simple as CPU
programming

Nvidia Geforce 580 GTX ATI Radeon 6950 ”Cayman”


∼ 0.75TFLOPs/1.5TFLOPs ∼ 0.75GFLOPs/3TFLOPs