Cuda Seminar

Computations on GPU: a road towards desktop supercomputing
Computations on GPU: a road towards desktop

supercomputing
Glib Ivashkevych
Institute of Theoretical Physics, NSC KIPT
November 24, 2010

Quick outline
GPU – Graphic Processing Unit

programmable
manycore
multithreaded
with very high memory bandwidth
Quick outline
GPU – Graphic Processing Unit

programmable
manycore
multithreaded
with very high memory bandwidth
We are going to talk about:

how GPU became usefull for scientific computations
GPU intrinsics and programming
how to get as much as possible from GPU and survive:)
the future of GPUs and GPU programming
0
CUDA(Nvidia s Compute Unified Device Architecture) , most of the
time, and OpenCL(Open Computing Language)
Should we care?
But first of all: do we really need GPU computing?

Should we care?
Short answer: yes!

high performance
transparent scalability
Should we care?
Short answer: yes!

high performance
More accurate answer: yes, for problems with high parallelism.

large datasets
portions of data could be processed independently
Should we care?
Short answer: yes!

high performance
More accurate answer: yes, for problems with high parallelism.

large datasets
portions of data could be processed independently
Most accurate answer: yes, for problems with high data

parallelism.
Reference
For reference
GFLOPs – 109 FLoating Point Operations Per second
∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to
Intel)
∼ 125 GFLOPs AMD Opteron Istanbul 2435
∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision
on Nvidia Tesla C2050
∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National
Laboratory – the fastest supercomputer as for November 1999
∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo
Institute of Technology – first GPU based supercomputer, №88
in Top500, as for November 2010 (№56 – in November 2009)
∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National
Supercomputing Center in Tianjin – the fastest
supercomputer as for November 2010 – GPU based
Examples
Matrix and vector operations

CUBLAS 1 on Nvidia Tesla C2050 (CUDA 3.2) vs Intel MKL 10.21
on Intel Core i7 Nehalem (4 threads)
∼ 8x in double precision
CULA 1 on Nvidia Tesla C2050 (CUDA 3.2)
up to ∼ 220 GFLOPs in double precision

up to ∼ 450 GFLOPs in single precision
vs Intel MKL 10.2
∼ 4 − 6x speed–up
1
CUDA accelerated Basic Linear Algebra Subprograms
1
Math Kernel Library
1
LAPACK for Heterogeneous systems
Examples
Fast Fourier Transform

CUFFT on Nvidia Tesla C2070 (CUDA 3.2)
up to 65 GFLOPs in double precision

up to 220 GFLOPs in single precision
vs Intel MKL on Intel Core i7 Nehalem
∼ 9x in double precision
∼ 20x in single precision
Examples
Physics: Computational Fluid Dynamics

Simulation of transition to turbulence1
Nvidia Tesla S1070 vs quad-core Intel Xeon X5450 (3GHz)
∼ 20x over serial code
∼ 10x over OpenMP realization (2 threads)
∼ 5x over OpenMP realization (4 threads)
1
A.S. Antoniou et al., American Institute of Aeronautics and Astronautics
Paper 2010 – 0525
Examples
Quantum chemistry
Calculations of molecular orbitals1
Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz)
∼ 173x over serial non–optimized code
∼ 14x over parallel optimized code (4 threads)
1
D.J. Hardy et al., GPGPU 2009
Examples
Medical Imaging
Isosurfaces reconstruction from scalar volumetric data1
Nvidia GeForce GTX 285 vs ?
∼ 68x over optimized CPU code
nearly real-time processing of data
1
T. Kalbe et al., Proceedings of 5th International Symposium on Visual
Computing (ISVC 2009)
Examples
GPUGrid.net
Biomolecular simulations
accelerated by Nvidia CUDA boards and Sony PlayStation
∼ 8000 users from 101 country
∼ 145 TFLOPs in average ≈ №25 in Top500
∼ 50 GFLOPs from every active user
Examples
ATLAS experiment on Large Hadron Collider

Particles tracking, trigerring, events simulations1
possible Higgs events – track large number of particles
∼ 32x in tracking, ∼ 35x in triggering on Nvidia Tesla C1060
1
P.J. Clark et al., Processing Petabytes per Second with the ATLAS
Experiment at the LHC (GTC 2010)
Examples
And even more examples in:

N–body simulations
seismic simulations
molecular dynamics
SETI@Home & MilkyWay@Home
finance
neural networks
...
and, of course, graphics
VFX, rendering
image editing, video
Examples
GPU Technology Conference 2010 (September 20-23)

Outline
Outline
1 History
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
Outline
Outline
1 History
3 Threads and memories hierarchy
Outline
Outline
1 History
4 Toolbox
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
7 EnSPy architecture
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
8 Example: D5 potential
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
9 Example: Hill problem
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
10 Example: Hill problem, N–body version
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
11 Performance results
Outline
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
12 GPU computing prospects
History
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
History
History in brief:
History
GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision
History
GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision
GPGPU today:
straightforward
easy
in double precision.
CUDA: architecture overview and programming model
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Hardware model: GT200 architecture
consists of
multiprocessors
each MP has:
8 stream processors
1 unit for double
precision operations
shared memory
global memory
Hardware model: Fermi architecture
each MP has:
32 stream processors
4 SFU’s (Special
Function Unit)
each SP has:
1 FP Unit & 1 INT
Unit
Hardware model: Multiprocessors and threads

MP can launch numerous threads
threads are ”lightweight” – little creation and switching
overhead
threads run the same code
threads syncronization within MP
cooperation via shared memory
each thread have unique identifier – thread ID
Efficiency is achieved by latency hiding by calculation, and not by

cache usage, as on CPU
Software model: C for CUDA

a set of extensions to C
runtime library
function and variable type qualifiers
built–in vector types: float4, double2 etc.
built–in variables
Kernels
maps parallel part of the program to the GPU
execution: N times in parallel by N CUDA threads
CUDA Driver API

low–level control over the execution
no need in nvcc compiler if kernels are precompiled – only
driver needed
Software model: Example

//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}

{
//Some math
r e t u r n smth ;
}
// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}

{
//Some math
r e t u r n smth ;
}
// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}
// Host c o d e
i n t main ( )
{
// K e r n e l i n v o c a t i o n
SomeKernel <<<1,N>>>(A , B , C)
}
Software model: Explanations

device qualifier defines function that is:
executed on device
callable from device only
Software model: Explanations

device qualifier defines function that is:
executed on device
callable from device only
global qualifier defines function that is:

executed on device
callable from host only
Execution model
Scalability
underlying hardware architecture is hidden
threads could syncronize only within the MP
we do not need to know exact number of MP
scalable applications – from GTX8800 to Fermi

Threads and memories hierarchy
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Single threads
each thread have private local memory
are identified by built–in variable threadIdx (uint3 type)
int idx = threadIdx . x + threadIdx . y + threadIdx . z ;
form 1–, 2– or 3–dimensional array – vector, matrix or field
Threads are organized into

thread blocks
Thread blocks
each block have shared memory visible to all threads within
the block
are identified by built–in variable blockIdx (uint3 type)
int b idx = blockIdx . x + blockIdx . y ;
dimension of the block is identified by built–in variable

blockDim (dim3 type)
Blocks are organized into

grid
Grid of thread blocks

global device memory is accessible by all threads in the grid
dimension of the grid is identified by built–in variable gridDim
(dim3 type)

Example: vector addition

i n t main ( )
{
// A l l o c a t e v e c t o r s i n d e v i c e memory
size t size = N ∗ sizeof ( float );
float ∗ d A ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
float ∗ d B ;
float ∗ d C ;
// Copy d a t a from h o s t memory t o d e v i c e memory

cudamemcpy ( d A , h A , s i z e , cudaMemcpyHostToDevice ) ;
cudamemcpy ( d B , h B , s i z e , cudaMemcpyHostToDevice ) ;
// P r e p a r e t h e k e r n e l l a u n c h
int threadsPerBlock = 256;
i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ;
VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ;
cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ;
// F r e e d e v i c e memory
cudaFree ( d A ) ;
cudaFree ( d B ) ;
cudaFree ( d C ) ;
}
Example: vector addition

// K e r n e l c o d e
global v o i d VecAdd ( f l o a t ∗ A , f l o a t ∗ B , f l o a t ∗ C)
{
int i = threadIdx . x ;
i f ( i < N)
C[ i ] = A[ i ] + B[ i ] ;
}
Performance analysis and optimization

there must be enough thread blocks per MP to hide latency
try not to under–populate blocks

use memory bandwidth (∼ 100GB/s!) efficiently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity

use memory bandwidth (∼ 100GB/s!) efficiently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity
never diverge threads within one warp:
divergence → serialization = parallelism
Toolbox
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Toolbox
Start–up tools
drivers
CUDA Toolkit
nvcc compiler, runtime library, header files, CUBLAS, CUFFT,
Visual Profiler etc.
CUDA SDK
examples, Occupancy Calculator etc.
Free download at
http://developer.nvidia.com/object/cuda 2 3 downloads.html
Support for 32 and 64-bit Windows, Linux1 & Mac OS X
1
Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5,
OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04
Toolbox
Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution
Toolbox
Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution
CUDA Visual Profiler

tracks events with hardware counters
global memory loads/stores
total branches and divergent branches taken by threads
instruction count
number of serialized thread warps due to address conflicts
(shared and constant memory)
PyCUDA&PyOpenCL
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
PyCUDA&PyOpenCL
Python
easy to learn
dynamically typed
rich built–in functionality
interpreted
very well documented
have large and active community
PyCUDA&PyOpenCL
Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
PyCUDA&PyOpenCL
Scientific tools:
Fourier transforms
ODE
Optimization
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···
PyCUDA&PyOpenCL
Scientific tools:
Fourier transforms
ODE
Optimization
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···
Python is a convenient way of interfacing C/C++ libraries

PyCUDA&PyOpenCL
PyCUDA
provide complete access to CUDA features
automatically manages resources
errors handling and translation to Python exceptions
convenient abstractions: GPUArray
metaprogramming: creation of CUDA source code
dynamically
interactive!
PyOpenCL is pretty much the same in concept – but not only for
Nvidia GPUs.
Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)
PyCUDA&PyOpenCL
Python and CUDA

We could interface with:
Python C API – low–level approach: overkill
SWIG, Boost::Python – high–level approach: overkill
PyCUDA – most simple and straightforward way for CUDA
only
scipy.weave.inline – simple and straightforward way for
both CUDA and plain C/C++
EnSPy functionality
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
EnSPy functionality
Motivation
Combine flexibility of Python with efficiency of C++ → CUDA for
N–body sim
interface of EnSPy is written in Python
core of EnSPy is written in C++
joined together by scipy.weave.inline
C++ core could be used without Python – just include header
and link with precompiled shared library
easily extensible: both through high–level Python interface
and low–level C++ core – new algorithms, initial distributions
etc.
multi–GPU parallelization
it’s easy to experiment with EnSPy!
EnSPy functionality
EnSPy functionality
Types of ensembles:
”Simple” ensemble – without interaction, only external
potential
N–body ensemble – both external potential and gravitational
interaction between particles
Current algorithms:
4-th order Runge–Kutta for ”simple” ensemble
Hermite scheme with shared time steps for N-body ensemble
EnSPy functionality
Predefined initial distributions:

Uniform, point and spherical for ”simple” ensembles
Uniform sphere with 2T /|U| = 1 for N-body ensemble
user could supply functions (in Python) for initial ensemble
generation
User specified values and expressions:

parameters of initial distribution
potential, forces, parameters of integration scheme
arbitrary number of triggers – Ni (t) of particles which do not
cross the given hypersurface Fi (q, p) = 0 before time t
arbitrary number of averages – F̄i (q, p, t) – quantities which
should be averaged over the ensembles
EnSPy functionality
Runtime generation and compilation of C and CUDA code:

User specified expressions (as Python strings) are wrapped by
EnSPy template subpackage into C functions and CUDA
module
Compiled at runtime
High usage and calculation efficiency:

flexible Python interface
all actual calculations are performed by runtime generated C
extension and precompiled shared library
Drawback:
extra time for generation and compilation of new code
EnSPy architecture
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
EnSPy architecture
Execution flow and architecture
Input parameters
Ensemble population
(predefined or user specified
distribution)
Code generation and

compilation
Launching NGPUs threads

EnSPy architecture
GPU parallelization scheme for N–body simulations

EnSPy architecture
Order of force calculation

Example: D5 potential
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Overview
Problem:
Escape from potential well.
Watched values (trigger):
N(t) – number of particles, remaining in the well at time t
Potential:
x4
UD5 = 2ay 2 − x 2 + xy 2 +
4
”Critical” energy: Ecr = ES = 0

Potential and structure of phase space:
Level lines of D5 potential
2
2
1
1
0
px
0
y
−1
2 1 0 1 2
−2 x
−2 −1 0 1 2
x
Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
x > 0 ∩ U(x, y ) < E
trigger: x = 0 → q0 = 0.
12 lines of simple Python code (examples/d5.py):
specification of integration parameters
Results:
Regular particles are trapped in well → initial ”mixed state” splits
E = 0.1
0.8
0.6
N (t)/N (0)
0.4
E = 0.9
0.2
0
0 10 20 30
t
Example: Hill problem
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Overview
Problem:
Toy model of escape from star cluster: escape of star from
potential of point rotating star cluster Mc and point galaxy core
Mg Mc
Watched values (trigger):
N(t) – number of particles, remaining in cluster at time t
”Potential” in cluster frame of reference (tidal approximation):
GMc
UHill = −3ω 2 x 2 −
r2
”Critical” energy: Ecr = ES = −4.5ω 2

Potential:
Hill curves
0.5
0.0
y
−0.5
−1.0
−1.0 −0.5 0.0 0.5
x
Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
|x| < rt ∩ U(x, y ) < E
ω= √1 → rt = 1
3
trigger: |x| − rt = 0 → abs(q0) - 1. = 0.
12 lines of simple Python code (examples/hill plain.py):
Results:
Traping of regular particles (some tricky physics here):
1 · 104
8 · 103
6 · 103
N (t)
4 · 103
2 · 103 E = −1.3
E = −0.8
E = −0.3
0
0 2.5 · 104 5 · 104 7.5 · 104 1 · 105
nt
Example: Hill problem, N–body version
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Overview
Problem:
Simplified model of escape from star cluster: escape of star from
potential of rotating star cluster with total mass Mc and point
potential of galaxy core with mass Mg Mc (2D)
Watched values:
Configuration of cluster
Potential of galaxy core in cluster frame of reference (tidal
approximation):
UHillNB = −3ω 2 x 2
”Toy” Hill model vs N–body Hill model:

Calculation setup:
N–body ensemble
2D (z = 0) initial distribution of N = 10240 particles inside
circle R with zero initial velocities
14 lines of simple Python code (examples/hill nbody.py):
Mc = 1, R = 200, ω = √1
3
Results: cluster configuration

step = 201 step = 401 step = 601
300 300 300
200 200 200
100 100 100
0 0 0
y
y
−100 −100 −100
−200 −200 −200
−300 −300 −300

−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x
step = 801 step = 1001 step = 1201

300 300 300
200 200 200
100 100 100
0 0 0
y
y
−100 −100 −100
−200 −200 −200
−300 −300 −300

−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x
Performance results
OpenSUSE 11.2, GCC 4.4, CUDA 3.0. AMD Athlon X2 4400+

(2.3GHz) / Intel Core2Duo E8500 (3.16GHz), Nvidia Geforce 260
GTX. Not as good, as it could be – subject to improve.
Estimation: ∼ 1TFLOPs on 2x recent Fermi graphic processors
40 300
OpenM P
SSE optimized
250
CU DA
30
200
speed − up
GF lop/s
20 150
100
10
GTX260 DP - N –body 50
GTX260 DP – ”simple” ensemble
0 0
1 · 104 2 · 104 5 · 104 1 · 105 2 · 105 0 2.5 · 105 5 · 105 7.5 · 105 1 · 106
N number of particles
GPU computing prospects
Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
Yesterday:
uniform programming with OpenCL: no need to care about
concrete implementation
desktop supercomputers (full ATX form–factor):
Nvidia Tesla C1060 x4 ATI FireStream x4

∼ 300GFLOPs/4TFLOPs ∼ 960GFLOPs/4.8TFLOPs
Windows & Linux 32/64–bit Windows & Linux 32/64–bit
support support
Today:
CUDA 3.2 → C++: classes, namespaces, default
parameters, operators overloading
Nvidia Tesla C2050/2070 x4 ATI FireStream 9350/9370 x4
∼ 2TFLOPs/4TFLOPs ∼ 2GFLOPs/8TFLOPs
concurent kernel execution stable double–precision support
(12 August 2010)
∼ 8x in GFLOPs, ∼ 6x in
GFLOPs/$, ∼ 5x in LOEWE–CSC (University of
GFLOPs/W vs four Intel Xeon Frankfurt): №22 in Top500
X5550
(85GFLOPs/73GFLOPs)
Tianhe-1-A, Nebulae,
Tsubame-2: №1, 3, 4 SC from
Top500
Tommorow:
OpenCL 1.2 (?) → matrix and ”field” complex and real
types
New libraries: GPU programming as simple as CPU
programming
Nvidia Geforce 580 GTX ATI Radeon 6950 ”Cayman”

∼ 0.75TFLOPs/1.5TFLOPs ∼ 0.75GFLOPs/3TFLOPs

Cuda Seminar

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cuda Seminar

Uploaded by

Copyright:

Available Formats

Computations on GPU: a road towards desktop supercomputing

Computations on GPU: a road towards desktop

Institute of Theoretical Physics, NSC KIPT

November 24, 2010

GPU – Graphic Processing Unit

GPU – Graphic Processing Unit

We are going to talk about:

But first of all: do we really need GPU computing?

But first of all: do we really need GPU computing?

Short answer: yes!

But first of all: do we really need GPU computing?

Short answer: yes!

More accurate answer: yes, for problems with high parallelism.

But first of all: do we really need GPU computing?

Short answer: yes!

More accurate answer: yes, for problems with high parallelism.

Most accurate answer: yes, for problems with high data

Matrix and vector operations

CULA 1 on Nvidia Tesla C2050 (CUDA 3.2)

up to ∼ 220 GFLOPs in double precision

vs Intel MKL 10.2

Fast Fourier Transform

up to 65 GFLOPs in double precision

vs Intel MKL on Intel Core i7 Nehalem

Physics: Computational Fluid Dynamics

ATLAS experiment on Large Hadron Collider

And even more examples in:

GPU Technology Conference 2010 (September 20-23)

Hardware model: GT200 architecture

Hardware model: Fermi architecture

Hardware model: Multiprocessors and threads

Efficiency is achieved by latency hiding by calculation, and not by

Software model: C for CUDA

CUDA Driver API

Software model: Example

Software model: Example

Software model: Example

Software model: Explanations

Software model: Explanations

global qualifier defines function that is:

we do not need to know exact number of MP

scalable applications – from GTX8800 to Fermi

form 1–, 2– or 3–dimensional array – vector, matrix or field

Threads are organized into

dimension of the block is identified by built–in variable

Blocks are organized into

Grid of thread blocks

Threads and memories hierarchy

Example: vector addition

// Copy d a t a from h o s t memory t o d e v i c e memory

Example: vector addition

Performance analysis and optimization

Performance analysis and optimization

Performance analysis and optimization

Support for 32 and 64-bit Windows, Linux1 & Mac OS X

CUDA Visual Profiler

Python is a convenient way of interfacing C/C++ libraries

Python and CUDA

Predefined initial distributions:

User specified values and expressions:

Runtime generation and compilation of C and CUDA code:

High usage and calculation efficiency:

Execution flow and architecture

Code generation and