You are on page 1of 29

Webinar 20130404

Molecular Simulation with


GROMACS on CUDA GPUs
Erik Lindahl

GROMACS is used on a
wide range of resources

Were comfortably
on the single-s
scale today

Larger machines often


mean larger systems,
not necessarily longer
simulations

Why use GPUs?


Throughput
Sampling
Free energy
Cost eciency
Power eciency
Desktop simulation
Upgrade old machines
Low-end clusters

Performance
Longer simulations
Parallel GPU simulation

using Infiniband
High-end eciency by
using fewer nodes
Reach timescales not
possible with CPUs

Many GPU programs today


Caveat emperor:
It is much easier to get a reference
problem/algorithm to scale
i.e., you see much better
relative scaling before
introducing any optimization on the CPU side
When comparing programs:
What matters is absolute performance
(ns/day), not the relative speedup!

Gromacs-4.5 with OpenMM


Previous version - what was the limitation?
Gromacs running
entirely on CPU as
a fancy interface
Actual simulation running
entirely on GPU
using OpenMM kernels
Only a few select algorithms worked
Multi-CPU sometimes beat GPU performance...

Why dont we use the CPU too?

0.5-1 TFLOP
Random memory
access OK (not great)

Great for complex


latency-sensitive stuff
(domain decomposition, etc.)

~2 TFLOP
Random memory
access wont work
Great for
throughput

Gromacs-4.6
next-generation
GPU implementation:
Programming
model
Domain decomposition
dynamic load balancing
1 MPI rank 1 MPI rank

1 MPI rank 1 MPI rank

CPU
N OpenMP N OpenMP
(PME) threads
threads

N OpenMP N OpenMP
threads
threads

Load balancing
GPU

1 GPU
context

1 GPU
context

Load balancing
1 GPU
context

1 GPU
context

Heterogeneous CPU-GPU acceleration in GROMACS-4.6

Wallclock time for an MD step:


~0.5 ms if we want to simulate 1s/day
We cannot aord to lose all previous acceleration tricks!

CPU trick 1: all-bond constraints


t limited by fast motions - 1fs
SHAKE (iterative, slow) - 2fs
Remove bond vibrations

Problematic in parallel (wont work)


Compromise: constrain h-bonds only 1.4fs

GROMACS (LINCS):

LINear Constraint Solver


Approximate matrix inversion expansion
Fast & stable - much better than SHAKE
Non-iterative
Enables 2-3 fs timesteps
Parallel: P-LINCS (from Gromacs 4.0)

LINCS:

t=2
t=1

A) Move w/o constraint


t=2
t=1
B) Project out motion
along bonds
t=2
t=1
C) Correct for rotational
extension of bond

CPU trick 2: Virtual sites

Next fastest motions is H-angle and


rotations of CH3/NH2 groups
Try to remove them:

Ideal H position from heavy atoms.


CH3/NH2 groups are made rigid
Calculate forces, then project back onto heavy atoms
Integrate only heavy atom positions, reconstruct Hs

Enables 5fs timesteps!

1-a

1-a

|d |

|b |

3fd

3fad

|c |

3out

4fd

Interactions

Degrees of Freedom

dista
actio
this i
intera
it to
tem.
Fo
tween
7
C
3
B 2
1D d
C
B
decom
A
4
6
A
sions
rc 5
1
0
allow
rc
3
1
detai
0
8th-sphere
comm
FIG. 3: The zones to communicate to themost
proces
FIG. 2: The domain decomposition cells
that
communisee the(1-7)
text for
details.
nami
cate coordinates to cell 0. Cell 2 is hidden below cell 7. The
parts
zones that need to be communicated to cell 0 are dashed, rc
ensure that all bonded interaction between
ch
cut-o
is the cut-off radius.
can be assigned to a processor, it is sucien
balan
that the charge groups within a sphere of ra
munip
present on at least one processor for every
ter of the sphere. In Fig. ?? this means
we a
balan
communicate volumes B and C. When no bo
are calculated.
in cel
actions are present between charge groups,
th
are not
communicated.
For 2D decomposition
bond
Bonded interactions are distributed
over
the processors
C are the only extra volumes that need to be
calcu
by finding the smallest x, y and z coordinate
of the charge
For 3D domain decomposition the pictures be
tionsi
groups involved and assigning the ainteraction
to thebut
probit more complicated,
the procedure

CPU trick 3: Non-rectangular


cells & decomposition

Load balancing works


for arbitrary triclinic cells

Lysozyme, 25k atoms


All
these
tricks
now
work
fine
Rhombic dodecahedron
(36k atoms in cubic cell) with GPUs in GROMACS-4.6!

apart from more extensive book-keeping. All

From neighborlists to cluster


pair lists in GROMACS-4.6
x,y grid
z sort
z bin
x,y,z
gridding

Organize
as tiles with
Cluster pairlist
all-vs-all
interactions:

X
X
X
X

X
X
X
X

X
X
X
X

X
X
X
X

Tiling circles is dicult


Need a lot of cubes
to cover a sphere
Interactions outside
cuto should be 0.0
Group cuto

Verlet cuto

GROMACS-4.6 calculates a large enough buer


zone so no interactions are missed
Optimize nstlist for performance - no need to
worry about missing any interactions with Verlet!

Tixel algorithm work-efficiency


8x8x8 tixels compared to a non performance-optimized Verlet scheme
0.36

rc=1.5, rl=1.6

0.58
0.82

Verlet
Tixel Pruned
Tixel non-pruned

0.29

rc=1.2, rl=1.3

0.52
0.75
0.21

rc=0.9, rl=1.0

0.42
0.73

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Highly memory-ecient algorithm:


Can handle 20-40 million atoms with 2-3GB memory
Even cheap consumer cards will get you a long way

PME weak scaling


Xeon X5650 3T + C2075 / process
0.35

480 s/step (1500 atoms)

1xC2075 CUDA F kernel


1xC2075 CPU total
2xC2075 CPU total
4xC2075 CPU total

Iteration time per 1000 atoms (ms/step)

0.3

0.25

0.2

Text
0.15

700 s/step (6000 atoms)


0.1

Complete time step including


kernel, h2d, d2h, CPU constraints,
CPU PME, CPU integration,OpenMP & MPI

0.05

0
1.5

12

24

48

96

192

System size/GPU (1000s of atoms)

384

768

1536

3072

Example performance: Systems with


~24,000 atoms, 2 fs time steps, NPT
Amber:
DHFR

CPU, 96 CPU cores


GPU, 1xGTX680
GPU, 4xGTX680
0

Gromacs:
RNAse

100

ns/day

200

300

200

300

CPU, 6 cores
CPU, 2*8 cores

6 CPU cores +1xK20c GPU


6 CPU cores +1xGTX680 GPU
dodec+vsites(5fs), 6 CPU cores
dodec+vsites(5fs), 2*8 CPU cores
dodec+vsites(5fs), 6 cores + 1xK20c

dodec+vsites(5fs), 6 cores + 1xGTX680


0

100

The Villin headpiece


~8,000 atoms, 5 fs steps
explicit solvent
triclinic box
PME electrostatics
i7 3930K (GMX 4.5)
i7 3930K (GMX 4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

200

400

600

ns/day

800

1000

2,546 FPS (beat that, Battlefield 4)

1200

GLIC: Ion channel


membrane protein
150,000 atoms
Running on a simple desktop!

i7 3930K (GMX4.5)
i7 3930K (GMX4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

10

20

ns/day

30

40

Strongof
scaling
of Reaction-Field and PME
Scaling
Reaction-field
& PME
1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff

Performance (ns/day)

100

10

1
RF
RF linear scaling
PME
PME linear scaling

0.1
1

10

100

#Processes-GPUs

Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth
Small systems often work best using only a single GPU!

GROMACS 4.6 extreme scaling


Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9
1000

XK6/X2090
XK7/K20X
XK6 CPU only
XE6 CPU only

ns/day

100

10

1
1

4
8
16
#sockets (CPU or CPU+GPU)

32

64

Using GROMACS
with GPUs in practice

Compiling GROMACS with CUDA

Make sure CUDA driver is installed


Make sure CUDA SDK is in /usr/local/cuda
Use the default GROMACS distribution
Just run cmake and we will detect CUDA
automatically and use it

gcc-4.7 works great as a compiler


On Macs, you want to use icc (commercial)

Longer Mac story: Clang does not support OpenMP,


which gcc does. However, the current gcc versions for
Macs do not support AVX on the CPU. icc supports both!

Using GPUs in practice


In your mdp file:
cutoff-scheme
nstlist
coulombtype
vdw-type
nstcalcenergy

=
=
=
=
=

Verlet
10
; likely 10-50
pme
; or reaction-field
cut-off
-1
; only when writing edr

Verlet cuto-scheme is more accurate


Necessary for GPUs in GROMACS
Use -testverlet mdrun option to force it w. old tpr files
Slower on a single CPU, but scales well on CPUs too!
Shift modifier is applied to both coulomb and VdW by
default on GPUs - change with coulomb/vdw-modifier

Load balancing
rcoulomb
fourierspacing

= 1.0
= 0.12

If we increase/decrease the coulomb direct-space

cuto and the reciprocal space PME grid spacing by


the same amount, we maintain accuracy
... but we move work between CPU & GPU!
By default, GROMACS-4.6 does this automatically at
the start of each run - you will see diagnostic output
GROMACS excels when you combine a fairly fast
CPU and GPU. Currently, this means Intel CPUs.

Demo

Acknowledgments

GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham
Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov
Multi-Threaded PME: Roland Shultz, Berk Hess
Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!

Test Drive K20


GPUs!

Questions?

Run GROMACS on Tesla K20


GPU today

Devang Sachdev - NVIDIA


dsachdev@nvidia.com
@DevangSachdev

Contact us

Experience The Acceleration

Sign up for FREE GPU Test Drive


on remotely hosted clusters
www.nvidia.com/GPUTestDrive

GROMACS questions
Check www.gromacs.org
gmx-users@gromacs.org

mailing list
Stream other webinars from GTC
Express:
http://www.gputechconf.com/
page/gtc-express-webinar.html

Register for the Next GTC


Express Webinar
Molecular Shape Searching on GPUs
Paul Hawkins, Applications Science Group Leader, OpenEye
Wednesday, May 22, 2013, 9:00 AM PDT
Register at www.gputechconf.com/gtcexpress

You might also like