Gromacs Kepler Gpus GTC Express Webinar

Webinar 20130404
Molecular Simulation with

GROMACS on CUDA GPUs
Erik Lindahl
GROMACS is used on a
wide range of resources
Were comfortably
on the single-s
scale today
Larger machines often

mean larger systems,
not necessarily longer
simulations
Why use GPUs?

Throughput
Sampling
Free energy
Cost eciency
Power eciency
Desktop simulation
Upgrade old machines
Low-end clusters
Performance
Longer simulations
Parallel GPU simulation
using Infiniband
High-end eciency by
using fewer nodes
Reach timescales not
possible with CPUs
Many GPU programs today

Caveat emperor:
It is much easier to get a reference
problem/algorithm to scale
i.e., you see much better
relative scaling before
introducing any optimization on the CPU side
When comparing programs:
What matters is absolute performance
(ns/day), not the relative speedup!
Gromacs-4.5 with OpenMM

Previous version - what was the limitation?
Gromacs running
entirely on CPU as
a fancy interface
Actual simulation running
entirely on GPU
using OpenMM kernels
Only a few select algorithms worked
Multi-CPU sometimes beat GPU performance...
Why dont we use the CPU too?
0.5-1 TFLOP
Random memory
access OK (not great)
Great for complex

latency-sensitive stuff
(domain decomposition, etc.)
~2 TFLOP
Random memory
access wont work
Great for
throughput
Gromacs-4.6
next-generation
GPU implementation:
Programming
model
Domain decomposition
dynamic load balancing
1 MPI rank 1 MPI rank
1 MPI rank 1 MPI rank
CPU
N OpenMP N OpenMP
(PME) threads
threads
N OpenMP N OpenMP
threads
threads
Load balancing
GPU
1 GPU
context
1 GPU
context
Load balancing
1 GPU
context
1 GPU
context
Heterogeneous CPU-GPU acceleration in GROMACS-4.6
Wallclock time for an MD step:

~0.5 ms if we want to simulate 1s/day
We cannot aord to lose all previous acceleration tricks!
CPU trick 1: all-bond constraints

t limited by fast motions - 1fs
SHAKE (iterative, slow) - 2fs
Remove bond vibrations
Problematic in parallel (wont work)

Compromise: constrain h-bonds only 1.4fs
GROMACS (LINCS):
LINear Constraint Solver

Approximate matrix inversion expansion
Fast & stable - much better than SHAKE
Non-iterative
Enables 2-3 fs timesteps
Parallel: P-LINCS (from Gromacs 4.0)
LINCS:
t=2
t=1
A) Move w/o constraint

t=2
t=1
B) Project out motion
along bonds
t=2
t=1
C) Correct for rotational
extension of bond
CPU trick 2: Virtual sites
Next fastest motions is H-angle and

rotations of CH3/NH2 groups
Try to remove them:
Ideal H position from heavy atoms.

CH3/NH2 groups are made rigid
Calculate forces, then project back onto heavy atoms
Integrate only heavy atom positions, reconstruct Hs
Enables 5fs timesteps!
1-a
1-a
|d |
|b |
3fd
3fad
|c |
3out
4fd
Interactions
Degrees of Freedom
dista
actio
this i
intera
it to
tem.
Fo
tween
7
C
3
B 2
1D d
C
B
decom
A
4
6
A
sions
rc 5
1
0
allow
rc
3
1
detai
0
8th-sphere
comm
FIG. 3: The zones to communicate to themost
proces
FIG. 2: The domain decomposition cells
that
communisee the(1-7)
text for
details.
nami
cate coordinates to cell 0. Cell 2 is hidden below cell 7. The
parts
zones that need to be communicated to cell 0 are dashed, rc
ensure that all bonded interaction between
ch
cut-o
is the cut-off radius.
can be assigned to a processor, it is sucien
balan
that the charge groups within a sphere of ra
munip
present on at least one processor for every
ter of the sphere. In Fig. ?? this means
we a
balan
communicate volumes B and C. When no bo
are calculated.
in cel
actions are present between charge groups,
th
are not
communicated.
For 2D decomposition
bond
Bonded interactions are distributed
over
the processors
C are the only extra volumes that need to be
calcu
by finding the smallest x, y and z coordinate
of the charge
For 3D domain decomposition the pictures be
tionsi
groups involved and assigning the ainteraction
to thebut
probit more complicated,
the procedure
CPU trick 3: Non-rectangular

cells & decomposition
Load balancing works

for arbitrary triclinic cells
Lysozyme, 25k atoms

All
these
tricks
now
work
fine
Rhombic dodecahedron
(36k atoms in cubic cell) with GPUs in GROMACS-4.6!
apart from more extensive book-keeping. All
From neighborlists to cluster

pair lists in GROMACS-4.6
x,y grid
z sort
z bin
x,y,z
gridding
Organize
as tiles with
Cluster pairlist
all-vs-all
interactions:
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Tiling circles is dicult

Need a lot of cubes
to cover a sphere
Interactions outside
cuto should be 0.0
Group cuto
Verlet cuto
GROMACS-4.6 calculates a large enough buer

zone so no interactions are missed
Optimize nstlist for performance - no need to
worry about missing any interactions with Verlet!
Tixel algorithm work-efficiency

8x8x8 tixels compared to a non performance-optimized Verlet scheme
0.36
rc=1.5, rl=1.6
0.58
0.82
Verlet
Tixel Pruned
Tixel non-pruned
0.29
rc=1.2, rl=1.3
0.52
0.75
0.21
rc=0.9, rl=1.0
0.42
0.73
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Highly memory-ecient algorithm:

Can handle 20-40 million atoms with 2-3GB memory
Even cheap consumer cards will get you a long way
PME weak scaling

Xeon X5650 3T + C2075 / process
0.35
480 s/step (1500 atoms)
1xC2075 CUDA F kernel

1xC2075 CPU total
2xC2075 CPU total
4xC2075 CPU total
Iteration time per 1000 atoms (ms/step)
0.3
0.25
0.2
Text
0.15
700 s/step (6000 atoms)

0.1
Complete time step including

kernel, h2d, d2h, CPU constraints,
CPU PME, CPU integration,OpenMP & MPI
0.05
0
1.5
12
24
48
96
192
System size/GPU (1000s of atoms)
384
768
1536
3072
Example performance: Systems with

~24,000 atoms, 2 fs time steps, NPT
Amber:
DHFR
CPU, 96 CPU cores

GPU, 1xGTX680
GPU, 4xGTX680
0
Gromacs:
RNAse
100
ns/day
200
300
200
300
CPU, 6 cores
CPU, 2*8 cores
6 CPU cores +1xK20c GPU

6 CPU cores +1xGTX680 GPU
dodec+vsites(5fs), 6 CPU cores
dodec+vsites(5fs), 2*8 CPU cores
dodec+vsites(5fs), 6 cores + 1xK20c
dodec+vsites(5fs), 6 cores + 1xGTX680

0
100
The Villin headpiece

~8,000 atoms, 5 fs steps
explicit solvent
triclinic box
PME electrostatics
i7 3930K (GMX 4.5)
i7 3930K (GMX 4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0
200
400
600
ns/day
800
1000
2,546 FPS (beat that, Battlefield 4)
1200
GLIC: Ion channel

membrane protein
150,000 atoms
Running on a simple desktop!
i7 3930K (GMX4.5)
i7 3930K (GMX4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0
10
20
ns/day
30
40
Strongof
scaling
of Reaction-Field and PME
Scaling
Reaction-field
& PME
1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff
Performance (ns/day)
100
10
1
RF
RF linear scaling
PME
PME linear scaling
0.1
1
10
100
#Processes-GPUs
Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth
Small systems often work best using only a single GPU!
GROMACS 4.6 extreme scaling

Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9
1000
XK6/X2090
XK7/K20X
XK6 CPU only
XE6 CPU only
ns/day
100
10
1
1
4
8
16
#sockets (CPU or CPU+GPU)
32
64
Using GROMACS
with GPUs in practice
Compiling GROMACS with CUDA
Make sure CUDA driver is installed

Make sure CUDA SDK is in /usr/local/cuda
Use the default GROMACS distribution
Just run cmake and we will detect CUDA
automatically and use it
gcc-4.7 works great as a compiler

On Macs, you want to use icc (commercial)
Longer Mac story: Clang does not support OpenMP,

which gcc does. However, the current gcc versions for
Macs do not support AVX on the CPU. icc supports both!
Using GPUs in practice

In your mdp file:
cutoff-scheme
nstlist
coulombtype
vdw-type
nstcalcenergy
=
=
=
=
=
Verlet
10
; likely 10-50
pme
; or reaction-field
cut-off
-1
; only when writing edr
Verlet cuto-scheme is more accurate

Necessary for GPUs in GROMACS
Use -testverlet mdrun option to force it w. old tpr files
Slower on a single CPU, but scales well on CPUs too!
Shift modifier is applied to both coulomb and VdW by
default on GPUs - change with coulomb/vdw-modifier
Load balancing
rcoulomb
fourierspacing
= 1.0
= 0.12
If we increase/decrease the coulomb direct-space
cuto and the reciprocal space PME grid spacing by

the same amount, we maintain accuracy
... but we move work between CPU & GPU!
By default, GROMACS-4.6 does this automatically at
the start of each run - you will see diagnostic output
GROMACS excels when you combine a fairly fast
CPU and GPU. Currently, this means Intel CPUs.
Demo
Acknowledgments
GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham
Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov
Multi-Threaded PME: Roland Shultz, Berk Hess
Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!
Test Drive K20

GPUs!
Questions?
Run GROMACS on Tesla K20

GPU today
Devang Sachdev - NVIDIA

dsachdev@nvidia.com
@DevangSachdev
Contact us
Experience The Acceleration
Sign up for FREE GPU Test Drive

on remotely hosted clusters
www.nvidia.com/GPUTestDrive
GROMACS questions
Check www.gromacs.org
gmx-users@gromacs.org
mailing list
Stream other webinars from GTC
Express:
http://www.gputechconf.com/
page/gtc-express-webinar.html
Register for the Next GTC

Express Webinar
Molecular Shape Searching on GPUs
Paul Hawkins, Applications Science Group Leader, OpenEye
Wednesday, May 22, 2013, 9:00 AM PDT
Register at www.gputechconf.com/gtcexpress

Gromacs Kepler Gpus GTC Express Webinar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gromacs Kepler Gpus GTC Express Webinar

Uploaded by

Copyright:

Available Formats

Webinar 20130404

Molecular Simulation with

Larger machines often

Why use GPUs?

Many GPU programs today

Gromacs-4.5 with OpenMM

Why dont we use the CPU too?

Great for complex

1 MPI rank 1 MPI rank

Heterogeneous CPU-GPU acceleration in GROMACS-4.6

Wallclock time for an MD step:

CPU trick 1: all-bond constraints

Problematic in parallel (wont work)

LINear Constraint Solver

A) Move w/o constraint

CPU trick 2: Virtual sites

Next fastest motions is H-angle and

Ideal H position from heavy atoms.

Enables 5fs timesteps!

CPU trick 3: Non-rectangular

Load balancing works

Lysozyme, 25k atoms

apart from more extensive book-keeping. All

From neighborlists to cluster

Tiling circles is dicult

GROMACS-4.6 calculates a large enough buer

Tixel algorithm work-efficiency

Highly memory-ecient algorithm:

PME weak scaling

480 s/step (1500 atoms)

1xC2075 CUDA F kernel

Iteration time per 1000 atoms (ms/step)

700 s/step (6000 atoms)

Complete time step including

System size/GPU (1000s of atoms)

Example performance: Systems with

CPU, 96 CPU cores

6 CPU cores +1xK20c GPU

dodec+vsites(5fs), 6 cores + 1xGTX680

The Villin headpiece

2,546 FPS (beat that, Battlefield 4)

GLIC: Ion channel

GROMACS 4.6 extreme scaling

Compiling GROMACS with CUDA

Make sure CUDA driver is installed

gcc-4.7 works great as a compiler

Longer Mac story: Clang does not support OpenMP,

Using GPUs in practice

Verlet cuto-scheme is more accurate

If we increase/decrease the coulomb direct-space

cuto and the reciprocal space PME grid spacing by

Test Drive K20

Run GROMACS on Tesla K20

Devang Sachdev - NVIDIA

Experience The Acceleration

Sign up for FREE GPU Test Drive

Register for the Next GTC

You might also like