Professional Documents
Culture Documents
GROMACS is used on a
wide range of resources
Were comfortably
on the single-s
scale today
Performance
Longer simulations
Parallel GPU simulation
using Infiniband
High-end eciency by
using fewer nodes
Reach timescales not
possible with CPUs
0.5-1 TFLOP
Random memory
access OK (not great)
~2 TFLOP
Random memory
access wont work
Great for
throughput
Gromacs-4.6
next-generation
GPU implementation:
Programming
model
Domain decomposition
dynamic load balancing
1 MPI rank 1 MPI rank
CPU
N OpenMP N OpenMP
(PME) threads
threads
N OpenMP N OpenMP
threads
threads
Load balancing
GPU
1 GPU
context
1 GPU
context
Load balancing
1 GPU
context
1 GPU
context
GROMACS (LINCS):
LINCS:
t=2
t=1
1-a
1-a
|d |
|b |
3fd
3fad
|c |
3out
4fd
Interactions
Degrees of Freedom
dista
actio
this i
intera
it to
tem.
Fo
tween
7
C
3
B 2
1D d
C
B
decom
A
4
6
A
sions
rc 5
1
0
allow
rc
3
1
detai
0
8th-sphere
comm
FIG. 3: The zones to communicate to themost
proces
FIG. 2: The domain decomposition cells
that
communisee the(1-7)
text for
details.
nami
cate coordinates to cell 0. Cell 2 is hidden below cell 7. The
parts
zones that need to be communicated to cell 0 are dashed, rc
ensure that all bonded interaction between
ch
cut-o
is the cut-off radius.
can be assigned to a processor, it is sucien
balan
that the charge groups within a sphere of ra
munip
present on at least one processor for every
ter of the sphere. In Fig. ?? this means
we a
balan
communicate volumes B and C. When no bo
are calculated.
in cel
actions are present between charge groups,
th
are not
communicated.
For 2D decomposition
bond
Bonded interactions are distributed
over
the processors
C are the only extra volumes that need to be
calcu
by finding the smallest x, y and z coordinate
of the charge
For 3D domain decomposition the pictures be
tionsi
groups involved and assigning the ainteraction
to thebut
probit more complicated,
the procedure
Organize
as tiles with
Cluster pairlist
all-vs-all
interactions:
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Verlet cuto
rc=1.5, rl=1.6
0.58
0.82
Verlet
Tixel Pruned
Tixel non-pruned
0.29
rc=1.2, rl=1.3
0.52
0.75
0.21
rc=0.9, rl=1.0
0.42
0.73
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.3
0.25
0.2
Text
0.15
0.05
0
1.5
12
24
48
96
192
384
768
1536
3072
Gromacs:
RNAse
100
ns/day
200
300
200
300
CPU, 6 cores
CPU, 2*8 cores
100
200
400
600
ns/day
800
1000
1200
i7 3930K (GMX4.5)
i7 3930K (GMX4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0
10
20
ns/day
30
40
Strongof
scaling
of Reaction-Field and PME
Scaling
Reaction-field
& PME
1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff
Performance (ns/day)
100
10
1
RF
RF linear scaling
PME
PME linear scaling
0.1
1
10
100
#Processes-GPUs
Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth
Small systems often work best using only a single GPU!
XK6/X2090
XK7/K20X
XK6 CPU only
XE6 CPU only
ns/day
100
10
1
1
4
8
16
#sockets (CPU or CPU+GPU)
32
64
Using GROMACS
with GPUs in practice
=
=
=
=
=
Verlet
10
; likely 10-50
pme
; or reaction-field
cut-off
-1
; only when writing edr
Load balancing
rcoulomb
fourierspacing
= 1.0
= 0.12
Demo
Acknowledgments
GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham
Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov
Multi-Threaded PME: Roland Shultz, Berk Hess
Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!
Questions?
Contact us
GROMACS questions
Check
www.gromacs.org
gmx-users@gromacs.org
mailing
list
Stream other webinars from GTC
Express:
http://www.gputechconf.com/
page/gtc-express-webinar.html