You are on page 1of 1

Python tutorial Introduction to Molecular Dynamics

Navigation simulations with OpenMM


What is Python?
Running Molecular Dynamics (MD) simulations of all possible kinds of molecular systems in a
How do I install Python? correct and efficient way is a non-trivial task. Luckily, there are many different programs and
How do I use Python? packages out there to help you with that. Due to the complexity of the matter, however, these
Read more programs are mostly not exactly straightforward to use. All of them have their strengths and
Tutorial weaknesses and come with their own flavor of addressing the exercise of representing molecules
Introduction to Molecular in silico and propagating their configuration through time. The basic concepts underlying their
Dynamics simulations with design are often very similar, though. We chose the open source toolkit OpenMM to practically
OpenMM introduce you to the art of molecular simulations, because it provides a relatively well accessible
Pre-requirements interface in Python. Although it is not as feature rich out of the box as other solutions, it is
Primer on molecular successfully used in daily research.
structure files
Structure preparation with We will show you here step-by-step how to setup and run a standard NVT MD simulation of a
OpenMM protein in water with OpenMM. Refer to the official website for installation instructions
Simulating in OpenMM (OpenMM is available in the conda package manager) and documentation. We will also need
Molecular Dynamics some kind of molecule viewer to inspect molecular structures. We can recommend VMD for this
exercises purpose (also available via conda).
AlgoSB2021 exercises on
conda install -c omnia -c conda-forge openmm
PCA, TICA, and MSM conda install -c conda-forge vmd
estimation
Hints To completely follow this introduction, you will also need the PDBFixer that has to be installed
Attic separately. You will not need the PDBFixer for the upcoming exercise.

Quick search $ git clone https://github.com/openmm/pdbfixer.git


$ cd pdbfixer
Go $ python setup.py install

Pre-requirements
[1]: import shlex # Optional
import subprocess # Optional

import matplotlib as mpl


import matplotlib.pyplot as plt

[2]: # Matplotlib configuration


mpl.rcParams.update(mpl.rcParamsDefault)
mpl.rc_file('../../matplotlibrc', use_default_template=False)

Primer on molecular structure files


A central element of every molecular simulation is the (atomic) structure of the system of
interest. In a classic MD simulation every atom in the system is represented as a point in R 3

with x-, y -, and z-coordinates. We need to provide these coordinates in a machine (and human)
readable, consistent format to be of any use in a computer simulation. A popular file format,
especially for biochemical systems, is the PDB file format. PDB files (filename extension .pdb)
are text files and can be opened with the text editor of your choice. Many (crystal-, NMR-,
cryoEM-, …) structures that can be used as starting structures in simulations are available in the
RCSB Protein Data Bank (PDB) for everyone to download. Structures in this data base are
identified by a letter code – the PDB-ID. For our example, we use the Structure of the
carbohydrate-recognition domain of human Langerin with the PDB-ID 3P5G. Langerin is an
endocytic pattern recognition receptor important for the human immune system.

Info: If you want to learn more about PDB structure files than we tackle here, check
out this Guide to understanding PDB data.

[3]: # Lets have a look into the PDB file


with open("3p5g.pdb") as pdb_file:
show = 10 # Lines to show
for lc, line in enumerate(pdb_file):
# Print the first lines of the file
if lc < show:
print(line.strip())
print("=" * 80)
print("Total line count: ", lc)

HEADER SUGAR BINDING PROTEIN 08-OCT-10 3P5G


TITLE STRUCTURE OF THE CARBOHYDRATE-RECOGNITION DOMAIN OF HUMAN
TITLE 2 WITH BLOOD GROUP B TRISACCHARIDE (GAL ALPHA1-3(FUC ALPHA1
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: C-TYPE LECTIN DOMAIN FAMILY 4 MEMBER K;
COMPND 3 CHAIN: A, B, C, D;
COMPND 4 FRAGMENT: LANGERIN CRD (UNP RESIDUES 193-328);
COMPND 5 SYNONYM: LANGERIN;
COMPND 6 ENGINEERED: YES
SOURCE MOL_ID: 1;
====================================================================
Total line count: 5922

Each line in the PDB file begins with a record keyword identifying the kind of information
stored in the corresponding line. Records like HEADER, TITLE, or COMPND contain information
that we can safely ignore when we are only interested in the atomic coordinates of the molecule.
Depending on what you want to find out about the structure, however, they can be very useful.
The COMPND statements tells us for example, that the file contains 4 (identical) subunits – so
called chains A, B, C, and D – each consisting of 136 amino acid residues (residues 193 to 328).
We can also get the complete sequence of the protein contained in this file from SEQRES entries.

[4]: # What is the protein sequence?


with open("3p5g.pdb") as pdb_file:
for line in pdb_file:
if line.startswith("SEQRES"):
line = line.split()
if line[2] == "A":
# Only print sequence for chain A
print(*line[4:])

GLN VAL VAL SER GLN GLY TRP LYS TYR PHE LYS GLY ASN
PHE TYR TYR PHE SER LEU ILE PRO LYS THR TRP TYR SER
ALA GLU GLN PHE CYS VAL SER ARG ASN SER HIS LEU THR
SER VAL THR SER GLU SER GLU GLN GLU PHE LEU TYR LYS
THR ALA GLY GLY LEU ILE TYR TRP ILE GLY LEU THR LYS
ALA GLY MET GLU GLY ASP TRP SER TRP VAL ASP ASP THR
PRO PHE ASN LYS VAL GLN SER ALA ARG PHE TRP ILE PRO
GLY GLU PRO ASN ASN ALA GLY ASN ASN GLU HIS CYS GLY
ASN ILE LYS ALA PRO SER LEU GLN ALA TRP ASN ASP ALA
PRO CYS ASP LYS THR PHE LEU PHE ILE CYS LYS ARG PRO
TYR VAL PRO SER GLU PRO

Often a PDB file contains more than only the structure of the main molecule. Crystallographic
structures for example can contain co-crystallised solvent and salt as well as ligands like small
drug like molecules. Record entries like HETNAM or FORMULA may tell you more about what else
is in your file. Our example comes with a sugar ligand, a calcium-ion, and crystal water.

[5]: # What else is in the file?


with open("3p5g.pdb") as pdb_file:
for line in pdb_file:
if line.startswith(("HETNAM", "FORMUL")):
print(line.strip())

HETNAM GLA ALPHA D-GALACTOSE


HETNAM GAL BETA-D-GALACTOSE
HETNAM FUC ALPHA-L-FUCOSE
HETNAM CA CALCIUM ION
FORMUL 5 GLA 3(C6 H12 O6)
FORMUL 5 GAL C6 H12 O6
FORMUL 5 FUC 4(C6 H12 O5)
FORMUL 6 CA 4(CA 2+)
FORMUL 13 HOH *816(H2 O)

Finally, the atomic coordinates we were initially interested in are listed under ATOM records.
From left to right, each ATOM record holds the atom ID, atom name, residue name, chain ID,
residue ID, xyz-coordinates in Å, crystallographic occupancy, crystallographic temperature
factor and the chemical element of a single atom in a strict order. The different parts are identify
by a character position, i.e. the chain ID for example occupies the 22nd character of the ATOM
entry. Besides the xyz-coordinates, for an MD simulation we need especially the atom name and
the residue name column. To represent an atom in such a simulation by suitable force field
parameters, it is not enough to know that we have for example a carbon atom somewhere. It is
required to know instead that it is say the Cα -atom of a glycine. Note, that the first residue in
chain A is GLY198 which means that residues 193 to 197 are missing in this chain.

[6]: with open("3p5g.pdb") as pdb_file:


show = 10
count = 0
# Print the first ATOM entries
for line in pdb_file:
if count >= show:
break
if line.startswith(("ATOM")):
print(line.strip())
count += 1

ATOM 1 N GLY A 198 1.923 33.617 6.889 1.00 29.44


ATOM 2 CA GLY A 198 2.927 34.067 5.942 1.00 33.42
ATOM 3 C GLY A 198 4.064 33.079 5.747 1.00 21.30
ATOM 4 O GLY A 198 5.016 33.355 5.022 1.00 26.76
ATOM 5 N TRP A 199 3.964 31.919 6.391 1.00 19.08
ATOM 6 CA TRP A 199 4.990 30.892 6.261 1.00 19.93
ATOM 7 C TRP A 199 4.987 30.356 4.840 1.00 28.23
ATOM 8 O TRP A 199 3.926 30.117 4.255 1.00 30.68
ATOM 9 CB TRP A 199 4.772 29.756 7.263 1.00 22.11
ATOM 10 CG TRP A 199 4.961 30.161 8.697 1.00 19.73

Atoms not part of the main molecule are usually put into HETATM records of similar structure.
TER additionally marks the end of a chain, which is particularly important for proteins because
the termini are modeled differently than residues within the chain.

[7]: with open("3p5g.pdb") as pdb_file:


show = 34
count = 0
# Print the first HETATM entries
for line in pdb_file:
if count >= show:
break
if line.startswith(("HETATM")):
print(line.strip())
count += 1

HETATM 4215 C1 GLA A 400 35.389 42.564 1.554 1.00 34.53


HETATM 4216 C2 GLA A 400 34.417 41.967 0.543 1.00 38.46
HETATM 4217 C3 GLA A 400 33.765 40.724 1.124 1.00 24.16
HETATM 4218 C4 GLA A 400 34.841 39.740 1.563 1.00 25.80
HETATM 4219 C5 GLA A 400 35.798 40.434 2.520 1.00 33.42
HETATM 4220 C6 GLA A 400 36.930 39.510 2.960 1.00 35.43
HETATM 4221 O2 GLA A 400 33.406 42.903 0.245 1.00 35.44
HETATM 4222 O3 GLA A 400 32.945 40.132 0.148 1.00 34.77
HETATM 4223 O4 GLA A 400 35.543 39.288 0.424 1.00 26.84
HETATM 4224 O5 GLA A 400 36.347 41.582 1.899 1.00 48.07
HETATM 4225 O6 GLA A 400 37.680 40.140 3.977 1.00 26.17
HETATM 4226 C1 GAL A 401 35.691 44.693 5.881 1.00 58.00
HETATM 4227 C2 GAL A 401 34.721 44.101 4.866 1.00 34.37
HETATM 4228 C3 GAL A 401 35.517 43.473 3.723 1.00 26.42
HETATM 4229 C4 GAL A 401 36.531 44.479 3.176 1.00 47.76
HETATM 4230 C5 GAL A 401 37.356 45.105 4.295 1.00 46.54
HETATM 4231 C6 GAL A 401 38.282 46.188 3.748 1.00 66.04
HETATM 4232 O1 GAL A 401 34.968 45.256 6.951 1.00 41.39
HETATM 4233 O2 GAL A 401 33.927 43.149 5.553 1.00 30.54
HETATM 4234 O3 GAL A 401 34.648 43.055 2.689 1.00 32.94
HETATM 4235 O4 GAL A 401 35.864 45.510 2.481 1.00 33.15
HETATM 4236 O5 GAL A 401 36.495 45.674 5.258 1.00 38.71
HETATM 4237 O6 GAL A 401 38.986 46.792 4.811 1.00 54.30
HETATM 4238 C1 FUC A 402 32.629 42.960 4.961 1.00 22.64
HETATM 4239 C2 FUC A 402 31.943 41.857 5.765 1.00 19.14
HETATM 4240 C3 FUC A 402 31.667 42.330 7.178 1.00 15.88
HETATM 4241 C4 FUC A 402 30.867 43.622 7.148 1.00 22.27
HETATM 4242 C5 FUC A 402 31.577 44.638 6.258 1.00 18.90
HETATM 4243 C6 FUC A 402 30.712 45.886 6.131 1.00 32.10
HETATM 4244 O2 FUC A 402 32.798 40.739 5.847 1.00 24.94
HETATM 4245 O3 FUC A 402 30.972 41.311 7.888 1.00 20.82
HETATM 4246 O4 FUC A 402 29.573 43.399 6.633 1.00 25.32
HETATM 4247 O5 FUC A 402 31.822 44.121 4.962 1.00 24.17
HETATM 4248 CA CA A 500 32.105 39.059 7.647 1.00 19.65

To get a better overview over the system, let’s open the PDB file in VMD (start VMD directly
from the Jupyter notebook for a quick view or externally otherwise).

[8]: # Lauch VMD from Jupyter


view = subprocess.Popen(
shlex.split("vmd 3p5g.pdb"),
close_fds=True,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)

We display the protein chains in the NewCartoon representation, colored by SecondaryStructure.


For the Ca-ion we use the VDW representation and color it with a ColorID of our choice. Water
and ligand atoms are shown in the Licorice representation and colored by Name. You can set
these options in the Main > Graphics > Representations... menu. Note, that the
structure does not contain hydrogen atoms.

FIGURE Full structure under the PDB entry 3P5G, showing 4 langerin molecules with calcium
atoms, crystal water and different sugar ligands bound.

Fixing and modifying the structure


For our simulation we now want to extract only chain A including the present protein residues
198 to 325. You could do this in VMD via Main > File > Save Coordinates... but meta
information stored in the file is usually lost this way. You can perform this task manually by just
deleting information you do not need from the text file. The best way in terms of reproducibility
and traceability, however, might be to use Python for this. Note, that we have to issues to deal
with here:

We need to pay separate attention in our case to residues that have atom entries with split
occupancies, a leftover of the crystal structure prediction, like for SER277 shown below.
For this residue two alternative sets of atoms are given, indicated by a letter at the 17th
character of the ATOM entry.
We want to reconstruct the missing residues 193 to 197 and 326 to 328. In particular we
want to fix missing terminal atoms in this way.

[9]: # Check structure for alternative atom positions


with open("3p5g.pdb") as pdb_file:
for line in pdb_file:
if not line.startswith("ATOM"):
continue
if not line[21] == "A": # Chain A?
continue
if line[16] != " ": # Alternativ atoms positions?
print(line.strip())
ATOM 658 CA ASER A 277 29.047 24.388 2.751 0.50 19.39
ATOM 659 CA BSER A 277 29.040 24.387 2.738 0.50 18.11
ATOM 662 CB ASER A 277 28.811 24.669 4.239 0.50 18.59
ATOM 663 CB BSER A 277 28.762 24.666 4.218 0.50 17.85
ATOM 664 OG ASER A 277 27.712 23.928 4.738 0.50 15.78
ATOM 665 OG BSER A 277 29.938 24.530 4.996 0.50 17.39

FIGURE SER277 has two alternative sets of atom positions in the PDB File, one displayed
transparentl

FIGURE The terminal PRO325 is missing a terminal oxygen atom.

Let’s tackle the ambiguous atom positions first. We can fix this kind of shortcoming easily while
extracting atom information from the file.

[10]: allowed_res_ids = list(range(198, 326)) + [500]


# Protein 198 to 325 and Ca

# Copy desired information from one file to another


with open("3p5g.pdb") as in_file:
with open("chain_A.pdb", "w") as out_file:
for line in in_file:
if line.startswith(("ATOM", "TER", "HETATM")):
# check for atomic entries
if not line[21] == "A":
# Chain A?
continue
if not int(line[22:26]) in allowed_res_ids:
# Keep residue?
continue
if line[16] == "B": # Alternative set of atoms?
# Throw away location B
continue
if line[16] == "A": # Alternative set of atoms?
# Keep location A
line = line[:16] + " " + line[17:]
out_file.write(line)

The second task, adding missing residues and atoms, can not be so easily solved by ourselves.
We will use a tool from the OpenMM cosmos for this – the PDBFixer. This tool can be used
through a Python interface. To save the fixed PDB file back to disk we will also need the PDB
file handler from the OpenMM package. We will discuss the file handling by OpenMM more in
the next section.

[11]: import pdbfixer


import simtk.openmm.app as app # OpenMM interface

The fixing work the following way: We create a PDBFixes Python object from our pre-
processed PDB file. Than we let the fixer find missing residues, which it does by analysing the
header of the PDB file (which is why we wanted to let it be in there). Finally we can fix the
missing atoms and save to a new PDB file.

[12]: fixer = pdbfixer.PDBFixer("chain_A.pdb") # Load file


fixer.findMissingResidues() # Identify missing residue
fixer.findMissingAtoms() # Identify missing atoms
print(fixer.missingResidues)
{(0, 0): ['GLN', 'VAL', 'VAL', 'SER', 'GLN'], (0, 128): ['SER', 'GLU

The PDBFixer found, that in chain A (0) at the beginning (residue id 0) five residues, and at the
end (residue id 128) three residues should be added. The fixer.missingResidues attribute
tells us this in form of a Python dictionary having tuples of (chain ID, residue ID) has
keys (points to insert residues) and lists of residues as values. We are fine with that and proceed
with applying the changes.

[13]: fixer.addMissingAtoms() # Apply the fix


app.PDBFile.writeFile(fixer.topology, fixer.positions, open('chain_A
# Save to file

There is one last thing we want to repair. The PDBFixer renumbered the residues in the protein
from 1 to 136 and we would like to restore the biological numbering starting at 193 going to
328. Note also, that the PDBFixer put the calcium atom in a another chain B than the protein and
omitted the PDB meta-information, but we are fine with that at this point.

[14]: offset = 192 # Residue numbering should start here

# Fix residue numbering


with open('chain_A_fixed.pdb') as in_file:
with open('chain_A_fixednumbering.pdb', "w") as out_file:
for line in in_file:
if line.startswith(("ATOM", "TER", "HETATM")) and line[2
resid = int(line[22:26]) + offset
line = line[:22] + f"{resid:>4}" + line[26:]
out_file.write(line)

We visualise the reduced and fixed structure – now only containing a single langerin chain plus
calcium atom – once more in VMD, this time using the QuickSurf representation and coloring by
Name. In the next part we will prepare this structure for a standard simulation in water.

FIGURE Pre-processed langerin structure in surface representation, carbon – black, oxygen –


red, nitrogen – blue, sulfur – yellow, calcium – dark blue.

Structure preparation with OpenMM


To use OpenMM to prepare our structure further for the simulation in water we need to import a
few Python modules.

[15]: import simtk.openmm as mm # Main OpenMM functionality


import simtk.openmm.app as app # Application layer (handy interface
import simtk.unit as unit # Unit/quantaty handling

OpenMM provides a PDB reader (as we saw already in the last part), with that we can load our
structure. This will create a PDB file object, that we name molecule. On this object we can
access several important attributes, like the atomic positions. Things like positions that have a
value and a unit are represented in OpenMM as Quantity objects.

Info: If you are looking for a nice way of handling units and quantities outside
OpenMM in Python in general, try out Pint.

[16]: molecule = app.PDBFile("chain_A_fixednumbering.pdb") # Load PDB


molecule.positions[0] # Access position of atom 1

[16]: Quantity(value=Vec3(x=-1.3648, y=4.1043, z=0.38580000000000003), uni

Another central attribute of this molecule is its topology. The topology term is not used
completely consistent among different MD softwares, but it essentially describes the MD
representation of a molecule including its (fixed) constitution, i.e. its connectivity. The topology
is the mapping of atomic elements at positions (e.g. oxygen at position xyz) to atom types within
residues connected by bonds, angels, etc. (e.g. carbonynyl oxygen of the glycine backbone,
bonded to carbonylic carbon). This mapping is necessary to select force field parameters (like
partial charges, Lennard-Jones coefficients, vibrational force constants) from a force field, that
understands the atom types and such in the topology, for the simulation. The topology is the
bottleneck of most MD setups and it is the main reason why simulations of proteins with
canonic amino acids (straightforward topology creation) can be done relatively easy while less
structured systems (arbitrarily complex topology creation) can be problematic. When we read a
PDB file with OpenMM, the topology is automatically created for us.

[17]: molecule.topology

[17]: <Topology; 2 chains, 137 residues, 1104 atoms, 1145 bonds>

A modern force field that we can choose for the simulation of our system is the AMBER14SB
force field available in OpenMM. We need to create a force field object from a force field file.
Force field files understood by OpenMM can be for example .xml file, so just another type of
text file that collects force field parameters, ready to match a created topology. We create the
force field object combining the AMBER14SB protein parameters with compatible parameters
for the TIP3P water model since we want to simulate in water later.

[18]: forcefield = app.ForceField(


"amber14/protein.ff14SB.xml",
"amber14/tip3p.xml"
)

In the next step we want to modify our structure by adding missing hydrogen atoms and putting
it into a box of water. In addition, we need to account for the fact that our system as a non-zero
total charge, due to the presence of charged groups like amino acid side chains and the Ca - 2+

ion. OpenMM has limited support for operations like this via a so called Modeller. We prepare
a molecule for modeling by passing its positions and topology to a Modeller, creating an object
that provides methods to add hydrogens and to add solvent. These methods in turn require a
force field object to work properly.

[19]: model = app.Modeller(molecule.topology, molecule.positions)


# Prepare molecule for modeling

model.addHydrogens(
forcefield
)

model.addSolvent(
forcefield,
padding=1*unit.nanometer,
neutralize=True
)

Info: We could have used the PDBFixer, too, for the addition of hydrogens. Generally
there is often more than one possibility to perform these processing tasks.

We can transfer the modified atom positions and the topology that does now include water back
to the initially created molecule PDBFile object.

[20]: molecule.positions = model.getPositions()


molecule.topology = model.getTopology()

And using molecules writeFile method we can write the new structure to disk, to be for
example inspected in VMD. To be able to construct a valid PDB file the writer needs the atom
positions and the topology.

[21]: with open("solvated.pdb", "w") as file_:


molecule.writeFile(
molecule.topology, molecule.positions,
file=file_
)

FIGURE We created a cubic simulation box filled with water.

FIGURE The system is fully solvated and one chloride has been added to neutralise the total
charge.

Info: In VMD you can add different representation for different atom selection. VMD
understand many selection keywords, like "protein", "water", "ion" or "resid
1", "chain A". You can select atoms based on distances (in Å) with e.g. "same
resid as within 2 of chain A".

Simulating in OpenMM
Setup
Now that we have our structure well prepared, we can start to use OpenMM for calculations.
The standard road to MD simulations usually involves the following steps:

Energy minimisation: The system we have prepared is most probably not in an energy
minimum. This is because we have put the solid state crystal structure into a new solvent
box, where we removed crystal water and choose solvent positions largely at random. To
prevent the system to blow up at the beginning of a simulation, we need to minimise its
energy.
NVT equilibration: MD simulations can be performed in different ensembles. Often it is
desired to simulate a system under realistic conditions, e.g. say at a physiological
temperature. The temperature can be controlled throughout the simulation by coupling the
system to a thermostat. When we switch on such a coupling the system needs some time to
adjust to it, in other words the temperature needs to equilibrate in a simulation run prior to
the actual simulation.
NPT equilibration: If in addition to the temperature also the pressure should be controlled
via a barostat, a second equilibration in in which pressure coupling is switched on needs to
be done.
Seeding and other optional steps: Once we have a stable equilibrated system, we are good
to go for a production MD simulation. In practice other preliminary steps may follow first,
like for example a high temperature run to generate a set of starting configurations for a set
of production runs.

Although this are the standard steps done to get to the final simulation, depending on the system
fewer or more stages can be necessary, like for example a vacuum minimisation before the
solvation or several NVT equilibrations under varying conditions.

In any case, OpenMM requires us to setup a system object first, that combines the molecular
topology with the chosen force field and simulation parameters. There are different ways to
create as system. Here we show using the createSystem method of the force field object
created earlier.

[22]: # Create a simulation system from a force field


system = forcefield.createSystem(
molecule.topology, # pass the molecular topology
nonbondedMethod=app.PME, # How to treat non-bonded int
nonbondedCutoff=1*unit.nanometer, # cutoff for non-bonded inter
constraints=app.HBonds # Constrain H-heavy atom bond
)

The system abstracts the molecule in terms of forces considered between its atoms. These forces
are calculated during the simulation to propagate the system. Forces are added to the system
from the force field according to the topology (e.g. add a harmonic potential for every bond
choosing the right force constant). Simulation parameters determine further how to evaluate the
different force contribution. PME for example is a scheme used to calculate pairwise
electrostatic interactions. We choose a cutoff for the non-bonded interactions of 1 nm beyond
which force contributions can be neglected to speed up the simulation. Furthermore, we
constraint hydrogen-heavy atom bonds meaning we do not really simulate the corresponding
stretching but rather keep the bond lengths at a reasonable value. In this way we can choose a
larger time step that does not need to resolve the vibration.

Thermostats and such interacting with the system during the simulation are also considered force
contributions, so adding a thermostat for our NVT simulation later is done via the addForce
method of the system. We choose an Andersen thermostat with a coupling rate of 1 ps to keep
the temperature at 300 K.

[23]: system.addForce(
mm.AndersenThermostat(300*unit.kelvin, 1/unit.picosecond)
)

[23]: 5

We also need to choose an integrator that will actually do the work of calculating forces and
modifying atom positions and velocities. Here we use the Verlet integrator and a typical time
step of 2 fs.

[24]: integrator = mm.VerletIntegrator(2.0*unit.femtoseconds)


integrator.setConstraintTolerance(0.00001)

Finally, we have to put everything together once more in another layer of abstraction by creating
a simulation object. From this simulation object, the simulations are started later. A simulation
bundles the molecular topology (the raw description), the system (the force/parameter
description), the integrator (the workhorse), hardware instructions and the simulation context.

[25]: simulation = app.Simulation(


molecule.topology, # Topology
system, # System
integrator, # Integrator
mm.Platform.getPlatformByName('CPU') # Platform
)

Under the term simulation context, OpenMM understands everything that is only secondarily
connected to the definition of a simulation, namely the current state of the run with certain atom
positions and velocities. In this way, the same conceptual simulation object can run many
independent simulations with e.g. different starting structures. We define in the context of our
simulation the current molecule positions as starting positions.

[26]: simulation.context.setPositions(molecule.positions)

Energy minimisation
With having done the abstraction of what and how we want to simulate in a simulation object,
minimising the energy is easy.

[27]: simulation.minimizeEnergy()

After this job is done, we can retrieve the current state of our molecule from the simulation
context and save the new atom positions to a file. Let’s see the outcome.

[28]: state = simulation.context.getState(getPositions=True)


molecule.positions = state.getPositions()

[29]: with open("min.pdb", "w") as file_:


molecule.writeFile(
molecule.topology, molecule.positions,
file=file_
)

FIGURE The atoms positions in our system have changed during the minimisation (before –
transparent).

NVT equilibration
As we can retrieve the context of a simulation, we can in the same way set it whenever we want.
Let’s for illustration purposes reload the minimised structure and reset the simulation state.

[30]: molecule = app.PDBFile("min.pdb")


simulation.context.setPositions(molecule.positions)

Manually getting the state out of a simulation in between runs works fine, but it is also possible
to automise this task. OpenMM provides so called Reporters that are attached to a simulation
and save data in fixed intervals. When we now want to run a NVT equilibration for 100 ps and
we want to save information about the progress of the run and the current temperature every 1
ps, we can do this using a StateDataReporter.

[31]: run_length = 50000 # 50000 * 2 fs = 100 ps


simulation.reporters.append(
app.StateDataReporter(
"equilibration.log", 500, step=True,
potentialEnergy=True, totalEnergy=True,
temperature=True, progress=True,
remainingTime=True, speed=True,
totalSteps=run_length,
separator='\t')
)

Running the simulation for a certain amount of steps is as easy as starting the minimisation.
Everything needed for the simulation has been already defined in the simulation object.

[32]: simulation.step(run_length)

For our system this simulation takes about 1 hour to finish on a reguler 4 CPU core machine.
After the simulation has completed we want to check if the equilibration was long enough. We
can do this by reading the log-file created by the reporter. We can for example plot the
temperature and energy development over time.

[33]: pot_e = []
tot_e = []
temperature = []

with open("equilibration.log") as file_:


for line in file_:
if line.startswith("#"):
continue
for l, v in zip((pot_e, tot_e, temperature), line.split()[2
l.append(float(v))

t = range(1, 101)

[34]: plt.close("all")
fig, ax = plt.subplots()
ax.plot(t, [x / 1000 for x in pot_e], label="potential")
ax.plot(t, [x / 1000 for x in tot_e], label="total")

ax.set(**{
"title": "Energy",
"xlabel": "time / ps",
"xlim": (0, 100),
"ylabel": "energy / 10$^{3}$ kJ mol$^{-1}$"
})

ax.legend(
framealpha=1,
edgecolor="k",
fancybox=False
)
plt.show()

[35]: plt.close("all")
fig, ax = plt.subplots()
ax.plot(t, temperature)

ax.set(**{
"title": "Temperature",
"xlabel": "time / ps",
"xlim": (0, 100),
"ylabel": "temperature / K"
})
plt.show()

This looks good. In the beginning of the run, the temperature of the system rises from zero (we
have started with no initial particle velocities) to the desired value of about 300 K. The observed
fluctuations of the instantaneous temperature are expected. Overall temperature and energy are
converged well. To save the equilibrated state of the system, including not only atom positions
but also velocities and everything that is needed to continue the simulation from here at a later
point, we can instead of accessing the simulation context once more, use the saveState
method of the simulation.

[36]: simulation.saveState("eq.xml")

Production
Continuing from the equilibrated system we could now proceed with a “real” MD simulation.
First, we could reload the state into the simulation.

[37]: simulation.loadState("eq.xml")

Writing output during a simulation is relatively time consuming, so we modify our state reporter
that it only writes information to a log-file every 100 ps which is enough to keep track of the
progress of the run. We also want to write out the current atom positions in regular intervals to
be analysed later as an MD trajectory of atomic coordinates. We could use a PDBReporter for
this to append the structures to a PDB file. OpenMM offers, however, also a different format that
is better suited to store long trajectories. A DCDReporter saves positions to a binary .dcd file
which uses much less disk space than a .pdb text file. We could for example write out the
positions every 10 ps. How long a simulation should be in the end depends much on the question
you want to answer with the simulation. If you want to simulate rather slow molecular processes
like protein folding you will find yourself quickly in a situation were you would need to
simulate several microseconds. The length of a typical single run may be around a few hundreds
of nanoseconds. If we want to simulate our system here for only 100 ns it would take us roughly
40 days to finish this on the machine used for the equilibration, so we will not actually do this
now. If you have a GPU with good cooling at your disposal (which can speed up the simulation
about a factor of 100) and you do not care too much about your electricity bill, you could give it
a try.

[38]: simulation.reporters = [] # Delete all current reporters

run_length = 50000000 # 50000000 * 2 fs = 100 ns


simulation.reporters.append(
app.StateDataReporter(
"simulation.log", 50000, step=True,
potentialEnergy=True,
temperature=True, progress=True,
remainingTime=True, speed=True,
totalSteps=run_length,
separator='\t')
)

simulation.reporters.append(
app.DCDReporter('trajectory.dcd', 5000)
)

[39]: simulation.step(run_length)

©2020, Jan-Oliver Joswig. | Powered by Sphinx 3.3.1 & Alabaster 0.7.12 | Page source

You might also like