Professional Documents
Culture Documents
Pre-requirements
[1]: import shlex # Optional
import subprocess # Optional
with x-, y -, and z-coordinates. We need to provide these coordinates in a machine (and human)
readable, consistent format to be of any use in a computer simulation. A popular file format,
especially for biochemical systems, is the PDB file format. PDB files (filename extension .pdb)
are text files and can be opened with the text editor of your choice. Many (crystal-, NMR-,
cryoEM-, …) structures that can be used as starting structures in simulations are available in the
RCSB Protein Data Bank (PDB) for everyone to download. Structures in this data base are
identified by a letter code – the PDB-ID. For our example, we use the Structure of the
carbohydrate-recognition domain of human Langerin with the PDB-ID 3P5G. Langerin is an
endocytic pattern recognition receptor important for the human immune system.
Info: If you want to learn more about PDB structure files than we tackle here, check
out this Guide to understanding PDB data.
Each line in the PDB file begins with a record keyword identifying the kind of information
stored in the corresponding line. Records like HEADER, TITLE, or COMPND contain information
that we can safely ignore when we are only interested in the atomic coordinates of the molecule.
Depending on what you want to find out about the structure, however, they can be very useful.
The COMPND statements tells us for example, that the file contains 4 (identical) subunits – so
called chains A, B, C, and D – each consisting of 136 amino acid residues (residues 193 to 328).
We can also get the complete sequence of the protein contained in this file from SEQRES entries.
GLN VAL VAL SER GLN GLY TRP LYS TYR PHE LYS GLY ASN
PHE TYR TYR PHE SER LEU ILE PRO LYS THR TRP TYR SER
ALA GLU GLN PHE CYS VAL SER ARG ASN SER HIS LEU THR
SER VAL THR SER GLU SER GLU GLN GLU PHE LEU TYR LYS
THR ALA GLY GLY LEU ILE TYR TRP ILE GLY LEU THR LYS
ALA GLY MET GLU GLY ASP TRP SER TRP VAL ASP ASP THR
PRO PHE ASN LYS VAL GLN SER ALA ARG PHE TRP ILE PRO
GLY GLU PRO ASN ASN ALA GLY ASN ASN GLU HIS CYS GLY
ASN ILE LYS ALA PRO SER LEU GLN ALA TRP ASN ASP ALA
PRO CYS ASP LYS THR PHE LEU PHE ILE CYS LYS ARG PRO
TYR VAL PRO SER GLU PRO
Often a PDB file contains more than only the structure of the main molecule. Crystallographic
structures for example can contain co-crystallised solvent and salt as well as ligands like small
drug like molecules. Record entries like HETNAM or FORMULA may tell you more about what else
is in your file. Our example comes with a sugar ligand, a calcium-ion, and crystal water.
Finally, the atomic coordinates we were initially interested in are listed under ATOM records.
From left to right, each ATOM record holds the atom ID, atom name, residue name, chain ID,
residue ID, xyz-coordinates in Å, crystallographic occupancy, crystallographic temperature
factor and the chemical element of a single atom in a strict order. The different parts are identify
by a character position, i.e. the chain ID for example occupies the 22nd character of the ATOM
entry. Besides the xyz-coordinates, for an MD simulation we need especially the atom name and
the residue name column. To represent an atom in such a simulation by suitable force field
parameters, it is not enough to know that we have for example a carbon atom somewhere. It is
required to know instead that it is say the Cα -atom of a glycine. Note, that the first residue in
chain A is GLY198 which means that residues 193 to 197 are missing in this chain.
Atoms not part of the main molecule are usually put into HETATM records of similar structure.
TER additionally marks the end of a chain, which is particularly important for proteins because
the termini are modeled differently than residues within the chain.
To get a better overview over the system, let’s open the PDB file in VMD (start VMD directly
from the Jupyter notebook for a quick view or externally otherwise).
FIGURE Full structure under the PDB entry 3P5G, showing 4 langerin molecules with calcium
atoms, crystal water and different sugar ligands bound.
We need to pay separate attention in our case to residues that have atom entries with split
occupancies, a leftover of the crystal structure prediction, like for SER277 shown below.
For this residue two alternative sets of atoms are given, indicated by a letter at the 17th
character of the ATOM entry.
We want to reconstruct the missing residues 193 to 197 and 326 to 328. In particular we
want to fix missing terminal atoms in this way.
FIGURE SER277 has two alternative sets of atom positions in the PDB File, one displayed
transparentl
Let’s tackle the ambiguous atom positions first. We can fix this kind of shortcoming easily while
extracting atom information from the file.
The second task, adding missing residues and atoms, can not be so easily solved by ourselves.
We will use a tool from the OpenMM cosmos for this – the PDBFixer. This tool can be used
through a Python interface. To save the fixed PDB file back to disk we will also need the PDB
file handler from the OpenMM package. We will discuss the file handling by OpenMM more in
the next section.
The fixing work the following way: We create a PDBFixes Python object from our pre-
processed PDB file. Than we let the fixer find missing residues, which it does by analysing the
header of the PDB file (which is why we wanted to let it be in there). Finally we can fix the
missing atoms and save to a new PDB file.
The PDBFixer found, that in chain A (0) at the beginning (residue id 0) five residues, and at the
end (residue id 128) three residues should be added. The fixer.missingResidues attribute
tells us this in form of a Python dictionary having tuples of (chain ID, residue ID) has
keys (points to insert residues) and lists of residues as values. We are fine with that and proceed
with applying the changes.
There is one last thing we want to repair. The PDBFixer renumbered the residues in the protein
from 1 to 136 and we would like to restore the biological numbering starting at 193 going to
328. Note also, that the PDBFixer put the calcium atom in a another chain B than the protein and
omitted the PDB meta-information, but we are fine with that at this point.
We visualise the reduced and fixed structure – now only containing a single langerin chain plus
calcium atom – once more in VMD, this time using the QuickSurf representation and coloring by
Name. In the next part we will prepare this structure for a standard simulation in water.
OpenMM provides a PDB reader (as we saw already in the last part), with that we can load our
structure. This will create a PDB file object, that we name molecule. On this object we can
access several important attributes, like the atomic positions. Things like positions that have a
value and a unit are represented in OpenMM as Quantity objects.
Info: If you are looking for a nice way of handling units and quantities outside
OpenMM in Python in general, try out Pint.
Another central attribute of this molecule is its topology. The topology term is not used
completely consistent among different MD softwares, but it essentially describes the MD
representation of a molecule including its (fixed) constitution, i.e. its connectivity. The topology
is the mapping of atomic elements at positions (e.g. oxygen at position xyz) to atom types within
residues connected by bonds, angels, etc. (e.g. carbonynyl oxygen of the glycine backbone,
bonded to carbonylic carbon). This mapping is necessary to select force field parameters (like
partial charges, Lennard-Jones coefficients, vibrational force constants) from a force field, that
understands the atom types and such in the topology, for the simulation. The topology is the
bottleneck of most MD setups and it is the main reason why simulations of proteins with
canonic amino acids (straightforward topology creation) can be done relatively easy while less
structured systems (arbitrarily complex topology creation) can be problematic. When we read a
PDB file with OpenMM, the topology is automatically created for us.
[17]: molecule.topology
A modern force field that we can choose for the simulation of our system is the AMBER14SB
force field available in OpenMM. We need to create a force field object from a force field file.
Force field files understood by OpenMM can be for example .xml file, so just another type of
text file that collects force field parameters, ready to match a created topology. We create the
force field object combining the AMBER14SB protein parameters with compatible parameters
for the TIP3P water model since we want to simulate in water later.
In the next step we want to modify our structure by adding missing hydrogen atoms and putting
it into a box of water. In addition, we need to account for the fact that our system as a non-zero
total charge, due to the presence of charged groups like amino acid side chains and the Ca - 2+
ion. OpenMM has limited support for operations like this via a so called Modeller. We prepare
a molecule for modeling by passing its positions and topology to a Modeller, creating an object
that provides methods to add hydrogens and to add solvent. These methods in turn require a
force field object to work properly.
model.addHydrogens(
forcefield
)
model.addSolvent(
forcefield,
padding=1*unit.nanometer,
neutralize=True
)
Info: We could have used the PDBFixer, too, for the addition of hydrogens. Generally
there is often more than one possibility to perform these processing tasks.
We can transfer the modified atom positions and the topology that does now include water back
to the initially created molecule PDBFile object.
And using molecules writeFile method we can write the new structure to disk, to be for
example inspected in VMD. To be able to construct a valid PDB file the writer needs the atom
positions and the topology.
FIGURE The system is fully solvated and one chloride has been added to neutralise the total
charge.
Info: In VMD you can add different representation for different atom selection. VMD
understand many selection keywords, like "protein", "water", "ion" or "resid
1", "chain A". You can select atoms based on distances (in Å) with e.g. "same
resid as within 2 of chain A".
Simulating in OpenMM
Setup
Now that we have our structure well prepared, we can start to use OpenMM for calculations.
The standard road to MD simulations usually involves the following steps:
Energy minimisation: The system we have prepared is most probably not in an energy
minimum. This is because we have put the solid state crystal structure into a new solvent
box, where we removed crystal water and choose solvent positions largely at random. To
prevent the system to blow up at the beginning of a simulation, we need to minimise its
energy.
NVT equilibration: MD simulations can be performed in different ensembles. Often it is
desired to simulate a system under realistic conditions, e.g. say at a physiological
temperature. The temperature can be controlled throughout the simulation by coupling the
system to a thermostat. When we switch on such a coupling the system needs some time to
adjust to it, in other words the temperature needs to equilibrate in a simulation run prior to
the actual simulation.
NPT equilibration: If in addition to the temperature also the pressure should be controlled
via a barostat, a second equilibration in in which pressure coupling is switched on needs to
be done.
Seeding and other optional steps: Once we have a stable equilibrated system, we are good
to go for a production MD simulation. In practice other preliminary steps may follow first,
like for example a high temperature run to generate a set of starting configurations for a set
of production runs.
Although this are the standard steps done to get to the final simulation, depending on the system
fewer or more stages can be necessary, like for example a vacuum minimisation before the
solvation or several NVT equilibrations under varying conditions.
In any case, OpenMM requires us to setup a system object first, that combines the molecular
topology with the chosen force field and simulation parameters. There are different ways to
create as system. Here we show using the createSystem method of the force field object
created earlier.
The system abstracts the molecule in terms of forces considered between its atoms. These forces
are calculated during the simulation to propagate the system. Forces are added to the system
from the force field according to the topology (e.g. add a harmonic potential for every bond
choosing the right force constant). Simulation parameters determine further how to evaluate the
different force contribution. PME for example is a scheme used to calculate pairwise
electrostatic interactions. We choose a cutoff for the non-bonded interactions of 1 nm beyond
which force contributions can be neglected to speed up the simulation. Furthermore, we
constraint hydrogen-heavy atom bonds meaning we do not really simulate the corresponding
stretching but rather keep the bond lengths at a reasonable value. In this way we can choose a
larger time step that does not need to resolve the vibration.
Thermostats and such interacting with the system during the simulation are also considered force
contributions, so adding a thermostat for our NVT simulation later is done via the addForce
method of the system. We choose an Andersen thermostat with a coupling rate of 1 ps to keep
the temperature at 300 K.
[23]: system.addForce(
mm.AndersenThermostat(300*unit.kelvin, 1/unit.picosecond)
)
[23]: 5
We also need to choose an integrator that will actually do the work of calculating forces and
modifying atom positions and velocities. Here we use the Verlet integrator and a typical time
step of 2 fs.
Finally, we have to put everything together once more in another layer of abstraction by creating
a simulation object. From this simulation object, the simulations are started later. A simulation
bundles the molecular topology (the raw description), the system (the force/parameter
description), the integrator (the workhorse), hardware instructions and the simulation context.
Under the term simulation context, OpenMM understands everything that is only secondarily
connected to the definition of a simulation, namely the current state of the run with certain atom
positions and velocities. In this way, the same conceptual simulation object can run many
independent simulations with e.g. different starting structures. We define in the context of our
simulation the current molecule positions as starting positions.
[26]: simulation.context.setPositions(molecule.positions)
Energy minimisation
With having done the abstraction of what and how we want to simulate in a simulation object,
minimising the energy is easy.
[27]: simulation.minimizeEnergy()
After this job is done, we can retrieve the current state of our molecule from the simulation
context and save the new atom positions to a file. Let’s see the outcome.
FIGURE The atoms positions in our system have changed during the minimisation (before –
transparent).
NVT equilibration
As we can retrieve the context of a simulation, we can in the same way set it whenever we want.
Let’s for illustration purposes reload the minimised structure and reset the simulation state.
Manually getting the state out of a simulation in between runs works fine, but it is also possible
to automise this task. OpenMM provides so called Reporters that are attached to a simulation
and save data in fixed intervals. When we now want to run a NVT equilibration for 100 ps and
we want to save information about the progress of the run and the current temperature every 1
ps, we can do this using a StateDataReporter.
Running the simulation for a certain amount of steps is as easy as starting the minimisation.
Everything needed for the simulation has been already defined in the simulation object.
[32]: simulation.step(run_length)
For our system this simulation takes about 1 hour to finish on a reguler 4 CPU core machine.
After the simulation has completed we want to check if the equilibration was long enough. We
can do this by reading the log-file created by the reporter. We can for example plot the
temperature and energy development over time.
[33]: pot_e = []
tot_e = []
temperature = []
t = range(1, 101)
[34]: plt.close("all")
fig, ax = plt.subplots()
ax.plot(t, [x / 1000 for x in pot_e], label="potential")
ax.plot(t, [x / 1000 for x in tot_e], label="total")
ax.set(**{
"title": "Energy",
"xlabel": "time / ps",
"xlim": (0, 100),
"ylabel": "energy / 10$^{3}$ kJ mol$^{-1}$"
})
ax.legend(
framealpha=1,
edgecolor="k",
fancybox=False
)
plt.show()
[35]: plt.close("all")
fig, ax = plt.subplots()
ax.plot(t, temperature)
ax.set(**{
"title": "Temperature",
"xlabel": "time / ps",
"xlim": (0, 100),
"ylabel": "temperature / K"
})
plt.show()
This looks good. In the beginning of the run, the temperature of the system rises from zero (we
have started with no initial particle velocities) to the desired value of about 300 K. The observed
fluctuations of the instantaneous temperature are expected. Overall temperature and energy are
converged well. To save the equilibrated state of the system, including not only atom positions
but also velocities and everything that is needed to continue the simulation from here at a later
point, we can instead of accessing the simulation context once more, use the saveState
method of the simulation.
[36]: simulation.saveState("eq.xml")
Production
Continuing from the equilibrated system we could now proceed with a “real” MD simulation.
First, we could reload the state into the simulation.
[37]: simulation.loadState("eq.xml")
Writing output during a simulation is relatively time consuming, so we modify our state reporter
that it only writes information to a log-file every 100 ps which is enough to keep track of the
progress of the run. We also want to write out the current atom positions in regular intervals to
be analysed later as an MD trajectory of atomic coordinates. We could use a PDBReporter for
this to append the structures to a PDB file. OpenMM offers, however, also a different format that
is better suited to store long trajectories. A DCDReporter saves positions to a binary .dcd file
which uses much less disk space than a .pdb text file. We could for example write out the
positions every 10 ps. How long a simulation should be in the end depends much on the question
you want to answer with the simulation. If you want to simulate rather slow molecular processes
like protein folding you will find yourself quickly in a situation were you would need to
simulate several microseconds. The length of a typical single run may be around a few hundreds
of nanoseconds. If we want to simulate our system here for only 100 ns it would take us roughly
40 days to finish this on the machine used for the equilibration, so we will not actually do this
now. If you have a GPU with good cooling at your disposal (which can speed up the simulation
about a factor of 100) and you do not care too much about your electricity bill, you could give it
a try.
simulation.reporters.append(
app.DCDReporter('trajectory.dcd', 5000)
)
[39]: simulation.step(run_length)
©2020, Jan-Oliver Joswig. | Powered by Sphinx 3.3.1 & Alabaster 0.7.12 | Page source