You are on page 1of 8

Available online at www.sciencedirect.

com Current Opinion in

ScienceDirect Structural Biology

Data science techniques in biomolecular force field


development
Ye Ding1,2, Kuang Yu3 and Jing Huang1,2

Abstract also making a profound impact on how scientific


Recent advances in data science are impacting the develop- research is conducted [2]. People advocated that data
ment of classical force fields. Here we review some ideas and science is the fourth paradigm in scientific discovery [3].
techniques from data science that have been used in force
field development, including database construction, atom Molecular mechanics force fields (FFs) are physics-
typing, and machine learning potentials. We highlight how new based computational models that connect the force
tools such as active learning and automatic differentiation are exerted on each atom with the coordinates of all atoms
facilitating the generation of target data and the direct fitting in a given bsimulation system. Due to their central roles
with macroscopic observables. Philosophical changes on how in the molecular modeling and simulations of bio-
force field models should be built and used are also discussed. molecules, efforts are continuously made to increase
It’s inspiring that more accurate biomolecular force fields can their accuracy, such as introducing more sophisticated
be developed with the aid of data science techniques. function forms and adopting better parametrization
strategies [4]. One of the key issues in biomolecular FF
Addresses development is transferability, or the ability of FF
1
Westlake AI Therapeutics Lab, Westlake Laboratory of Life Sciences
models to generalize across different simulation envi-
and Biomedicine, Hangzhou, Zhejiang, 310024, China
2
Key Laboratory of Structural Biology of Zhejiang Province, School of ronments [5,6]. This is a non-trivial task due to the
Life Sciences, Westlake University, Hangzhou, Zhejiang, 310024, significant difference between typical fitting data
China (quantum mechanics (QM) calculations of small organic
3
Tsinghua Shenzhen International Graduate School, Tsinghua Uni- molecules in vacuum) and the application scenarios
versity, Shenzhen, Guangdong, 518055, China
(biological macromolecules in the condensed phase).
Corresponding author: Huang Jing (huangjing@westlake.edu.cn) Traditional biomolecular FFs are thus often considered
empirical, and their development and refinement is
largely a matter of trial and error. Many works are
Current Opinion in Structural Biology 2023, 78:102502 dedicated to the systematic development of the bio-
This review comes from a themed issue on Theory and Simulation/ molecular FFs with different algorithms [7].
Computational Methods (2023)
Edited by Turkan Haliloglu and Gregory A. Voth With the ideas and techniques introduced by data science,
For complete overview of the section, please refer the article collection - it might be possible to make FF development less
Theory and Simulation/Computational Methods (2023) empirical and more data-driven. Established computa-
Available online 30 November 2022 tional tools on data labeling, feature extraction, and model
https://doi.org/10.1016/j.sbi.2022.102502
fitting could be directly utilized (Figure 1). Furthermore,
new philosophies on how data should be treated and how
0959-440X/© 2022 Elsevier Ltd. All rights reserved.
models should be constructed and evaluated might help.
In this manuscript, we will review the emerging efforts to
Keywords bring data science into biomolecular FF development.
Data Science, Machine Learning, Force Field, Molecular Modeling,
These range from new fitting algorithms, new molecular
Molecular Dynamics Simulation.
representation methods, to automatic differentiation
techniques. An important viewpoint in data science is that
data play the central role and should be made publicly
Introduction
available, so we start by reviewing works on the con-
Data science, a popular term in recent years, represents
struction and sharing of FF-relevant datasets.
the study of extracting knowledge and making inference
from data [1]. A variety of data science techniques have
been developed to curate, handle, and make use of From providing FF parameters to making
large-scale, heterogeneous data. The most successful data sets open access
technique is probably machine learning (ML), in Databases are the fundamental infrastructures in the era
particular deep learning (DL) that employs neural of data science. Modern computational models require
networks (NNs). The development of data science is larger and larger data sets for training and validation, and

www.sciencedirect.com Current Opinion in Structural Biology 2023, 78:102502


2 Theory and Simulation/Computational Methods (2023)

Figure 1

How data science techniques are transforming the development of empirical FFs. The importance of databases is emphasized, and new techniques such
as active learning are replacing rigid scan in the generation of fitting data. New optimization methods such as Bayesian inference and stochastic gradient
descent are introduced. Classical FF models (left panels) are released as text files containing parameters for different atom types, while in data science
trained models (right panels) are often provided as a whole and atom types can be replaced by continuous representations based on topological
connectivity.

high-quality databases facilitate the development of Research that contains the interaction energies on dimer
computational models. A good example is AlphaFold [8], complexes of organic molecules at the CCSD(T)/CBS
whose success relies on the high-quality experimentally level [13]. Off-equilibrated conformations of the dimer
resolved protein structures curated by the Protein Data complexes were generated by rigid scans along radial
Bank (PDB) database [9]. FF development typically intermolecular axis with initial structures extracted from
requires large amounts of quantum mechanics-based QM optimization and molecular dynamics (MD) trajec-
calculations using high-level correlation methods such tory frames. A central dataset DES370K contains about
as the coupled cluster or at least the MP2 methods. 370,000 geometries evaluated with CCSD(T), and an
Traditional FF development works only publish the final additional dataset DES5M contains about 5 million ge-
FF models but does not make the data used to fit the ometries evaluated with SNS-MP2, a ML approach
models publicly accessible. With the general emphasis whose accuracy is equivalent to the coupled cluster level
on reproducibility and data availability, high-quality QM [14]. Smith et al. created the ANI-1x and ANI-1ccx
databases for FF development have been established in datasets of small organic molecules with conformational
recent years [10]. sampling along low-frequency normal modes [15].
Relevant off-equilibrated conformations are selected by
We note that most existing QM databases at the active learning and subjected to QM calculations. The
CCSD(T)/CBS level are intended for benchmarking ANI-1x dataset contains 4.5 million conformations eval-
other QM methods such that only the geometries and uated with the uB97x functional. Via an active learning
interactions of equilibrated structures are included. FF procedure using machine learning potentials trained
development requires instead information on off- with ANI-1x, a smaller ANI-1ccx dataset was generated
equilibrated structures, or at least potential energy with 0.5 million conformations evaluated with the
scans along key degrees of freedom (DOFs). How to DLPNO-CCSD(T) method. In these two datasets only
generate representative off-equilibrated conformations C, H, O, and N elements are covered, while broader
is a key challenge. Some QM databases start to provide coverage is also attempted [16]. Open access databases
such information, for example, in the NCIAtlas database, accumulating high-quality QM calculation results would
10-point scans along hydrogen bonding or dispersion greatly promote the application of various data science
interaction distances are provided [11,12]. Specific data techniques in FF parametrization. On the other hand,
sets for FF development have been released, most the standardization of these emergent databases is
notably a much larger database released by DEShaw desired for the easy access of them [7].

Current Opinion in Structural Biology 2023, 78:102502 www.sciencedirect.com


Data science for force field development Ding et al. 3

The validation and refinement of biomolecular FFs also Chodera and co-workers proposed Espaloma, a FF
rely on condensed phase experimental data, such as model that discards atom typing and assigns parameters
NMR scalar and dipolar couplings as well as molecular with message-passing neural networks (MPNNs) [31].
liquid properties [17e20]. For example, the helical Instead of mapping discrete atom types into FF pa-
content of (AAQAA)3 derived from NMR measurements rameters with tabulated functions, Espaloma uses
and its temperature dependence have systematically trained continuous functions to directly map the topo-
driven protein FFs towards more accurate backbone logical connectivity to bonded and non-bonded param-
conformational sampling [17,21,22]. The Protein eters. Determination of atomic partial charges is
Ensemble Database (PED) stored the structural complicated by the additional constraint that the total
ensemble information of intrinsically disordered pro- charge should be an integer. To avoid the manual
teins (IDPs), with about 212 protein entries annotated redistribution of predicted charges, Espaloma predicts
with experimentally measured NMR, SAXS, or FRET atomic electronegativity and hardness and uses them to
data [23]. More efforts to construct databases contain- compute the atomic charges analytically with Lagrange
ing relevant experimental data would benefit the multipliers to satisfy the constraint. This approach leads
development of biomolecular FFs. to more accurate small molecule parameters as demon-
strated in the significant improvement in the binding-
Atom typing: from discrete types to free energy calculations of the Tyk2 ligand-protein
continuous embedding system [31].
Force field parameters can be generally divided into two
sets. One set includes per-atom parameters such as Setting aside the function forms in FFs
partial charges and Van der Waals (VdW) parameters, Classical FFs model molecular interactions with
which describe the non-bonded interactions. The other parametrized mathematical functions. The function
includes parameters to model bonded interaction terms forms were empirically designed, which is extremely
based on the topological connectivity of atoms in mo- successful as these function forms remain basically the
lecular fragments. The first step to employ an FF model same for more than 50 years [32]. More sophisticated
for a particular simulation system is atom typing, and the function forms are introduced to account for polarization
assignment of both sets of parameters depends on the effects through different formulas of polarizable FF
atom types pre-defined in the FF. Atom typing is in models [33,34]. In classical FFs, the curse of dimen-
general difficult for small molecule ligands [24], while sionality is avoided by a “divide-and-conquer” strategy
for biomolecular FFs it also remains a question whether that decomposes the total energy to different interac-
it is necessary to introduce new atom types during the tion terms, which effectively approximates the high-
FF development. dimensional potential energy surface to a series of 1-
or 2-dimensional subspaces. Theoretical advances in
As the atom type is encoded by an atom’s topological data science point out that neural network-based ma-
environment, mapping from the atom’s substructure to chine learning can overcome the curse of dimensionality
FF parameters is feasible with a text-rich format for when approximating high-dimensional functions
chemical environment definition as illustrated by [35,36]. Recently, a variety of machine learning poten-
Mobley et al. [25]. More generally, graph neural tials (MLPs) have been developed that discard the
network naturally provides the architecture for the empirical function forms in FFs and directly map the
mimic of the topology environment, laying the foun- atomic coordinates onto the potential energy and
dation of machine learning for atom types. Zhang pro- forces [37].
posed to use topology adaptive graph convolutional
network (TAGCN) [26] for automatic atom typing MLPs can be generally classified into kernel-based or
[27]. For each atom, the output of TAGCN is a prob- neural network-based methods according to the ML
ability density against all pre-defined atom types in a techniques employed (Figure 2). A determining factor
certain FF. The network was trained to reproduce the of their accuracy is how well the ML architecture can
atom types assigned using the rule-based CGENFF preserve the physical symmetries of the simulation
program [24,28], and achieved over 90% fidelity for system such as the rotational, translational, and per-
atomic typing problem in validation. To some extent, mutation invariance [38]. MLPs should also be size-
atom typing in empirical FFs represents an abstraction extensive, which means being transferable across sys-
layer to reduce the dimensionality of the parameter tems with various sizes. The pioneering work by Behler
space [25]. We note that once atom types are deter- and Parrinello in 2007 proposed to construct the total
mined, parameter assignment can also be aided by potential energy of a simulation system as the summa-
machine learning [29,30]. While atom typing reduces tion of atomic energies of individual particles [39]. Each
the searching space and enhances the model trans- atomic energy is predicted by a NN that takes envi-
ferability, accuracy is inevitably compromised by the ronmental descriptors of the corresponding atom as
finite discrete parameter space during optimization. inputs. In Behler’s work, these input features are

www.sciencedirect.com Current Opinion in Structural Biology 2023, 78:102502


4 Theory and Simulation/Computational Methods (2023)

Figure 2

Illustrations on how the potential energy is determined by atom coordinates in (a) classical FFs, (b) kernel-based MLPs, (c) NN-based MLPs that utilize
atom-centered symmetry functions as environment descriptors, (d) the DP model with descriptors learned from embedding networks, and (e) MLPs with
descriptors learned by MPNN.

manually crafted to guarantee the translational and predictions over a certain molecular conformation serves
rotational invariance. Smith et al. extended the defini- as the criterion on whether this conformation should be
tion of environment descriptors by including 3-body added to the training set. Using MD simulation for
interaction features, yielding the ANI series of models conformational searching in active learning eventually
that are widely used to model drug-like organic mole- leads to simultaneous model training and application,
cules [40,16]. It’s also possible to replace the manually similar to the “on-the-fly” MD approach [46,47].
designed environment descriptors with atomic features Moreover, active learning can be - and should be -
learned by another embedding network. The embed- combined with more radical techniques to sample
ding NN and the fitting NN are trained together, and conformational spaces as well as the chemical space.
different embedding strategies lead to different MLPs
such as the Deep Potential (DP) model [41] and the While MLPs can achieve QM accuracy with the same
SchNet model [42]. Following the popular philosophy of scaling of classical FFs, they are still computationally
data science, developers of the DP method provide full- much more expensive. Difficulties in the proper
stack training and development tools to expand the handling of long-range interactions also limit their ap-
application scenarios for different users [43,44]. plications to highly heterogeneous systems [48,49]. To
explicitly include the non-local effects that are crucial
One consequence of setting aside the functional forms for the accurate modeling of long-range interactions,
is that separation of physically meaningful interaction improvement in current MLPs architectures is needed.
terms is not straightforward anymore. “Divide-and- Unke et al. recently reported their efforts to construct a
conquer” strategies for parametrize classical FFs are MLP model for protein simulations by training on
thus no longer valid. Machine learning models are also “bottom-up” and “top-down” fragments of varying sizes
notorious for the extremely large number of parameters [50]. In general, accurate MLPs to simulate proteins
to be determined, which implies that high-quality solvated in water or embedded in lipid bilayers are not
MLPs require vast number of training data. To this yet available. A more practical approach to use MLP for
end, active learning is widely used in the construction of biomolecular systems is via hybrid models. Compared
MLPs [44,45]. In active learning, multiple MLP models with the well-established QM/MM method, one
are trained at the same time and the variance of their advantage of MLP/MM is that the electronic degrees of

Current Opinion in Structural Biology 2023, 78:102502 www.sciencedirect.com


Data science for force field development Ding et al. 5

freedom in MLPs are implicitly integrated out. This What if there are multiple equally good FF
would simplify MD integration as there are consistent models?
DOFs between the MLP part and the classical MM part During the development of biomolecular FFs, some-
in the model. times it’s possible to generate several parameter sets
when fitting to the QM calculations of model com-
Both the ANI and the DP models have been used to pounds. These sets were usually subjected to MD
construct hybrid MLP/MM models for MD simulations simulations of a few biomolecular systems, and one set
of biomolecules [51]. The whole simulation system is was selected empirically as the published final model.
described by a classical MM model, with a subset of With more sophisticated optimizers and the direct
atoms modeled by an additional MLP model. The MLP fitting to condensed phase experimental data, it would
model can be trained to reproduce the difference in be common that several equally good FF models are
atomic forces between QM calculations and the MM derived. For example, Lindorff and co-workers gener-
model for the subset, or a subtraction of MM forces can ated three different parameter sets when optimizing a
be performed during the simulation. We also note that coarse-grained protein hydrophobicity scale (HPS)
MLPs can be utilized to improve QM/MM calculations, model with Bayesian parameter-learning [63] and they
either as a correction to bridge low-level and high-level found that there was no significant difference between
QM methods [52,53] or as leverage to model the QM these three models in describing the relevant IDP
influence on the QM/MM boundary [54]. properties. This would be even more common with
MLPs due to their much larger parameter spaces.
Turbocharging FF development with Interestingly, an understanding of the non-deterministic
automatic differentiation nature of FF models would provide additional insights
Development of biomolecular FFs typically involves on how these models should be used.
running MD simulations, from which thermodynamic
properties can be computed and compared with exper- Data science provides a series of theoretical tools to
imental measurements. This requires analytical first handle the uncertainties in computational models, such
derivatives or even second derivatives (Hessians) of the as the Bayesian neural network and the committee
potential energy function. Furthermore, thermodynamic model. In the development of FFs, the uncertainties’
properties can be leveraged to directly optimize FF estimation is also important for its applications in
parameters through reweighting by taking the de- different scenarios [64e67]. Ceriotti and co-workers
rivatives with respect to particular FF parameters derived a theoretical framework for the error estimation
[55,22]. Implementation of these derivatives is however of thermodynamic properties calculated from MD sim-
cumbersome, which limits the advanced optimization of ulations considering the uncertainties of the FF models
FFs. Fortunately, this is a common task in data science. employed [68]. Essentially, uncertainty propagation can
ML architectures, in particular the JAX, now provide be performed for ensemble-averaged properties by
strong support to automatic differentiation and thus computing the uncertainty for each MD frame using the
remove the coding burden of derivative calculations. committee model. Their work also demonstrated the
advantage to construct MLPs as add-ons of classical FFs
Automatic differentiation allows agile FF construction such that accuracy and generalization can be achieved at
based on macroscopic observable values. Examples the same time. A similarity to the well-established
include FF fitting targeting radial distribution functions ensemble learning method in ML should be noted [69].
and specific folded conformations [56], and FF param-
etrization using hydration-free energies with automatic Conclusions
differentiation [57]. In another study, a coarse-grained Data science has changed the life of human beings in
protein FF is constructed using only RMSDs with the past decade, especially with the applications of deep
respect to targeted native folded conformational states learning. To this end, it develops a variety of techniques
[58]. Aided by automatic differentiation, there are a for the training and deployment of computational
series of efforts towards the construction of end-to-end models, which can help us develop better biomolecular
differentiable molecular dynamics [59,60]. We do note force fields. Scaling up these approaches to large, het-
that automatic differentiation in DL bears the possible erogeneous biomolecular systems is still a challenge due
issue of numerical instability, as the iterative utilization to the difficulties in efficiently sampling the chemical
of chain rules might induce gradient exploding or van- space and modeling long-range interactions. Attempts to
ishing [61,62]. Recurrent neural networks are usually tackle these problems are ongoing, such as active
included in ML architectures to confront the issue and learning and large-scale pre-trained models, but good
enhance the robustness of training models. Similar solutions are yet to come. We expect accurate and
considerations should be taken into account when transferable MLPs or hybrid models using NNs for
applying the automatic differentiation techniques in proteins and nucleic acids would be available in the very
MD simulations and FF development. near future. Other data science techniques than those

www.sciencedirect.com Current Opinion in Structural Biology 2023, 78:102502


6 Theory and Simulation/Computational Methods (2023)

reviewed here might also be useful. For example, 


12. Rezá 
c J: Non-covalent interactions atlas benchmark data sets
5: London dispersion in an extended chemical space. Phys
continual learning can potentially change the way how Chem Chem Phys 2022, 24:14780–14793.
the transferability of FFs is assessed. Methods to handle
13. Donchev AG, Taube AG, Decolvenaere E, Hargus C,
label noise and imbalanced datasets might also be * * McGibbon RT, Law K-H, Gregersen BA, Li J-L, Palmo K, Siva K,
applicable to the robust optimization of FF models. et al.: Quantum chemical benchmark databases of gold-
standard dimer interaction energies. Sci Data 2021, 8:1–9.
Together with applications in finding collective variables Title: Quantum chemical benchmark databases of gold-standard dimer
for enhanced sampling [70], generating conformational interaction energies. Description: This work presented a large data-
base of gold-standard dimer interaction energies with millions off-
ensembles [71], or even directly accelerating the dy- equilibrated conformations on organic molecules that would be useful
namics propagation [72], data science and machine for FF development.
learning are revolutionizing the modeling and simulation 14. McGibbon RT, Taube AG, Donchev AG, Siva K, Hernández F,
of biomolecular systems. Hargus C, Law K-H, Klepeis JL, Shaw DE: Improving the ac-
curacy of Møller-Plesset perturbation theory with neural
networks. J Chem Phys 2017, 147, 161725.
Conflict of interest statement 15. Smith JS, Zubatyuk R, Nebgen B, Lubbers N, Barros K,
Nothing declared. Roitberg AE, Isayev O, Tretiak S: The ani-1ccx and ani-1x data
sets, coupled-cluster and density functional theory proper-
ties for molecules. Sci Data 2020, 7:1–10.
Data availability 16. Devereux C, Smith JS, Huddleston KK, Barros K, Zubatyuk R,
Data will be made available on request. Isayev O, Roitberg AE: Extending the applicability of the ani
deep learning molecular potential to sulfur and halogens.
J Chem Theor Comput 2020, 16:4192–4202.
Acknowledgments
This work was supported by the Zhejiang Provincial Natural Science 17. Best RB, Buchete N-V, Hummer G: Are current molecular dy-
namics force fields too helical? Biophys J 2008, 95. L07–L09.
Foundation of China (Grant No. LR19B030001), the National Natural
Science Foundation of China (Grant No. 21803057, 32171247), and the 18. Robustelli P, Piana S, Shaw DE: Developing a molecular
Shenzhen Science and Technology Innovation Commission (Grant dynamics force field for both folded and disordered pro-
No: WDZC20200819115243002). tein states. Proc Natl Acad Sci USA 2018, 115:
E4758 – E4766.
References 19. Xu Y, Huang J: Validating the charmm36m protein force field
Papers of particular interest, published within the period of review, with lj-pme reveals altered hydrogen bonding dynamics
have been highlighted as: under elevated pressures. Commun. Chem. 2021, 4:99.

* of special interest 20. Caleman C, Van Maaren PJ, Hong M, Hub JS, Costa LT, Van Der
* * of outstanding interest Spoel D: Force field benchmark of organic liquids: density,
enthalpy of vaporization, heat capacities, surface tension,
isothermal compressibility, volumetric expansion coefficient,
1. Dhar V: Data science and prediction. Commun ACM 2013, 56: and dielectric constant. J Chem Theor Comput 2012, 8:61–74.
64–73.
21. Huang J, MacKerell A: Induction of peptide bond dipoles
2. Schlick T, Portillo-Ledesma S: Biomolecular modeling thrives drives cooperative helix formation in the (aaqaa)3 peptide,
in the age of technology. Nature computational science 2021, 1: Biophys. J 2014, 107:991–997.
321–331.
22. Huang J, Rauscher S, Nawrocki G, Ran T, Feig M, de Groot BL,
3. Tolle KM, Tansley DSW, Hey AJ: The fourth paradigm: data- Grubmueller H, MacKerell A: Charmm36m: an improved force
intensive scientific discovery [point of view]. Proc IEEE 2011, field for folded and intrinsically disordered proteins. Nat
99:1334–1337. Methods 2017, 14:71–73.
4. Huang J, MacKerell AD: Force field development and simula- 23. Lazar T, Martínez-Pérez E, Quaglia F, Hatos A, Chemes LB,
tions of intrinsically disordered proteins. Curr Opin Struct Biol Iserte JA, Méndez NA, Garrone NA, Saldaño TE, Marchetti J,
2018, 48:40–48. et al.: Ped in 2021: a major update of the protein ensemble
database for intrinsically disordered proteins. Nucleic Acids
5. MacKerell Jr AD: Empirical force fields for biological macro- Res 2021, 49:D404–D411.
molecules: overview and issues. J Comput Chem 2004, 25:
1584–1604. 24. Vanommeslaeghe K, MacKerell Jr AD: Automation of the
charmm general force field (cgenff) i: bond perception and
6. Nerenberg PS, Head-Gordon T: New developments in force atom typing. J Chem Inf Model 2012, 52:3144–3154.
fields for biomolecular simulations. Curr Opin Struct Biol 2018,
49:129–138. 25. Mobley DL, Bannan CC, Rizzi A, Bayly CI, Chodera JD, Lim VT,
Lim NM, Beauchamp KA, Slochower DR, Shirts MR, et al.:
7. Van der Spoel D: Systematic design of biomolecular force Escaping atom types in force fields using direct chemical
fields. Curr Opin Struct Biol 2021, 67:18–24. perception. J Chem Theor Comput 2018, 14:6076–6092.
8. Jumper J, Evans R, Pritzel A, Green T, Figurnov M,

Ronneberger O, Tunyasuvunakool K, Bates R, Zídek A, 26. J. Du, S. Zhang, G. Wu, J. M. Moura, S. Kar, Topology adaptive
Potapenko A, et al.: Highly accurate protein structure predic- graph convolutional networks, arXiv preprint arXiv:1710.10370.
tion with alphafold. Nature 2021, 596:583–589. 27. Zhang J: Atom typing using graph representation learning:
9. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, * how do models learn chemistry? J Chem Phys 2022, 156.
Weissig H, Shindyalov IN, Bourne PE: The protein data bank. 204108.
Nucleic Acids Res 2000, 28:235–242. Title: Atom typing using graph representation learning: How do models
learn chemistry?Description: This paper trained a graph neural net-
10. K. Kriz, L. Schmidt, A. Andersson, M.-M. Walz, D. van der Spoel, works to address the atom typing problem.
An imbalance in the force: the need for standardised benchmarks
for molecular simulation. 28. Vanommeslaeghe K, Raman EP, MacKerell Jr AD: Automation
of the charmm general force field (cgenff) ii: assignment of

11. Rezá
c J: Non-covalent interactions atlas benchmark data sets: bonded parameters and partial atomic charges. J Chem Inf
hydrogen bonding. J Chem Theor Comput 2020, 16:2355–2368. Model 2012, 52:3155–3168.

Current Opinion in Structural Biology 2023, 78:102502 www.sciencedirect.com


Data science for force field development Ding et al. 7

29. Chatterjee P, Sengul MY, Kumar A, MacKerell Jr AD: Harnessing 48. Yue S, Muniz MC, Calegari Andrade MF, Zhang L, Car R,
deep learning for optimization of Lennard-Jones parameters Panagiotopoulos AZ: When do short-range atomistic machine-
for the polarizable classical drude oscillator force field. learning models fall short? J Chem Phys 2021, 154, 034111.
J Chem Theor Comput 2022, 18:2388–2407.
49. Behler J, Csányi G: Machine learning potentials for extended
30. Kumar A, Pandey P, Chatterjee P, MacKerell Jr AD: Deep neural systems: a perspective. Eur Phys J B 2021, 94:1–11.
network model to predict the electrostatic parameters in the
polarizable classical drude oscillator force field. J Chem 50. O. T. Unke, M. Stöhr, S. Ganscha, T. Unterthiner, H. Maennel, S.
Theor Comput 2022, 18:1711–1725. * Kashubin, D. Ahlin, M. Gastegger, L. M. Sandonas, A. Tkatch-
enko, et al., Accurate machine learned quantum-mechanical force
31. Wang Y, Fass J, Kaminow B, Herr JE, Rufa D, Zhang I, Pulido I, fields for biomolecular simulations, arXiv preprint arXiv:
* * Henry M, Macdonald HEB, Takaba K, et al.: End-to-end differ- 2205.08306.Title: Accurate Machine Learned Quantum-
entiable construction of molecular mechanics force fields. Mechanical Force Fields for Biomolecular Simulations. Descrip-
Chem Sci 2022, 13:12016–12033. tion: This work constructed a MLP model for simulations of sol-
Title: End-to-end differentiable molecular mechanics force field con- vated proteins with training data obtained from the QM
struction. Description: This work proposed a new way to construct FFs calculations on millions of “bottom-up” and “top-down” selected
that directly maps the topological connectivity to FF parameters, conformations of solvated proteins.
discarding the discrete atom types in classical FF models.
51. D. A. Rufa, H. E. B. Macdonald, J. Fass, M. Wieder, P. B. Grin-
32. Lifson S, Warshel A: Consistent force field for calculations of * * away, A. E. Roitberg, O. Isayev, J. D. Chodera, Towards chemical
conformations vibra tional spectra and enthalpies of cyclo- accuracy for alchemical free energy calculations with hybrid
alkane and n-alkane molecules. J Chem Phys 1968, 49: physics-based machine learning/molecular mechanics potentials,
5116–5129. BioRxiv.Title: Towards chemical accuracy for alchemical free
energy calculations with hybrid physics-based machine learning/
33. Lemkul J, Huang J, Roux B, MacKerell A: An empirical polar- molecular mechanics potentials. Description: This paper pro-
izable force field based on the classical drude oscillator posed a hybrid MLP/MM model for protein-ligand systems by
model: development history and recent applications. Chem applying the ANI-2x model to describe the intramolecular in-
Rev 2016, 116:4983–5013. teractions of ligands.
34. Huang J, Simmonett AC, Pickard FC, MacKerell AD, Brooks BR: 52. Pan X, Yang J, Van R, Epifanovsky E, Ho J, Huang J, Pu J, Mei Y,
Mapping the drude polarizable force field onto a multipole Nam K, Shao Y: Machine-learning-assisted free energy
and induced dipole model. J Chem Phys 2017, 147, 161702. simulation of solution-phase and enzyme reactions. J Chem
Theor Comput 2021, 17:5745–5758.
35. Han J, Jentzen A: Solving high-dimensional partial differential
equations using deep learning. Proc Natl Acad Sci USA 2018, 53. Zeng J, Giese TJ, Ekesan S, York DM: Development of range-
115:8505–8510. corrected deep learning potentials for fast, accurate quantum
mechanical/molecular mechanical simulations of chemical
36. P. Beneventano, P. Cheridito, R. Graeber, A. Jentzen, B. Kuck- reactions in solution. J Chem Theor Comput 2021, 17:
uck, Deep neural network approximation theory for high- 6993–7009.
dimensional functions, arXiv preprint arXiv:2112.14523.
54. Lier B, Poliak P, Marquetand P, Westermayr J, Oostenbrink C:
37. Unke OT, Chmiela S, Sauceda HE, Gastegger M, Poltavsky I, Burnn: buffer region neural network approach for
Schütt KT, Tkatchenko A, Müller K-R: Machine learning force polarizable-embedding neural network/molecular mechanics
fields. Chem Rev 2021, 121:10142–10186. simulations. J Phys Chem Lett 2022, 13:3812–3818.
38. Chmiela S, Sauceda HE, Müller K-R, Tkatchenko A: Towards 55. Wang L-P, Martinez TJ, Pande VS: Building force fields: an
exact molecular dynamics simulations with machine-learned automatic, systematic, and reproducible approach. J Phys
force fields. Nat Commun 2018, 9:1–10. Chem Lett 2014, 5:1885–1891.
39. Behler J, Parrinello M: Generalized neural-network represen- 56. W. Wang, S. Axelrod, R. Gómez-Bombarelli, Differentiable mo-
tation of high-dimensional potential-energy surfaces. Phys lecular simulations for control and learning, arXiv preprint arXiv:
Rev Lett 2007, 98, 146401. 2003.00868.
40. Smith JS, Isayev O, Roitberg AE: Ani-1: an extensible neural 57. Wang X, Li J, Yang L, Chen F, Wang J, Chang J, Chen J,
network potential with dft accuracy at force field computa- Zhang L, Yu K: Dmff: an open-source automatic differentiable
tional cost. Chem Sci 2017, 8:3192–3203. platform for molecular force field development and molecular
41. Zhang L, Han J, Wang H, Saidi W, Car R, Weinan E: End-to-end dynamics simulation. ChemRxiv 2022, https://doi.org/10.26434/
symmetry preserving inter-atomic potential energy model for chemrxiv-2022-2c7gv.
finite and extended systems. In Advances in neural information 58. Greener JG, Jones DT: Differentiable molecular simulation
processing systems; 2018:4436–4446. can learn all the parameters in a coarse-grained force field for
42. Schütt KT, Sauceda HE, Kindermans P-J, Tkatchenko A, proteins. PLoS One 2021, 16, e0256990.
Müller K-R: Schnet–a deep learning architecture for mole- 59. Schoenholz S, Cubuk ED: Jax md: a framework for differen-
cules and materials. J Chem Phys 2018, 148, 241722. tiable physics. Adv Neural Inf Process Syst 2020, 33:
43. Wang H, Zhang L, Han J, Weinan E: Deepmd-kit: a deep 11428–11441.
learning package for many-body potential energy represen- 60. Doerr S, Majewski M, Pérez A, Kramer A, Clementi C, Noe F,
tation and molecular dynamics. Comput Phys Commun 2018, Giorgino T, De Fabritiis G: Torchmd: a deep learning frame-
228:178–184. work for molecular simulations. J Chem Theor Comput 2021,
44. Zhang Y, Wang H, Chen W, Zeng J, Zhang L, Wang H, Weinan E: 17:2355–2363.
Dp-gen: a concurrent learning platform for the generation of 61. Pascanu R, Mikolov T, Bengio Y: On the difficulty of training
reliable deep learning based potential energy models. recurrent neural networks. In International conference on ma-
Comput Phys Commun 2020, 253, 107206. chine learning. PMLR; 2013:1310–1318.
45. Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE: Less is 62. L. Metz, C. D. Freeman, S. S. Schoenholz, T. Kachman, Gradi-
more: sampling chemical space with active learning. J Chem ents are not all you need, arXiv preprint arXiv:2111.05803.
Phys 2018, 148, 241733.
63. Tesei G, Schulze TK, Crehuet R, Lindorff-Larsen K: Accurate
46. Csányi G, Albaret T, Payne M, De Vita A: Learn on the fly”: a * model of liquid–liquid phase behavior of intrinsically disor-
hybrid classical and quantum-mechanical molecular dy- dered proteins from optimization of single-chain properties.
namics simulation. Phys Rev Lett 2004, 93, 175503. Proc Natl Acad Sci USA 2021, 118, e2111696118.
47. Li Z, Kermode JR, De Vita A: Molecular dynamics with on-the- Title: Accurate model of liquid–liquid phase behavior of intrinsically
fly machine learning of quantum-mechanical forces. Phys disordered proteins from optimization of single–chain properties.
Rev Lett 2015, 114, 096405. Description: This work optimized a coarse-grained protein

www.sciencedirect.com Current Opinion in Structural Biology 2023, 78:102502


8 Theory and Simulation/Computational Methods (2023)

hydrophobicity scale model with Bayesian parameter-learning. Three Title: Uncertainty estimation for molecular dynamics and sampling.-
different parameter sets were generated with equally good performance. Description: This paper constructed an analysis protocol for the un-
certainty of MD-derived thermodynamic properties from the intrinsic
64. Cailliez F, Pernot P: Statistical approaches to forcefield cali- uncertainties of the MLP models.
bration and prediction uncertainty in molecular simulation.
J Chem Phys 2011, 134, 054124. 69. Zhou Z-H, Wu J, Tang W: Ensembling neural networks: many
could be better than all. Artif Intell 2002, 137:239–263.
65. Rocklin GJ, Mobley DL, Dill KA: Calculating the sensitivity and
robustness of binding free energy calculations to force field 70. Sidky H, Chen W, Ferguson AL: Machine learning for collective
parameters. J Chem Theor Comput 2013, 9:3072–3083. variable discovery and enhanced sampling in biomolecular
simulation. Mol Phys 2020, 118, e1737742.
66. Yildirim A, Ghahremanpour MM, Van der Spoel D: Propagation
of uncertainty in physicochemical data to force field pre- 71. Noé F, Olsson S, Köhler J, Wu H: Boltzmann generators:
dictions. Physical Review Research 2020, 2, 033277. sampling equilibrium states of many-body systems with deep
learning. Science 2019, 365, eaaw1147.
67. Cailliez F, Pernot P, Rizzi F, Jones R, Knio O, Arampatzis G,
Koumoutsakos P: Bayesian calibration of force fields for mo- 72. Kochkov D, Smith JA, Alieva A, Wang Q, Brenner MP, Hoyer S:
lecular simulations. Uncertainty Quantification in Multiscale * Machine learning–accelerated computational fluid dynamics.
Materials Modeling 2020:169–227. Proc Natl Acad Sci USA 2021, 118, e2101784118.
Title: Machine learning–accelerated computational fluid dynam-
68. Imbalzano G, Zhuang Y, Kapil V, Rossi K, Engel EA, Grasselli F, ics.Description: This work used machine learning to accelerate fluid
* * Ceriotti M: Uncertainty estimation for molecular dynamics and dynamics simulations by bridging the discrepancy between high-
sampling. J Chem Phys 2021, 154, 074102. resolution and coarse-resolution simulations.

Current Opinion in Structural Biology 2023, 78:102502 www.sciencedirect.com

You might also like