Professional Documents
Culture Documents
Macromolecular Modeling and Design in Rosetta: Recent Methods and Frameworks
Macromolecular Modeling and Design in Rosetta: Recent Methods and Frameworks
https://doi.org/10.1038/s41592-020-0848-2
T
he understanding that molecular structure determines modeling, protein design, and interaction with peptides and nucleic
biological function has motivated decades of experimen- acids (Fig. 1 and Tables 1 and 2). Over more than 20 years, the com-
tal determination of protein structure and function. Many munity of developers and scientists, the RosettaCommons, grew
computational packages have been developed to guide experimen- from a single academic laboratory to laboratories at over 60 institu-
tal methods and elucidate macromolecular structure, including tions wordwide26. The software has undergone several transitions,
Rosetta. Rosetta offers capabilities spanning many bioinformatics including in programming language and implementation, with the
and structural-bioinformatics tasks. Computational structural biol- latest protocols based on Rosetta3, first released in 200827. The score
ogy frameworks with similarly comprehensive scope are few, but function has been continuously improved and was described in
key to progress in biology. Schrödinger1, the Molecular Operating refs. 28,29. As part of our sustained focus on accessibility, usability and
Environment2 and Discovery Studio3 are computational chemistry scientific reproducibility, we developed several interfaces (PyRosetta30,
platforms for advanced modeling and design for structural biology, RosettaScripts31, Foldit32) and emphasized publishing protocol cap-
drug discovery and material science, based on molecular mechan- tures33 to accompany manuscripts. As those interfaces have grown
ics, molecular dynamics and quantum mechanics calculations. The more versatile and modular, development has accelerated and
HHSuite4 includes tools for bioinformatics, sequence alignments, branched in many directions. However, the interoperability, exten-
structure prediction and modeling. The BioChemicalLibrary5 sibility and modularity enable scientists to combine modules in a
(BCL) includes tools for structure prediction and drug discovery, wide variety of combinations, making it difficult to keep up with all
and several sequence-to-structure methods using machine learn- the developments within the software and the scientific community.
ing approaches. The Integrative Modeling Platform6 (IMP) models Here we have compiled the latest method developments in Rosetta
large macromolecular complexes by incorporating various types of from the past 5 years, divided into several categories; we provide
experimental data. OpenBabel7 is a ChemInformatics toolbox sup- direction on where to find further information for specific model-
porting molecular mechanics calculations, being most heavily used ing problems. The Supplementary Note contains more details on
for interconversion of file formats. the protocols, with extensive links to documentation, resources on
Molecular dynamics packages like CHARMM8, AMBER9, the web, limitations, and competitors.
GROMACS10 and others simulate most atoms explicitly with a
physics-based energy function that relies on solving Newton’s equa- General overview and challenges
tion of motion. These methods can be used for folding small pro- A typical Rosetta protocol is outlined in Fig. 2a: the conformation of
teins, model refinement, modeling phenomena such as ion flow a biomolecule (the Pose) is altered, either deterministically or sto-
through membrane channels, and modeling interactions with small chastically, via a Mover and the resulting conformation is evaluated
molecules and are therefore highly complementary to Rosetta. by a ScoreFunction. The move is accepted based on the Metropolis
OpenMM11 is an API (application programming interface) for set- criterion and the energy difference between the original and the
ting up molecular simulations and can be used as a library or stand- new conformation:
alone application.
Many other tools are available for more specialized tasks — if Enew <Eorig accept
for instance, for de novo modeling (AlphaFold12,13, QUARK14,
RaptorX15), homology modeling (Modeller16, SwissModel17), fold
recognition (iTasser18), protein–protein docking (HADDOCK19, if Enew ≥ Eorig accept with probability P ¼ e�ððEnew �Eorig Þ=T Þ
Zdock20, ClusPro21), ligand docking (AutoDock22, FlexX23, Glide24)
and many other tasks requiring molecular modeling. As the focus Many independent trajectories are generated, and the final
here is on Rosetta developments, a comprehensive list of related models are evaluated based on the scientific objective. This setup
methods is listed in the Supplementary Note. highlights common limitations in Rosetta protocols involving
Development of Rosetta started in the mid-1990s; it was initially sampling, scoring (discussed in “Rosetta’s score function” below),
aimed at protein structure prediction and protein folding25. Over or technical challenges. Many protocols suffer from undersam-
time, the number of applications grew to address diverse model- pling34, especially when flexibility is involved. Sampling is a limi-
ing tasks, from protein–protein or protein–small molecule docking tation for structure prediction (especially for large structures),
to incorporating nuclear magnetic resonance (NMR) data, loop protein design and unconstrained global protein–protein docking.
A full list of authors and affiliations appear at the end of the paper.
For example, even with local docking we are limited by backbone pairwise interactions that have multi-body, cooperative properties.
flexibility and performance deteriorates with larger flexibility in The HBNet protocol47 has been used to design de novo coiled coils
the binding interface. Small-molecule docking similarly relies with interaction specificity mediated by designed hydrogen bond
on correct identification of the binding interface and is limited networks, including homo-oligomers47, membrane proteins48 and
by flexibility between unbound and bound states. Enormous con- large sets of orthogonal heterodimers49. An improvement to HBNet
formational search spaces are also prohibitive for RNA modeling uses a Monte Carlo search to sample hydrogen bond networks with
because of the size and combinatorics of the torsion space (see notably improved performance50. We further developed a statistical
“Modeling nucleic acids and their interactions with proteins” potential to place highly coordinated water molecules on the surface
below), membrane proteins because of their size, and carbohy- of biomolecules. On a dataset of 153 high-resolution protein–pro-
drates because of branching and flexibility. tein interfaces, the method predicts 17% of native interface waters
Some Rosetta applications suffer from technical challenges in with 20% precision within 0.5 Å of the crystallographic water posi-
implementation; a lack of documentation, protocol captures or tions51. The potential is accessible through the ExplicitWaterMover
support; and a need for more diverse chemistries for biomolecules. (formerly WaterBoxMover) in RosettaScripts.
Technical challenges are either historical or due to lack of inter- There are still several limitations to the score function. (1) It does
est in the community to develop and advance methods in these not directly estimate entropy52, which has been shown to improve
unique areas. sampling efficiency53. However, rotamer bond angles, solvation,
fragments and pair terms all implicitly model this component of
Rosetta’s score function the free energy, which at these temperatures and solvation densi-
Rosetta’s score function has been continuously improved over many ties account for more than half of the entropy. (2) In most cases,
years35 with guiding principles including improving speed of com- knowledge-based score terms are derived from high-resolution
putation, increasing extensibility and improving accuracy across crystal structures, representing a single state on the energy land-
multiple tasks. The main score function is a linear combination of scape, and do not represent flexibility well (in comparison to
weighted score terms that balances physics-based and statistically solution NMR). (3) Knowledge-based terms are less interpre-
derived potentials describing respectively van der Waals energies, table and transferable than physics-based terms. (4) Scoring per-
hydrogen bonds, electrostatics, disulfide bonds, residue solvation, formance scales with the number of score terms and has become
backbone torsion angles, sidechain rotamer energies, and an aver- slower, although more accurate, over time. (5) The solvation model
age unfolded state reference energy (Fig. 2b): is implicit, and hence fast, but hinders explicit modeling of ions,
water molecules or lipid environments. (6) Several score functions
¼ EvdW þ Ehbond þ Eelec þ Edisulf for specific applications (RNA, membrane proteins, carbohydrates,
þEsolv þ EBBtorsion þ Erotamer þ Eref non-canonical amino acids) are still developing.
Some energy terms are decomposed into several components Major applications
to parameterize each of them separately. For instance, the van der Predicting protein structures. Rosetta was originally developed for
Waals energy is split into attractive and repulsive terms between de novo protein structure prediction, assembling fragments from
different residues, in addition to an intra-residue repulsive term. known protein structures via a Monte Carlo procedure and evaluat-
A detailed account of the all-atom score function was published ing the models with the score function. While the community’s main
recently28. goals have moved to macromolecular design over the past decade,
The newest score function29, REF2015, reproduces ther- performance in the CASP13 blind prediction challenge remains
modynamic observables (such as liquid-phase properties36 respectable54, with ranking for refinement and prediction of mul-
and liquid-to-vapor transfer free energies37) in addition to timeric complexes among the top three groups. Meanwhile, other
structure-based38 tests. It also utilizes a new, derivative-free optimi- groups have refined their tools exploiting evolutionary couplings
zation technique, which is suitable for robust optimization of >100 and machine learning: for instance Google’s DeepMind developed
parameters. Further, a new energy term was added that takes into AlphaFold12,13, which uses Rosetta for refinement, with outstand-
consideration non-ideality of bond lengths and angles in cartesian ing performance in the recent CASP1354. Another highly ranking
space39. The cartesian term39 is also the basis for a cartesian_ddG method is the Zhang server, built on iTasser14 and QUARK14.
method, which has been used to calculate ΔΔG values of mutations Homology modeling was improved by using multiple templates
(where ΔG is the free energy of folding) to assess changes in protein in RosettaCM55 (now available on the new Robetta56,57 server), which
stability. Only the backbones and side chains of residues near the hybridizes the most homologous portions from multiple templates
mutation site are allowed to move40. Due to the local optimization, into a single model while modeling missing residues de novo55.
this protocol is much faster than the previous gold-standard ddg_ Without a template, predicting protein structures de novo remains
monomer41 while retaining the same level of accuracy. REF2015 one of the most challenging tasks in structural biology, even
is now compatible with an expanded palette of chemical building though the incorporation of evolutionary coupling constraints (for
blocks—canonical and non-canonical l-α-amino acids and their instance, from GREMLIN58) has led to enormous improvements in
d-amino acid counterparts, exotic achiral amino acids, peptoids model quality. An iterative hybridization approach improves sam-
and oligoureas—and can model metalloproteins42. Score functions pling and uses a genetic algorithm that recombines models from
that enable simultaneous modeling of protein and RNA are being an input pool to create models that have features from their parents
explored43. REF2015 is now thread-safe and fully mirror symmet- but are also distinct. Creating several child models in each iteration,
ric; that is, enantiomers in mirror conformations score identically. updating the input pool, and performing 30–50 iterations lead to
Guidance energy terms for design have been added to encour- improved model accuracy because features that are scored favorably
age certain features, such as specific amino acid compositions44,45, are repeatedly used in the recombination, such that the models in
hydrogen bonding networks, or global or local net charges, and dis- the pool converge over time. Iterative hybridization has been used
courage others, such as repeat sequences that hinder NMR assign- to improve model quality of de novo predicted models59 as well as
ments, buried unsatisfied hydrogen bond donors and acceptors, or homology models60. Model refinement or generating ensembles
voids within the protein46. of structures (useful for design) can be accomplished by several
Hydrogen bond networks are important for biomolecular struc- algorithms in Rosetta: FastRelax61, Backrub62 or vicinity sampling63
ture and catalysis but have been challenging to design because of using kinematic closure (KIC) or next-generation-KIC (NGK) loop
Biomineral surface
Structure prediction Protein–protein docking Ligand docking Loop modeling docking
O R3
Tasks
N
N
R2 O
Systems
Modeling with Non-canonical
Protein design experimental data Symmetric assemblies chemistries
Fig. 1 | Capabilities of the Rosetta macromolecular modeling suite. Some popular tasks that can be addressed in Rosetta (blue) and major systems that
can be modeled (red). Note that this is an incomplete list of Rosetta’s broad modeling capabilities.
modeling64. Loop modeling65 was implemented early in Rosetta66,67, For symmetric homomers, Rosetta SymDock272 uses the same
with initial approaches relying on fragment sampling and itera- six-dimensional scoring scheme as RosettaDock. Symmetry infor-
tive cyclic coordinate descent (CCD)68 for chain closure. Later, a mation can be extracted from a homologous complex, or from a
KIC approach relied on polynomial resultants to analytically solve global docking search for a given point symmetry using our sym-
for closed conformations, producing more native-like loops69,70. metry framework73. An induced-fit-based all-atom refinement
Next-generation KIC64 is an innovation that improves sampling step relieves clashes in tightly packed complexes to give physically
by employing diversification (that is, wider range of conforma- realistic models. On a benchmark set of 43 complexes with differ-
tions) and intensification (that is, focus around previously gen- ent cyclic and dihedral symmetries, global docking on homology
erated conformations), substantially increasing the fraction of models had accuracies of 61% and 42% for cyclic and dihedral sym-
near-native models64 and modeling longer loops. A related method, metries, respectively72. These accuracies can be markedly improved
GeneralizedKIC44 (GenKIC), samples loop geometries between when adding restraints.
fixed endpoints, including non-standard peptide chemistries or
chemistries that conventional loop-modeling algorithms do not Docking small-molecule ligands into proteins. Structure-based
typically handle. drug design has become a key drug optimization tool and leverages
the vast array of knowledge contained in the increasing numbers of
Modeling protein–protein complexes. Another early expan- deposited structures in the PDB. RosettaLigand74 has demonstrated
sion of Rosetta’s functionality was RosettaDock, a method for success in predicting small molecule–protein interactions. Later in
predicting the structure of protein–protein complexes. The latest the drug development process, medicinal chemists optimize ligands
version, RosettaDock4.071, incorporates protein flexibility from on the basis of structure–activity relationships by synthesizing dif-
pre-generated protein ensembles, mimicking conformer selection. ferent ligands that share a core chemical scaffold and are assumed to
This has improved sampling efficiency by automatically adjusting bind to their target in a similar fashion75. RosettaLigandEnsemble76
the sampling rate on the basis of the diversity of the input ensembles. improves sampling during ligand docking by taking advantage
Scoring has been improved by a six-dimensional coarse-grained of ligand similarities and docking a congeneric series of ligands
scoring scheme called motif_dock_score, employing score grids simultaneously, allowing a placement that works for all considered
generated from known complexes in the Protein Data Bank (PDB). ligands while optimizing the binding interface for each ligand inde-
In local docking benchmarks with backbone deviations of up to pendently. Experimental structure–activity relationships can help
2.2 Å, RosettaDock4.0 successfully docked ~50% of complexes71. identify preferred binding modes. Small-molecule ligands can also
Mover Edisulf
Esolv
Eelec
ScoreFunction ScoreFunction
evaluates
Ehbond
whether new New conformation
conformation is created during a move
is accepted Erotamer
EBBtorsion
EvdW Lennard–Jones for attractive or repulsive interaction Esolv Implicit solvation model penalizes buried polar atoms
Ehbond Hydrogen bonding allows buried polar atoms EBBtorsion Backbone torsion preferences from main-chain potential
Eelec Electrostatic interaction between charges Erotamer Side-chain torsion angles from rotamer library
Edisulf Disulfide bonds between cysteines Eref Unfolded state reference energy for design
Fig. 2 | Main elements of Rosetta are scoring and sampling. a, Three main elements are required in a Rosetta protocol. The Pose is the biomolecule, such
as a protein, RNA, DNA, small molecule, or glycan, in a specific conformation. Residues in the Pose can be selected via ResidueSelectors and the behavior
for side-chain optimization or mutation can be defined by TaskOperations. Specific Movers then control how the conformation of the Pose is changed, and
the new conformation is subsequently evaluated by a ScoreFunction. The Metropolis criterion decides whether the new conformation is accepted during
sampling. Many independent sampling trajectories are generated, and the final models are evaluated according to the purpose of the protocol. b, The score
function consists of a weighted linear combination of various score terms, highlighted in the figure and described in the text.
of novel antibody models97,98. The models are docked to a target of SEWING creates de novo designs by recombining parts of pro-
interest, either locally to a specific epitope or globally, followed by tein structures from randomly selected helical building blocks103.
an optimization step comprising rigorous backbone sampling and SEWING’s requirement-driven approach allows users to specify fea-
sequence design for improving model stability and binding affinity. tures that should be incorporated into their designs during backbone
generation without requiring a certain size or three-dimensional
Designing new proteins and functions. Protein design99 relies on fold. New features include incorporation of functional motifs
several of the same core functionalities needed for structure predic- such as protein-binding peptides for protein interface design and
tion, and synergy and interoperability between design and predic- ligand binding sites for ligand-binding protein design104. A similar
tion models has always been a core Rosetta principle. For example, algorithm was implemented for antibody design (AbDesign, see
this synergy is well illustrated by the biased forward folding method: above), which was generalized for enzyme design105. A more general
during de novo protein design100, a test for the consistency of the approach is RosettaRemodel, performing protein design by rebuild-
designed sequence is whether ab initio structure prediction will ing parts or all of the structure106 from fragments of known pro-
yield the same structure that was used as a starting point for the tein structures. RosettaRemodel uses a blueprint file in which the
design. However, computationally testing a large number of designs user defines secondary and supersecondary structure of the desired
is prohibited by the vast conformational search space for ab initio fold. Remodel interfaces with various Rosetta protocols and allows
structure prediction. To limit that space and test more designs, de novo modeling; fixed-backbone sequence design; refinement;
biased forward folding101 uses 3 (instead of 200) fragments per loop insertion, deletion and remodeling; disulfide engineering;
residue position, with fragments being chosen on the basis of the domain assembly; and motif grafting.
r.m.s. deviation to the native structure used to instantiate the design A common task is not only design toward a certain goal (posi-
process. Protein design is easier when starting from known struc- tive design), but design away from undesired features (negative
tures and when redesigning for well understood objectives such as design). Such a multi-state design107 approach evaluates strengths
thermostability102. More difficult design objectives include de novo and weaknesses of a single sequence on multiple backbones — for
design (without a template structure) and design for novel folds or instance, binding to one but not another protein partner. REstrained
functions. Successes in these cases require sampling of enormous CONvergence108 (RECON) allows each state to sample multiple
conformational spaces, depending on the protein size. Another sim- sequences during the design process, which is iteratively applied
plification of de novo design is thermostabilization of the protein, by increasing the restraint weight to encourage sequence conver-
essentially creating rigid structures that are mostly non-functional, gence. RECON achieves on average 70% sequence recovery (a 30%
by expanding the energy gap between folded and unfolded designs increase compared to multi-state design) for large multi-state design
to facilitate structural characterization. To date, novel functional problems, such as antibody affinity maturation or the prediction of
designs mostly exploit known structures, and the next frontier is the evolutionary sequence profiles of flexible backbones109,110.
design of novel functions onto de novo scaffolds. Moreover, nature Protein function can be designed by motif grafting—that is,
typically does not design for the global minimum energy confor- grafting a known motif or predicted active or binding site from
mation (in terms of stability) because proteins require flexibility to a template structure onto a new protein. This approach has been
carry out their functions. used for antibodies and vaccine design111 using the fold_from_loops
Design of novel protein structures and functions toward thera- application, where the functional motif is used as a starting point
peutic intervention is addressed by various methods in Rosetta. of an extended structure that is folded following the constraints of
Fig. 4 | User interfaces to the codebase. a, Rosetta can be run from a terminal and offers three interfaces to the codebase. The top panel outlines the task
to be accomplished: making two mutations in a protein and then refining the structure. The panels underneath show how this task can be accomplished
in the different interfaces. The command line panel shows the executable, input files and options to run two specific applications. RosettaScripts is an
XML-based scripting language that offers more flexibility by combining Movers and ScoreFunctions into a custom Protocol. PyRosetta offers direct access
to the underlying code objects but requires knowledge of the codebase. b, Point-and-click interfaces to the codebase. InteractiveRosetta is a graphical
user interface (GUI) to PyRosetta. It offers controls to the most popular protocols, file formats and options. Foldit is a video game primarily used to
crowd-source real-world scientific puzzles and is also usable on custom proteins of interest. It can run some popular applications via a game interface.
ROSIE hosts a multitude of servers, each executing a particular protocol. It currently includes servers for 25 Rosetta methods. (InteractiveRosetta and
Foldit panels reprinted from refs. 184,214 under Creative Commons licenses.).
measurements for parameterization of these models and adapta- such as loop modeling (GenKIC and Stepwise Assembly), refinement
tion of a multitude of score terms. (GlycanTreeModeler), symmetry, and RosettaScripts-accessible
classes such as MoveMaps and ResidueSelectors. Linkages are auto-
Adding carbohydrates to the modeling process. Carbohydrates matically determined during PDB read-in. Carbohydrates work
are fundamental to life179,180, but because of challenges in experi- with Cartesian minimization and can be refined into electron den-
mental characterization and computational sampling and scor- sity maps129. Limitations in the carbohydrate framework include
ing, their structures have been historically under-studied. The the increased sampling space due to carbohydrate flexibility and
RosettaCarbohydrate framework128 models carbohydrate structures branching, and the need to model many different chemistries with
and complexes such as glycosylated proteins or protein–sugar com- possible branching and cyclization. Developments in this area have
plexes (Fig. 3f) with the same algorithms one would use for proteins. only recently started, and much work remains.
RosettaCarbohydrate can handle commonly studied and uncom-
mon carbohydrate structures, including linear, cyclic and branched User interfaces and usability
structures, sugar modifications, and conjugations. Methods exist Advances have also focused on improving usability of Rosetta
for sampling ring conformations, packing substituents, refining gly- through several user interfaces to suit different use cases and work-
cosidic linkages, sampling from linkage ‘fragments’, and extending flow styles (Fig. 4). The command line was the first and is still
glycan chains. Scoring of saccharide-containing sugars includes a the most often used interface to Rosetta methods. Additionally,
quantum-mechanically derived intrinsic backbone term181. Because Rosetta features two popular scripting interfaces: RosettaScripts
saccharide residues are stored as distinct data structures, we can and PyRosetta. RosettaScripts31 uses Extensible Markup Language
integrate bioinformatic and statistical data into these algorithms, (XML) to build complex protocols using core machinery27, with-
opening the door for glycoengineering and design applications. out requiring knowledge of the codebase. PyRosetta30,182 is a collec-
RosettaCarbohydrate has been integrated with other frameworks, tion of Python bindings to the source code, allowing flexible and
1
Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA. 2Department of Biology, New York University, New York,
New York, USA. 3Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, USA. 4Department of Biochemistry,
University of Washington, Seattle, WA, USA. 5Institute for Protein Design, University of Washington, Seattle, WA, USA. 6Lyell Immunopharma Inc.,
Seattle, WA, USA. 7Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. 8Department of
Biochemistry, Duke University, Durham, NC, USA. 9Cyrus Biotechnology, Seattle, WA, USA. 10Department of Immunology and Microbiology, The Scripps
Research Institute, La Jolla, CA, USA. 11Department of Microbiology and Molecular Genetics, IMRIC, Ein Kerem Faculty of Medicine, Hebrew University
of Jerusalem, Jerusalem, Israel. 12Department of Chemistry and Biochemistry, Ohio State University, Columbus, OH, USA. 13Graduate Program in
Bioinformatics, University of California San Francisco, San Francisco, CA, USA. 14Institute of Bioengineering, École Polytechnique Fédérale de Lausanne,
Lausanne, Switzerland. 15Baylor College of Medicine, Department of Pharmacology, Houston, TX, USA. 16Biological Physics Structure and Design PhD
Program, University of Washington, Seattle, WA, USA. 17Department of Pharmacology, Vanderbilt University, Nashville, TN, USA. 18Institute of Quantitative
Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, USA. 19Swiss Institute of Bioinformatics, Lausanne, Switzerland. 20Fred
Hutchinson Cancer Research Center, Seattle, WA, USA. 21Department of Biological Sciences, Rensselaer Polytechnic Institute, Troy, NY, USA. 22Khoury
College of Computer Sciences, Northeastern University, Boston, MA, USA. 23Department of Biochemistry, Stanford University School of Medicine,
Stanford, CA, USA. 24DSM Biotechnology Center, Delft, the Netherlands. 25Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, PA, USA.
26
Department of Chemistry, Vanderbilt University, Nashville, TN, USA. 27University of Maryland Institute for Bioscience and Biotechnology Research,
Rockville, MD, USA. 28Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, USA. 29Program in Molecular
Biophysics, Johns Hopkins University, Baltimore, MD, USA. 30Faculty of Chemistry, Biological and Chemical Research Centre, University of Warsaw,
Warsaw, Poland. 31Department of Chemistry & Biochemistry, University of Denver, Denver, CO, USA. 32The Knoebel Institute for Healthy Aging, University
of Denver, Denver, CO, USA. 33Research School of Chemistry, Australian National University, Canberra, Australian Capital Territory, Australia. 34Program in
Bioinformatics and Computational Biology, Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
35
Center for Computational Biology, University of Kansas, Lawrence, KS, USA. 36Biophysics Program, Stanford University, Stanford, CA, USA. 37Institute
for Computational Science, University of Zurich, Zurich, Switzerland. 38S3IT, University of Zurich, Zurich, Switzerland. 39Department of Chemistry and
Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ, USA. 40Department of Chemistry and Chemical Biology, The State
University of New Jersey, Piscataway, NJ, USA. 41Center for Integrative Proteomics Research, Rutgers, The State University of New Jersey, Piscataway,
NJ, USA. 42Computational Biology and Molecular Biophysics Program, Rutgers, The State University of New Jersey, Piscataway, NJ, USA. 43Department
of Computer and Information Science, University of Massachusetts Dartmouth, Dartmouth, MA, USA. 44Department of Bioengineering and Therapeutic
Sciences, University of California San Francisco, San Francisco, CA, USA. 45Center for Structural Biology, Vanderbilt University, Nashville, TN, USA.
46
Medical Device Development and Regulation Research Center, School of Engineering, University of Tokyo, Tokyo, Japan. 47Department of Bioengineering,
School of Engineering, University of Tokyo, Tokyo, Japan. 48Department of Chemistry, Franklin & Marshall College, Lancaster, PA, USA. 49Department of
Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel. 50Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of
Medicine, Lund University, Lund, Sweden. 51Institute for Research in Biomedicine Barcelona, The Barcelona Institute of Science and Technology, Barcelona,
Spain. 52Departments of Chemistry, Pharmacology and Biomedical Informatics, Vanderbilt University, Nashville, TN, USA. 53Institute for Chemical Biology,
Vanderbilt University, Nashville, TN, USA. 54Department of Computer Science, University of California Santa Cruz, Santa Cruz, CA, USA. 55Molecular and
Cellular Biology Program, University of Washington, Seattle, WA, USA. 56Chemical and Physical Biology Program, Vanderbilt Vaccine Center, Vanderbilt
University, Nashville, TN, USA. 57Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA, USA. 58Department of
Chemistry, University of California, Davis, Davis, CA, USA. 59Department of Biochemistry and Molecular Medicine, University of California, Davis, Davis,
California, USA. 60Genome Center, University of California, Davis, Davis, CA, USA. 61Department of Computer Science, New York University, New York,
NY, USA. 62Center for Data Science, New York University, New York, NY, USA. 63These authors contributed equally: Julia Koehler Leman, Brian D. Weitzner,
Steven M. Lewis. ✉e-mail: Julia.koehler.leman@nyu.edu; Bonneau@nyu.edu