Professional Documents
Culture Documents
com
ScienceDirect
Figure 1
During unsupervised training (a), a generative decoder learns to generate proteins similar to those in the unsupervised training set from embedding
vectors produced by the encoder. The embeddings can then be used as inputs to a predictor for downstream modeling tasks such as activity or stability, or
the encoder can be fine-tuned with the predictor (b). The decoder can be used to generate new functional sequences (c), or the entire generative model
can be tuned to generate functional sequences optimized for a desired property (D).
data are used to train regression models. These models are second round of fine-tuning), showing promising results
then used as estimates for the true experimental value and on two tasks: improving the fluorescence activity of
can be used to search through and identify beneficial se- Aequorea victoria green fluorescent protein and optimizing
quences in silico. TEM-1 b-lactamase. After training on just 24 randomly
selected sequences, this approach consistently out-
Learned representations have the potential to be more performs one-hot encodings with 5e10 times the hit rate
informative than one-hot encodings of sequence or amino (defined as the fraction of proposed sequences with ac-
acid physicochemical properties. They encode discrete tivity greater than the wild type). The authors show that
protein sequences in a continuous and compressed latent the pretrained- representation separates functional and
space, wherein further optimization can be performed. nonfunctional sequences, allowing the final discriminator
Ideally, these representations capture contextual informa- to focus on distinguishing the best sequences from
tion [16] that simplifies downstream modeling. Notably, mediocre but functional ones. Biasing the initial input set
these representations do not always outperform simpler for functional variants further optimizes evolution [25].
representations if training with a large fraction of the total
[17]. Protein sequence generation
In addition to improving function predictions in down-
For example, in BioSeqVAE [18], the latent representa- stream modeling, generative models can also be used to
tion was learned from 200,000 sequences between 100 generate new functional protein sequences. Here, we
and 1000 amino acids in length obtained from Swiss-Prot describe recent successful examples of sequences
[8]. The authors demonstrate that a simple random generated using VAEs, GANs, and autoregressive models.
forest classifier from scikit-learn [19] can be used to learn
the relationship between roughly 60,000 sequences Hawkins-Hooker et al [26] generate functional lucifer-
(represented by the outputs of the VAE encoder) and ases from two VAE architectures: (1) by computing the
their protein localization and enzyme classification (by multiple sequence alignment (MSA) first and then
Enzyme Commission number) in a downstream fine- training a VAE (MSA VAE) on the aligned sequences
tuning task. By optimizing in latent space for either and (2) by introducing an autoregressive component to
property through the downstream models and decoding the decoder to learn the unaligned sequences (AR VAE).
this latent representation to a protein sequence, the Motivated by a similar model used for text generation
authors generate examples that have either one or both [27], the decoder of the AR VAE contains an upsampling
desired properties. Although the authors did not validate component, which converts the compressed represen-
the generated proteins in vitro, they did observe tation to the length of the output sequence, and an
sequence homology between their designed sequences autoregressive component. Both models were trained
and natural sequences with the desired properties. with roughly 70,000 luciferase sequences (w360 resi-
dues) and were quite successful: 21 of 23 and 18 of 24
While the previous study used the output from a variants generated with the MSA VAE and AR VAE,
pretrained network as a fixed representation, another respectively, showed measurable activity.
approach is to fine-tune the generative network for the
new task. Autoregressive models are trained to predict the The authors of ProteinGAN successfully trained a GAN to
next token in a sequence from the previous tokens generate functional malate dehydrogenases [28]. In one of
(Appendix A). When pretrained on large databases of the first published validations of GAN-generated se-
protein sequences, they have stronger performance than quences, after training with nearly 17,000 unique se-
other architectures on a variety of downstream discrimi- quences (average length: 319), 13 of 60 sequences
native tasks, such as predicting protein function classifi- generated by ProteinGAN display near-wild-type-level
cation, contacts, and secondary structures [20e22,11]. activity, including a variant with 106 mutations from the
There are few examples of experimental validation in this closest known sequence. Interestingly, although the posi-
space owing to the cost and time of wet lab experiments tional entropy of the final set of sequences closely matched
needed to physically verify computational predictions. that of the initial input, the generated sequences expand
However, Biswas et al [23] demonstrated that a double into new structural domains as classified by CATH [29],
fine-tuning scheme results in discriminative models that suggesting structural diversity in the generated results.
can find improved sequences after training on very few
measured sequences. First, they train an autoregressive Shin et al [30] applied autoregressive models to generate
model on sequences in UniRef50 [24]. They then fine- single-domain antibodies (nanobodies). As the nanobody’s
tune the autoregressive model on evolutionarily related complementarity-determining region is difficult to align
sequences. Finally, they use the activations from the owing to its high variation, an autoregressive strategy is
penultimate layer to represent each position of an input particularly advantageous because it does not require
protein sequence in a simple downstream model (in a sequence alignments. With 100,000s of antibody
sequences, the authors trained a residual dilated convo- One approach to optimization is to bias the data fed to a
lutional network over 250,000 updates. While other GAN. Amimeur et al trained Wasserstein GANs [41] on
(recurrent) architectures were tested to capture longer 400,000 heavy- or light-chain sequences from human
range information, exploding gradients were encountered, antibodies to generate regions of 148 amino acids of the
as is common in these architectures. After training, the respective chain [40]. After initial training, by biasing
authors generated more than 37 million new sequences by further input data on desired properties (length, size of a
sampling amino acids at each new position in the negatively charged region, isoelectronic point, and esti-
sequence. Further clustering, diversity selection, and mated immunogenicity), the estimated properties of the
removal of motifs that may make expression more chal- generated examples shift in the desired direction. While
lenging (such as glycosylation sites and sulfur residues) it is not known what fraction of the 100,000 generated
enabled researchers to winnow this number below constructs is functional from the experimental validation,
200,000, for which experimental results are pending. extensive biophysical characterization of two of the suc-
cessful designs show promising signs of retaining the
Wu et al. applied the Transformer encoder-decoder designed properties in vitro. An alternative study devel-
model [31] to generate signal peptides for industrial oping a Feedback GAN (FBGAN) framework extends
enzymes [32]. Signal peptides are short (15e30 amino this by iteratively generating sequences from a GAN,
acid) sequences prepended to target protein sequences scoring them with an oracle, and replacing the lowest
that signal the transport of the target sequence. After scoring members of the training set with the highest
training with 25,000 pairs of target and signal peptide scoring generated sequences [42].
sequences, signal peptide sequences were generated
and tested in vitro, with roughly half of the generated Fortunately, this optimization can be enforced algorith-
sequences resulting in secreted and functional enzymes mically. The Design by Adaptive Sampling algorithm [43]
after expression with Bacillus subtilis. improves the iterative retraining scheme by using a
probabilistic oracle and weighting generated sequences by
Although this work suffices as early experimentally verified the probability that they are above the Qth percentile of
examples, there are many improvements that can be made, scores from the previous iteration. This allows the opti-
such as introducing information on which to condition mization to become adaptively more stringent and gua-
generation. Sequences are typically designed for a specific rantees convergence under some conditions. The authors
task, and task-specific information can be incorporated in validate this approach on synthetic ground truth data by
the training process [33]. For example, a VAE decoder can training models (of a different type) on real biological data.
be conditioned on the identity of the metal cofactors They then show that generated sequences outperform
bound [34]. After training on 145,000 enzyme examples in traditional evolutionary methods (and the previously
MetalPDB [35], the authors find a higher fraction of mentioned FBGAN) when restricted to a budget of 10,000
desired metal-binding sites observed in generated se- sequences. The current iteration of this work, Condi-
quences. In addition, 11% of 1000 sequences generated for tioning by Adaptive Sampling, improves this approach by
recreating a removed copper-binding site identified the avoiding regions too far from the training set for the oracle
correct binding amino acid triad. The authors also applied [38], while other approaches focus on the oracle as design
this approach to design specific protein folds, validating moves between regions of sequence space [44] or
their results with Rosetta and molecular dynamics simu- emphasize sequence diversity in generations [45].
lations. In ProGen, Madani et al [36] condition an autor-
egressive sequence model on protein metadata, such as a Another approach [39] for model-based optimization has
protein’s functional and/or organismal annotation. While roots in reinforcement learning (RL) [46]. The RL
this work does not have functional experimental validation, framework is typically applied when a decision maker is
after training on 280 million sequences and their annota- asked to choose an action that is available, given the
tions from various databases, the authors show that current state. From this action, the state changes
computed energies from Rosetta [37] of the generated through the transition function with some reward. When
sequences are similar to those of natural sequences. a given state and action are independent of all previous
states and actions (the Markov property), the system
can be modeled with Markov decision processes. This
Optimization with generative models requirement is satisfied by interpreting the protein
Although much of the existing work is designed to sequence generation as a process wherein the sequence
generate ‘valid’ sequences, eventually, the protein en- is generated from left to right. At each time step, we
gineer expects ‘improved’ sequences. An emerging begin with the sequence as generated to that point (the
approach to this optimization problem is to optimize current state), then select the next amino acid (the
with generative models [38e40]. Instead of generating action), and add that amino acid to the sequence (the
viable examples, this framework trains models to transition function). The reward remains 0 until gen-
generate optimized sequences by placing higher prob- eration is complete, and the final reward is the fitness
ability on improved sequences (Figure 1d). measurement for the generated sequence. The action
www.sciencedirect.com Current Opinion in Chemical Biology 2021, 65:18–27
22 Machine Learning in Chemical Biology
(picking the next amino acid) is decided by a policy learning have been driven by data collection as well. For
network, which is trained to output a probability over all example, a large contribution to the current boom in
available actions based on the sequence thus far and the deep learning can be traced back to ImageNet, a data-
expected future reward. Notably, the transition function base of well-annotated images used for classification
is simple (adding an amino acid). tasks [48]. For proteins, a well-organized biannual
competition for protein structure prediction known as
The major challenge under the RL framework is then CASP (Critical Assessment of protein Structure Pre-
determining the expected reward. To tackle this issue, diction) [49] enabled machine learning to push the field
Angermueller and coauthors use a panel of machine of structure prediction forward [50]. A large database of
learning models, each learning a surrogate fitness function protein sequences also exists [8] with reference clusters
fbj based on available data from each round of experimen- provided [51,24] that can be linked to Gene Ontology
tation [39]. The subset of models from this panel that pass annotations [52]. However, these sequences are rarely
some threshold accuracy (as empirically evaluated by cross- coupled to ‘fitness’ measurements, and if so, they are
validation) is selected for use in estimating the reward, and collected under diverse experimental conditions. While
the policy network is then updated based on the estimated databases such as ProtaBank [53] promise to organize
reward. Thus, this algorithm enables using a panel of data collected along with their experimental conditions,
models to potentially capture various aspects of the fitness protein sequence design for diverse functions has yet to
landscape, but only uses the models that have sufficient experience its ImageNet moment.
accuracy to update the policy network. The authors also
incorporate a diversity metric by including a term in the Fortunately, a wide variety of tools are being developed
expected reward for a sequence that counts the number of for collecting large amounts of data, including deep
similar sequences previously explored. The authors mutational scanning [54] and methods involving
applied this framework to various biologically motivated continuous evolution [55e57]. These techniques
synthetic datasets, including an antimicrobial peptide (8e contain their own nuances and data artifacts that must
75 amino acids) dataset as simulated with random forests. be considered [58], and unifying across studies must be
With eight rounds testing up to 250 sequences each, the done carefully [59]. Although these techniques
authors obtained higher fitness values compared to other currently apply to a subset of desired protein properties
methods, including Conditioning by Adaptive Sampling that are robustly measured, such as survival, fluores-
and FBGAN. However, the authors also show that the cence, and binding affinity, we must continue to develop
proposed sequence diversity quickly drops, and only the experimental techniques if we hope to model and un-
diversity term added to the expected reward prevents it derstand more complex traits such as enzymatic activity.
from converging to zero.
In the meantime, machine learning has enabled us to
Although significant work has been invested in opti- generate useful protein sequences on a variety of scales.
mizing protein sequences with generative models, this In low- to medium-throughput settings, protein engi-
direction is still in its infancy, and it is not clear which neering guided by discriminative models enables effi-
approach or framework has general advantages, partic- cient identification of improved sequences through the
ularly as many of these approaches have roots in learned surrogate fitness function. In settings with
nonbiological fields. In future, balancing increased larger amounts of data, deep generative models have
sequence diversity against staying within each model’s various strengths and weaknesses that may be leveraged
trusted regions of sequence space [38,45] or other depending on design and experimental constraints. By
desired protein properties will be necessary to broaden integrating machine learning with rounds of experi-
our understanding of protein sequence space. mentation, data-driven protein engineering promises to
maximize the efforts from expensive laboratory work,
enabling protein engineers to quickly design useful
Conclusions and future directions sequences.
Machine learning has shown preliminary success in
protein engineering, enabling researchers to access
Declaration of competing interest
optimized sequences with unprecedented efficiency.
The authors declare that they have no known competing
These approaches allow protein engineers to efficiently
financial interests or personal relationships that could have
sample sequence space without being limited to na-
appeared to influence the work reported in this paper.
ture’s repertoire of mutations. As we continue to explore
sequence space, expanding from the sequences that
nature has kindly prepared, there is hope that we will
Acknowledgements
The authors wish to thank members Lucas Schaus and Sabine Brinkmann-
find diverse solutions for myriad problems [47]. Chen for feedback on early drafts. This work is supported by the Camille
and Henry Dreyfus Foundation (ML-20-194) and the NSF Division of
Chemical, Bioengineering, Environmental, and Transport Systems
Many of the examples presented required testing many (1937902).
protein variants, and many of the advances in machine
Figure A.1
(a) Variational Autoencoders (VAEs) are tasked with encoding sequences in a structured latent space. Samples from this latent space may then be decoded to
functional protein sequences. (b) Generative Adversarial Networks (GANs) have two networks locked in a Nash equilibrium: the generative network generates
synthetic data that look real, while the discriminative network discerns between real and synthetic data. (c) Autoregressive models predict the next amino acid in a
protein given the amino-acid sequence up to that point.
directly in the model state [70,71], and an added function by ProSAR-driven enzyme evolution. Nat Biotechnol
2007, 25:338–344, https://doi.org/10.1038/nbt1286.
memory layer is introduced in Long Short-Term Memory
(LSTM) networks to account for long-range in- 15. Liao J, Warmuth MK, Govindarajan S, Ness JE, Wang RP,
Gustafsson C, Minshull J: Engineering proteinase K using
teractions [72e74]. Finally, Transformer networks are machine learning and synthetic genes. BMC Biotechnol 2007,
based on the attention mechanism, which computes a 7, https://doi.org/10.1186/1472-6750-7-16.
soft probability contribution over all positions in the 16. Xu Y, Verma D, Sheridan RP, Liaw A, Ma J, Marshall N,
sequence [75,76]. They were also developed for lan- McIntosh J, Sherer EC, Svetnik V, Johnston J: A deep dive into
machine learning models for protein engineering. J Chem Inf
guage modeling to capture all possible pairwise in- Model 2020, 60:2773–2790, https://doi.org/10.1021/
teractions [31,77e79]. Notably, Transformer networks acs.jcim.0c00073.
are also used for autoencoding pretraining, where tokens 17. Shanehsazzadeh A, Belanger D, Dohan D: Is transfer learning
throughout the sequence (regardless of order) are necessary for protein landscape prediction? arXivarXiv 2020.
https://arxiv.org/abs/2011.03443; 2020.
masked and reconstructed [77].
18. Costello Z, Garcia Martin H: How to hallucinate functional pro-
teins. arXivarXiv 2019. https://arxiv.org/abs/1903.00458; 2019.
References 19. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B,
Papers of particular interest, published within the period of review, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al.:
have been highlighted as: Scikit-learn: machine learning in python. J Mach Learn Res
2011, 12:2825–2830.
* of special interest
* * of outstanding interest 20. Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM:
Unified rational protein engineering with sequence-based
1. Romero PA, Arnold FH: Exploring protein fitness landscapes deep representation learning. Nat Methods 2019, 16:
by directed evolution. Nat Rev Mol Cell Biol 2009, 10:866–876, 1315–1322.
https://doi.org/10.1038/nrm2805. 21. Rives A, Goyal S, Meier J, Guo D, Ott M, Zitnick CL, Ma J,
2. Arnold FH: Directed evolution: bringing new chemistry to life. * Fergus R: Biological structure and function emerge from
Angew Chem Int Ed 2018, 57:4143–4148, https://doi.org/ scaling unsupervised learning to 250 million protein se-
10.1002/anie.201708408. quences. bioRxiv 2021, 118(15), e2016239118, https://doi.org/
10.1101/622803.
3. Huang P-S, Boyken SE, Baker D: The coming of age of de novo An early example of modern protein language modeling, which applied
protein design. Nature 2016, 537:320–327, https://doi.org/ the BERT training objective to protein sequences.
10.1038/nature1994.
22. Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J,
4. Garcia-Borrás M, Houk KN, Jiménez-Osés G: Computational Abbeel P, Song Y: Evaluating protein transfer learning with
design of protein function. Comput Tools CHem Biol 2017, 3: tape. In Advances in Neural information processing systems;
87, https://doi.org/10.1039/9781788010139-00087. 2019:9686–9698.
5. Yang KK, Wu Z, Arnold FH: Machine-learning-guided directed 23. Biswas S, Khimulya G, Alley EC, Esvelt KM, Church GM: Low-n
evolution for protein engineering. Nat Methods 2019, 16: * * protein engineering with data-efficient deep learning. bioRxiv
687–694, https://doi.org/10.1038/s41592-019-0496-6. 2021, https://doi.org/10.1101/2020.01.23.917682.
An excellent example leveraging learned embeddings for protein en-
6. Mazurenko S, Prokop Z, Damborsky J: Machine learning in gineering, enabling improved variants to be identified after training on
enzyme engineering. ACS Catal 2020, 10:1210–1223, https:// as little as 24 variants.
doi.org/10.1021/acscatal.9b04321.
24. Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH,
7. Volk MJ, Lourentzou I, Mishra S, Vo LT, Zhai C, Zhao H: Bio- Consortium U: Uniref clusters: a comprehensive and scalable
systems design by machine learning. ACS Synth Biol 2020, 9: alternative for improving sequence similarity searches. Bio-
1514–1533, https://doi.org/10.1021/acssynbio.0c00129. informatics 2015, 31:926–932, https://doi.org/10.1093/bioinfor-
matics/btu739.
8. Consortium Uniprot, Uniprot: The universal protein knowl-
edgebase in 2021. Nucleic Acids Res 2021, 49:D480–D489. 25. Wittmann BJ, Yue Y, Arnold FH: Machine learning-assisted
directed evolution navigates a combinatorial epistatic fitness
9. Ingraham J, Garg V, Barzilay R, Jaakkola T: Generative models landscape with minimal screening burden. bioRxiv 2020,
for graph-based protein design. In Advances in neural infor- https://doi.org/10.1101/2020.12.04.408955.
mation processing systems; 2019:15794–15805.
26. Hawkins-Hooker A, Depardieu F, Baur S, Couairon G, Chen A,
10. Sabban S, Markovsky M: Ramanet: computational de novo Bikard D: Generating functional protein variants with varia-
helical protein backbone design using a long short-term tional autoencoders. PLoS Comput Biol 2021, 17, e1008736.
memory generative adversarial neural network.
F1000Research 2020, 9, https://doi.org/10.12688/ 27. Semeniuta S, Severyn A, Barth E: A hybrid convolutional
f1000research.22907.1. variational autoencoder for text generation. arXivarXiv 2017.
https://arxiv.org/abs/1702.02390; 2017.
11. T. Bepler, B. Berger, Learning protein sequence embeddings
using information from structure. 28. Repecka D, Jauniskis V, Karpus L, Rembeza E, Rokaitis I,
Zrimec J, Poviloniene S, Laurynenas A, Viknander S, Abuajwa W,
12. Anand N, Eguchi RR, Derry A, Altman RB, Huang P: Protein et al.: Expanding functional protein sequence spaces using
sequence design with a learned potential. bioRxiv 2020, generative adversarial networks. Nature Machine Intell 2021:
https://doi.org/10.1101/2020.01.06.895466. 1–10.
13. Hie B, Bryson BD, Berger B: Leveraging uncertainty in ma- 29. Sillitoe I, Dawson N, Lewis TE, Das S, Lees JG, Ashford P,
* chine learning accelerates biological discovery and design. Tolulope A, Scholes HM, Senatorov I, Bujan A, et al.: Cath:
Cell Sys 2020, 11:461–477, https://doi.org/10.1016/ expanding the horizons of structure-based functional anno-
j.cels.2020.09.007. tations for genome sequences. Nucleic Acids Res 2019, 47:
This paper is a clear demonstration of the efficacy of learned embeddings D280–D284, https://doi.org/10.1093/nar/gky1097.
for both proteins and small molecules, and additionally shows how
modeled uncertainty enables the identification of improved sequences. 30. Shin J-E, Riesselman AJ, Kollasch AW, McMahon C, Simon E,
Sander C, Manglik A, Kruse AC, Marks DS: Protein design
14. Fox RJ, Davis SC, Mundorff EC, Newman LM, Gavrilovic V, and variant prediction using autoregressive generative
Ma SK, Chung LM, Ching C, Tam S, Muley S, Grate J, Gruber J, models. Nat Commun 2021, https://doi.org/10.1038/s41467-
Whitman JC, Sheldon RA, Huisman GW: Improving catalytic 021-22732-w.
70. Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S: 75. Bahdanau D, Cho K, Bengio Y: Neural machine translation by
Recurrent neural network based language model. In Eleventh jointly learning to align and translate. arXivarXiv 2014. https://
annual conference of the international speech communication arxiv.org/abs/1409.0473; 2014.
association; 2010.
76. Luong M-T, Pham H, Manning CD: Effective approaches to
71. Kalchbrenner N, Blunsom P: Recurrent continuous translation attention-based neural machine translation. arXivarXiv 2015.
models. In Proceedings of the 2013 conference on empirical https://arxiv.org/abs/1508.04025; 2015.
methods in Natural language processing; 2013:1700–1709.
77. Devlin J, Chang M-W, Lee K, Toutanova K: Bert: pre-training of
72. Hochreiter S, Schmidhuber J: Long short-term memory. Neural deep bidirectional transformers for language understanding.
Comput 1997, 9:1735–1780, https://doi.org/10.1162/ arXivarXiv 2018. https://arxiv.org/abs/1810.04805; 2018.
neco.1997.9.8.1735.
78. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I: Lan-
73. Sutskever I, Vinyals O, Le QV: Sequence to sequence learning guage models are unsupervised multitask learners. OpenAI
with neural networks. In Advances in Neural information Blog 2019, 1:9.
processing systems; 2014:3104–3112.
79. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A,
74. Cho K, Van Merrie€nboer B, Gulcehre C, Bahdanau D, Bougares F, Cistac P, Rault T, Louf R, Funtowicz M, et al.: Huggingface’s
Schwenk H, Bengio Y: Learning phrase representations using transformers: State-of-the-art natural language processing.
rnn encoder-decoder for statistical machine translation. arXi- arXivarXiv 2019. https://arxiv.org/abs/1910.03771; 2019.
varXiv 2014. https://arxiv.org/abs/1406.1078; 2014.