Machine Learning For Designing Next-Generation mRNA Therapeutics

pubs.acs.
org/accounts Article
Machine Learning for Designing Next-Generation mRNA

Therapeutics
Published as part of the Accounts of Chemical Research special issue “mRNA Therapeutics”.
Sebastian M. Castillo-Hair and Georg Seelig*
Cite This: Acc. Chem. Res. 2022, 55, 24−34 Read Online
ACCESS Metrics & More Article Recommendations
CONSPECTUS: Over just the last 2 years, mRNA therapeutics and

vaccines have undergone a rapid transition from an intriguing concept to
real-world impact. However, whereas some aspects of mRNA
therapeutics, such as the use of chemical modifications to increase
stability and reduce immunogenicity, have been extensively optimized for
over two decades, other aspects, particularly the selection and design of
the noncoding leader and trailer sequences which control translation
efficiency and stability, have received comparably less attention. In
practice, such 5′ and 3′ untranslated regions (UTRs) are often borrowed
from highly expressed human genes with few or no modifications, as in
the case for the Pfizer/BioNTech Covid vaccine. Focusing on the
5′UTR, we here argue that model-driven design is a promising alternative
that provides unprecedented control over 5′UTR function. We review
recent work that combines synthetic biology with machine learning to
build quantitative models that relate ribosome loading, and thus translation efficiency, to the 5′UTR sequence. We first introduce an
experimental approach that uses polysome profiling and high-throughput sequencing to quantify ribosome loading for hundreds of
thousands of 5′UTRs in parallel. We apply this approach to measure ribosome loading in synthetic RNA libraries with a random
sequence inserted into the 5′UTR. We then review Optimus 5-Prime, a convolutional neural network model trained on the
experimental data. We highlight that very accurate models of biological regulation can be learned from synthetic data sets with
degenerate 5′UTRs. We validate model predictions not only on held-out data sets from our random library but also on a large library
of over 30 000 human 5′UTR fragments and using translation reporter data collected independently by other groups. Both the
experiment and model are compatible with commonly used chemically modified nucleosides, in particular, pseudouridine (Ψ) and 1-
methyl-pseudouridine (m1Ψ). We find that, in general, 5′UTRs have very similar impacts when combined with different protein-
coding sequences and even in the context of different chemical modifications. We demonstrate that Optimus 5-Prime can be
combined with design algorithms to generate de novo sequences with precisely defined translation efficiencies. We emphasize recent
developments in design algorithms that rely on activation maximization and generative modeling to improve both the fitness and
diversity of designed sequences. Compared with prior approaches such as genetic algorithms, we show that these approaches are not
only faster but also less likely to get stuck in local sequence optima. Finally, we discuss how the approach reviewed here can be
generalized to other gene regions and applications.
■ KEY REFERENCES • Linder, J.; Seelig, G. Fast Activation Maximization for

Molecular Sequence Design. BMC Bioinformatics 2021,
• Sample, P. J.; Wang, B.; Reid, D. W.; Presnyak, V.; 22, 510.2 Fast SeqProp is a computational design algorithm
McFadyen, I. J.; Morris, D. R.; Seelig, G. Human 5′ based on activation maximization that can be combined
UTR Design and Variant Effect Prediction from a
Massively Parallel Translation Assay. Nature Biotechnol-
ogy 2019, 37, 803−809.1 A neural network model, Received: October 7, 2021
Optimus 5-Prime, trained on data f rom a massively Published: December 14, 2021
parallel translation assay accurately predicts how the
5′UTR sequence controls ribosome loading and, together
with a genetic algorithm, enables the design of high-
performing 5′UTR sequences for mRNA therapeutics.
© 2021 The Authors. Published by

American Chemical Society https://doi.org/10.1021/acs.accounts.1c00621
24 Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research pubs.acs.org/accounts Article
Figure 1. Workflow combining high-throughput assays and machine learning for characterizing mRNA regulation and engineering UTR sequences
with high performance.
with sequence-function models such as Optimus 5-Prime to involved in the regulation of translation and mRNA
rapidly design f unctional and high-fitness sequences. degradation; however, how their sequence affects these
• Linder, J.; Bogard, N.; Rosenberg, A. B.; Seelig, G. A processes is not completely understood, and poor UTR design
Generative Neural Network for Maximizing Fitness and can negatively impact the expression of the therapeutic protein.
Diversity of Synthetic DNA and Protein Sequences. Cell Thus, most mRNA therapies to date take their UTRs from
Systems 2020, 11, 49−62.e16.3 Deep Exploration Net- highly expressed human genes such as α- and β-globin.30
works (DENs) are a class of generative sequence design Recent studies, however, have shown that alternative UTRs
models that can be used to design sequence libraries while can result in higher expression,31,32 suggesting that there is
simultaneously maximizing the performance of all sequences significant room for improvement. Moreover, when targeting
and minimizing similarity between them. different cell types or tissues, UTRs may need further tuning to
■
account for the differential expression of regulators such as
INTRODUCTION RNA-binding proteins (RBPs)33 and microRNAs (miRNAs).34
To further complicate matters, there is a complex interplay
We are witnessing the beginning of the mRNA therapeutics between the different sequence-dependent regulatory mecha-
revolution. Indeed, this technology is now in the public nisms that ultimately control protein expression. Recent
spotlight thanks to its role in fighting the COVID-19 studies have shown that 5′UTR elements that repress
pandemic: It resulted in two of the most effective vaccines translation can also diminish mRNA stability,35 but 5′UTRs
available,4,5 with more currently in clinical trials,6 and has
with very high translation efficiencies can also have a
continued to enable the rapid development of potential
destabilizing effect and ultimately reduce expression.36 Clearly,
booster shots against variants of concern.7 But its scope is
quantitative models that take into account these effects and
not limited to vaccines for infectious diseases, as mRNA
therapeutics are being evaluated for clinical applications such predict protein expression from sequence are crucial to
as cancer immunotherapy,8−10 regenerative medicine,11−13 and unlocking the full potential of mRNA therapeutics.
protein replacement therapy,14,15 among others.16,17 This wave In this Account, we describe a framework that combines
of mRNA therapeutics is the result of decades of work on high-throughput assays and deep-learning techniques to
many fronts, including the development of novel 5′-cap develop predictive models, obtain biological insights, and
analogs18−21 and capping methods,22−25 chemically modified engineer de novo sequences that optimize protein expression
bases,26−28 and lipid nanoparticles,29 all of which improve for mRNA therapeutics applications (Figure 1). The first
mRNA stability, decrease immunogenicity, improve delivery component of this framework consists of massively parallel
efficiency, and, in general, result in the successful in vivo reporter assays (MPRAs) based on large synthetic gene or
expression of the therapeutic protein. transcript libraries, whereby sequence variation is targeted to
Nevertheless, there is still untapped potential in optimizing the region under study. Such an approach makes it possible to
the primary sequence, either to further improve protein experimentally interrogate and quantify how a particular part of
expression or to encode more complex pharmacokinetics. For the mRNA (5′UTR, coding sequence (CDS), 3′UTR)
example, 5′ and 3′ untranslated regions (UTRs) are heavily contributes to protein production. In particular, here we
25 https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Figure 2. Polysome profiling data from a random 5′UTR library captures known regulatory effects. (A) Schematic of polysome profiling
experiment. (B) Influence of upstream start codon position along the 5′UTR. Plots show the average MRL of sequences containing a canonical
(AUG) or noncanonical (CUG, GUG) start codon at the indicated position with respect to the primary AUG. (C) Influence of context around
upstream start codon. Sequences containing an upstream AUG, GUG, or CUG between positions −21 and −8 were grouped by their surrounding
context (strong, moderate, weak) and whether they occur in-frame or out-of-frame with the primary AUG. p values were calculated using two-sided
t tests. (D) Relationship between the 5′UTR secondary structure and the measured ribosome load. 20 000 5′UTR sequences were grouped by their
predicted minimum free energy, and the MRL distribution of each group was plotted.
focus on recent work characterizing the influence of the 5′UTR includes a purine at −3 and a G at +4, promotes
sequence on translation efficiency. recognition.38,39 Finally, initiation factors dissociate, the 60S
The second component consists of deep-learning methods subunit is recruited, the full 80S ribosome is assembled, and
capable of identifying complex regulatory relationships from peptide elongation begins.37 Translation of the primary open
the resulting data sets. Specifically, we show that a convolu- reading frame (ORF) can be heavily influenced by sequence
tional neural network (CNN) model trained on MPRA data elements within the 5′UTR. For example, upstream start
accurately predicts ribosome loading from 5′UTR sequences. codons (uAUGs) and upstream ORFs (uORFs) can have a
Finally, we discuss methods for engineering novel sequences repressive effect by capturing ribosomes that would otherwise
that achieve specified performance levels or exceed the initiate at the primary start codon.40 Secondary structure and
performance of endogenous sequences. We first report results
RNA-binding proteins may block or interfere with ribosome
from designing 5′UTRs using a genetic algorithm, an iterative
scanning.41 Other elements, such as the 5′ terminal
discrete search method that “evolves” sequences in silico until
the model prediction matches a prespecified target. We then oligopyrimidine tract (5′TOP), regulate changes in translation
describe more recent methods that are capable of rapidly in response to stress.42 Finally, internal ribosome entry sites
designing large libraries of diverse sequences while avoiding (IRESs) allow translation initiation in a cap-independent
many of the inefficiencies and pitfalls of genetic algorithms. We manner.43 All of these cis-regulatory elements may interact
conclude with a short discussion on how this work can be with one another and influence translation in ways that remain
extended to 3′UTR and CDS sequences and to other challenging to predict.
molecular phenomena such as mRNA degradation. Machine-learning approaches can, in principle, be used to
■
build models that predict translation efficiency from 5′UTR
A MASSIVELY PARALLEL TRANSLATION ASSAY sequence, but such models require very large-scale and high-
FOR CHARACTERIZING 5′UTRs quality training data. A potential solution could be found in
high-throughput translation data sets obtained from the human
Translation of most mRNAs in eukaryotes starts with the
assembly of the eIF4F complex at the 5′cap followed by transcriptome. Most of these used Ribo-seq,44 a method
recruitment of the 43S preinitiation complex containing the wherein ribosome-bound transcripts are digested and the
40S ribosomal subunit, several eukaryotic initiation factors ribosome-protected mRNA fragments are sequenced. Ribo-seq
(eIFs), and the Met-tRNAiMet anticodon. 37 Next, the provides mRNA translation efficiencies and even identifies the
assembled 43S complex scans the 5′UTR in the 5′ to 3′ reading frame being translated, but it has difficulty distinguish-
direction until a start codon is recognized. Successful ing between transcript isoforms of the same gene due to a
recognition depends on the start codon identity (the canonical short ∼30 nt fragment length. TrIP-seq,45 wherein mRNAs are
AUG is more likely to be recognized than CUG or GUG) and fractionated based on the number of elongating ribosomes
the context around it. The “Kozak consensus sequence”, which before sequencing, can distinguish transcript isoforms and has
Acc. Chem. Res. 2022, 55, 24−34
Figure 3. Polysome profiling data from the random 5′UTR eGFP library generalizes to different CDSs and mRNA chemistries. (A) MRL from a
library of 3110 5′UTRs with an eGFP versus mCherry CDS. (B) Schematic of uridine compared with the modified nucleosides pseudouridine (Ψ)
and 1-methyl-pseudouridine (m1Ψ). (C,D) MRL comparison for modified versus unmodified chemistries. The eGFP library was resynthesized
using Ψ (C) or m1Ψ (D), and polysome profiling was performed as with the unmodified library. r2 values were calculated from 20 000 sequences
with the highest read coverage. Plots show 3000 sequences randomly chosen from this subset.
been successfully applied to studying the impact of alternative splicing52 and even translation49 often used short (6−10 nt)
5′ and 3′ UTRs.45,46 random regions such that every possible sequence combination
Still, endogenous transcript data may not be optimal for was covered. However, building on our own previous work on
training predictive models for multiple reasons. First, alternative splicing53 and translation regulation in yeast,51 we
endogenous transcripts contain highly variable UTR and used a longer random sequence to allow for more diverse motif
CDS sequences, making it difficult to reliably isolate how a combinations. Whereas we cannot cover every possible 50-mer
specific part of the mRNA, such as the 5′UTR, influences (there are >1030), short regulatory sequences such as start and
translation. Second, the size of an endogenous data set is stop codons should appear frequently and in many different
fundamentally limited by the size of the human transcriptome. positions and combinations.
Because deep learning can take advantage of extremely large mRNA was synthesized, capped, and polyadenylated using
data sets to achieve exceptional performance,47 obtaining more an in vitro transcription (IVT) system, which, compared with
examples than what the genome can provide is desirable. transfecting plasmid DNA, allowed us to remove confounding
Finally, sequences with deleterious effects are likely to be transcriptional and RNA processing effects. Synthesized
underrepresented in endogenous data, potentially resulting in mRNA was transfected into HEK293T cells and incubated
major model blind spots. for 12 h. Cells were then lysed in the presence of
An alternative approach is to use MPRAs, where large cycloheximide, an antibiotic that halts elongating ribosomes.
libraries of synthetic reporter sequences are assayed. Here The lysate was run through a sucrose gradient, and fractions
variation is restricted to a particular sequence element, in the containing mRNAs bound to distinct numbers of ribosomes
form of either fully degenerate or endogenous sequence (polysomes) were collected, barcoded, and sequenced. The
fragments.48 In addition, the MPRA library size can be orders resulting data set contains read counts for 280 000 5′UTR
of magnitude larger than the number of genomic examples. sequences. From here, we obtained the mean ribosome load
Previous MPRAs for characterizing translation in human cells (MRL) for each sequence by multiplying, for each fraction, the
have used stable single-copy integration of DNA libraries proportion of reads corresponding to a specific sequence times
followed by fluorescence-activated sorting and sequencing.49,50 the number of associated ribosomes and summing these
Similarly, we have previously used DNA libraries combined products. The MRL is thus a quantitative measure of
with a growth selection assay to study translation in yeast.51 translation efficiency.
Measurements from DNA libraries are affected by 5′UTRs Our polysome profiling data set recapitulated previously
influencing transcription or RNA processing as well as known regulatory effects. 5′UTRs with upstream AUGs had,
translation, which, in some studies, has been compensated on average, lower MRL values when the AUG was out-of-
for by placing a fluorescent reporter downstream of an IRES in frame with respect to the primary start codon (Figure 2B).
the same transcript.49,50 This effect was also present with noncanonical upstream start
To characterize the influence of the 5′UTR on translation, codons (CUG, GUG) but to a significantly lesser extent
we developed an MPRA in which we measure the ribosome (Figure 2B). Notably, the context around upstream start
loading of a random library of hundreds of thousands of codons strongly influenced their repressive effect: Out-of-frame
mRNAs (Figure 2A). This assay uses polysome profiling uAUGs were more repressive when surrounded by a purine
followed by sequencing, as in TrIP-seq;45 however, we use (A,G) at position −3 and a guanine at +4, matching the Kozak
synthetic mRNA libraries where sequence variation is targeted consensus sequence, whereas uCUGs and uGUGs had
to the 5′UTR region. Specifically, our reporter design contains statistically significant effects only within a similarly strong
a constant enhanced green fluorescent protein (eGFP) CDS, a context (Figure 2C). Similar effects were observed for uORFs.
3′UTR derived from bovine growth hormone (BGH), and a These observations are consistent with stronger uAUGs and
5′UTR with an initial 25 nt-long fixed segment followed by a uORFs redirecting ribosomes that would otherwise initiate at
50 nt fully degenerate region. Two out-of-frame stop codons the primary start codon. Additionally, sequences with lower
present at the beginning of the eGFP CDS ensured that predicted free energies had, on average, lower MRLs,
initiation at a randomly generated out-of-frame start codon consistent with stable secondary structures interfering with
would not result in extended translation. Prior MPRAs for ribosome scanning (Figure 2D).
Acc. Chem. Res. 2022, 55, 24−34
Figure 4. Optimus 5-Prime can predict ribosome loading and protein expression from a given 5′UTR sequence. (A) Optimus 5-Prime architecture.
An input sequence represented as a 50 × 4 one-hot encoded vector (bottom) is fed into two convolutional layers (middle) followed by a fully dense
layer to generate an MRL prediction (top). (B) Measured versus predicted MRLs on a held-out test set of 20 000 sequences. Red: 5′UTRs with no
uAUGs. Blue: 5′UTRs with uAUGs. (C) Predicted MRL versus eGFP fluorescence for 10 mRNAs selected to have a wide range of MRL values.
mRNAs were independently transfected into HEK293 cells and imaged using an IncuCyte S3 live-cell analysis system. The maximum fluorescence
over a 20.5 h time window is shown.
Figure 5. Optimus 5-Prime predictions generalize across mRNA chemistries, cell lines, and endogenous 5′UTRs. (A) Coefficients of determination
(r2) for model predictions when training and test data sets are taken from one of two replicates of the original eGFP data set without modification
(U), with pseudouridine (Ψ), or with 1-methyl-pseudouridine (m1Ψ). (B,C) Optimus 5-Prime predictions compared with translation efficiency
measurements for 77 5′UTRs designed and characterized in six different cell lines by Ferreira et al.56 mRNA reporters contained a GFP ORF
preceded by a designed 5′ UTR and a red fluorescent protein (RFP) ORF preceded by an IRES to be used as a normalization control. Cell lines
included human embryonic kidney cells (293T), mouse pre-B lymphocytes (PD31), human chronic myelogenous leukemia cells (K562), human
colon cancer cells (HCT116), Chinese hamster ovary cells (CHO-K1), and mouse plasmacytoma (MPC11). 5′UTRs were used in this analysis
only if their GFP ORFs started with ATGG. 5′UTRs shorter than 50 bp were zero-padded before being used with Optimus 5-Prime. (B) Direct
comparison with measurements in PD31 cells. (C) Coefficients of determination (r2) of measurements versus MRL predictions in all cell lines. (D)
Predicted versus observed MRL for wild-type and SNV-containing human 5′UTR sequences.
Associations between 5′UTR sequences and MRL measure-

ments generalized to other coding sequences and nucleotide
■ DEVELOPING PREDICTIVE MODELS OF
RIBOSOME LOADING
chemistries, both relevant to mRNA therapeutic applications. Deep learning has been highly successful at various tasks in
We tested the same ∼3000 5′UTRs together with either the molecular biology54 due to at least two factors. First, the tiered
eGFP or mCherry CDS, two fluorescent proteins of different nature of the molecular interactions involved in a particular
origins and with widely differing sequences. We observed processsequence motifs recruit effector proteins, which form
excellent correlation between MRLs measured in both contexts complexes with other proteins, which, in turn, interact with
other complexes and sequence motifsare efficiently captured
(r2 = 0.732, Figure 3A), suggesting that identical UTRs result
by the layered architecture of a deep-learning network. Second,
in similar MRLs, even if the CDS context changes. because deep learning is capable of capturing complex,
Similarly, we resynthesized the original 280 000-member nonlinear interactions, these models are uniquely suited to
5′UTR library but replaced uracil in the IVT reaction with the take advantage of extremely large data sets to obtain improved
chemically modified nucleosides pseudouridine (Ψ) and 1- performance.47,55
methyl-pseudouridine (m1Ψ) (Figure 3B). mRNAs incorpo- To predict ribosome loading from 5′UTR sequence, we
rating these modifications avoid activation of intracellular developed a convolutional neural network (CNN) model
pattern recognition receptors, which would result in an named Optimus 5-Prime. The model contains two convolu-
tional layers with filters that identify short motifs from the
immune response that suppresses translation and promotes input and one fully connected layer that ultimately computes
mRNA degradation. Consequently, these modifications are the MRL prediction (Figure 4A). We first trained this model
commonly used in mRNA therapeutics.6,11 MRLs were found using 260 000 random 5′UTRs and associated MRLs from the
to be highly correlated across chemistries (Figure 3C,D). polysome profiling data set. Optimus 5-Prime was able to
Acc. Chem. Res. 2022, 55, 24−34
predict ribosome loading with remarkable precision when convolutions staggered every three bases to further extend
tested against 20 000 samples held out from training (r2 = 0.93, MRL predictions to arbitrary-length 5′UTR sequences.58
Figure 4B, compared with r2 = 0.64 for the best k-mer linear Finally, we use Optimus 5-Prime to score the translation
model with k ≤ 6). To further validate whether MRL efficiencies of 5′UTRs previously used in mRNA therapeutics.
predictions were indicative of output protein expression, we First, the commonly used α- and β-globin 5′UTRs6 have
selected 10 sequences and performed individual eGFP predicted MRLs of 6.1 and 6.6, respectively. Therefore,
fluorescence measurements, which were highly correlated compared with the 25−100 nt library data set, these sequences
with MRL predictions (Figure 4C). can be placed in the 65th and 86th percentiles. The
To be maximally useful for the design of mRNA BioNTech/Pfizer BNT-162b2 COVID-19 vaccine4 uses a
therapeutics, Optimus 5′ needs to generalize to different modified α-globin with a consensus Kozak and, according to
coding sequences, chemical modifications, cell types, sequence our model, results in an MRL of 6.3 (76th percentile). Finally,
“types” (e.g., human rather than random), or lengths. As the Moderna mRNA-1273 vaccine5 uses a synthetic 5′UTR,
detailed above, we found UTR sequences to be highly which our model predicts to have an MRL of 5.7 (52nd
transferable between different CDS contexts and even chemical percentile). Whereas most of these UTRs result in higher-than-
modifications (Figure 3). Accordingly, despite being trained on average ribosome loading and generally high expression, our
eGFP library data, Optimus 5-Prime predictions could explain results suggest that further optimization could be beneficial for
78 and 77% of the observed MRL variation in data from two strongly expressing proteins in therapeutic applications. In fact,
replicates with ∼200 000 random 5′UTRs preceding an in very recent work, Exposito and coworkers compared six
mCherry CDS (Figure 5A). Similarly, despite being trained 5′UTRs selected from our eGFP library because of their high
only on unmodified mRNA data, Optimus 5-Prime could measured MRLs to the β-globin 5′UTR and found that at least
explain 69−73% of the observed MRL variation in the Ψ one of the six synthetic sequences (“UTR4”) resulted in higher
library and 68−76% in the m1Ψ library. Still, the model GFP expression across three different cell types. Most notably,
an 80% increase in fluorescence compared with the β-globin
accuracy could be increased to 84−85, 77−82, and 72−81% by
control was reported in primary human-monocyte-derived
retraining directly on the mCherry, Ψ, and m1Ψ data sets,
dendritic cells.59
■
respectively (Figure 5B). Therefore, whereas a model trained
on unmodified RNA data is reasonably accurate, training DESIGNING SEQUENCES FOR ENHANCED AND
directly on modified RNA data will be ideal for predicting the SPECIFIC mRNA TRANSLATION
impact of such modifications in mRNA therapeutics contexts.
To test whether Optimus 5-Prime would perform well on Methods to rationally design regulatory sequences with custom
sequences designed by others, we turned to translation performance, such as 5′UTRs that achieve target translation
measurements conducted with six different cell lines and 77 efficiencies, have been a major focus of synthetic biology. Early
5′UTRs designed by Ferreira et al.56 and found that Optimus genetic engineering relied on using regulatory sequences from
5-Prime could explain 73−85% of the reported variation endogenous sources, with the expectation that they would
(Figure 5C). We also note that measurements reported for perform as well in their new synthetic context,60,61 an approach
different cell types are very highly correlated, suggesting that still largely used with UTRs for mRNA therapeutics.6,30
However, further work, in particular, related to promoters,
the basic regulatory rules (e.g., strengths of Kozak, role of
demonstrated that engineered sequences could allow for the
uORFs, etc.) remain similar between cell types. These
finer tuning of performance62,63 and even outperform native
observations also suggest that a model trained on data
sequences64 while being more robust to context changes.65
collected in a single cell type can generalize to other, more
Methods to design these sequences included building chimeras
clinically relevant cell types. from native sequences,64 rationally inserting66 or deleting67
We also showed that Optimus 5-Prime generates accurate regulatory motifs, and screening libraries containing random
MRL predictions on human 5′UTRs despite being trained on mutations62,63 or permutations of sequence elements.68,69
random sequences only. We synthesized 35 212 5′UTRs An alternative approach is model-based design, where a
extracted from the 50 nt long region immediately preceding predictive sequence-to-function model is used alongside a
the start codon in human transcripts and 3577 single search algorithm to generate fully synthetic sequences with
nucleotide variant (SNV) sequences from ClinVar57 and target performance. An early demonstration of this approach
assayed them as described above. MRL predictions were highly was the ribosome binding site (RBS) calculator, a software
correlated with experimental observations (r2 = 0.82, Figure package that designs bacterial 5′UTRs for specified translation
5D). efficiencies.70 The RBS calculator was successful in part
A limitation of the initial version of Optimus 5-Prime is its because bacterial translation initiation relies on binding of the
fixed 50 nt long input, as UTRs used for mRNA therapeutics 16S ribosomal RNA to a sequence element in the mRNA
can be longer, whereas human 5′UTRs range from tens to 5′UTR; therefore, a model based entirely on RNA hybrid-
thousands of bases with a median length of 218.41 Thus we ization thermodynamics was sufficiently accurate. However,
constructed and characterized a new library with a degenerate detailed biophysical models may not be available for other
5′UTR region ranging from 25 to 100 bases. We then retrained processes. Machine-learning models, such as Optimus 5-Prime,
Optimus 5-Prime using a longer 100-base input layer. Input that can be trained on large-scale example data sets even in the
sequences shorter than 100 bases were accommodated by left- absence of a quantitative biophysical model provide a powerful
padding their one-hot-encoded vector with zeros. On a test alternative as “oracles” for sequence design.
data set of held-out random and human 5′UTR sequences, we We first demonstrated the machine-learning-guided design
found MRL predictions to be highly correlated with measure- of functional sequence elements in the context of yeast 5′UTR
ments (r2 from 0.84 to 0.75). Recently, Gagneur and regulation. Specifically, we used a neural network model
coworkers used our MPRA data to train a model based on trained on 500 000 random UTRs together with random
Acc. Chem. Res. 2022, 55, 24−34
Figure 6. (A) Sequence design using Optimus 5-Prime and a genetic algorithm. (B) Predicted and observed MRLs of 12 000 sequences designed
for different target MRLs. (C,D) Model performance before (C) and after (D) retraining with a subset of the designed sequences evaluated on a
held-out designed sequence test set.
Figure 7. Fast SeqProp, a gradient-based sequence design method with PWM sampling and per-base logit normalization, rapidly finds high-
performing sequences. (A) Methods based on search heuristics introduce random changes to a candidate sequence. Performance needs to be
evaluated for several candidates before finding an improvement. (B) Gradient-based methods move in the direction of increased performance, as
measured by the gradient of the cost function. (C) Fast SeqProp optimizes logits via gradient descent. Logits are normalized across positions, and
one-hot encoded sequences are sampled from PWMs before being presented to the pretrained predictive model. (D) Cost function over number of
iterations with and without logit normalization and PWM sampling when optimizing for high MRLs using Optimus 5-Prime.
mutagenesis to computationally evolve high fitness sequen- further increase MRLs. Notably, retraining the model using a
ces.51 This work gave a proof of principle for machine-learning- subset of the designed sequences and their measured MRLs
guided sequence design, but regulatory sequences optimized improved the prediction accuracy (Figure 6C,D). Recent work
for gene expression in yeast are unlikely to be optimal for by Lu and coworkers also used a genetic algorithm to design
mRNA therapeutics applications. Given the high accuracy 5′UTRs for DNA gene therapy applications, but their approach
achieved by Optimus 5-Prime, we similarly evaluated its jointly optimized transcription and translation.73
application in designing 5′UTRs with specified translation
efficiencies.1 Our design approach was based on genetic
algorithms,71 a discrete search heuristic previously used for
■ IMPROVED ALGORITHMS FOR SEQUENCE
DESIGN
designing bacterial 5′UTRs70 and RNAs with pseudoknotted Sequence design based on search heuristics such as genetic
structures.72 Starting with a set of random 50 bp 5′UTRs, each algorithms has several issues that we have tried to address in
iteration consisted of random mutations and crossovers in recent work. For example, design can be slow and inefficient:
silico, followed by scoring using Optimus 5-Prime and selection The mutation and crossover operations change a few
of the best sequences for the next round (Figure 6A). To test nucleotides at a time and are not guaranteed to increase the
this approach, we designed 12 000 sequences to either achieve performance at every step, resulting in small improvements
one of seven discrete MRL values between 3 and 9 or to despite multiple model evaluations (Figure 7A). A more
maximize the MRL. These sequences were then synthesized efficient approach is activation maximization through gradient
and tested via polysome profiling. We found excellent descent, where the gradient of the performance metric with
agreement between the target and experimental MRLs when respect to the model input is used to iteratively refine a
the target was eight or lower. However, for larger target MRLs, candidate sequence.74 Progress is always made in the direction
the experimental measurements were lower than predicted of increased performance, and fewer model evaluations are
(Figure 6B). A closer inspection of sequences in this stage required (Figure 7B). However, gradients can only be taken
revealed the appearance of long poly-U stretches not present in with respect to continuous real-valued inputs, and thus some
the training library. Therefore, the genetic algorithm was likely modifications are needed to design sequences consisting of
exploiting a blind spot in Optimus 5-Primea region in discrete letters. Attempts at addressing this limitation include
sequence space where predictions would have low qualityto representing sequences via unstructured real-valued matrices75
Acc. Chem. Res. 2022, 55, 24−34
and introducing a “softmax” layer that transforms unbounded

real-valued inputs (“logits”) into position-weight matrices
(PWMs) before feeding them to the model.74 However,
these approaches may result in poor performance because
models are trained on one-hot encoded data (i.e., unambig-
uous sequences), not real-valued inputs.
We previously demonstrated a hybrid continuous/discrete
solution to this problem: At every iteration, model evaluations
are made using one-hot encoded sequences sampled from the
PWM, but gradients are evaluated with respect to the
continuous input logits. We successfully used this approach
to design sequences with custom alternative polyadenylation
isoform ratios.55 In more recent work,2 we evaluated the effect
of normalizing logits across all positions and introducing per-
base scaling and bias factors (Figure 7C). These additions
helped us avoid issues with vanishing gradients, an error mode
where gradients become too small to drive meaningful updates.
As a result, our Fast SeqProp method converges rapidly in a
variety of design tasks, including maximizing transcription
factor binding, transcriptional activity, alternative polyadeny-
lation, and translation using Optimus 5-Prime (Figure 7D).
Still, despite the speed improvement of Fast SeqProp,
activation maximization methods share a few limitations. First,
the algorithm needs to be run from scratch for every new
Figure 8. Deep Exploration Networks. (A) A DEN is a neural
generated sequence. Furthermore, optimization might get network that transforms random real-valued vectors (Z) into PWMs,
stuck in local minima or converge to a region in the sequence from which sequences can be sampled. During training, generated
space far from the training data set, where the model is not sequences are scored based on their predicted performance and their
accurate. A related issue is the lack of an explicit mechanism to similarity to each other. Optionally, another generative network such
force generated sequences to be distinct, thus limiting the as a VAE, trained on the same data as the predictive model, can be
diversity of sequences available for experimental testing and used to make sure DEN-generated sequences do not dramatically
reducing the likelihood of finding one with high performance. deviate from the training data set. (B) During generation, a single
A different class of design methods is based on deep evaluation of the trained DEN results in a different generated
sequence.
generative models, neural networks trained to learn the
distribution of a training data set to generate completely new
examples with similar properties.74,76,77 A major advantage DENs when maximum performance and large numbers of
over gradient methods is speed: After training, generating new diverse sequences are desirable.
■
examples requires a single evaluation of the generative model
without any iterations; however, the basic versions of these SUMMARY AND OUTLOOK
methods do not optimize sequence performance or explicitly
Improving our ability to map mRNA sequence to function and
maximize diversity. We recently developed Deep Exploration
vice versa is key to developing a new generation of mRNA
Networks (DENs),3 an activation-maximizing deep generative therapeutics. Here we reviewed how an approach combining
model that addresses these limitations (Figure 8). DENs are high-throughput MPRA data, deep learning, and sequence
trained via gradient descent by minimizing a cost function design algorithms can be used to characterize mRNA
composed of two terms: one related to the performance of a regulation, extract biological insights, and design novel
generated sequence as given by an independent, pretrained sequences with high performance.
predictive model (e.g., Optimus 5-Prime) and the other We expect this work to be extended in several directions in
computed from a similarity metric between two generated the near future. First, MPRAs targeting regions other than the
sequences (Figure 8A). By simultaneously maximizing 5′UTR and processes other than translation will be developed.
performance and minimizing similarity, DENs learn to As a recent example, Qian and coworkers developed an MPRA
generate highly diverse sequences with high performance. with a short randomized uORF in the 5′UTR to study the
Furthermore, DENs can be restricted from generating interplay between translation and mRNA stability.35 Similarly,
sequences that deviate too much from the sequence space MPRAs targeting the 3′UTR region have been developed to
defined by the training data set of the predictor by using a study the effect of variants on mRNA abundance78 and
variational autoencoder (VAE)76 to penalize deviation during subcellular transcript localization in neurons.79 Second, models
training (Figure 8A). We showed that DENs can be used to capable of predicting multiple biomolecular processes from
design sequences with specified alternative polyadenylation sequence should be further developed. For example, recent
isoform ratios and custom cleavage positions, splicing work found that sequences selected to have an exceptionally
regulatory sequences with maximal differential splicing in two high MRL could result in low mRNA stability,36 suggesting
different cell lines and highly diverse GFP sequences with high that optimizing a predictor that models translation alone may
fluorescence. Still, a potential drawback of generative models is not be an optimal strategy for maximizing protein expression.
the up-front requirement for model training. Thus we Similarly, in a test of six synthetic 5′UTRs with very high
recommend using Fast SeqProp for simple design tasks and MRLs (7.8−10), a range of expression levels was observed,
Acc. Chem. Res. 2022, 55, 24−34
possibly because of confounding effects of the 5′UTR the BNT162b2MRNA Covid-19 Vaccine. N. Engl. J. Med. 2020, 383,
sequence on stability or even cell toxicity.59 Finally, we expect 2603−2615.
understanding and engineering cell-type-specific expression to (5) Baden, L. R.; El Sahly, H. M.; Essink, B.; Kotloff, K.; Frey, S.;
be a major goal going forward, as targeting expression to Novak, R.; Diemert, D.; Spector, S. A.; Rouphael, N.; Creech, C. B.;
specific cell or tissue types will limit the side effects of future McGettigan, J.; Khetan, S.; Segall, N.; Solis, J.; Brosz, A.; Fierro, C.;
Schwartz, H.; Neuzil, K.; Corey, L.; Gilbert, P.; Janes, H.; Follmann,
mRNA therapeutics. Whereas some cell-type specificity can
D.; Marovich, M.; Mascola, J.; Polakowski, L.; Ledgerwood, J.;
currently be achieved by pasting miRNA binding elements into Graham, B. S.; Bennett, H.; Pajon, R.; Knightly, C.; Leav, B.; Deng,
the 3′UTR,80,81 we expect model-based design to further W.; Zhou, H.; Han, S.; Ivarsson, M.; Miller, J.; Zaks, T. Efficacy and
increase the specificity and allow the targeting of cell types and Safety of the MRNA-1273 SARS-CoV-2 Vaccine. N. Engl. J. Med.
tissues that are currently inaccessible.
■
2021, 384, 403−416.
(6) Chaudhary, N.; Weissman, D.; Whitehead, K. A. MRNA
AUTHOR INFORMATION Vaccines for Infectious Diseases: Principles, Delivery and Clinical
Corresponding Author Translation. Nat. Rev. Drug Discovery 2021, 20, 817−838.
Georg Seelig − Department of Electrical & Computer (7) Wu, K.; Choi, A.; Koch, M.; Elbashir, S.; Ma, L.; Lee, D.; Woods,
A.; Henry, C.; Palandjian, C.; Hill, A.; Jani, H.; Quinones, J.; Nunna,
Engineering and Paul G. Allen School of Computer Science &
N.; O’Connell, S.; McDermott, A. B; Falcone, S.; Narayanan, E.;
Engineering, University of Washington, Seattle, Washington Colpitts, T.; Bennett, H.; Corbett, K. S; Seder, R.; Graham, B. S;
98195, United States; orcid.org/0000-0002-3163-8782; Stewart-Jones, G. B.; Carfi, A.; Edwards, D. K Variant SARS-CoV-2
Email: gseelig@uw.edu mRNA vaccines confer broad neutralization as primary or booster
series in mice. bioRxiv 2021, DOI: 10.1101/2021.04.13.439482.
Author
(8) Sebastian, M.; Schröder, A.; Scheel, B.; Hong, H. S.; Muth, A.;
Sebastian M. Castillo-Hair − Department of Electrical & von Boehmer, L.; Zippelius, A.; Mayer, F.; Reck, M.; Atanackovic, D.;
Computer Engineering and eScience Institute, University of Thomas, M.; Schneller, F.; Stöhlmacher, J.; Bernhard, H.; Gröschel,
Washington, Seattle, Washington 98195, United States; A.; Lander, T.; Probst, J.; Strack, T.; Wiegand, V.; Gnad-Vogt, U.;
orcid.org/0000-0002-2384-3129 Kallen, K.-J.; Hoerr, I.; von der Muelbe, F.; Fotin-Mleczek, M.; Knuth,
A.; Koch, S. D. A Phase I/IIa Study of the MRNA-Based Cancer
Complete contact information is available at:
Immunotherapy CV9201 in Patients with Stage IIIB/IV Non-Small
https://pubs.acs.org/10.1021/acs.accounts.1c00621 Cell Lung Cancer. Cancer Immunol. Immunother. 2019, 68, 799−812.
(9) Papachristofilou, A.; Hipp, M. M.; Klinkhardt, U.; Früh, M.;
Notes Sebastian, M.; Weiss, C.; Pless, M.; Cathomas, R.; Hilbe, W.; Pall, G.;
The authors declare no competing financial interest. Wehler, T.; Alt, J.; Bischoff, H.; Geißler, M.; Griesinger, F.; Kallen, K.-
J.; Fotin-Mleczek, M.; Schröder, A.; Scheel, B.; Muth, A.; Seibel, T.;
Biographies Stosnach, C.; Doener, F.; Hong, H. S.; Koch, S. D.; Gnad-Vogt, U.;
Sebastian M. Castillo-Hair is a Data Science Postdoctoral Fellow at Zippelius, A. Phase Ib Evaluation of a Self-Adjuvanted Protamine
the eScience Institute and the Department of Electrical and Computer Formulated MRNA-Based Active Cancer Immunotherapy, BI1361849
Engineering at University of Washington. He studies how high- (CV9202), Combined with Local Radiation Treatment in Patients
throughput assays and machine learning can be used to engineer novel with Stage IV Non-Small Cell Lung Cancer. j. immunotherapy cancer
2019, 7, 38.
synthetic biological systems. Previously, he obtained his Ph.D. in
(10) Beck, J. D.; Reidenbach, D.; Salomon, N.; Sahin, U.; Türeci, Ö .;
Bioengineering at Rice University.
Vormehr, M.; Kranz, L. M. MRNA Therapeutics in Cancer
Georg Seelig is a Professor at the University of Washington. His Immunotherapy. Mol. Cancer 2021, 20, 69.
research interests are in synthetic biology and genomics. (11) Kwon, H.; Kim, M.; Seo, Y.; Moon, Y. S.; Lee, H. J.; Lee, K.;
■
Lee, H. Emergence of Synthetic MRNA: In Vitro Synthesis of MRNA
ACKNOWLEDGMENTS and Its Applications in Regenerative Medicine. Biomaterials 2018,
156, 172−193.
We thank Johannes Linder for feedback on this manuscript. (12) Warren, L.; Lin, C. MRNA-Based Genetic Reprogramming.
This work was supported by NIH Awards R01GM120379 and Mol. Ther. 2019, 27, 729−734.
R01HG009892 to G.S., and by the University of Washington (13) Chanda, P. K.; Sukhovershin, R.; Cooke, J. P. MRNA-
eScience Institute with support from the Washington Research Enhanced Cell Therapy and Cardiovascular Regeneration. Cells 2021,
Foundation to S.M.C. 10, 187.
■ REFERENCES
(1) Sample, P. J.; Wang, B.; Reid, D. W.; Presnyak, V.; McFadyen, I.
(14) Magadum, A.; Kaur, K.; Zangi, L. MRNA-Based Protein
Replacement Therapy for the Heart. Mol. Ther. 2019, 27, 785−793.
(15) Trepotec, Z.; Lichtenegger, E.; Plank, C.; Aneja, M. K.;
J.; Morris, D. R.; Seelig, G. Human 5′ UTR Design and Variant Effect Rudolph, C. Delivery of MRNA Therapeutics for the Treatment of
Prediction from a Massively Parallel Translation Assay. Nat. Hepatic Diseases. Mol. Ther. 2019, 27, 794−802.
Biotechnol. 2019, 37, 803−809. (16) Sahin, U.; Karikó, K.; Türeci, Ö . MRNA-Based Therapeutics 
(2) Linder, J.; Seelig, G. Fast Activation Maximization for Molecular Developing a New Class of Drugs. Nat. Rev. Drug Discovery 2014, 13,
Sequence Design. BMC Bioinf. 2021, 22, 510. 759−780.
(3) Linder, J.; Bogard, N.; Rosenberg, A. B.; Seelig, G. A Generative (17) Pardi, N.; Hogan, M. J.; Porter, F. W.; Weissman, D. MRNA
Neural Network for Maximizing Fitness and Diversity of Synthetic Vaccines  a New Era in Vaccinology. Nat. Rev. Drug Discovery 2018,
DNA and Protein Sequences. Cell Systems 2020, 11, 49−62. 17, 261−279.
(4) Polack, F. P.; Thomas, S. J.; Kitchin, N.; Absalon, J.; Gurtman, (18) Stepinski, J.; Waddell, C.; Stolarski, R.; Darzynkiewicz, E.;
A.; Lockhart, S.; Perez, J. L.; Pérez Marc, G.; Moreira, E. D.; Zerbini, Rhoads, R. E. Synthesis and Properties of MRNAs Containing the
C.; Bailey, R.; Swanson, K. A.; Roychoudhury, S.; Koury, K.; Li, P.; Novel “Anti-Reverse” Cap Analogs 7-Methyl(3′-O-Methyl)GpppG
Kalina, W. V.; Cooper, D.; Frenck, R. W.; Hammitt, L. L.; Türeci, Ö .; and 7-Methyl(3′-Deoxy)GpppG. RNA 2001, 7, 1486−1495.
Nell, H.; Schaefer, A.; Ü nal, S.; Tresnan, D. B.; Mather, S.; Dormitzer, (19) Jemielity, J.; Fowler, T.; Zuberek, J.; Stepinski, J.; Lewdorowicz,
P. R.; Ş ahin, U.; Jansen, K. U.; Gruber, W. C. Safety and Efficacy of M.; Niedzwiecka, A.; Stolarski, R.; Darzynkiewicz, E.; Rhoads, R. E.
Acc. Chem. Res. 2022, 55, 24−34
Novel “Anti-Reverse” Cap Analogs with Superior Translational Watkins, A. M.; Nicol, J. J.; Romano, J.; Tunguz, B.; Participants, E.;
Properties. RNA 2003, 9, 1108−1122. Barna, M.; Das, R. Combinatorial Optimization of MRNA Structure,
(20) Kuhn, A. N.; Diken, M.; Kreiter, S.; Selmi, A.; Kowalska, J.; Stability, and Translation for RNA-Based Therapeutics. bioRxiv 2021,
Jemielity, J.; Darzynkiewicz, E.; Huber, C.; Türeci, Ö .; Sahin, U. DOI: 10.1101/2021.03.29.437587.
Phosphorothioate Cap Analogs Increase Stability and Translational (37) Jackson, R. J.; Hellen, C. U. T.; Pestova, T. V. The Mechanism
Efficiency of RNA Vaccines in Immature Dendritic Cells and Induce of Eukaryotic Translation Initiation and Principles of Its Regulation.
Superior Immune Responses in Vivo. Gene Ther. 2010, 17, 961−971. Nat. Rev. Mol. Cell Biol. 2010, 11, 113−127.
(21) Kocmik, I.; Piecyk, K.; Rudzinska, M.; Niedzwiecka, A.; (38) Kozak, M. Point Mutations Define a Sequence Flanking the
Darzynkiewicz, E.; Grzela, R.; Jankowska-Anyszka, M. Modified AUG Initiator Codon That Modulates Translation by Eukaryotic
ARCA Analogs Providing Enhanced Translational Properties of Ribosomes. Cell 1986, 44, 283−292.
Capped MRNAs. Cell Cycle 2018, 17, 1624−1636. (39) Kozak, M. Structural Features in Eukaryotic MRNAs That
(22) Ensinger, M. J.; Martin, S. A.; Paoletti, E.; Moss, B. Modulate the Initiation of Translation. J. Biol. Chem. 1991, 266,
Modification of the 5′-Terminus of MRNA by Soluble Guanylyl 19867−19870.
and Methyl Transferases from Vaccinia Virus. Proc. Natl. Acad. Sci. U. (40) Hinnebusch, A. G.; Ivanov, I. P.; Sonenberg, N. Translational
S. A. 1975, 72, 2525−2529. Control by 5′-Untranslated Regions of Eukaryotic MRNAs. Science
(23) Yisraeli, J. K.; Melton, D. A. [4] Synthesis of Long, Capped 2016, 352, 1413−1416.
Transcripts in Vitro by SP6 and T7 RNA Polymerases. In Methods in (41) Leppek, K.; Das, R.; Barna, M. Functional 5′ UTR MRNA
Enzymology; RNA Processing Part A: General Methods; Academic Structures in Eukaryotic Translation Regulation and How to Find
Press, 1989; Vol. 180, pp 42−50. Them. Nat. Rev. Mol. Cell Biol. 2018, 19, 158−174.
(24) Yunus, M. A.; Chung, L. M. W.; Chaudhry, Y.; Bailey, D.; (42) Avni, D.; Biberman, Y.; Meyuhas, O. The 5′ Terminal
Goodfellow, I. Development of an Optimized RNA-Based Murine Oligopyrimidine Tract Confers Translational Control on Top Mrnas
Norovirus Reverse Genetics System. J. Virol. Methods 2010, 169, in a Cell Type-and Sequence Context-Dependent Manner. Nucleic
112−118. Acids Res. 1997, 25, 995−1001.
(25) Henderson, J. M.; Ujita, A.; Hill, E.; Yousif-Rosales, S.; Smith, (43) Hellen, C. U. T.; Sarnow, P. Internal Ribosome Entry Sites in
C.; Ko, N.; McReynolds, T.; Cabral, C. R.; Escamilla-Powers, J. R.; Eukaryotic MRNA Molecules. Genes Dev. 2001, 15, 1593−1612.
Houston, M. E. Cap 1 Messenger RNA Synthesis with Co- (44) Ingolia, N. T.; Ghaemmaghami, S.; Newman, J. R. S.;
Transcriptional CleanCap® Analog by In Vitro Transcription. Current Weissman, J. S. Genome-Wide Analysis in Vivo of Translation with
Protocols 2021, 1, e39. Nucleotide Resolution Using Ribosome Profiling. Science 2009, 324,
(26) Karikó, K.; Muramatsu, H.; Welsh, F. A.; Ludwig, J.; Kato, H.; 218−223.
Akira, S.; Weissman, D. Incorporation of Pseudouridine Into MRNA (45) Floor, S. N.; Doudna, J. A. Tunable Protein Synthesis by
Yields Superior Nonimmunogenic Vector With Increased Transla- Transcript Isoforms in Human Cells. eLife 2016, 5, No. e10921.
tional Capacity and Biological Stability. Mol. Ther. 2008, 16, 1833− (46) Blair, J. D.; Hockemeyer, D.; Doudna, J. A.; Bateup, H. S.;
1840.
Floor, S. N. Widespread Translational Remodeling during Human
(27) Andries, O.; Mc Cafferty, S.; De Smedt, S. C.; Weiss, R.;
Neuronal Differentiation. Cell Rep. 2017, 21, 2005−2016.
Sanders, N. N.; Kitada, T. N1-Methylpseudouridine-Incorporated
(47) Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting
MRNA Outperforms Pseudouridine-Incorporated MRNA by Provid-
Unreasonable Effectiveness of Data in Deep Learning Era. arXiv
ing Enhanced Protein Expression and Reduced Immunogenicity in
[cs.CV], August 4, 2017, 1707.02968. https://arxiv.org/abs/1707.
Mammalian Cell Lines and Mice. J. Controlled Release 2015, 217,
02968 (accessed 2021-09-06).
337−344.
(48) Kinney, J. B.; McCandlish, D. M. Massively Parallel Assays and
(28) Li, B.; Luo, X.; Dong, Y. Effects of Chemically Modified
Messenger RNA on Protein Expression. Bioconjugate Chem. 2016, 27, Quantitative Sequence−Function Relationships. Annu. Rev. Genomics
849−853. Hum. Genet. 2019, 20, 99−127.
(29) Hou, X.; Zaks, T.; Langer, R.; Dong, Y. Lipid Nanoparticles for (49) Noderer, W. L.; Flockhart, R. J.; Bhaduri, A.; Diaz de Arce, A.
MRNA Delivery. Nat. Rev. Mater. 2021, 1−17. J.; Zhang, J.; Khavari, P. A.; Wang, C. L. Quantitative Analysis of
(30) Weng, Y.; Li, C.; Yang, T.; Hu, B.; Zhang, M.; Guo, S.; Xiao, Mammalian Translation Initiation Sites by FACS-Seq. Mol. Syst. Biol.
H.; Liang, X.-J.; Huang, Y. The Challenge and Prospect of MRNA 2014, 10, 748.
Therapeutics Landscape. Biotechnol. Adv. 2020, 40, 107534. (50) Diaz de Arce, A. J.; Noderer, W. L.; Wang, C. L. Complete
(31) Orlandini von Niessen, A. G.; Poleganov, M. A.; Rechner, C.; Motif Analysis of Sequence Requirements for Translation Initiation at
Plaschke, A.; Kranz, L. M.; Fesser, S.; Diken, M.; Löwer, M.; Vallazza, Non-AUG Start Codons. Nucleic Acids Res. 2018, 46, 985−994.
B.; Beissert, T.; Bukur, V.; Kuhn, A. N.; Türeci, Ö .; Sahin, U. (51) Cuperus, J. T.; Groves, B.; Kuchina, A.; Rosenberg, A. B.; Jojic,
Improving MRNA-Based Therapeutic Gene Delivery by Expression- N.; Fields, S.; Seelig, G. Deep Learning of the Regulatory Grammar of
Augmenting 3′ UTRs Identified by Cellular Library Screening. Mol. Yeast 5′ Untranslated Regions from 500,000 Random Sequences.
Ther. 2019, 27, 824−836. Genome Res. 2017, 27, 2015−2024.
(32) Roth, N.; Schön, J.; Hoffmann, D.; Thran, M.; Thess, A.; (52) Ke, S.; Shang, S.; Kalachikov, S. M.; Morozova, I.; Yu, L.;
Mueller, S. O.; Petsch, B.; Rauch, S. CV2CoV, an Enhanced MRNA- Russo, J. J.; Ju, J.; Chasin, L. A. Quantitative Evaluation of All
Based SARS-CoV-2 Vaccine Candidate, Supports Higher Protein Hexamers as Exonic Splicing Elements. Genome Res. 2011, 21, 1360−
Expression and Improved Immunogenicity in Rats. bioRxiv 2021, 1374.
DOI: 10.1101/2021.05.13.443734. (53) Rosenberg, A. B.; Patwardhan, R. P.; Shendure, J.; Seelig, G.
(33) Gerstberger, S.; Hafner, M.; Tuschl, T. A Census of Human Learning the Sequence Determinants of Alternative Splicing from
RNA-Binding Proteins. Nat. Rev. Genet. 2014, 15, 829−845. Millions of Random Sequences. Cell 2015, 163, 698−711.
(34) Sood, P.; Krek, A.; Zavolan, M.; Macino, G.; Rajewsky, N. Cell- (54) Eraslan, G.; Avsec, Ž .; Gagneur, J.; Theis, F. J. Deep Learning:
Type-Specific Signatures of MicroRNAs on Target MRNA New Computational Modelling Techniques for Genomics. Nat. Rev.
Expression. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 2746−2751. Genet. 2019, 20, 389−403.
(35) Jia, L.; Mao, Y.; Ji, Q.; Dersh, D.; Yewdell, J. W.; Qian, S.-B. (55) Bogard, N.; Linder, J.; Rosenberg, A. B.; Seelig, G. A Deep
Decoding MRNA Translatability and Stability from the 5′ UTR. Nat. Neural Network for Predicting and Engineering Alternative
Struct. Mol. Biol. 2020, 27, 814−821. Polyadenylation. Cell 2019, 178, 91−106.
(36) Leppek, K.; Byeon, G. W.; Kladwang, W.; Wayment-Steele, H. (56) Ferreira, J. P.; Overton, K. W.; Wang, C. L. Tuning Gene
K.; Kerr, C. H.; Xu, A. F.; Kim, D. S.; Topkar, V. V.; Choe, C.; Expression with Synthetic Upstream Open Reading Frames. Proc.
Rothschild, D.; Tiu, G. C.; Wellington-Oguri, R.; Fujii, K.; Sharma, E.; Natl. Acad. Sci. U. S. A. 2013, 110, 11284−11289.
Acc. Chem. Res. 2022, 55, 24−34
(57) Landrum, M. J.; Lee, J. M.; Benson, M.; Brown, G.; Chao, C.; (75) Lanchantin, J.; Singh, R.; Lin, Z.; Qi, Y. Deep Motif: Visualizing
Chitipiralla, S.; Gu, B.; Hart, J.; Hoffman, D.; Hoover, J.; Jang, W.; Genomic Sequence Classifications. arXiv [cs.LG], June 2, 2016,
Katz, K.; Ovetsky, M.; Riley, G.; Sethi, A.; Tully, R.; Villamarin- 1605.01133. https://arxiv.org/abs/1605.01133 (accessed 2021-09-
Salomon, R.; Rubinstein, W.; Maglott, D. R. ClinVar: Public Archive 28).
of Interpretations of Clinically Relevant Variants. Nucleic Acids Res. (76) Kingma, D. P.; Welling, M. Auto-Encoding Variational Bayes.
2016, 44, D862−D868. arXiv [stat.ML], May 1, 2014, 1312.6114. https://arxiv.org/abs/1312.
(58) Karollus, A.; Avsec, Ž .; Gagneur, J. Predicting Mean Ribosome 6114 (accessed 2021-08-28).
Load for 5′UTR of Any Length Using Deep Learning. PLoS Comput. (77) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-
Biol. 2021, 17, No. e1008982. Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
(59) Linares-Fernández, S.; Moreno, J.; Lambert, E.; Mercier-Gouy, Networks. arXiv [stat.ML], June 10, 2014, 1406.2661. https://arxiv.
P.; Vachez, L.; Verrier, B.; Exposito, J.-Y. Combining an Optimized org/abs/1406.2661 (accessed 2021-08-28).
MRNA Template with a Double Purification Process Allows Strong (78) Griesemer, D.; Xue, J. R.; Reilly, S. K.; Ulirsch, J. C.; Kukreja,
Expression of in Vitro Transcribed MRNA. Mol. Ther.–Nucleic Acids K.; Davis, J. R.; Kanai, M.; Yang, D. K.; Butts, J. C.; Guney, M. H.;
2021, 26, 945−956. Luban, J.; Montgomery, S. B.; Finucane, H. K.; Novina, C. D.;
(60) Breathnach, R.; Harris, B. A. Plasmids for the Cloning and Tewhey, R.; Sabeti, P. C. Genome-Wide Functional Screen of 3′UTR
Expresion of Full-Length Double-Stranded CDNAs under Control of Variants Uncovers Causal Variants for Human Disease and Evolution.
the SV40 Early or Late Gene Promoter. Nucleic Acids Res. 1983, 11, Cell 2021, 184, 5247−5260 .e19.
7119−7136. (79) Mikl, M.; Eletto, D.; Lee, M.; Lafzi, A.; Mhamedi, F.; Sain, S.
(61) Studier, F. W.; Moffatt, B. A. Use of Bacteriophage T7 RNA B.; Handler, K.; Moor, A. E. A Massively Parallel Reporter Assay
Reveals Focused and Broadly Encoded RNA Localization Signals in
Polymerase to Direct Selective High-Level Expression of Cloned
Neurons. bioRxiv 2021, DOI: 10.1101/2021.04.27.441590.
Genes. J. Mol. Biol. 1986, 189, 113−130.
(80) Xie, Z.; Wroblewska, L.; Prochazka, L.; Weiss, R.; Benenson, Y.
(62) Alper, H.; Fischer, C.; Nevoigt, E.; Stephanopoulos, G. Tuning
Multi-Input RNAi-Based Logic Circuit for Identification of Specific
Genetic Control through Promoter Engineering. Proc. Natl. Acad. Sci.
Cancer Cells. Science 2011, 333, 1307−1311.
U. S. A. 2005, 102, 12678−12683. (81) Jain, R.; Frederick, J. P.; Huang, E. Y.; Burke, K. E.; Mauger, D.
(63) Nevoigt, E.; Kohnke, J.; Fischer, C. R.; Alper, H.; Stahl, U.; M.; Andrianova, E. A.; Farlow, S. J.; Siddiqui, S.; Pimentel, J.;
Stephanopoulos, G. Engineering of Promoter Replacement Cassettes Cheung-Ong, K.; McKinney, K. M.; Köhrer, C.; Moore, M. J.;
for Fine-Tuning of Gene Expression in Saccharomyces Cerevisiae. Chakraborty, T. MicroRNAs Enable MRNA Therapeutics to
Appl. Environ. Microbiol. 2006, 72, 5266−5273. Selectively Program Cancer Cells to Self-Destruct. Nucleic Acid
(64) de Boer, H. A.; Comstock, L. J.; Vasser, M. The Tac Promoter: Ther. 2018, 28, 285−296.
A Functional Hybrid Derived from the Trp and Lac Promoters. Proc.
Natl. Acad. Sci. U. S. A. 1983, 80, 21−25.
(65) Mutalik, V. K.; Guimaraes, J. C.; Cambray, G.; Lam, C.;
Christoffersen, M. J.; Mai, Q.-A.; Tran, A. B.; Paull, M.; Keasling, J.
D.; Arkin, A. P.; Endy, D. Precise and Reliable Gene Expression via
Standard Transcription and Translation Initiation Elements. Nat.
Methods 2013, 10, 354−360.
(66) Lutz, R.; Bujard, H. Independent and Tight Regulation of
Transcriptional Units in Escherichia Coli Via the LacR/O, the TetR/
O and AraC/I1-I2 Regulatory Elements. Nucleic Acids Res. 1997, 25,
1203−1210.
(67) Chao, S.-H.; Harada, J. N.; Hyndman, F.; Gao, X.; Nelson, C.
G.; Chanda, S. K.; Caldwell, J. S. PDX1, a Cellular Homeoprotein,
Binds to and Regulates the Activity of Human Cytomegalovirus
Immediate Early Promoter *. J. Biol. Chem. 2004, 279, 16111−16120.
(68) Magnusson, T.; Haase, R.; Schleef, M.; Wagner, E.; Ogris, M.
Sustained, High Transgene Expression in Liver with Plasmid Vectors
Using Optimized Promoter-Enhancer Combinations. Journal of Gene
Medicine 2011, 13, 382−391.
(69) Blazeck, J.; Garg, R.; Reed, B.; Alper, H. S. Controlling
Promoter Strength and Regulation in Saccharomyces Cerevisiae
Using Synthetic Hybrid Promoters. Biotechnol. Bioeng. 2012, 109,
2884−2895.
(70) Salis, H. M.; Mirsky, E. A.; Voigt, C. A. Automated Design of
Synthetic Ribosome Binding Sites to Control Protein Expression. Nat.
Biotechnol. 2009, 27, 946−950.
(71) Eiben, A. E.; Smith, J. From Evolutionary Computation to the
Evolution of Things. Nature 2015, 521, 476−482.
(72) Taneda, A. Multi-Objective Genetic Algorithm for Pseudo-
knotted RNA Sequence Design. Front. Genet. 2012, 3, 36.
(73) Cao, J.; Novoa, E. M.; Zhang, Z.; Chen, W. C. W.; Liu, D.;
Choi, G. C. G.; Wong, A. S. L.; Wehrspaun, C.; Kellis, M.; Lu, T. K.
High-Throughput 5′ UTR Engineering for Enhanced Protein
Production in Non-Viral Gene Therapies. Nat. Commun. 2021, 12,
4138.
(74) Killoran, N.; Lee, L. J.; Delong, A.; Duvenaud, D.; Frey, B. J.
Generating and Designing DNA with Deep Generative Models. arXiv
[cs.LG], December 17, 2017, 1712.06148. https://arxiv.org/abs/1712.
06148 (accessed 2021-09-28).
Acc. Chem. Res. 2022, 55, 24−34

Machine Learning For Designing Next-Generation mRNA Therapeutics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning For Designing Next-Generation mRNA Therapeutics

Uploaded by

Copyright:

Available Formats

pubs.acs.

Machine Learning for Designing Next-Generation mRNA

ACCESS Metrics & More Article Recommendations

CONSPECTUS: Over just the last 2 years, mRNA therapeutics and

■ KEY REFERENCES • Linder, J.; Seelig, G. Fast Activation Maximization for

© 2021 The Authors. Published by

Associations between 5′UTR sequences and MRL measure-

and introducing a “softmax” layer that transforms unbounded

You might also like