You are on page 1of 3

News

AI-enhanced protein design makes proteins


that have never existed
Protein engineers are drawing on
rapidly evolving machine learning
tools, deep reservoirs of data, and
the structure-predicting firepower
of AlphaFold2 to pursue more
sophisticated de novo protein
designs.

By Michael Eisenstein

O
n 26 January, Profluent came
out of stealth mode with $9 mil-
lion in seed funding to support
the company’s efforts to apply
machine learning (ML) to engi-
neer novel functional proteins. This is just the AI-based algorithms can guide the design of proteins exhibiting many different kinds of
latest in a steady flurry of investment in this symmetry, from simple spherical forms to complex icosahedral designs.
space. Last January, Generate Biomedicines
signed a $50 million drug development deal
with Amgen that could potentially net the
company more than $1.9 billion in total, and a of real-world proteins and apply them to The field of AI-assisted protein design is
few months later, Arzeda drew $33 million in construct bespoke proteins with functions blossoming, but the roots of the field stretch
series B funding to support its ongoing pro- devised by the user. Lucas Nivon, CEO and back more than two decades, with work by
tein design programs. Other startups are also cofounder of Cyrus Biotechnology, believes academic researchers like David Baker and
starting to crowd the field, such as computa- the ultimate impact of such in silico-designed colleagues at what is now the Institute for Pro-
tional company Cradle, which exited stealth in proteins will be massive and compares the tein Design at the University of Washington.
November with a $5.5 million seed investment, field to the fledgling biotech industry of the Starting in the late 1990s, Baker — who has
and Monod Bio, which launched with $25 mil- 1980s. “I think in 30 years 30, 40 or 50% of co-founded companies in this space includ-
lion in seed funding in August. drugs will be computationally designed pro- ing Cyrus, Monod and Arzeda — oversaw
ML and other artificial intelligence (AI)- teins,” he says. the development of Rosetta, a foundational
based computational tools have already To date, companies operating in the protein software suite for predicting and manipulat-
proven their prowess at predicting real-world design space have largely focused on retool- ing protein structures. Since then, Baker and
protein structures. AlphaFold 2, an algorithm ing existing proteins to perform new tasks or other researchers have developed many other
developed by scientists at DeepMind that can enhance specific properties, rather than true powerful tools for protein design, powered by
confidently predict protein structure purely design from scratch. For example, scientists at rapid progress in ML algorithms — and particu-
on the basis of an amino acid sequence, Generate Biomedicines have drawn on exist- larly, by progress in a subset of ML techniques
has become a household name since its launch ing knowledge about the SARS-CoV-2 spike known as deep learning. This past Septem-
in July 2021. Today, AlphaFold 2 is used rou- protein and its interactions with the receptor ber, for example, Baker’s team published their
tinely by many structural biologists, with over protein ACE2 to design a synthetic protein deep learning ProteinMPNN platform, which
200 million structures predicted. that can consistently block viral entry across allows them to input the structure they want
This ML toolbox could generate made- diverse variants. “In our internal testing, this and have the algorithm spit out an amino acid
to-order proteins too, including those with molecule is quite resistant to all of the vari- sequence likely to produce that de novo back-
functions not present in nature. This is an ants that we’ve seen thus far,” says cofounder bone structure, achieving a >50% success rate.
appealing prospect because, despite natural and CTO Gevorg Grigoryan, adding that Gen- Some of the greatest excitement in the
proteins’ vast molecular diversity, there are erate aims to file Investigational New Drug deep learning world relates to generative
Credit: Generate Biomedicines.

many biomedical and industrial problems that paperwork to clear the way for clinical testing models that can create entirely new proteins,
evolution has never been compelled to solve. in the second quarter of this year. More ambi- never seen before in nature. These modeling
Scientists are now rapidly moving toward a tious programs are on the horizon, although tools belong to the same category of algo-
future in which they can apply careful compu- it remains to be seen how soon the leap to rithms used to produce eerie and compelling
tational analysis to infer the underlying prin- de novo design — in which new proteins are AI-generated artworks in programs like Stable
ciples governing the structure and function built entirely from scratch — will come. Diffusion or DALL-E 2 and text in programs like

nature biotechnology
News

ChatGPT. In those cases, the software is trained structures and subdomains — including the Some companies are also looking to aug-
on vast amounts of annotated image data and shapes of the letters of the alphabet — although ment public structural biology resources
then uses those insights to produce new pic- it remains to be seen how many will form these with data of their own. Generate is in the
tures in response to user queries. The same folds in the lab. process of building a multi-instrument
feat can be achieved with protein sequences In addition to the new algorithms’ power, cryo-electron microscopy facility, which will
and structures, where the algorithm draws on the tremendous amount of structural data allow them to generate near-atomic-resolution
a rich repository of real-world biological infor- captured by biologists has also allowed the structures at relatively high throughput. Such
mation to dream up new proteins based on the protein design field to take off. The Protein internally generated structural data are more
patterns and principles observed in nature. To Data Bank, a critical resource for protein likely to include relevant metadata about indi-
do this, however, researchers also need to give designers, now contains more than 200,000 vidual proteins than data from publicly avail-
the computer guidance on the biochemical experimentally solved structures. The Alpha- able resources.
and physical constraints that inform protein Fold 2 algorithm is also proving to be a game In-house wet lab facilities are another criti-
design, or else the resulting output will offer changer here in terms of providing training cal component of the design process because
little more than artistic value. material and guidance for design algorithms. experimental results are, in turn, used to
One effective strategy to understand pro- “They are models, so you have to take them retrain the algorithm to achieve even better
tein sequence and structure is to approach with a grain of salt, but now you have this outcomes in future rounds. Grigoryan notes
them as ‘text’, using language modeling extraordinarily large amount of predicted that, although Generate likes to spotlight its
algorithms that follow rules of biological structures that you can build upon,” says Zang- algorithmic toolbox, the majority of its work-
‘grammar’ and ‘syntax’. “To generate a fluent hellini, who says this tool is a core component force comprises experimentalists. And Bruno
sentence or a document, the algorithm needs of Arzeda’s computational design workflow. Correia, a computational biologist at the École
to learn about relationships between different For AI-guided design, more training data Polytechnique Fédérale de Lausanne, says that
types of words, but it needs to also learn facts are always better. But existing gene and pro- the success of a protein design effort depends
about the world to make a document that’s tein databases are constrained by a limited on close consultation between algorithm
cohesive and makes sense,” says Ali Madani, range of species and a heavy bias towards experts and experienced wet-lab practition-
a computer scientist formerly at Salesforce humans and commonly used model organ- ers. “This notion of how protein molecules
Research who recently founded Profluent. In isms. Basecamp Research is building an are and how they behave experimentally
a recent publication, Madani and colleagues ultra-diverse repository of biological infor- builds in a lot of constraints,” says Correia.
describe a language modeling algorithm that mation obtained from samples collected “I think it’s a mistake to handle biological enti-
can yield novel computer-designed proteins in biomes in 17 countries, ranging from the ties just as a piece of data.”
that can be successfully produced in the lab Antarctic to the rainforest to hydrothermal Biological validation is an extremely impor-
with catalytic activities comparable to those vents on the ocean floor. Chief Technology tant consideration for investors in this sec-
of natural enzymes. Language modeling is Officer Philipp Lorenz says that once the tor, says van Stekelenburg. “If you are doing
also a key part of Arzeda’s toolbox, according genomic data from these specimens are ana- de novo, the real gold standard is not which
to co-founder and CEO Alexandre Zanghellini. lyzed and annotated, they can assemble a architecture are you using — it’s what percent-
For one project, the company used multiple knowledge-graph that can reveal functional age of your designed proteins had the end
rounds of algorithmic design and optimiza- relationships between diverse proteins and desired property,” she says. “If you can’t show
tion to engineer an enzyme with improved pathways that would not be obvious purely on that, then it doesn’t make sense.” Accordingly,
stability against degradation. “In three rounds the basis of sequence-based analysis. “It’s not most companies pursuing computational
of iteration, we were able to go from com- just generating a new protein,” says Lorenz. design are still focused on tuning protein func-
plete disappearance of the protein after four “We are finding protein families in prokary- tion rather than overhauling it, shortening the
weeks to retention of effectively 95% activity,” otes that have been thought to exist only in leap between prediction and performance.
he says. eukaryotes.” This means many more starting Nivon says that Cyrus typically works with
A recent preprint from researchers at Gener- points for AI-guided protein design efforts, existing drugs and proteins that fall short in
ate describes a new generative modeling-based and Lorenz says that his team’s own design a particular parameter. “This could be a drug
design algorithm called Chroma, which experiments have achieved an 80% success that needs better efficacy, lower immuno-
includes several features that improve its rate at producing functional proteins. genicity or a better toxicity profile,” he says.
performance and success rate. These include But proteins do not function in a vacuum. For Cradle, the primary goal is to improve pro-
diffusion models, an approach used in many Tess van Stekelenburg, an investor at Hum- tein therapeutics by optimizing properties
image-generation AI tools that makes it easier mingbird Ventures, notes that Basecamp like stability. “We’ve benchmarked our model
to manipulate complex, multidimensional — one of the companies funded by the firm against empirical studies so that people can
data. Chroma also employs algorithmic — captures all manner of environmental and get a sense of how well this might work in an
techniques to assess long-range interactions biochemical context for the proteins it identi- experimental setting,” says founder and CEO
between residues that are far apart on the pro- fies. The resulting ‘metadata’ accompanying Stef van Grieken.
tein’s amino acid backbone, but that may be each protein sequence can help guide the engi- Arzeda’s focus is on enzyme engineering
essential for proper folding and function. In a neering of proteins that express and function for industrial applications. They have already
series of initial demonstrations, the Generate optimally in particular conditions. “It gives succeeded in creating proteins with novel
team showed that they could obtain sequences you a lot more ability to constrain for things catalytic functions for use in agriculture,
that were predicted to fold into a broad array like pH, temperature or pressure, if that’s what materials and food science. These projects
of naturally occurring and arbitrarily chosen you’re planning to look at,” she says. often begin with a relatively well-established

nature biotechnology
News

core reaction that is catalyzed in nature. But to amino acid composition of its exterior to around it,” he says. But he adds that many
adapt these reactions to work with a different greatly reduce its immunogenicity. But with more challenges await. For example, a protein
substrate, “you need to remodel the active site the new Chroma algorithm, Grigoryan says with excellent catalytic properties might be
dramatically,” says Zanghellini. Some of the that Generate is ready to embark on more exceedingly difficult to manufacture at scale
company’s projects include a plant enzyme ambitious projects, in which the algorithm or exhibit poor properties as a drug. In the
that can break down a widely used herbicide, can start building true de novo designs with future, however, next-generation algorithms
as well as enzymes that can convert relatively user-designated structural and functional should make it possible to generate de novo
low-value plant byproducts into useful natural features. Of course, Chroma’s design propos- proteins optimized to tick off many boxes on
sweeteners. als must then be validated by experimental a scientist’s wish list rather than just one.
Generate’s first-generation engineering testing, although Grigoryan says “we’re very
projects have focused on optimization. In encouraged by what we’ve seen.” Michael Eisenstein
one published study, company scientists Zanghellini believes the field is near an Philadelphia, PA, USA
showed that they could ‘resurface’ the amino inflection point. “We’re starting to see the
acid-metabolizing enzyme l-asparaginase possibility of really truly creating a complex Acknowledgements
from Escherichia coli bacteria, altering the active site and then building the protein Additional reporting by Shafaq Zia.

nature biotechnology

You might also like