0% found this document useful (0 votes)
51 views7 pages

RF Diffusion Manual

The RF Diffusion manual outlines the process of protein backbone generation using machine learning algorithms, specifically RF Diffusion, Protein MPNN, and Alphafold2. It details the steps involved in designing proteins, including backbone generation, sequence design, and experimental validation, while emphasizing the algorithm's ability to create new proteins with desired functions. The manual also describes the practical application of these tools in developing a specific nanobody for OXA48, including the setup and execution of the RF Diffusion Google Collab.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views7 pages

RF Diffusion Manual

The RF Diffusion manual outlines the process of protein backbone generation using machine learning algorithms, specifically RF Diffusion, Protein MPNN, and Alphafold2. It details the steps involved in designing proteins, including backbone generation, sequence design, and experimental validation, while emphasizing the algorithm's ability to create new proteins with desired functions. The manual also describes the practical application of these tools in developing a specific nanobody for OXA48, including the setup and execution of the RF Diffusion Google Collab.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MANUALS RF-

DIFFUSION

SUPERBUGBUSTER
TEAM
RF Diffusion manual
Making a protein backbone, the base structure of a protein, containing the carbone, is a really challenging task. First,
in history, we tried to have the structure of a protein by experiments such as photogrammetry or cryomicroscopy. But
as it is an experiment, those have limits. It is why, in the past few years, a certain number of softwares appeared, to
predict the structure of protein, by modeling it, thanks to informatic.

1. How does it work?

Models like RF Diffusion are useful tools that allow humanity to find proteins that nature didn’t even
develop and that are useful for us because of their modified functions or properties.

But, first, it is important to understand how protein design works.


1- Backbone generation: first it is important to find the basic structure of the protein that can carry
the specific functions wanted or able to bind to a specific protein. It’s where RF Diffusion is
working
2- Sequence design: after, it is needed to find the sequence that could fold up that backbone
structure. This part is done with Protein MPNN
3- Computation filtering: Thus, a selection of the best designs is done. The inverse way is practiced
thanks to other algorithms like Alphafold2: if the sequence that was find doesn’t fold up the
same structure that was designed, it’s set aside. Because having a good fold increase chances of
success
4- Experimental characterization: it is of course necessary to validate the experimentally the
proteins created.

All the first 3 steps are included in the Google Collab linked in the Contribution page.

RF diffusion

This machine learning algorithm uses deep learning methods to find the backbone of proteins. It is based
on a probabilistic diffusion model, that means that it is trained on true data on which a known amount of
noise is added, and the model has to denoise it, step by step. It is called Gaussian noise.

At the end, it can provide a protein starting from pure noise.

Thanks to that, the amount of output it can give is endless.


If you give a sequence of nucleotides it can give you the structure of the protein.
If you already have the structure, it can give you another structure that can be fixed on the first one, like a
protein binder.
It can give you the folding of the active sites.
It can also create new proteins with wanted functions (from known bioactive compounds).
This algorithm has high accuracy and a higher rate of success than previous modeling algorithms while
passing at the experimental confirmation: around 10%.

Protein MPNN

ProteinMPNN is a deep learning-based protein sequence design method that can design new proteins with
high accuracy. It was written by David Baker in 2022. ProteinMPNN is trained on a protein databank
comprising thousands of high-resolution structures.
This model takes as input a protein structure and based on its backbone predicts new sequences that will
fold into that backbone. Optionally, we can run AlphaFold2 on the predicted sequence to check whether
the predicted sequences adopt the same backbone.

Alphafold

Alphafold2 is an artificial intelligence algorithm developed to predict the three-dimensional structure of


proteins. It was developed by DeepMind and announced in 2020. Alphafold2 can predict the
three-dimensional structure of a protein from its amino acid sequence with high accuracy.

2. Why did we use it?

For the conception of our BacPROTACs, we needed a protein specific to OXA48. We decided to use a
nanobody but none existed in the previous publication. So we found one specific to CMY-2, an homologue
of OXA48.
In order to make that nanobody interact with our target protein, we needed to modify it. Indeed, a
nanobody is composed of three fixed parts and two variable ones that will make it specific to one protein.
So we decided to use a docking model (thanks to the suggestion of Riccardo Pellarin, who also helped us in
the use of this program), to modify and try to find good candidates.

3. How did we use the RF Diffusion Google Collab?


We didn’t use this algorithm at its fool capacity, and are not pretending to know all the tips and the ways to
use it. We will just show what we did to find our nanobody.

Warning: without a paid version, the algorithm can’t overpass a certain amount of time running nor a
certain amount of runs. If a lot of runs are needed, different email addresses can help. If the time limit is
exceeded, the user has to do all the steps again.

Previously to the use of this algorithm, we put the dimer Nanobody (CMY-2 specific)/Oxa48 protein in
Alphafold2, in order to have a scaffold structure predicted, and see how the two proteins could interact.

Step by step procedure :


1.

1. First, the setup will install all the databases and commands needed for the algorithm to work, the
cell has to run before going to the next step.
2. The name of the run, the results will be download with that name at the end of the program
3. This line allow to choose which parts will be optimized:
A:B1-25/B32-51/B58-100/B116-129
● A → It means that the first monomer (chain of amino acids) won’t change
● : → Symbol of separation of the two monomers (/contigs)
● B1-25 → The amino acids from the position 1 to 25 (of the second monomer, B) are fixed
● /→ separation between two segments in the contig
● B32-51 → The amino acids from the position 1 to 25 are fixed. So amino acids from the 26
to the 31 included will be optimized.

A3-30/36/A33-68
● The algorithm will diffuse a loop of 36 a.a. between the two fixed parts
4. The amino acids chain is needed here. The use “:” permits to mark se separation between two
monomers.
5. Number of times the algorithm will repass the optimized backbone and re-optimized it to have the
best backbone possible. Set by default at 50.
6. How many first designs differents the algorithm will propose. By default to 1. We Chose 8 different.
Since we didn’t need any symmetry we didn’t use the second part and let it like that.
After the information is added, the cell needs to be runed.

2.
1. This cell will display the 3D structures of all the backbones it found.

2. This line allows you to choose the number of propositions for the sequences that ProteinMPNN will
find that fit in each backbone. (for example, we used 8 propositions for each backbone).
3. Number of times the algorithm will repass the optimized sequence and re-optimized it to have the
best sequence possible. Set by default at 1.
This cell will find the best sequences possible to fit into the backbone and test it with alphafold to see if it
folds accordingly to the backbone. It will show sequence coverage plots (the number of unique reads that
include a given nucleotide in the reconstructed sequence)

And the structures with surety fold prediction (pLDDT)


3.

1. Will show the best result of the run


2. Will download the results on the user’s computer.

We did this program multiple times to have a lot of propositions at the end. And when we had all of them,
we needed to choose.
The program gives the results with a lot of scores that it calculated along the way. Here are the scores and
their meaning:

- MPNN(message passing neural network):


This score describes the protein-ligand interactions. It ranks a given set of chemical compounds
(ligands) with respect to its binding affinity to a particular target.

- pLDDT (predicted local distance difference test):


It goes from 0 to 100 and marks the per residue confidence score. It means the confidence of the
model prediction for each amino acid residue relative to the 𝛂 carbon atoms.
pLDDT > 90 signifies very high confidence, 70–90 good enough confidence, while <70 connotes low
model confidence

- PTM(Predicted Template Modeling Score):


It measures the structural conformity between two folded protein structures. It goes from 0 to 1,
with pTM < 0.2 that means that the residues are disordered proteins or with negligible correlation,
and pTM > 0.5 that means that conformity is strong enough to conclude.

- PAE (inter-chain predicted alignment error)


It ranked how well residues are arranged relative to the others in space. It means the intra/inter
domain distance between two residues relative to the true structure when aligned on the same
plane. For a confident prediction, the distance should stay between 0 and 35 Ångstroms.

- RMSD (root mean square deviation):


is the measure of the average distance between the atoms (usually the backbone atoms) of
superimposed proteins. It rates whether the alignment between the backbone initially predicted as
the best and the fold of the final nanobody is good or not. But the problem of this score is that
between really low (really good superposition) and really high, it’s difficult to conclude. A value
inferior to 2 Å is fairly good.

For ranking our nanobodies, we choose to only take sequences with a pLDDT > 0,89 and best RMSD.
Indeed, as all the other scores are really good, RFDiffusion gives a score to the nanobody based on the
RMSD (not all of them are lower than 2Å, so the lowest the better). The PAE and PTM and the MPNN
scores were good for all of them so not helping for any conclusion.

You might also like