Professional Documents
Culture Documents
Abstract—In this paper, we describe how neural networks molecules (estimated at 1033 unique molecules [2]). HTVS
can be used for high throughput screening of potential drug applications have seen considerable success in achieving the
candidates. Individual small molecules (ligands) are assessed for correct conformation of ligands by docking them into rigid
their potential to bind to specific proteins (receptors). Commit-
tees of multilayer networks are used to classify protein-ligand protein structures, but very little success in determining their
complexes as good binders or bad binders, based on selected affinity. Autodock Vina is orders of magnitude faster than its
chemical descriptors. The novel aspects of this paper include predecessor and yet more accurate in determining binding
the use of statistical analyses on the weights of single layer conformation of candidate ligands [3]. Vina’s estimate of
networks to select the appropriate descriptors, the use of Monte affinity, however, is hardly better than a guess.
Carlo cross-validation to provide confidence measures of network
performance (and also to identify problems in the data), and Neural networks have been used previously in HTVS as
the use of Self Organizing Maps to analyze the performance a ”post-docking” processing step in two ways [4]. Neural
of the trained network and identify anomalies. We demonstrate networks have been trained to classify docked complexes into
the procedures, on a large practical data set, and use them to good binders and bad binders (having greater than 25μM
discover a promising characteristic of the data.
affinity or less). And they have also been trained to find
I. I NTRODUCTION a continuous function to predict Kd (the binding affinity
Biochemistry sees drugs as ligands, which are small constant). With the rising number and accuracy of crystal-
molecules or ions that bind to specific regions of protein lographic and NMR complexes in online databases, there
macromolecules. Ligands are used by your cells to control the is more and more potential for so called knowledge-based
behavior of proteins, basically like an on-off switch. When functions to accomplish the affinity prediction. These functions
ligands bind to their target region (see Figure 1), the protein are statistical analyses of complexes in the databases. Pairs of
and ligand change shape to make the tightest fit possible and atoms that are often found close to each other are considered
reach the lowest energy state. If the ligand is an agonist, to be energetically favorable. NNscore [4] is knowledge based
it induces the active conformation of the protein. If it is and uses atom type pairs and their prevalences as seen in the
an antagonist, it will keep the protein in an inactive state. databases. But it is also empirical, because it characterizes
Protein-ligand interactions are driven by intermolecular forces complexes on coulomb energy, vdW forces, number of ligand
(ionic bonds, hydrogen bonds, van der Waals forces, etc.) as torsions, and other significant physical descriptors.
well as entropy and solvent related effects, which create a In this paper, we extend the work of [4] in several ways.
non-covalent protein-ligand complex. The shape and energy First, we use Monte Carlo cross-validation to provide con-
of these complexes are unique for every ligand, and this fidence measures of network performance and to identify
is what determines their biological functionality. In the lab, problems in the data. In addition, we use Self Organizing Maps
biochemists must use an assay to determine the binding affinity to analyze the performance of trained networks and identify
of novel ligands. These assays are quite accurate, but also anomalies. We were able to use these methods to discover a
expensive and time consuming. very interesting characteristic of a commonly used data set
Computational chemists can generate enormous libraries of that has the potential to significantly increase accuracies of
drug-like molecules, and need a way to very rapidly screen out binding predictions.
those that aren’t likely to bind so that any likely binders can be
confirmed experimentally. This technique is referred to as high
throughput virtual screening (HTVS). Computers can use 3d
structures to simulate the physics of protein-ligand interactions
and attempt to predict their binding affinities. These programs
start with a rigid protein structure and dock ligands into their
target region through energy minimization following steepest
decent and molecular dynamics calculations. Programs such
as Autodock 4 [1] can approach experimental accuracy, but
are too computationally intensive for high throughput applica-
tions; consider the vast chemical space occupied by drug-like
Daniel M. Hagan is with the Department of Biochemistry and Martin
T. Hagan is with the School of Electrical and Computer Engineering,
Oklahoma State University, Stillwater, OK 74078, USA (email: {dhagan,
mhagan}@okstate.edu). Fig. 1. Figure of Protein-ligand Complex.
978-1-5090-0620-5/16/$31.00 2016
c IEEE 579
II. M ATERIALS categories: proximity counts, electrostatic forces, and ligand
characteristics. The proximity list counts pairs of atoms be-
The data we used to develop our neural network classifier tween the receptor and ligand within 2Å of each other. The
was partially comprised of the 3446 protein-ligand complexes number is counted separately for each atom type pair. A
of the ”refined” set on the PDBbind database [5]. This set similar proximity list counts pairs within 4Å. Electrostatic
was curated from the Protein Data Bank to only include high force is a function of proximity, but also of partial atomic
quality complexes with experimentally determined structures charge. This force is calculated for atoms within 4Å of each
and binding affinity data. Diversity was considered to repre- other and then summed for each unique atom type pair. Finally,
sent affinities ranging from 10 mM - 1pM, and excessively the ligand is characterized by counting the number of atoms it
redundant proteins and ligands were also removed. Only contains along with the number of torsions that remain after
about 700 complexes had affinities greater than 25μM (weak binding.
binders) compared to 2700 less than 25μM (strong binding
complexes). This imbalance arises from a general lack of
structural data for weak binding complexes. If we train on this
data alone, we can expect the networks will reflect the same (*$
'*+$"
)
bias and give more false positives. To provide a more balanced #)!
!''(,
&,$
'
$%
training set, we included the 1431 ”weak binding” complexes '
!"#$%
generated by Durrant et al [4]. These were created by docking ()
the NCI Diversity Set II – a set of 1364 random drug-like
&
ligands – into the 3446 receptors that were already part of
the training set, using Autodock Vina. Only those complexes
with final energies between 0 and -4 kcal/mol (according to
Vina) were used to create this weak binding set. The final
Fig. 2. Flowchart showing file processing steps.
training set consisted of 2741 good binders and 2130 bad
binders. Structural files for each complex in the PDB refined
set were downloaded. For each complex, there is one ligand
file in MOL2 format and one receptor file in PDB format.
These files need to be in PDBQT format, and some required !"# !" !"$%&!"$!"$&!"'&!"'!"&(!"'
-
'!"!"!")*!"+,!" '!"+, !")*+,!"#+,!"
information must be added. In Figure 2, we have outlined the -!"-.!"'!")*#!"$'!"-+,!"!"##!"#/!"
0
#!"(!"!")*!"#!"#/!"-!"-.!"+,!" 1"
procedure for preparing the correct PDBQT format from the !"&!"!"!"(!)*'!"##!"##/!"#-!"#-.!"
#/+,!"#/!"#/'!"#/!" #$%!"#$!"#!"#!"#
2
formats available on the PDBBind database. Starting with the '!"#&!"#!"#!"#(!"-.+,!"-.'!"-+,!"-!"-
'!"+,+,!"+,$%!"+,$!"+,!"+,!"+,'!"+,&!"
ligands, we used Raccoon [6] in MGLTools v.1.5.4 to add +,!"+,!"+,(!" '!"$%!"$%'!"$'!"'!"
S x1
1
2x1
indicate problems. If a complex is frequently misclassified,
1 b
1
1 b
2 this could indicate that there is a problem with the data, or
R S1x1 S
1
2x1 2 it could indicate that the complex is in a region of the input
1 1 1 2 2 1 2
space that is poorly sampled, with few neighboring complexes.
a = tansig(W p + b ) a = softmax(W a + b )
It could also indicate that the descriptors used in the network
Fig. 5. Two Layer Network. input are insufficient to characterize the binding quality of the
complex. Later in this paper, we will investigate consistently
For classification purposes, we will use committees of two misclassified data.
layer networks. Each network in the committee will provide a 3) Removing irrelevant inputs: In addition to the use of
vote on the classification (good or bad binder), and the class Monte Carlo cross-validation to assess the generalization ca-
with the most votes will be the final choice. pabilities of the neural network classifiers, we also used it in
Both single layer and two layer networks will be trained a novel way to remove irrelevant descriptors from the network
using the scaled conjugate gradient algorithm [7], as imple- inputs. We trained single layer networks, using Monte Carlo
mented in the Neural Network Toolbox for MATLAB [8]. cross-validation, as described above. We then considered the
1) Training, validation and test sets: In order to develop distribution of weights across the Monte Carlo trials. Since
confidence ranges for the performance of the networks, we in a single layer network each weight corresponds to a single
used Monte Carlo cross-validation (also known as repeated input, the distribution of a specific weight can indicate the
0.3
numbers of the SOM are shown in Figure 7. After training,
we expect that the cluster represented by Neuron 1 will be near
0.25
the clusters represented by Neurons 2 and 6. This will better
enable us to judge the distribution of inputs in the training set.
0.2
TABLE III
A DJUSTING N EURONS AND N ETWORKS FOR T WO L AYER N ETWORK
C OMMITTEES
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1 2
Neuron 2: TN= 53, TP= 168, FN= 11, FP= 29
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1 2
Neuron 3: TN= 36, TP= 97, FN= 0, FP= 4
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1 2
Neuron 4: TN= 375, TP= 220, FN= 63, FP= 33
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1 2
Neuron 5: TN= 1, TP= 155, FN= 0, FP= 1
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
854 complexes that were misclassified more than 50% of the −80
time. Of these, 246 were good binders and 608 were bad. −100
0 50 100 150
There were 4017 complexes that were correctly classified more Weight Number
than 50% of the time. Of these, 2495 were good, and 1522
were bad. This means that bad binders were more likely to Fig. 11. Weights when trained only on well-classified data
be misclassified than good binders, but significant numbers
of both bad and good binders were contained in both the Now, consider Figure 12. These are the weights for the
misclassified data set and the correctly classified data set. single layer network that was trained only on the complexes
Originally Misclassified −5
10 −4 −2 0 2 4 6 8 10 12
Weights for Misclassified Data
8
6 Fig. 13. Weights from full set versus weights from misclassified set
4
2
promise for producing even more accurate results. By dividing
0
the data into two separate categories, and training different
−2 network committees on each category, we were able to achieve
−4 average accuracies of more than 90%. This division of the data
−6 into the two categories cannot be achieved using the current
−8
descriptors. The next step of this research will be to look
for biochemical differences between the complexes in the two
−10
0 50
Weight Number
100 150 categories.
VI. ACKNOWLEDGEMENTS
Fig. 12. Weights when trained only on misclassified data
The authors would like to thank Dr. Jacob Durrant of the
Figure 13 shows a scatter plot of the weights from the University of California, San Diego, for supplying us with data
full set versus the weights from the misclassified set. The on well docked bad binders and for helpful suggestions on this
best fit linear equation is −0.44x + 0.34, with an R value work.
of 0.8. There is a generally negative relationship, but there is R EFERENCES
a significant amount of variation. There are some inputs that [1] G. M. Morris, R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S.
have the largest weights in the complete data set (also the well Goodsell, and A. J. Olson., “Autodock4 and autodocktools4: automated
classified), but small weights in the originally misclassified. docking with selective receptor flexibility,” J. Computational Chemistry.,
vol. 16, no. 1, pp. 2785–2791, June. 2009.
In the future, we will investigate those descriptors that are [2] P. G. Polishchuk, T. I. Madzhidov, and A. Varnek., “Estimation of the size
important for the originally well classified data, but apparently of drug-like chemical space based on gdb-17 data.” J. Computer Aided
not consequential for the originally misclassified data. Molecule Design., vol. 8, no. 1, pp. 675–679, Aug. 2013.
[3] O. Trott and A. J. Olson., “Autodock vina: Improving the speed and
V. C ONCLUSIONS accuracy of docking wih a new scoring function, efficient optimization,
and multithreading,” J. Computational Chemistry., vol. 31, no. 2, pp. 455–
In this paper, we have described how neural networks can 461, June. 2009.
[4] J. D. Durrant and J. A. McCammon., “Nnscore: A neural-network-based
be used in the virtual screening of potential drug candidates. scoring function for the characterization of protein/ligand complexes,” J.
There are several novel aspects of our research. We have Chem. Inf. Model., vol. 50, no. 10, pp. 1865–1871, Sept. 2010.
used a statistical analysis of the weights of single layer [5] R. Wang, X. Fang, Y. Lu, and S. Wang., “The pdbbind database:
Collection of binding affinities for protein-ligand complexes with known
networks to select relevant inputs, we have used Monte Carlo three-dimensional structures,” J. Med. Chem., vol. 47, no. 12, pp. 2977–
cross-validation to provide confidence measures of network 2980, Dec. 2004.
performance and to identify problems in network training, and [6] S. Forli., “Raccoon—autodock vs: an automated
tool for preparing autodock virtual screenings,”
we have used Self Organizing Maps to investigate the training http://autodock.scripps.edu/resources/raccoon, accessed: 2016-01-10.
results and identify anomalies. We applied these techniques [7] M. Moller., “A scaled conjugate gradient algorithm for fast supervised
to a large practical data set of protein-ligand complexes. We learning.” Neural Networks, vol. 6, no. 4, pp. 525–533, Dec. 1993.
[8] H. B. Demuth, M. H. Beale, and M. T. Hagan., The Neural Network
were able to classify the complexes into good binding and Toolboxł for MATLAB. Boston, MA: The MathWorks, 2014.
bad binding categories with average accuracies of better than [9] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78,
82%. More importantly, by using an analysis of the Monte no. 9, p. 14641480, Sep. 1990.
Carlo cross-validation and Self Organizing Map results, we
were able to discover a characteristic of the data set that has