Virtual Drug Screening Using Neural Networks: Daniel M. Hagan and Martin T. Hagan

Virtual Drug Screening Using Neural Networks
Daniel M. Hagan and Martin T. Hagan
Abstract—In this paper, we describe how neural networks molecules (estimated at 1033 unique molecules [2]). HTVS
can be used for high throughput screening of potential drug applications have seen considerable success in achieving the
candidates. Individual small molecules (ligands) are assessed for correct conformation of ligands by docking them into rigid
their potential to bind to specific proteins (receptors). Commit-
tees of multilayer networks are used to classify protein-ligand protein structures, but very little success in determining their
complexes as good binders or bad binders, based on selected affinity. Autodock Vina is orders of magnitude faster than its
chemical descriptors. The novel aspects of this paper include predecessor and yet more accurate in determining binding
the use of statistical analyses on the weights of single layer conformation of candidate ligands [3]. Vina’s estimate of
networks to select the appropriate descriptors, the use of Monte affinity, however, is hardly better than a guess.
Carlo cross-validation to provide confidence measures of network
performance (and also to identify problems in the data), and Neural networks have been used previously in HTVS as
the use of Self Organizing Maps to analyze the performance a ”post-docking” processing step in two ways [4]. Neural
of the trained network and identify anomalies. We demonstrate networks have been trained to classify docked complexes into
the procedures, on a large practical data set, and use them to good binders and bad binders (having greater than 25μM
discover a promising characteristic of the data.
affinity or less). And they have also been trained to find
I. I NTRODUCTION a continuous function to predict Kd (the binding affinity
Biochemistry sees drugs as ligands, which are small constant). With the rising number and accuracy of crystal-
molecules or ions that bind to specific regions of protein lographic and NMR complexes in online databases, there
macromolecules. Ligands are used by your cells to control the is more and more potential for so called knowledge-based
behavior of proteins, basically like an on-off switch. When functions to accomplish the affinity prediction. These functions
ligands bind to their target region (see Figure 1), the protein are statistical analyses of complexes in the databases. Pairs of
and ligand change shape to make the tightest fit possible and atoms that are often found close to each other are considered
reach the lowest energy state. If the ligand is an agonist, to be energetically favorable. NNscore [4] is knowledge based
it induces the active conformation of the protein. If it is and uses atom type pairs and their prevalences as seen in the
an antagonist, it will keep the protein in an inactive state. databases. But it is also empirical, because it characterizes
Protein-ligand interactions are driven by intermolecular forces complexes on coulomb energy, vdW forces, number of ligand
(ionic bonds, hydrogen bonds, van der Waals forces, etc.) as torsions, and other significant physical descriptors.
well as entropy and solvent related effects, which create a In this paper, we extend the work of [4] in several ways.
non-covalent protein-ligand complex. The shape and energy First, we use Monte Carlo cross-validation to provide con-
of these complexes are unique for every ligand, and this fidence measures of network performance and to identify
is what determines their biological functionality. In the lab, problems in the data. In addition, we use Self Organizing Maps
biochemists must use an assay to determine the binding affinity to analyze the performance of trained networks and identify
of novel ligands. These assays are quite accurate, but also anomalies. We were able to use these methods to discover a
expensive and time consuming. very interesting characteristic of a commonly used data set
Computational chemists can generate enormous libraries of that has the potential to significantly increase accuracies of
drug-like molecules, and need a way to very rapidly screen out binding predictions.
those that aren’t likely to bind so that any likely binders can be
confirmed experimentally. This technique is referred to as high
throughput virtual screening (HTVS). Computers can use 3d
structures to simulate the physics of protein-ligand interactions
and attempt to predict their binding affinities. These programs
start with a rigid protein structure and dock ligands into their
target region through energy minimization following steepest
decent and molecular dynamics calculations. Programs such
as Autodock 4 [1] can approach experimental accuracy, but
are too computationally intensive for high throughput applica-
tions; consider the vast chemical space occupied by drug-like
Daniel M. Hagan is with the Department of Biochemistry and Martin
T. Hagan is with the School of Electrical and Computer Engineering,
Oklahoma State University, Stillwater, OK 74078, USA (email: {dhagan,
mhagan}@okstate.edu). Fig. 1. Figure of Protein-ligand Complex.
978-1-5090-0620-5/16/$31.00 2016
c IEEE 579
II. M ATERIALS categories: proximity counts, electrostatic forces, and ligand
characteristics. The proximity list counts pairs of atoms be-
The data we used to develop our neural network classifier tween the receptor and ligand within 2Å of each other. The
was partially comprised of the 3446 protein-ligand complexes number is counted separately for each atom type pair. A
of the ”refined” set on the PDBbind database [5]. This set similar proximity list counts pairs within 4Å. Electrostatic
was curated from the Protein Data Bank to only include high force is a function of proximity, but also of partial atomic
quality complexes with experimentally determined structures charge. This force is calculated for atoms within 4Å of each
and binding affinity data. Diversity was considered to repre- other and then summed for each unique atom type pair. Finally,
sent affinities ranging from 10 mM - 1pM, and excessively the ligand is characterized by counting the number of atoms it
redundant proteins and ligands were also removed. Only contains along with the number of torsions that remain after
about 700 complexes had affinities greater than 25μM (weak binding.
binders) compared to 2700 less than 25μM (strong binding
complexes). This imbalance arises from a general lack of
structural data for weak binding complexes. If we train on this
data alone, we can expect the networks will reflect the same (*$

'*+$" )
bias and give more false positives. To provide a more balanced #)! !''(,
&,$
' $%
training set, we included the 1431 ”weak binding” complexes '
!"#$%
generated by Durrant et al [4]. These were created by docking ()

the NCI Diversity Set II – a set of 1364 random drug-like
&
ligands – into the 3446 receptors that were already part of
the training set, using Autodock Vina. Only those complexes

with final energies between 0 and -4 kcal/mol (according to
Vina) were used to create this weak binding set. The final
Fig. 2. Flowchart showing file processing steps.
training set consisted of 2741 good binders and 2130 bad
binders. Structural files for each complex in the PDB refined
set were downloaded. For each complex, there is one ligand
file in MOL2 format and one receptor file in PDB format.

These files need to be in PDBQT format, and some required !"# !" !"$%&!"$!"$&!"'&!"'!"&(!"'
-

'!"!"!")*!"+,!" '!"+, !")*+,!"#+,!"
information must be added. In Figure 2, we have outlined the -!"-.!"'!")*#!"$'!"-+,!"!"##!"#/!"

0
#!"(!"!")*!"#!"#/!"-!"-.!"+,!" 1"
procedure for preparing the correct PDBQT format from the !"&!"!"!"(!)*'!"##!"##/!"#-!"#-.!"

#/+,!"#/!"#/'!"#/!" #$%!"#$!"#!"#!"#

2
formats available on the PDBBind database. Starting with the '!"#&!"#!"#!"#(!"-.+,!"-.'!"-+,!"-!"-

'!"+,+,!"+,$%!"+,$!"+,!"+,!"+,'!"+,&!"
ligands, we used Raccoon [6] in MGLTools v.1.5.4 to add +,!"+,!"+,(!" '!"$%!"$%'!"$'!"'!"

!"(!"!"!"'!"&!"!"!"(!"

hydrogen atoms, merge non-polar hydrogens with their parent ''!"'!"'(!"!"(!"(!
atom, delete lone pairs, assign gasteiger charges, and convert
to PDBQT format. For the protein receptors, we used Reduce

"'" ""+,"#" ")*"#/"-""&"
to add hydrogens, and MGLTools v1.5.4 to merge non-polar

hydrogens and assign gasteiger charges. Autodock 4 atom-

types were assigned to all atoms.

-.+,!"$%'!"(!" +,!"+,!"#+,!"#'!"

#!"+,+,!"+,$%!"+,!"+,'!"+,(!"'

From the molecular descriptions of the molecules in the data (!
base, we need to compute a limited number of descriptors

that can be input to the neural network. Figure 3 shows
-!" !")*+,!"# !"#/+,!"+,!" !"$%&!"
194 descriptors that we considered. We decided to start with $!"$&!"&(!"!")*!")*!"+,!"#
!"#/!"#!" '!"+, !"#/'!"!"!"
the same 194 descriptors that were used in [4]. We will $'!"+,!"+,$!"#!"(!"'&!"-.!"
!"!"'!"&!"!"(!")*#!")*'!"
train a single layer network and use the weights to gage the ##!"##/!"#-!"#-.!" #+,!"#/!"#/!"#$%!"
contribution each input has toward the final decision. If our #$!"#!"#'!"#&!"#!"#(!"-.+,!"-.

!"-.'!"-+,!"-!"-'!"-!"+,+,!"+,

first set of inputs train to high accuracy, we can remove those $%!"+,'!"+,&!"+,!"+,!"+,(!"$%!"

$%'!"'!"!"!"!"'!"&!"

inputs that correspond to the smallest weights (based on a t- !"(!"''!"'!"'!"'(!"!"
(!"(!
test). This process of removing inputs and retraining can be
repeated until only the significant descriptors remain without
sacrificing accuracy. In Figure 3, all of the colored atoms/atom

!
"
pairs were removed by this process. Each color represents
one round of removing inputs. Green atoms/atom pairs were

removed in the first training, light blue in the second, blue in

the third, magenta in the fourth, and red in the fifth round of
training. Fig. 3. Descriptors used as neural network inputs.
The inputs used in the initial training fall into three main
580 2016 International Joint Conference on Neural Networks (IJCNN)

III. M ETHODS random subsampling validation). For every Monte Carlo trial,
A. Feedforward neural networks we divided the data into two sets. The first set consisted of a
randomly sampled 85% of the data, and the second (test) set
In order to classify protein-ligand complexes into good
consisted of the remaining 15% of the data. For the training of
binder and bad binder categories, we will use feedforward
each member of the committee of networks, the first set (85%
neural networks. We will use both single layer and two layer
of the data) was randomly divided again into a training set
networks. As discussed in a later section, the single layer
(70% of the full data set) and a validation set (15% of the full
network will be used to determine which inputs (descriptors)
data set). Each committee member is trained to optimize the
are relevant to the classification process. The single layer
performance of the network on its assigned training set. The
network also provides a baseline for classification accuracy.
validation set is monitored during training, and training stops
After the relevant inputs have been selected, a committee of
when performance on the validation set increases. This early
two layer networks will be used to make the final classification.
stopping procedure helps avert overfitting. Each member of
Figure 4 shows the single layer network we will use. It
the committee is thus trained on a somewhat different data set
uses a log sigmoid activation function, and is equivalent to
(and with a different set of initial weights). To summarize, on
logistic regression. The network will be trained to minimize
each Monte Carlo trial a committee of multiple feedforward
mean square error.
networks are trained. The committee of networks then vote on
Inputs Log-Sigmoid Layer the test set, and the error is calculated. After all of the Monte
Carlo trials are complete, the mean and standard deviation of
the test set errors are then computed. Since the test sets are
p a different for each Monte Carlo trial, and since the test sets are
W not used in any way for training, the distribution of the Monte
R x1 1x1
1xR
n Carlo errors can be expected to be representative of typical
1x1
future errors.
1 b 2) Consistently misclassified data: We used Monte Carlo
R 1x1 1 cross-validation in a novel way in this work. As described
above, on each Monte Carlo trail, different elements of the full
a = logsig(Wp + b) data set (protein-ligand complexes) are used in the training,
validation and test sets. The percent errors that we report in a
Fig. 4. Single Layer Network.
later section of this paper are computed only on the test sets.
Figure 5 shows the two layer network. The hidden layer uses However, at the completion of each Monte Carlo trial, we
a tan sigmoid activation function, and the output layer uses a also applied the full data set to the committee of networks,
softmax activation function. The network will be trained to in order to classify every protein-ligand complex (whether in
minimize cross-entropy. The number of neurons in the hidden training, validation or testing sets). We then kept track of the
layer will be adjusted for robust generalization performance. number of times each protein-ligand complex is misclassified.
We would expect that a given complex would be more likely
Inputs Tan-Sigmoid Layer Softmax Layer to be classified correctly when it is included in the training
or validation set, and more likely to be misclassified when
p a
1
a
2 it is included in the test set. However, if some complexes
R x1
W1 1
1
S x1
W2 2 were misclassified much more often than others, this could
1
S xR
n 2xS
1 n 2x1
S x1
1
2x1
indicate problems. If a complex is frequently misclassified,
1 b
1
1 b
2 this could indicate that there is a problem with the data, or
R S1x1 S
1
2x1 2 it could indicate that the complex is in a region of the input
1 1 1 2 2 1 2
space that is poorly sampled, with few neighboring complexes.
a = tansig(W p + b ) a = softmax(W a + b )
It could also indicate that the descriptors used in the network
Fig. 5. Two Layer Network. input are insufficient to characterize the binding quality of the
complex. Later in this paper, we will investigate consistently
For classification purposes, we will use committees of two misclassified data.
layer networks. Each network in the committee will provide a 3) Removing irrelevant inputs: In addition to the use of
vote on the classification (good or bad binder), and the class Monte Carlo cross-validation to assess the generalization ca-
with the most votes will be the final choice. pabilities of the neural network classifiers, we also used it in
Both single layer and two layer networks will be trained a novel way to remove irrelevant descriptors from the network
using the scaled conjugate gradient algorithm [7], as imple- inputs. We trained single layer networks, using Monte Carlo
mented in the Neural Network Toolbox for MATLAB [8]. cross-validation, as described above. We then considered the
1) Training, validation and test sets: In order to develop distribution of weights across the Monte Carlo trials. Since
confidence ranges for the performance of the networks, we in a single layer network each weight corresponds to a single
used Monte Carlo cross-validation (also known as repeated input, the distribution of a specific weight can indicate the
2016 International Joint Conference on Neural Networks (IJCNN) 581

importance of the corresponding input. If an input (descriptor) TABLE I
is not informative about the binding strength of protein-ligand S INGLE L AYER N ETWORK P ERFORMANCE
complexes, then we would expect the weights for that input Inputs Perc Error STD Spec STD Sens STD
to vary randomly over the Monte Carlo trials, with a mean 194 20.4 0.25 86.9 0.26 71.4 0.28
value near zero. Using the average value and sample standard 170 19.9 0.19 87.4 0.17 71.7 0.29
166 20.0 0.21 87.4 0.20 71.5 0.30
deviation of each weight, we can perform a one-sample t-test 165 20.1 0.17 87.4 0.17 71.8 0.28
to test the hypothesis that the mean is different than zero. After 144 19.7 0.19 87.5 0.17 72.0 0.22
computing the t-statistic, a p-value can be found. If the p-value 124 22.2 0.21 84.4 0.18 70.1 0.24
is below a statistical significance level (we used 0.01), then 105 21.9 0.16 84.7 0.10 70.6 0.21
84 22.1 0.17 84.3 0.11 70.6 0.17
we would reject the hypothesis that the mean is zero. If it is 64 23.4 0.17 82.6 0.13 69.8 0.18
above the significance level, we cannot reject the hypothesis 44 23.8 0.15 82.0 0.10 69.4 0.15
that the mean is zero, and we will then remove that input. 24 26.0 0.17 81.6 0.12 64.7 0.18
14 26.8 0.17 80.0 0.12 65.2 0.09
B. Self Organizing Map 2 39.9 0.18 88.8 0.30 23.0 0.30
In addition to feedforward networks, we also use the Self

Organizing Map (SOM) [9], which is an unsupervised cluster- TABLE II
T WO L AYER N ETWORK P ERFORMANCE
ing network. The SOM is a topology preserving network, in
that neurons within the network have neighbor relationships Inputs Perc Error STD Spec STD Sens STD
194 18.5 0.13 89.9 0.06 72.3 0.07
that are preserved by the training process. We will use the
170 18.4 0.14 89.9 0.06 72.2 0.06
SOM to analyze the training of the feedforward networks. 166 18.5 0.15 89.9 0.07 72.1 0.06
In particular, we will be able to investigate the distribution 165 18.4 0.11 89.8 0.07 72.2 0.08
of misclassified complexes, and improve the training of the 144 18.6 0.13 89.8 0.06 72.1 0.06
124 21.4 0.14 86.6 0.07 70.0 0.07
feedforward networks. 105 20.9 0.15 86.8 0.06 69.9 0.07
84 21.5 0.13 86.3 0.06 69.3 0.06
IV. R ESULTS AND D ISCUSSION 64 22.2 0.14 84.9 0.06 69.0 0.08
A. Single Layer Results 44 23.0 0.15 83.9 0.06 68.9 0.07
24 25.0 0.15 85.0 0.06 62.6 0.07
Table I shows the results of training the single layer network 14 25.8 0.15 83.0 0.05 63.5 0.06
as the descriptors in the input are removed. The average 2 39.4 0.16 92.5 0.06 19.6 0.09
percent error over 100 Monte Carlo trials, on the indepen-
dent test sets, is provided, along with the sample standard
deviation. In addition, the table shows the average sensitivity suggests that with data of the type we have used, it would
and specificity for each case, along with their corresponding not be appropriate to report results on a single test set, since
sample standard deviations. The original input contains 194 performance can vary significantly with small variations in
descriptors. The descriptors with p-values greater than 0.01 data selection.
were then removed, to produce 170 descriptors. After the Figure 6 shows a comparison between the average percent
simplified network was trained, new p-values were computed, errors of the one layer and two layer networks, as the number
and additional descriptors were removed. This continued until of inputs is changed. (Inputs were removed in the order of
165 descriptors remained, at which time subsequent p-values their p-values.) There are several things that we can take
remained less than 0.01. To further investigate the importance from this figure. First, we can see that the two layer network
of the remaining descriptors, we removed additional inputs, does have consistently smaller errors for every set of inputs,
approximately 20 at a time, using p-value ranking. It was pos- although the amount of reduction is not extremely large.
sible to remove a total of 50 of the original inputs (leaving 144 Second, the patterns of the two curves are very similar. Even
descriptors) without any significant decrease in performance. though the p-values that were used to remove the inputs were
determined using the single layer network, the results on the
B. Multilayer Results two layer network follow the same pattern. This gives us some
Table II shows the average performance of the 20 member confidence in the technique we have proposed for removing
committee of two layer networks with 5 neurons in the hidden irrelevant inputs.
layer, using the number of inputs that were found using the At a significance level of 1%, 29 inputs were removed
single layer network. As we can see, the pattern is very similar (resulting in 165 remaining inputs). Another 20 inputs were
to the single layer case. removed (with the next 20 highest p values) without any signif-
It should also be noted that in each case there is a significant icant increase in the percent error. This pattern is demonstrated
variation among the 100 Monte Carlo trials. For example, the in both the single layer and two layer results, which again
lowest average percent error of 0.184 occurs with 165 descrip- provides corroboration for the use of single layer p-values to
tors in the input vector, but the sample standard deviation of assist in input removal.
the 100 trials is 0.011, which means we could expect a range Table III shows network performance as the number of
of errors on any one trail to go from 0.162 to 0.206. This hidden layer neurons (S 1 ) and the number of networks in the

set are characterized by a small set of prototype vectors - one
0.4
Two Layer for each neuron in the SOM.
One Layer
To illustrate the process, we will consider the 14 most
0.35 important input elements from the original 194, as determined
by their p-values. We will train a 5x5 SOM (25 neurons) with
an hexagonal grid to cluster the input vectors. The neuron
Percent Error
0.3
numbers of the SOM are shown in Figure 7. After training,
we expect that the cluster represented by Neuron 1 will be near
0.25
the clusters represented by Neurons 2 and 6. This will better
enable us to judge the distribution of inputs in the training set.
0.2
0 50 100 150 200

Number of Inputs
Fig. 6. Percent Error Comparison.
TABLE III
A DJUSTING N EURONS AND N ETWORKS FOR T WO L AYER N ETWORK
C OMMITTEES
S1 Nn Perc Error STD Spec STD Sens STD

3 5 19.0 0.18 90.3 0.25 70.6 0.25
3 10 18.5 0.16 91.3 0.09 70.2 0.13
3 20 18.0 0.11 90.8 0.07 71.0 0.09
3 40 18.3 0.16 91.0 0.07 70.8 0.07
5 5 18.7 0.14 90.6 0.11 71.2 0.15
5 10 18.3 0.12 91.0 0.09 70.9 0.14
5 20 18.4 0.13 90.8 0.07 71.3 0.10 Fig. 7. SOM neuron numbers and misclassified complexes.
5 40 18.3 0.13 90.3 0.05 71.6 0.07
10 5 18.6 0.12 89.4 0.10 72.8 0.12
10 10 18.4 0.12 90.2 0.09 72.2 0.10
10 20 18.6 0.15 90.0 0.06 72.5 0.08
10 40 18.5 0.14 90.0 0.04 72.6 0.05
20 5 18.5 0.14 88.0 0.10 74.7 0.10
20 10 18.1 0.13 89.1 0.08 73.6 0.07
20 20 18.2 0.14 88.9 0.05 74.1 0.07
20 40 17.7 0.12 88.8 0.04 74.4 0.05
committee (Nn ) are varied (with 165 inputs to the network).

There is a consistent, although not large, difference between
the multilayer performance and the single layer performance,
but having more than three neurons and more than 10 networks
provides little additional improvement.
C. Self-Organizing Map analysis
Approximately 600 of the 4872 data points were always Fig. 8. SOM Hits.
misclassified, whether they were included in the training,
validation or testing data sets. We would generally expect Figure 8 is a hit histogram of the trained SOM. The size
that data points would be less likely to be misclassified when of the hexagrams represents the number of input vectors that
they were included in the training set, and more likely to be are in each cluster. Here we can see that Neuron 4 has the
misclassified when they were in the testing set. This would not largest cluster, with 691 input vectors. The green hexagons
be the case, however, if the inputs from different classes were represent clusters in which more than 50% of the input vectors
so similar to each other as to be indistinguishable. In order are good binders. The red hexagons represent clusters in which
to investigate whether or not this is the case, we will cluster more than 50% of the input vectors are bad binders. Figure
the input vectors to determine their similarities. The clustering 7 illustrates where the consistently misclassified vectors are
method we will use is the Self-Organizing Map (SOM). The located. A red hexagon indicates that there are consistently
SOM is a topology preserving network, in that neurons within misclassified vectors in the cluster, and the majority of those
the network have neighbor relationships that are preserved by vectors are False Positives (bad binders classified as good).
the training process. After training, the inputs from the training A green hexagon indicates False Negatives (good binders that

are classified as bad). The size of the hexagon indicates the the relatively few False Positives (blue lines) look more like
percentage of misclassified vectors in the cluster. For example, the True Positives than the True Negatives.
Neuron 1 has 72 vectors in its cluster; the majority are good Cluster 4 contains mainly bad binders, but even most of the
binders. It also has a large number of misclassified vectors, and good binders are correctly classified. There are misclassified
the majority are False Positives (bad binders that are classified vectors, but they are generally False Negatives (good binders
as good binders). Comparing Figure 8 and Figure 7, we can see that are classified as bad). Most of the 691 input vectors in
that most misclassified inputs are bad binders that are classified this cluster are close to the cluster center, except at elements 1
as good, and they are on the left side of the feature map, where and 14. The False Negatives do not seem to be distinguishable
most good binders are located. from the True Negatives.
There are several things we can say based on an initial Cluster 5 is mainly distinguishable from Cluster 4 in the
reading of these two figures. The good binders generally fall second element. All but two of the 157 vectors in Cluster
on the left side of the SOM, and the bad binders fall on 5 are good binders, and only one of the two bad binders is
the right side. This tells us that the 14 inputs that we are incorrectly classified. The misclassified vector is very similar
using are able to characterize the binding properties of the to the True Positives, and is quite distinct from the True
protein-ligand complex, at least in a general way, and this is Negative. Again, the indication is that the 14 descriptors may
what enables an accuracy of approximately 80%. However, not be sufficient to distinguish this bad binder.
there are a number of clusters in which there are many input A similar study was made for the 144 element input vectors,
vectors that are consistently misclassified. Clusters 1, 6, 11 and a similar conclusion was reached. The original 194 char-
and 21, in particular, have a high percentage of misclassified acteristics of the protein-ligand complexes that we have been
vectors. The highest percentages of misclassified vectors occur using do not appear to be sufficient to provide significantly
in clusters that contain primarily good binders, and the errors reduced classification error. Most of the complexes that are
are generally False Postives (bad binders that are misclassified being misclassified are misclassified whether they are included
as good binders). Does this mean that the existing inputs are within the training set or not. Based on this result, and the
not sufficient to distinguish the binding ability of certain types SOM analysis, it appears that we need to use additional
of protein-ligand complexes? descriptors in order to improve the classification abilities of
Further insight can be obtained by examining the individual the network. It is not a matter of increasing the numbers of
clusters. Figure 9 shows the clusters associated with the bottom neurons or layers in the networks.
row of the SOM. The left column shows the cluster center as
a thick green line and the inputs associated with the cluster as D. Analysis of Misclassified Data
thin black, cyan, red or blue lines. The black lines represent As described earlier, a Monte Carlo analysis of the training
true negative inputs (bad binders that are correctly classified), process was performed. For example, the best results were
the cyan lines represent true positives (good binders that are obtained for a committee of 20 networks, where each net-
correctly classified), the red lines represent false negatives work had two layers with five neurons in the hidden layer.
(good binders that are incorrectly classified) and the blue During the training process, the 20-network committees were
lines represent false positives (bad binders that are incorrectly generated 100 times, with different training/validation and
classified). In order to better visualize the differences between test sets, and the mean and standard deviations of errors on
the inputs associated with each cluster, the right column the 100 different committees were computed. In addition, we
of Figure 9 shows the input vectors for each cluster, after counted the number of times that each ligand-protein complex
the cluster center has been subtracted. It also indicates the in the entire data set (training, validation and testing) was
numbers of True Positives (TP), True Negatives (TN), False misclassified over the 100 Monte Carlo trials. One would
Positives (FP) and False Negatives (FN). We can see that expect that a typical complex would be less likely to be
Neuron 1 contains a majority of good binders, but most of the misclassified, if it was included in the training (or validation)
bad binders in this cluster are misclassified as good binders. set, and more likely to be misclassified, if it was included in the
The True Positives and False Positives are very similar to each test set (which is not used to update weights or stop training).
other, at least in terms of the 14 descriptors that are being used At each Monte Carlo trial, 15% of the data is included in the
to make up the input vector. This indicates that the 14 selected test set. If a given complex was always misclassified when it
characteristics may not be sufficient to distinguish these good was included in the test set, and never misclassified when it
and bad binders. was included in the training or validation sets, then we would
Cluster 2 also has misclassified input vectors, although the expect it to be misclassified, on average, 15 times out of the
percentage is not as high as in Cluster 1. From Figure 9 we can 100 Monte Carlo Trials. Of course, this would be an extreme
see that Cluster 1 and Cluster 2 centers are fairly different from result. If the network is being trained effectively, and if the
each other, although they both contain mainly bad binders. descriptors in the network inputs are sufficiently explanatory,
In Cluster 3 there are relatively few misclassified vectors, we would expect the number of misclassifications to be much
although the cluster contains both good and bad binders. less.
Figure 9(f) shows that the True Negatives (black lines) have Figure 10 is a histogram of the number of misclassifications
a different pattern than the True Positives (cyan lines), while that occur for each of the 4871 complexes. Notice that over

1 2
Neuron 1: TN= 4, TP= 48, FN= 2, FP= 18
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
(a) Cluster 1 Inputs (b) Inputs and cluster center differences
1 2
Neuron 2: TN= 53, TP= 168, FN= 11, FP= 29
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
(c) Cluster 2 Inputs (d) Inputs and cluster center differences
1 2
Neuron 3: TN= 36, TP= 97, FN= 0, FP= 4
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
(e) Cluster 3 Inputs (f) Inputs and cluster center differences
1 2
Neuron 4: TN= 375, TP= 220, FN= 63, FP= 33
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
(g) Cluster 4 Inputs (h) Inputs and cluster center differences
1 2
Neuron 5: TN= 1, TP= 155, FN= 0, FP= 1
1.5
0.5 1
0.5
0 0
−0.5
−0.5 −1
−1.5
−1 −2
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
(i) Cluster 5 Inputs (j) Inputs and cluster center differences
Fig. 9. Example Clusters

600 of the complexes are misclassified in every Monte Carlo Using just the single layer network, we trained both the
trial (100 times). This result was very unexpected. This means well-classified complexes and the misclassified complexes. For
that even when these complexes are in the training set, they are the well-classified complexes, we found, as expected, that
misclassified every time. As illustrated in Figure 8 and Figure the accuracy improved. The mean percent error over 100
7, the consistently misclassified complexes consist of both Monte Carlo trials was 6.23%. This is much less than the
good and bad binders. In addition, they must be distributed approximately 20% error on the full data set. The average
throughout the input space, since they appear in almost every sensitivity was 94.5%, and the average specificity was 93.9%.
cluster in the SOM network. This suggests that there must be Both of these numbers are significantly higher than the results
good binders that look like bad binders and bad binders that on the full data set.
look like good binders. This raises the question as to whether For the misclassified data set, the results were unexpected.
these consistently misclassified complexes have anything in We found that the average percent error was 16.3%. This is
common. less than the 20% error on the full data set. The average
sensitivity was 92.6%, and the average specificity was 71.6%.
The specificity is approximately the same as the results for the
4000
full data set and the sensitivity is higher. Remember that the
3500 error on these complexes was close to 100% when they were
3000
included in the full data set.
It appears that there is some consistent difference between
2500
the complexes that were consistently misclassified and those
2000 that were not. To investigate this further, we inspected the
weights of the single layer network to see how they differed
1500
between the two data sets. First, consider Figure 11. This
1000 is a plot of the weights for the single layer network, with
144 inputs, that was trained on the complexes that were
500
misclassified less than 50% of the time when trained on the full
0
0 20 40 60 80 100
data set. There are three curves in the figure. The blue curve
Number of Misclassifications is the mean of the weights for the 100 Monte Carlo runs. The
red line is two standard deviations above the mean, and the
Fig. 10. Histogram of the Number of Misclassifications of Each Complex green line is two standard deviations below the mean. Almost
all of the Monte Carlo runs fell between the red and green
To test this, we divided the data into two sets. One set lines. The data was very consistent when these 144 inputs
contained all complexes that were misclassified more than 50% were used. (For space reasons, the plot of the weights for the
of the time, and the other set contained all complexes that were complete data set is not shown, but the shape of that plot was
misclassified less than 50% of the time. The two data sets almost identical to the originally well classified case, although
were then trained separately. We would, of course, expect that the weights were much smaller – with a maximum of about
the complexes that were usually correctly classified in the full 3, instead of 80.)
data set would also be correctly classified in the smaller data
set, where the original misclassified complexes were removed. Originally Well Classified
100
We would not, however, expect that the complexes that were
always misclassified should now be accurately classified, when 80
the correctly classified complexes were removed. There is no 60
reason, a priori, to expect that the misclassified complexes 40

have anything in particular in common with each other. If the 20
consistently misclassified complexes were simply very difficult
0
to classify, then they should still be difficult to classify when
−20
the correctly classified complexes are removed.
−40
There were 4872 complexes in the original data set. Of
these, 2741 were good binders and 2130 were bad. We found −60
854 complexes that were misclassified more than 50% of the −80
time. Of these, 246 were good binders and 608 were bad. −100
0 50 100 150
There were 4017 complexes that were correctly classified more Weight Number
than 50% of the time. Of these, 2495 were good, and 1522
were bad. This means that bad binders were more likely to Fig. 11. Weights when trained only on well-classified data
be misclassified than good binders, but significant numbers
of both bad and good binders were contained in both the Now, consider Figure 12. These are the weights for the
misclassified data set and the correctly classified data set. single layer network that was trained only on the complexes

that were misclassified more than 50% of the time when
4
trained on the full data set. These weights follow a very
different pattern than the other case. The weight pattern shown 3
y = − 0.44*x + 0.34
in Figure 11 is able to correctly classify the majority of the 2
Weights for Complete Data Set

complexes in the original data set. However, there is a set 1
of 600 to 800 complexes that seem to require a different
0
type of classifier, represented by the weights in Figure 12.
In certain regions of the weight matrix, the weights for the −1
originally misclassified data are approximately the negative of −2

the weights for the well classified data, but this does not hold
−3
completely.
−4
Originally Misclassified −5
10 −4 −2 0 2 4 6 8 10 12
Weights for Misclassified Data
8
6 Fig. 13. Weights from full set versus weights from misclassified set
4
2
promise for producing even more accurate results. By dividing
0
the data into two separate categories, and training different
−2 network committees on each category, we were able to achieve
−4 average accuracies of more than 90%. This division of the data
−6 into the two categories cannot be achieved using the current
−8
descriptors. The next step of this research will be to look
for biochemical differences between the complexes in the two
−10
0 50
Weight Number
100 150 categories.
VI. ACKNOWLEDGEMENTS
Fig. 12. Weights when trained only on misclassified data
The authors would like to thank Dr. Jacob Durrant of the
Figure 13 shows a scatter plot of the weights from the University of California, San Diego, for supplying us with data
full set versus the weights from the misclassified set. The on well docked bad binders and for helpful suggestions on this
best fit linear equation is −0.44x + 0.34, with an R value work.
of 0.8. There is a generally negative relationship, but there is R EFERENCES
a significant amount of variation. There are some inputs that [1] G. M. Morris, R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S.
have the largest weights in the complete data set (also the well Goodsell, and A. J. Olson., “Autodock4 and autodocktools4: automated
classified), but small weights in the originally misclassified. docking with selective receptor flexibility,” J. Computational Chemistry.,
vol. 16, no. 1, pp. 2785–2791, June. 2009.
In the future, we will investigate those descriptors that are [2] P. G. Polishchuk, T. I. Madzhidov, and A. Varnek., “Estimation of the size
important for the originally well classified data, but apparently of drug-like chemical space based on gdb-17 data.” J. Computer Aided
not consequential for the originally misclassified data. Molecule Design., vol. 8, no. 1, pp. 675–679, Aug. 2013.
[3] O. Trott and A. J. Olson., “Autodock vina: Improving the speed and
V. C ONCLUSIONS accuracy of docking wih a new scoring function, efficient optimization,
and multithreading,” J. Computational Chemistry., vol. 31, no. 2, pp. 455–
In this paper, we have described how neural networks can 461, June. 2009.
[4] J. D. Durrant and J. A. McCammon., “Nnscore: A neural-network-based
be used in the virtual screening of potential drug candidates. scoring function for the characterization of protein/ligand complexes,” J.
There are several novel aspects of our research. We have Chem. Inf. Model., vol. 50, no. 10, pp. 1865–1871, Sept. 2010.
used a statistical analysis of the weights of single layer [5] R. Wang, X. Fang, Y. Lu, and S. Wang., “The pdbbind database:
Collection of binding affinities for protein-ligand complexes with known
networks to select relevant inputs, we have used Monte Carlo three-dimensional structures,” J. Med. Chem., vol. 47, no. 12, pp. 2977–
cross-validation to provide confidence measures of network 2980, Dec. 2004.
performance and to identify problems in network training, and [6] S. Forli., “Raccoon—autodock vs: an automated
tool for preparing autodock virtual screenings,”
we have used Self Organizing Maps to investigate the training http://autodock.scripps.edu/resources/raccoon, accessed: 2016-01-10.
results and identify anomalies. We applied these techniques [7] M. Moller., “A scaled conjugate gradient algorithm for fast supervised
to a large practical data set of protein-ligand complexes. We learning.” Neural Networks, vol. 6, no. 4, pp. 525–533, Dec. 1993.
[8] H. B. Demuth, M. H. Beale, and M. T. Hagan., The Neural Network
were able to classify the complexes into good binding and Toolboxł for MATLAB. Boston, MA: The MathWorks, 2014.
bad binding categories with average accuracies of better than [9] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78,
82%. More importantly, by using an analysis of the Monte no. 9, p. 14641480, Sep. 1990.
Carlo cross-validation and Self Organizing Map results, we
were able to discover a characteristic of the data set that has

Virtual Drug Screening Using Neural Networks: Daniel M. Hagan and Martin T. Hagan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Virtual Drug Screening Using Neural Networks: Daniel M. Hagan and Martin T. Hagan

Uploaded by

Copyright:

Available Formats

Virtual Drug Screening Using Neural Networks

Daniel M. Hagan and Martin T. Hagan

!" (!" !" !" '!" &!" !" !" (!"

hydrogens and assign gasteiger charges. Autodock 4 atom-

base, we need to compute a limited number of descriptors

$%'!" '!" !" !" !" '!" &!" 

one round of removing inputs. Green atoms/atom pairs were

580 2016 International Joint Conference on Neural Networks (IJCNN)

2016 International Joint Conference on Neural Networks (IJCNN) 581

In addition to feedforward networks, we also use the Self

582 2016 International Joint Conference on Neural Networks (IJCNN)

0 50 100 150 200

Fig. 6. Percent Error Comparison.

S1 Nn Perc Error STD Spec STD Sens STD

committee (Nn ) are varied (with 165 inputs to the network).

2016 International Joint Conference on Neural Networks (IJCNN) 583

584 2016 International Joint Conference on Neural Networks (IJCNN)

(a) Cluster 1 Inputs (b) Inputs and cluster center differences

(c) Cluster 2 Inputs (d) Inputs and cluster center differences

(e) Cluster 3 Inputs (f) Inputs and cluster center differences

(g) Cluster 4 Inputs (h) Inputs and cluster center differences

(i) Cluster 5 Inputs (j) Inputs and cluster center differences

Fig. 9. Example Clusters

2016 International Joint Conference on Neural Networks (IJCNN) 585

the correctly classiﬁed complexes were removed. There is no 60

reason, a priori, to expect that the misclassiﬁed complexes 40

586 2016 International Joint Conference on Neural Networks (IJCNN)

Weights for Complete Data Set

originally misclassiﬁed data are approximately the negative of −2

2016 International Joint Conference on Neural Networks (IJCNN) 587

You might also like

!"(!"!"!"'!"&!"!"!"(!"

$%'!"'!"!"!"!"'!"&!"