Learning Graph-Level Representation For Drug Discovery: Junying Li, Deng Cai, Xiaofei He

Learning Graph-Level Representation for Drug Discovery
Junying Li, Deng Cai, Xiaofei He

State Key Laboratory of CAD&CG, College of Computer Science and Technology
Zhejiang University
microljy@zju.edu.cn, dengcai@gmail.com, xiaofeihe@cad.zju.edu.cn
arXiv:1709.03741v2 [cs.LG] 16 Sep 2017
Abstract tasks(Krizhevsky, Sutskever, and Hinton 2012). It indicates

that end-to-end learning with reasonable differentiable fea-
Predicating macroscopic influences of drugs on human body,
like efficacy and toxicity, is a central problem of small- ture extractor may surpass the conventional two stage clas-
molecule based drug discovery. Molecules can be represented sifier. Following this idea, (Duvenaud et al. 2015) model a
as an undirected graph, and we can utilize graph convolu- molecule as a graph, the nodes of which stand for atoms and
tion networks to predication molecular properties. However, the edges of which represent chemical bonds linking some
graph convolutional networks and other graph neural net- of the atoms together, and propose graph convolutional net-
works all focus on learning node-level representation rather work which directly takes the graph as input to learn the rep-
than graph-level representation. Previous works simply sum resentations of molecules. Previous studies show that graph
all feature vectors for all nodes in the graph to obtain the convolutional network broadly offers the best performance
graph feature vector for drug predication. In this paper, we in- in most of the datasets in MoleculeNet(Wu et al. 2017).
troduce a dummy super node that is connected with all nodes
in the graph by a directed edge as the representation of the Molecular properties of drugs are almost macroscopic in-
graph and modify the graph operation to help the dummy fluences on human body, like efficacy and toxicity. In fact,
super node learn graph-level feature. Thus, we can handle unlike common graph-based application like web classifi-
graph-level classification and regression in the same way as cation, citation network and knowledge graph, predicating
node-level classification and regression. In addition, we ap- drugs properties is graph-level classification and regression
ply focal loss to address class imbalance in drug datasets. rather than node-level classification and regression. How-
The experiments on MoleculeNet show that our method can ever, present graph neural networks, including graph convo-
effectively improve the performance of molecular properties lutional network(Duvenaud et al. 2015), similar graph net-
predication.
works that take the original molecule as input(Altae-Tran
et al. 2017; Yao et al. 2017; Schütt et al. 2017; Gomes et
Introduction al. 2017), and other general graph neural networks designed
Reducing the high attrition rates in drug development con- for all types of graph(Li et al. 2015; Bruna et al. 2013;
tinues to be a key challenge for the pharmaceutical industry. Defferrard, Bresson, and Vandergheynst 2016; Kipf and
When biological researches yield evidence that a particular Welling 2016; Niepert, Ahmed, and Kutzkov 2016), all fo-
molecule could modulate essential pathways to achieve ther- cus on learning node-level representation rather than graph-
apeutic activity, the discovered molecule often fails as a po- level representation. Previous works (Altae-Tran et al. 2017;
tential drug for a number of reasons including toxicity, low Wu et al. 2017) simply sums all feature vectors for all nodes
activity, and low solubility(Waring et al. 2015). The primary in the graph to obtain the graph representation for molecule
problem of small-molecule based drug discovery is finding properties classification and regression.
analogue molecules with increased efficacy and reduced safe To learn better graph-level representation and handle
risks to the patient to optimize the candidate molecule. Ma- graph-level classification and regression, we introduce a
chine learning is a powerful tool of drugs virtual screening dummy super node that is connected with all nodes in the
that can help drug developer eliminate unqualified candidate graph by a directed edge, as the representation of the graph.
molecules quickly. Thus, we can handle graph-level classification and regres-
A molecule can be of arbitrary size and shape. How- sion in the same way as node-level classification and regres-
ever, most machine learning methods can only handle in- sion. The idea of taking the feature of one specific node as
puts of a fixed size. A wide-used conventional proposal is the feature of the graph has been proposed before(Scarselli
to utilize hand-crafted feature like ECFP(Rogers and Hahn et al. 2009), but earlier work simply assign the first atom in
2010), Coulomb Matrix(Rupp et al. 2012) as input and feed the atom-bond description as the specific node. In our work,
it into conventional classifier like random forest and multi- the dummy super node is initialized to zero and get updated
layer perceptron. In the last few years, convolutional neu- through graph convolutional networks simultaneously as the
ral network clearly surpassed the conventional methods that genuine nodes do. We modify the graph operations(Altae-
use hand-crafted feature and SVM in object classification Tran et al. 2017) and apply node-level batch normalization
to learn better features. We take the feature of the dummy output sequences. Spectral graph convolutional neural net-
super node as the feature of the graph and feed it into classi- works are introduced by (Bruna et al. 2013) and later ex-
fier. tended by (Defferrard, Bresson, and Vandergheynst 2016)
Another problem of molecular properties predication is with fast localized convolutions. (Kipf and Welling 2016) in-
that the datasets of molecules are often unbalanced. That troduced a number of simplifications to spectral graph con-
is, the positive samples may be only a small part of the volutinal neural network and improve scalibility and classi-
total samples. Take HIV dataset(hiv ) as instance, there fication performance in large-scale networks. (Cao, Lu, and
are only 1487 chemical compounds that are HIV active in Xu 2016) propose a novel model for learning graph repre-
the dataset of totally 41913 chemical compounds. Previous sentations, which generates a low-dimensional vector repre-
works(Altae-Tran et al. 2017; Wu et al. 2017) did not pay sentation for each vertex by capturing the graph structural
much attention to the fact of unbalanced data. To address information. (Niepert, Ahmed, and Kutzkov 2016) propose
the unbalanced data, we adopt focal loss(Lin et al. 2017) a framework for learning convolutional neural networks for
and show that it can effectively improve the classification arbitrary graphs and applied it to molecule classification.
performance in unbalanced molecular dataset. All above work focus on learning node-level representa-
We evaluated our model on several datasets of toxity, bi- tion in graph rather than graph-level representation. How-
ological activities and solubility in MoleculeNet, including ever, predicating molecular properties is in fact a problem of
Tox21 ToxCast, PCBA, MUV, HIV, FreeSolv. The results graph-level classification and regression. In this paper, we
show that our method could effectively improve the perfor- introduce a dummy super node that is connected to all other
mance on molecular properties predication. nodes by a directed edge to learn graph-level representation
rather than simply using the sum of the vectors of all nodes
Related Work as the graph-level representation(Duvenaud et al. 2015;
Altae-Tran et al. 2017).
Our work draws inspiration both from the field of molecular
machine learning and graph neural network. In what follows,
we provide a brief overview of previous works in both fields.
Learning Graph-Level Representation
For most macroscopic molecular properties predication
Molecule Mchine Learning tasks, like efficacy and toxicity, we could neglect the de-
tailed edge information and take a molecule as an undirected
Encoding molecules into fixed-length strings or vectors is
graph(Wu et al. 2017). We apply graph convolutional net-
a core challenge for molecular machine learning. (Rupp et
works(Duvenaud et al. 2015) to learn representation of the
al. 2012) introduce Coulomb Matrix as feature of molecules
atoms(nodes) in the molecule. To learn the representation of
and apply a machine learning model to predict atomiza-
the molecule(graph), we introduce a dummy super node that
tion energies of a diverse set of organic molecules. (Rogers
is connected with all nodes in the graph by a directed edge.
and Hahn 2010) propose Extended-connectivity fingerprints
We modify the graph operation to help the dummy super
(ECFP), a novel class of topological fingerprints for molec-
node learn graph-level feature, and utilize neural network as
ular characterization.
a classifier for graph-level properties predication.
(Duvenaud et al. 2015) propose graph convolutional net-
work that operates directly on graph, which could learn Graph Convolutional Network
better representations than conventional method like ECFP.
(Altae-Tran et al. 2017) combine graph convolutional net- A molecule could be modeled as a graph, the nodes of which
work with residual LSTM embedding for one-shot learning stand for atoms and the edges of which represent chemical
on drug discovery. (Yao et al. 2017; Schütt et al. 2017) ap- bonds linking some of the atoms together. In graph convolu-
ply neural network to predicate the molecular energy and tion network, for a specific node, we separately feed it and
reduced the predication error to 1 kcal/mol. (Gomes et al. its neighbours into two dense layer, and calculate the sum
2017) propose atomic neural network to predicate the bind- as the new features of the node. The weights are shared be-
ing free energy of a subset of protein-ligand complexes tween dense layers that operate on different node with the
found in the PDBBind dataset.(Ramsundar et al. 2015) ap- same degree. Formally, for a specific node v, that have to-
plied Massively multitask neural architectures to synthesize tally d neighbours ni (1 ≤ i ≤ d), the new feature of the
information from many distinct biological sources. (Wu et node v 0 of graph convolution operation is formulated as
al. 2017) introduce MoleculeNet, a large scale benchmark v 0 = Wself
d
v+
X
Wnbd
ni + bd (1)
for molecular machine learning, which contains multiple
1≤i≤d
public datasets, and establish metrics for evaluation and high
quality open-source implementations. where Wselfd
is the weight for self node, Wnb d
is the
weight for neighbour node, boldsymbolbd is the bias. These
Graph neural network weights vary from different degree d of the specific node,
Graph-based Neural networks have previously introduced for that the nodes with different number of neighbours are
in(Gori, Monfardini, and Scarselli 2005; Scarselli et al. affected by neighbours in different ways. Since the output
2009) as a form of recurrent neural network. (Li et al. 2015) of graph convolution layer is also a graph, we could contin-
modify the graph neural network by using gated recurrent ually apply graph convolution layers to the graph. Thus, the
units and modern optimization techniques and then extend to receptive field of the network would be larger, and the nodes
v and its neighbors are fed S and all nodes are fed
Max over v and its neighbors
into dense layer and summed into dense layer and summed
S S
S S
v v
GraphConv GraphConv(Super Node) GraphPool
Figure 1: Graphical representation of the major graph operation. The dark blue node represents the specific node v, the
light blue nodes represent the neighbours of v, the dotted node represents the dummy super node S. The orange and yellow
arrows represent dense layer with different weights. The red arrows represent max polling, the black arrow represents identical
mapping. The dummy super node is connected with all other nodes by a directed edge. The directed edge indicates that the
dummy super node learns feature from genuine nodes but does not affect the feature of genuine nodes. Note that, for GraphConv
and GraphPool, the operation is shown for a single node v, however, these operations are performed on all nodes in graph
simultaneously in fact.
would be affected by further neighbours. The left thumbnail small, especially compared with the size of most drug com-
in figure illustrates the graph convolution operation. The pounds. Simply summing the node-level features can not en-
d
yellow arrows indicate the dense layer with weight Wnb for large the receptive of the model in a real sense. A naive way
neighbours, representing the neighbours’ effect on the spe- to enlarge the receptive field is simply applying more graph
cific node. The orange arrows indicate the dense layer with convolutional layers. However, such method makes model
d
weight Wself for the specific node self. more likely to overfit and perform worse in validation and
In analogy to pooling layers in convolutional neural net- test subset for the lack of enough data in drug datasets.
works, Graph Pooling is introduced by (Altae-Tran et al. In order to learn better graph-level representations, we
2017). Graph Polling is a simple operation that returns the introduce a dummy super node that is connected with all
maximum activation across the node and its neighbours. nodes in the graph by a directed edge, as the representa-
Simple illustration is shown in Figure . With graph pooling, tion of the graph. Since the dummy super node is directly
we could easily enlarge the receptive field without adding connected with all nodes in graph, it can easily learn global
extra weights. However, graph pooling does not change the feature through one graph convolution layer. The directed
size of the graph as the pooling of CNN does to the features edge pointed to the dummy super node from other genuine
of images. nodes, indicates that the dummy super node could learn fea-
Although there is no real convolution operation in graph tures from all other genuine nodes, while none of the gen-
convolutional network, graph convolutional network is de- uine nodes would be affected by the dummy super node.
sign for similar reason to convolutional neural network, the Consequently, the dummy super node could learn the global
’local geometry’ and reasonable weight sharing. Convolu- feature while the genuine nodes keep learning local features
tional networks(LeCun et al. 1998) apply convolutions to ex- that is equivariant to permutation of atomic group. Since the
ploit the translation equivariance of image features. And the feature of molecules is likely to be more complex than that
feature learned form CNN is translation equivariant to some of atoms, we could use a longer vector as the feature of the
extent. Graph convolutional network also focus on learning dummy super node.
local feature, through sharing weight with dense layers op- Formally, we could formulate the new feature of the
erate on nodes with the same number of degree. Thus, the dummy super node S 0 of graph convolution layer as
feature learned by graph convolutional is determined by lo- X
cal neighbours rather the global position of nodes. Such fea- S 0 = Wself S + Wnb vi + b (2)
tures are equivariant to permutation of atomic group. 1≤i≤N
where Wself is the weight for the dummy super node, Wnb
Dummy Super Node is the weight for genuine node, N is the number of nodes,
Learning graph-level is a central problem of molecule classi- vi is the feature of ith node, b is the bias. The dummy super
fication and regression. Modern graph convolution networks node S is initialized to zero and get updated through graph
often contain two graph convolution layers and two graph convolutional networks simultaneously as the genuine nodes
polling layers, the receptive field of the such model is too do.
Two-Layer
GC+RELU+BN GP GC+RELU+BN GP GC+RELU+BN
Classifier
GC: Graph Convolution GP: Graph Pooling BN: Node-Level Batch Normalization
Figure 2: Our Network Structure. We apply three graph convolution blocks(GC + RELU + BN) and two graph pooling layers,
then we feed the feature of the dummy super node to a two-layer classifier. The batch normalization used here is node-level
batch normalization. The third graph convolution block only calculates the feature of the dummy super node, for the new feature
of other nodes would not be fed into following classifier.
The new features of genuine nodes are not affected by the Downsample is a simple method to tackle with unbal-
dummy super node, and keep the same as Equation 1. Graph anced data. However, the number of data in drug datasets
pooling is only applied to genuine nodes, and the dummy su- is always not large for the difficulty of acquiring data and
per node is not regarded as neighbours of other nodes when get them labeled, and simple downsample may result in se-
graph pooling is applied. rious overfitting of neural network for the lack of sufficient
data. In addition, Boosting is an effective resample method
Network Structure that may address imbalance data. But it’s not easy to inte-
In addition to graph operations, we also apply node-level grate boosting with deepchem(Wu et al. 2017), the modern
batch normalization, relu activation, tanh activation and chemical machine learning framework.
dense layer in the model. For some datasets with imbalance Focal Loss is an elegant and brief proposal that introduced
data, we replace common cross entropy loss with focal loss. by (Lin et al. 2017) to address class imbalance. It shows that
The whole network structure is shown in Figure . We add focal loss could effectively improve the performance of one-
the dummy super node and its edges to the original graph stage detector in object detection. In fact, Focal Loss is a
before sending it to the network. In our model, We apply kind of reshaped cross entropy loss that the weighs of well-
three graph convolution blocks(GC + RELU + BN) and two classified examples are reduced. Formally, Focal Loss is de-
graph pooling layers, then we feed the feature of the dummy fined as
super node to a two-layer classifier. The batch normalization
used there is node-level batch normalization that would be
discussed later. Only the feature of the dummy super node F L(p) = −(y(1 − p)γ log p + (1 − y)pγ log(1 − p)).
is calculated in the third graph convolution block, for the where y ∈ {0, 1} specifies the ground-truth class, p ∈
new feature of other nodes would not be fed into following [0, 1] is the model’s estimated probability for the class with
classifier. label y = 1, γ is tunable focusing parameter. When γ = 0,
Unlike the ResNet, we place relu activation before batch focal loss is equivalent to cross entropy loss, and as γ is in-
normalization, for the experiment shows that it is hard to creased the effect of the modulating factor is likewise in-
train when we place relu activation behind. We argue that the creased. Focal loss focuses on training on a sparse set of
molecule feature is different from that of images. The noise hard examples and prevents the vast number of easy nega-
in images usually does not affect the image label, while that tives from overwhelming the classifier during training.
in molecules not.
Node-Level Batch Normalization Since the number of Experiment
atoms vary from molecules, standard batch normaliza- We evaluate our model on several datasets of MoleculeNet,
tion(Ioffe and Szegedy 2015) can not be applied to graph ranging from drug activity, toxity, solvation. Our code is re-
convolution network directly. We extend standard batch nor- leased in Github1 .
malization to node-level batch normalization. We normalize
the feature of each node, make it zero mean and unit vari- MoleculeNet
ance. Since the dummy super node is also a node, we simply
apply identical normalization to the dummy super node as MoleculeNet is a dataset collection built upon multiple pub-
well. lic databases, covering various levels of molecular proper-
ties, ranging from atomic-level properties to macroscopic in-
Focal Loss Data in drug datasets is often unbalanced. Take fluences on human body.
HIV dataset as instance, there are only 1487 chemical com- For chemical data, random splitting datasets into
pounds that are HIV active in the dataset of totally 41913 train/valid/test collections that is widely used in machine
chemical compounds. However, previous work(Altae-Tran
et al. 2017; Wu et al. 2017) do not pay much attention to the 1
https://github.com/microljy/graph_level_
fact of unbalanced data in Drug dataset. drug_discovery
Table 1: The Performance in FreeSolv dataset. R2 is squared Pearson correlation coefficient(larger is better). Since R2 used
by (Wu et al. 2017) can not perfectly reflect the performance of the model, we provide with RMSE and MAE(kcal/mol) as well.
Valid Test
Model Split Method
R2 RMSE MAE R2 RMSE MAE
Index 0.935 0.909 0.703 0.941 0.963 0.738
GraphConv Random 0.928 0.982 0.644 0.895 1.228 0.803
Scaffold 0.883 2.115 1.555 0.709 2.067 1.535
Index 0.952 0.787 0.566 0.945 0.933 0.598
Our Random 0.933 1.010 0.652 0.910 1.112 0.659
Scaffold 0.884 2.076 1.404 0.746 1.939 1.415
learning, is often not correct(Sheridan 2013). Consequently, tested the ability to inhibit HIV replication for 41,913 com-
MoleculeNet implements multiple different splittings(Index pounds(hiv ). Screening results were evaluated and placed
Split, Random Split, Scaffold Split) for each dataset. Scaf- into three categories: confirmed inactive (CI), confirmed
fold split attempts to separate structurally molecules in the active (CA) and confirmed moderately active (CM). In
training/validation/test sets, thus the scaffold split offers a MoleculeNet, the latter two labels are combined, making it a
greater challenge and demands a higher level of generaliza- classification task between inactive (CI) and active (CA and
tion ability for learning algorithms than index split and ran- CM).
dom split.
For quite a few datasets in MoleculeNet, the number of the FreeSolv The Free Solvation Database (FreeSolv) pro-
positive samples and the number of negative samples is not vides experimental and calculated hydration free energy of
balanced. Thus the accuracy metrics widely used in machine small molecules in water(Mobley and Guthrie 2014). Free-
learning classification tasks are not suitable here. In Molecu- Solv contains 643 compounds, a subset of which are also
leNet, classification tasks are evaluated by area under curve used in the SAMPL blind prediction challenge(Mobley et
(AUC) of the receiver operating characteristic (ROC) curve, al. 2014).
and regression tasks are evaluated by squared Pearson cor-
relation coefficient (R2). Settings and Performance Comparisons
We pick following datasets(Tox21, ToxCast, HIV, MUV, We apply the identical network to all above datasets, except
PCBA, FreeSolv) on macroscopic chemical influences on for the output channel number of the last layer that is deter-
human body from MoleculeNet, and evaluate our model. mined by the number of tasks. For different datasets, we only
tune the hyperparameter of training. We compare our model
Tox21 The Toxicology in the 21st Century (Tox21) initia- with logistic regression with ECFP feature as input and stan-
tive created a public database measuring toxicity of com- dard graph convolution network. To have a fair comparison,
pounds, which has been used in the 2014 Tox21 Data Chal- we reimplement graph convolution network, and most of the
lenge(tox ). Tox21 contains qualitative toxicity measure- performance of our implements are much better than that re-
ments for 8014 compounds on 12 different targets, including ported by (Wu et al. 2017).
stress response pathways and nuclear receptors.
Classification Tasks For classification tasks(Tox21, Tox-
ToxCast ToxCast is another data collection (from the Cast, MUV, PCBA), we utilize the area under curve (AUC)
same initiative as Tox21) providing toxicology data for of the ROC curve as metric(larger is better). The results are
a large library of compounds based on in virtual high- shown in Table 2, Table 3, Table 5 and Table 4. Our model
throughput screening(Richard et al. 2016). The processed generally surpasses standard graph convolution network and
collection in MoleculeNet contains qualitative results of conventional logistic regression in both small datasets on
over 600 experiments on 8615 compounds. toxity(Tox21, ToxCast) and larger datasets on bioactivi-
MUV The MUV dataset(Rohrer and Baumann 2009) con- ties(MUV, PCBA). It indicates that our method could gener-
tains 17 challenging tasks for around 90,000 compounds and ally improve the performance of graph convolution network.
is specifically designed for virtual screening techniques. The Compared with standard graph convolution network, in av-
positives examples in these datasets are selected to be struc- erage, our model achieves an improvement of about 1.5% in
turally distinct from one another. the test subset of the four datasets.
PCBA PubChem BioAssay (PCBA) is a database that con- Regression Tasks For regression tasks(FreeSolv), we uti-
sists of biological activities of small molecules generated lize squared Pearson correlation coefficient(R2 ), RMSE,
by high-throughput screening(Wang et al. 2011). The pro- MAE as metrics. The experimental results are shown in Ta-
cessed collection in MoleculeNet is a subset that contains ble 1. We only report the performance of graph convolution
128 bioassays measured over 400,000 compounds. network and our model here. Our model has a better perfor-
mance in general. We argue that our model failed to surpass
HIV The HIV dataset was introduced by the Drug Ther- graph convolution network in validation subset under ran-
apeutics Program (DTP) AIDS Antiviral Screen, which dom splitting for stochastic factors introduced by splitting
Table 2: The area under curve (AUC) of the ROC curve Table 6: The area under curve (AUC) of the ROC curve
of various models in Tox21 dataset. of various models in HIV dataset. With focal loss(γ = 2),
our method achieves further improvement.
Model Split Method Train Valid Test
Index 0.903 0.704 0.738 Model Split Method Train Valid Test
ECFP+LR Random 0.901 0.742 0.755 Index 0.864 0.739 0.741
Scaffold 0.905 0.651 0.697 ECFP+LR Random 0.860 0.806 0.809
Index 0.945 0.829 0.820 Scaffold 0.858 0.798 0.738
GraphConv Random 0.938 0.833 0.846 Index 0.945 0.779 0.728
Scaffold 0.955 0.778 0.752 GraphConv Random 0.939 0.835 0.822
Index 0.965 0.839 0.848 Scaffold 0.938 0.795 0.769
Our Random 0.964 0.842 0.854 Index 0.973 0.789 0.737
Scaffold 0.971 0.788 0.759 Our Random 0.951 0.842 0.830
Scaffold 0.967 0.813 0.763
Index 0.993 0.793 0.749
Table 3: The area under curve (AUC) of the ROC curve Our + Focal Loss Random 0.993 0.843 0.851
of various models in ToxCast dataset. Scaffold 0.992 0.816 0.776
Model Split Method Train Valid Test
Index 0.727 0.578 0.464 be predicated in FreeSolv has been widely used as a test
ECFP+LR Random 0.713 0.538 0.557 of computational chemistry methods. With energy values
Scaffold 0.717 0.496 0.496
Index 0.904 0.723 0.708
ranging from -25.5 to 3.4kcal/mol in the FreeSolv dataset,
GraphConv Random 0.901 0.734 0.754 the RMSE of ab-initio calculations results reach up to
Scaffold 0.914 0.662 0.640 1.5kcal/mol(Mobley et al. 2014). Our methods clearly out-
Index 0.927 0.747 0.734 perform ab-initio calculations when we utilize index split
Our Random 0.924 0.746 0.768 and random split. However, our method fails to outperform
Scaffold 0.929 0.696 0.657 ab-initio calculations when the dataset is split by scaffold
split that separate molecules structurally. We argue that it is
the insufficient data that results in weak generalization abil-
Table 4: The area under curve (AUC) of the ROC curve ity of our model. When fed with enough data, our model
of various models in PCBA dataset. may overall surpass classic ab-initio calculations.
Model Split Method Train Valid Test imbalance Class We apply focal loss in HIV dataset that
Index 0.809 0.776 0.781 have unbalanced class. The experimental results of Table
ECFP+LR Random 0.808 0.772 0.773 6 shown that focal loss could further improve the perfor-
Scaffold 0.811 0.746 0.757 mance. It is interesting that our model overfits quickly in
Index 0.895 0.855 0.851 train dataset when we apply focal loss. We argue that focal
GraphConv Random 0.896 0.854 0.855 loss helps our model handle hard example and fit better in
Scaffold 0.900 0.829 0.829
Index 0.904 0.869 0.864
train dataset.
Our Random 0.899 0.863 0.867
Scaffold 0.907 0.847 0.845 Conclusion
In this paper, we point out that molecular properties pred-
ication demand graph-level representation. However, most
Table 5: The area under curve (AUC) of the ROC curve
of the previous works on graph neural network only focus
of various models in MUV dataset.
on learning node-level representation. Such representation
Model Split Method Train Valid Test is clearly not sufficient for molecular properties predication.
Index 0.960 0.773 0.717 In order to learn graph-level representation, we propose the
ECFP+LR Random 0.954 0.780 0.740 dummy super node that is connected with all nodes in the
Scaffold 0.956 0.702 0.712 graph by a directed edge and modify the graph operation
Index 0.951 0.816 0.792 to help the dummy super node learn graph-level feature.
GraphConv Random 0.949 0.787 0.836 Thus, we can handle graph-level classification and regres-
Scaffold 0.979 0.779 0.735 sion in the same way as node-level classification and re-
Index 0.982 0.816 0.795 gression. The experiment on MoleculeNet shows that our
Our Random 0.989 0.800 0.845 method could generally improve the performance of graph
Scaffold 0.990 0.816 0.762 convolution network.
method and too few numbers of compounds in FreeSolv. References

It is interesting that how our method competes with clas- Altae-Tran, H.; Ramsundar, B.; Pappu, A. S.; and Pande, V.
sic ab-initio calculations. Hydration free energy that should 2017. Low data drug discovery with one-shot learning. ACS
central science 3(4):283–293. Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Konerd-
Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. ing, D.; and Pande, V. 2015. Massively multitask networks
Spectral networks and locally connected networks on for drug discovery. arXiv preprint arXiv:1502.02072.
graphs. arXiv preprint arXiv:1312.6203. Richard, A. M.; Judson, R. S.; Houck, K. A.; Grulke, C. M.;
Cao, S.; Lu, W.; and Xu, Q. 2016. Deep neural networks for Volarath, P.; Thillainadarajah, I.; Yang, C.; Rathman, J.;
learning graph representations. In AAAI, 1145–1152. Martin, M. T.; Wambaugh, J. F.; et al. 2016. Toxcast chem-
ical landscape: paving the road to 21st century toxicology.
Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Chemical research in toxicology 29(8):1225–1251.
Convolutional neural networks on graphs with fast localized
Rogers, D., and Hahn, M. 2010. Extended-connectivity fin-
spectral filtering. In Advances in Neural Information Pro-
gerprints. Journal of chemical information and modeling
cessing Systems, 3844–3852.
50(5):742–754.
Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, Rohrer, S. G., and Baumann, K. 2009. Maximum unbi-
R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015. ased validation (muv) data sets for virtual screening based
Convolutional networks on graphs for learning molecular on pubchem bioactivity data. Journal of chemical informa-
fingerprints. In Advances in neural information processing tion and modeling 49(2):169–184.
systems, 2224–2232.
Rupp, M.; Tkatchenko, A.; Müller, K.-R.; and Von Lilien-
Gomes, J.; Ramsundar, B.; Feinberg, E. N.; and Pande, V. S. feld, O. A. 2012. Fast and accurate modeling of molecular
2017. Atomic convolutional networks for predicting protein- atomization energies with machine learning. Physical re-
ligand binding affinity. arXiv preprint arXiv:1703.10603. view letters 108(5):058301.
Gori, M.; Monfardini, G.; and Scarselli, F. 2005. A Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and
new model for learning in graph domains. In Neural Net- Monfardini, G. 2009. The graph neural network model.
works, 2005. IJCNN’05. Proceedings. 2005 IEEE Interna- IEEE Transactions on Neural Networks 20(1):61–80.
tional Joint Conference on, volume 2, 729–734. IEEE.
Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.;
Aids antiviral screen data. https://wiki.nci.nih. and Tkatchenko, A. 2017. Quantum-chemical insights
gov/display/NCIDTPdata/. from deep tensor neural networks. Nature communications
Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accel- 8:13890.
erating deep network training by reducing internal covariate Sheridan, R. P. 2013. Time-split cross-validation as a
shift. In International Conference on Machine Learning, method for estimating the goodness of prospective pre-
448–456. diction. Journal of chemical information and modeling
Kipf, T. N., and Welling, M. 2016. Semi-supervised classi- 53(4):783–790.
fication with graph convolutional networks. arXiv preprint Tox21 data challenge. https://tripod.nih.gov/
arXiv:1609.02907. tox21/challenge/.
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.;
Imagenet classification with deep convolutional neural net- Zhou, Z.; Han, L.; Karapetyan, K.; Dracheva, S.; Shoe-
works. In Advances in neural information processing sys- maker, B. A.; et al. 2011. Pubchem’s bioassay database.
tems, 1097–1105. Nucleic acids research 40(D1):D400–D412.
LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Waring, M. J.; Arrowsmith, J.; Leach, A. R.; Leeson, P. D.;
Gradient-based learning applied to document recognition. Mandrell, S.; Owen, R. M.; Pairaudeau, G.; Pennie, W. D.;
Proceedings of the IEEE 86(11):2278–2324. Pickett, S. D.; Wang, J.; et al. 2015. An analysis of the
Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2015. attrition of drug candidates from four major pharmaceutical
Gated graph sequence neural networks. arXiv preprint companies. Nature reviews. Drug discovery 14(7):475.
arXiv:1511.05493. Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Ge-
niesse, C.; Pappu, A. S.; Leswing, K.; and Pande, V. 2017.
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P.
Moleculenet: A benchmark for molecular machine learning.
2017. Focal loss for dense object detection. arXiv preprint
arXiv preprint arXiv:1703.00564.
arXiv:1708.02002.
Yao, K.; Herr, J. E.; Brown, S. N.; and Parkhill, J. 2017. In-
Mobley, D. L., and Guthrie, J. P. 2014. Freesolv: a database trinsic bond energies from a bonds-in-molecules neural net-
of experimental and calculated hydration free energies, with work. The Journal of Physical Chemistry Letters.
input files. Journal of computer-aided molecular design
28(7):711.
Mobley, D. L.; Wymer, K. L.; Lim, N. M.; and Guthrie, J. P.
2014. Blind prediction of solvation free energies from the
sampl4 challenge. J Comput Aided Mol Des 28(3):135–150.
Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning
convolutional neural networks for graphs. In International
Conference on Machine Learning, 2014–2023.

Learning Graph-Level Representation For Drug Discovery: Junying Li, Deng Cai, Xiaofei He

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Graph-Level Representation For Drug Discovery: Junying Li, Deng Cai, Xiaofei He

Uploaded by

Copyright:

Available Formats

Learning Graph-Level Representation for Drug Discovery

Junying Li, Deng Cai, Xiaofei He

Abstract tasks(Krizhevsky, Sutskever, and Hinton 2012). It indicates

GraphConv GraphConv(Super Node) GraphPool

method and too few numbers of compounds in FreeSolv. References

You might also like