ct-2017-00577j.R1 Proof Hi

Journal of Chemical Theory and Computation
This document is confidential and is proprietary to the American Chemical Society and its authors. Do not
copy or disclose without written permission. If you have received this item in error, notify the sender and
delete all copies.
Prediction errors of molecular machine learning models

lower than hybrid DFT error
Journal: Journal of Chemical Theory and Computation
Manuscript ID ct-2017-00577j.R1
Manuscript Type: Article
Date Submitted by the Author: 19-Sep-2017
Complete List of Authors: Faber, Felix; University of Basel

Hutchison, Luke; Google
Huang, Bing; University of Basel
Gilmer, Justin; Google
Schoenholz, Samuel; Google
Dahl, George; Google
Vinyals, Oriol; Google
Kearnes, Steven; Google
Riley, Patrick; Google
von Lilienfeld, O. Anatole; University of Basel
ACS Paragon Plus Environment

Page 1 of 19 Journal of Chemical Theory and Computation
1
2
3
4 Prediction errors of molecular machine
5
6 learning models lower than hybrid DFT error
7
8
9 Felix A. Faber,†,§ Luke Hutchison,‡,§ Bing Huang,† Justin Gilmer,‡ Samuel S.
10
11 Schoenholz,‡ George E. Dahl,‡ Oriol Vinyals,¶ Steven Kearnes,‡ Patrick F. Riley,‡
12 and O. Anatole von Lilienfeld∗,†
13
14 †Institute of Physical Chemistry and National Center for Computational Design and
15
16 Discovery of Novel Materials, Department of Chemistry, University of Basel,
17 Klingelbergstrasse 80, CH-4056 Basel, Switzerland
18 ‡Google, 1600 Amphitheatre Parkway, Mountain View, CA, US - 94043 CA
19 ¶Google, 5 New Street Square, London EC4A 3TW, UK
20
21 §Authors contributed equally
22
23 E-mail: anatole.vonlilienfeld@unibas.ch
24
25
26
27
Abstract larization (EN)), random forest (RF), kernel
28 ridge regression (KRR) and two types of neural
29 We investigate the impact of choosing regres- networks, graph convolutions (GC) and gated
30 sors and molecular representations for the con- graph networks (GG). Out-of sample errors
31 struction of fast machine learning (ML) models are strongly dependent on the choice of rep-
32
of thirteen electronic ground-state properties of resentation and regressor and molecular prop-
33
34 organic molecules. The performance of each re- erty. Electronic properties are typically best
35 gressor/representation/property combination is accounted for by MG and GC, while energetic
36 assessed using learning curves which report out- properties are better described by HDAD and
37 of-sample errors as a function of training set size
38 KRR. The specific combinations with the low-
39 with up to ∼118k distinct molecules. Molecu- est out-of-sample errors in the ∼118k training
40 lar structures and properties at hybrid density set size limit are (free) energies and enthalpies
41 functional theory (DFT) level of theory come of atomization (HDAD/KRR), HOMO/LUMO
42 from the QM9 database [Ramakrishnan et al,
43
eigenvalue and gap (MG/GC), dipole moment
44 Scientific Data 1 140022 (2014)] and include (MG/GC), static polarizability (MG/GG), zero
45 enthalpies and free energies of atomization , point vibrational energy (HDAD/KRR), heat
46 HOMO/LUMO energies and gap, dipole mo- capacity at room temperature (HDAD/KRR),
47 ment, polarizability, zero point vibrational en- and highest fundamental vibrational frequency
48
49
ergy, heat capacity and the highest fundamental (BAML/RF). We present numerical evidence
50 vibrational frequency. Various molecular repre- that ML model predictions deviate from DFT
51 sentations have been studied (Coulomb matrix, (B3LYP) less than DFT (B3LYP) deviates from
52 bag of bonds, BAML and ECFP4, molecular experiment for all properties. Furthermore,
53
graphs (MG)), as well as newly developed dis- out-of-sample prediction errors with respect to
54
55 tribution based variants including histograms hybrid DFT reference are on par with, or close
56 of distances (HD), and angles (HDA/MARAD), to, chemical accuracy. The results suggest that
57 and dihedrals (HDAD). Regressors include lin- ML models could be more accurate than hybrid
58
ear models (Bayesian ridge regression (BR) DFT if explicitly electron correlated quantum
59
60 and linear regression with elastic net regu- (or experimental) data was available.
1
Journal of Chemical Theory and Computation Page 2 of 19
1
1 Introduction same or similar data sets. 21,22 Dral et al. 23 use
2 such data to machine learn optimal molecule
3 Due to its favorable trade-off between accuracy specific parameters for the OM2 24 semiempir-
4 and computational cost, Density Functional ical method, and orthogonalization tests are
5 Theory (DFT) 1,2 is the workhorse of quan- benchmarked in Ref. 25
6
tum chemistry 3 —despite its well known short- However, limited work has yet been done in
7
8 comings regarding spin-states, van der Waals systematically assessing various methods and
9 interactions, and chemical reactions. 4,5 Fail- properties on large sets of the exact same chem-
10 ures to predict reaction profiles are particu- icals. 26 In order to unequivocally establish if
11 larly worrisome, 6 and recent analysis casts even
12 ML has the potential to replace hybrid DFT
13 more doubts on the usefulness of DFT func- for the screening of properties, one has to
14 tionals obtained through parameter fitting. 7 demonstrate that ML test errors are system-
15 The prospect of universal and computation- atically lower than estimated hybrid DFT ac-
16 ally much more efficient machine learning (ML)
17
curacies for all the properties available. This
18 models, trained on data from experiments or study accomplishes that through a large scale
19 generated at higher levels of electronic structure assessment of unprecedented scale: (i) In or-
20 theory such as post-Hartree Fock or quantum der to approximate large training set sizes
21 Monte Carlo (e.g. exemplified in Ref. 8 ), there- N , we included 13 quantum properties from
22
23
fore represents an appealing alternative strat- up to ∼118k molecules in training (90% of
24 egy. Not surprisingly, a lot of recent effort has QM9). (ii) We tested multiple regressors
25 been devoted to developing ever more accurate (Bayesian ridge regression (BR), linear regres-
26 ML models of properties of molecular and con- sion with elastic net regularization (EN), ran-
27
densed phase systems. dom forest (RF), kernel ridge regression (KRR),
28
29 Several ML studies have already been pub- neural network (NN) models graph convolu-
30 lished using a data set called QM9, 9 consist- tions (GC) 27 and gated graphs (GG) 28 ) and
31 ing of molecular quantum properties for the (iii) multiple representations including BAML,
32
∼134k smallest organic molecules containing BOB, CM, extended connectivity fingerprints
33
34 up to 9 heavy atoms (C, O, N, or F; not (ECFP4), histograms of distance, angle, and
35 counting H) in the GDB-17 universe. 10 Some dihedral (HDAD), molecular atomic radial an-
36 of these studies have developed or used rep- gular distribution (MARAD), and molecular
37 resentations we consider in this work, such as
38 graphs (MG). (iv) We investigated all combi-
39 BAML (Bonds, angles, machine learning), 11 nations of regressors and representations, ex-
40 bag of bonds (BOB) 12,13 and the Coulomb ma- cept for MG/NN which was exclusively used
41 trix (CM). 13,14 Atomic variants of the CM have together because GC and GG depend funda-
42 also been proposed and tested on QM9. 15 Other
43
mentally on the input representation being a
44 representations have also been benchmarked on graph instead of a flat feature vector.
45 QM9 (or QM7 which is a smaller but simi- The best models for the various proper-
46 lar data set), such as Fourier series of radial ties are: atomization energy at 0 Kelvin
47 distance distributions, 16 motifs, 17 the smooth (HDAD/KRR), atomization energy at room
48
49
overlap of atomic positions (SOAP) 18 in com- temperature (HDAD/KRR), enthalpy of atom-
50 bination with regularized entropy match, 19 con- ization at room temperature (HDAD/KRR),
51 stant size descriptors based on connectivity and atomization of free energy at room tempera-
52 encoded distance distributions. 20 Ramakrish- ture (HDAD/KRR), HOMO/LUMO eigenvalue
53
nan et al. 8 introduced a ∆-ML approach, where and gap (MG/GC), dipole moment (MG/GC),
54
55 the difference between properties calculated at static polarizability (MG/GG), zero point vi-
56 coarse/accurate quantum level of theories is brational energy (HDAD/KRR), heat capac-
57 being modeled. Furthermore, neural network ity at room temperature (HDAD/KRR), and
58
models, as well as deep tensor neural networks, the highest fundamental vibrational frequency
59
60 have recently been proposed and tested on the (BAML/RF). For training set size of ∼118k
2
(90% of data set) we have found the additional 2.2 Model validation
1
2 out-of-sample error added by machine learning
Starting from the ∼131k molecules in QM9 af-
3 to be lower or as good as DFT errors at B3LYP
ter removing the ∼3k molecules (see above) we
4 level of theory relative to experiment for all
5 have created a number of train-validation-test
properties, and that chemical accuracy (See ta-
6 splits. We have splitted the data set into test
ble 3) is reached, or in sight.
7 and non-test sets and varied the percentage of
8 This paper is organized as follows: First
data in test set to explore the effect of amount
9 we will briefly describe the methods, includ-
10 of data in error rates. Inside the non-test set,
ing data set, model validation protocols, rep-
11 we have performed 10 fold cross validation for
resentations, and regressors. In section III, we
12 hyperparameter optimization. That is, for each
13 present the results and discuss them, and sec-
model 90% (the training set) of the non-test
14 tion IV concludes the paper.
15 set is used for training and 10% (the valida-
16 tion set) is used for hyperparameter selection.
17
2 Method For each test/non-test split, we have trained 10
18
models on different subsets of the non-test set,
19
20 2.1 Data set and we report the mean error on the test set
21 across those 10 models. Note that the non-test
22 We have used the QM9 data set consisting of set will be referred to as training set in the re-
23 ∼134k drug-like organic molecules. 9 Molecules sults section in order to simplify discussion.
24
in the data set consist of H, C, O, N and F, and In terms of CPU investments necessary for
25
26 contain up to 9 heavy atoms. For each molecule training the respective models we note that
27 several properties, calculated at DFT level of EN/BR, RF/KRR, and GC/GG required min-
28 theory (B3LYP/6-31G(2df,p)), were included. utes, hours, and multiple days, respectively.
29 We used: Atomization energy at 0 Kelvin U0
30 Using GPUs could dramatically reduce such
31 (eV); atomization energy at room temperature timings.
32 U (eV); enthalpy of atomization at room tem-
33 perature H (eV); atomization of free energy
34 at room temperature G (eV); HOMO eigen-
2.3 DFT errors
35
36 value εHOMO (eV); LUMO eigenvalue εLUMO To place the quality of our prediction er-
37 (eV); HOMO-LUMO qP gap ∆ε (eV); norm of rors in the right context, experimental accu-
38 dipole moment µ = r∈x,y,z ( drn(r)r) (De-
racy estimates of hybrid DFT become desir-
R
2
39
bye), where n(r) is the molecular charge den- able. Here, we summarize literature results
40
41 sity distribution; static isotropic polarizability comparing DFT at B3LYP level of theory to ex-
42 α = 3 i∈x,y,z αii (Bohr3 ), where αii is the di-
1
P periments for the various properties we study.
43
agonal element of the polarizability tensor; zero Where data is available, the corresponding de-
44 viation from experiment is listed in Table 3,
45 point vibrational energy ZPVE (eV); heat ca-
46 pacity at room temperature Cv (cal/mol/K); alongside our ML prediction errors (vide infra).
47 and the highest fundamental vibrational fre- In order to also get an idea of hybrid DFT
48
quency ω1 (cm−1 ). For energies of atomization energy errors for organic molecules, such as the
49 compounds studied herewithin, we refer to a
50 (U0 , U , H and G) all models yield very sim-
51 ilar errors. We will therefore only discuss U0 comparison of PBE and B3LYP results for 6k
52 for the remainder. The 3053 molecules speci- constitutional isomers of C7 H10 O2 . 8 After cen-
53
fied in Ref. 9 which failed SMILES consistency tering the data by subtracting their mean shift
54
tests were excluded from our study, as well as from G4MP2 (177.8 (PBE) and 95.3 (B3LYP)
55
56 two linear molecules, leaving ∼131k molecules. kcal/mol). The remaining MAEs are roughly
57 ∼2.5 and ∼ 3.0 kcal/mol for B3LYP and PBE,
58 respectively. This is in agreement with what
59 Curtiss et al. 29 found. They compared DFT
60
3
to experimental values from 69 small organic per are estimated from B3LYP and using other
1
2 molecules (of which 47 were substituted with functionals can yield very different errors.
3 F, Cl, and S), with up to 6 heavy atoms (not Nevertheless, we feel that the quoted errors
4 counting hydrogens), and calculated the ener- provide meaningful guidance as to what one can
5 gies using B3LYP/6-311+G(3df,2p). The re- expect from DFT for each property.
6
7 sulting mean absolute deviation from experi-
8 mental values was 2.3 kcal/mol. 2.4 Representations
9 Rough hybrid DFT error estimates for dipole
10 moment and polarizability have been obtained The design of molecular representations is a
11
12 from Refs. 30 .The errors are estimated refer- long-standing problem in chem-informatics and
13 enced to experimental values, for a data set con- materials informatics, and many interesting and
14 sisting of 49 molecules with up to 7 heavy atoms promising variants have already been proposed.
15 (C, Cl, F, H, O, P, or S) Below, we provide the details on the represen-
16
Frontier molecular orbital energies (HOMO, tations selected for this study. While finalizing
17
18 LUMO and HOMO-LUMO gap) can not be our study, competitive alternatives were intro-
19 measured directly. However, for the exact (yet duced 35,36 but have been tested only for ener-
20 unknown) exchange-correlation potential, the gies (and polarizabilities).
21
22
Kohn-Sham HOMO eigenvalues correspond to
23 the negative of the vertical ionization poten- 2.4.1 CM and BOB
24 tial (IP). 31 Unfortunately, within hybrid DFT,
25 the precise meaning of the frontier eigenvalues The Coulomb matrix (CM) representation 14 is
26
and the gap is less clear, and we therefore re- a square atom by atom matrix, where off diago-
27
frain from a direct comparison of B3LYP to nal elements are the Coulomb nuclear repulsion
28
experimental numbers. Nevertheless, we have terms between atom pairs. The diagonal ele-
29
30 included eigenvalues and the gap due to their ments approximate the electronic potential en-
31 widespread use for molecular and materials de- ergy of the free atoms. Atom indices in the CM
32
sign applications. are sorted by the L1 norm of each atom’s row
33
Hybrid DFT RMSE estimates with respect to (or column). The Bag of Bonds (BOB) 12 repre-
34
35 experimental values of ZPVE and ω1 (the high- sentation uses exclusively CM elements, group-
36 est fundamental vibrational frequency) were ing them for different atom pairs into different
37
published in Ref. 32 for a set of 41 organic bags, and sorting them within each bag by their
38 relative magnitude.
39 molecules, with up to 6 heavy atoms (not count-
40 ing hydrogen) and calculated using B3LYP/cc-
41 pVTZ. 2.4.2 BAML
42
43
Normally distributed data has a constant The recently introduced BAML (Bonds, an-
44 ratio between RMSE and MAE, 33 which is gles, machine learning) representation can
45 roughly 0.8. We have used this ratio to ap- be viewed as a many-body extension of
46 proximate the MAE from the RMSE estimates BOB. 11 All pairwise nuclear repulsions are re-
47
48
reported for ZPVE and ω1 . Deviation of DFT placed by Morse/Lennard-Jones potentials for
49 (at the B3LYP/6-311g** level of theory) from bonded/non-bonded atoms respectively. Fur-
50 experimental heat capacities were reported by thermore, three- and four-body interactions
51 DeTar 34 who obtained errors of 16 organic between covalently bonded atoms are included
52
molecules, with up to 8 heavy atoms (not count- using angular and torsional terms, respectively.
53
54 ing hydrogens). Parameters and functional forms are based on
55 Note, however, that one should be cautious the universal force field (UFF). 37
56 when referring to these errors: Strictly speak-
57 ing they can not be compared since different ba-
58
59 sis sets, molecules, and experiments were used.
60 We also note that all DFT errors in this pa-
4
2.4.3 ECFP4 2.4.5 HD, HDA, and HDAD

1
2 Extended Connectivity Fingerprints 38 (ECFP4) BOB, BAML and MARAD rely on comput-
3
4 are a common representation of molecules in ing functions for given interatomic distances,
5 cheminformatics based studies. They are par- and/or angles, and/or torsions, and then either
6 ticularly popular for drug discovery. 39–41 The project that value on to discrete bins, or sort
7 basic idea, typical also for other cheminfor- the values. As a straightforward alternative, we
8
9
matics descriptors 42 (e.g. the signature descrip- also investigated representations which account
10 tor 43,44 ) is to represent a molecule as the set of directly from pairwise distances, triple-wise an-
11 subgraphs up to a fixed diameter (here we use gles, and quad-wise dihedral angles through
12 ECFP4, which is a max diameter of 4 bonds). manually generated bins in histograms. The
13
14
To produce a fixed length vector, the subgraphs resulting representations in increasing inter-
15 can be hashed such that every subgraph sets atomic many-body order are called HD (His-
16 one bit in the fixed length vector to 1. In this togram of distances), HDA (Histogram of dis-
17 work, we use a fixed length vector of size 1024. tances and angles), and HDAD (Histogram of
18
Note that ECFP4 is based solely on the molec- distances, angles and dihedral angles). For any
19
20 ular graph specifying all covalent bonds, e.g. as given molecule, one iterates through each atom
21 encoded by SMILES strings. ai , producing a set of distances, angle and di-
22 hedral angle features for ai .
23 Distance features were produced by measur-
24 2.4.4 MARAD
25 ing the distance between ai and aj (for i 6= j)
26
Molecular atomic radial angular distribution for each element pair. The distance features
27 (MARAD) is an atomic radial distribution were assigned a label incorporating the atomic
28 function (RDF) based representation. Per atom symbols of ai and aj sorted alphabetically (with
29 it consists of three RDFs using Gaussians of in- H last), e.g. if ai was a carbon atom and aj
30
teratomic distances, and parallel and orthog- was a nitrogen atom, the distance feature for
31
32 onal projections of distances in atom triplets, the atom pair would be labeled C-N. These la-
33 respectively. Distances between two molecules bels will be used to group all features with the
34 can be evaluated analytically. Unfortunately, same label into a histogram and allow us to only
35 most regressors evaluated in this work, such as
36 count each pair of atoms once.
37 BR, EN and RF, do not rely on inner prod- Angle features were produced by taking the
38 ucts and distances between representations. We principal angles formed by the two vectors span-
39 resolve this issue by projecting MARAD onto ning from each atom ai to every subset of 2 of
40 bins in order to work with all regressors (apart
41 its 3 nearest atoms, aj and ak . The angle fea-
42 for GG and GC which use MG exclusively). tures were labeled by the element type of ai ,
43 The three body terms in MARAD contain in- followed by the alphabetically sorted element
44 formation about both, angles and distances of types (Except for hydrogens, which were listed
45 all atoms involved. This differs from HDA (see
46 last) of aj and ak . The example where ai is a
47
below), where distances, and angles are decou- Carbon atom, aj a Hydrogen atom, ak a Nitro-
48 pled, and placed in separated bins. Note that gen would be assigned the label C-N-H.
49 unlike BAML or HDAD, there are only two and Dihedral angle features were produced by tak-
50 three-body terms, no four-body terms (dihedral ing the principal angles between two planes. We
51
52
angles) have been included within MARAD. take ai as the origin, and for each of the four
53 Details about how the projected MARAD is nearest neighbors in turn, labeling the neighbor
54 calculated can be found under in the Supple- atom aj , and
55 mentary materials. forming a vector Vij = ai → aj .
56 Then all 32 subsets of the remaining three out
Further details and characteristics of of four nearest neighbors of ai are chosen, and
57
58 MARAD will also be discussed in a forthcoming labeled as ak and al . This third and fourth atom
59 separate in-depth study. respectively form two triangular faces when
60
5
paired with Vij : hak , ai , aj i and hal , ai , aj i. The first and second bin respectively. (ii) Reduction:
1
2 dihedral angle between the two triangular faces The collection of contributions within each bin
3 was calculated. These dihedral angle features of each molecule’s feature vector is condensed
4 were labeled with the atomic symbol for ai , fol- to a single value by summing all contributions.
5 lowed by the atomic symbols for aj , ak and al ,
6
7 sorted alphabetically, with the exception that 2.4.6 Molecular Graphs
8 hydrogens were listed last, e.g. C-C-N-H.
9 The features from all molecules have been ag- We have investigated several neural network
10 gregated for each label type to generate a his- models which are based on molecular graphs
11
tograms for each label type. Fig. 1 exemplifies (MG) as representation. The inputs are
12 real-valued vectors associated with each atom
13 this for C-N distances, C-C-C angles, and C-
14 C-C-O dihedrals for the entire QM9 data set. and with each pair of atoms. More specifi-
15 Certain typical molecular characteristics can be cally, we have used the featurization described
16
recognized upon mere inspection. For example, in Kearnes et al. 27 with the removal of partial
17 charge and the addition of Euclidean distances
18 the CN histogram displays a strong and iso-
19 lated peak between 1.1 and 1.5 Å, correspond- to the pair feature vectors. All elements of the
20 ing to occurrences of single, double, and triple feature vector are described in Tables 1 and 2.
21
bonds. For distances above 2 Å, peaks at typ- The featurization process was unsuccessful
22
ical radii of second and third covalent bonding for a small number of molecules (367) because
23
24 shells around N can be recognized at 2.6 Å and of conversion failures from geometry to ratio-
25 3.9 Å. Also C-C-C angles can be easily internal SMILES string when using OpenBabel 45 or
26
preted: The peak close to zero and π Rad cor- RDKit, 46 and were excluded from all results us-
27
responds to geometries where three atoms are ing the molecule graph features.
28
29 part of a linear (alkyne, or nitrile) type of mo- Table 1: Atom features for the MG represen-
30 tif. The broad and largest peak corresponds to tation: Values provided for each atom in the
31 120 and 109 degrees, typically observed in sp2
32 molecule.
33 and sp3 hybridized atoms.
34 The morphology of each histogram has then Feature Description
35 been examined to identify apparent peaks
36 and troughs, motivated by the idea that Atom type H, C, N, O, F (one-hot).
37 Chirality R or S (one-hot or null).
38 peaks indicate structural commonalities among
Formal charge Integer electronic charge.
39 molecules. Bin centers have been placed at
Ring sizes For each ring size (3–8),
40 each significant local minimum and maximum
41 the number of rings that
(Shown as vertical lines in Fig. 1). Values at
42 include this atom.
15-25 bin centers have been chosen as a rep-
43 Hybridization sp, sp2 , or sp3 (one-hot or
44 resentation for each label type. All bin center null).
45 values are provided in the Supplementary Ma- Hydrogen bonding Whether this atom is a hy-
46 terial. For each molecule, the collection of drogen bond donor and/or
47
48
features has subsequently been rendered into a acceptor (binary values).
49 fixed-size representation, producing one vector Aromaticity Whether this atom is part
50 component for each bin center, within each la- of an aromatic system.
51 bel type. This has been accomplished following
52
a two-step process. (i) Binning and interpola- Note that within a previous draft of this
53
54 tion: Each feature value is projected on the two study, 47 we reported biased results for GC/GG
55 nearest bins. The relative amount projected on models due to use of Mulliken partial charges
56 each bin uses linear projection between the two within the MG representation. All MG re-
57 bins. For example: A feature with value 1.7
58 sults presented herewithin have been obtained
59 which lies between two bins placed at 1.0 and without any Mulliken charges in the represen-
60 2.0 respectively, contributes 0.3 and 0.7 to the tation. Model hyper parameters for the GC
6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
logarithmic grid for the 10 percent training set 2.5.5 Graph Convolutions
1
2 (from 0.25 to 8192 for Gaussian kernel and 0.1
We have used the GC model as described
3 to 16384 for Laplacian kernel). In order to sim-
in Kearnes et al. 27 , with several structural mod-
4 plify the width screening, prior to learning all
5 ifications and optimized hyperparameters. The
feature vectors were normalized (scaling the in-
6 graph convolution model is built on the con-
put vector by the mean norm across the train-
7 cepts of “atom” layers (one real vector asso-
8 ing set) by the Euclidean norm for the Gaussian
ciated with each atom) and “pair” layers (one
9 kernel and the Manhattan norm for the Lapla-
10 real vector associated with each pair of atoms).
cian kernel.
11 The graph convolution architecture defines op-
12 erations to transform atom and pair layers to
13 2.5.2 Bayesian Ridge Regression new atom and pair layers. There are three
14
15 We use BR 54 as is implemented in scikit- structural changes to the model used herewithin
16 learn. 53 BR is a linear model with a L2 penalty when compared to the one described in Kearnes
17 on the coefficients. Unlike Ridge Regression et al. 27 . We describe these briefly here with de-
18
where the strength of that penalty is a regu- tails in the Supplementary Material. First, we
19
20 larization hyperparameter which must be set, have removed the “Pair order invariance” prop-
21 in Bayesian Ridge Regression the optimal reg- erty by simplifying the (A → P ) transforma-
22 ularizer is estimated from the data. tion. Since the model only uses the atom layer
23 for the molecule level features, pair order in-
24
25 2.5.3 Elastic Net variance is not needed. Second, we have used
26 the Euclidean distance between atoms. In the
27 Also EN 55 is a linear model. Unlike BR, the (P → A) transformation, we divide the value
28 penalty on the weights is a mix of L1 and L2 from the convolution step by a series of dis-
29 terms. In addition to the regularization hyper-
30 tance exponentials. If the original convolution
31 parameter for the weight penalty, Elastic net for an atom pair (a, b) with distance d pro-
32 has an additional hyperparameter l1_ratio to duces a vector V , we concatenate the vectors V ,
33 control the relative strength of the L1 and L2 V
, V , V , and dV6 to produce the transformed
34 d1 d2 d3
weight penalties. We used the scikit-learn 53 im- value for the pair (a, b). Third, we have fol-
35
36 plementation and set l1_ratio = 0.5. We then lowed other work on neural networks based on
37 did a hyperparameter search on regularizing pa- chemical graphs 57 which uses a sum of softmax
38 rameter in a base 10 logarithmic grid from 1e−6 operations to convert a real valued vector to
39 to 1.0.
40
a sparse vector and sum those sparse vectors
41 across all the atoms. We use the same oper-
42 2.5.4 Random Forest ation here along with a simple sum across the
43 atoms to produce molecule level features from
44 We use RF 56 as implemented in scikit-learn. 53
the top atom layer. We have found that this
45 RF regressors produce a value by averaging
46 works as well or better than the Gaussian his-
many individual decision trees fitted on ran-
47 tograms first used in GC. 27 To optimize the
domly resampled sets of the training data. Each
48 network, we have searched the hyperparame-
49 node in each decision tree is a threshold of one
ter space using Gaussian Process Bandit Op-
50 input feature. Early experiments did not reveal
51 timization 58 as implemented by HyperTune. 59
strong differences in performance based on the
52 The hyperparameter search has been based on
number of trees used, once a minimal number
53 the evaluation of the validation set for a sin-
54 was reached. We have used 120 trees for all
gle fold of the data. Further details including
55 regressions.
56 parameters, and search ranges chosen for this
57 paper are listed in the Supplementary materi-
58 als.
59
60
8
2.5.6 Gated Graph Neural Networks accuracy, both defined alongside in table 3,
1
2 for U0 (HDAD/KRR and MG/GG), µ (GC),
We have used the GG Neural Networks model
3 Cv (HDAD/KRR), and ω1 (BAML/KRR,
(GG) as described in Li et al. 28 . Similar to the
4 MG/GC, HDAD/KRR, BOB/KRR, HD/KRR
5 GC model, it is a deep neural network whose
and MG/GG). For the remaining properties
6 input is a set of node features {xv , v ∈ G}, and
( εHOMO , εLUMO , ∆ǫ, α, and ZPVE) the best
7 an adjacency matrix A with entries in a discrete
8 models come within a factor 2 of target accu-
set S = {0, 1, · · · , k} to indicate different edge
9 racy, while all (except εHOMO , εLUMO and ∆ǫ)
10 types. It has internal hidden representations
where we don’t have reliable data. outperform-
11 for each node in the graph htv of dimension d
ing DFT accuracy.
12 which it updates for T steps of computation. Its
13 In Fig. 2 out-of-sample errors as a function
output is invariant to all graph isomorphisms,
14 of training set size (learning curves) are shown
15 meaning the order of the nodes presented to the
for all properties and representations with the
16 model does not matter. To include the most rel-
best corresponding regressor. It is important
17 evant distance information we distinguish four
18 to note that all models on display systemat-
different covalent bonding types (single, double,
19 ically improve with training set size, exhibit-
triple, aromatic). For all remaining atom-pairs
20 ing the typical linearly decaying behavior on a
21 we bin them by their interatomic distance [in
log-log plot. 11,61 Errors for most models shown
22 Å] into 10 bins: [0, 2], [2,2.5], [2.5,3], [3,3.5],
23 decay with roughly the same slopes, indicat-
[3.5,4], [4,4.5], [4.5,5], [5,5.5], [5.5,6], and [6,∞].
24 ing similar exponents in the power-law of error
Using these bins, the adjacency matrix has en-
25 decay. Notable exceptions, i.e. property mod-
26 tries in an alphabet of size 14 (k=14), indicating
els with considerably steeper learning curves
27 bond type for covalently bonded atoms, and dis-
28 (Slopes and off-sets of all learning curves can be
tance bin for all other atoms. We have trained
29 found in Tables S4 and S5 in the Supplemen-
the GG model on each target property individ-
30 tary Material ), are MG/GC for µ, MG/GG and
31 ually. Further technical details are specified in
HDAD/KRR for α, CM/KRR and BOB/KRR
32 the Supplementary materials.
33 for hR2 i, HDAD/KRR and MG/GG for U0 , and
34 MG/GG for ω1 . These results suggest that
35
3 Results and discussion the specified representations capture particu-
36 larly well the effective dimensionality of the cor-
37
38 3.1 Overview responding property in chemical space.
39
40 We present an overview of the most relevant nu-
41 merical results in Table 3. It contains the test
3.2 Regressors
42
errors for all combinations of regressors and rep- Inspection of Table 3 indicates that the regres-
43
44 resentations and properties for models trained sors can roughly be ordered by performance,
45 on ∼118 k molecules. The best models for the independent of property and representation:
46 respective properties are U0 (HDAD/KRR), GC>GG>KRR>RF>BR>EN. It is notewor-
47 εHOMO (MG/GC), εLUMO (MG/GC), ∆ε thy how EN, BR, and RF regressors perform
48
49 (MG/GC), µ (MG/GC), α (MG/GG), ZPVE substantially worse than GC/GG/KRR. The
50 (HDAD/KRR), Cv (HDAD/KRR), and ω1 bad performance of EN and BR is due to
51 (BAML/RF). We do not show results for the their low model capacities. This can also be
52 other three energies, U (T = 298K), H(T = seen from the learning curves of all regressors
53
54 298K), G(T = 298K) since identical observa- presented in Figures S1 to S6 of the Supple-
55 tions as for U0 can be made. mentary Material. The performance of BR
56 Overall, NN and KRR regressors perform well and EN improves only slightly with increased
57 for most properties. The ML out-of-sample training set size and even gets worse for some
58
59
errors outperform DFT accuracy at B3LYP property/representation combinations. These
60 level of theory and reach chemical (target) two regressors also exhibit very similar learn-
9
Table 3: MAE on out-of-sample data of all representations for all regressors and properties at ∼118k
1
2
(90%) training set size. Regressors include linear regression with elastic net regularization (EN),
3 Bayesian ridge regression (BR), random forest (RF), kernel ridge regression (KRR) and molecular
4 graphs based neural networks (GG/GC). The best combination for each property are highlighted
5 in bold. Additionally, the table contains mean MAE of representations for each property and
6
7
regressor; and normalized, by MAD (See Table 4), mean MAE (NMMAE) over all properties for
8 each regressor/representation combination.
9
10 U0 εHOMO εLUMO ∆ε µ α ZPVE Cv ω1 NMMAE
11 eV eV eV eV Debye Bohr3 eV cal/molK cm−1 arb. u.
12 CM 0.911 0.338 0.631 0.722 0.844 1.33 0.0265 0.906 131.0 0.423
13 BOB 0.602 0.283 0.521 0.614 0.763 1.2 0.0232 0.7 81.4 0.35
14 BAML 0.212 0.186 0.275 0.339 0.686 0.793 0.0129 0.439 60.4 0.231
15 ECFP4 3.68 0.224 0.344 0.383 0.737 3.45 0.27 1.51 86.6 0.462
16 EN
HDAD 0.0983 0.139 0.238 0.278 0.563 0.437 0.006 47 0.0876 94.2 0.183
17 HD 0.192 0.203 0.299 0.36 0.705 0.638 0.009 49 0.195 104.0 0.236
18
MARAD 0.183 0.222 0.305 0.391 0.707 0.698 0.008 08 0.206 108.0 0.256
19
Mean 0.84 0.228 0.373 0.441 0.715 1.22 0.0509 0.578 95.1
20
21 CM 0.911 0.338 0.632 0.723 0.844 1.33 0.0265 0.907 131.0 0.424
22 BOB 0.586 0.279 0.521 0.614 0.761 1.14 0.0222 0.684 80.9 0.343
23 BAML 0.202 0.183 0.275 0.339 0.685 0.785 0.0129 0.444 60.4 0.229
24 BR ECFP4 3.69 0.224 0.344 0.383 0.737 3.45 0.27 1.51 86.7 0.462
25 HDAD 0.0614 0.14 0.238 0.278 0.565 0.43 0.003 18 0.0787 94.8 0.182
26 HD 0.171 0.203 0.298 0.359 0.705 0.633 0.006 93 0.19 104.0 0.235
27 MARAD 0.171 0.184 0.257 0.315 0.647 0.533 0.008 54 0.201 103.0 0.226
28 Mean 0.828 0.221 0.367 0.43 0.706 1.19 0.05 0.574 94.5
29 CM 0.431 0.208 0.302 0.373 0.608 1.04 0.0199 0.777 13.2 0.239
30 BOB 0.202 0.12 0.137 0.164 0.45 0.623 0.0111 0.443 3.55 0.142
31 BAML 0.2 0.107 0.118 0.141 0.434 0.638 0.0132 0.451 2.71 0.141
32 RF ECFP4 3.66 0.143 0.145 0.166 0.483 3.7 0.242 1.57 14.7 0.349
33 HDAD 1.44 0.116 0.136 0.156 0.454 1.71 0.0525 0.895 3.45 0.198
34 HD 1.39 0.126 0.139 0.15 0.457 1.66 0.0497 0.879 4.18 0.197
35 MARAD 0.21 0.178 0.243 0.311 0.607 0.676 0.0102 0.311 19.4 0.199
36
Mean 1.08 0.142 0.174 0.209 0.499 1.43 0.0569 0.761 8.74
37
38 CM 0.128 0.133 0.183 0.229 0.449 0.433 0.0048 0.118 33.5 0.136
39 BOB 0.0667 0.0948 0.122 0.148 0.423 0.298 0.003 64 0.0917 13.2 0.0981
40 BAML 0.0519 0.0946 0.121 0.152 0.46 0.301 0.003 31 0.082 19.9 0.105
41 KRR ECFP4 4.25 0.124 0.133 0.174 0.49 4.17 0.248 1.84 26.7 0.383
42 HDAD 0.0251 0.0662 0.0842 0.107 0.334 0.175 0.001 91 0.0441 23.1 0.0768
43 HD 0.0644 0.0874 0.113 0.143 0.364 0.299 0.003 16 0.0844 21.3 0.0935
44 MARAD 0.0529 0.103 0.124 0.163 0.468 0.343 0.003 01 0.0758 21.3 0.112
45 Mean 0.662 0.101 0.126 0.159 0.427 0.859 0.0383 0.333 22.7
46 GG MG 0.0421 0.0567 0.0628 0.0877 0.247 0.161 0.004 31 0.0837 6.22 0.0602
47 GC MG 0.15 0.0549 0.062 0.0869 0.101 0.232 0.009 66 0.097 4.76 0.0494
48
49
50 ing curves and BR performs only slightly better tion which has a noteworthy improvement with
51
52
than EN for most combinations. The only clear increased training set size for some properties.
53 exception to this rule is for ZPVE and U0 to- The constant learning rates are not surprising
54 gether with HDAD, where BR performs signifi- as (a) the number of free regression parame-
55 cantly better than EN. Also, BR and EN errors ters in BR and EN is relatively small and does
56
rapidly converge to a constant w.r.t. training not grow with training set size, and as (b) the
57
58 set size for all representations and properties, underlying model is a linear combination with
59 except for HDAD, which is the only representa- small flexibility. This behavior implies error
60
10
Table 4: Mean and mean absolute deviation (MAD) for all properties in the QM9 data set, as well
1
2
as target MAE, and DFT (at B3LYP level of theory) MAE relative to experiment for each property,
3 and the number of molecules used to estimate the values (In parentheses of DFT row). The target
4 accuracies taken from Ref. 13 Target accuracy for energies of atomization, and orbital energies were
5 set to 1 kcal/mol, which is generally accepted as (or close to) chemical accuracy within the chemistry
6
7
community. Target accuracies used for µ and α are 0.1 D and 0.1 Bohr3 respectively, which is within
8 the error of CCSD relative to experiments. 30 Target accuracies used for ω1 and ZPVE are 10 cm−1 ,
9 which is slightly larger than CCSD(T) error for predicting frequencies. 60 Target accuracies used for
10 Cv were not explained in article. 13 Section 2.3 discusses how the errors for DFT where obtained.
11
12 U0 εHOMO εLUMO ∆ε µ α ZPVE Cv ω1
13 3
eV eV eV eV Debye Bohr eV cal/molK cm−1
14
Mean −76.6 −6.54 0.322 6.86 2.67 75.3 4.06 31.6 3500
15
16 MAD 8.19 0.439 1.05 1.07 1.17 6.29 0.717 3.21 238
17 Target 0.043 0.043 0.043 0.043 0.10 0.10 0.0012 0.050 10
18 DFT 0.10(69) NA NA NA 0.10(49) 0.4(49) 0.0097(41) 0.34(16) 28(41)
19
20
21 convergence already for relatively small train- tance are: O-H placed at 0.961 Å, N-H placed
22 ing sets. at at 1.01 Å and C-C-H at 3.138 radians with
23 RF performs poorly compared to GC, GG and an importance of 0.587, 0.305 and 0.064 respec-
24 KRR for all properties except for ω1 , the high- tively. These three first bins constitute ∼96%
25
26 est lying fundamental vibrational frequency in of the prediction of ω1 and distances of the O-H
27 each molecule. For this property RF yields and N-H bins are very similar to O-H and N-H
28 an astounding performance with out-of-sample bond lengths. C-C-H is placed on ∼ π radians
29 errors as small as single digit cm−1 . B3LYP which means that it has to correspond to a lin-
30
31 achieves a mean absolute error of only 28 cm−1 ear C-C-H (alkyne) chain which implies that the
32 with respect to experiment. 32 The distribution two carbons must be bonded by a triple bond
33 of ω1 , Fig. 1 of reference, 13 suggests a simple (typically the C-H with the lowest pKa and the
34 reason for this: There are three distinct peaks highest C-H stretch vibration).
35
36
which correspond to typical C-H, N-H and O-H KRR performs remarkably well on average.
37 stretch vibrations in increasing order. There- For extensive energetic properties it yields
38 fore the principal learning task in this property the lowest overall errors in combination with
39 is to detect if there is an OH group, and if not HDAD and BOB, respectively. Its outstand-
40
41
if there is an NH group. If neither group is ing performance is not unsurprising in light of
42 present, CH will yield the vibration with the the multiple previous ML studies dealing with
43 highest frequency. As such, this is essentially compositional as well as configurational spaces.
44 about classifying which bonds are present in The neural network flavors GC and GG, how-
45
the molecule. RF works by fitting a decision ever, yield better performance on average, and
46
47 tree to the target property. Each branch in the the lowest errors for all electronic (mostly in-
48 tree is based on an inequality of one entry in the tensive) properties, i.e. dipole moment, polar-
49 representation. RF should therefore be able to izability, HOMO/LUMO eigenvalues and gaps.
50 identify which bonds are present in a molecule, A possible explanation for this property depen-
51
52 simply by looking at the entries in the each el- dent difference in performance between KRR
53 ement pair, and/or triplet bin of the represen- and NN could be the inherent respective addi-
54 tations. For RF, a fractional importance can tive and multiplicative nature of these regres-
55 be assigned to each input feature (the sum of sors. The energy being extensive, it is con-
56
57 all importances is 1.0). Analyzing the impor- sistent with this picture that effective, quasi-
58 tance of the bins in HDAD of the RF model particle based linear KRR based estimates have
59 reveals that the three bins with highest impor- recently been reported to deliver very accurate
60
11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
and HD can even yield slightly lower errors than of-sample errors reach levels on par with chem-
1
2 HDAD. In our opinion, this is due to the addi- ical accuracy, or better, using a training set size
3 tional bins of angles and dihedrals rather adding of ∼118k (90% of QM9 data set) molecules.
4 noise than signal. By contrast, the separation For the remaining properties α, εHOMO , εLUMO ,
5 of distances, angles and dihedral angles into dif- ∆ǫ, and ZPVE, errors of the best models come
6
7 ferent bins is not a problem for the KRR meth- within a factor 2 of chemical accuracy.
8 ods because the kernels used are purely distance Regressors EN, BR, and RF lead to rather
9 based. This makes it possible for KRR to ex- high out-of-sample errors, while KRR and
10 ploit the extra three- and four-body informa- graph based NN regressors compete for the low-
11
12 tion in HDAD and to gain an advantage over est errors. We have found that GC, GG, and
13 HD. We note however that the remarkable per- KRR have best performance across all prop-
14 formance of HDAD is possible despite its strik- erties, except for the highest vibrational fre-
15 ing simplicity. As illustrated in Fig. 1 and dis- quency for which RF performs best. There is no
16
17
cussed above, characteristic chemical behavior single representation and regressor combination
18 can be directly obtained by human inspection of that works best for all properties (though forth-
19 HDAD. As such, HDAD corresponds to a rep- coming work with further improvements to the
20 resentation very much "Occam’s razor style". GG based models indicates best in class per-
21
22
Unfortunately, due to its discrete nature and formance across all properties 63 ). For inten-
23 its origin in sorting distances, HDAD will suf- sive electronic properties (µ, HOMO/LUMO
24 fer from lack of differentiability, which might eigenvalues and gap) we have found MG/NN
25 limit its applicability when modeling forces or to yield the highest predictive power, while
26
other non-equilibrium properties. HDAD/KRR corresponds to the most accurate
27
28 MARAD, containing similar information model for extensive properties (α, ZPVE, U0
29 as HDA, performs similarly to BAML—yet, and CV ). The latter point is remarkable when
30 MARAD requires no prior knowledge about considering the simplicity of KRR, being just
31 the physics encoded in the universal force-field a linear expansion of property in training set,
32
33 such as electronic hybridization states, bond- and HDAD, being just histograms of distances,
34 order, or functional potential shapes (Morse, angles, dihedrals. Using BR and EN is not rec-
35 Lennard-Jones, harmonic angular potentials, or ommended if accuracy is of importance, both
36 sinusoidal dihedrals). BOB and CM, previously regressors perform worse across all properties.
37
38 state of the art, result in relatively poor perfor- Apart from predicting highest fundamental vi-
39 mance. ECFP4 produces out-of-sample errors brational frequency best, RF based models de-
40 on par or slightly better than CM/KRR for liver rather unsatisfactory performance. The
41 intensive properties (µ, HOMO/LUMO eigen- ECPF4 based models have shown poor general
42
43
values and gap), however the model produces performance in comparison to all other repre-
44 errors that are off-the-chart for all extensive sentations studied; it is not recommended for
45 properties (α, ZPVE, U0 and CV ). investigations of molecular properties.
46 We should caution the reader that all our re-
47
48
sults refer to equilibrium structures of a set of
49
4 Conclusions only ∼131 k organic molecules. While ∼131k
50 molecules might seem sufficiently large to be
We have benchmarked many combinations of
51 representative, this number is dwarfed in com-
52 regressors and representations on the same
parison to chemical space, i.e. the space popu-
53 QM9 data set consisting of ∼131k organic
54 lated by all theoretically stable molecules, es-
molecules. For all properties, the best ML
55 timated to exceed 1060 for medium sized or-
model prediction errors reach the accuracy of
56 ganic molecules. 64 Furthermore, ML models
57 DFT at B3LYP level with respect to experi-
for predicting properties of molecules in non-
58 ment. For 7 out of 12 distinct properties (at-
59 equilibrium or strained configurations might re-
omization energies, heat-capacity, ω1 , µ) out-
60 quire substantially more training data. This
13
point is also of relevance because some of the knowledges support from the Swiss National
1
2 highly accurate models described herewithin Science foundation (No. PP00P2_138932,
3 (MG based) currently use bond based graph 310030_160067), the research fund of the Uni-
4 connectivity in addition to distance, raising versity of Basel, and from Google. This mate-
5 questions about the applicability to reactive rial is based upon work supported by the Air
6
7 processes. Force Office of Scientific Research, Air Force
8 In summary, for the organic molecules stud- Material Command, USAF under Award No.
9 ied, we have collected numerical evidence which FA9550-15-1-0026. This research was partly
10 suggests that the out-of-sample error of ML supported by the NCCR MARVEL, funded
11
12 is consistently better than estimated DFT at by the Swiss National Science Foundation.
13 B3LYP level accuracy. While this is no guar- Some calculations were performed at sciCORE
14 antee that ML models would reach same error (http://scicore.unibas.ch/) scientific comput-
15 levels if more accurate, explicitly electron corre- ing core facility at University of Basel.
16
17
lated or experimental reference data was used,
18 previous studies indicate that similar perfor-
19 mance can be expected when using higher levels 5 SI
20 of theory. 8 More specifically, one might intu-
21 Supplementary information regarding raw data,
itively expect that going beyond hybrid DFT to
22 MARAD representation, graph convolutions,
23 higher quality data (either wavefunction based-
gated graphs, random forests, and learning
24 QM or experiment) in terms of reference meth-
curves are reported, as well as root mean square
25 ods would represent a more challenging learn-
26 errors for ML predictions after training on the
ing problem, and therefore imply the need for
27 largest training set.
28 larger training set sizes. Results in Ref., 8 how-
29 ever, suggest that ML models can predict the
30 differences between HF and MP2, CCSD, and
31 CCSD(T) equally well using the same training
32
33 set.
34 As such, we conclude that future reference
35 datasets for training state-of-the-art machine
36 learning models of molecular properties should
37
38 preferably use reference levels of theory which
39 go beyond DFT at B3LYP level of accuracy.
40 While it seems unlikely that for each class of
41 molecules, hundreds of thousands of experimen-
42
43
tal training data points will become available
44 in the foreseeable future, it might well be pos-
45 sible to reach such scale using efficient imple-
46 mentations of explicit electron correlated meth-
47
48
ods within high-performance computing cam-
49 paigns. Finally, we note that future work could
50 deal with improving representations and regres-
51 sors, with the goal of reaching similar predictive
52
power using less data.
53
54 Acknowledgement The authors thank Dirk
55
Bakowies for helpful comments, and Adrian
56
57 Roitberg for pointing out an issue with the use
58 of partial charges in the neural net models in
59 an earlier version of this paper. O.A.v.L. ac-
60
14
1
References (11) Huang, B.; von Lilienfeld, O. A. Commu-
2 nication: Understanding molecular repre-
3 (1) Hohenberg, P.; Kohn, W. Inhomogeneous sentations in machine learning: The role of
4 Electron Gas. Phys. Rev. 1964, 136, B864. uniqueness and target similarity. J. Chem.
5 Phys 2016, 145, 161102.
6 (2) Kohn, W.; Sham, L. J. Self-Consistent
7 Equations Including Exchange and Cor- (12) Hansen, K.; Biegler, F.; Ramakrish-
8 relation Effects. Phys. Rev. 1965, 140,
9 nan, R.; Pronobis, W.; von Lilien-
10 A1133. feld, O. A.; Müller, K.-R.; Tkatchenko, A.
11
(3) Burke, K. Perspective on density func- Machine Learning Predictions of Molec-
12 ular Properties: Accurate Many-Body
13 tional theory. J. Chem. Phys. 2012, 136,
14 150901. Potentials and Nonlocality in Chemical
15 Space. J. Phys. Chem. Lett. 2015, 6,
16 (4) Koch, W.; Holthausen, M. C. A Chemist’s 2326–2331.
17 Guide to Density Functional Theory;
18
Wiley-VCH, 2002. (13) Ramakrishnan, R.; von Lilienfeld, O. A.
19 Many Molecular Properties from One Ker-
20
21
(5) Cohen, A. J.; Mori-Sánchez, P.; Yang, W. nel in Chemical Space. chimia 2015, 69,
22 Challenges for Density Functional Theory. 182–186.
23 Chem. Rev. 2012, 112, 289–320.
24 (14) Rupp, M.; Tkatchenko, K.-R., Alexandre
25 (6) Plata, R. E.; Singleton, D. A. A Case haand Müller; von Lilienfeld, O. A. Fast
26 Study of the Mechanism of Alcohol- and Accurate Modeling of Molecular At-
27
Mediated Morita Baylis-Hillman Reac- omization Energies with Machine Learn-
28
29 tions. The Importance of Experimental ing. Phys. Rev. Lett. 2012, 108, 058301.
30 Observations. J. Am. Chem. Soc. 2015,
31 137, 3811–3826. (15) Barker, J.; Bulin, J.; Hamaekers, J.;
32 Mathias, S. Localized Coulomb Descrip-
33 (7) Medvedev, M. G.; Bushmarinov, I. S.; tors for the Gaussian Approximation Po-
34 Sun, J.; Perdew, J. P.; Lyssenko, K. A.
35 tential. arXiv preprint arXiv:1611.05126
36 Density functional theory is straying from 2016,
37 the path toward the exact functional. Sci-
38 ence 2017, 355, 49–52. (16) von Lilienfeld, O. A.; Ramakrishnan, R.;
39 Rupp, M.; Knoll, A. Fourier series of
40 (8) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; atomic radial distribution functions: A
41 von Lilienfeld, O. A. Big Data Meets
42 molecular fingerprint for machine learning
43 Quantum Chemistry Approximations: models of quantum chemical properties.
44 The ∆-Machine Learning Approach. Int. J. Quantum 2015, 115, 1084–1093.
45 J. Chem. Theory Comput. 2015, 11,
46 2087–2096. (17) Huan, T. D.; Mannodi-Kanakkithodi, A.;
47 Ramprasad, R. Accelerated materials
48 (9) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; property predictions and design using
49
50
von Lilienfeld, O. A. Quantum chem- motif-based fingerprints. Phys. Rev. B
51 istry structures and properties of 134 kilo 2015, 92, 014106.
52 molecules. Sci. Data 2014, 1, 140022.
53 (18) Bartók, A. P.; Kondor, R.; Csányi, G. On
54 (10) Ruddigkeit, L.; van Deursen, R.; representing chemical environments. Phys.
55 Blum, L. C.; Reymond, J.-L. Enu- Rev. B 2013, 87, 184115.
56
meration of 166 Billion Organic Small
57
58 Molecules in the Chemical Universe (19) De, S.; Bartók, A. P.; Csányi, G.; Ceri-
59 Database GDB-17. J. Chem. Inf. Model. otti, M. Comparing molecules and solids
60 2012, 52, 2864–2875. across structural and alchemical space.
15
Phys. Chem. Chem. Phys. 2016, 18, (28) Li, Y.; Tarlow, D.; Brockschmidt, M.;
1
2 13754–13769. Zemel, R. Gated Graph Sequence Neural
3 Networks. ICLR 2016,
4 (20) Collins, C. R.; Gordon, G. J.; von
5 Lilienfeld, O. A.; Yaron, D. J. Con- (29) Curtiss, L. A.; Raghavachari, K.; Red-
6 stant Size Molecular Descriptors For Use fern, P. C.; Pople, J. A. Assessment of
7
With Machine Learning. arXiv preprint Gaussian-2 and density functional theories
8
9 arXiv:1701.06649 2016, for the computation of enthalpies of for-
10 mation. J. Chem. Phys. 1997, 106, 1063–
11 (21) Smith, J. S.; Isayev, O.; Roitberg, A. E. 1079.
12 ANI-1: An extensible neural network po-
13 tential with DFT accuracy at force field (30) Hickey, A. L.; Rowley, C. N. Benchmark-
14
computational cost. Chem. Sci. 2017, 8, ing quantum chemical methods for the cal-
15
16 3192–3203. culation of molecular dipole moments and
17 polarizabilities. J. Phys. Chem. A 2014,
18 (22) Schütt, K. T.; Arbabzadah, F.; 118, 3678–3687.
19 Chmiela, S.; Müller, K. R.;
20 Tkatchenko, A. Quantum-chemical in- (31) Stowasser, R.; Hoffmann, R. What do
21
sights from deep tensor neural networks. the Kohn-Sham orbitals and eigenvalues
22
23 Nat. Commun. 2017, 8, 13890. mean? J. Am. Chem. Soc. 1999, 121,
24 3414.
25 (23) Dral, P. O.; von Lilienfeld, O. A.;
26 Thiel, W. Machine Learning of Parame- (32) Sinha, P.; Boesch, S. E.; Gu, C.;
27 ters for Accurate Semiempirical Quantum Wheeler, R. A.; Wilson, A. K. Harmonic
28
Chemical Calculations. J. Chem. Theory Vibrational Frequencies: Scaling Factors
29
30 2015, 11, 2120–2125. for HF, B3LYP, and MP2 Methods in
31 Combination with Correlation Consistent
32 (24) Weber, W.; Thiel, W. Orthogonaliza- Basis Sets. J. Phys. Chem. A 2004, 108,
33 tion corrections for semiempirical meth- 9213–9217.
34 ods. Theor. Chem. Acc. 2000, 103, 495–
35
506. (33) Geary, R. C. The Ratio of the Mean Devi-
36
37 ation to the Standard Deviation as a Test
38
(25) Dral, P. O.; Wu, X.; Spörkel, L.; of Normality. Biometrika 1935, 27, 310–
39 Koslowski, A.; Thiel, W. Semiempirical 332.
40 Quantum-Chemical Orthogonalization-
41 Corrected Methods: Benchmarks for (34) DeTar, D. F. Calculation of Entropy and
42
Ground-State Properties. J. Chem. Heat Capacity of Organic Compounds in
43
44 Theory 2016, 12, 1097–1120. the Gas Phase. Evaluation of a Consistent
45 Method without Adjustable Parameters.
46 (26) Hansen, K.; Montavon, G.; Biegler, F.; Fa- Applications to Hydrocarbons. J. Phys.
47 zli, S.; Rupp, M.; Scheffler, M.; von Lilien- Chem. A 2007, 111, 4464–4477.
48 feld, O. A.; Tkatchenko, A.; Müller, K.-
49
R. Assessment and Validation of Machine (35) Huo, H.; Rupp, M. Unified Representation
50
51 Learning Methods for Predicting Molecu- for Machine Learning of Molecules and
52 lar Atomization Energies. J. Chem. The- Crystals. arXiv preprint arXiv:1704.06439
53 ory Comput. 2013, 9, 3404–3419. 2017,
54
55 (27) Kearnes, S.; McCloskey, K.; Berndl, M.; (36) Bartok, A. P.; De, S.; Poelking, C.; Bern-
56
Pande, V.; Riley, P. Molecular Graph Con- stein, N.; Kermode, J.; Csanyi, G.; Ce-
57
58 volutions: Moving Beyond Fingerprints. riotti, M. Machine Learning Unifies the
59 J. Comput. Aided Mol. Des. 2016, 30, Modelling of Materials and Molecules.
60 595–608. arXiv preprint arXiv:1706.00179 2017,
16
(37) Rappe, A. K.; Casewit, C. J.; Col- for an inverse quantitative structure activ-
1
2 well, K. S.; III, W. A. G.; Skiff, W. M. ity relationship using the signature molec-
3 UFF, a full periodic table force field for ular descriptor. J. Mol. Graph. Model.
4 molecular mechanics and molecular dy- 2002, 20, 429–438.
5 namics simulations. J. Am. Chem. Soc.
6
1992, 114, 10024–10035. (45) O’Boyle, N. M.; Banck, M.; James, C. A.;
7
8
Morley, C.; Vandermeersch, T.; Hutchi-
9 (38) Rogers, D.; Hahn, M. Extended- son, G. R. Open Babel: An open chemical
10 connectivity fingerprints. J. Chem. toolbox. J. Cheminform. 2011, 3, 1–14.
11 Inf. Model. 2010, 50, 742–754.
12 (46) Landrum, G. RDKit: Open-source chem-
13 (39) Besnard, J.; Ruda, G. F.; Setola, V.; informatics. http://www.rdkit.org 2014,
14 Abecassis, K.; Rodriguiz, R. M.;
15
3, 2012.
16 Huang, X.-P.; Norval, S.; Sassano, M. F.;
17 Shin, A. I.; Webster, L. A.; Sime- (47) Faber, F. A.; Hutchison, L.; Huang, B.;
18 ons, F. R. C.; Stojanovski, L.; Prat, A.; Gilmer, J.; Schoenholz, S. S.; Dahl, G. E.;
19 Seidah, N. G.; Constam, D. B.; Bicker- Vinyals, O.; Kearnes, S.; Riley, P. F.; von
20 Lilienfeld, O. A. Fast machine learning
21
ton, G. R.; Read, K. D.; Wetsel, W. C.;
22 Gilbert, I. H.; Roth, B. L.; Hopkins, A. L. models of electronic and energetic proper-
23 Automated design of ligands to polyphar- ties consistently reach approximation er-
24 macological profiles. Nature 2012, 492, rors better than DFT accuracy. arXiv
25
215–220. preprint arXiv:1702.05532 2017,
26
27 (48) Müller, K.-R.; Mika, S.; Rätsch, G.;
(40) Lounkine, E.; Keiser, M. J.; White-
28
29 bread, S.; Mikhailov, D.; Hamon, J.; Tsuda, K.; Schölkopf, B. An introduction
30 Jenkins, J. L.; Lavan, P.; Weber, E.; to kernel-based learning algorithms. IEEE
31 Doak, A. K.; Côté, S.; Shoichet, B. K.; transactions on neural networks 2001, 12,
32 Urban, L. Large-scale prediction and test- 181–201.
33
34 ing of drug activity on side-effect targets.
Nature 2012, 486, 361–367. (49) Schölkopf, B.; Smola, A. J. Learning with
35
36 kernels: support vector machines, regular-
37 (41) Huigens III, R. W.; Morrison, K. C.; Hick- ization, optimization, and beyond ; MIT
38 lin, R. W.; Flood Jr, T. A.; Richter, M. F.; press, 2002.
39 Hergenrother, P. J. A Ring Distortion
40
Strategy to Construct Stereochemically (50) Vovk, V. In Empirical Inference:
41
42 Complex and Structurally Diverse Com- Festschrift in Honor of Vladimir N.
43 pounds from Natural Products. Nature Vapnik ; Schölkopf, B., Luo, Z., Vovk, V.,
44 chemistry 2013, 5, 195. Eds.; Springer Berlin Heidelberg: Berlin,
45 Heidelberg, 2013; pp 105–116.
46 (42) Todeschini, R.; Consonni, V. Handbook
47
of Molecular Descriptors; Wiley-VCH, (51) Hastie, T.; Tibshirani, R.; Friedman, J.
48
49 Weinheim, 2009. The Elements of Statistical Learning:
50 Data Mining, Inference, and Prediction,
51 (43) Faulon, J.-L.; Visco, Jr., D. P.; Second Edition, 2nd ed.; Springer: New
52 Pophale, R. S. The Signature Molec- York, 2011.
53 ular Descriptor. 1. Using Extended
54 (52) Hoerl,; Arthur, E.; Kennard,; Robert, W.
Valence Sequences in QSAR and QSPR
55
56 Studies. J. Chem. Inf. Comp. Sci. 2003, Ridge Regression Biased Estimation for
57 43, 707. Nonorthogonal Problems. Technometrics
58 2000, 80.
59 (44) Visco, J.; Pophale, R. S.; Rintoul, M. D.;
60 Faulon, J. L. Developing a methodology
17
(53) Pedregosa, F.; Varoquaux, G.; Gram- machine learning with “amons”. arXiv
1
2 fort, A.; Michel, V.; Thirion, B.; preprint arXiv:1707.04146 2017,
3 Grisel, O.; Blondel, M.; Prettenhofer, P.;
4 Weiss, R.; Dubourg, V.; Vanderplas, J.; (63) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.;
5 Passos, A.; Cournapeau, D.; Brucher, M.; Vinyals, O.; Dahl, G. E. Neural Message
6
Perrot, M.; Duchesnay, E. Scikit-learn: Passing for Quantum Chemistry. Proceed-
7
8 Machine Learning in Python. J. Mach. ings of the 34nd International Conference
9 Learn. Res. 2011, 12, 2825–2830. on Machine Learning, ICML 2017. 2017.
10
11 (54) Neal, R. M. Bayesian Learning for Neu- (64) Kirkpatrick, P.; Ellis, C. Chemical space.
12
ral Networks; Springer-Verlag New York, Nature 2004, 432, 823.
13
14 Inc.: Secaucus, NJ, USA, 1996.
15
16 (55) Zou, H.; Hastie, T. Regularization and
17 variable selection via the elastic net. J. R.
18 Stat. Soc. Series. B Stat. Methodol. 2005,
19
20
67, 301–320.
21
22 (56) Breiman, L. Random forests. Machine
23 learning 2001, 45, 5–32.
24
25 (57) Duvenaud, D. K.; Maclaurin, D.; Ipar-
26 raguirre, J.; Bombarell, R.; Hirzel, T.;
27 Aspuru-Guzik, A.; Adams, R. P. Convo-
28
29 lutional Networks on Graphs for Learning
30 Molecular Fingerprints. Advances in Neu-
31 ral Information Processing Systems. 2015;
32 pp 2215–2223.
33
34
(58) Desautels, T.; Krause, A.; Burdick, J. W.
35
36 Parallelizing Exploration-Exploitation
37 Tradeoffs in Gaussian Process Bandit
38 Optimization. J. Mach. Learn. Res. 2014,
39 15, 4053–4103.
40
41
42
(59) Google HyperTune.
43 https://cloud.google.com/ml/ (accessed
44 2016).
45
46 (60) Tew, D. P.; Klopper, W.; Heckert, M.;
47 Gauss, J. Basis Set Limit CCSD(T) Har-
48
49
monic Vibrational Frequencies. J. Phys.
50 Chem. A 2007, 111, 11242–11248.
51
52 (61) Müller, K.-R.; Finke, M.; Murata, N.;
53 Schulten, K.; Amari, S. A numerical study
54 on learning curves in stochastic multi-
55
56 layer feedforward networks. Neural Com-
57 put. 1996, 8, 1085–1106.
58
59 (62) Huang, B.; von Lilienfeld, O. A. The
60 “DNA” of chemistry: Scalable quantum
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52 Figure 3: TOC
53
54
55
56
57
58
59
60
19

ct-2017-00577j.R1 Proof Hi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ct-2017-00577j.R1 Proof Hi

Uploaded by

Copyright:

Available Formats

Journal of Chemical Theory and Computation

Prediction errors of molecular machine learning models

Journal: Journal of Chemical Theory and Computation

Manuscript Type: Article

Date Submitted by the Author: 19-Sep-2017

Complete List of Authors: Faber, Felix; University of Basel

ACS Paragon Plus Environment

2.4.3 ECFP4 2.4.5 HD, HDA, and HDAD

You might also like