You are on page 1of 22

Supporting Information

Holistic Prediction of the pKa in Diverse Solvents Based on a Machine-


Learning Approach**
Qi Yang, Yao Li, Jin-Dong Yang, Yidi Liu, Long Zhang,* Sanzhong Luo,* and Jin-Pei Cheng

anie_202008528_sm_miscellaneous_information.pdf
SUPPORTING INFORMATION

Supporting Information

Table of Contents

pKa dataset …………………………………………………………………………………………S2


Molecular descriptors ……………………………………………………………………………S2
Machine learning methods ……………………………………………………………………… S6
The results and analysis …………………………………………………………………………S8

1
SUPPORTING INFORMATION

Details of pKa dataset

All experimental pKa values were collected and filtrated from ibonD dataset (http://ibond.nankai.edu.cn). The validate pKa values
were chosen by following these rules:
1) pKa values which is out of the solvent’s pKa window were carefully selected. Some calculated or deductive data were excluded;
2) When a few pKa values of an identical molecule in the same solvent were reported in different kinds of literatures, we looked
over about the raw data from original source. If the pKa values is reasonable and the variation is within 2 pKa unit, we will retain the
average value of these pKa data; if the variation is more than 2 pKa unit, and it’s impossible to differentiate which value is more
reliable, then those data with large bias were eliminated;
3) During the training process, outliers which hasn’t been predicted were double checked to make sure all the pKa values are
reliable without any typo. (e.g. the bias between experimental value and predicted values is greater than 5 pKa unit.) a few hundred of
pKa values were picked up and corrected during the whole process.

Details of molecular descriptors

All the molecular descriptors are computed with RDKit (http://www.rdkit.org/), an open-source toolkit for cheminformatics. Because
the descriptor calculation is very fast, so it’s very suitable for online descriptor computation and pKa prediction, which offers huge
advantages over quantum chemical property calculation (those methods need time-consuming DFT or ab initio calculation).

Table S1. The list of molecular descriptors used in this study

Molecular descriptors Description


Estate Generates the EState fingerprints for the molecule Concept from the paper: Hall and Kier JCICS _35_ 1039-1045 (1995).
Estate contains the number of times each possible atom type is hit.
EstateValue Similar with Estate, the difference is the EstateValue contains the sum of the EState indices for atoms of each type.
MACC The most widely known and used are the MACCS keys fingerprint, which is a kind of structural fingerprints based on
molecular substructure features. There are two flavors: 166 bit and 320 bit MACC. In RDKit, the bit length is 167 to
maintain consistency with other software packages which are numbered from 1.
MD A variety of molecular descriptors are available within the RDKit. For more details about RDKit molecular descriptors,
please see https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors.
Morgan2_512 Morgan fingerprint belongs to the family of circular fingerprints, which is similar to the well-known ECFP or FCFP
fingerprints. The algorithm used is described in the paper Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints.
JCIM 50:742-54 (2010). For Morgan2_512, the circular radius is 2, and the bit length is 512.
Morgan2_1024 The Morgan fingerprint radius is 2, and the bit length is 1024.
Morgan2_2048 The Morgan fingerprint radius is 2, and the bit length is 2048.
Morgan3_512 The Morgan fingerprint radius is 3, and the bit length is 512.
Morgan3_1024 The Morgan fingerprint radius is 3, and the bit length is 1024.
Morgan3_2048 The Morgan fingerprint radius is 3, and the bit length is 2048.
MDEstate Combination of MD and Estate fingerprint.
MDEstateValue Combination of MD and EstateValue fingerprint.
MDMACC Combination of MD and MACC fingerprint.
MDMorgan2_512 Combination of MD and Morgan2_512 fingerprint.
MDMorgan2_1024 Combination of MD and Morgan2_1024 fingerprint.
MDMorgan2_2048 Combination of MD and Morgan2_2048 fingerprint.
MDMorgan3_512 Combination of MD and Morgan3_512 fingerprint.
MDMorgan3_1024 Combination of MD and Morgan3_1024 fingerprint.
MDMorgan3_2048 Combination of MD and Morgan3_2048 fingerprint.

The feature importance

Table S2. The feature rank.

Entr Feature namea,b Contribution Feature type Entry Feature name Contribution (%) Feature type
y (%)
1 MinPartialCharge 0.025938947 POC 206 fr_Al_COO 0.0005002 S
2 fr_phenol_noOrthoH
MaxPartialCharge 0.02533156 POC 207 0.0005002 S
bond
3 MinAbsEStateIndex 0.02486709 POC 208 MACCS_fp153 0.0005002 FP

2
SUPPORTING INFORMATION
4 qed 0.020393873 S 209 fr_nitro_arom 0.000493054 S
5 MinEStateIndex 0.019514948 POC 210 MACCS_fp106 0.000493054 FP
6 MinAbsPartialCharge 0.0194292 POC 211 MACCS_fp75 0.000485909 FP
7 MaxEStateIndex 0.019264849 POC 212 MACCS_fp82 0.000485909 FP
8 Kappa3 0.019136226 S 213 MACCS_fp101 0.000485909 FP
9 BalabanJ 0.01887898 S 214 fr_ArN 0.000478763 S
10 NumAromaticCarboc
MaxAbsPartialCharge 0.01882896 POC 215 0.000471617 S
ycles
11 FpDensityMorgan3 0.018035786 S 216 MACCS_fp160 0.000471617 FP
12 MolLogP 0.017142572 POC 217 MACCS_fp63 0.000464472 FP
13 PEOE_VSA7 0.016270794 POC 218 MACCS_fp93 0.000464472 FP
14 PEOE_VSA9 0.015963528 POC 219 fr_Nhpyrrole 0.000457326 S
15 FpDensityMorgan2 0.014913108 S 220 SMR_VSA2 0.000457326 POC
16 SMR_VSA10 0.014913108 POC 221 solvent7 0.000457326 L
17 PEOE_VSA8 0.014670154 POC 222 MACCS_fp76 0.000457326 FP
18 SlogP_VSA2 0.014520094 POC 223 MACCS_fp103 0.000457326 FP
19 VSA_EState9 0.014262848 POC 224 MACCS_fp165 0.000457326 FP
20 TPSA 0.013491111 POC 225 fr_Al_OH_noTert 0.00045018 S
21 Kappa2 0.012869434 S 226 MACCS_fp51 0.00045018 FP
22 FpDensityMorgan1 0.012676499 S 227 MACCS_fp39 0.000443034 FP
23 EState_VSA8 0.011197336 POC 228 MACCS_fp113 0.000443034 FP
24 EState_VSA4 0.011140171 POC 229 fr_halogen 0.000428743 S
25 EState_VSA5 0.010868633 POC 230 fr_phenol 0.000428743 S
26 Chi4v 0.010697136 S 231 MACCS_fp89 0.000428743 FP
27 Chi2v 0.01038987 S 232 MACCS_fp109 0.000428743 FP
28 HallKierAlpha 0.010382725 S 233 MACCS_fp111 0.000421597 FP
29 Chi3v 0.010311267 S 234 MACCS_fp115 0.000414452 FP
30 SMR_VSA1 0.010296976 POC 235 MACCS_fp117 0.000414452 FP
31 EState_VSA6 0.010232665 POC 236 MACCS_fp135 0.000414452 FP
32 PEOE_VSA6 0.010168353 POC 237 MACCS_fp146 0.000414452 FP
33 Chi4n 0.01013977 S 238 fr_benzene 0.000407306 S
34 EState_VSA2 0.010004002 POC 239 MACCS_fp30 0.000407306 FP
35 PEOE_VSA10 0.00988967 POC 240 MACCS_fp110 0.000407306 FP
36 BertzCT 0.009696736 S 241 RingCount 0.00040016 S
37 SMR_VSA7 0.009482364 POC 242 MACCS_fp158 0.00040016 FP
38 Chi3n 0.00943949 S 243 MACCS_fp119 0.000393014 FP
39 EState_VSA3 0.009410908 POC 244 MACCS_fp122 0.000393014 FP
40 Chi2n 0.009396615 S 245 MACCS_fp147 0.000385869 FP
41 SlogP_VSA1 0.009346596 POC 246 fr_ketone_Topliss 0.000378723 S
42 FractionCSP3 0.008782084 S 247 MACCS_fp74 0.000378723 FP
43 SlogP_VSA5 0.00873921 POC 248 MACCS_fp116 0.000378723 FP
44 solvent25 0.008467672 L 249 MACCS_fp65 0.000364432 FP
45 SMR_VSA5 0.008460527 POC 250 fr_ketone 0.000357286 S
46 PEOE_VSA1 0.008403362 POC 251 solvent37 0.00035014 L
47 SlogP_VSA6 0.00823901 POC 252 MACCS_fp142 0.000342994 FP
48 EState_VSA7 0.008046076 POC 253 solvent30 0.000335849 L
49 PEOE_VSA11 0.007896015 POC 254 MACCS_fp164 0.000335849 FP
50 Kappa1 0.007574458 S 255 MACCS_fp25 0.000321557 FP
51 solvent12 0.007517293 L 256 fr_amide 0.000314411 S
52 Chi1v 0.007452981 S 257 MACCS_fp48 0.000314411 FP
53 MolMR 0.007374378 POC 258 MACCS_fp60 0.000314411 FP
54 MolWt 0.007310067 S 259 MACCS_fp72 0.000314411 FP

3
SUPPORTING INFORMATION
55 MaxAbsEStateIndex 0.007167153 POC 260 fr_methoxy 0.000307266 S
56 fr_nitro_arom_nonort
EState_VSA9 0.006945635 POC 261 0.000307266 S
ho
57 Chi0v 0.006459727 S 262 MACCS_fp41 0.00030012 FP
58 Chi1n 0.006459727 S 263 fr_tetrazole 0.000292974 S
59 PEOE_VSA3 0.006373978 POC 264 MACCS_fp69 0.000292974 FP
60 HeavyAtomMolWt 0.00628823 S 265 MACCS_fp87 0.000292974 FP
61 SlogP_VSA3 0.006223918 POC 266 solvent9 0.000285829 L
62 Chi0n 0.006181044 S 267 fr_aldehyde 0.000271537 S
63 PEOE_VSA2 0.006123878 POC 268 fr_nitroso 0.000271537 S
64 SMR_VSA6 0.00588807 POC 269 MACCS_fp49 0.000271537 FP
65 EState_VSA1 0.00578803 POC 270 NumAromaticRings 0.000264391 S
66 PEOE_VSA12 0.005716573 POC 271 NumSaturatedRings 0.000264391 S
67 PEOE_VSA13 0.004487509 POC 272 fr_sulfone 0.000264391 S
68 NHOHCount 0.004473218 S 273 solvent21 0.000264391 L
69 LabuteASA 0.004466072 POC 274 MACCS_fp67 0.000264391 FP
70 SlogP_VSA4 0.004223118 POC 275 MACCS_fp139 0.000264391 FP
71 SMR_VSA3 0.004194535 POC 276 SlogP_VSA7 0.000242954 POC
72 Chi1 0.004065912 S 277 MACCS_fp141 0.000242954 FP
73 VSA_EState8 0.004065912 POC 278 solvent5 0.000235809 L
74 EState_VSA10 0.003958726 POC 279 MACCS_fp26 0.000235809 FP
75 PEOE_VSA14 0.003930144 POC 280 solvent22 0.000228663 L
76 VSA_EState10 0.003780084 POC 281 MACCS_fp64 0.000228663 FP
77 SMR_VSA9 0.003722918 POC 282 solvent4 0.000221517 L
78 molecular_state2 0.003708626 L 283 HeavyAtomCount 0.000214371 S
79 NumHeteroatoms 0.003694335 S 284 fr_nitrile 0.000214371 S
80 ExactMolWt 0.003622878 S 285 MACCS_fp126 0.000214371 FP
81 solvent28 0.003558566 L 286 MACCS_fp129 0.000214371 FP
82 solvent2 0.003351341 L 287 MACCS_fp58 0.000207226 FP
83 SlogP_VSA10 0.003322758 POC 288 MACCS_fp130 0.000207226 FP
84 PEOE_VSA4 0.003315612 POC 289 fr_N_O 0.00020008 S
85 NumHDonors 0.003008346 S 290 MACCS_fp144 0.00020008 FP
86 NumHAcceptors 0.002965472 S 291 MACCS_fp148 0.00020008 FP
87 NumRotatableBonds 0.002879723 S 292 fr_alkyl_halide 0.000192934 S
88 Chi0 0.002786829 S 293 MACCS_fp35 0.000192934 FP
89 SMR_VSA4 0.002743955 POC 294 fr_Imine 0.000178643 S
90 SlogP_VSA12 0.00270108 POC 295 MACCS_fp38 0.000178643 FP
91 PEOE_VSA5 0.002636769 POC 296 MACCS_fp46 0.000178643 FP
92 SlogP_VSA8 0.002193735 POC 297 MACCS_fp57 0.000178643 FP
93 NumSaturatedHeter
solvent38 0.002186589 L 298 0.000171497 S
ocycles
94 Ipc 0.001993655 S 299 MACCS_fp33 0.000164351 FP
95 NOCount 0.001915052 S 300 MACCS_fp104 0.000164351 FP
96 solvent20 0.001886469 L 301 fr_nitro 0.000157206 S
97 NumValenceElectrons 0.001829303 S 302 solvent6 0.000157206 L
98 molecular_state3 0.00180072 L 303 MACCS_fp29 0.000157206 FP
99 MACCS_fp145 0.001757846 FP 304 MACCS_fp163 0.000157206 FP
100 molecular_state1 0.001743555 L 305 fr_amidine 0.00015006 S
101 MACCS_fp70 0.001722117 FP 306 fr_sulfide 0.00015006 S
102 MACCS_fp140 0.001714972 FP 307 MACCS_fp12 0.000135769 FP
103 solvent10 0.001693535 L 308 MACCS_fp22 0.000135769 FP
104 fr_NH0 0.001686389 S 309 fr_priamide 0.000128623 S

4
SUPPORTING INFORMATION
105 MACCS_fp132 0.001672097 FP 310 fr_guanido 0.000121477 S
106 fr_para_hydroxylation 0.00160064 S 311 MACCS_fp40 0.000114331 FP
107 solvent3 0.00160064 L 312 fr_oxime 0.00010004 S
108 MACCS_fp50 0.001522037 FP 313 solvent14 0.00010004 L
109 solvent39 0.001507746 L 314 MACCS_fp166 0.00010004 FP
110 solvent1 0.001493455 L 315 fr_C_S 9.29E-05 S
111 solvent8 0.001436289 L 316 fr_morpholine 9.29E-05 S
112 MACCS_fp98 0.001407706 FP 317 MACCS_fp23 9.29E-05 FP
113 MACCS_fp151 0.001371977 FP 318 fr_Ndealkylation1 8.57E-05 S
114 MACCS_fp152 0.001364832 FP 319 fr_sulfonamd 8.57E-05 S
115 fr_NH1 0.00135054 S 320 MACCS_fp14 8.57E-05 FP
116 solvent35 0.001329103 L 321 MACCS_fp37 8.57E-05 FP
117 MACCS_fp161 0.00130052 FP 322 fr_azo 7.86E-05 S
118 MACCS_fp97 0.001257646 FP 323 MACCS_fp34 7.15E-05 FP
119 MACCS_fp90 0.001236209 FP 324 MACCS_fp43 7.15E-05 FP
120 fr_C_O_noCOO 0.001229063 S 325 fr_hdrzine 6.43E-05 S
121 fr_aniline 0.001207626 S 326 solvent13 6.43E-05 L
122 MACCS_fp54 0.001207626 FP 327 MACCS_fp20 6.43E-05 FP
123 MACCS_fp96 0.001186189 FP 328 MACCS_fp52 6.43E-05 FP
124 solvent34 0.001171897 L 329 fr_unbrch_alkane 5.72E-05 S
125 solvent33 0.001136169 L 330 MACCS_fp16 5.72E-05 FP
126 MACCS_fp81 0.001107586 FP 331 MACCS_fp21 5.72E-05 FP
127 NumSaturatedCarbo
MACCS_fp85 0.001086149 FP 332 5.00E-05 S
cycles
128 MACCS_fp79 0.001036129 FP 333 MACCS_fp56 5.00E-05 FP
129 MACCS_fp80 0.001028983 FP 334 MACCS_fp59 5.00E-05 FP
130 fr_SH 0.0010004 S 335 MACCS_fp24 4.29E-05 FP
131 MACCS_fp143 0.000993255 FP 336 fr_piperdine 3.57E-05 S
132 MACCS_fp94 0.000971817 FP 337 solvent19 2.86E-05 L
133 MACCS_fp155 0.000964672 FP 338 MACCS_fp47 2.86E-05 FP
134 MACCS_fp156 0.00095038 FP 339 fr_hdrzone 2.14E-05 S
135 NumAliphaticRings 0.000936089 S 340 fr_thiazole 2.14E-05 S
136 MACCS_fp84 0.000936089 FP 341 MACCS_fp68 2.14E-05 FP
137 fr_C_O 0.000921797 S 342 fr_Ndealkylation2 1.43E-05 S
138 fr_aryl_methyl 0.000914652 S 343 fr_furan 1.43E-05 S
139 fr_pyridine 0.000914652 S 344 fr_imide 1.43E-05 S
140 MACCS_fp105 0.000914652 FP 345 fr_piperzine 1.43E-05 S
141 SlogP_VSA11 0.000893214 POC 346 fr_urea 1.43E-05 S
142 fr_Ar_NH 0.000878923 S 347 MACCS_fp28 1.43E-05 FP
143 MACCS_fp99 0.000878923 FP 348 MACCS_fp32 1.43E-05 FP
144 MACCS_fp91 0.000871777 FP 349 fr_lactone 7.15E-06 S
145 MACCS_fp133 0.000871777 FP 350 fr_phos_acid 7.15E-06 S
146 fr_COO 0.00085034 S 351 fr_term_acetylene 7.15E-06 S
147 MACCS_fp157 0.000836049 FP 352 MACCS_fp31 7.15E-06 FP
148 MACCS_fp136 0.000828903 FP 353 MACCS_fp61 7.15E-06 FP
149 fr_Ar_OH 0.000821757 S 354 MACCS_fp62 7.15E-06 FP
150 NumRadicalElectron
MACCS_fp53 0.000814612 FP 355 0 S
s
151 fr_bicyclic 0.000807466 S 356 fr_HOCCN 0 S
152 MACCS_fp159 0.000807466 FP 357 fr_alkyl_carbamate 0 S
153 MACCS_fp42 0.000793174 FP 358 fr_azide 0 S
154 MACCS_fp55 0.000793174 FP 359 fr_barbitur 0 S

5
SUPPORTING INFORMATION
155 MACCS_fp162 0.000793174 FP 360 fr_benzodiazepine 0 S
156 MACCS_fp150 0.000786029 FP 361 fr_diazo 0 S
157 MACCS_fp78 0.000771737 FP 362 fr_dihydropyridine 0 S
158 MACCS_fp131 0.000771737 FP 363 fr_epoxide 0 S
159 MACCS_fp44 0.000764592 FP 364 fr_isocyan 0 S
160 MACCS_fp125 0.000757446 FP 365 fr_isothiocyan 0 S
161 MACCS_fp128 0.000736009 FP 366 fr_lactam 0 S
162 MACCS_fp134 0.00070028 FP 367 fr_oxazole 0 S
163 MACCS_fp83 0.000693134 FP 368 fr_phos_ester 0 S
164 fr_quatN 0.000685989 S 369 fr_prisulfonamd 0 S
165 MACCS_fp95 0.000685989 FP 370 fr_thiocyan 0 S
166 fr_COO2 0.000678843 S 371 fr_thiophene 0 S
167 MACCS_fp123 0.000671697 FP 372 SMR_VSA8 0 POC
168 NumAliphaticCarbocycles 0.000664552 S 373 SlogP_VSA9 0 POC
169 fr_Al_OH 0.000664552 S 374 EState_VSA11 0 POC
170 MACCS_fp100 0.000664552 FP 375 VSA_EState1 0 POC
171 solvent36 0.000657406 L 376 VSA_EState2 0 POC
172 MACCS_fp71 0.000657406 FP 377 VSA_EState3 0 POC
173 MACCS_fp73 0.000643114 FP 378 VSA_EState4 0 POC
174 MACCS_fp149 0.000643114 FP 379 VSA_EState5 0 POC
175 MACCS_fp154 0.000643114 FP 380 VSA_EState6 0 POC
176 solvent27 0.000635969 L 381 VSA_EState7 0 POC
177 fr_Ar_N 0.000628823 S 382 solvent11 0 L
178 fr_ester 0.000628823 S 383 solvent15 0 L
179 MACCS_fp107 0.000628823 FP 384 solvent16 0 L
180 MACCS_fp118 0.000621677 FP 385 solvent18 0 L
181 MACCS_fp108 0.000614532 FP 386 solvent23 0 L
182 MACCS_fp137 0.000614532 FP 387 solvent29 0 L
183 solvent17 0.000607386 L 388 solvent31 0 L
184 MACCS_fp102 0.00060024 FP 389 solvent32 0 L
185 solvent26 0.000585949 L 390 MACCS_fp1 0 FP
186 fr_allylic_oxid 0.000578803 S 391 MACCS_fp2 0 FP
187 MACCS_fp120 0.000578803 FP 392 MACCS_fp3 0 FP
188 fr_imidazole 0.000564512 S 393 MACCS_fp4 0 FP
189 MACCS_fp124 0.000557366 FP 394 MACCS_fp5 0 FP
190 fr_Ar_COO 0.00055022 S 395 MACCS_fp6 0 FP
191 fr_NH2 0.00055022 S 396 MACCS_fp7 0 FP
192 NumAliphaticHeterocycles 0.000543074 S 397 MACCS_fp8 0 FP
193 NumAromaticHeterocycles 0.000543074 S 398 MACCS_fp9 0 FP
194 MACCS_fp121 0.000543074 FP 399 MACCS_fp10 0 FP
195 MACCS_fp127 0.000543074 FP 400 MACCS_fp11 0 FP
196 MACCS_fp66 0.000528783 FP 401 MACCS_fp13 0 FP
197 MACCS_fp77 0.000528783 FP 402 MACCS_fp15 0 FP
198 MACCS_fp92 0.000528783 FP 403 MACCS_fp17 0 FP
199 MACCS_fp112 0.000528783 FP 404 MACCS_fp18 0 FP
200 fr_ether 0.000521637 S 405 MACCS_fp19 0 FP
201 MACCS_fp88 0.000521637 FP 406 MACCS_fp27 0 FP
202 MACCS_fp138 0.000521637 FP 407 MACCS_fp36 0 FP
203 solvent24 0.000514492 L 408 MACCS_fp45 0 FP
204 MACCS_fp86 0.000514492 FP 409 MACCS_fp167 0 FP
205 MACCS_fp114 0.000514492 FP

6
SUPPORTING INFORMATION
a
The chemical meaning of each Molecular descriptor could be found via https://datagrok.ai/help/domains/chem/descriptors
b
POC: Physical organic chemistry parameters; S: Structure-related parameters; FP: MACC fingerprints; L: Ionic state and solvent
labels.

Details of machine learning methods

Machine learning methods, including random forest (RF), support vector machine (SVM), decision tree regression (DTR), Gaussian
process regression (GPR), K-nearest neighbor (KNN), Ada boost, gradient boosting, bagging regression, extra tree regression, were
built in Python using the scikit-learn package, which is an open-source tool for data analysis and machine learning
(https://github.com/scikit-learn/scikit-learn). XGBoost model was built by using xgboost packages (https://github.com/dmlc/xgboost).
Restricted parameter space has been searched and optimized due to the huge amount of computation. 5-fold cross-validation is
applied to most of the methods. For some methods, no significant performance improvements were observed when we tried some
parameter combinations, so the default values were used during the training/validation process. For more details about the parameter
setting, please see https://scikit-learn.org/stable/supervised_learning.html#supervised-
learning and https://xgboost.readthedocs.io/en/latest/parameter.html.

Table S3. The list of parameters used in different machine learning methods.

Method Parameters
Decision tree regression (DTR) sklearn.tree.DecisionTreeRegressor(max_depth = 20)

Random forest (RF) sklearn.ensemble.RandomForestRegressor(n_estimators = 500, criterion = 'mse', oob_score = True, bootstrap = True,
max_features = 'auto', n_jobs = -1)
Support vector machine (SVM) sklearn.svm.SVR(kernel = 'rbf', C = 1e4, gamma = 0.01)
Gaussian process regression (GPR),
kernel=1.0 * RBF(length_scale = 1) + WhiteKernel(noise_level = 1)
sklearn.gaussian_process.GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 0, normalize_y = True)
K-nearest neighbor (KNN) sklearn.neighbors.KNeighborsRegressor(n_neighbors = 8, weights = 'distance')
Ada boost ensemble.AdaBoostRegressor(DecisionTreeRegressor(max_depth = 100 ), sklearn.n_estimators = 100, random_state
= 42, learning_rate = 0.01)
Gradient boosting sklearn.ensemble.GradientBoostingRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 8, loss = 'ls',
random_state = 42 )
Bagging regression sklearn.ensemble.BaggingRegressor()
Extra tree regression sklearn.tree.ExtraTreeRegressor()
XGBoost
xgboost.XGBRegressor(learning_rate = 0.04, n_estimators = 1200, max_depth = 10, min_child_weight = 7, subsample
= 0.8, colsample_bytree = 0.6, gamma = 0.1, reg_alpha = 2, reg_lambda = 0.5, verbosity = 2, n_jobs = -1, seed = 42)

Details of neural networks

As described by Figure 1, a full-connected neural network with 3 hidden layers was generated to transform the molecular descriptors
to the target pKa value. The model was trained using the SGD optimizer with an initial learning rate of 0.01 and a momentum of 0.5
applied every epoch. Restricted model hyper parameters have been searched and optimized (e.g., change the number of hidden layers,
the neurons per layers, the activation function, the dropout percentage, etc.) to find relatively good results.

7
SUPPORTING INFORMATION

Figure S1. The details of neural network.

The Results and Analysis

Table S4. The screening results of descriptors with XGBoost algorithm.

Entry Descriptor Train_RMSE Train_R2 Test_RMSE Test_R2


1 Morgan2_512 1.134 0.966 2.357 0.849
2 Morgan3_512 1.002 0.973 2.472 0.834
3 Morgan2_1024 1.307 0.995 2.327 0.853
4 Morgan3_1024 1.188 0.962 2.375 0.846
5 Morgan2_2048 1.449 0.944 2.266 0.86
6 Morgan3_2048 1.333 0.953 2.33 0.852
7 estate 1.734 0.92 2.739 0.796
8 estateValue 1.178 0.963 2.766 0.792
9 MACC 0.877 0.98 1.908 0.901
10 POCc 0.326 0.997 1.824 0.909
11 Estate_POC 0.325 0.997 1.818 0.91
12 EstateValue_POC 0.328 0.997 1.82 0.91
13 MACC_ POC 0.323 0.997 1.698 0.922
14 Morgan2_512_POC 0.34 0.997 1.771 0.915
15 Morgan2_1024_POC 0.359 0.997 1.777 0.914
16 Morgan2_2048_POC 0.394 0.996 1.748 0.917
17 Morgan3_512_POC 0.33 0.997 1.768 0.915
18 Morgan3_1024_POC 0.357 0.997 1.771 0.915
19 Morgan3_2048_POC 0.386 0.996 1.735 0.918
20 MACC_POC_ISL[a] 0.298 0.998 1.504 0.939

[a] ISL: Ionic Status Label.

Table S5. The 5-fold CV statistics for the pKa prediction models based on the whole dataset.

ML Methods CV_RMSE CV_MAE CV_R2


Tree Regression 2.66 (0.11)[a] 1.44 (0.04) 0.81 (0.02)
Random Forest 1.80 (0.08) 1.04 (0.03) 0.91 (<0.01)
Gaussian Process 1.59 (0.05) 0.92 (0.03) 0.93 (<0.01)
Gradient Boosting 1.73 (0.05) 1.10 (0.02) 0.92 (<0.01)
SVM 2.41 (0.09) 1.33 (0.04) 0.84 (0.01)
KNN 3.26 (0.03) 1.97 (0.02) 0.71 (<0.01)
Ada Boost 1.73 (0.06) 0.94 (0.02) 0.92 (<0.01)
Bagging Tree 1.95 (0.07) 1.15 (0.03) 0.90 (<0.01)
Extra Tree 2.46 (0.05) 1.31 (0.02) 0.84 (<0.01)

8
SUPPORTING INFORMATION
DNN 1.41 (0.07) 0.87 (0.02) 0.95 (<0.01)
XGBoost 1.50 (0.06) 0.88 (0.02) 0.94 (<0.01)

[a] values in brackets is the standard deviation.

Table S6. The statistics analysis of SSM and HM-6S

Solvent Train_RMSE Train_MAE Train_R2 Test_RMSE Test_MAE Test_R2

H2O 0.264 0.145 0.996 1.508 0.892 0.881


DMSO 0.239 0.155 0.999 2.718 1.767 0.878
EtOH/H2O 0.197 0.136 0.993 0.856 0.491 0.856
MeCN 0.255 0.170 0.999 2.710 1.671 0.866
MeOH 0.217 0.155 0.996 1.722 0.861 0.771
DMF 0.238 0.172 0.997 1.918 1.162 0.793
HM-6S 0.273 0.166 0.998 1.481 0.887 0.944

Table S7. The statistics analysis of HM-6S categorized by solvents

Solvent HM-6S_test_RMSE HM-6S_test_RAE HM-6S_test_R2

H2O 1.358 0.836 0.903


DMSO 2.345 1.533 0.909
EtOH/H2O 0.800 0.451 0.872
MeCN 2.009 1.282 0.926
MeOH 1.033 0.617 0.919
DMF 1.066 0.701 0.937

Table S8. The pKa correlation between different solvents

Entry Solvent pair (X-Y)[a] Predicted [b] R2 Experimental [c] R2


1 H2O-DMSO y = 1.00x + 3.52 0.66 y = 1.10x + 3.62 0.63
2 H2O-MeCN y = 0.96x + 10.64 0.54 y =0.97x + 11.50 0.45
3 DMSO-MeCN y = 0.92x + 7.68 0.74 y = 1.07x + 9.08 0.82
4 H2O-EtOH/H2O y = 0.79x + 2.06 0.90 y = 0.82x + 1. 33 0.84
5 EtOH/H2O-MeOH y = 1.20x + 0.86 0.94 y = 1.11x + 2.57 0.76
6 H2O-MeOH y = 0.94x + 3.37 0.84 y = 0.85x + 4.36 0.69
7 H2O-DMF y = 0.95x + 3.94 0.69 y = 0.83x + 7.15 0.45
8 DMSO-EtOH/H2O y = 0.61x + 0.99 0.82 y = 0.52x + 1.51 0.41
9 DMSO-MeOH y = 0.79x + 1.46 0.90 y = 0.65x + 2.87 0.76
10 DMSO-DMF y = 0.89x + 1.13 0.93 y = 0.96x + 1.46 0.98
11 MeCN-EtOH/H2O y = 0.50x – 1.38 0.62 y = 0.44x – 2.59 0.37
12 MeCN-MeOH y = 0.67x – 2.01 0.74 y = 0.56x – 1.52 0.77
13 MeCN-DMF y = 0.78x – 3.22 0.81 y = 0.40x + 4.06 0.35
14 EtOH/H2O-DMF y = 1.25x + 1.17 0.82 y = 2.68x – 2.92 0.63
15 MeOH-DMF y = 1.07x – 0.04 0.94 y = 0.95x + 2.27 0.61

[a] The linear correlation of pKa values in solvent pair X-Y were performed, where pKa value in solvent X were used as argument and pKa value in solvent Y were
treated as dependent variable. [b] The correlation analysis were performed based on the predicted pKa values with HM-XGBoost, totally 15338 pairs of pKa data
were used. [c] The correlation analysis were performed based on the experimental pKa values existed in the pKa dataset. The numbers of the data were varied for
different solvent pairs.

9
SUPPORTING INFORMATION

Figure S2. pKa Correlation between H2O and DMSO (Left: Pred. Data; Right: Exp. Data)

Figure S3. pKa Correlation between H2O and MeCN (Left: Pred. Data; Right: Exp. Data)

Figure S4. pKa Correlation between DMSO and MeCN (Left: Pred. Data; Right: Exp. Data)

10
SUPPORTING INFORMATION

Figure S5. pKa Correlation between H2O and EtOH:H2O (1:1) (Left: Pred. Data; Right: Exp. Data)

Figure S6. pKa Correlation between EtOH:H2O (1:1) and MeOH (Left: Pred. Data; Right: Exp. Data)

Figure S7. pKa Correlation between H2O and MeOH (Left: Pred. Data; Right: Exp. Data)

11
SUPPORTING INFORMATION

Figure S8. pKa Correlation between H2O and DMF (Left: Pred. Data; Right: Exp. Data)

Figure S9. pKa Correlation between DMSO and EtOH:H2O (1:1) (Left: Pred. Data; Right: Exp. Data)

Figure S10. pKa Correlation between DMSO and MeOH (Left: Pred. Data; Right: Exp. Data)

12
SUPPORTING INFORMATION

Figure S11. pKa Correlation between DMSO and DMF (Left: Pred. Data; Right: Exp. Data)

Figure S12. pKa Correlation between MeCN and EtOH:H2O (1:1) (Left: Pred. Data; Right: Exp. Data)

Figure S13. pKa Correlation between MeCN and MeOH (Left: Pred. Data; Right: Exp. Data)

13
SUPPORTING INFORMATION

Figure S14. pKa Correlation between MeCN and DMF (Left: Pred. Data; Right: Exp. Data)

Figure S15. pKa Correlation between EtOH:H2O (1:1) and DMF (Left: Pred. Data; Right: Exp. Data)

Figure S16. pKa Correlation between MeOH and DMF (Left: Pred. Data; Right: Exp. Data)

Table S9. The Experimental and predicted pKa of SAMPL6.

SAMPL6 Exp. Pred. (NN)


Pred. (XGBoost)
SM01 9.53 ± 0.01 9.41 8.58
SM02 5.03 ± 0.01 3.95 3.68
SM03 7.02 ± 0.01 9.02 8.35

14
SUPPORTING INFORMATION
SM04 6.02 ± 0.01 5.38 4.73
SM05 4.59 ± 0.01 6.58 4.22
SM06 3.03 ± 0.04 3.85 3.63
11.74 ± 0.01 10.78 10.19
SM07 6.08 ± 0.01 5.32 5.33
SM08 4.22 ± 0.01 4.76 4.87
SM09 5.37 ± 0.01 5.17 5.36
SM10 9.02 ± 0.01 9.50 8.79
SM11 3.89 ± 0.01 3.91 4.00
SM12 5.28 ± 0.01 4.06 4.08
SM13 5.77 ± 0.01 5.69 5.89
SM14 2.58 ± 0.01 5.03 4.75
5.30 ± 0.01 5.55 5.02
SM15 4.70 ± 0.01 4.90 4.90
8.94 ± 0.01 8.58 8.92
SM16 5.37 ± 0.01 3.91 3.35
10.65± 0.01 11.21 10.42
SM17 3.16 ± 0.01 3.26 3.68
SM18 2.15 ± 0.02 3.28 3.70
9.58 ± 0.03 6.96 5.90
11.02 ± 0.04 11.01 11.00
SM19 9.56 ± 0.02 9.61 8.77
SM20 5.70 ± 0.03 6.98 6.23
SM21 4.10 ± 0.01 4.36 2.36
SM22 2.40 ± 0.02 2.94 2.30
7.43 ± 0.01 7.71 6.23
SM23 5.45 ± 0.01 4.78 4.94
SM24 2.60 ± 0.01 4.18 2.86
RME 0.80 0.85

RMSE 1.07 1.17


2
R 0.84 0.82

Figure S17. The linear fitting between exp. and pred. data of SAMPL6 molecules

15
SUPPORTING INFORMATION

Scheme S1. The predicted micro-states of SAMPL6 molecules (P1: Prediction with XGBoost model; P2: Prediction with NN model).

16
SUPPORTING INFORMATION

Scheme S2. The predicted micro-states of drug molecules

Table S10. The Prediction results of Thiourea Catalysts in DMSO

Catalyst pKa value XGBoost NN


TU1a 17.6 17.58 17.26
TU1b 17.5 17.58 17.26
TU1c 16.1 15.96 15.66
TU1d 15.7 15.71 15.41
TU1e 15.2 15.19 15.16
TU1f 16.0 16.17 15.92
TU1g 13.8 13.72 13.38

17
SUPPORTING INFORMATION
TU1h 21.1 20.89 20.55
TU2a 16.2 16.26 15.62
TU2b 15.8 15.85 15.33
TU2c 15.2 15.25 14.56
TU2d 13.2 13.16 12.71
TU2e 19.5 19.37 18.25
TU3a 16.9 17.31 17.82
TU3b 16.3 17.09 17.59
TU3c 15.5 16.10 16.68
TU3d 13.3 13.64 13.90
TU3e 19.9 19.86 18.93
TU4a 17.3 17.36 16.92
TU4b 17.2 17.26 16.80
TU4c 15.9 15.92 15.51
TU4d 15.6 15.75 15.35
MAE 0.16 0.55

Table S11. The predicted pKa Values of Squaramides in DMSO

squaramide pKa value XGBoost NN


SA1 12.48 14.81 12.33
SA2 10.55 10.59 10.07
SA3 8.37 8.51 8.44
SA4 12.17 12.18 11.81
SA5 10.54 11.27 11.73
SA6 14.99 13.75 13.07
SA7 12.18 12.25 11.50
SA8 11.03 11.22 11.07
SA9 15.13 14.74 13.32
SA10 13.10 13.14 12.80
SA11 11.83 11.74 11.47
SA12 16.28 13.06 12.769
SA13 13.30 13.32 12.72
SA14 11.87 11.95 11.31
SA15 16.46 14.28 12.42
SA16 11.42 11.45 11.11
SA17 11.57 11.64 11.29
MAE 0.64 0.98

18
SUPPORTING INFORMATION

Table S12. The predicted pKa Values of BINOL type Organocatalysts in DMSO

catalyst pKa value XGBoost NN

BINOL1 12.98 14.51 13.46

BINOL2 11.95 13.43 13.16

BINOL3 10.39 13.04 11.32

BINOL4 9.78 12.05 10.58

BINOL5 9.44 12.08 11.05

BINOL6 9.30 12.03 10.12

BINOL7 10.93 14.67 11.73

BINOL8 10.89 13.55 10.45

BINOL9 9.99 8.31 8.74

BINOL10 9.68 8.53 8.81

BINOL11 10.73 12.84 10.94

BINOL12 12.35 12.07 9.52

BINOL13 11.92 11.72 10.44

BINOL14 10.45 14.45 11.70

BINOL15 16.43 14.46 12.37

MAE 2.07 1.27

Table S13. pKa values of proline type organocatalysts in DMSO

Cat. pKa value XGBoost NN

19
SUPPORTING INFORMATION
PA1 11.57 11.77 11.33
PA2 11.50 11.75 10.25

PA3 11.55 11.60 10.18

PA4 11.39 11.00 10.16

PA5 11.17 10.25 9.16

PA6 11.40 11.33 10.00

PA7 14.34 14.45 14.25

PA8 18.60 19.23 19.24

PA9 20.67 20.65 20.23

PA11 23.81 23.75 23.47

PA12 11.26 11.26 11.34

PA13 16.71 16.63 15.79

MAE 0.23 0.78

Table S14. pKa Values of 6’-HBCA Catalysts in DMSO

6‘-HBCA Cat pKa value XGBoost NN


CA1 8.13 11.02 13.61
CA2 15.46 14.75 16.13
CA3 15.55 14.70 15.56
CA4 15.57 14.88 16.14
CA5 15.32 14.01 16.42
CA6a 15.53 14.70 15.56
CA7 15.44 14.59 16.86
CA8 17.38 15.47 17.61
CA9 8.95 11.28 14.33
CA10 8,64 11.18 15.75
CA11 10.19 13.09 14.26
CA12 7.85 9.99 11.52
CA13 7.79 9.99 10.63
CA14 7.08 8.83 9.67
CA15 7.64 9.27 10.40

20
SUPPORTING INFORMATION
CA16 6.90 7.53 8.73
CA17 6.76 7.91 8.49
CA18 20.24 18.51 20.65
MAE 1.61 2.33

Scheme S3. pKa Values of aminiocatalysts in MeCN

21

You might also like