1 s2.0 S1476927121001675 Main

Computational Biology and Chemistry 95 (2021) 107597
Contents lists available at ScienceDirect
Computational Biology and Chemistry

journal homepage: www.elsevier.com/locate/cbac
Virtual screening of dipeptidyl peptidase-4 inhibitors using quantitative

structure–activity relationship-based artificial intelligence and molecular
docking of hit compounds
Oky Hermansyah a, Alhadi Bustamam b, Arry Yanuar a, *
a
Laboratory of Biomedical Computation and Drug Design, Faculty of Pharmacy, Universitas Indonesia, Depok 16424, Indonesia
b
Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Depok 16424, Indonesia
A R T I C L E I N F O A B S T R A C T
Keywords: Dipeptidyl peptidase-4 (DPP-4) inhibitors are becoming an essential drug in the treatment of type 2 diabetes
Artificial intelligence mellitus; however, some classes of these drugs exert side effects, including joint pain and pancreatitis. Studies
DPP-4 suggest that these side effects might be related to secondary inhibition of DPP-8 and DPP-9. In this study, we
KNIME
identified DPP-4-inhibitor hit compounds selective against DPP-8 and DPP-9. We built a virtual screening
Machine learning
QSAR
workflow using a quantitative structure–activity relationship (QSAR) strategy based on artificial intelligence to
Virtual screening allow faster screening of millions of molecules for the DPP-4 target relative to other screening methods. Five
regression machine learning algorithms and four classification machine learning algorithms were applied to
build virtual screening workflows, with the QSAR model applied using support vector regression (R2pred 0.78)
and the classification QSAR model using the random forest algorithm with 92.2% accuracy. Virtual screening
results of > 10 million molecules obtained 2 716 hits compounds with a pIC50 value of > 7.5. Additionally,
molecular docking results of several potential hit compounds for DPP-4, DPP-8, and DPP-9 identified CH0002 as
showing high inhibitory potential against DPP-4 and low inhibitory potential for DPP-8 and DPP-9 enzymes.
These results demonstrated the effectiveness of this technique for identifying DPP-4-inhibitor hit compounds
selective for DPP-4 and against DPP-8 and DPP-9 and suggest its potential efficacy for applications to discover hit
compounds of other targets.
1. Introduction pancreatitis, and are related to the secondary inhibition of enzymes with
high-sequence homology to DPP-4 (e.g., DPP-8 and DPP-9) (Huan et al.,
Dipeptidyl peptidase-4 (DPP-4) (EC 3.4.14.5) inhibitors are impor 2015; Patel and Ghate, 2014). Therefore, there is a need to develop
tant oral antidiabetic drugs for treating type 2 diabetes (T2DM). Sita novel DPP-4 inhibitors selective against DPP-8 and DPP-9 enzymes.
gliptin was reported in 2006 as the first DPP-4 inhibitor agent, and since Novel DPP-4 inhibitors can be developed through high-throughput
then, this class of drugs has increasingly shifted the role of sulfonylurea screening, which is generally performed by pharmaceutical com
in T2DM treatment according to national and international guidelines. panies. An alternative is the computer-aided drug design through virtual
The drugs work differently from most other antidiabetic drugs. DPP-4 screening (Hughes et al., 2011; Pei et al., 2020; Shamsara, 2019; Wang
inhibition stimulates the pancreas to produce and release insulin while et al., 2019) of large databases containing millions of compounds, such
reducing or normalizing body weight without causing hypoglycemia as ChEMBL (Gaulton et al., 2012) and PubChem (Kim et al., 2016). A
(Alam et al., 2018; Chylewska et al., 2018; Gallwitz, 2019; virtual screening method developed using a quantitative structur
Popovic-Djordjevic et al., 2018; Sesti et al., 2019). e–activity relationship (QSAR) strategy (regression or classification)
Some commercially available DPP-4 inhibitors are well-tolerated, capable of predicting the selectivity of a molecule for DPP-4 can reveal
whereas others have side effects, ranging from joint pain to the relationship between molecular structures represented by
Abbreviations: DL, Deep Learning; XGBoost, XGBoost Tree Ensemble; RF, Random Forest; MLR, Multiple Linear Regression; SVR, Support Vector Regression; SVM,
Support Vector Machine.
* Correspondence to: Faculty of Pharmacy, Universitas Indonesia, Depok 16424, Indonesia.
E-mail address: arry.yanuar@ui.ac.id (A. Yanuar).
https://doi.org/10.1016/j.compbiolchem.2021.107597
Received 17 April 2021; Received in revised form 25 October 2021; Accepted 26 October 2021
Available online 30 October 2021
1476-9271/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
O. Hermansyah et al. Computational Biology and Chemistry 95 (2021) 107597
previous study employed a machine learning approach to

DPP-4-inhibitor screening (Cai et al., 2017); however, their prediction
accuracy was < 90%.
Herein, we developed a QSAR model that meets the QSAR statistical
parameter standard (Golbraikh and Tropsha, 2002; Roy et al., 2015a) for
regression models and the accuracy standard for classification models.
We employed five machine learning regression algorithms (i.e., XGBoost
tree ensemble, random forest, support vector regression, deep learning,
and multiple linear regression) and four machine learning classification
algorithms (i.e., XGBoost tree ensemble, random forest, support vector
regression, and deep learning) (Babajide Mustapha and Saeed, 2016;
Roy et al., 2015b). Further, we used molecular docking to identify hit
compounds selective toward DPP-4 and against DPP-8 and DPP-9 and
related commercially available ligands, including trelagliptin, omar
igliptin, and carmegliptin (Burness, 2015. Omarigliptin: First Global
Approval. Drugs, 75(16), 2015; Makrilakis, 2019; Mattei et al., 2010;
McKeage, 2015). Commercial compounds were included because they
are expected to exhibit inhibitory constants (Ki) similar to those of po
tential hit compounds (Huan et al., 2015).
2. Methods
2.1. Dataset
The dataset was downloaded from the ChEMBL website, and it in
cludes the human DPP-4 target and an IC50 activity filter (https://www.
ebi.ac.uk/chembl/target_report_card/CHEMBL284/). The dataset con
tained 4 661 compounds after removing empty activity values, salt ions,
and small fragments, with activity units presented as molar values. The
molecular structures were normalized and compound duplication was
determined (Cherkasov et al., 2014; Kausar and Falcao, 2018). The
remaining 3 933 compounds were used as the regression modeling
dataset.
For the QSAR classification models, we used 4 355 compounds from
the ChEMBL database obtained using a scientific literature filter. The
missing values and duplicates were corrected, salt ions and small frag
ments were removed, and the molecular structure was normalized
(Cherkasov et al., 2014; Kausar and Falcao, 2018), leaving 3 740 com
pounds. The data were classified as active and inactive compounds, with
pIC50 activities > 7.5 designating active compounds and pIC50 < 6
designating inactive compounds (those with pIC50 between 6.0 and 7.5
were removed) (Cai et al., 2017). The remaining 2 307 compounds were
Fig. 1. QSAR workflow for modeling DPP-4 inhibitors.
used for the model development.
descriptors or fingerprints with potential hit compounds.

In this study, we performed virtual screening by developing a 2.2. Calculation of descriptors and fingerprints
workflow (Kausar and Falcao, 2018; Ramesh and Muthuraman, 2020; P.
Mazanetz et al., 2012; Santos et al., 2019; Wójcikowski et al., 2019) that Descriptor and fingerprint calculations were performed using four
uses QSAR classification models to predict hit compounds, followed by nodes in KNIME (Beisken et al., 2013) (i.e., RDKit descriptor calculation
QSAR regression to assess their values (Ramesh and Muthuraman, and fingerprinting and CDK fingerprint and molecular properties)
2020). Additionally, we employed artificial intelligence (AI) techniques (Beisken et al., 2013). The total descriptors and fingerprints of each
to incorporate a supervised learning algorithm into the QSAR process molecular structure included 17 784 and 19 875 features for the clas
(Bitencourt-Ferreira et al., 2021; Bitencourt-Ferreira and de Azevedo, sification and regression models, respectively.
2019a; da Silva et al., 2020; Kumar et al., 2018) to allow faster evalu
ation of millions of molecules relative to other virtual screening methods
(Neves et al., 2018). Previous studies have employed AI-based QSAR 2.3. Standardization of activities and partitions
analysis of DPP-4 inhibitors (Al-Fakih et al., 2019; Gu et al., 2013;
Sokolović et al., 2017); however, they considered only certain com The activity of each molecular structure (IC50) was converted into
pound derivatives and small datasets, thereby limiting their predictive logarithmic values in molar units [pIC50 = − log (IC50 × 10− 9)] (Selvaraj
ability to only those derivatives. The principle of QSAR suggests that et al., 2011). The dataset was randomly partitioned (80%:20%), fol
similar structures have similar activities (Martin et al., 2002); therefore, lowed by further partitioning to 90%:10%, and 10% was used for
using a small dataset and a limited number of derivative compounds training and 20% as the test dataset (Baldi and Brunak, 2001; Ripley,
makes the prediction bias, especially when screening several com 1996; Xiong et al., 2020). The training set was used for model con
pounds. Therefore, our goal was to expand AI-based QSAR to enable the struction and validation, and the test dataset was used for internal and
prediction of DPP-4-inhibitor compounds from larger datasets and external evaluation of the models (Liu et al., 2018; Mozafari et al., 2020;
without limitations by the number or type of derivative. Moreover, a Myint et al., 2012).
2
2.4. Feature selection A previous study proposed rm2(test) for external validation of models
(Roy et al., 2015a), with this value calculated using the square of the
Optimal features were selected using several feature-selection correlation coefficient between actual and predicted activities from the
methods. Dimension reduction was performed with principal compo test dataset. For acceptable predictions, r2m (test) should be < 0.2 when
nent analysis, calculation of height correlation, and application of the ∆rm2 (test) is > 0.5.
random forest algorithm (Silipo et al., 2014). The identified features
were tested using several machine learning models, and those with the 2.7. Evaluation of the classification QSAR model
highest accuracy were used as features for modeling.
Internal validation and external validation datasets were used to test
2.5. Quantitative structure–activity relationship modeling classification model performance. All models were evaluated as follows:
Sensitivity = Recall = TP / (TP + FN) (5)

Five machine learning algorithms were used to build the QSAR
regression models (XGBoost tree ensemble, multiple linear regression, Specificity= TN / (TN + FP) (6)
random forest, deep learning, and support vector regression) and four
were used for the QSAR classification models (XGBoost tree ensemble, Accuracy= (TP + TN) / (TP + FP +TN + FN) (7)
random forest, deep learning, and support vector regression). The par F-Measure = 2(Recall × Precision) / (Recall + Precision) (8)
titioned dataset was used to build the machine learning models from
training and validation (Fig. 1). The hyperparameter value of each Precision (FP Rate)= TP / (TP +FP) (9)
model was determined through optimization with a random search (a
A receiver operating characteristic (ROC) plot was used to visualize
combination of 100 experiments), with the hyperparameter showing the
the model behavior according to the area under the ROC curve (AUC),
best performance used for internal and external validation.
with an AUC of 0.5 indicating that the classification model has no
discriminatory power (Gramatica, 2013; Roy et al., 2015a).
2.6. Evaluation of the QSAR regression model
2.8. Testing QSAR model workflow on other targets
To determine goodness-of-fit model performance, we performed
statistical analysis of the regression coefficient (R), determination co To assess QSAR workflow implementation, targets from the ChEMBL
efficient (R2), and mean-square error (MSE) for each machine learning database [opioid sigma receptor (CHEMBL287) and a β1 adrenergic
algorithm. Hyperparameter selection was evaluated using the lowest receptor (CHEMBL213)] were used as a modeling dataset to evaluate
root (R)MSE. predictive capability. Activity data for each target was downloaded from
∑(yi − ŷ ) ChEMBL and analyzed with KNIME, and the same five machine learning
R = 1− i
(1) regression algorithms were used for the analysis.
i
(yi − yi )
∑(yi 2.9. Virtual screening using multiple databases

ŷi )2
(2)
−
R2 = 1 − 2
(yi − yi )
i
To identify DPP-4-inhibitor hit compounds, virtual screening was
performed using various databases, including 1 870 461 molecules from
1∑ n
MSE = (yi − ŷi )2 (3) ChEMBL (Davies et al., 2015; Gaulton et al., 2017), 1424986 molecules
n i=1 from PubChem (Kim et al., 2018) (https://ftp.ncbi.nih.gov/pubchem/
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ Compound/CURRENT-Full/SDF/; last update: October 18, 2019), and 7
1∑ n
869 542 molecules from Molport (ftp://molport.com/: last update: 18
RMSE = (yi − ŷi )2 (4)
n i=1 October 2019). These molecules were tested using the QSAR classifica
tion models to determine their activities. Active molecules were further
where yi is the actual value, ŷi is the predicted value, and yi is the actual processed for similarity testing using the ChEMBL DPP-4-inhibitor data
average value. The minimum value recommended to allow the QSAR to determine whether a hit compound was novel. The Tanimoto coeffi
regression model to produce a reliable prediction and that of the cor cient was used as the similarity descriptor, with a similarity limit of 0.85
relation coefficient (R) for in vivo data was ≥ 0.8 with a coefficient of as a threshold through 166-bit MACCS fingerprinting (Danishuddin,
determination (R2) ≥ 0.6 (Veerasamy et al., 2011). Khan, 2016; Martin et al., 2002).
We used a set of previously described parameters to determine the The activities of the active molecules that met the similarity criteria
external predictability of the QSAR regression models (whether a model were determined using the QSAR regression models. The obtained hit
is satisfactory if all conditions are met) (Golbraikh and Tropsha, 2002): compounds were then evaluated to compare with DPP-4 target decoys
from DUD-E (Mysinger et al., 2012) (http://dude.docking.org/tar
1. R2(cv) > 0.5 gets/dpp4) to determine whether that the hit compounds were similar to
2. R2(ext) > 0.6 pan-assay interference compounds (Lipinski et al., 2001). We employed
R− R20 Lipinski’s rule of five to determine whether a hit compound qualifies for
3. R2
< 0.1 and 0.85 ≤ k ≤ 1.15 or
potential development as a DPP-4 inhibitor.
R 2
− R′ 20
4. r2
< 0.1 and 0.85 ≤ k’ ≤ 1.15 or
⃒ ⃒ 2.10. Molecular docking of hit compounds with DPP enzymes
5. ⃒R20 − R′ 20 ⃒ < 0.3
Hit compounds with the highest activities were determined from the
where R2(cv) and R2(ext) are the coefficients of determination from in virtual screening results, and molecular docking was performed using
ternal validation results and the test data, respectively; R20 and R′ 20 are the DPP-4, DPP-8, and DPP-9 crystal structures obtained from the Pro
the coefficients of determination between actual and predicted activities tein Data Bank (PDB) (Berman, 2000; Kang et al., 2007). DPP-4 crystal
at zero intercepts and between predicted and actual activities, respec structures include various commercially available ligands, such as tre
tively; and k and k′ are the slopes of the regression line through the lagliptin (PDB: 5KBY) (Grimshaw et al., 2016), alogliptin (PDB: 2ONC)
origin point (Golbraikh and Tropsha, 2002). (Feng et al., 2007), omarigliptin (PDB: 4PNZ) (Biftu et al., 2014), and
3
Fig. 2. Analysis of the chemical space. Training set versus test set (external validation) defined by molecular weight (MW) and ALogP. (A) For the regression model,
green symbols (x) are results from the training set, and red symbols (∆) are results from the test set. (B) For the classification model, blue symbols (□) are results from
the training set, and red symbols (∆) are results from the test set.
Fig. 3. Feature selection. Seven features developed to obtain the best method Fig. 4. Internal validation results using the regression model. Support vector
regression produced the best performance among other models according to the
using four learning models.
lowest MSE.
3. Results and Discussion

carmegliptin (PDB: 3KWF) (Mattei et al., 2010), whereas DPP-8 (PDB:
6HP8) (Ross, 2019) and DPP-9 (PDB: 6EOR) include no bound ligands Dataset diversity is a critical factor when using QSAR methods.
(Ross et al., 2018). Potential hit compounds should exhibit a predictive Previous applications of QSAR to assess DPP-4 inhibitors mainly focused
inhibitory activity against DPP-4 greater than or equal to that of the on local (conventional) QSAR modeling using both two- and three-
original ligands. In contrast, to exhibit nonselectivity for DPP-8 and dimensional QSAR and pharmacophores. However, this provides reli
DPP-9, hit compounds should have a low inhibitory activity (>10 able predictions across only a limited chemical space. Therefore, there is
000 nM) (Huan et al., 2015). Molecular docking was performed using a need to use a more extensive and diverse dataset to build a global
Autodock (v.1.5.6; Discovery Studio R2 Client 2017; Biovia, San Diego, QSAR method (Shi et al., 2018).
CA, USA), and PyMol (v.2.3.4; Schrödinger, Inc., New York, NY, USA).
Before assessing the screened compounds, the original ligands were used
for docking to establish a reference RMSE. Quality docking was indi 3.1. Analysis of the chemical space
cated by a root-mean-square deviation of < 2 Å. All redocking values
satisfy this threshold. For the regression model (Fig. 2A), we used a training dataset with 2
831 molecules with molecular activities ranging from 0.064 nM (pIC50
4
Table 1
Internal validation results for the classification model.
Models TP FP TN FN Sensitivity Specificity F-Measure Precision Accuracy
Deep learning 891 93 75 0.9224 0.8942 0.9138 0.9055 0.9079

786
Random forest 925 105 41 0.9576 0.8805 0.9269 0.8981 0.9209
774
SVM 952 457 14 0.9855 0.4801 0.8017 0.6757 0.7447
422
XGBoost 907 93 59 0.9389 0.8942 0.9227 0.9070 0.9176
786
TP = true positive; FP = false positive; TN = true negative; FN = false negative.
Table 2
Statistical parameters for external validation of various models.
Metric DL XGBoost MLR RF SVR Standard
Q2 0.6920 0.7530 0.5939 0.7532 0.7607 > 0.5a

R2(pred) 0.5910 0.7617 0.6013 0.7668 0.7761 > 0.6a
MSE (validation) 0.7686 0.6163 1.0134 0.6157 0.5971 –
MSE (ext) 1.0370 0.6043 1.0109 0.5914 0.5679 –
R20 0.2564 0.6914 0.4597 0.6319 0.7281 –
R′ 20 0.5911 0.7618 0.6014 0.7668 0.7762 –
(R2 − R20 ) / R2 0.5672 0.0924 0.2422 0.1847 0.0624 < 0.1a
(R −2
R′ 20 )
/ R2 0.0023 0.0000 0.0086 0.0107 0.0006 < 0.1a
⃒ 2 ⃒
⃒R − R′ 2 ⃒
0 0
0.3347 0.0704 0.1417 0.1349 0.0481 < 0.3a
k 0.9979 1.0024 0.9976 1.0005 0.9975 0.85 ≤ k ≤ 1.15a
k′ 0.9797 0.9845 0.9805 0.9867 0.9902 0.85 ≤ k′ ≤ 1.15a
R2m 0.2760 0.5181 0.3665 0.4272 0.5668 –
R′ 2m 0.5686 0.7618 0.5586 0.6874 0.7563 –
R2m 0.4223 0.6400 0.4625 0.5573 0.6616 > 0.5b
∆R2m 0.2925 0.2437 0.1920 0.2602 0.1895 < 0.2b

Model Predictive Fail Fail Fail Fail Yes
a
= standard Golbraikh and Tropsha (2002)
b
= standard Roy, Kar & Das (2015)
Table 3
External validation of the classification model.
Models TP FP TN FN Sensitivity Specificity F-measure Precision Accuracy
Deep learning 215 25 10 0.9556 0.8945 0.8958 0.9242

212 0.9247
Random forest 219 29 6 0.9733 0.8776 0.8831 0.9242
208 0.9260
SVM 221 137 4 0.9822 0.4219 0.6173 0.6948
100 0.7581
XGBoost 215 26 10 0.9556 0.8903 0.8921 0.9221
211 0.9227
TP = true positive; FP = false positive; TN = true negative; FN = false negative.
= 10.19) to 9 100 µM (pIC50 = 2.04) and a test set of 787 molecules (for 3.2. Feature selection
external validation) with molecular activities ranging from 0.012 nM
(pIC50 = 10.92) to 1 000 µM (pIC50 = 3). Feature selection was performed for seven feature types (Fig. 3) to
For the classification model (Fig. 2B), we used a training set of 1 845 identify the most optimal feature for use in compound identification,
molecules (879 active and 976 inactive), whereas the test set (external with this undertaken using the random forest algorithm. This process
validation) contained 462 molecules (237 active and 135 inactive). reduced 17 569 total features by 98.8%, resulting in 208 features for the
Analysis of the chemical space was performed to examine the diversity QSAR regression model and 200 features for the QSAR classification
of the datasets. Fig. 2 shows plots of DPP-4-inhibitor compound diversity model.
in each dataset according to molecular weight (MW) and partition co
efficient (XLogP). The results identified that compounds with a MW
ranging from 128.094 Da to 1 173.69 Da and a partition coefficient 3.3. Optimization of machine learning algorithms
ranging from − 4.107–18.493, suggesting that the model demonstrated
significant heterogeneity in the chemical space, potentially resulting in The model with the lowest error and highest accuracy was obtained
broader predictive ability for new compounds (Kong and Yan, 2017). from algorithm parameter optimization by searching for the lowest
random error within a specific parameter range across 100 parameter
5
Fig. 5. ROC curve of the four classification models.
Table 4
Performance of QSAR workflow on various targets.
Target Models Q2 MSE R2 (ext) Dataset Curation Training Validation Test
Beta-1 adrenergic receptor (CHEMBL213) Deep Learning 0.6601 0.4265 0.9134 1508 620 446 496 50
MLR 0.1570 1.0578 -0.2724
Random Forest 0.7349 0.3326 0.6462
SVR 0.7312 0.3373 0.6515
XGBoost 0.7099 0.3641 0.6676
Sigma Opioid receptor (CHEMBL233) Deep Learning 0.6730 0.6113 0.0736 2280 1157 832 925 232
MLR 0.4672 0.9959 0.5693
Random Forest 0.7725 0.4253 0.7318
SVR 0.7543 0.4593 0.7311
XGBoost 0.7453 0.4762 0.7422
values and 100 repetitions. Parameter-optimization results were as fol small datasets, by mapping and transforming nonlinear data kernels into
lows: deep learning using two dense layers with a learning rate of 0.01 high-dimensional features. XGBoost has not been extensively employed
and 100 neurons (RMSE = 0.8335; batches, 65; epochs, 505); XGBoost in QSAR modeling because it is new compared to other machine learning
(RMSE = 0.7684; rounds, 1000; maximum depth, 13; and eta, 0.2994); methods, although it shows a faster analytical capability and satisfactory
multiple linear regression (RMSE = 0.9605; offset parameter, 0.23707); predictive results (Babajide Mustapha and Saeed, 2016).
random forest (RMSE = 0.7765; number of models, 109; tree depth, 15); The optimization results using the algorithms showed accuracy of up
support vector regression (RMSE = 0.7573; cost parameters, 79; degree, to 94%, although support vector regression showed an accuracy of only
5). 69%, suggesting that the hyperplane was unable to separate active and
For optimization of classification methods, the deep learning optimal inactive compounds in the classification model. However, ensemble-
parameters on the dense layer were obtained using a rectified linear unit based models, such as random forest and XGBoost, were able to pre
(ReLU) weight-initiation strategy and a leaky ReLU activation function dict active and inactive compounds well. These results contrasted with
(batch size, 15; epochs, 843). The highest accuracies for each algorithm those of the regression QSAR model, which showed support vector
were 0.9431 (deep learning), 0.9404 (XGBoost; nRounds, 500; max regression as the best algorithm. In the regression method, the trans
Depth, 12), 0.9404 (random forest; nMethod, 216; treeDepth, 21), and formation of the variable to higher dimensions resulted in better pre
0.6938 (support vector regression; sigma, 0.9662; penalty, 25). diction of pIC50 relative to other methods.
Support vector regression showed the best results, which agree with In the regression model, the goodness-of-fit from multiple linear
previous studies that applied the QSAR model to DPP-4-inhibitor anal regression and deep learning were low relative to results from support
ysis (Gu et al., 2013; Yang et al., 2013). Additionally, this method is vector regression and ensemble methods. This is likely because multiple
widely used to manage high-dimensional variables, especially with linear regressions are intended for linear data and show worse
6
Fig. 6. Molecular structures of hit compounds identified by virtual screening.
performance with nonlinear data. For deep learning, the poor results deep learning, XGBoost, and random forest showed excellent perfor
were likely due to the lack of training data, which hindered model mance (accuracy >90%), with random forest showing the best
performance. performance.
3.4. Internal validation 3.5. External validation
Internal validation of the regression QSAR model showed < 25% Although the QSAR regression models met the requirements for pa
difference for the results derived from all of the models, suggesting no rameters k and k′ , the R20 values for deep learning, multiple linear
overfitting (Veerasamy et al., 2011). Support vector regression showed regression, and random forest were > 0.1, indicating that they were
lowest error (Fig. 4), followed by XGBoost and random forest. Internal unsuitable as predictive QSAR models.
validation of the QSAR classification models (Table 1) revealed that For verification of similarities between observed and predicted data,
7
Fig. 7. The interaction of compounds CH0001 CH0002 and CH0003 with DPP4. Ojeda-Montes et al. (2018) report that compounds selected for DPP8 and DPP9 have
hydrophobic interactions with Phe357, Arg358, or Tyr547 residues of DPP4. (a) Hydrophobic interaction of the Pi-Pi T-Shaped type between the aromatic group of
the molecule CH0001 and DPP4 (3KWF) at residues Phe357 and Tyr547. (b) Hydrophobic interaction of the Pi-Pi Stacked type between the aromatic group molecule
CH0002 and DPP4 (4PNZ) at residue Phe357. (c) There was no hydrophobic interaction between CH0003 and DPP4 (2ONC) with Phe357, Arg358, or
Tyr547 residues.
Table 2 shows that deep learning and multiple linear regression did not 3.8. Molecular docking results
meet the R2m requirements (values <0.5), whereas the other methods
were eligible, whereas on support vector regression showed an ∆R2m Molecules showing high potential inhibitory activity after docking
with DPP-4 included CH0002, CH0003 and CH0001 (ChEMBL identi
parameter making it eligible for the QSAR regression model.
Support vector regression fulfilled all requirements necessary for the fiers). Specifically, CH0002 showed slightly higher inhibitory activity to
QSAR regression model, suggesting that it can be used to predict hit trelagliptin (PDB: 5KBY), CH0001 showed higher inhibitory activity
compounds during virtual screening. Additionally, support vector than omarigliptin (PDB: 4PNZ) and CH0003 showed higher inhibitory
regression showed a significant difference between the curves formed by activity than almost all ligands.
predictive values and the zero intercept, suggesting increased model In accordance with the proposal of Ojeda-Montes et al. (2018) to
accuracy. produce inhibitors that have potential activity against DPP4 and are
Table 3 shows external validation results of the classification selective against DPP8 and DPP9, a compound must have an aromatic
method, and Fig. 5 shows ROC analysis of the accuracies of the deep ring so that it can interaction π –π with Phe357, then have a negatively
learning, XGBoost, and random forest methods [all higher than previous charged group so that it can interact electrostatically with Arg358 and
studies (accuracy > 80%)] (Cai et al., 2017). Deep learning and XGBoost have an aromatic ring another to form an additional –π interaction with
algorithms showed better performance on external validation; however, Tyr547. Virtual screening hit compounds, CH0001, CH0002 and
we used the random forest algorithm to build the virtual screening CH0003, have various of these groups (Fig. 7). DPP4 inhibitors such as
workflow based its higher accuracy relative to the other algorithms. trelagliptin, alogliptin, omarigliptin, and carmegliptin also interact with
these residues. In compound CH0003, the interaction with DPP4 (2ONC)
did not form a molecular bond with the residue, which resulted in low
3.6. Testing QSAR regression workflow on other targets selectivity with DPP8 and DPP9 (Ojeda-montes et al., 2018).
Generally compounds that have inhibitory potential (Ki < 1 µM),
QSAR prediction of several targets (Table 4) produced results with a moderate (1 µM < Ki < 10 µM), and weak (Ki > 10 µM) (Havale and Pal,
coefficient of determination reaching 0.7 for the support vector regres 2009; Kang et al., 2014; Taur et al., 2012). Molecular docking results
sion, XGBoost, and random forest models, suggesting the ability of the using DPP-8 indicated that CH0003 (Ki = 52.98 nM) and CH0001
workflow to identify targets from raw datasets downloaded from the (Ki = 310.16 nM) showed high potential inhibitory activity and that
ChEMBL database. CH0002 (Ki = 1190 nM) showed moderate inhibitory activity, which
was similar to trelagliptin (Ki = 1410 nM). For DPP-9, CH0003 (Ki =
22.27 nM) and CH0001 (Ki = 332.85 nM) showed high inhibitory ac
3.7. Virtual screening results tivity, whereas CH0002 (Ki = 1220 nM) showed moderate inhibitory
activity, which was similar to carmegliptin (Ki = 1730 nM). As the result
Virtual screening results of the ChEMBL, PubChem, and Molport of molecular docking of hit compounds against DPP4, DPP8 and DPP9
databases identified several hit compounds (Fig. 6), which were (Table 5), the best compound from the virtual screening study was
compared the DPP-4-inhibitors from the ChEMBL database (similarity CH0002 (Fig. 8), which showed potential inhibitory activity against
values < 0.85) and with DPP-4 DUD-E decoys. We ultimately identified DPP4, but moderate inhibition against both DPP8 and DPP9. The
2 716 potential hit compounds.
8
Table 5
Molecular docking results of hit compounds with DPP-4, DPP-8, and DPP-9.
Macromolecule Ligand Binding Inhib_Constant Molecule
(PDB ID) Energy (Ki)
Kcal/mol nM
DPP4 (5KBY) 6RL1510 − 9.66 82.49 Trelagliptin

CH0001 − 9.42 125.21
CH0002 − 9.67 81.13
CH0003 − 11.27 5.51
MP0001 − 5.67 69420
MP0002 − 8.54 551.45
MP0005 − 4.55 459580
PC0001 − 6.68 12780
PC0002 − 7.13 5910
PC0003 − 8.43 658
DPP4 (2ONC) SY1800 − 10.43 22.83 Native Ligand
of 2ONC
CH0001 − 9.51 106.54
CH0002 − 9.78 67.62
CH0003 − 11.46 4
MP0001 − 5.27 136760
MP0002 − 8.5 586.04
MP0005 − 4.7 358230
PC0001 − 6.43 19500
PC0002 − 6.77 10940
Fig. 8. Visualization of the CH0002 hit compound with DPP-4 (PDB: 4PNZ).
PC0003 − 7.97 1440
DPP4 (4PNZ) 2VH802 − 10.37 25.12 Omarigliptin
CH0001 − 9.78 67.96 compound CH0001 had a higher inhibitory activity against DPP8 and
CH0002 − 8.22 940.40 DPP9 than CH0002, while CH0003 had the lowest selectivity.
CH0003 − 8.91 293.62
MP0001 − 4.89 261080
MP0002 − 7 7420 4. Conclusions
MP0005 − 3.69 1980000
PC0001 − 6.86 9350 In summary, we developed a QSAR method that uses AI to create a
PC0002 − 6.65 13280
PC0003 − 7.66 2410
virtual screening workflow that meets QSAR statistical parameter
DPP4 (3KWF) B1Q1 − 9.75 71.41 Carmegliptin standards. Additionally, we identified support vector regression as the
CH0001 − 9.05 234.16 best algorithm for satisfying QSAR parameters, resulting in > 90% ac
CH0002 − 9.21 177.65 curacy in the classification model, with ROC curve of 0.96. The optimal
CH0003 − 10.11 39.1
algorithm for classification method was random forest, which was used
MP0001 − 4.04 1090000
MP0002 − 7.41 3730 to predict hit compounds, with support vector regression used to predict
MP0005 − 4.16 897610 their activity. Virtual screening identified 2 716 compounds, which were
PC0001 − 6.94 8170 then analyzed by molecular docking with DPP-4, DPP-8, and DPP-9,
PC0002 − 6.35 22310 revealing CH0002 as a selective inhibitor of DPP-4.
PC0003 − 7.59 2720
DPP8 (6HP8) GK2901 − 6.69 12490 Native ligand
Declarations.
of 6HP8
CH0001 − 8.88 310.16 Ethics approval and consent to participate
CH0002 − 8.08 1190
CH0003 − 9.93 52.98
MP0001 − 6.42 19660
Not applicable.
MP0002 − 5.88 48820
MP0005 − 4.17 879860 Funding
PC0001 − 6.91 8650
PC0002 − 6.22 27410
PC0003 − 7.4 3790
This work was supported by the fund through PITTA Indonesia (In
2VH802 − 5.21 152160 Omarigliptin ternational Indexed Publication for UI Student Final Project) 2019 grant
B1Q1 − 6.26 25890 Carmegliptin (NKB-0472/UN2. R3.1/HKP.05.00/2019) awarded to A.Y. from the
6RL1510 − 7.98 1410 Trelagliptin Directorate of Research and Community Service at the University of
SY1800 8.54 551 Native ligand
Indonesia for publication manuscript.
−
of 2ONC
DPP9 (6EOR) 9XH901 − 11.24 5.8 Native ligand
of 6EOR CRediT authorship contribution statement
CH0001 − 8.84 332.85
CH0002 − 8.07 1220
OH designed the study, implemented all KNIME workflow develop
CH0003 − 10.44 22.27
MP0001 − 5.28 135300 ment, and prepared the manuscript; AB helped with study design; AY
MP0002 − 7.11 6130 supervised the study; OH, AB, and AY discussed the results; OH and AY
MP0005 − 4.37 630360 analyzed data and wrote the manuscript; and all authors read and
PC0001 − 7.23 5000 approved the final manuscript.
PC0002 − 6.82 10060
PC0003 − 7.76 2050
CH0090 − 8.28 857.76 Declaration of Competing Interest
2VH802 − 8.32 792 Omarigliptin
B1Q1 − 7.86 1730 Carmegliptin
6RL1510 − 7.42 3660 Trelagliptin The authors declare that they have no known competing financial
SY1800 − 7.34 4150 Native ligand interests or personal relationships that could have appeared to influence
of 2ONC the work reported in this paper.
9
Data Availability 2007. Discovery of alogliptin: a potent, selective, bioavailable, and efficacious
inhibitor of dipeptidyl peptidase IV †. J. Med. Chem. 50, 2297–2300. https://doi.
org/10.1021/jm070104l.
Data are available at Hermansyah, Oky; Bustamam, Alhadi; Yanuar, Gallwitz, B., 2019. Clinical Use of DPP-4 Inhibitors. Front. Endocrinol. 10, 389. https://
Arry (2019), “Dataset for QSAR Modeling of DPP-4 Inhibitors,” Men doi.org/10.3389/fendo.2019.00389.
deley Data, v2. https://doi.org/10.17632/4sw5hr2yz7.2. Gaulton, A., Bellis, L.J., Bento, A.P., Chambers, J., Davies, M., Hersey, A., Light, Y.,
McGlinchey, S., Michalovich, D., Al-Lazikani, B., Overington, J.P., 2012. ChEMBL: a
large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40,
Acknowledgments D1100–D1107. https://doi.org/10.1093/nar/gkr777.
Gaulton, A., Hersey, A., Nowotka, M., Bento, A.P., Chambers, J., Mendez, D., Mutowo, P.,
Atkinson, F., Bellis, L.J., Cibrián-Uhalte, E., Davies, M., Dedman, N., Karlsson, A.,
We thank to Prof. Heru Suhartanto (Faculty of Computer Sciences, Magariños, M.P., Overington, J.P., Papadatos, G., Smit, I., Leach, A.R., 2017. The
Universitas Indonesia) for providing advice on and discussion of ma ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954. https://doi.org/
chine learning models. 10.1093/nar/gkw1074.
Golbraikh, A., Tropsha, A., 2002. Beware of q2! J. Mol. Graph. Model. 20, 269–276.
https://doi.org/10.1016/S1093-3263(01)00123-1.
Consent for Publication Gramatica, P., 2013. In: Reisfeld, B., Mayeno, A.N. (Eds.), On the Development and
Validation of QSAR Models BT - Computational Toxicology, Volume II. Humana
Press, Totowa, NJ, pp. 499–526. https://doi.org/10.1007/978-1-62703-059-5_21.
Not applicable. Grimshaw, C.E., Jennings, A., Kamran, R., Ueno, H., Nishigaki, N., Kosaka, T., Tani, A.,
Sano, H., Kinugawa, Y., Koumura, E., Shi, L., Takeuchi, K., 2016. Trelagliptin (SYR-
References 472, Zafatek), novel once-weekly treatment for type 2 diabetes, inhibits dipeptidyl
peptidase-4 (DPP-4) via a non-covalent mechanism. PLOS ONE 11, e0157509.
https://doi.org/10.1371/journal.pone.0157509.
Al-Fakih, A.M., Algamal, Z.Y., Lee, M.H., Aziz, M., Ali, H.T.M., 2019. A QSAR model for
Gu, T., Yang, X., Li, M., Wu, M., Su, Q., Lu, W., Zhang, Y., 2013. Predicting the DPP-IV
predicting antidiabetic activity of dipeptidyl peptidase-IV inhibitors by enhanced
inhibitory activity pIC₅₀ based on their physicochemical properties. BioMed. Res. Int.
binary gravitational search algorithm. SAR QSAR Environ. Res. 30, 403–416.
2013, 798743 https://doi.org/10.1155/2013/798743.
https://doi.org/10.1080/1062936X.2019.1607899.
Havale, S.H., Pal, M., 2009. Medicinal chemistry approaches to the inhibition of
Alam, F., Islam, M.A., Kamal, M.A., Gan, S.H., 2018. Updates on managing type 2
dipeptidyl peptidase-4 for the treatment of type 2 diabetes. Bioorg. Med. Chem. 17,
diabetes mellitus with natural products: towards antidiabetic drug development.
1783–1802. https://doi.org/10.1016/j.bmc.2009.01.061.
Curr. Med. Chem. 25 (39), 5395–5431. https://doi.org/10.2174/
Huan, Y., Jiang, Q., Liu, J., Shen, Z., 2015. Establishment of a dipeptidyl peptidases
0929867323666160813222436. PMID: 27528060.
(DPP) 8/9 expressing cell model for evaluating the selectivity of DPP4 inhibitors.
Babajide Mustapha, I., Saeed, F., 2016. Bioactive Molecule Prediction Using Extreme
J. Pharmacol. Toxicol. Methods 71, 8–12. https://doi.org/10.1016/j.
Gradient Boosting. In: Molecules, 21, p. 983. https://doi.org/10.3390/
vascn.2014.11.002.
molecules21080983.
Hughes, J.P., Rees, S., Kalindjian, S.B., Philpott, K.L., 2011. Principles of early drug
Baldi, P., Brunak, S., 2001. Bioinformatics: The Machine Learning Approach. MIT Press.
discovery. Br. J. Pharmacol. 162, 1239–1249. https://doi.org/10.1111/j.1476-
Beisken, S., Meinl, T., Wiswedel, B., de Figueiredo, L.F., Berthold, M., Steinbeck, C.,
5381.2010.01127.x.
2013. KNIME-CDK: Workflow-driven cheminformatics. BMC Bioinforma. 14, 257.
Kang, N.S., Ahn, J.H., Kim, S.S., Chae, C.H., Yoo, S.-E., 2007. Docking-based 3D-QSAR
https://doi.org/10.1186/1471-2105-14-257. PMID: 24103053; PMCID:
study for selectivity of DPP4, DPP8, and DPP9 inhibitors. Bioorg. Med. Chem. Lett.
PMC3765822.
17, 3716–3721. https://doi.org/10.1016/j.bmcl.2007.04.031.
Beisken, S., Meinl, T., Wiswedel, B., de Figueiredo, L.F., Berthold, M., Steinbeck, C.,
Kang, S., Tang, W., Li, H., Chreifi, G., Martásek, P., Roman, L.J., Poulos, T.L.,
2013. KNIME-CDK: Workflow-driven cheminformatics. BMC Bioinformatics 14 (1),
Silverman, R.B., 2014. Nitric oxide synthase inhibitors that interact with both heme
257. https://doi.org/10.1186/1471-2105-14-257.
propionate and tetrahydrobiopterin show high isoform selectivity. J. Med. Chem. 57,
Berman, H.M., 2000. The protein data bank. Nucleic Acids Res. 28, 235–242. https://doi.
4382–4396. https://doi.org/10.1021/jm5004182.
org/10.1093/nar/28.1.235.
Kausar, S., Falcao, A.O., 2018. An automated framework for QSAR model building.
Biftu, T., Sinha-Roy, R., Chen, P., Qian, X., Feng, D., Kuethe, J.T., Scapin, G., Gao, Y.D.,
J. Chemin.-. 10, 1. https://doi.org/10.1186/s13321-017-0256-5.
Yan, Y., Krueger, D., Bak, A., Eiermann, G., He, J., Cox, J., Hicks, J., Lyons, K.,
Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B.A.,
He, H., Salituro, G., Tong, S., Patel, S., Doss, G., Petrov, A., Wu, J., Xu, S.S.,
Thiessen, P.A., Yu, B., Zaslavsky, L., Zhang, J., Bolton, E.E., 2018. PubChem 2019
Sewall, C., Zhang, X., Zhang, B., Thornberry, N.A., Weber, A.E., 2014. Omarigliptin
update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109.
(MK-3102): a novel long-acting DPP-4 inhibitor for once-weekly treatment of type 2
https://doi.org/10.1093/nar/gky1033.
diabetes. J. Med. Chem. 57, 3205–3212. https://doi.org/10.1021/jm401992e.
Kim, S., Thiessen, P.A., Bolton, E.E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He, S.,
Bitencourt-Ferreira, G., de Azevedo, W.F., 2019a. In: de Azevedo Jr., W.F. (Ed.), Machine
Shoemaker, B.A., Wang, J., Yu, B., Zhang, J., Bryant, S.H., 2016. PubChem substance
Learning to Predict Binding Affinity BT - Docking Screens for Drug Discovery.
and compound databases. Nucleic Acids Res. 44, D1202–D1213. https://doi.org/
Springer New York, New York, NY, pp. 251–273. https://doi.org/10.1007/978-1-
10.1093/nar/gkv951.
4939-9752-7_16.
Kong, Y., Yan, A., 2017. QSAR models for predicting the bioactivity of Polo-like Kinase 1
Burness, CBurness, C.B. (2015). Omarigliptin: First Global Approval. Drugs, 75(16),
inhibitors. Chemom. Intell. Lab. Syst. 167, 214–225. https://doi.org/10.1016/j.
1947–1952. https://doi.org/10.1007/s40265–015-0493–8eleste B, 2015.
chemolab.2017.06.011.
Omarigliptin: First Global Approval. Drugs 75, 1947–1952. 〈https://doi.org/
Kumar, R., Sharma, A., Siddiqui, M.H., Tiwari, R.K., 2018. Prediction of drug-plasma
10.1007/s40265–015-0493–8〉.
protein binding using artificial intelligence based algorithms. Comb. Chem. High.
Bitencourt-Ferreira, G., Duarte da Silva, A., Filgueira de Azevedo, W. Jr., 2021.
Throughput Screen. 21 (1), 57–64. https://doi.org/10.2174/
Application of Machine Learning Techniques to Predict Binding Affinity for Drug
1386207321666171218121557. PMID: 29256344.
Targets: A Study of Cyclin-Dependent Kinase 2. Curr. Med. Chem. 28 (2), 253–265.
Lipinski, C.A., Lombardo, F., Dominy, B.W., Feeney, P.J., 2001. Experimental and
https://doi.org/10.2174/2213275912666191102162959.
computational approaches to estimate solubility and permeability in drug discovery
Cai, J., Li, C., Liu, Z., Du, J., Ye, J., Gu, Q., Xu, J., 2017. Predicting DPP-IV inhibitors with
and development settings1PII of original article: S0169–409X(96)00423–1. The
machine learning approaches. J. Comput. -Aided Mol. Des. 31, 393–402. https://doi.
article was originally published in Advanced Drug Delivery Reviews 23 (1997) 3.
org/10.1007/s10822-017-0009-6.
Advanced Drug Delivery Reviews 46, 3–26. https://doi.org/https://doi.org/
Cherkasov, A., Muratov, E.N., Fourches, D., Varnek, A., Baskin, I.I., Cronin, M.,
10.1016/S0169–409X(00)00129–0.
Dearden, J., Gramatica, P., Martin, Y.C., Todeschini, R., Consonni, V., Kuz’min, V.E.,
Liu, W., Lu, H., Cao, C., Jiao, Y., Chen, G., 2018. An improved quantitative structure
Cramer, R., Benigni, R., Yang, C., Rathman, J., Terfloth, L., Gasteiger, J., Richard, A.,
property relationship model for predicting thermal conductivity of liquid aliphatic
Tropsha, A., 2014. QSAR modeling: where have you been? Where are you going to?
alcohols. J. Chem. Eng. Data 63, 4735–4740. https://doi.org/10.1021/acs.
J. Med. Chem. 57, 4977–5010. https://doi.org/10.1021/jm4004285.
jced.8b00764.
Chylewska, A., Biedulska, M., Sumczynski, P., Makowski, M., 2018.
Makrilakis, K., 2019. The role of DPP-4 inhibitors in the treatment algorithm of type 2
Metallopharmaceuticals in therapy - a new horizon for scientific research. Curr. Med.
diabetes mellitus: when to select, what to expect. Int. J. Environ. Res. Public Health
Chem. 25 (15), 1729–1791. https://doi.org/10.2174/
16, 2720. https://doi.org/10.3390/ijerph16152720.
0929867325666171206102501. PMID: 29210637.
Martin, Y.C., Kofron, J.L., Traphagen, L.M., 2002. Do structurally similar molecules have
da Silva, A.D., Bitencourt-Ferreira, G., de Azevedo Jr., W.F., 2020. Taba: a tool to analyze
similar biological activity? J. Med. Chem. 45, 4350–4358. https://doi.org/10.1021/
the binding affinity. J. Comput. Chem. 41, 69–73. https://doi.org/10.1002/
jm020155c.
jcc.26048.
Mattei, P., Boehringer, M., Di Giorgio, P., Fischer, H., Hennig, M., Huwyler, J., Koçer, B.,
Danishuddin, Khan, A.U., 2016. Descriptors and their selection methods in QSAR
Kuhn, B., Loeffler, B.M., MacDonald, A., Narquizian, R., Rauber, E., Sebokova, E.,
analysis: paradigm for drug design. Drug Discov. Today 21, 1291–1302. https://doi.
Sprecher, U., 2010. Discovery of carmegliptin: a potent and long-acting dipeptidyl
org/10.1016/j.drudis.2016.06.013.
peptidase IV inhibitor for the treatment of type 2 diabetes. Bioorg. Med. Chem. Lett.
Davies, M., Nowotka, M., Papadatos, G., Dedman, N., Gaulton, A., Atkinson, F., Bellis, L.,
20, 1109–1113. https://doi.org/10.1016/j.bmcl.2009.12.024.
Overington, J.P., 2015. ChEMBL web services: streamlining access to drug discovery
McKeage, K., 2015. Trelagliptin: first global approval. Drugs 75, 1161–1164. https://doi.
data and utilities. Nucleic Acids Res. 43, W612–W620. https://doi.org/10.1093/
org/10.1007/s40265-015-0431-9.
nar/gkv352.
Mozafari, Z., Arab Chamjangali, M., Arashi, M., 2020. Combination of least absolute
Feng, J., Zhang, Z., Wallace, M.B., Stafford, J.A., Kaldor, S.W., Kassel, D.B., Navre, M.,
shrinkage and selection operator with Bayesian Regularization artificial neural
Shi, L., Skene, R.J., Asakawa, T., Takeuchi, K., Xu, R., Webb, D.R., Gwaltney, S.L.,
10
network (LASSO-BR-ANN) for QSAR studies using functional group and molecular Roy, K., Kar, S., Das, R.N., 2015b. In: Roy, K., Kar, S., Das, R.N. (Eds.), QSAR/QSPR
docking mixed descriptors. Chemom. Intell. Lab. Syst. 200, 103998 https://doi.org/ Methods BT - A Primer on QSAR/QSPR Modeling: Fundamental Concepts. Springer
10.1016/j.chemolab.2020.103998. International Publishing, Cham, pp. 61–103. https://doi.org/10.1007/978-3-319-
Myint, K.-Z., Wang, L., Tong, Q., Xie, X.-Q., 2012. Molecular fingerprint-based artificial 17281-1_3.
neural networks QSAR for ligand biological activity predictions. Mol. Pharm. 9, Santos, L.H.S., Ferreira, R.S., Caffarena, E.R., 2019. In: de Azevedo Jr., W.F. (Ed.),
2912–2923. https://doi.org/10.1021/mp300237z. Integrating Molecular Docking and Molecular Dynamics Simulations BT - Docking
Mysinger, M.M., Carchia, M., Irwin, J.J., Shoichet, B.K., 2012. Directory of useful decoys, Screens for Drug Discovery. Springer New York, New York, NY, pp. 13–34. https://
enhanced (DUD-E): better ligands and decoys for better benchmarking. J. Med. doi.org/10.1007/978-1-4939-9752-7_2.
Chem. 55, 6582–6594. https://doi.org/10.1021/jm300687e. Selvaraj, C., Tripathi, S., Reddy, K., Singh, S.K., 2011. Tool development for Prediction of
Neves, B.J., Braga, R.C., Melo-Filho, C.C., Moreira-Filho, J.T., Muratov, E.N., Andrade, C. pIC50 values from the IC50 values-A pIC50 value calculator, Current Trends in
H., 2018. QSAR-based virtual screening: advances and applications in drug Biotechnology and Pharmacy.
discovery. Front. Pharmacol. 9, 1275. https://doi.org/10.3389/fphar.2018.01275. Sesti, G., Avogaro, A., Belcastro, S., Bonora, B.M., Croci, M., Daniele, G., Dauriz, M.,
Ojeda-Montes, M.J., Gimeno, A., Tomas-Hernández, S., Cereto-Massagué, A., Beltrán- Dotta, F., Formichi, C., Frontoni, S., Invitti, C., Orsi, E., Picconi, F., Resi, V.,
Debón, R., Valls, C., Mulero, M., Pujadas, G., Garcia-Vallvé, S., 2018. Activity and Bonora, E., Purrello, F., 2019. Ten years of experience with DPP-4 inhibitors for the
selectivity cliffs for DPP-IV inhibitors: Lessons we can learn from SAR studies and treatment of type 2 diabetes mellitus. Acta Diabetol. 56, 605–617. https://doi.org/
their application to virtual screening. Med. Res. Rev. 38 (6), 1874–1915. https://doi. 10.1007/s00592-018-1271-3.
org/10.1002/med.21499. Epub 2018 Apr 16. PMID: 29660786. Shamsara, J., 2019. A random forest model to predict the activity of a large set of soluble
Mazanetz, P., J. Marmon, M., Reisser B.T., R., Morao, I, C., 2012. Drug discovery epoxide hydrolase inhibitors solely based on a set of simple fragmental descriptors.
applications for KNIME: an open source data mining platform. Curr. Top. Med. Comb. Chem. High. Throughput Screen. 22 (8), 555–569. https://doi.org/10.2174/
Chem. 12, 1965–1979. https://doi.org/10.2174/156802612804910331. 1386207322666191016110232. PMID: 31622216.
Patel, B.D., Ghate, M.D., 2014. Recent approaches to medicinal chemistry and Shi, J., Zhao, G., Wei, Y., 2018. Computational QSAR model combined molecular
therapeutic potential of dipeptidyl peptidase-4 (DPP-4) inhibitors. Eur. J. Med. descriptors and fingerprints to predict HDAC1 inhibitors. Med Sci. (Paris) 34, 52–58.
Chem. 74, 574–605. https://doi.org/10.1016/j.ejmech.2013.12.038. Silipo, R., Adae, I., Hart, A., Berthold, M., 2014. Seven techniques for dimensionality
Pei, L., Shen, X., Yan, Y., Tan, C., Qu, K., Zou, J., Wang, Y., Ping, F., 2020. Virtual reduction: missing values, low variance filter, high correlation filter, pca, random
screening of the multi-pathway and multi-gene regulatory molecular mechanism of forests, backward feature elimination, and forward feature construction. Knime
dachengqi decoction in the treatment of stroke based on network pharmacology. 1–21.
Comb. Chem. High. Throughput Screen. 23 (8), 775–787. https://doi.org/10.2174/ Sokolović, D., Ranković, J., Stanković, V., Stefanović, R., Karaleić, S., Mekić, B.,
1386207323666200311113747. PMID: 32160845. Milenković, V., Kocić, J., Veselinović, A.M., 2017. QSAR study of dipeptidyl
Popovic-Djordjevic, J.B., Jevtic, I.I., Stanojkovic, T.P., 2018. Antidiabetics: structural peptidase-4 inhibitors based on the Monte Carlo method. Med. Chem. Res. 26,
diversity of molecules with a common aim. Curr Med Chem. 25 (18), 2140–2165. 796–804. https://doi.org/10.1007/s00044-017-1792-2.
https://doi.org/10.2174/0929867325666171205145309. PMID: 29210642. Taur, J.-S., Schuck, E.L., Wong, N.Y., 2012. A transcellular assay to assess the P-gp
Ramesh, Muthusamy, Muthuraman, Arunachalam, 2020R. Quantitative structure- inhibition in early stage of drug development. Drug Metab. Lett. 6, 285–291. https://
activity relationship (QSAR) studies for the inhibition of MAOs. Comb. Chem. High. doi.org/10.2174/1872312811206040008.
Throughput Screen. 23 (9) https://doi.org/10.2174/ Veerasamy, R., Rajak, H., Jain, A., Sivadasan, S., Christapher, P.V., Agrawal, R.K., 2011.
1386207323666200324173231. Validation of QSAR models - strategies and importance. Int. J. Drug Des. Disco.
Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Wang, Z.-F., Hu, Y.-Q., Zhang, Q.-G.W., R, 2019. Virtual screening of potential anti-
Press, Cambridge https://doi.org/DOI: 10.1017/CBO9780511812651. fatigue mechanism of polygonati rhizoma based on network pharmacology. Comb.
Ross, B., Krapp, S., Augustin, M., Kierfersauer, R., Arciniega, M., Geiss-Friedlander, R., Chem. High. Throughput Screen. https://doi.org/10.2174/
Huber, R., 2018. Structures and mechanism of dipeptidyl peptidases 8 and 9, 1386207322666191106110615.
important players in cellular homeostasis and cancer. Proc. Natl. Acad. Sci. 115, Wójcikowski, M., Siedlecki, P., Ballester, P.J., 2019. In: de Azevedo Jr., W.F. (Ed.),
E1437–E1445. https://doi.org/10.1073/pnas.1717565115. Building Machine-Learning Scoring Functions for Structure-Based Prediction of
Ross, B.H., 2019. Improvement of Protein Crystal Diffraction Using Post-Crystallization Intermolecular Binding Affinity BT - Docking Screens for Drug Discovery. Springer
Methods: Infrared Laser Radiation Controls Crystal Order. Thesis. 〈https://doi.org/ New York, New York, NY, pp. 1–12. https://doi.org/10.1007/978-1-4939-9752-7_1.
10.2210/PDB6HP8/PDB〉. Xiong, Z., Cui, Y., Liu, Z., Zhao, Y., Hu, M., Hu, J., 2020. Evaluating explorative
Roy, K., Kar, S., Das, R., 2015a. A primer on QSAR/QSPR modeling: fundamental prediction power of machine learning algorithms for materials discovery using k-fold
concepts. 〈https://doi.org/10.1007/978–3-319–17281-1〉. forward cross-validation. Comput. Mater. Sci. 171, 109203 https://doi.org/
Roy, K., Kar, S., Das, R.N., 2015a. In: Roy, K., Kar, S., Das, R.N. (Eds.), Statistical Methods 10.1016/j.commatsci.2019.109203.
in QSAR/QSPR BT - A Primer on QSAR/QSPR Modeling: Fundamental Concepts. Yang, X., Li, M., Su, Q., Wu, M., Gu, T., Lu, W., 2013. QSAR studies on pyrrolidine amides
Springer International Publishing, Cham, pp. 37–59. https://doi.org/10.1007/978- derivatives as DPP-IV inhibitors for type 2 diabetes. Med. Chem. Res. 22,
3-319-17281-1_2. 5274–5283. https://doi.org/10.1007/s00044-013-0527-2.
11

1 s2.0 S1476927121001675 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1476927121001675 Main

Uploaded by

Copyright:

Available Formats

Computational Biology and Chemistry 95 (2021) 107597

Contents lists available at ScienceDirect

Computational Biology and Chemistry

Virtual screening of dipeptidyl peptidase-4 inhibitors using quantitative

previous study employed a machine learning approach to

descriptors or fingerprints with potential hit compounds.

Sensitivity = Recall = TP / (TP + FN) (5)

∑(yi 2.9. Virtual screening using multiple databases

3. Results and Discussion

Deep learning 891 93 75 0.9224 0.8942 0.9138 0.9055 0.9079

TP = true positive; FP = false positive; TN = true negative; FN = false negative.

Q2 0.6920 0.7530 0.5939 0.7532 0.7607 > 0.5a

R′ 20 0.5911 0.7618 0.6014 0.7668 0.7762 –

(R2 − R20 ) / R2 0.5672 0.0924 0.2422 0.1847 0.0624 < 0.1a

R′ 2m 0.5686 0.7618 0.5586 0.6874 0.7563 –

R2m 0.4223 0.6400 0.4625 0.5573 0.6616 > 0.5b

∆R2m 0.2925 0.2437 0.1920 0.2602 0.1895 < 0.2b

Deep learning 215 25 10 0.9556 0.8945 0.8958 0.9242

TP = true positive; FP = false positive; TN = true negative; FN = false negative.

Fig. 5. ROC curve of the four classification models.

Fig. 6. Molecular structures of hit compounds identified by virtual screening.

3.4. Internal validation 3.5. External validation

DPP4 (5KBY) 6RL1510 − 9.66 82.49 Trelagliptin

You might also like