10 1002@jcc 26223

Received: 28 January 2020 Revised: 10 March 2020 Accepted: 28 April 2020
DOI: 10.1002/jcc.26223
FULL PAPER
HCVpred: A web server for predicting the bioactivity

of hepatitis C virus NS5B inhibitors
Aijaz Ahmad Malik1 | Chuleeporn Phanus-umporn1 | Nalini Schaduangrat1 |

Watshara Shoombuatong1 | Chartchalerm Isarankura-Na-Ayudhya2 |
Chanin Nantasenamat1
1
Center of Data Mining and Biomedical
Informatics, Faculty of Medical Technology, Abstract
Mahidol University, Bangkok, Thailand Hepatitis C virus (HCV) is one of the major causes of liver disease affecting an estimated
2
Department of Clinical Microbiology and
170 million people culminating in 300,000 deaths from cirrhosis or liver cancer. NS5B is
Applied Technology, Faculty of Medical
Technology, Mahidol University, Bangkok, one of three potential therapeutic targets against HCV (i.e., the other two being NS3/4A
Thailand
and NS5A) that is central to viral replication. In this study, we developed a classification
Correspondence structure–activity relationship (CSAR) model for identifying substructures giving rise to
Chanin Nantasenamat, Center of Data Mining
anti-HCV activities among a set of 578 non-redundant compounds. NS5B inhibitors
and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok were described by a set of 12 fingerprint descriptors and predictive models were con-
10700, Thailand.
structed from 100 independent data splits using the random forest algorithm. The
Email: chanin.nan@mahidol.edu
modelability (MODI index) of the data set was determined to be robust with a value of
Funding information
0.88 exceeding established threshold of 0.65. The predictive performance was deduced
Mahidol University, Grant/Award Numbers:
A32/2561, B.E. 2557-2559; Thailand Research by the accuracy, sensitivity, specificity, and Matthews correlation coefficient, which was
Fund, Grant/Award Number: RSA6280075
found to be statistically robust (i.e., the former three parameters afforded values in
excess of 0.8 while the latter statistical parameter provided a value >0.7). An in-depth
analysis of the top 20 important descriptors revealed that aromatic ring and alkyl side
chains are important for NS5B inhibition. Finally, the predictive model is deployed as a
publicly accessible HCVpred web server (available at http://codes.bio/hcvpred/) that
would allow users to predict the biological activity as being active or inactive against
HCV NS5B. Thus, the knowledge and web server presented herein can be used in the
design of more potent and specific drugs against the HCV NS5B.
KEYWORDS
data science, flavivirus, HCVpred, hepatitis C virus, NS5B, QSAR, structure–activity

relationship, web server
1 | I N T RO DU CT I O N (EC 3.4.21.98), NS4B, NS5A, and NS5B (EC 2.7.7.48)) proteins.[2,3]

HCV is responsible for causing acute hepatitis and chronic liver dis-
Hepatitis C virus (HCV) is the etiological agent of liver disease that eases that eventually culminates in liver cirrhosis and hepatocellular
imposes a global health burden affecting more than 170 million peo- carcinoma in patients without treatment.[4] The current standard
ple.[1] HCV belongs to the Flaviviridae family of viruses and its treatment of care for HCV therapy is pegylated α-interferon
genome is comprised of a (+)-stranded RNA which encodes the poly- (PEG-IFN), either alone or in combination with ribavirin (RBV), a broad
protein chain of 3,000 amino acids containing structural (i.e., core, E1, spectrum antiviral agent.[5–9] Even though there has been moderate
E2, and p7) and non-structural (i.e., NS2, NS3 serine proteinase success in treating the infected patients, the side-effects associated
J Comput Chem. 2020;1–15. wileyonlinelibrary.com/journal/jcc © 2020 Wiley Periodicals, Inc. 1

2 MALIK ET AL.
with these drugs are very severe including fatigue, haemolytic anemia, development of new anti-HCV therapy as they have lesser off-target
depression, and flu-like symptoms.[10] Therefore, it is of utmost impor- effects on the host cells. These allosteric sites include (a) “Thumb”
tance to identify novel antiviral agents that can cure the overall HCV pocket I, (b) “Thumb” pocket II, (c) “Palm” pocket I, (d) “Palm” pocket II,
infection and eliminate or reduce the severe side-effects associated and (e) “Palm” pocket III. Therefore, based on the mode-of-action,
with PEG-IFG/RBV therapy. inhibitors binding to the NS5B polymerase can be divided into two
Several viral proteins have been demonstrated as promising tar- main categories: nucleoside inhibitors (NIs) and non-nucleoside inhibi-
gets for HCV treatment. Among these, NS5B (i.e., an RNA-dependent tors (NNIs). NIs are nucleoside/nucleotide analogues that bind to the
RNA polymerase [RdRp]) is an essential non-structural protein compo- NS5B active site and terminates the elongation of the HCV genome
nent and is the key for the survival of HCV as it orchestrates the repli- during RNA synthesis by preventing further incorporation of incoming
cation process, leading to the outburst of progeny viruses in infected nucleotides while NNIs bind with the allosteric site and induces con-
[11]
cells. Since there is no functional equivalent of NS5B in mammalian formational changes that blocks the initiation and elongation of viral
cells and the fact that it can be inhibited by a wide range of com- RNA.[14,15] Therefore, NS5B polymerase inhibitors could serve as ideal
pounds acting on various allosteric sites, makes it one of the main tar- drugs for the treatment of HCV infection.
gets for HCV treatment. In addition, NS5B is a 66 kDa protein Structural and chemical functions of compounds/inhibitors together
comprising of 591 amino acids which is found at the C terminus of with their biological activity are essential aspects in helping us understand
the HCV polyprotein.[12] Furthermore, the crystal structure of NS5B the influence of physicochemical properties on their biological activity.
shows the characteristic thumb, finger and palm domains with the Computer-aided drug design (CADD) represents a repertoire of computa-
active site being at the palm domain as shown in Figure 1.[13] How- tional tools which has become instrumental in chemical biology and com-
ever, aside from catalytic site, several allosteric binding sites within putational approaches for the elucidation of structure–activity
NS5B have been identified which serve as potential targets for the relationship. Informative chemical descriptors known as ligands are utlized
F I G U R E 1 Crystal structure of
HCV NS5B RNA-dependent RNA
polymerase showing different
allosteric sites together with their
bound co-crystallized non-nucleoside
inhibitors. Finger, palm, and thumb
domains are colored in red, blue, and
green. Five different allosteric sites
are shown with inhibitors, colored
according to the following scheme:
Palm I, II, and III in yellow (PDB IDs
3H2L, 3FQL, and 3LKH, respectively)
while Thumb I and II are colored in
red (PDB IDs 2BRL and 3FRZ,
respectively) [Color figure can be
viewed at wileyonlinelibrary.com]
MALIK ET AL. 3
to make sense of the bioactivity via computational tools.[16,17] Moreover, In this study, we compiled a non-redundant set of 578 inhibitors
the physicochemical properties of compounds can be computed using with known IC50 values against the HCV NS5B. Briefly, the investi-
molecular descriptors calculation software that represents compounds as gated compounds were described by a set of 12 fingerprint descrip-
quantitative values and/or qualitative labels.[18] Quantitative structure– tors while the IC50 values were binned to qualitative class labels
activity/property relationship (QSAR/QSPR) is a commonly used compu- (i.e., active and inactive). After that, predictive models were con-
tational technique for building predictive models that can discern the structed using the RF algorithm. The constructed QSAR model was
influence of significant molecular descriptors governing their biological shown to robustly classify compounds as being active or inactive
activities and chemical properties.[16,19] Such QSAR models has been against the HCV NS5B as deduced by the accuracy, sensitivity, speci-
successfully employed for modeling a wide range of targets such as, ficity, and Matthews correlation coefficient, which was found to be
antioxidant,[20–22] antimicrobial,[23–26] anticancer, [27,28]
and antiviral[29,30] statistically competent. Furthermore, the underlying important sub-
activities. As such, owing to the beneficial functions of QSAR models, structures that are crucial for the bioactivity were elucidated and
herein, a large number of HCV inhibitors were used as the data set for characterized. Particularly, an in-depth analysis of the top 20 important
the development of predictive models to facilitate rational drug design. descriptors revealed that the aromatic ring and alkyl side chains are
A considerable amount of literature has been published on in sil- important for NS5B polymerase inhibition. Thus, this knowledge can
ico methods to identify novel inhibitors against NS5B polymerase, be used in the design of more potent and specific drugs against HCV.
including QSAR, pharmacophore modeling, molecular dynamics
(MD) simulation, and molecular docking.[31–34] Recently, Wei et al.[35]
discovered novel inhibitors using different methods of virtual screen- 2 | M A T E R I A L S A N D M ET H O D S
ing that included random forest (RF), e-pharmacophore, and molecular
docking studies. In their study, the authors identified five novel anti- 2.1 | Data compilation and curation
HCV compounds with IC50 values ranging from 2 to 23.84 μM
targeting NS5B polymerase. In their review, Barreca et al.[36] discussed A data set of inhibitors against HCV NS5B (Target ChEMBL ID:
in detail regarding various approaches to identify and optimize compu- CHEMBL5375) was compiled from the ChEMBL database, release
tational methods in discovery of novel inhibitors binding the allosteric 23,[72] which initially comprised of 1903 bioactivity data points from
[30]
sites of NS5B. Additionally, Worachartcheewan et al. employed 1,350 compounds. For this study, the IC50 was selected as the bioac-
SMILES-based descriptors to build QSAR models using the Monte tivity unit for further investigation and this constituted a final data set
Carlo method which showed high sensitivity, specificity and accuracy. of 865 compounds. As this study sets out to develop a classification
Similarly, Rafiei et al.[37] also build QSAR models using Genetic model of HCV inhibitors, therefore we defined thresholds of <1 and
Algorithm-Multiple Linear Regression (GA-MLR) in order to correlate >10 μM for distinguishing actives from inactives, respectively. More-
the biological activities of NS5B inhibitors with chemical structure. over, the intermediate biological activity with IC50 values ranging
[38]
Furthermore, Talele et al. identified novel inhibitors against the between 1 and 10 were not selected for this study, which consisted
“Thumb” site I of NS5B using in silico approach and validated his pre- of 284 inhibitors. Finally, a set of non-redundant and curated com-
dictions in vitro with IC50 values ranging from 7.7 to 68.0 μM. Simi- pounds consisting of 578 inhibitors were obtained and subjected to
larly, Musmuca et al.[39] used a structural-based approach for 3D further analysis.
QSAR to identify new “Thumb” site II inhibitors with IC50 values rang-
ing between 46 and 73.3 μM. In addition, Golub et al.[40] performed
both virtual screening of the inhibitors and validated the finding by 2.2 | Molecular descriptors
in vitro assay. In their study, they used 169 inhibitors from the Otava
database for virtual screening and identified 59 compounds that have The fingerprint descriptors for each compound of the data set were
high potential as anti-HCV. However, the in vitro identified only eight computed using the PaDEL-Descriptor software.[73] Briefly, the
candidates that could neutralize the protein. Interestingly, the most molecular fingerprints were calculated using the PaDEL-Descriptor
active compound identified had the 4-hydrazinoquinazoline scaffold. software. In general, molecular descriptors are numerical values that
[41]
Elhefnawi et al. used structure-based screening in identifying novel characterize the properties of molecules and helps in analysis of the
inhibitors against NS5B. In their study, a compilation of all PDB files chemical structure information. As such, they are very important for
for each allosteric binding site was performed followed by the identifi- QSAR studies. In this study, a total of 12 molecular fingerprints
cation of common physical interactions that is shared across all inhibi- belonging to nine classes were used for describing the chemical struc-
tors to the target allosteric site. Afterwards, the authors ranked and tures and this consisted of AtomPairs 2D, CDK fingerprinter, CDK
validated the outcome of their redocking procedure using both a extended, CDK graph only, E-state, Klekota–Roth, MACCS, PubChem,
machine learning algorithm (e.g., artificial neural network) and molecu- and Substructure. It is worthy to note that Substructure, Klekota–
lar dynamics. Thus, they concluded that novel inhibitors had lower Roth fingerprint and 2D atom pairs had two versions in which one is a
predicted IC50 values and stable binding poses. Overall, computational presence/absence of the descriptor (i.e., 1 or 0 to denote the pres-
strategies have been proven to be powerful and accessible tools for ence or absence) while the second is a count version that represents
the identification of novel inhibitors against the NS5B polymerase. the frequency of the particular descriptor (i.e., 2, 3, 4, 5, etc.).
4 MALIK ET AL.
2.3 | Data filtering constructing the RF classifier (ntree) and the number of random candi-
date features (mtry). In this study, a 10-fold CV procedure was applied
In order to remove the complexity and bias in model building, con- to tune the ntree parameter from the range of ntree {100,200,
stant and near constant variables were engaged to select the fin- …,1,000}, while the mtry parameter was estimated using the tuneRF
gerprint descriptor sets. The constants of each fingerprint function in the randomForest package.[78] In order to provide a better
descriptors were calculated using SD of 0.1 as a cut-off value. The understanding of the compound activity of NS5B inhibitors, the RF
fingerprint descriptors with SD values >0.1 were selected for fur- model was also used to identify an informative descriptor because of
ther analysis. its efficient and effective built-in feature, importance estimator. The
mean decrease of the Gini index (MDGI) was utilized to estimate the
important descriptors. The descriptor with the largest value of MDGI
2.4 | Data balancing represents the most important features as that descriptor contributes
most to the prediction performance of the model.
To reduce the tendency of overfitting of imbalanced data, we used
undersampling technique by randomly selecting a subset of 166 active
compounds from the initial set of 415 active compounds. Since there 2.7 | Data set modelability
are possibilities of having a biased predicted model due to a single
data split, therefore we addressed this problem by following the sug- The modelability is necessarily dependent on the underlying related-
gestion of Puzyn et al., according to which, the prediction models ness of chemical structures and their bioactivities. Particularly, two
should be constructed from N independent data splits. Hence, in this highly similar compounds with striking differences in their bioactivity
study, the data was split into a 80/20 ratio, where 80% of the entire (i.e., one compound pair affording favorable bioactivity whereas the
data set was used as the internal set and the remaining 20% served as other in the pair affords poor bioactivity), otherwise known as activity
the external set. cliffs, would be detrimental for machine learning algorithms in its
attempt to correlate structures with related bioactivity. On the other
hand, similar compounds with similar bioactivities (i.e., both com-
2.5 | Statistical analysis pounds in a compound pair provide the same bioactivity class) would
contribute favorably to the modelability of the data set. This so-called
In this study, we used six common descriptive statistical parameters, modelability index (MODI) has been introduced by Golbraikh et al.[79]
encompassing the minimum (Min), first quartile (Q1), median, mean, MODI is defined as an activity class-weighted ratio of the number of
third quartile (Q3), and maximum (Max) parameters to identify the nearest-neighbor pairs of compounds with the same activity class ver-
trend in individual descriptors of both active and inactive compounds. sus the total number of pairs. The statistical metric can be computed
The outcome was visualized by box plot using the R package ggplot2. as follows:
Step 1: For any given pair of compounds Ci and Cj defined by an

2.6 | Multivariate analysis normalized ) is
m-dimensional vector, the normalized Euclidean distance (D
computed as follows:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
For a CSAR model, its prediction performance will depend not only on u m
uX 2
the compound descriptors but also on the predictor used. In this dij = kCi − Cj k = t Cik −Cjk ð1Þ
k=1
study, we employ RF owing to previous successful models and inter-
pretability in many applications. Below, we briefly introduce the basic
P
n
concept and parameter optimization procedures of used to generate dij
di =
j=1
the model. ð2Þ
n −1
Random forest is an ensemble classifier producing a number of

decision trees, using a randomly selected subset of training samples
D−min
D
normalized =
D ð3Þ
and variables. The classification of RF starts at the root node, in which
max D −min D
the data set at each node is split according to the value of descriptors
that are selected such that the descriptors of different activities are where dij, di , and n represents the distance scores between two com-
[74,75]
predominantly moved to different branches. Finally, the classifi- pounds, the mean Euclidean distance and the number of compounds,
cation is obtained by averaging the results of all trees by a majority respectively.
vote from each tree.[76,77] The RF classifier was generated using the Step 2: For each compound in a data set, the MODI can be com-
randomForest package in the R programming language. To accurately puted by identifying its first nearest neighbor (i.e., a compound with
predict the activity of NS5B inhibitors, it is necessary to tune two the smallest Euclidean distance) belonging to the same or different
parameter of the RF model, that is, the number of trees used for class as follows:
MALIK ET AL. 5
1 X NC
Nsame frontend and backend of the web server. Technically, the web server
MODI = i
ð4Þ
NC l = 1 Ntotal
i
is comprised of two major components: (a) user interface and
(b) server, which are saved as ui.R and server.R, respectively. The ui.R
file accepts input values (i.e., the SMILES notation of query com-
where NC is the number of classes (i.e., C = 2 denotes active and inac- pounds) and transfers this information to the server.R file where the
tive compounds), Nsame
i is the number of compounds of the ith class SMILES notation is submitted to the PaDEL-Descriptor software for
that have their first nearest neighbors belonging to the same ith class, descriptor calculation followed by applying the computed descriptors
and Ntotal
i is the number of compounds belonging to the ith class. as input to the predictive model (i.e., which is read from the model.
A data set is considered to be modelable if the MODI index is greater RDS file) that will then classify the query compounds as either being
than the threshold value of 0.65. The MODI index was computed active or inactive (i.e., the bioactivity class label). Such predicted class
using an in-house developed R code. labels are then printed out on the web server whereby users can also
download the predicted results as a CSV file.
2.8 | Model validation

2.11 | Reproducible research
True positives (TP), true negatives (TN), false positives (FP), and false
negatives (FN). Finally, the fitness of the model was assessed using Data and code used in this study are provided publicly on GitHub at
various statistical parameters including the overall prediction accuracy https://github.com/chaninlab/hcvpred/.
(Ac), sensitivity (Sn), specificity (Sp), and Matthew's correlation coeffi-
cient (MCC).
3 | RESULTS AND DISCUSSION
TP + TN
Ac = × 100 ð5Þ
ðTP + TN + FP + FNÞ
A summary of the QSAR workflow for the modeling process per-
formed herein is provided in Figure 2.
TP
Sn = × 100 ð6Þ
ðTP + FNÞ
TN
3.1 | Chemical space analysis
Sp = × 100 ð7Þ
ðTN + FPÞ
One of the main reasons for performing chemical space analysis is to
TP × TN− FP × FN explore the characteristic differences between the active and inactive
MCC = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð8Þ
ðTP + FPÞðTP + FNÞðTN + FPÞðTN + FNÞ compounds. First, we will explore the general chemical space by visu-
alizing the distribution of actives and inactives as a function of the
where TP, TN, FP, and FN represents the instances of true positives, molecular weight (MW) versus the Ghose–Crippen–Viswanadhan
true negatives, false positives, and false negatives, respectively. octanol–water partition coefficient (ALogP). Second, we will then
compare actives and inactives as a function of the Lipinski's rule-of-
five (Ro5) descriptors. Using the Ro5 method, it is possible to predict
2.9 | Applicability domain analysis whether a chemical compound will illustrate the pharmacological or
biological property required for an orally active drug in humans. These
The main purpose of the applicability domain (AD) is to estimate the properties are based on the observation that most of the drugs are
boundaries within which the model can make reliable and accurate relatively large lipophilic molecules, comprising of MW, ALogP, num-
predictions for compounds on the basis of similarity with the com- ber of hydrogen bond donors (nHBDon), and number of hydrogen
pounds on which the model was constructed. The compounds that bond acceptors (nHBAcc).[42] Visualization of the MW as a function of
satisfy the scope of the model are within the AD while the rest are ALogP is shown in Figure 3. As can be observed, most of the com-
outside the AD. In this study, we used the PCA bounding box to pounds are clustered within the MW range of 300–600 Da with an
assess the AD of compounds from the training (internal) and testing ALogP in the range of 1.5–5. In addition, data visualization and statis-
(external) sets. tical analysis of the Ro5 are shown in Figure 4. It can be seen that
most of the compounds follow the criteria of the Ro5 with a MW
of <500 Da, ALogP and nHBDon and nHBAcc values of <10. Further-
2.10 | Development of the HCVpred web server more, the results from statistical analysis display a significant
difference between the active and inactive compounds using the
The predictive model is deployed as the HCVpred web server by ini- Mann–Whitney U test. Most of active compounds were larger than
tially exporting the model as an RDS file (i.e., the model.RDS file). The the inactive compounds, which were observed from the mean value
Shiny package, which is essentially a web framework, is used as the of box plots. Similarly, the ALogP values of the active compounds
6 MALIK ET AL.
8
1029 Initial
compounds Data Set
6
Data 4
ALogP
Curation
578 Curated
compounds Data Set
0
412 Active Inactive 166 4

compounds Set Set compounds 0 200 400 600 800
MW
F I G U R E 3 Chemical space analysis of HCV inhibitors. Active and

Descriptor
Calculation inactive compounds are shown in green and red colors, respectively
[Color figure can be viewed at wileyonlinelibrary.com]
Descriptor
Filterlng Moreover, the AD of the classification model was estimated using
the PubChem fingerprints as input for PCA analysis and the resulting
PCA scores plot (i.e., also refer herein as the bounding box plot) is
Balanced shown in Figure 5. The data set comprising of 578 compounds after
Data Set
balancing was divided further into two subsets consisting of internal
(80%) and external sets (20%) using the Kennard–Stone algorithm. It
should be noted that the internal set are used as the training set for
Data splitting
(Random) predictive model construction (i.e., for subsequent prediction on the
external set) and also subjected to 10-fold CV. It was found that the
chemical space distribution of the external set falls well within the
boundaries of the internal set. Thus, results indicated that the AD is
248 Internal External 84
compounds compounds well defined for the CSAR model developed herein.
Set Set
3.2 | QSAR modeling

Cross- External
Training
validation validation
In order to develop robust QSAR model, this study follows the guide-
lines of the OECD[43] that consists of the following main points:
(a) the data set has a defined endpoint, (b) an unambiguous learning
RF
modeling algorithm, (c) a defined AD of the QSAR model, (d) appropriate mea-
sures of goodness-of-fit, robustness, and predictivity, and
(e) mechanistic interpretation of the QSAR model. With the respect to
the last point, which is to develop interpretable QSAR models, this
Predictive
model study makes use of interpretable molecular fingerprints computed
using the PaDEL-Descriptor software. In particular, five (PubChem,
F I G U R E 2 Schematic summary of the QSAR modeling workflow Substructure, Substructure count, Klekota–Roth and Klekota–Roth
presented in this study [Color figure can be viewed at count) of the 12 fingerprints are readily interpretible and Table 1 pro-
wileyonlinelibrary.com] vides a list of these fingerprints along with their description.
Before the development of the predictive model, the activity cliffs
in the data set must be detected and accordingly treated using the
were greater than the inactive compounds. However, the nHBDon modelabilty score of the data set or MODI index. This method is used
values of both active and inactive compounds were similar whereas to identify the probability of obtaining the CSAR model by sorting the
and nHBAcc values of the active compounds were observed to be less active compounds from inactive compounds. It was found that
than those of the inactive compounds. PubChem fingerprints have MODI index of 0.88 which is >0.65
MALIK ET AL. 7
F I G U R E 4 Box plot of Lipinski's rule-of- MW 8 ALogP

five descriptors for investigated HCV 800
inhibitors [Color figure can be viewed at * **
wileyonlinelibrary.com] 6
600
400 4
200 2
0 0
nHBDon 12
nHBAcc
6
4 8
2 4
0 0
Active Inactive Active Inactive
0.75 ± 0.03 which was achieved for the PubChem fingerprint descrip-
10 tors as evaluated by 10-fold CV. Concurrently, Klekota–Roth and Sub-
structure descriptors also performed well harboring the second and
third highest averaged values for Ac and MCC in which models built
0 using Klekota–Roth fingerprints afforded Ac and MCC values of
PC2
84.41 ± 2.18% and 0.66 ± 0.04, respectively, while models using Sub-
structure fingerprints afforded Ac and MCC values of 85.45 ± 2.08%
10 and 0.70 ± 0.04, respectively. Interestingly, the external validation for
PubChem, Klekota–Roth, and Substructure fingerprint descriptors
was better than rest of the descriptors as shown in Table 3. Consider-
20 ing the results from 10-fold CV and the external validation tests, it
can thus be seen that these fingerprint descriptors (i.e., PubChem,
20 10 0 10 Klekota–Roth, and Substructure fingerprint) are superior to the other
PC1 fingerprint classes.
F I G U R E 5 Bounding box plot of internal (pink) and external (cyan)

sets for assessing the applicability domain [Color figure can be viewed 3.3 | Mechanistic interpretation of feature
at wileyonlinelibrary.com]
importance
thereby indicating that the data set is reliable for building a classifica- The 20 top-ranked PubChem descriptors as derived from the RF
tion model. model are shown in Figure 6 and described in Table 4, which con-
In this study, we developed the CSAR model based on the RF sisted of descriptors pertaining to the following classes: seven aro-
algorithm in order to differentiate the active and inactive NS5B inhibi- matic, four aliphatic hydrocarbons, two aldehyde/ketone, two
tors. Table 2 shows the results from 100 independent runs for the RF nitrogen-containing compounds, two alcohols, and two atom counts.
model with 12 different types of fingerprints over an internal valida- Therefore, in this section the contributions of substructure towards
tion test, 10-fold CV, and an external validation test. The best aver- the overall functioning of compounds will be discussed separately in
aged values were observed as Ac 87.85 ± 1.87% and MCC the following sections.
8 MALIK ET AL.
TABLE 1 List of molecular fingerprints employed in this study for representing chemical structures of the HCV NS5B inhibitor dataset
Number
Fingerprint of features Description References
CDK 1,024 Fingerprint of length 1,024 and search depth of 8 [44]
CDK extended 1,024 Extends the fingerprint with additional bits describing ring features [44]
CDK graph only 1,024 A special version that considers only the connectivity and not bond order [44]
E-state 79 Electrotopological state atom types [45]
MACCS 166 Binary representation of chemical features defined by MACCS keys [46]
PubChem 881 Binary representation of substructures defined by PubChem [47]
Substructure 307 Presence of SMARTS patterns for functional groups [48]
Substructure count 307 Count of SMARTS patterns for functional groups [48]
Klekota–Roth 4,860 Presence of chemical substructures [49]
Klekota–Roth count 4,860 Count of chemical substructures [49]
2D atom pairs 780 Presence of atom pairs at various topological distances [50]
2D atom pairs count 780 Count of atom pairs at various topological distances [50]
3.3.1 | Aliphatic hydrocarbons foremost feature in this class was PubchemFP758, which ranked fifth
among the top 20 features. Interestingly, it was found that the ratio of
This group consisted of PubchemFP712, PubchemFP335, active/inactive was 29 (i.e., 174 to actives and six to inactive) which is
PubchemFP697, and PubchemFP696 descriptors where the former very high making it evident that this descriptor plays an important role
two were the first and fourth ranked descriptors while the latter two among the active compounds. In addition, the cyclic/aromatic moiety
were the 12th and 16th ranked descriptors. A closer look at the chem- plays a very important role by interacting with aromatic/hydrophobic
ical structure of the fingerprints revealed that PubchemFP712 and residues in either the palm domain or thumb domain. Besides that, two
PubchemFP697 were non-linear, branched aliphatic hydrocarbons polar residues (i.e., Ser476, and Tyr447) were also involved with hydro-
while PubchemFP335 and PubchemFP696 were linear, non-branched phobic interaction. As mentioned earlier, these moieties are found
aliphatic hydrocarbons. Furthermore, an in-depth analysis was per- mainly in two types of interactions with the residues of the NS5B,
formed by computing the ratio of active to inactive compounds so as (a) π–π interactions with the cyclic/aromatic moiety and (b) hydrogen
to reveal any inherent preferences of functional moiety that the active bonding with the acceptor in the inhibitor. Moreover, a number of
or inactive compounds may have. Branching of the aliphatic hydrocar- studies have shown the interaction of the aromatic moiety with the
bons apparently did not exert any distinct influence on the active/ corresponding residues in the vicinity of the compound. For instance,
inactive ratio. Interestingly, it was illustrated that the first and fourth Chen et al.[51] have shown π–π interactions between the aromatic ring
ranked descriptors (i.e., PubchemFP712 and PubchemFP335), pro- of benzothaidizes and residues Tyr448 and Phe193 of the palm
vided a relatively high active/inactive ratio (i.e., 9.95 and 13.20 domain I binding pocket of NS5B. This finding was supported by
corresponding to 219 actives and 22 inactives, respectively for the another study by Tedesco et al.[52] where in both cases, it was shown
former while 198 actives and 15 inactives, respectively for the latter) that the aromatic interaction is indispensable for the efficacy of the
and these descriptors contained <6 carbon atoms (i.e., six carbon compound to inhibit the functional activity of the protein. In other
atoms for the former and four carbon atoms for the latter). On the independent study by Kolykhalov et al.,[53] the removal of the aromatic
other hand, the 12th and 16th descriptors (i.e., PubchemFP697 and moiety led to the reduction of hydrophobic interactions and eventually
PubchemFP696), yielded a much lower active/inactive ratio (i.e., 3.49 reduced the activity of the compound. Additionally, Ikegashira et al.[54]
and 2.68 corresponding to 227 actives and 65 inactives, respectively, demonstrated a unique and successful approach in the development of
for the former while 411 actives and 153 inactives, respectively, for potent inhibitors directed towards the thumb domain I. In their study,
the latter) and these descriptors consisted of eight carbon atoms. they found that the best enzymatic and cell-based inhibitory activities
where achieved with varying sizes (five to eight membered ring) in the
compound. They observed that benzodiazepinoindole displays remark-
3.3.2 | Aromatic fingerprints able potency with an IC50 of 0.09 μM. This drug is currently in phase I
of the clinical trials. However, Wang et al.[55] showed the involvement
This group contained the maximum number of PubChem fingerprints of similar interactions between the aromatic moiety of the compound
with seven out of the top 20 important features (i.e., PubchemFP758, and residues in the thumb domain II namely Tyr477, Ile482, Leu497,
PubchemFP821, PubchemFP192, PubchemFP185, PubchemFP199, Trp528, and Leu419. Furthermore, there are numerous studies that
PubchemFP193, and PubchemFP713). They all belonged to the general show similar results[56] and it is clear that most of the interactions
class of compounds harboring the aromaticity/cyclic moiety. The involve the aromatic moiety between the inhibitor and residues. Of
MALIK ET AL.
TABLE 2 Performance summary of CSAR models for predicting NS5B inhibitors of HCV
Training set 10-fold CV set External set
Descriptor class N Ac Sn Sc MCC Ac Sn Sc MCC Ac Sn Sc MCC

CDK 1,024 99.59 ± 0.38 99.74 ± 0.4 99.44 ± 0.66 0.99 ± 0.07 87.79 ± 1.93 90.30 ± 2.13 85.63 ± 2.34 0.76 ± 0.04 88.73 ± 3.36 91.63 ± 3.94 86.66 ± 5.00 0.78 ± 0.07
CDK extended 1,024 99.70 ± 0.32 99.68 ± 0.45 99.73 ± 0.44 0.99 ± 0.06 87.88 ± 2.02 90.15 ± 2.55 85.94 ± 2.25 0.76 ± 0.04 88.00 ± 3.13 90.32 ± 4.45 86.37 ± 4.30 0.76 ± 0.06
Estate 79 95.83 ± 1.03 95.67 ± 1.35 96.03 ± 1.56 0.92 ± 2.01 85.68 ± 2.01 86.05 ± 2.22 85.38 ± 2.44 0.71 ± 0.04 86.32 ± 4.30 87.18 ± 4.97 86.00 ± 5.46 0.73 ± 0.08
CDK graph only 1,024 98.52 ± 0.68 98.52 ± 0.90 98.54 ± 0.98 0.97 ± 0.01 87.33 ± 1.74 89.17 ± 2.21 85.73 ± 2.04 0.74 ± 0.03 87.34 ± 3.53 89.56 ± 5.56 85.98 ± 4.73 0.75 ± 0.07
MACCS 166 99.25 ± 0.06 99.36 ± 0.70 99.14 ± 0.80 0.98 ± 0.01 88.00 ± 1.91 88.00 ± 2.05 88.08 ± 2.43 0.76 ± 0.03 89.17 ± 3.40 89.12 ± 4.54 89.78 ± 4.84 0.78 ± 0.07
PubChem 880 99.29 ± 0.58 99.33 ± 0.76 99.25 ± 0.79 0.98 ± 0.01 87.85 ± 1.87 88.80 ± 2.36 87.02 ± 2.12 0.75 ± 0.03 88.06 ± 3.43 89.37 ± 4.51 87.25 ± 4.50 0.76 ± 0.07
Substructure 307 95.95 ± 1.60 95.58 ± 2.05 96.37 ± 1.70 0.91 ± 0.03 84.30 ± 2.18 85.03 ± 2.75 83.66 ± 2.29 0.68 ± 0.04 84.73 ± 4.10 85.92 ± 5.81 84.26 ± 4.92 0.69 ± 0.08
Substructure 307 99.23 ± 0.78 99.02 ± 1.06 99.45 ± 0.73 0.98 ± 0.015 85.45 ± 2.08 85.48 ± 2.49 85.52 ± 2.39 0.70 ± 0.04 86.5 ± 3.62 86.46 ± 4.87 87.08 ± 4.73 0.73 ± 0.07
count
Klekota–Roth 4,860 98.16 ± 1.17 98.62 ± 1.71 97.77 ± 1.35 0.96 ± 0.02 83.28 ± 2.02 84.25 ± 2.49 82.48 ± 2.49 0.66 ± 0.04 84.71 ± 3.31 85.92 ± 5.37 84.33 ± 4.62 0.70 ± 0.06
Klekota–Roth 4,860 98.91 ± 0.94 99.21 ± 1.30 98.64 ± 1.08 0.97 ± 0.01 83.41 ± 2.18 84.03 ± 2.59 82.94 ± 2.69 0.66 ± 0.04 83.84 ± 3.70 85.01 ± 5.42 83.39 ± 4.76 0.68 ± 0.07
count
2D atom pairs 780 98.04 ± 0.96 98.43 ± 0.98 97.66 ± 1.46 0.96 ± 0.02 84.43 ± 2.63 85.15 ± 2.54 83.82 ± 3.21 0.68 ± 0.05 85.45 ± 4.48 85.94 ± 5.38 85.59 ± 5.78 0.71 ± 0.09
2D atom pairs count 780 99.81 ± 0.25 99.87 ± 0.29 99.75 ± 0.40 0.99 ± 0.00 86.08 ± 1.88 87.18 ± 2.17 85.11 ± 2.24 0.72 ± 0.04 87.14 ± 3.54 87.70 ± 5.09 87.23 ± 4.73 0.74 ± 0.07
9
10 MALIK ET AL.
TABLE 3 Difference in MCC values for training set versus note, even though the distribution of the interacting residues is wide-
10-fold CV set (MCCTr–MCCCV) and for training set versus external spread the NS5B, most of them have the aromatic/cyclic hydrocarbon
set (MCCTr–MCCExt)
as one of their moieties.
Descriptor class MCCTr–MCCCV MCCTr–MCCExt
CDK 0.23 ± 0.03 0.21 ± 0.05
CDK extended 0.23 ± 0.034 0.23 ± 0.055 3.3.3 | Aldehyde/ketone fingerprints
Estate 0.20 ± 0.019 0.18 ± 0.064
This group is composed of only two members namely PubchemFP537
CDK graph only 0.22 ± 0.02 0.22 ± 0.05
and PubchemFP432 ranked as the second and ninth among the
MACCS 0.22 ± 0.026 0.19 ± 0.05
important PubChem descriptors. Although they are ranked high, the
PubChem 0.22 ± 0.026 0.22 ± 0.056
ratio of active to inactive is very low (<1). It indicates that these
Substructure 0.23 ± 0.011 0.22 ± 0.05
descriptors are part of the general structural composition of the ligand
Substructure count 0.27 ± 0.026 0.25 ± 0.056
but might not be crucial for the functional activity of the compounds.
Klekota–Roth 0.30 ± 0.01 0.265 ± 0.04
However, Love et al.[57] illustrated that the hydrophobic interactions
Klekota–Roth count 0.30 ± 0.025 0.29 ± 0.055 and key protein-inhibitor interactions also comprise of direct hydro-
2D atom pairs 0.27 ± 0.033 0.24 ± 0.07 gen bonds between the enol-ketone oxygen of the inhibitor with the
2D atom pairs count 0.28 ± 0.04 0.28 ± 0.07
PubChem
PubchemFP712
PubchemFP537
PubchemFP335
PubchemFP758
PubchemFP12
PubchemFP2
PubchemFP821
PubchemFP432
PubchemFP646
PubchemFP192
PubchemFP697
PubchemFP451
PubchemFP185
PubchemFP574
PubchemFP696
PubchemFP199
PubchemFP193
PubchemFP655
PubchemFP713
F I G U R E 6 Box plots of top 20 features as

PubchemFP548
deduced from the Gini index from RF models
0 1 2 3 4 5 built using PubChem fingerprint [Color figure
Gini index can be viewed at wileyonlinelibrary.com]
MALIK ET AL. 11
TABLE 4 List of the 20 top-ranking PubChem fingerprint and their corresponding description
Fingerprint SMARTS pattern Substructure description

PubchemFP712 C–C(C)–C(C)–C 2,3-Dimethylbutane
PubchemFP537 O=C–C–O 2-Hydroxyacetaldehyde
PubchemFP335 C(C)(C)(C)(H) 1-Butane
PubchemFP758 Cc1c(N)cccc1 2-Methylaniline
PubchemFP12 ≥16C >16 carbon atoms
PubchemFP2 ≥16H >16 hydrogen atoms
PubchemFP821 CC1C(N)CCCC1 2-Methylcyclohexan-1-amine
PubchemFP432 C(–C)(–C)(=O) Propan-2-one
PubchemFP646 O=C–N–C–[#1] N-methylformamide
PubchemFP192 ≥3 any ring size 6 >3 six-member cyclic ring
PubchemFP697 C–C–C–C–C–C(C)–C 2-Methylheptane
PubchemFP451 C(–N)(=O) Formamide
PubchemFP574 O–C–C:C–C (2E)-but-2-en-1-ol
PubchemFP696 C–C–C–C–C–C–C–C Octane
PubchemFP193 ≥3 saturated or aromatic carbon-only ring size 6 >3 saturated or aromatic carbon-only six-member cyclic
ring
PubchemFP655 O–C–C–C:C But-3-en-1-ol
PubchemFP713 Cc1ccc(C)cc1 1,4-Xylene
PubchemFP548 O–C–C:C Prop-2-en-1-ol
backbone amide NH of Leu476 along with water-mediated H-bond to binding, thus hindering the interaction indirectly and inhibiting the ini-
Tyr477. What does this tell us? tiation process of NS5B. Most of the inhibitors interact mainly via
hydrophobic interactions. However, the only exception is the pres-
ence of a salt bridge/hydrogen bond between the inhibitor and a gua-
3.4 | Structural interpretation of allosteric nidinium group of Arg503 as shown in Figure 1a. From the X-ray
binding site structure of the Thumb site I inhibitors, Marco et al.[58] demonstrated
that the binding site was mostly lipophilic with the aromatic ring of
NNI inhibitors are classified based on the binding position with NS5B the inhibitor stacked against the Pro495 side chain while Arg498 and
of HCV and based on this criterion, they are further classified into Arg503 provided stability to the complex (PDB ID 2BRK). Moreover,
four main groups consisting of PS-I, PS-II, TS-I, and TS-II. Even though from the known actives inhibitors, most of them have aromatic/cyclic
they bind at different sites on the target protein, most of the inhibi- moieties that form key interactions with the target NS5B such as
tors retain their aromatic moiety within the core structure or as side Phe93, Phe193, Tyr415, Tyr429, Tyr448, Trp500, and Trp550.
chain. Therefore, in this section, the contributions of inhibitor binding
to different allosteric sites will be explored.
3.4.2 | Thumb site II inhibitors
3.4.1 | Thumb site I inhibitors The Thumb site II is located approximately 35 Å from the active site,
which is a narrow and shallow hydrophobic cavity near the base of
The Thumb site I has attracted the attention among many pharmaceu- the thumb domain. The most prominent and critical interaction with
tical companies and researchers around the world as it is close to the lipophilic inhibitors involves the residues within the “dimple” region as
active site. In addition, being on the upper section of NS5B makes it defined by Leu419, Trp528, Tyr477, and Arg422 residues. Besides
more accessible for new inhibitors. From the crystallographic studies, that, ligands also form hydrogen bond interactions with Ser476 and
it was found that inhibitors specific to this site bind to the pockets Tyr477 via the hydroxyl group and backbone amides. One of the most
occupied by side chains consisting of Leu30 and Leu31. These resi- advanced and well studied inhibitor is hydroxldihydropyranone.[59]
dues are in between Ser29 and Arg32 which are critical for GTP Even though the exact mechanism of action is not well established, it
12 MALIK ET AL.
has been hypothesized that these inhibitors block the association of site, thus hindering the RNA replication in vivo. Such changes result in
NS5B with other proteins/RNA, which are critical for replication. the reduction of GTP affinity and stability of NS5B which further
[57]
Using the structural information, Love et al. showed that inhibits the polymerase function of the protein.
dihydropyrones can inhibit viral replication by 30-fold with sub-
micromolar potency in biochemical assays but lacked the potency in
cell-based assays. However, further optimization of dihydropyrones 3.5 | Palm site I inhibitors
lead to the discovery of filibuvir, a potent inhibitor of HCV NS5B with
a good PK profile. In addition, filibuvir has been tested in combination The palm I (PS-1) binding site is located at the junction between the
with other drugs and is currently in Phase 2b of clinical trials with thumb and palm domains near the active site of NS5B. Among all
promising results. Albeit an unknown mechanism, the mode of action the NS5B allosteric sites, the palm domain binding site is not only the
of such inhibitors most likely involves interference with protein con- largest binding site but it is also deeply recessed and hydrophobic in
formations that are critical for RNA synthesis. However, Biswal nature. The combination of these unique properties allows broad range
et al.[60] were the first to demonstrate the plausible mechanism of of chemotypes to act as inhibitors. The key interactions at PS-1 include
action of this class of inhibitors. They demonstrated that filibuvir and hydrogen bonds with Ser556, polar interactions with Tyr448 and
similar inhibitors interact using hydrophobic interactions with Lys419 hydrophobic contacts with Pro197, Leu384, Met414, Tyr415, and
and Met423 and direct hydrogen bonds with Ser476 and Tyr477 as Tyr448 (π–π interaction between side chain and aromatic moiety of
shown in Figure 1b. The binding of these inhibitors induced conforma- inhibitor) as shown in Figure 1c. These findings were demonstrated by
tional changes in the thumb site II in particular, within the positively Yeung et al.[61] in the co-crystal structure of the complex. Furthermore,
charged helix (helix T) which are in close proximity to the GTP binding most of the inhibitors block HCV replication by interfering with the
F I G U R E 7 Screenshots of the HCVpred

web server. Screenshots before (a) and after
(b) submission, respectively, of the input
SMILES notation [Color figure can be
viewed at wileyonlinelibrary.com]
MALIK ET AL. 13
interaction between the inner thumb/palm domain near the active site. end of the model's utility. In a way, the model has served its purpose
Due to extensive research both in academia and by pharmaceutical of making predictions and providing useful insights on the underlying
companies, successful compounds directed against PS-1 are currently important features. In an effort to extend the life cycle of the predic-
in clinical development.[62] Benzothiadiazine derivatives, such as, tive model, we believe that deploying the predictive model as a public
desabuvir (ABT-333), sertrobuvir (RG-7790), and ABT-072 have been web server that would allow scientists, researchers, and graduate stu-
identified as NS5B PS-1 binders. Desabuvir is the first approved NNI dents to make use of the model's predictive insights would greatly
[63]
drug for the treatment of HCV. Additionally, setrobuvir in combina- add value to the model while at the same time benefiting the scientific
tion and without IFN exhibits strong antiviral activity.[64] Moreover, in community as a whole. Moreover, the deployed model also empowers
another phase IIa study of triple therapy of two direct acting antivirals biologists or chemists with no background in computer science to also
(DAA) (ABT-450 and ABT-072) and RBV have shown the higher rate of make use of the predictive model. Thus, the predictive model
[65]
viral clearance in a 12 week treatment. It is noteworthy that benzo- described herein is deployed as a public web server called the
furans including nesbuvir are currently the only class of molecules that HCVpred and is available at http://codes.bio/hcvpred/.
have been reported to bind to the palm site I of the HCV NS5B. In a nutshell, the HCVpred web server accepts as input the SMILES
notation of the chemical structures of interest and submits this to the
server for descriptor calculation (i.e., using PaDEL-Descriptor software)
3.6 | Palm site II and III inhibitors followed by applying the trained RF predictive model to make predic-
tions on the bioactivity class labels (i.e., whether the compound will be
In-depth studies have carried out on benzofuran inhibitors including active or inactive) along with the corresponding probability class values
nesbuvir as they are the only class of compounds which bind to the for the active and inactive classes. Screenshots of the HCVpred web
palm site II (PS-II) of NS5B.[66] Furthermore, nesbuvir showed promis- server before and after submission of the input SMILES notation is pro-
ing results during phase I clinical trials and progressed to phase vided in Figure 7 a, b. As noted in Figure 7b, predicted results are given
II. However, it was shown to increase the level of liver enzymes in in the text box found underneath the Status/Output section while a
[67]
patients infected with HCV thus, the trials were canceled. Never- Download CSV button will appear after the prediction has been made so
theless, during the trial phase, the crystal structure of PS-II inhibitor in that users can save the predicted results into their own computer.
complex with NS5B was elucidated which allowed researchers to
identify the key interactions as shown in Figure 1d.[68] The
Figure illuminates the core of benzofuran which forms two major 4 | CONC LU SION
hydrogen bonds with residues Ser365 and Arg500 while fluorophenyl
forms hydrophobic interactions with residues Ile363, Val370, Leu360, With increasing incidences of global HCV infection, the development of
Leu204, Val321, Leu314, and Leu384. Concurrently, the crystal struc- novel and promising anti-HCV compounds are urgently needed. HCV
ture of oxyfluorobenzamides, a class of inhibitors providing inhibition NS5B is an ideal drug target, mainly due to its dependence on viral repli-
to HCV replication in vitro with submicromolar range of activity, was cation and less off-target effect on humans. Furthermore, combinations
observed to bind near the palm pocket I and III.[69] The inhibitor forms of various computational approaches revealed the structure–activity rela-
hydrogen bonds with residues Met414, Asn316, and Tyr195 while tionships of NS5B inhibitors, which is essential for rational drug design of
also forming a significant number of hydrophobic interactions with promising therapeutic agents against HCV infection. In this study, a total
residues Pro197, Tyr415, Ile447, Tyr448, Trp550, and Phe551 as of 578 compounds were compiled from the ChEMBL database and pre-
shown in Figure 1e. In the last few years, the focus of research has processed so as to remove redundancy. Twelve sets of fingerprint
shifted been to the palm sites II/III inhibitors judging by the several descriptors were used to construct the QSAR models and performance
patents filed by various pharmaceutical companies (e.g., Abott labora- was comparatively evaluated. From the results, it was found that
tories, Biota, Bristal-Myers, GlaxoSmithKline, and Presidio Pharma- PubChem fingerprints together with the RF learning algorithm, not only
ceutical). One such compound from Presido Pharmaceutical (PPI-383) provided the best performance but also an interpretable set of descrip-
has shown promising anti-viral activity against major HCV genotypes tors. Using the RF-derived Gini index, important features that are critical
[70]
(e.g., 1a/b, 2a, 3, and 4a). A key feature of this compound is that it for NS5B inhibition were identified. Feature analysis of the active com-
binds moderately with NS5B and exhibits good oral bioactivity in ani- pounds revealed the importance of aromaticity/cyclic moiety
mal models. However, the crystal structure of this compound has not (i.e., PubchemFP758, PubchemFP821, PubchemFP192, PubchemFP185,
yet been revealed and it is currently in phase II of clinical trials.[71] PubchemFP199, PubchemFP193, and PubchemFP713 corresponding to
non-polar interaction with the surrounding residues especially aromatic
and polar residues of NS5B), aliphatic group (i.e., PubchemFP712,
3.7 | Model deployment as the HCVpred web PubchemFP335, PubchemFP697, and PubchemFP696 corresponding
server to a conjugated double bond) and aldehyde/ketone moiety
(i.e., PubchemFP537 and PubchemFP432 corresponding to non-
In a typical life cycle of a predictive model, once models are published specific interactions in aqueous medium). Moreover, analysis of the
and results are displayed in the publication this essentially marks the structure–activity relationship revealed that the hydrophobic interaction
14 MALIK ET AL.
with the aromatic ring plays a pivotal role in inhibiting the functional activ- [25] W. F. de Azevedo, F. Canduri, J. S. de Oliveira, L. A. Basso,
ity of NS5B inhibitors. Thus, the basic understanding regarding the quanti- M. S. Palma, J. H. Pereira, D. S. Santos, Biochem. Biophys. Res.
Commun. 2002, 295, 142.
tative structural and functional relationship might serve as general
[26] J. H. Pereira, F. Canduri, J. S. de Oliveira, N. J. da Silveira, L. A. Basso,
guidelines for the data-driven design of potentially active NS5B inhibitors. M. S. Palma, W. F. de Azevedo, D. S. Santos, Biochem. Biophys. Res.
Commun. 2003, 312, 608.
ACKNOWLEDGMENTS [27] N. Suvannang, L. Preeyanon, A. A. Malik, N. Schaduangrat,
W. Shoombuatong, A. Worachartcheewan, T. Tantimongcolwat,
This study is supported by the New Researcher Grant (A32/2561)
C. Nantasenamat, RSC Adv. 2018, 8, 11344.
from Mahidol University; the annual budget grant (B.E. 2557-2559) of [28] V. Prachayasittikul, R. Pingaew, A. Worachartcheewan,
Mahidol University and the Research Career Development Grant C. Nantasenamat, S. Prachayasittikul, S. Ruchirawat, V. Prachayasittikul,
(No. RSA6280075) from the Thailand Research Fund. Eur. J. Med. Chem. 2014, 84, 247.
[29] E. F. da Cunha, K. S. Matos, T. C. Ramalho, Med. Chem. 2013, 9, 774.
[30] A. Worachartcheewan, V. Prachayasittikul, A. P. Toropova,
ORCID A. A. Toropov, C. Nantasenamat, Mol. Divers. 2015, 19, 955.
Chanin Nantasenamat https://orcid.org/0000-0003-1040-663X [31] P. P. L. F. de Albuquerque, L. H. S. Santos, D. Antunes,
E. R. Caffarena, A. S. Figueiredo, Virus Res. 2020, 278, 197867.
[32] F. Dragoni, A. Boccuto, F. Picarazzi, A. Giannini, F. Giammarino,
RE FE R ENC E S
F. Saladini, M. Mori, E. Mastrangelo, M. Zazzi, I. Vicenti, Antiviral Res.
[1] T. K. Scheel, C. M. Rice, Nat. Med. 2013, 19, 837.
2020, 175, 104708.
[2] S. Chinnaswamy, H. Cai, C. Kao, Virus Adapt. Treat. 2010, 2, 73.
[33] A. A. Elfiky, A. Ismail, Life Sci. 2019, 238, 116958.
[3] N. J. da Silveira, H. A. Arcuri, C. E. Bonalumi, F. P. de Souza, I. M. Mello,
[34] G. S. Hassan, H. H. Georgey, E. Z. Mohammed, F. A. Omar, Eur.
P. Rahal, J. R. Pinho, W. F. de Azevedo, BMC Struct. Biol. 2005, 5, 1.
J. Med. Chem. 2019, 184, 111747.
[4] H. B. El-Serag, Gastroenterology 2004, 127, 27.
[35] Y. Wei, J. Li, J. Qing, M. Huang, M. Wu, F. Gao, D. Li, Z. Hong,
[5] A. E. Grzegorzewska, Curr. Med. Chem. 2019, 26, 4832.
L. Kong, W. Huang, J. Lin, PLoS One 2016, 11, 1.
[6] I. Gentile, F. Borgia, A. R. Buonomo, G. Castaldo, G. Borgia, Curr. Med.
[36] M. L. Barreca, N. Iraci, G. Manfroni, V. Cecchetti, Future Med. Chem.
Chem. 2013, 20, 3733.
2011, 3, 1027.
[7] I. Gentile, C. Viola, F. Borgia, G. Castaldo, G. Borgia, Curr. Med. Chem.
[37] H. Rafiei, M. Khanzadeh, S. Mozaffari, M. H. Bostanifar, Z. M. Avval,
2009, 16, 1115.
R. Aalizadeh, E. Pourbasheer, EXCLI J. 2016, 15, 38.
[8] L. Gragnani, A. Piluso, T. Urraro, A. Fabbrizzi, E. Fognani, L. Petraccia,
[38] T. T. Talele, P. Arora, S. S. Kulkarni, M. R. Patel, S. Singh,
A. Genovesi, L. Giubilei, J. Ranieri, C. Stasi, M. Monti, A. L. Zignego,
M. Chudayeu, N. Kaushik-Basu, Bioorg. Med. Chem. 2010, 18, 4630.
Curr Drug Targets 2017, 18, 772.
[39] I. Musmuca, A. Caroli, A. Mai, N. Kaushik-Basu, P. Arora, R. Ragno,
[9] J. J. Feld, Curr. Drug Targets 2017, 18, 851.
J. Chem. Inf. Model. 2010, 50, 662.
[10] M. P. Walker, T. C. Appleby, W. Zhong, J. Y. Lau, Z. Hong, Antivir.
[40] A. G. Golub, K. Gurukumar, A. Basu, V. G. Bdzhola, Y. Bilokin,
Chem. Chemother. 2003, 14, 1.
S. M. Yarmoluk, J.-C. Lee, T. T. Talele, D. B. Nichols, N. Kaushik-Basu,
[11] T. J. Liang, M. G. Ghany, N. Engl. J. Med. 2013, 368, 1907.
Eur. J. Med. Chem. 2012, 58, 258.
[12] F. Penin, J. Dubuisson, F. A. Rey, D. Moradpour, J.-M. Pawlotsky,
[41] M. Elhefnawi, M. ElGamacy, M. Fares, BMC Bioinformatics 2012,
Hepatology 2004, 39, 5.
13, S5.
[13] C. A. Lesburg, M. B. Cable, E. Ferrari, Z. Hong, A. F. Mannarino,
[42] C. A. Lipinski, F. Lombardo, B. W. Dominy, P. J. Feeney, Adv. Drug
P. C. Weber, Nat. Struct. Mol. Biol. 1999, 6, 937.
Deliv. Rev. 2001, 46, 3.
[14] P. Poijarvi-Virta, H. Lonnberg, Curr. Med. Chem. 2006, 13, 3441.
[43] 37th Joint Meeting of the Chemicals Committee, OECD principles
[15] C. Caillet-Saguy, S. P. Lim, P.-Y. Shi, J. Lescar, S. Bressanelli, Antivir.
for the validation, for regulatory purposes, of (quantitative)
Res. 2014, 105, 8.
structure–activity relationship models, 2004.
[16] A. Cherkasov, E. N. Muratov, D. Fourches, A. Varnek, I. I. Baskin,
[44] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann,
M. Cronin, J. Dearden, P. Gramatica, Y. C. Martin, R. Todeschini,
E. Willighagen, J. Chem. Inf. Model. 2003, 43, 493.
V. Consonni, V. E. Kuz'min, R. Cramer, R. Benigni, C. Yang, J. Rathman,
[45] L. H. Hall, L. B. Kier, J. Chem. Inf. Model. 1995, 35, 1039.
L. Terfloth, J. Gasteiger, A. Richard, A. Tropsha, J. Med. Chem. 2014, 57,
[46] J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, J. Chem. Inf. Com-
4977.
put. Sci. 2002, 42, 1273.
[17] V. Prachayasittikul, A. Worachartcheewan, W. Shoombuatong,
[47] NCBI, PubChem Substructure Fingerprint, version January 3, 2009.
N. Songtawee, S. Simeon, V. Prachayasittikul, C. Nantasenamat, Curr.
Ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_
Top. Med. Chem. 2015, 15, 1780.
fingerprints.txt
[18] C. Nantasenamat, C. Isarankura-Na-Ayudhya, V. Prachayasittikul,
[48] C. Laggner, SMARTS Patterns for Functional Group Classification, Inte:
Expert Opin. Drug Discov. 2010, 5, 633.
Ligand Software-Entwicklungs und Consulting GmbH, Maria
[19] A. Z. Dudek, T. Arodz, J. Galvez, Comb. Chem. High Throughput Screen.
Enzersdorf, Austria 2005.
2006, 9, 213.
[49] J. Klekota, F. P. Roth, Bioinformatics 2008, 24, 2518.
[20] S. Das, I. Mitra, S. Batuta, M. Niharul Alam, K. Roy, N. A. Begum, Bio-
[50] R. E. Carhart, D. H. Smith, R. Venkataraghavan, J. Chem. Inf. Model.
org. Med. Chem. Lett. 2014, 24, 5050.
1985, 25, 64.
[21] D. Ami, D. Davidovi-Ami, D. Beslo, V. Rastija, B. Luci, N. Trinajsti,
[51] C.-M. Chen, Y. He, L. Lu, H. B. Lim, R. L. Tripathi, T. Middleton,
Curr. Med. Chem. 2007, 14, 827.
L. E. Hernandez, D. W. A. Beno, M. A. Long, W. M. Kati, T. D. Bosse,
[22] W. F. de Azevedo, S. Leclerc, L. Meijer, L. Havlicek, M. Strnad,
D. P. Larson, R. Wagner, R. E. Lanford, W. E. Kohlbrenner,
S. H. Kim, Eur. J. Biochem. 1997, 243, 518.
D. J. Kempf, T. J. Pilot-Matias, A. Molla, Antimicrob. Agents
[23] M. Arisoy, O. Temiz-Arpaci, I. Yildiz, F. Kaynak-Onurdag, E. Aki,
Chemother. 2007, 51, 4290.
I. Yalcin, U. Abbasoglu, SAR QSAR Environ. Res. 2008, 19, 589.
[52] R. Tedesco, A. N. Shaw, R. Bambal, D. Chai, N. O. Concha,
[24] M. Pir, H. Agirbas, F. Budak, M. Ilter, Med. Chem. Res. 2016, 25,
M. G. Darcy, D. Dhanak, D. M. Fitch, A. Gates, W. G. Gerhardt,
1794.
D. L. Halegoua, C. Han, G. A. Hofmann, V. K. Johnston, A. C. Kaura,
MALIK ET AL. 15
N. Liu, R. M. Keenan, J. Lin-Goerke, R. T. Sarisky, K. J. Wiggall, D. Mayers, H. W. Reesink, Antimicrob. Agents Chemother. 2012, 56,
M. N. Zimmerman, K. J. Duffy, J. Med. Chem. 2006, 49, 971. 4525.
[53] A. A. Kolykhalov, K. Mihalik, S. M. Feinstone, C. M. Rice, J. Virol. [66] C. Sarrazin, C. Hézode, S. Zeuzem, J. M. Pawlotsky, J. Hepatol. 2012,
2000, 74, 2046. 56, S88.
[54] K. Ikegashira, T. Oka, S. Hirashima, S. Noji, H. Yamanaka, Y. Hara, [67] N. M. Kneteman, A. Y. Howe, T. Gao, J. Lewis, D. Pevear, G. Lund,
T. Adachi, J.-I. Tsuruha, S. Doi, Y. Hase, T. Noguchi, I. Ando, N. Ogura, D. Douglas, D. F. Mercer, D. L. Tyrrell, F. Immermann, I. Chaudhary,
S. Ikeda, H. Hashimoto, J. Med. Chem. 2006, 49, 6950. J. Speth, S. A. Villano, J. O'Connell, M. Collett, Hepatology 2009, 49, 745.
[55] M. Wang, K. K.-S. Ng, M. M. Cherney, L. Chan, C. G. Yannopoulos, [68] J. Q. Hang, Y. Yang, S. F. Harris, V. Leveque, H. J. Whittington,
J. Bedard, N. Morin, N. Nguyen-Ba, M. H. Alaoui-Ismaili, R. C. Bethell, S. Rajyaguru, G. Ao-Ieong, M. F. McCown, A. Wong, A. M. Giannetti,
M. N. G. James, J. Biol. Chem. 2003, 278, 9489. S. Le Pogam, F. Talamas, N. Cammack, I. Najera, K. Klumpp, J. Biol.
[56] M. J. Sofia, W. Chang, P. A. Furman, R. T. Mosley, B. S. Ross, J. Med. Chem. 2009, 284, 15517.
Chem. 2012, 55, 2481. [69] C. C. Cheng, G. W. Shipps, Z. Yang, N. Kawahata, C. A. Lesburg,
[57] R. A. Love, H. E. Parge, X. Yu, M. J. Hickey, W. Diehl, J. Gao, J. S. Duca, J. Bandouveres, J. D. Bracken, C. K. Jiang, S. Agrawal,
H. Wriggers, A. Ekker, L. Wang, J. A. Thomson, P. S. Dragovich, E. Ferrari, H. C. Huang, Bioorg. Med. Chem. Lett. 2010, 20, 2119.
S. A. Fuhrman, J. Virol. 2003, 77, 7575. [70] R. Colonno, N. Huang, M. Lau, Q. Huang, A. Huq, E. Peng,
[58] S. di Marco, C. Volpari, L. Tomei, S. Altamura, S. Harper, F. Narjes, M. Bencsik, M. Zhong, L. Li, J. Hepatol. 2012, 56, S464.
U. Koch, M. Rowley, R. de Francesco, G. Migliaccio, A. Carfí, J. Biol. [71] P. L. Beaulieu, Chapter 8 design and development of QSAR polymer-
Chem. 2005, 280, 29765. ase non-nucleoside inhibitors for the treatment of hepatitis C virus
[59] P. Manvar, F. Shaikh, R. Kakadiya, K. Mehariya, R. Khunt, B. Pandey, infection. in Successful strategies for the discovery of antiviral drugs,
A. Shah, Tetrahedron 2016, 72, 1293. The Royal Society of Chemistry, Cambridge, UK 2013, p. 248.
[60] B. K. Biswal, M. M. Cherney, M. Wang, L. Chan, C. G. Yannopoulos, [72] A. Gaulton, A. Hersey, M. Nowotka, A. P. Bento, J. Chambers,
D. Bilimoria, O. Nicolas, J. Bedard, M. N. G. James, J. Biol. Chem. D. Mendez, P. Mutowo, F. Atkinson, L. J. Bellis, E. Cibrian-Uhalte,
2005, 280, 18202. M. Davies, N. Dedman, A. Karlsson, M. P. Magarinos, J. P. Overington,
[61] K. S. Yeung, B. R. Beno, K. Parcella, J. A. Bender, K. A. Grant-Young, G. Papadatos, I. Smit, A. R. Leach, Nucleic Acids Res. 2017, 45, D945.
A. Nickel, P. Gunaga, P. Anjanappa, R. O. Bora, K. Selvakumar, [73] C. W. Yap, J. Comput. Chem. 2011, 32, 1466.
K. Rigat, Y. K. Wang, M. Liu, J. Lemm, K. Mosure, S. Sheriff, C. Wan, [74] M. Wójcikowski, P. Siedlecki, P. J. Ballester, Methods Mol. Biol. 2019,
M. Witmer, K. Kish, U. Hanumegowda, X. Zhuo, Y. Z. Shu, D. Parker, 2053, 1.
R. Haskell, A. Ng, Q. Gao, E. Colston, J. Raybon, D. M. Grasela, [75] P. J. Ballester, Biomolecules 2019, 9, 216.
K. Santone, M. Gao, N. A. Meanwell, M. Sinz, M. G. Soars, [76] L. Breiman, Mach. Learn. 2001, 45, 5.
J. O. Knipe, S. B. Roberts, J. F. Kadow, J. Med. Chem 2017, 60, 4369. [77] L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone, Classification and
[62] D. Dhanak, K. J. Duffy, V. K. Johnston, J. Lin-Goerke, M. Darcy, regression trees, Chapman & Hall/CRC, Boca Raton, FL 1984.
A. N. Shaw, B. Gu, C. Silverman, A. T. Gates, M. R. Nonnemacher, [78] A. Liaw, M. Wiener, R News 2002, 2, 18.
D. L. Earnshaw, D. J. Casper, A. Kaura, A. Baker, C. Greenwood, [79] A. Golbraikh, E. Muratov, D. Fourches, A. Tropsha, J. Chem. Inf. Model
L. L. Gutshall, D. Maley, A. DelVecchio, R. Macarron, G. A. Hofmann, 2014, 54, 1.
Z. Alnoah, H. Y. Cheng, G. Chan, S. Khandekar, R. M. Keenan,
R. T. Sarisky, J. Biol. Chem 2002, 277, 38322.
[63] J. J. Feld, K. V. Kowdley, E. Coakley, S. Sigal, D. R. Nelson,
D. Crawford, O. Weiland, H. Aguilar, J. Xiong, T. Pilot-Matias, How to cite this article: Malik AA, Phanus-umporn C,
B. DaSilva-Tillmann, L. Larsen, T. Podsadecki, B. Bernstein, New Engl. Schaduangrat N, Shoombuatong W, Isarankura-Na-
J. Med. 2014, 370, 1594. Ayudhya C, Nantasenamat C. HCVpred: A web server for
[64] P. Ferenci, Chapter 56: current treatment of chronic hepatitis C. in
predicting the bioactivity of hepatitis C virus NS5B inhibitors.
Liver Pathophysiology (Ed: P. Muriel), Academic Press, Boston, MA
2017, p. 781. J Comput Chem. 2020;1–15. https://doi.org/10.1002/jcc.
[65] J. de Bruijne, J. van de Wetering de Rooij, A. A. van Vliet, X. J. Zhou, 26223
M. F. Temam, J. Molles, J. Chen, K. Pietropaolo, J. Z. Sullivan-B?lyai,

10 1002@jcc 26223

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1002@jcc 26223

Uploaded by

Copyright:

Available Formats

Received: 28 January 2020 Revised: 10 March 2020 Accepted: 28 April 2020

HCVpred: A web server for predicting the bioactivity

Aijaz Ahmad Malik1 | Chuleeporn Phanus-umporn1 | Nalini Schaduangrat1 |

data science, flavivirus, HCVpred, hepatitis C virus, NS5B, QSAR, structure–activity

1 | I N T RO DU CT I O N (EC 3.4.21.98), NS4B, NS5A, and NS5B (EC 2.7.7.48)) proteins.[2,3]

J Comput Chem. 2020;1–15. wileyonlinelibrary.com/journal/jcc © 2020 Wiley Periodicals, Inc. 1

Step 1: For any given pair of compounds Ci and Cj defined by an

2.8 | Model validation

412 Active Inactive 166 4

F I G U R E 3 Chemical space analysis of HCV inhibitors. Active and

3.2 | QSAR modeling

F I G U R E 4 Box plot of Lipinski's rule-of- MW 8 ALogP

F I G U R E 5 Bounding box plot of internal (pink) and external (cyan)

Training set 10-fold CV set External set

Descriptor class N Ac Sn Sc MCC Ac Sn Sc MCC Ac Sn Sc MCC

F I G U R E 6 Box plots of top 20 features as

Fingerprint SMARTS pattern Substructure description

F I G U R E 7 Screenshots of the HCVpred

You might also like