You are on page 1of 21

SAR and QSAR in Environmental Research

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/gsar20

QSPR modelling for investigation of different


properties of aminoglycoside-derived polymers
using 2D descriptors

P.M. Khan & K. Roy

To cite this article: P.M. Khan & K. Roy (2021) QSPR modelling for investigation of different
properties of aminoglycoside-derived polymers using 2D descriptors, SAR and QSAR in
Environmental Research, 32:7, 595-614, DOI: 10.1080/1062936X.2021.1939150

To link to this article: https://doi.org/10.1080/1062936X.2021.1939150

View supplementary material

Published online: 21 Jun 2021.

Submit your article to this journal

Article views: 177

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=gsar20
SAR AND QSAR IN ENVIRONMENTAL RESEARCH
2021, VOL. 32, NO. 7, 595–614
https://doi.org/10.1080/1062936X.2021.1939150

QSPR modelling for investigation of different properties of


aminoglycoside-derived polymers using 2D descriptors
P.M. Khana and K. Roy b

a
Department of Pharmacoinformatics, National Institute of Pharmaceutical Educational and Research
(NIPER), Kolkata, India; bDrug Theoretics and Cheminformatics Laboratory, Division of Medicinal and
Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India

ABSTRACT ARTICLE HISTORY


The quantitative structure-property relationship (QSPR) method is Received 28 March 2021
commonly used to predict different physicochemical characteristics Accepted 2 June 2021
of interest of chemical compounds with an objective to accelerate KEYWORDS
the process of design and development of novel chemical com­ QSPR; aminoglycoside-
pounds in the biotechnology and healthcare industries. In the derived polymers; polymer-
present report, we have employed a QSPR approach to predict DNA binding; transgene
the different properties of the aminoglycoside-derived polymers expression; PLS
(i.e. polymer DNA binding and aminoglycoside-derived polymers
mediated transgene expression). The final QSPR models were
obtained using the partial least squares (PLS) regression approach
using only specific categories of two-dimensional descriptors and
subsequently evaluated considering different internationally
accepted validation metrics. The proposed models are robust and
non-random, demonstrating excellent predictive ability using test
set compounds. We have also developed different kinds of con­
sensus models using several validated individual models to
improve the prediction quality for external set compounds. The
present findings provide new insight for exploring the design of
an aminoglycoside-derived polymer library based on different iden­
tified physicochemical properties as well as predict their property
before their synthesis.

Introduction
Gene therapy deals with delivering exogenous DNA into mammalian cells to treat
numerous genetic disorders [1] and many other diseases, such as neurodegenerative [2]
and infectious disorders [3], AIDS [4] and different types of cancers [5–7]. The delivery of
membrane-impermeable DNA into the cell has been attained by employing either viral
vectors or synthetically designed non-viral vectors [8,9]. The viral vectors are more widely
used carrier than the non-viral vectors for delivering exogenous DNA into the target cells.
However, the clinical risk (i.e. immunogenicity, insertion mutagenesis, limited carrying
capacity, and viral degradation) and potentially high production costs associated with
viral vectors limit their therapeutic applications [10,11]. To overcome the viral vector
problems, scientists worldwide mainly concentrate on developing efficient and

CONTACT K. Roy kunalroy_in@yahoo.com


Supplemental data for this article can be accessed at: https://doi.org/10.1080/1062936X.2021.1939150
© 2021 Informa UK Limited, trading as Taylor & Francis Group
596 P.M. KHAN AND K. ROY

biocompatible non-viral vectors (i.e. cationic lipid and polymers) [12–15]. Non-viral vec­
tors exhibit certain merits over viral vectors, such as lower production cost, flexibility in
chemical design and development, nearly unlimited capacity to carry DNA and safety [16].
However, non-viral vectors have also suffered from some demerits, such as low transgene
efficacy and high cytotoxicity [17]. At present, more attention has diverted towards the
design and development of novel non-viral vehicles with higher transgene efficacy and
biocompatibility using appropriate alterations in the chemical structure [18,19]. Polymers
have rigorously been explored as the delivery vehicle for small molecules (drugs) as well
as macromolecules (i.e. DNA, RNA and proteins) [20,21]. In the present study, we have
employed a quantitative structure–property relationship (QSPR) approach to predict the
different properties of the aminoglycoside-derived polymers. A combinatorial library of
aminoglycoside-derived polymers was synthesized and characterized by another group
[19,22]; the idea of selecting aminoglycoside core for polymer synthesis originates mainly
due to the presence of several functional groups in the core structure such as hydrophilic
sugar, hydroxyl, and amine groups, which can facilitate the design of different derivatives
as well as in polymerization process [23]. Furthermore, polymers derived from aminogly­
cosides core are more likely to be biodegradable due to the glycosidic linkages in these
molecules [24].
Only a few QSPR models were proposed to predict the percentage of DNA binding and
transgene expression efficacy of aminoglycoside-derived polymers. For example, Rege et al.
[25] have proposed a nonlinear support vector machine (SVM) model to predict DNA
binding efficacy of an aminoglycoside-polyamine library. The final SVM model with a cross-
validated r2 value of 0.97 comprises several features with encoded information about the
molecular size, basicity, methylene group spacing between amine centres, hydrogen-bond
donor groups and positive groups [25]. Similarly, Potta et al. have also proposed a nonlinear
SVM model to predict the aminoglycoside derived polymers mediated transgene expression
using a dataset of 30 polymers in the training set and validated the model using only three
test set compounds. The final SVM model with a significant statistical value of r2 = 0.78
comprises five structural features, which are PEOE_VSA_PPOS, log P, RECON_SIEPMax,
BCUT_PEOE_3, and RECON_PIPMax, and finally, the model was validated using the limited
number of test compound (n = 3) [19]. Miryala et al. [26]. reported the parallel synthesis and
QSPR modelling of a small library of 27 lipo-polymers obtained by conjugation of three
alkanoyl chlorides (i.e. hexanoyl chloride, myristoyl chloride and stearoyl chloride) with
three aminoglycosides derived polymers (neomycin, paromomycin and apramycin cross-
linked with glycerol diglycidylether (GDE)) as novel non-viral vectors for transgene delivery
and expression in the target cells. The final SVM model comprises six unique features, which
are ALKYL_rsynth, ALKYL_vdw_vol, AMINO_Q_VSA_FFPOS, AMINO_a_nN, vsa_other, and
AMINO_PEOE_RPC+, which stand for the synthetic feasibility of alkyl group, van der Waals
volume of an alkyl group, fractional positive polar van der Waals surface area of aminoglyco­
sides, number of nitrogen atoms, van der Waals surface area of other atoms and relative
positive partial charge of aminoglycosides, respectively. The model with a significant sta­
tistical value of an internal parameter (r2 = 0.8365) and predictive variance of an external set
(r2pred = 0.6543) was selected as the best model using the online Learning Equipment (SOLE)
platform, a web-based machine learning system [26]. Zhen et al. [22] also proposed two-
step chemometric models to predict the aminoglycoside derived polymers mediated
transgene expression. The first step involved in developing QSPR models of different
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 597

physicochemical properties of polymers such as molecular weight, hydrophobicity, zeta


potential, and percentage DNA binding. In the next step, the experimental and predicted
values of these polymers’ physicochemical properties were individually combined with
estimated descriptors and employed to develop new QSPR models to predict the amino­
glycoside-derived polymers mediated transgene expression efficacy. The final SVM model
for predicting the percentage DNA binding of polymers was obtained using the dataset of
33 aminoglycoside-derived polymers. The selected final model with values for r2 = 0.886 and
r2pred = 0.94 comprises ten unique descriptors, computed using the Molecular Operating
Environment. Next, a PLS QSPR model for predicting polymer-mediated transgene expres­
sion was developed using the experimental calculated physicochemical properties of poly­
mers and estimated descriptors from a representative building block structure. The reported
PLS and SVM models were based on the eight unique descriptors with values of internal
parameter r2 = 0.76 and 0.80. However, the external predicted variance values of the PLS and
SVM models are r2pred = 0.723 and 0.734. Again, the QSPR model for predicting polymer-
mediated transgene expression was also generated using predicted physicochemical prop­
erties and the estimated descriptors from a representative building block structure. The
reported PLS and SVM models were based on the four unique descriptors with values
internal parameter r2 = 0.685 and 0.716. However, the external predicted variance values of
the PLS and SVM models are r2pred = 0.838 and 0.924 [22].
Most of the models mentioned above were generated employing nonlinear algorithms
using experimental and computationally estimated two-dimensional descriptors; the
estimation of experimental descriptors for model development required more time and
resources. Similarly, nonlinear models were considered more complex in comparison to
linear models. In most cases, the data set involved in model development was small, and
the developed models were validated using a maximum of only three compounds. None
of the reported studies employed a consensus-based approach for estimation of different
properties of aminoglycoside-derived polymers to the best of our knowledge.
In the current work, we have employed a QSPR approach to predict different properties
of aminoglycoside-derived polymers (i.e. polymer DNA binding and aminoglycoside-
derived polymers mediated transgene expression). The QSPR approach explores the
relationship between the chemical structural attributes and the response endpoints
[27]. The QSPR concept is based on the similarity principle, which means that similar
chemical compounds show similar activity/property/toxicity, and a slight modification in
the chemical structure results in a change in the endpoint value of a particular compound.
The prime objective of the present work has been to develop robust and statistically
significant 2D QSPR models to predict the polymer DNA binding and polymer mediated
transgene expression. To generate QSPR models, we have employed two different sizes of
the dataset (33 and 44 aminoglycoside-derived polymers for polymer DNA binding and
polymers mediated transgene expression, respectively). Representative building block
structures of polymers were employed for molecular descriptor computation. Next, we
have identified suitable subsets of crucial attributes using a genetic algorithm approach;
subsequently, the identified features were subjected to the best subset selection tool.
Further, the identified best combinations of descriptors were used for QSPR model
development using the partial least squares (PLS) regression technique. The final models
were strictly evaluated as per the internationally accepted different internal and external
validation metrics. Subsequently, the validated models were used for consensus model
598 P.M. KHAN AND K. ROY

development to enhance the predictive performance for the external set compounds with
lowest error. The present findings provide new insight for exploring the design of an
aminoglycoside-derived polymer library based on different identified physicochemical
properties and predicting different crucial properties of polymeric vehicles before their
synthesis.

Materials and methods


Dataset
The experimental response values of two different endpoints, i.e., percentage DNA bind­
ing (%DNA binding) and relative luciferase expression (Log10 (RLU/mg)) of aminoglycoside-
derived polymers were gathered from previously published reports [19,22]. The datasets
are composed of 33 (% DNA binding) and 44 (Log10 (RLU/mg)) aminoglycoside-derived
polymers. Tables S1 and S2 (in supporting information documents) provide the list of
aminoglycoside-derived polymers along with their different experimentally determined
physicochemical properties such as percentage DNA binding and relative luciferase
expression. The raw experimental percentage DNA binding (%DNA Binding) data of
polymers was directly used as the response variable for QSPR model generation.
Subsequently, we analysed the quality of the same model by converting the response
variable, i.e., raw experimental percentage DNA binding data to logarithmic transformed
value of experimental percentage DNA binding (Log10%DNA Binding). The experimental
relative luciferase expression values were transformed into the logarithmic scale (Log10);
the transformed values were used as the response endpoint. The detailed chemical
synthesis of a library of aminoglycoside-derived polymers was described elsewhere
[19,22].

Building block structure


Representative building block structures of polymers were constructed based on poly­
merization reaction between different aminoglycosides and diglycidyl ethers (Figures S1-
S2 in supporting information). The building block representation of polymers or monomer
structure approximations of aminoglycoside-derived polymers were described in detail
elsewhere [19,28]. However, we have provided here a brief overview of the generation of
monomer structure approximations for aminoglycoside-derived polymers. The generated
building-block structures of polymers were obtained by the polymerization reaction of
only primary amines of different aminoglycosides core with the epoxide functionality
present in the diglycidyl ethers [19]. This is an excellent approximation given that the
ration of the kinetic rate constant for primary amine reacting with epoxide moieties is
twofold higher than secondary amines [19,28]. The second epoxide functionality in the
diglycidyl linker was also kept open to represent the polymers’ random environment [19].
The representative building block structures of the polymers were precisely drawn
manually using MarvinSketch software (http://www.chemaxon.com/) and subsequently
cleaned adding explicit hydrogens and aromatizing a molecule and finally stored in MDL.
mol format, a recommended format for descriptor estimation software tools such PaDEL-
Descriptor [29] and AlvaDesc [30]. Figure S3 in the Supporting Information Section
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 599

represents the 44 polymers’ building-block structures used for molecular descriptors


calculation.

Descriptor calculation and dataset division


The prepared building block structures of polymers were subjected to different molecular
descriptors calculation software tools, such as PaDEL-Descriptor and AlvaDesc software.
We have calculated only specific categories of 2D descriptors with a definite physico­
chemical meaning, for example, atom type E-state indices, ring descriptors, 2D atom pairs,
connectivity indices, constitutional indices, functional group counts, molecular property
and atom-centred fragment descriptors using AlvaDesc software [30]. Only extended
topochemical atom (ETA) class of descriptors were computed from PaDEL-Descriptor
(Version 2.20) software [29]. These descriptors were selected based on our previous
experience on their performance in developing simple but meaningful QSPR models for
different endpoints. The primary datasets of generated features were composed of 154
and 170 descriptors for QSPR modelling of two different endpoints, i.e. percentage DNA
binding and relative luciferase expression, after eliminating the redundant variables,
constant variables, all missing values variables and inter-correlated descriptors (|r|>0.9).
The prime goal of the current work was to generate different robust and significant
QSPR models for predicting polymer DNA binding and relative luciferase expression in the
cell. To produce the robust and reliable QSPR models, we have divided both the datasets
into two subsets, i.e. a training set (used explicitly for model training) and a test set (for
validation purpose) using dataset division software tools, i.e., Kennard-Stone [31],
Euclidean distance [32] and sorted response (available at http://dtclab.webs.com/soft
ware-tools). However, in the current work, statistically superior models for both the
endpoints were obtained using the Euclidean distance-based division [31]. The final
models for predicting percentage DNA binding and aminoglycoside-derived polymers
mediated transgene expression were obtained using 25 and 31 compounds in the
training sets (ntrain) and 8 and 10 molecules in the test sets (ntest), respectively.

Model development and validation


In the present study, the initial pool of molecular descriptors of each dataset was
subjected to the genetic algorithm to extract important features in several iterations.
The primary datasets of generated features of 154 and 170 descriptors were reduced to 16
and 38 for QSPR modelling of the two different endpoints, i.e., percentage DNA binding
and relative luciferase expression using repeated or frequent occurrence of different
features in the initial population of GA models obtained by multiple GA iterations
[33,34]. The reduced dataset was subjected to the best subset selection tool to obtain
the best combinations of QSPR modelling descriptors. The best combinations were
selected based on the different internal and external validation metrics values and further
subjected to the PLS regression approach [35] (tool available at http://teqip.jdvu.ac.in/
QSAR_Tools/) to obtain a final robust and significant QSPR model. Note that the final
models presented in this work are not multiple linear regression (MLR) models. We have
reported here partial least squares (PLS) regression models [35]. PLS is a generalization of
MLR; MLR is a special case of PLS where the number of latent variables is equal to the
600 P.M. KHAN AND K. ROY

number of model descriptors. PLS is a robust technique in the sense that it can handle
numerous, intercorrelated and noisy variables. The final QSPR models for predicting
percentage DNA binding and aminoglycoside-derived polymers mediated transgene
expression were obtained at two and three latent variables respectively (which are the
actual regressing variables instead of the original descriptors) and subsequently validated
using different internal and external validation parameters to judge the acceptability of
QSAR models. The internal parameters are the determination coefficient (r2), training set
leave-one-out cross-validation (Q2LOO), r2m(LOO) [36,37] and mean absolute error of train­
ing set (MAEtrain100%). In contrast, external parameters deal with the predictive ability of
generated models based on test set compounds using external predictive variance r2pred,
r2m(test), and mean absolute error of test set (MAEtest95%) etc. [38]. Figure 1 provides
a detailed schematic overview of the protocol employed in the present study for QSPR
model development.

AD study and consensus modelling


As per the OECD principle 3, the applicability domain of all QSAR models should be
defined in chemical space, which can help to determine the prediction reliability of
untested compounds using a particular model. To comply with OECD principle 3, we
have performed an applicability domain (AD) study using the DModX (distance to model
in X space) [35] technique for the rigorously validated QSPR models.
Simultaneously, the final validated individual models were subjected to generate
consensus models employing the Intelligent Consensus Predictor software tool [39] to
increase the prediction quality for external set compounds and minimize the prediction
errors. The individual QSPR models contain a few numbers of independent variables,
which are based on different structural structures. Still, it may overstress specific char­
acteristic features or may underrate a few and often ignore several properties. Thus,
making consensus models can help to overcome the limitations of individual models.
The consensus models cover information from several models, consider several molecular
features, and offer a broad applicability domain with an increase in prediction quality [39].

Figure 1. Detailed schematic overview of the protocol employed in the present study for QSPR model
development.
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 601

Results and discussion


QSPR models for the prediction of polymer-DNA binding efficacy
In the current study, we have successfully developed a QSPR model to estimate polymer-
DNA binding efficacy employing 33 data points of aminoglycoside-derived polymers
obtained from a published compilation [19,22]. The curated data was divided into two
subsets (modelling set and validation set) before QSPR modelling using three different
dataset division software tools, i.e. Kennard-Stone algorithm [31], Euclidean distance-
based [32] and sorted response-based approaches. The best training and validation sets
were selected using the Euclidean distance-based method, in which 25 compounds were
used for model training, while eight compounds were used to validate the obtained
model. A partial least squares technique was employed to generate a new QSPR model to
estimate polymer–DNA binding efficacy. The final model equations 1 and 2 were derived
with or without log transformation of response variable, respectively, and keeping all
other settings (descriptors, number of descriptors, divisional subsets, latent variables, etc.)
same. Equation 1 was generated by employing the direct percentage DNA binding data as
the response variable without any transformation in order to make relevant comparisons
with earlier research that did not use any response variable transformation in the model­
ling procedure. However, Equation 2 was just a transformed version of Equation 1
generated by employing logarithmic transformed percentage DNA binding data
(Log10(%DNA binding)) as the response variable and keeping all other settings (descriptors,
number of descriptors, divisional subsets, latent variables, etc.) same. A comparative
analysis between Equations 1 and 2 revealed a statistically significant change in the
internal quality metrics; however, insignificant change in the external predictive variance
was observed. Significant changes occurred in few of the reported internal validation
metrics because their calculation depends on the range of the response variable [38]. All
the subsequent studies performed in the article were based on using Eq-1 only. A scatter
plot of the observed vs predicted training/test set compounds using the model without
logarithmic transformation of the response variable is given in Figure 2, which indicates
the goodness of fit and predictions. The final models were strictly validated using different
internal and external statistical parameters, as shown below:
Model generated by employing direct percentage DNA binding data as response
variable without any transformation:

%DNA BINDING ¼ 336:26 þ 10:02 gmax þ 1:40 F08½C N� þ 655:71 PW3


þ 7:632 B10 ½N N� (1)

ntrain ¼ 25; ntest ¼ 8; LV ¼ 2; r2 ¼ 0:913; Q2 ¼ 0:878; r2 pred ¼ 0:966;


2
rm2LOO 2
train ¼ 0:829; ΔrmLOO train ¼ 0:062; MAETraining100% ¼ 3:650; rmLOO test ¼
2
0:93; ΔrmLOO test ¼ 0:026; CCCtest ¼ 0:979; MAETest100% ¼ 2:249; Qualitytest MAE ¼
GOOD

Model was generated by employing to logarithmic transformation of percentage DNA


binding data (Log10(%DNA binding)) as response variable:
602 P.M. KHAN AND K. ROY

Figure 2. Scatter plots of observed vs predicted DNA binding of QSPR models obtained using
aminoglycoside-derived polymers.

Log10ð%DNA BindingÞ ¼ 3:054 þ 0:099 gmax þ 0:014 F08½C N� þ 9:917 PW3


þ 1:041 B10 ½N N� (2)

ntrain ¼ 25; ntest ¼ 8; LV ¼ 2; r2 ¼ 0:868; Q2 ¼ 0:794; r2 pred ¼ 0:962;


rm2LOO train ¼ 0:714; Δrm2LOO train ¼ 0:097; MAETraining100% ¼ 0:054; rm2LOO test ¼
0:944; Δrm2LOO test ¼ 0:011; CCCtest ¼ 0:979; MAETest100% ¼ 0:025; Qualitytest MAE ¼
GOOD

To determine the importance of each descriptor in the final equation (1), we have
performed a VIP analysis. The descriptors with VIP scores greater than one result in higher
statistical significance towards polymer DNA binding and are considered the most crucial
variables in QSPR modelling. In our case, two variables, F08[C-N] and gmax are regarded
as the essential descriptors, while PW3 and B10[N-N] with VIP scores less than one were
considered less important variables (Figure S4 in supporting information). We have also
performed a loading plot analysis to identify the most influential descriptors in the final
model. It was found that F08[C-N] and gmax are situated far from the origin of the plot
and considered as most influential descriptors in the final model, while PW3 and B10[N-N]
with slightly more close to the origin of the plot than other variables and considered less
influential than other variables (Figure 3).
The final QSPR equation was based on four unique descriptors calculated using the
AlvaDesc software tool. All the appearing descriptors show a positive contribution
towards predicting polymer-DNA binding efficacy, suggesting that higher values of the
descriptors result in higher DNA binding and vice versa. The first descriptor gmax belongs
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 603

Figure 3. Loading plot of the final QSPR model obtained using aminoglycoside-derived polymer for
predication of percentage DNA binding.

to atom-type electrotopological state, and it stands for maximum E-state value in the
molecules [40]. A high E-state value is generally associated with most electronegative
atoms in the molecules, and there is a strong probability that its selection relates to
structural alerts that it contains such moieties adjacent to the electrophilic centres [41,42].
The careful analysis of the data revealed that compounds #13, 14 and 16 in the training set
and compound #15 and 29 in the test set result in higher gmax values due to the presence
of amide functional group (electronegative atom, i.e., nitrogen attached adjacent to
electrophilic carbon of amide bond) in the molecules leading to higher DNA binding
efficacy of these polymers. Conversely, compound #18 shows a lower value of gmax
descriptor and results in a low DNA binding efficacy.
The second important descriptor in the equation is F08[C-N], which belongs to a class
of 2D atom pair descriptors [43]; it stands for the frequency of C – N at the topological
distance 8 in the molecule. Its positive correlation with predicting DNA binding efficacy
indicates that an increase in C-N number at the topological distance of eight results in
higher DNA binding. For example, compound #1 with a higher C-N frequency at the
topological distance of eight results in higher DNA binding efficacy. It is also evident from
the previous report that a shorter distance between nitrogen atoms within the aminogly­
coside core results in higher polymer DNA binding [19].
The next descriptor in the final model is PW3, which belongs to the shape topological
descriptor class demonstrating path/walk 3 – randic shape index [44]. The positive
regression coefficient towards modelling DNA binding efficacy indicates that an increase
in PW3 value increases polymer DNA binding and vice versa. For example, compound #16
with a higher value of PW3 result in higher polymer-DNA binding.
The final variable that appeared in the equation is B10[N-N], which belongs to the 2D
atom pairs [43]; it denotes the presence or absence of N – N at the topological distance 10
in the molecules, increasing polymer-DNA binding and vice versa. For example, com­
pound #13 shows more excellent polymer-DNA binding due to a nitrogen–nitrogen atom
pair separated by a topological distance of 10. It suggests that the distribution of nitrogen
atoms in the macromolecule was essential for estimating polymer-DNA binding efficacy. It
is also clear from the previous report that nitrogen cation in the molecules may bind with
the phosphate group in the DNA molecule [25].
604 P.M. KHAN AND K. ROY

Besides this, we have performed a Y-randomization study to check whether the final
model was obtained by chance (random) or not (non-random). The analysis was per­
formed by generating 100 unique models by shuffling the response variable values while
the descriptors’ values were kept intact. If the r2Y intercept and Q2Y intercept values of the
generated models exceed 0.3 and 0.05, respectively, the final models can be considered
obtained by chance (random). The plot analysis revealed that r2Y intercept and Q2Y
intercept values of the generated models were below the specified criteria, i.e.,
r2Y = 0.0266 and Q2Y = −0.336 and suggest that the proposed model was not obtained
by chance (Figure S5 in supporting information).
Finally, to define the proposed model’s applicability in the chemical space, we have
performed the AD study using the DModX approach. The AD analysis revealed that all the
compounds are within the domain in both sets, i.e., training and test sets (Figure S6 in
supporting information).

QSPR modelling of antibiotic-derived polymers mediated transgene expression


It is well known that cationic polymers are commonly used for gene delivery purpose. In
simple words, they are used to deliver plasmid DNA to cells, ultimately causing the
delivered transgene’s expression in the cell. In the present work, we examine how
computationally derived (only) two-dimensional properties obtained from aminoglyco­
side-derived polymers building block structure affect transgene expression. We have
employed the most widely used QSPR modelling approach to determine the relationship
between the physicochemical properties of aminoglycoside-derived polymers and trans­
gene expression. The dataset comprises experimentally determined transgene expression
of 44 aminoglycoside-derived polymers obtained from a published compilation [19,22]. In
the initial analysis, out of 44 aminoglycoside-derived polymers, three showed high pre­
diction residuals, which might influence the QSPR modelling. Thus, the identified three
polymers (Compound #19, 33, 37) were removed from the dataset. The final dataset was
divided into two subsets (a training set and a test set) before QSPR modelling using the
three different dataset division software tools, i.e., Kennard-Stone algorithm [31],
Euclidean distance-based [32] and sorted response-based approaches. The best training
and test sets for the final models were selected using the Euclidean distance-based
method, in which 31 compounds were used for model training, while 10 compounds
were used to validate the obtained models. Subsequently, a subset of the features was
selected from an initial pool of descriptors based on the repeated or frequent occurrence
of different features in the initial population of GA models obtained by multiple GA
iterations; the selected subset of the features was subjected for best subset selection
tool to develop n-number of MLR models with different combinations of descriptors.
Finally, four individual models with different combinations of descriptors were selected
and the pooled descriptors were subjected to partial least squares (PLS) regression
technique to build new QSPR models to predict aminoglycoside-derived polymers trans­
gene expression. The final four individual models were obtained at latent variables three
(which extracted meaningful information for QSPR modelling from the model descrip­
tors). We have rigorously validated each model considering several internationally
accepted internal and external statistical parameters. Subsequently, the final validated
models were employed for consensus model development using the Intelligent
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 605

Consensus Predictor tool [39] to improve the external validation sets prediction quality.
A scatter plot of the observed vs predicted training/test set compounds is given in
Figure 4, which indicates the goodness of fit and predictions. The selected four individual
models and the values of their statistical parameters are shown in Table 1.
To identify the relative importance of each variable in the final models, we have
performed the variable importance plot analysis using the SIMCA-P software tool [45].
The descriptors were presented in descending order of their relative importance in the
final model (Figure S7 in supporting information). For further verification of the VIP
analysis results, we have performed loading plot analysis intending to identify the most
influential descriptors and their relative significance in the final model (as shown in
Figure 5).
The F09[C-O] descriptor appearing in the final three individual models (IM 1–3) was
considered the second most crucial descriptor in QSPR models 1 and 2. However, it is the
most critical descriptor in the third QSPR model with more than one VIP score. F09[C-O]
belongs to the class of 2D atom pair descriptors, which stands for the frequency of
carbon-oxygen atoms at the topological distance 9 [43]. It positively correlates with the
response values in all the selected models, which indicates that relative luciferase expres­
sion increases if the frequency of carbon-oxygen atoms at the topological distance nine
increases and vice versa. The previous studies reported that a higher number of an oxygen
atom in molecular building block results in higher efficacies of transgene expression in the
cell [19]. The close analysis of the present data revealed that polymer with cross-linker
GDE with one extra oxygen atom results in higher values of these descriptors than RDE
and other linkers except PPEGDE and PEGDE. Similarly, higher oxygen atoms in the

Figure 4. Scatter plots of observed vs predicted luciferase expression efficacy of four individual QSPR
models obtained using aminoglycoside-derived polymers.
606
P.M. KHAN AND K. ROY

Table 1. Final individual and consensus QSPR models for predicting aminoglycoside-derived polymers mediated transgene expression and the detailed statistical
values of internal and external parameters.
2
Model No. Model equations LVs r2 Q2LOO rm2LOO train rm2LOO train rpred rm2LOO test rm2LOO test MAE95% test RMSEc RMSEp SEE
1 Log10 ðRLU=mgÞ ¼ 7:283 þ 0:0407 F09½C O� 0:0942 3 0.779 0.717 0.618 0.134 0.903 0.542 0.142 0.133 0.582 0.196 0.635
F09½O O� 0:0158 F10½C O� 4:6334 ETA EtaP L
2 Log10 ðRLU=mgÞ ¼ 6:301 þ 0:0436 F09½C O� 0:106 n 3 0.786 0.713 0.609 0.156 0.861 0.522 0.205 0.174 0.573 0.235 0.626
ROR þ 2:041 minssCH2 0:01116 ETA Beta
3 Log10 ðRLU=mgÞ ¼ 6:005 þ 0:0323F09½C O� 0:086 3 0.784 0.710 0.605 0.145 0.852 0.282 0.308 0.157 0.575 0.243 0.628
F09½O O� 0:0056 F10½C O� 0:0180 NssCH2
4 Log10 ðRLU=mgÞ ¼ 6:747 þ 0:460 MaxssssC þ 0:019 SsOH 3 0.782 0.702 0.601 0.113 0.843 0.461 0.171 0.162 0.578 0.250 0.632
þ3:400 minssCH2 0:0619 C 006
CM 0 Average of predictions from all input Individual models 0.939 0.699 0.142 0.11259 - 0.1557 -
CM 1 Average of predictions from ‘qualified’ Individual models 0.939 0.699 0.142 0.11259 - 0.1557 -
CM 2 Weighted average predictions from ‘qualified’ Individual models 0.939 0.691 0.157 0.11257 - 0.1555 -
CM 3 Best selection of predictions (compound-wise) from ‘qualified’ Individual models 0.941 0.743 0.025 0.11839 - 0.1531 -
2
LVs = Latent variables, r2 = Determination coefficient, Q2LOO = Leave one out cross-validation, rpred = External set predictivity and MAE95% test = Mean absolute error value of test set after removal
of 5% of high residual compounds, RMSEc = Root mean square error of training set, RMSEp = Root mean square error of test set, SEE = Standard error of estimate of training set.
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 607

Figure 5. Loading plot of the four individual PLS models developed for prediction of polymer-
mediated transgene expression.

aminoglycoside core also positively affect the F09[C-O] descriptor values; for example,
paromomycin with a higher oxygen atom count in the aminoglycoside core results in
a higher value than other aminoglycosides core. The present model proposes that
a higher frequency of C-O pair of atoms separated by the topological distance 9 (instead
of focusing on only oxygen atoms in the molecule) contributes to higher efficacies of
transgene expression due to increased hydrogen bonding potential and polarization of
the polymers. For example, compound 32 (Streptomycin-PEGDE) shows the least relative
luciferase expression in the cell due to a lower value of this particular descriptor than
molecule #25 (Paromomycine-GDE) with a higher descriptor value.
Another descriptor F10[C-O] appearing in two final individual models (IMs 1 and 3) also
belongs to the class of 2D atom pair descriptors [43], which stands for the frequency of
608 P.M. KHAN AND K. ROY

carbon-oxygen atoms at topological distance 10. But it contrasts to the B09[C-O] descrip­
tor in the final models, the relative luciferase expression increases if the frequency of
carbon-oxygen atom at the topological distance of 10 decreases and vice versa. For
example, compound #5 (Neomycin-PEGDE) results in lower relative luciferase expression
due to the higher frequency of pair carbon-oxygen atoms at the topological distance of
10. From this observation, it is clear that pair of carbon and oxygen atoms should be
specific to be at topological distance nine, and a further minor increment of one edge
between these two atoms results in a decline of relative luciferase expression in the cells.
The next descriptor is F09[O-O], which stands for the frequency of oxygen–oxygen
atoms at the topological distance nine [43]. It is the essential descriptor in the first
individual QSPR model with more than one VIP score. It negatively correlates with the
response in all the selected models, which indicates that relative luciferase expression
increases if the frequency of the oxygen–oxygen atoms at the topological distance nine
decreases and vice versa. For example, compound #26 (Paromomycine-PPGDE) shows the
least relative luciferase expression in the cell due to the higher F09[O-O] descriptor value.
On the other hand, molecule #43 (Sisomicin-BGDE) results in higher luciferase expression
due to the low frequency of oxygen–oxygen atoms at topological distance nine.
The ETA_EtaP_L descriptor belongs to the extended topochemical atom descriptor
class [46,47], and it is the least contributing descriptor in the first individual model. This
descriptor signifies local connectedness relative to the molecular size. It provides informa­
tion related to branching, presence of heteroatoms, and unsaturation [48]. It negatively
correlates with the relative luciferase expression, suggesting that an increase in branch­
ing/unsaturation relative to the molecular size results in a lower relative luciferase
expression in the cells. A close observation of data revealed that polymers with RDE,
EGDE and GDE cross-linker result in lower values of ETA_EtaP_L due to the presence of
phenyl group (imparting unsaturation in the molecules as well as serving to decrease
polymer mass density but subsequently resulting in enhancement of hydrophobicity of
polymer), short-chain length (size) and branching in the cross-linker chain respectively.
For example, compound #17 (Apramycin-CDDE) with a higher value of ETA_EtaP_L
descriptor results in lower luciferase expression than compound #12 (Streptomycin-
EGDE) with a lower value of ETA_EtaP_L descriptors.
The nROR descriptor appeared in the second individual models (IM 2) and was
considered as the most crucial descriptor based on the higher VIP scores. The nROR
descriptor belongs to the class of functional group count descriptors [43], representing
the number of aliphatic ether functionality in the molecular building block. The negative
coefficient indicates that the cross-linker length with an aliphatic ether functional group
results in lower efficacies of transgene expression and vice versa. For example, cross-
linkers with aliphatic ether functional groups or long chains, including PEGDE and PPGDE,
are not appropriate for developing polymeric vehicles for gene delivery [22], such as
compound #5 (Neomycin-PEGDE), resulting in lower expression due to a higher number
of aliphatic ether functional groups in the molecule. In contrast, molecule #42
(Kanamycin-RDE) results in higher expression due to the lower number of aliphatic
ether functional groups in the molecule.
The next descriptor that appears in individual model 2 is ETA_Beta, which belongs
to the class of extended topochemical atom (ETA) indices [46,47]. The descriptor
provides information about the measure of the molecules’ electronic environment. It
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 609

shows a negative contribution to the relative luciferase expression in the cells,


suggesting that this particular descriptor’s higher value results in a lower response
value and vice versa. For example, compound #32 (Streptomycin-PEGDE) shows
a lower response value than compound #28 (Paromomycin-RDE) due to the large
difference in the descriptors’ values. The difference in the values of the descriptor
appears due to the presence of a higher oxygen atom in the molecule, which may
participate in hydrogen bond formation in the polymer-DNA complex.
The next descriptor minssCH2 appearing in individual models 2 and 3 belongs to
the atom-type E-state indices [40], and represents as minimum atom-type E-State: -
CH2-. In simple words, ‘min’ stands for minimum E-state value of a particular group,
‘ss’ stands for the two single bonds of that group, and ‘CH2ʹ represents the hybrid
group. It shows a positive contribution towards the response value, indicating that
a higher value of the descriptor results in higher relative luciferase expression and
vice versa. For example, compound #43 (Sisomicin-BGDE) offers a higher value of
minssCH2 descriptor and results in higher transgene expression efficacies. In contrast,
the NssCH2 descriptor appears in individual model 3 with a negative contribution
towards the response value, representing that a high value of the number of atoms
of type ssCH2 in the molecule results in a lower relative luciferase expression and
vice versa.
Other descriptors that appeared in the final models belong to the atom-type E-state
indices [40] and are MaxssssC (maximum atom-type E-State: >C<), SsOH (stands for the
sum of E-state indices for -OH groups in the molecule) [40]. They show a positive
contribution towards the endpoint value prediction. The hydroxy groups present in the
aminoglycoside core enhances the hydrophilic surface of polymers and is more likely
involved in the hydrogen bonding with the cells surface. For example, compound #4
(Neomycin-GDE) with a higher value of the sum of E-state indices for the OH group in the
molecule results in higher relative luciferase expression.
The last descriptor that appeared in the individual model 4 is C-006, which belongs to
a class of atom centred fragments descriptors [43], representing CH2RX (X represent the
electronegative atoms as oxygen, nitrogen sulphur, etc.) fragment in the molecule resulting in
a lower response value. For example, molecule 5 (Neomycin-PEGDE) shows a lower response
value than compound #20 (Apramycin-RDE) due to a large difference in the descriptor values
based on the presence of the number of particular fragments in the molecules.
The AD study revealed that none of the training set and test set compounds were
outliers (training set) and outside AD (test set), representing that the trained models cover
all the essential features responsible for predicting relative luciferase expression (Figures
S8 and S9 Supporting Information).
We have also performed a Y scrambling test to check whether the final model was
obtained by chance (random) or not (non-random). We have produced 100 unique
models for the Y scrambling analysis by shuffling the response variable values while
the descriptors’ values were kept intact. If the r2Y intercept and Q2Y intercept values
of the produced models exceed 0.3 and 0.05, respectively, the final models can be
considered obtained by chance (random). From the analysis, we can confidently
suggest that the proposed models were obtained non-randomly (Figure S10 in the
supporting information).
610 P.M. KHAN AND K. ROY

Intelligent consensus QSPR modelling to predict polymer mediated


transgene expression
The rigorously validated four individual models were subjected to an Intelligent
Consensus Predictor tool [39] to develop different consensus models with the least
prediction error and higher prediction quality for external set compounds [39]. The
statistical values of internationally accepted internal and external validation parameters
of individual models (IMs) and external validation parameters of different kind of con­
sensus models are reported in Table 1. The comparative analysis between IMs (IM1 to IM4)
with consensus models (CM0 to CM3) revealed that the prediction qualities of all CMs are
better than the IM models. Similarly, in terms of prediction errors, all the CMs result in
significantly lower prediction errors than IM models. Moreover, among all the consensus
models, CM2 emerged as the best among all consensus models based on the MAE95%
validation metric.

Comparison with previously reported QSPR models for prediction of


aminoglycoside-derived polymers mediated transgene expression
We have performed a comparative analysis between the models of the current work and
previously reported several QSPR models to predict aminoglycoside-derived polymer
mediated transgene expression, as shown in Table S3 in the Supporting Information
file. The comparative analysis revealed that the previously reported QSPR models were
obtained using the nonlinear support vector machine (SVM) regression technique with 30
aminoglycosides-derived polymers in the training set. The molecular descriptors
employed in the QSPR modelling were calculated using commercially available MOE
software tool, and a few physicochemical features were obtained using an experimental
approach. Subsequently, the reported models were validated using a limited number of
compounds in the test set (ntest = 3). However, in our case, we have obtained four
individual QSPR models using a dataset of 31 compounds in the training set and
rigorously validated the derived models using a sufficient number of test set compounds
by considering different internationally accepted internal and external parameters (shown
in Table S3 in Supporting Information file).
Moreover, the proposed models were obtained using only two-dimensional descrip­
tors with significant physicochemical meaning, which are easy to estimate and interpret.
All the models were generated using the PLS regression technique to avoid the co-
linearity problem and obtain robust models. We have also performed an AD analysis of
each model to define their domain in the chemical space within which the prediction
obtained by a particular model was considered reliable. Finally, we have also generated
consensus models to reduce the prediction errors and enhance external set prediction
quality.

Conclusion
In the present study, we have successfully generated QSPR models to predict the polymer
DNA binding, and polymer-mediated transgene expression efficacy of aminoglycoside-
derived polymers. The final QSPR models were obtained using the partial least squares
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 611

(PLS) regression technique employing two different sizes of the dataset (33 and 44
aminoglycoside-derived polymers for polymer DNA binding and polymers mediated trans­
gene expression, respectively). It has been found out that several structural attributes
contributed to predicting polymer DNA binding as well as polymer mediated transgene
expression of aminoglycoside-derived polymers. In case of the polymer DNA binding
prediction, the maximum E-state of a molecule, path/walk three Randic index, presence
of a pair of nitrogen atoms separated with topological distance ten, and C-N pair presence
at the topological distance eight result in higher polymer DNA binding. On the other hand,
in case of the polymer mediated transgene expression prediction, the higher values of
different variables such as frequency of carbon-oxygen atoms at the topological distance 9,
minimum atom-type E-State: -CH2-, sum of E-state indices for -OH groups in the molecule
and maximum atom-type E-State: >C< result in higher transgene expression in the cells.
Again, higher values of different descriptors such as frequency of carbon-oxygen atoms at
the topological distance 10, frequency of oxygen–oxygen atoms at the topological distance
nine, a number of aliphatic ethers functionality in the molecular building block and
presence of CH2RX fragment in the molecule (X represent the electronegative atoms as
oxygen, nitrogen Sulphur, etc.) show lower transgene expression efficacy. The present
findings provide new insight for exploring the design of an aminoglycoside-derived poly­
mer library based on different identified physicochemical properties and predicting the
polymeric vehicles’ different crucial properties before their synthesis.

Acknowledgements
PMK thanks to National Institute of Pharmaceutical Education and Research Kolkata, the Ministry of
Chemicals & Fertilizers, Department of Pharmaceuticals, Government of India for providing financial
assistance in the form of a fellowship.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by the Ministry of Chemicals and Fertilizers, Govt. of India.

ORCID
K. Roy http://orcid.org/0000-0003-4486-8074

References
[1] W.F. Anderson, Gene therapy for genetic diseases, Hum. Gene Ther. 5 (1994), pp. 281–282.
doi:10.1089/hum.1994.5.3-281.
[2] M.G. Kaplitt, A. Feigin, C. Tang, H.L. Fitzsimons, P. Mattis, P.A. Lawlor, R.J. Bland, D. Young,
K. Strybing, D. Eidelberg, and M.J. During, Safety and tolerability of gene therapy with an
adeno-associated virus (AAV) borne GAD gene for Parkinson’s disease: An open label, phase
I trial, Lancet 369 (2007), pp. 2097–2105. doi:10.1016/S0140-6736(07)60982-9.
612 P.M. KHAN AND K. ROY

[3] B.A. Bunnell and R.A. Morgan, Gene therapy for infectious diseases, Clin. Microbiol. Rev. 11
(1998), pp. 42–56. doi:10.1128/CMR.11.1.42.
[4] R. Wolkowicz and G. Nolan, Gene therapy progress and prospects: Novel gene therapy
approaches for AIDS, Gene Ther. 12 (2005), pp. 467–476. doi:10.1038/sj.gt.3302488.
[5] N.A. Horn, J.A. Meek, G. Budahazi, and M. Marquet, Cancer gene therapy using plasmid DNA:
Purification of DNA for human clinical trials, Hum. Gene Ther. 6 (1995), pp. 565–573.
doi:10.1089/hum.1995.6.5-565.
[6] Z.R. Yang, H.F. Wang, J. Zhao, Y.Y. Peng, J. Wang, B.A. Guinn, and L.Q. Huang, Recent
developments in the use of adenoviruses and immunotoxins in cancer gene therapy, Cancer
Gene Ther. 14 (2007), pp. 599–615. doi:10.1038/sj.cgt.7701054.
[7] G. Ermak, Emerging Medical Technologies, World scientific publishing Co. Pvt.Ltd, Singapore,
2015.
[8] H. Yin, R.L. Kanasty, A.A. Eltoukhy, A.J. Vegas, J.R. Dorkin, and D.G. Anderson, Non-viral vectors
for gene-based therapy, Nat. Rev. Genet. 15 (2014), pp. 541–555. doi:10.1038/nrg3763.
[9] C.E. Thomas, A. Ehrhardt, and M.A. Kay, Progress and problems with the use of viral vectors for
gene therapy, Nat. Rev. Genet. 4 (2003), pp. 346–358. doi:10.1038/nrg1066.
[10] N. Bessis, F.J. GarciaCozar, and M.C. Boissier, Immune responses to gene therapy vectors:
Influence on vector function and effector mechanisms, Gene Ther. 11 (2004), pp. S10–S17.
doi:10.1038/sj.gt.3302364.
[11] C. Baum, O. Kustikova, U. Modlich, Z. Li, and B. Fehse, Mutagenesis and oncogenesis by
chromosomal insertion of gene transfer vectors, Hum. Gene Ther. 17 (2006), pp. 253–263. doi:
10.1089/hum.2006.17.253..
[12] D. Niculescu-Duvaz, J. Heyes, and C.J. Springer, Structure-activity relationship in cationic lipid
mediated gene transfection, Curr. Med. Chem. 10 (2005), pp. 1233–1261. doi:10.2174/
0929867033457476.
[13] S.K. Samal, M. Dash, S. Van Vlierberghe, D.L. Kaplan, E. Chiellini, C. Van Blitterswijk, L. Moroni,
and P. Dubruel, Cationic polymers and their therapeutic potential, Chem. Soc. Rev. 41 (2012),
pp. 7147–7194.
[14] D. Pezzoli, F. Olimpieri, C. Malloggi, S. Bertini, A. Volonterio, and G. Candiani, Chitosan-graft-
branched polyethylenimine copolymers: Influence of degree of grafting on transfection behavior,
PLoS One 7 (2012), pp. e34711. doi:10.1371/journal.pone.0034711.
[15] R. Labas, F. Beilvert, B. Barteau, S. David, R. Chèvre, and B. Pitard, Nature as a source of
inspiration for cationic lipid synthesis, Genetica 138 (2010), pp. 153–168. doi:10.1007/s10709-
009-9405-8.
[16] A.D. Miller, The problem with cationic liposome/micelle-based non-viral vector systems for gene
therapy, Curr. Med. Chem. 10 (2005), pp. 1195–1211. doi:10.2174/0929867033457485.
[17] H. Gonzalez, S.J. Hwang, and M.E. Davis, New class of polymers for the delivery of macromo­
lecular therapeutics, Bioconjug. Chem. 10 (1999), pp. 1068–1074. doi:10.1021/bc990072j.
[18] J.H. Jeong, S.W. Kim, and T.G. Park, Molecular design of functional polymers for gene therapy,
Prog. Polym. Sci. 32 (2007), pp. 1239–1274.
[19] T. Potta, Z. Zhen, T.S.P. Grandhi, and M.D. Christensen, Discovery of antibiotics-derived poly­
mers for gene delivery using combinatorial synthesis and cheminformatics modeling,
Biomaterials 35 (2014), pp. 1977–1988. doi:10.1016/j.biomaterials.2013.10.069.
[20] B. Shi, M. Zheng, W. Tao, R. Chung, D. Jin, D. Ghaffari, and O.C. Farokhzad, Challenges in DNA
delivery and recent advances in multifunctional polymeric DNA delivery systems,
Biomacromolecules 18 (2017), pp. 2231–2246. doi:10.1021/acs.biomac.7b00803.
[21] W. Wagner, S. Sakiyama-Elbert, and G. Zhang, Biomaterials Science: An Introduction to
Materials in Medicine, Academic Press, London, 2020.
[22] Z. Zhen, T. Potta, M.D. Christensen, E. Narayanan, K. Kanagal, C.M. Breneman, and K. Rege,
Accelerated materials discovery using chemical informatics investigation of polymer physico­
chemical properties and transgene expression efficacy, ACS Biomater. Sci. Eng. 5 (2019), pp.
654–669. doi:10.1021/acsbiomaterials.8b00963.
SAR AND QSAR IN ENVIRONMENTAL RESEARCH 613

[23] M. Chen, M. Hu, D. Wang, G. Wang, X. Zhu, D. Yan, and J. Sun, Multifunctional hyperbranched
glycoconjugated polymers based on natural aminoglycosides, Bioconjug. Chem. 23 (2012), pp.
1189–1199. doi:10.1021/bc300016b.
[24] N.D. Stebbins, M.A. Ouimet, and K.E. Uhrich, Antibiotic-containing polymers for localized,
sustained drug delivery, Adv. Drug Deliv. Rev. 78 (2014), pp. 77–87. doi:10.1016/j.
addr.2014.04.006.
[25] K. Rege, A. Ladiwala, S. Hu, M. Breneman, J.S. Dordick, and S.M. Cramer, Investigation of
DNA-binding properties of an aminoglycoside-polyamine library using Quantitative
Structure-Activity Relationship (QSAR) models, J. Chem. Inf. Model. 45 (2005), pp. 1854–1863.
doi:10.1021/ci050082g.
[26] B. Miryala, Z. Zhen, T. Potta, C.M. Breneman, and K. Rege, Parallel synthesis and quantitative
structure-activity relationship (QSAR) modeling of aminoglycoside-derived lipopolymers for
transgene expression, ACS Biomater. Sci. Eng. 1 (2015), pp. 656–668. doi:10.1021/
acsbiomaterials.5b00045.
[27] K. Roy, S. Kar, and R. Das, Understanding the Basics of QSAR for Applications in Pharmaceutical
Sciences and Risk Assessment, Academic press, New York, 2015.
[28] S. Paz-Abuin, M.P. Pellin, M. Paz-Pazos, and A. Lopez-Quintela, Influence of the reactivity of
amine hydrogens and the evaporation of monomers on the cure kinetics of epoxy-amine: Kinetic
questions, Polymer (Guildf) 38 (1997), pp. 3795–3804. doi:10.1016/S0032-3861(96)00957-3.
[29] C.W. Yap, PaDEL-descriptor: An open source software to calculate molecular descriptors and
fingerprints, J. Comput. Chem. 32 (2011), pp. 1466–1474. doi:10.1002/jcc.21707.
[30] AlvaDesc (software for molecular descriptors calculation) version 2.0.2, 2020, https://www.
alvascience.com, 2020.
[31] R.W. Kennard and L.A. Stone, Computer aided design of experiments, Technometrics 11 (1969),
pp. 137–148. doi:10.1080/00401706.1969.10490666.
[32] H. Golmohammadi, Z. Dashtbozorgi, and W.E. Acree Jr, Quantitative structure–activity relation­
ship prediction of blood-to-brain partitioning behavior using support vector machine, Eur.
J. Pharm. Sci. 47 (2011), pp. 421–429. doi:10.1016/j.ejps.2012.06.021.
[33] K. Roy, S. Kar, and R.N. Das, A Primer on QSAR/QSPR Modeling: Fundamental Concepts, Springer,
New York, 2015.
[34] P.M. Khan and K. Roy, Current approaches for choosing feature selection and learning algo­
rithms in quantitative structure–activity relationships (QSAR), Expert Opin. Drug Discov. 13
(2018), pp. 1075–1089. doi:10.1080/17460441.2018.1542428.
[35] S. Wold, M. Sjöström, and L. Eriksson, PLS-regression: A basic tool of chemometrics, Chemom.
Intell. Lab. Syst. 58 (2001), pp. 109–130. doi:10.1016/S0169-7439(01)00155-1.
[36] K. Roy and I. Mitra, On various metrics used for validation of predictive QSAR models with
applications in virtual screening and focused library design, Comb. Chem. High Throughput
Screen. 14 (2011), pp. 450–474. doi:10.2174/138620711795767893.
[37] K. Roy, I. Mitra, P. Ojha, S. Kar, R.N. Das, and H. Kabir, Introduction of rm2 (rank) metric
incorporating rank-order predictions as an additional tool for validation of QSAR/QSPR models,
Chemom. Intell. Lab. Syst. 118 (2012), pp. 200–210. doi:10.1016/j.chemolab.2012.06.004.
[38] K. Roy, R.N. Das, P. Ambure, and R.B. Aher, Be aware of error measures. Further studies on
validation of predictive QSAR models, Chemom. Intell. Lab. Syst. 152 (2016), pp. 18–33.
doi:10.1016/j.chemolab.2016.01.008.
[39] K. Roy, P. Ambure, S. Kar, and P. Kumar Ojha, Is it possible to improve the quality of predictions
from an “intelligent” use of multiple QSAR/QSPR/QSTR models? J. Chemom. 32 (2018), pp.
e2992. doi:10.1002/cem.2992.
[40] L.H. Hall and L.B. Kier, Electrotopological state indices for atom types: A novel combination of
electronic, topological, and valence state information, J. Chem. Inf. Comput. Sci. 35 (1995), pp.
1039–1045. doi:10.1021/ci00028a014.
[41] J. Votano, M. Parham, L. Hall, L. Kier, S. Oloff, A. Tropsha, Q. Xie, and W. Tong, Three new
consensus QSAR models for the prediction of Ames genotoxicity, Mutagenesis 19 (2004), pp.
365–377. doi:10.1093/mutage/geh043.
614 P.M. KHAN AND K. ROY

[42] S. Gupta, N. Basant, D. Mohan, and K.P. Singh, Inter-moieties reactivity correlations: An
approach to estimate the reactivity endpoints of major atmospheric reactants towards organic
chemicals, RSC Adv. 6 (2016), pp. 50297–50305. doi:10.1039/C6RA06805G.
[43] R. Todeschini and V. Consonni, Handbook of Molecular Descriptors, Vol. 11, John Wiley & Sons,
New Jersey, 2008.
[44] Y.S. Prabhakar, R.K. Rawal, M.K. Gupta, V.R. Solomon, and S.B. Katti, Topological descriptors in
modeling the HIV inhibitory activity of 2-aryl-3-pyridyl-thiazolidin-4-ones, Comb. Chem. High
Throughput Screen. 8 (2005), pp. 431–437. doi:10.2174/1386207054546531.
[45] Z. Wu, D. Li, J. Meng, and H. Wang, Introduction to SIMCA-P and its application, in Handbook of
Partial Least Squares, V.V. Esposito, W.W. Chin, J. Henseler, and H. Wang (Eds.), Berlin-
Heidelberg, Springer, 2010, pp. 757–774.
[46] K. Roy, Quantitative Structure-Activity Relationships in Drug Design, Predictive Toxicology, and
Risk Assessment, IGI Global, Hershey, Pennsylvania, 2015.
[47] K. Roy and G. Ghosh, Introduction of extended topochemical atom (eta)indices in the valence
electron mobile (vem) environment as tools for QSAR/QSPR studies, Internet Electron. J. Mol.
Des. 2 (2003), pp. 599–620.
[48] A. Karmakar, P. Ambure, T. Mallick, S. Das, K. Roy, and N.A. Begum, Exploration of synthetic
antioxidant flavonoid analogs as acetylcholinesterase inhibitors: An approach towards finding
their quantitative structure–activity relationship, Med. Chem. Res. 28 (2019), pp. 723–741.
doi:10.1007/s00044-019-02330-8.

You might also like