R PCA PublishedOnline

Received: 8 April 2023 Revised: 11 May 2023 Accepted: 11 May 2023
DOI: 10.1002/pca.3239
SPECIAL ISSUE REVIEW
R software for QSAR analysis in phytopharmacological studies
Sanjoy Singh Ningthoujam 1 | Rajat Nath 2 | Sibashish Kityania 2 |

Pranab Behari Mazumder 3 | Manabendra Dutta Choudhury 2 |
Anupam Das Talukdar 2 | Lutfun Nahar 4 | Satyajit D. Sarker 5
1
Government Hindi Teachers' Training College,
Imphal, Manipur, India Abstract
2
Department of Life Science and Introduction: In recent decades, quantitative structure–activity relationship (QSAR)
Bioinformatics, Assam University, Silchar,
Assam, India
analysis has become an important method for drug design and natural product
3
Department of Biotechnology, Assam research. With the availability of bioinformatic and cheminformatic tools, a vast num-
University, Silchar, Assam, India
ber of descriptors have been generated, making it challenging to select potential
4
Laboratory of Growth Regulators, Institute of
Experimental Botany, The Czech Academy of
independent variables that are accurately related to the dependent response
Sciences and Palacký University, Olomouc, variable.
Czech Republic
5
Objective: The objective of this study is to demonstrate various descriptor selection
Centre for Natural Products Discovery
(CNPD), School of Pharmacy and Biomolecular procedures, such as the Boruta approach, all subsets regression, the ANOVA
Sciences, Liverpool John Moores University, approach, the AIC method, stepwise regression, and genetic algorithm, that can be
Liverpool, UK
used in QSAR studies. Additionally, we performed regression diagnostics using R
Correspondence software to test parameters such as normality, linearity, residual histograms, PP plots,
Anupam Das Talukdar, Department of Life
Science and Bioinformatics, Assam University, multicollinearity, and homoscedasticity.
Silchar, Assam, India. Results: The workflow designed in this study highlights the different descriptor selec-
Email: anupam@bioinfoaus.ac.in
tion procedures and regression diagnostics that can be used in QSAR studies. The
Lutfun Nahar, Laboratory of Growth
Regulators, Institute of Experimental Botany, results showed that the Boruta approach and genetic algorithm performed better
The Czech Academy of Sciences and Palacký
than other methods in selecting potential independent variables. The regression diag-
University, Šlechtitelů 27, 78371 Olomouc,
Czech Republic. nostics parameters tested using R software, such as normality, linearity, residual his-
Email: nahar@ueb.cas.cz tograms, PP plots, multicollinearity, and homoscedasticity, helped in identifying and
Funding information diagnosing model errors, ensuring the reliability of the QSAR model.
Czech Agency Grant, Grant/Award Numbers:
Conclusion: QSAR analysis is vital in drug design and natural product research. To
Project 23-05474S, 23-05474S; European
Regional Development Fund, Grant/Award develop a reliable QSAR model, it is essential to choose suitable descriptors and per-
Numbers: Project ENOCH
form regression diagnostics. This study offers an accessible, customizable approach
(No. CZ.02.1.01/0.0/0.0/16_019/00008,
CZ.02.1.01/0.0/0.0/16_019/0000868 for researchers to select appropriate descriptors and diagnose errors in QSAR
studies.
KEYWORDS
descriptor, feature selection, MLR, QSAR, R software, regression assumption, regression
diagnostics
1 | I N T RO DU CT I O N has diverse applications in various fields such as agrochemistry, phar-

maceutical chemistry, toxicology, bioinformatics, and other areas of
Quantitative structure–activity relationship (QSAR) analysis is a com- chemistry.1,2 QSAR models describe mathematical equations repre-
putational technique that is widely used for analyzing chemical data. It senting the correlations between the chemical or biological activities
Phytochemical Analysis. 2023;1–20. wileyonlinelibrary.com/journal/pca © 2023 John Wiley & Sons Ltd. 1
10991565, 0, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/pca.3239 by <Shibboleth>-member@ljmu.ac.uk, Wiley Online Library on [01/07/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 NINGTHOUJAM ET AL.
of compounds and their structural and physiochemical information chemical features and properties from chemical structures or previous
called descriptors that are expressed in the form of numerical quanti- experimental data. In the next step, feature selection is performed
ties. These models are developed using appropriate statistical or under various techniques to determine the most significant properties
machine learning approaches. Many powerful statistical software, and reducing the number of features to be used in model develop-
both commercial and open source, is available that can efficiently per- ment.10 In the last step, a QSAR model is developed to determine an
form QSAR. Though the process is popular and common, there are empirical function that can represent the relationship between the
many issues that are frequently overlooked in its analyses. QSAR anal- input data and their biological properties.
ysis is faced with various challenges including lack of basic concepts Generalized steps in QSAR modeling include:
and flawed interpretation of the generated results. If a data matrix is
fed into any statistical software, it is inevitable that some results are 1. Generate descriptors—This involves converting the molecular
generated. At present, one-dimensional to multidimensional QSAR structure of the materials into a set of numbers that capture their
methods are available that can be used in lead optimization, prediction molecular and physiochemical properties in a relevant way.
of biological activity, and characterization of pharmacokinetic 2. Selection of a subset of descriptors—This step involves select-
properties.3 ing a small subset of descriptors that have the greatest impact
Along with prediction of biological activities, QSAR models help on the biological properties of the compounds.
in identifying the parameters responsible for biological responses 3. Derivation of an empirical relationship between the selected
essential for lead compound identification and optimization.4 As such, descriptors and the biological activities or responses.
QSAR analysis has a significant contribution in rational drug design in 4. Validation—This involves verifying the model in terms of its
the present context. The field of QSAR has experienced significant robustness, predictive power, and applicability domain.
advancements in recent years, with the development of new methods
and practices. These include the prediction of ADMET properties and After that, the newly developed model can be used for predicting
biological activities and the application of QSAR analysis in a variety the biological responses of new molecules for which biological activi-
of industries, such as the chemical, biotechnological, cosmetic, and ties are yet to be determined. In essence, a QSAR analysis is com-
pharmaceutical industries.1 New domains of QSAR applications prised of three important parts—(1) the activity data to be modeled
include refinement in process chemistry and prediction and optimiza- and hence predicted (response variable), (2) data with which to model
tion of synthetic routes. Besides these advances, QSAR modeling has (predictors or descriptors), and (3) a method to formulate the model
relevance in drug discovery and identifying pharmacological activities (Figure 1).11
4–6
of phytochemicals.
There are different statistical packages and specific software pro-
grams for QSAR analysis, but with different degrees of reliability.7 3 | M E T H O D S I N QS A R
These packages may belong to either commercial or open source
products. Though commercial statistical products have many visible Many regression and pattern recognition methods are included in
features, open source software has the advantages of customizability, QSAR analysis.12 Various methodologies and techniques have been
freedom, quality, flexibility, interoperability, and auditability. Some of developed, ranging from 2D-QSAR to 3D-QSAR. Major differences in
the software programs are dedicated statistical software programs these techniques are the selection of structural parameters used for
such as SPSS, Stata, SAR, Weka, Tanagra, RapidMiner, MATLAB, and characterizing molecular properties and the mathematical approach for
R. There are some integrated software programs such as EasyQSAR, describing the relationship between descriptors and biological
SyByl-X, Codessa, and Discovery Studio, which include QSAR func- responses.13 In QSAR modeling, various statistical approaches are used
8
tionalities with statistical functions. One popular open source statisti- to derive mathematical equations with a variable number of variables.
cal program is R, which can run on a variety of platforms, ranging from These methods may be (1) regression-based methods, (2) classification-
Windows and MacOS to Linux.9 Moreover, open source software is a based methods, or (3) machine learning methods. QSAR methods may
good candidate for implementing statistical analyses in QSAR studies. be linear models and nonlinear models. Apart from that, QSAR models
This paper aims to provide a review of R software for QSAR analysis, are also classified as receptor-dependent and receptor-independent
specifically in the context of phytopharmacological studies. The appli- based on the consideration of molecular receptor binding.
cability of R software is focused on three crucial phases of QSAR anal- In the classification-based QSAR models, validation parameters
ysis: descriptor selection, testing for regression assumptions, and like accuracy, sensitivity, specificity, precision, and F-measures are
addressing assumption violations. commonly used.14 Regression-based methods are commonly used in
many QSAR studies. Multiple linear regression (MLR) models are used
in many QSAR publications because of their simplicity, transparency,
2 | G EN E R A L I Z E D STE P S I N Q S A R reproducibility, and easy interpretability.15 QSAR studies utilize a
range of machine learning tools, including support vector machine,
The development of QSAR models in phytochemical analysis generally artificial neural network, and random forest. In addition to these
involves several modular steps. The first step is the derivation of established methods, new techniques such as local lazy regression
NINGTHOUJAM ET AL. 3
F I G U R E 1 Generalized workflow of QSAR showing the focus of this paper: (1) descriptor selection, (2) testing for regression assumptions,
and (3) addressing assumption violations.
(LLR), gene expression programming (GEP), and project pursuit regres- miniaturization. Large volumes of data can be accessed from publicly
sion (PPR) are also employed in QSAR studies.16 available domain databases such as ChEMBL, PubChem, and ZINC.1
4 | P R E S E N T T RE N D S I N Q S A R 4.1 | Artificial intelligence and machine learning in

QSAR
There has been unprecedented growth in the area of QSAR analysis,
thereby changing the dimensions of applications in drug discovery The use of machine learning and artificial intelligence (AI) in QSAR
such as lead optimization and drug–receptor interactions and protein– modeling is becoming increasingly common. Various standard machine
protein interactions.17 There has been an increase in the diversity of learning methods have been employed in QSAR, which can be classi-
datasets used for QSAR with the advancement of robotics and fied into the following categories:
1. Supervised learning (e.g., k-nearest neighbor, regression analy- 4.4 | Improvement in applicability domain
sis, Bayesian probabilistic learning, support vector machines,
neural networks, and random forest). Applicability domain analysis provides a tool for determining the reli-
2. Unsupervised learning (e.g., clustering algorithms, principal ability of QSAR models. It defines the molecular space for training the
components analysis [PCA], and independent components model and also provides the conditions where the model should be
analysis [ICA]). Some supervised methods, such as support applied. Applicability domains of QSAR models can be estimated by
vector machines, probabilistic graphical models, and neural different approaches such as the leverage approach, the DModX
networks, can also support unsupervised learning. approach, and the Euclidean distance approach.14 As the QSAR
models are data-driven models based on patterns or rules of the train-
The deep neural network (DNN) has emerged as an important ing samples, these models can be valid within limited applicability
machine learning method. However, this approach is not new, as it domains. Various applicability domain characterization methods and
can be traced back to the 1990s. Over the past few decades, improve- applicability domain metrics are proposed at present with utilization
ments in algorithms, hardware advancements, and the use of GPUs of machine learning algorithms.18
have significantly enhanced neural networks. The application of DNNs
gained traction after the Merck Molecular Activity Challenge in
2012.1 In addition to DNNs, there are several other machine learning 4.5 | Modelability
techniques employed in QSAR studies, including k-nearest neighbors,
partial least squares, support vector machines, relevance vector An important emerging trend in QSAR analysis is the concept of mod-
machines, random forest, Gaussian processes, and boosting. Random elability, which suggests that the predictivity of QSAR models is influ-
forest is another popular method for QSAR modeling as it provides enced by activity cliffs. Activity cliffs are observed when structurally
good predictions with few adjustable parameters. similar compounds act on the same target with different activities.
The occurrence of activity cliffs makes it difficult to predict the prop-
erties of the target compound. Such limitations cannot be addressed
4.2 | Improvement in validation method by altering the method used or the descriptors in the QSAR modeling.
When different stereoisomers exhibit different activities, the use of
There is a shift in the validation methods of QSAR, where common descriptors with stereochemical information can reduce activity cliffs.
practice involves splitting the data into an external test set and a train-
ing set. The model developed using the training dataset is then utilized
to predict the test set and evaluate the accuracy of the predictions. 4.6 | Interpretability
One significant advancement in this field is the use of time-split test
sets, where data generated in the later phases are assigned as the test Early QSAR methods were usually applied for molecules that were
set. Time-split validation is considered to be a more reliable method close analogs. With the development of more sophisticated modeling
for estimating R2 in true prospective prediction than random test set methods with diverse datasets and esoteric descriptors, the concept
selection and leave-one-out cross-validation. In QSAR model develop- of interpretability has become significant. While modeling, only the
ment, the use of this approach in combination with random selection relevant subset of descriptors is selected. This approach enhances the
can be considered as standard cross-validation. generalizability of QSAR models and also simplifies their interpreta-
tion. Typically, QSAR models are interpreted in two ways. In the first
case, the focus is on identifying descriptors with the ability to derive
4.3 | Multitask modeling properties. The second approach strives to project significant features
from model onto representative molecules. This approach is used to
In classical QSAR modeling, one predicted activity is generally deter- emphasize structural features that are correlated with a particular
mined at a given time. However, a candidate compound may produce activity.
effects when it binds to targets other than its predicted target. These
multiple activities need to be studied during drug development by
multiparameter optimization or multitask modeling, which is the tech- 5 | POLYPHARMACOLOGICAL
nique of studying compounds on the basis of more than one predicted APPLICATIONS OF QSAR
activity at a time. It can be accomplished by either a group of single
task models or a single model that can model multiple activities simul- Potential applications of QSAR analysis in searching for drugs with
taneously. This approach may use different techniques, including deep polypharmacological implications are also recognized. The classical
learning techniques. The approach has significant value when data are QSAR models are developed with groups of compounds selected as
limited. This modeling approach utilizes methods such as perturbation training set having a common bioactivity. Most of these compounds
theory machine learning (PTML), inductive learning, and multiobjec- belong to same chemical series. Thanks to the introduction of diverse
tive optimization. assays and different technologies, it is accepted that the drug
candidates may interact with many biological targets. Such polyphar- 7 | QSAR IN PHYTOCHEMICAL ANALYSIS
macological properties may generate additive or synergistic effects or AND DRUG DISCOVERY
create adverse or toxic effects.19 However, experimental validation of
thousands of drug candidates for highly variable targets in traditional There are various applications of QSAR in predicting the hitherto
wet lab settings is unrealistic. The situation is aggravated by the fact unknown properties of phytochemical compounds. Rahman et al.24
that the same ligand–target interaction may produce different out- applied QSAR models to predict the activity of antiviral phytochemi-
comes in different assays. Considering these constraints, in silico pre- cals targeting the dengue virus NS3 protease. The pIC (mM) values for
diction of multiple bioactivities through QSAR models has emerged as three phytochemicals (cyanidin 3-glucoside, dithymoquinone, and
a notable alternative.1 glabridin) were predicted by the MLR model to have good potency.
Compounds exerting varying pharmacological effects also Due to the COVID-19 pandemic, the exploration of antiviral phy-
exhibit a distinct intrinsic property known as biological activity spec- tochemicals has emerged as a new area of focus in research. One of
trum. This refers to the array of biological activities that a compound the key objectives in designing and creating antiviral medications is to
displays upon interaction with different biological systems. This target the SARS-CoV-2 protease. To identify compounds capable of
property is related with the multitarget profiling of compounds. selectively blocking the activity of the main protease of SARS-CoV-
There are several models for multitarget profiling, with pioneering 2. Islam et al. developed an MLR model that predicts the binding ener-
work of prediction of activity spectra for substances (PASS). This gies of antiviral phytochemicals.25
tool enables the prediction of various biological, pharmacological, The spike proteins expressed by SARS-CoV-2 possess a receptor
and biochemical activities at different levels (molecular and organ- binding domain (RBD) that is accountable for the virus's ability to
ism) on the basis of the structure of the compound. PTML is another infect human cells. Specifically, these proteins bind to the protease
method that can predict multiple target properties simultaneously domain of angiotensin-converting enzyme 2 (ACE2) receptor proteins
under different environments.20 found on host cells, allowing the virus to enter the cells. Basu et al.
A major limitation observed in multitarget profiling concerns the studied five phytochemicals (hesperidin, anthraquinone, rhein, chrysin,
evaluation of the predicted compound sets. There are only a few com- and emodin) from Indian and Chinese medicinal plants and studied
parative results about the predictability of different approaches. Out their antiviral activity through QSAR and molecular docking studies.
of them, the best results are obtained with feedforward neural net- They observed that hesperidin, emodin, and chrysin have potential for
works (FNNs), while the lowest predictability is observed for the simi- being used in COVID-19 treatment.26
larity ensemble approach (SEA).
Another application of multitarget QSAR is the search for interac-
tions between ligands and biological targets in combined chemical- TABLE 1 Some of the R packages that can be used in QSAR
analysis.
biological space. Chemogenomic or proteochemometric strategies
have been developed to search for all molecules that are capable of R package Purpose Reference
interacting with any biological target.21 The approach utilized in pre- Boruta Boruta 28
dicting ligands for a specific receptor differs from the classical QSAR Car: Companion Variance inflation factor test 29
approach in that it involves predicting interactions with multiple tar- to Applied

gets across both biological and chemical domains. This method aims Regression
30
to overcome the problem of predicting off-target proteins during drug Leaps: Regression All subsets regression
discovery, which can lead to unwanted side effects during the pro- Subset
Selection
cess.22 Recently, a Monte Carlo-based QSAR approach was used to
31
Lmtest: Testing Durbin–Watson test
study the use of protease inhibitors for COVID-19 treatment.
Linear
Regression
Models
6 | DESCRIPTOR-FREE QSAR MASS Stepwise regression 32
33
Subselect Genetic algorithm
Mainstream QSAR analyses are based on linking molecular descriptors Ezqsar Dedicated QSAR package 34
(X) to the response variable (Y). However, determining the appropriate 35

Rmol Transforming SD/Molfile
descriptors is a complex process because for many of them it is diffi- structure information into R
cult to explain how they are related to the response activity. Determi- objects
36
nation of the molecular descriptors may require an appropriate Rregrs Model selection with multiple
software framework and may suffer from human bias. There have regression models
37
been attempts to develop QSAR models by completely eliminating Camb Property and bioactivity modeler
38
molecular descriptors, particularly for large and diverse datasets by Chemmodlab A modeling laboratory for fitting
using deep learning techniques such as long short-term memory net- and assessing machine learning
models
works (LSTM) from SMILES code.23
NINGTHOUJAM ET AL.
F I G U R E 2 In the QSAR process, focus in this review paper is given to feature selection, violations of regression assumptions, and their
subsequent solutions.
6
8 | R SOF T WA R E I N QS A R step for identifying the most appropriate descriptors from a vast num-
ber of potential descriptors for a particular activity. The objective is to
R is a comprehensive programming language and environment that reduce the dimensionality of the input space and removing redundant
incorporates all standard statistical tests, models, and analyses for prac- or irrelevant descriptors. Models with fewer variables are easier to
ticing statisticians and researchers.27 As it is open source, code can be interpret and understand as it is more straightforward to identify the
freely accessed. Anyone can customize and improve the code according important variables. Such models could also provide better perfor-
to their preferences. Another advantage of R is that there are several mance for new data and reduce the risk of overfitting or overtrain-
ways to extend its functionality. Developers can write their own pack- ing.41 It is significant for removing noises from the analyses. Many
ages, extensions, and interfaces for various functions for data operation, statistical tools are available for feature selection. Descriptor data are
modeling, and graphical outputs. R software is free and can be down- pruned or filtered to remove intercorrelated and redundant descriptor
9
loaded from www.r-project.org. Operations can be implemented data. With the advancement of QSAR, many new techniques and
through various integrated developing environments (IDEs) such as R algorithms for descriptor selection have emerged that may belong to
Studio. In addition to R base installations, various R packages that can either of the three major categories, viz. filter, wrapper, and embed-
be used for QSAR studies are available from the CRAN website ded/hybrid methods (Figure 3).41 Filter methods are applied usually
(https://cran.r-project.org/web/packages) and other resources. Some of without any machine learning process and in an unsupervised manner.
the R packages used in QSAR analysis are listed in Table 1. Filter methods apply a statistical or information-theoretic metric to
Along with the R base installation, these packages can be used for assign a score to each feature. Then the top-ranked variables are
feature selection, testing model assumptions, and subsequent solu- selected regardless of the model to be used. For instance, metrics
tions as focused on in this paper (Figure 2). Other processes or steps such as correlation, information gain, or mutual information might be
are beyond the scope of this paper. Steps mentioned here are not used. Although filter methods can usually be applied without machine
fixed and can be tweaked as desired. In the process, the dataset is learning, they might be used in combination with machine learning
split into three mutually exclusive sets—the training set, the trial set, methods to improve model performance. In the wrapper techniques, a
and the external evaluation set. External validation sets are different linear or nonlinear classifier (or regressor) is used to select descriptors.
from test sets or training sets and are used to the estimate prediction Selections are made based on the performance of the variables in a
error to compare models. When sufficient data are available, it is pref- specific modeling algorithm. The hybrid or embedded method is a
erable to split the data into a training set, a validation set, and a test combination of the above two techniques. It involves incorporation of
set. When data available are insufficient for splitting into subsets, then description selection into the modeling algorithm itself.
all the data are used in the training set.39 Traditional methods of feature selection include both subjective
and objective approaches, which may range from human expert judg-
ment to statistical and algorithmic techniques. Human experts depend
9 | F E A T U R E SE L E C T I O N on their domain knowledge, experience, and scientific intuition to
select the relevant descriptors. The traditional approach also includes
When conducting QSAR studies, it is crucial to determine whether all correlation analysis, where features having the highest correlation
variables should be examined or if certain ones deemed insignificant coefficients with the target variable are selected for the model. Statis-
40
should be excluded. Ultimately, the selection of a final QSAR model tical methods such as t tests, ANOVA and intelligent algorithms are
involves achieving a balance between predictive accuracy and model also common methods. Mention may be made of methods like for-
simplicity. Predictive accuracy refers to the model's capacity to pre- ward stepping and backward elimination, neural network pruning, sim-
cisely predict the outcomes of new data based on the training data. ulated annealing, genetic algorithm (GA), and exhaustive
Feature selection is a technique that involves selecting a subset enumerations. In the present approach, the Boruta method, the
of pertinent features from a larger set of input features for model ANOVA approach, the Akaike Information Criterion (AIC) method,
development purposes. In QSAR studies, feature selection is a crucial stepwise regression, all subsets regression, and GA are considered.
FIGURE 3 Types of feature selection.

9.1 | Boruta method The package is based on an all-relevant feature selection algo-
rithm that identifies all features that are relevant to the given target
Boruta is a feature selection algorithm that uses a random forest algo- variable. Although the Boruta package is based on a random forest
rithm as a base model for descriptor selection in default mode.28 An algorithm by default, the package does not handle missing
example is shown in the paper. The dataset for this example is based (NA) values. Running this package is time consuming and demands
on the Selwood dataset popular in GA and QSAR modeling.41 The computational resources for large dataset.
syntax of Boruta for feature selection is provided here (Listing 1).
9.2 | ANOVA approach
By using the anova() function in the R base installation,9 two nested

models can be compared. A nested statistical model is a model con-
taining all the terms of another model and sometimes additional terms
as well. It means that it is a complex version of the model with more
parameters to estimate.
For example, in one model, where MIC is the response variable,
While running the function, Boruta gives a straightforward and regression coefficients for Mass and Volume are non-significant. It
unambiguous ways to determine the significance of variables in a data- can be tested whether a model without these two variables could pre-
set. Each variable is assigned a status of confirmed, rejected, or tenta- dict as well as one that includes them. An example syntax of the
tive. Tentative attributes are identified as potentially important but did ANOVA approach in an in-house QSAR study is presented here
not pass the statistical test of significance in a default number of random (Listing 2).
forest runs. So another process for tentative attributes is required. Then,
the list of the confirmed attributes can be generated by the R syntax.
In the first run, Boruta produces confirmed attributes with
rejected and tentative attributes which can be visually represented by
a graphical plot (Figure 4). In these boxplots, blue represents the mini-
mal average and the maximum Z-score of shadow attributes that are
not significantly different from randomly generated noise variables. In
additional color-coded boxplots, red, yellow, and green represent Z- In the example with the ANOVA approach using in-house dummy
scores of rejected, tentative, and confirmed attributes, respectively. data, two models have been generated. Model 1 is nested within
Specific color schemes can be customized according to the prefer- Model 2. In this process, Mass and Volume are added to linear predic-
ences of the user. After making a decision on tentative attributes, they tion. The test is non-significant with p = 0.994, which suggests that
are classified as either confirmed or rejected based on the specific the descriptors do not significantly contribute to the linear prediction
research model and objectives of the analysis. Then, a list of the con- of the response variable. This means that these two descriptors can
firmed descriptors can be obtained. be dropped from our model.
F I G U R E 4 Boxplots showing importance of attributes. (A) After the first run. (B) After treating tentative attributes. [Colour figure can be
viewed at wileyonlinelibrary.com]
descriptors that do not make a significant contribution to the

model are removed. The procedure continues until there is no
further improvement in the fit of the model.
Stepwise model selection (forward, backward, and stepwise) can

be implemented by using the stepAIC() function provided by the
MASS package in R.42 An example syntax of an in-house QSAR study
is included here (Listing 5).
9.3 | AIC method
The AIC measures the quality of a statistical model. It is used to com-

pare different models fitting to the same data. Models with smaller
AIC values are preferred over models with larger AIC values, because
smaller AIC values indicate models having good fit to the data with
fewer parameters. R code for the AIC method is implemented in the
MASS package42 with MIC as the response and Mass, Volume, Surfa- The process will continue by generating different models with a
ceArea, and HydrationEnergy as the predictors. lower number of descriptors and a lower number of AIC values. It will
stop when removing any descriptor would lead to an increase in the
AIC value.
Stepwise regression methods are popular but are now becoming
quite out of fashion. Moreover, these methods are controversial and
also have limitations. These methods may build a good model, but not
necessarily the best model for the given dataset. As the approach of
iteratively adding or removing features, all possible combinations of fea-
tures that may contribute to the development of the best model might
The process provides different values. The models with smaller not be considered. Not all possible models are fully evaluated. This pro-
AIC values are usually selected. The AIC method does not require a cess may encounter a problem if there are missing values, so it is better
nested approach. Both ANOVA and AIC methods can be implemen- to deal with these missing values before running the process.
ted to compare two models but are not applicable to multiple
models.
9.5 | All subsets regression
9.4 | Stepwise regression All subsets regression is a method to evaluate all possible combina-
tions of descriptors. In the R package, it can be implemented using the
Stepwise regression is a common feature selection method in linear regsubsets() function included in the leaps package.30 Criteria such as
regression models. It is an iterative procedure that starts with an initial R2, adjusted R2, or the Mallows Cp statistic can be employed to deter-
set of candidate variables. Variables are incrementally introduced or mine the most appropriate models in all subsets regression. The out-
removed from a model, one at a time, based on their statistical signifi- puts can be visually represented using either the plot() function or the
cance, until a final model is obtained. It can be implemented in any of subsets() function, which is found in the car package.29
the following three methods:
1. In forward stepwise regression, the procedure starts with a

model with no predictors. Then, variables are added to the
model one at a time, on the basis of their statistical signifi-
cance. The procedure stops when no further improvement in
the model's fit is achieved.
2. In backward stepwise regression, all variables are included in
the initial model. Then, variables are deleted one at a time until
no further improvement in the model's fit is achieved.
3. In stepwise regression (usually called stepwise), both forward Findings from the regsubsets() for all subsets regression in the
and backward approaches are combined. The process com- “leaps” package can be observed by plotting the result (Figure 5). In
mences with the forward phase, where descriptors are added the given example, the initial row (at the bottom) represents a model
one by one. At each step, the model is reassessed, and any with a constant and a variable called Mass, exhibiting an adjusted R2
value of 0.33. Above this row, there is a model featuring an intercept Automatic methods are helpful when the number of descriptors is
and a variable called SurfaceEnergy, with a value of 0.1. On the 12th large. The approach could not provide all possible models. When such
row, a model with the intercept, SurfaceEnergy, HydrationEnergy, and conditions exist, it is more efficient to use a search algorithm such as
Volume has a value of 0.54. Above this model, there is another model stepwise regression and forward selection to find the best model.
with intercept, SurfaceArea (variable), and HydrationEnergy (variable)
having an adjusted R2 value of 0.55. In this way, a model with mini-
mum descriptors with a larger adjusted R2 is obtained. This indicates 9.6 | GA
that the most appropriate subset of descriptors to be incorporated
into the model should solely contain two variables, namely, Surfa- GA is a method which took inspiration from Darwin's theory of evolu-
ceArea and HydrationEnergy. tion. Here, each model competes with the others according to the
concept of the “survival of the fittest.” The genetic function in the
subselect package33 in R can perform GA for variable selection.33 An
example syntax of GA in R is provided here (Listing 7).
In the GA approach, a theoretical best model could not be

obtained but a population of acceptable models could be generated.
So, this characteristic gives another role to the expert knowledge of
the researcher. The approach provides a chance to estimate the rela-
tionships with the response from multiple perspectives.
Thousands of molecular descriptors that are derived from differ-
ent algorithmic methods are available for QSAR analysis. While many
descriptors represent specific chemical or physical properties, determi-
nation of clear physical or chemical interpretations of some descriptors
FIGURE 5 Plot from all subsets regression. remains difficult. A clear mechanistic interpretation could not be given
F I G U R E 6 (A) Nonlinearity, unequal variance, and outliers can be detected by a residuals versus fitted plot. (B) Heteroscedasticity can be
located by a scale–location plot. (C) The normality assumption of the residuals can be verified by the normal Q-Q plot. (D) A residuals versus
leverage plot can help in identifying influential observations (outliers). [Colour figure can be viewed at wileyonlinelibrary.com]
as such descriptors are not well defined or identified.43 There is no hard

and fast criterion for determining the “best” model. Statistical methods
can be used to determine the relative statistically significant descriptors.
In a practical sense, it is difficult to determine whether the variables are
important or not. For more practical purposes, it is better to depend
more on domain knowledge of the researchers in selecting the descrip-
tors that can be filtered based on statistically significant variables. The
final decision on feature selection still depends on the experience,
expertise, and personal judgment of the investigator. So, it is better to
consider all highlighted descriptors and reach a consensus based on the
domain knowledge of the investigator.
10 | REGRESSION DIAGNOSTICS
R software and add-on packages can be used for regression diagnos-

tics. In the present study, evaluation of the statistical assumptions in
regression analysis is conducted in the following manner: (1) genera-
tion of the object through the “lm” function in the R base installation
F I G U R E 7 Checking linearity of the model. [Colour figure can be
and (2) applying the “plot()” function to graphically visualize the
viewed at wileyonlinelibrary.com]
results of the “lm()” function. 9
The example model is developed
through MLR. The efficacy of the MLR model is dependent on the
degree of validity of the assumptions on which the analyses are horizontal line represent the well-predicted variables by the model.
grounded. MLR functions are based on several assumptions, such as Values located significantly above the horizontal line are underpre-
linearity, independence, normality, and homoscedasticity. Application dicted, while those below the horizontal line are overpredicted. In
of the plot() function to the result from the lm() function provides an other words, the linearity assumption can be verified by analyzing the
indication of the regression assumption to some extent (Figure 6). pattern of residuals around the horizontal line.
10.1 | Linearity 10.2 | Normality
In regression analysis, it is assumed that independent variables have a In regression analysis, it is assumed that the variables exhibit a normal
linear relationship with the response variable. This assumption can be distribution. In the developed regression model, it is assumed that the
checked through a residual plot. To create the plot, standardized resid- distribution of residuals conforms to a normal distribution, with a
uals are plotted on the y-axis, while standardized predicted values are mean of zero and consistent variance. This assumption can be
plotted on the x-axis. To facilitate interpretation of the output, a hori- checked by reviewing the quantile–quantile (Q-Q) plot, goodness of
zontal line can be included. Syntax for drawing a residual plot in the fit (e.g., the Kolmogorov–Smirnov test), and the probability–
object “fit” generated by the lm() function is provided (Listing 8). probability (PP) plot or by developing a residual histogram to compare
it with a fitted normal curve.
A normal Q-Q plot is a graphical technique that serves as a proba-
bility plot, aimed at assessing whether the residuals in a regression
model follow a normal distribution. In the residual histogram method,
the residual values are plotted on a histogram and a fitted normal curve
is overlaid on top of the histogram. Then, this plot is used for visually
assessing the normality assumption of the residuals. If the residuals con-
form to normality, the plotted points on the normal Q-Q plot should be
distributed approximately along a straight line, which is often referred
to as the 45-degree line (as shown in Figure 6). If the residuals exhibit
consistent variance, the points displayed in the scale–location plot (also
called the spread–location plot or S-L plot) should form an unordered
band around a horizontal line, known as the line of zero slope. Apart
Assumptions that the variable has a linear relationship are ana- from these tests, there are other specialized codes for testing regres-
lyzed by creating a residual plot in R (Figure 7). Points close to the sion assumptions.
The extent of predicting a normal distribution can be determined plotted points depart away from the diagonal line represents the non-
by observing the plot between the histogram and normal distribution. normal feature.
Syntax in R for generating residual histograms from standardized
residuals (Listing 9) can be implemented in the R base installation.9
10.4 | Residual histogram
In regression analysis, residuals refer to the differences between the

observed values and the predicted values derived from the regression
equation, which are used to estimate the experimental error. Examina-
tion of residuals is a crucial step in all statistical modeling because
careful study of residuals can reveal the adequacy of the model and
the assumptions. The capability of predicting the normal distribution
10.3 | PP plot in the sample by the sample can be assessed by observing the residual
histogram (Figure 9).
A PP plot is a graphical tool for assessing the normality assumption of
the residuals. This plot compares the empirical cumulative distribution
function of the sample with the expected distribution of a normal dis-
tribution. To create the PP plot in R, the probability distribution is the 10.5 | Homoscedasticity
first requirement which is generated using the pnorm (VAR) function
in the R base installation. VAR is the variable containing the residuals One important assumption in linear regression modeling is homosce-
(Listing 10). To enable comparison, an abline() function is included in dasticity, which assumes that the variance of the residuals or errors is
the code to plot a diagonal line across the plot. consistent across all levels of the independent variables. In contrast,
heteroscedasticity occurs when the magnitude of the error term is not
constant across all values of the independent variables, representing a
violation of the homoscedasticity assumption. To check for homosce-
dasticity, tests such as the Breusch–Pagan test and the NCV test can
be used. The Breusch–Pagan test can be implemented using the
bptest() function in the lmtest package.31 The NCV test can be con-
ducted in the car package.29 The Breusch–Pagan test can be per-
formed from the object “fit” through the following syntax.
From the PP plot generated by the R script, normality can be

detected (Figure 8). The distribution is considered to be normal if the
plotted points fall along the diagonal line. The closeness of the plotted
points to the diagonal line represents how closely the distribution of
the data matches the normal distribution. The extent to which the
FIGURE 8 PP plot.
F I G U R E 9 Residual histogram from a sample

dataset. [Colour figure can be viewed at
wileyonlinelibrary.com]
FIGURE 10 Plots showing assumptions, heteroscedasticity, and nonlinearity. [Colour figure can be viewed at wileyonlinelibrary.com]
In the syntax for the lmtest package, “fit” is the model and “var- In Figure 6, residuals versus fitted and scale–location plots pro-
formula” describes only the potential explanatory variables. Both vide information on homoscedasticity. If there is absolutely no hetero-
methods have a p value below 0.05 when there is loss of scedasticity, there is an expectation that the scatterplot of residuals
homoscedasticity. will be evenly distributed along a flat line (Figure 10).
When the p value is greater than 0.05 in the studentized many alternative approaches for generating the correlation matrix,
Breusch–Pagan test in the “lmtest” package and the Non Constant such as the Hmisc package and corrplot.
Variance Score Test in the “car” package, it suggests failure to reject To ensure that there is no significant collinearity among the inde-
the null hypothesis and acceptance of the assumption of homoscedas- pendent variables, the correlation coefficients in the Pearson's bivari-
ticity. This means that the variance of the residuals is constant across ate correlation matrix need to be below 0.08.
all values of the independent variables and infers that heteroscedasti- In the matrix generated by either of the methods, R values gener-
city is indeed absent. ated by each pair of variables are checked. When R values are greater
than or less than a particular threshold value (say +0.8 or 0.8), only
one of the variables is selected for the model. Out of the pair of vari-
10.6 | No multicollinearity ables that possess multicollinearity, which variable is to be selected is
based on the potential relevance to the overall properties of the QSAR
In MLR, there are assumptions that independent are not highly corre- study.
lated with each other, that is, there is no multicollinearity among As per the Topliss and Costello rule, the linear regression model
them. Multicollinearity (also collinearity) is a phenomenon that occurs should have training sets with at least five chemicals for every descrip-
in multiple regression models when two or more descriptors are highly tor to minimize the likelihood of chance correlations.45 The number of
correlated with each other. The phenomenon is observed when the features having multicollinearity in MLR may be restricted in common
independent variables are not entirely independent of each other. It is statistical analysis, but in QSAR analysis, there are some exceptions.
common practice to eliminate highly correlated descriptors in MLR Multicollinearity is not expected to impact the overall performance in
equations for QSAR/QPSR model development. For checking multi- the predictive model where the objective is purely to predict outcomes
collinearity, four primary methods are commonly used—examining the rather than explaining the relationship between variables.46 As pruning
correlation matrix, assessing tolerance, calculating the variance infla- of highly correlated descriptors is the default setting in commercial
tion factor (VIF), and checking the condition index. MLR software packages, one may overlook meaningful correlations.
In our approach, a correlation matrix and VIF criteria are used. Sometimes, descriptors may act synergistically and provide models that
While detecting multicollinearity with the VIF, as a general rule, perform better than what would be expected by simply summing up
pffiffiffiffiffiffi
vif > 2 indicates a multicollinearity problem. In R software, another the contributions of individual components. In a study involving density
approach is to calculate the VIF by using the “vif()” function in the functional theory QSAR analysis, the QSAR model developed with
“car” package. 29
Multicollinearity can be checked by using another hardness, EHOMO, MRA-4, and MRB-4 could predict the potential bioac-
handy tool called the ENMTool, which is developed for environmental tivity of the set of chalcone molecules for treating Mycobacterium tuber-
niche modeling.44 This tool is a Perl script that is implemented through culosis infection.47 In this study, the descriptors hardness and EHOMO
the Tk+ package, which is a graphical user interface toolkit for Tcl are found to be related with the following formula:
programming language.
In the first approach, multicollinearity is checked by estimating ELUMO EHOMO
η¼ ,
2
cross-correlation using the Pearson correlation coefficient in R soft-
ware9 itself or ENMTools.44 In the first approach, a correlation matrix where η is the hardness and EHOMO is the energy of the highest occu-
is generated by the cor() function in the R base installation. There are pied molecular orbital. It is the responsibility of the researchers to
Variables Desc01 Desc02 Desc03 Desc04 Desc05 Desc06 Desc07 Desc08 Desc09 Desc10 Desc11
Desc01
Desc02 0.292
Desc03 0.598 0.158
Desc04 0.728 0.380 0.849
Desc05 0.832 0.005 0.195 0.250
Desc06 0.962 0.422 0.717 0.871 0.666
Desc07 0.703 0.558 0.810 0.975 0.201 0.865
Desc08 0.666 0.279 0.332 0.395 0.546 0.587 0.403
Desc09 0.858 0.170 0.535 0.650 0.749 0.849 0.611 0.259
Desc10 0.896 0.153 0.268 0.351 0.980 0.755 0.332 0.637 0.765
Desc11 0.967 0.345 0.730 0.877 0.676 0.995 0.851 0.602 0.846 0.757
Desc12 0.481 0.654 0.595 0.661 0.086 0.602 0.732 0.435 0.307 0.224 0.577
investigate the mechanism of action or identify areas where such effects. Outliers are usually indicative of either measurement errors
association could enhance the acceptability of a model. or the population, and sometimes they may also occur by chance.
In other statistical analyses, multicollinearity is a genuine concern Particularly in QSAR analysis, outliers in the data may not necessar-
and can be addressed by applying any statistical approach. However, ily be the result of statistical fluctuations or measurement errors.
in QSAR analysis, the multicollinearity problem has to be addressed Instead, these may be indicative of the presence of activity cliffs,
carefully. Perfect multicollinearity arises when an independent vari- which can be defined as a high ratio of the difference in biological
able in a regression model is perfectly correlated with either another activity between two chemical compounds to their “distance” or
descriptor or a linear combination of multiple other independent vari- separation in a given chemical space.51 The presence of “cliffs” in
ables. Such issues can be easily handled. Special attention is required the descriptor space can lead to dramatic changes in the bioactive
in the diagnosis and evaluation of lesser degrees of multicollinearity properties upon addition or removal of one small chemical group. If
which are more quite prevalent. There are instances where single an outlier is present, it is better to rerun the test after deleting the
descriptors that are highly correlated and yield poor performance can outlier data. In the following example, observation no. 5 is an
provide valuable insight into the model.46 In this case, domain knowl- outlier.
edge of the researchers becomes more important.
12 | I N F L U E N T I A L OB S E R V A T I O N S
10.7 | Autocorrelation
Influential observations are observations that have significant effects
In MLR, there is another assumption that there is no autocorrelation on the estimated values of the model parameters. Their presence or
in the dataset. Autocorrelation means that the residuals are depen- absence in the model may great influence the outcome of the analysis.
dent on each other rather than being independent. For instance, the Influential observations can be identified by the two methods of
value of y at a certain point (x + 1) is not independent of the value Cook's distance or D statistic29 and added variable plots.49 Observa-
of y at the previous point, y(x). This violates the assumption of inde- tions with Cook's D values greater than 4/(n k 1) may be consid-
pendence in regression analysis. The presence of autocorrelation ered influential, where n is the sample size and k is the number of
can be checked by drawing a scatterplot or performing the Durbin– predictor variables. It is to be noted that this is just a guideline and
Watson test. This test is used to examine the null hypothesis that needs to be applied judiciously. Added variable plots and Cook's
the residuals in a regression model are not linearly correlated. The D plots can be drawn in the “car” package.29
Durbin–Watson statistic (d) typically ranges between 0 and 4—with
a value of around 2 indicating the absence of significance autocorre-
lation. As a rule of thumb, a value of d between 1.5 and 2.5 suggests 13 | C O R R EC TI V E M E A S U RE S
the absence of significant autocorrelation in the MLR dataset. This
test is performed by using the “dwtest()” function in the package When there are violations of regression assumptions, there are many
“lmtest”31 of R software. approaches to deal with these issues. Some of the commonly occur-
ring corrective measures against violation of assumptions are given
below.
11 | O U T L I ER S
Outliers are data points that deviate significantly from the rest of the 13.1 | Deleting observations
48
data and are not predicted well by the statistical model. Outliers can
be expressed as either unusually large positive or negative residuals. In this approach, identified outliers are deleted to enhance the confor-
Positive residuals suggest that the model is underestimating the mity of a dataset to the normality assumption. Influential observations
response variable, while negative residuals indicate that the model is are usually deleted as they have a disproportionate influence on the
overestimating the response. In a Q-Q plot, data points that fall out- outcome. Once the largest outlier or most influential observation is
side the confidence band are generally considered to be outliers. A deleted, the model is re-evaluated. If any outliers or influential obser-
common rule of thumb is to pay attention to the standardized resid- vations persist, the procedure is repeated until a satisfactory fit is
uals greater than 2 or less than 2, as they may indicate the presence achieved. There should be some caution in deleting observations.
49
of outliers in the dataset. In our approach, outliers are detected in Removing the problematic observations is a reasonable approach
statistical tests by using the “car” package. The outlierTest() function when an observation is identified as an outlier due to data errors or
in this package calculates the Bonferroni adjusted p value for the larg- inaccurate protocols. In certain cases, the unusual observation may
est absolute studentized residual.29 This function helps in identifying provide important information about the data. Studying why this
the potential outliers in the dataset. observation deviates from the rest of the data can contribute in deter-
The success of QSAR analysis depends on the identification and mination of the response variable, so here is a chance of serendipity
removal of the outliers50 as MLR is very sensitive to outliers' for great discoveries.
13.2 | Transforming variables Although Box–Cox power transformation is a widely used method
to achieve normality in non-normal data, it is not a guarantee for nor-
Transformation of variables is an important strategy for addressing mality in all cases. Other methods need to be employed in conjunction
the issues when models fail to meet the criteria of normality, linearity, with this approach. The Box–Cox method searches for the value that
and homodescasticity assumptions. Transformations refer to replacing results in the smallest standard deviation. It is necessary to verify the
a variable Y with another variable Yλ. normality of the transformed data by using a probability plot. One limi-
tation of Box–Cox power transformation is that it can only be applied
1. If the model violates the normality assumption, a possible solu- to data that are positive and greater than zero. To achieve this, a con-
tion is to transform the response variable. stant (C) may be added to all data so that all data become positive
2. When the assumption of linearity is violated, transforming the before transformation. The transformation equation then becomes
predictor variables can be the solution.
3. With issues involving heteroscedasticity (non-constant error Y 1 ¼ ðY þ CÞ1 :
variance), transformations of the response variables may help.
The Box–Cox procedure aims to identify the optimal power trans-
formation to address the deviations from the assumptions of the lin-
13.3 | Normalization ear regression model. The process tries to find the optimal procedure
for producing a linear relationship between the variables that satisfies
Normally distributed data are preferable for QSAR analysis. When the assumptions of normality, constant variance, and linearity. For the
data do not follow a normal distribution, it is necessary to understand linear model fit, R commands may be used (Listing 12) for plotting the
the underlying cause of non-normality so as to take up appropriate “log likelihood” of the lambda parameter (λ) against a range of lambda
remedial actions. Data can be transformed by various methods. values, typically from 2 to 2. It can help in identifying the value of
Some of the common examples of data transformations in our lambda that maximizes the log-likelihood function.
daily life are currency exchange (USD to INR) and the conversion
of degrees Celsius to Fahrenheit or vice versa. These transforma-
tions are examples of linear transformations. This approach
involves simple multiplication or division of the original data by a
constant or coefficient or addition or subtraction from the data.
However, linear transformations do not change the shape of the
data distribution significantly and do not contribute much in
normalization.
One of the popular methods is Box–Cox power transformation,
which uses an exponent (lambda) for transforming data into a “normal
shape.” The value of lambda indicates the power to which all data
should be raised during transformation. During Box–Cox power trans- In the example shown in Figure 11A, the dotted vertical line rep-
formation, optimal lambda is searched from 5 to +5 until the most resents the ideal value of λ at around 1.5. To further refine the esti-
suitable value is found. mation of the optimal lambda value, the range of lambda values can
FIGURE 11 Box–Cox plot.

be adjusted, for example, to a range of 1 to 2 with increments of 0.1, pathological conditions, molecular dynamics and their properties, and
and then plotting the log-likelihood function. modulating factors of drugs highly contribute to the development of
In the new plot (Figure 11B), the estimates are refined to indicate effective QSAR models.14 Many QSAR studies have contributed to
the best value of lambda. The plot indicates that the best value of λ is the determination of the biological properties of various phytochemi-
about 1.45. We then transform the response variable accordingly, add cals. Some of these phytochemicals are highlighted.
it to the original dataset, and run a new linear model (Table 2). Agarwal et al.52 employed a random forest-based binary QSAR
model to screen a library of natural compounds for their anticancer
13.4 | Adding or removal of descriptors

T A B L E 3 Some of the QSAR studies on phytochemicals using
different software programs.
Changing the variables in the model will significantly affect the model's
Natural product
fit. Deleting descriptors is an important approach for dealing with multi-
source Disease/properties Reference
collinearity. If the sole purpose of QSAR analysis is to make predictions, 54
Withanolide from Human breast cancer cell
then the issue of multicollinearity might not be a major one.46 If we want
Withania somnifera lines
to extend the QSAR analysis to interpretations of individual descriptors, 59
Pulvinic acid and Antioxidant properties
we have to deal with multicollinearity issues. Removal of one of the coumarine against radiation sources of
descriptors involved in the multicollinearity when the square root of the derivatives Fenton, gamma, and UV
VIF is greater than 2 is one approach for addressing multicollinearity, Bioactive compounds Lipinski rule of five and 60
but it may not always be the right choice. Other alternatives include the of Gracilaria ADMET prediction
use of ridge regression and GA for selecting the best set of descriptors. corticata
57
705 phytochemical SARS-CoV and MERS-CoV
compounds
61
Leaf fractionated Type 2 diabetes mellitus
13.5 | Different approaches
compounds from
Gongronema
There are different statistical methods that can be attempted if MLR latifolium
approaches could not provide satisfactory results. As explained above, 40 antiviral NS3 protease of dengue virus 24
one approach is to use ridge regression for addressing multicollinear- phytochemicals

ity issues. If there are outliers or influential observations, a robust Plant-derived Fumigant and topical 62
regression model is preferred. If the normality assumption is violated, essential oils activities on Musca
domestica
non-parametric regression is more suited. When there is significant
63
nonlinearity, a nonlinear regression model is more appropriate. Flavonoid derivatives Alzheimer's disease targeting
from Artocarpus acetylcholinesterase
anisophyllus
64
Coumarin isolated Against enzyme
1 4 | E X A M P L E OF QS A R WI T H S P E C I F I C from Rutaceae glyceraldehyde-
PHYTOCHEMICALS species 3-phosphate
dehydrogenase (gGAPDH)
QSAR analysis is an efficient method for finding effective and less from glycosomes of the
parasite Trypanosoma cruzi,
toxic phytochemical drug candidates.4 Prior knowledge of the biologi-
the causative agent of
cal system, various factors concerning physiological processes and Chagas disease
65
Natural products and Monoamine oxidase
T A B L E 2 Common Box–Cox transformation; these values are not related derivatives inhibitors
universal and the optimal lambda value may differ according to the Neolignan-based Against T. cruzi by 66
specific dataset and objective of the study. diaryl- trypanothione reductase

tetrahydrofuran and activity
Lambda Y0 (transformed value of Y)
-furan analogs
2 Y 2 ¼ Y12 (reciprocal square transformation) 67
Cucurbitaceae plants HepG2 and HSC-T6 liver cell
1 lines
Y 1 ¼ Y1 (reciprocal transformation)
68
Monoterpenes from Against Aedes aegypti
0.5 Y 0:5 ¼ √Y
1
(reciprocal square root transformation)
essential oils of
0 log(y), that is, logarithmic transformation diverse plants
69
0.5 Y 0:5 ¼ √Y (square root transformation) Naturally occurring Inhibitors of glycogen
pentacyclic phosphorylase contributing
1 Y 1 ¼ Y (no transformation/identity transformation) triterpenes to hyperglycemia in type 2
2 Y 2 (square transformation) diabetes
activity against EGFR double mutant tumors and identified a few lead R-based QSAR analysis was performed on selected phytochemi-
compounds with potential for overcoming drug resistance in cancer. cals to explore their therapeutic potential for COVID-19 treatment.
In a study carried out by Shukla et al.,53 a QSAR model was devel- Highly correlated features were screened using the cor function, and
oped to predict the biological activities of GA derivatives against the linear models were fitted using the lm function in R software.57
triple-negative breast cancer cell line MDA-MB-231. The model iden- The potential of 4-chloro-3-formylcoumarin derivatives targeting
tified specific structural features that significantly contributed to the human thymidine phosphorylase was analyzed by R-based QSAR.58
cytotoxic activity of the compounds. The addition of an acetyl group Some more QSAR-based studies with phytochemicals are listed in
at C-3 increased the lipophilicity of GA and improved its cytotoxicity, Table 3.
while substitutions at C-30 with butyl amide, propylamide, and amino
ethyl amide were found to decrease the cytotoxicity potential. The ACKNOWLEDG MENTS
study also confirmed the importance of the C-30 carboxylic group for The authors are grateful to DelCON's e-Journal Access Facility. Lutfun
the cytotoxic activity of GA. The results of this study provide valuable Nahar gratefully acknowledges the financial support received from
insights into early lead compound discovery for potential treatment the European Regional Development Fund—Project ENOCH
for metastatic breast cancer. (No. CZ.02.1.01/0.0/0.0/16_019/0000868) and the Czech Agency
Yadav et al. performed 3D-QSAR analysis and docking simula- Grant—Project 23-05474S. In addition, equal major contributions
tions on derivatives of ursolic acid to evaluate their potential as anti- from SSN and RN are duly acknowledged.
cancer agents against the bladder cancer cell line T24. They showed
that structural modifications of the compounds could significantly DATA AVAILABILITY STAT EMEN T
affect their biological activity. The 3D-QSAR models developed in All the data can be requested from the corresponding author upon
their study helped to identify the important structural features reasonable request.
required for enhancing the activity of the derivatives. The study
revealed that the compounds with a bulky and electron-withdrawing OR CID
substituent at position C-3 of the triterpenoid skeleton showed Rajat Nath https://orcid.org/0000-0002-6633-8122
higher anticancer activity by inhibiting the NF-κB pathway. The Anupam Das Talukdar https://orcid.org/0000-0001-8916-2791
docking simulations further supported the biological activity of the
derivatives by showing their favorable binding interactions with the RE FE RE NCE S
active site of the target protein. The findings of this study could be 1. Muratov EN, Bajorath J, Sheridan RP, et al. QSAR without borders.
useful for the design and development of potent anticancer agents Chem Soc Rev. 2020;49(11):3525-3564. doi:10.1039/D0CS00098A
2. Selassie C, Verma RP. History of quantitative structure-activity rela-
targeting the NF-κB pathway. Overall, their study highlights the
tionships. In: Burger's Medicinal Chemistry and Drug Discovery. Wiley.
potential of 3D-QSAR analysis and docking simulations in predicting Vol.1; 2003:1-48.
the biological activity of compounds and guiding the development of 3. Veerasamy R. QSAR—an important in-silico tool in drug design and
new drugs.54 discovery. In: Advances in Computational Modeling and Simulation.
Springer; 2022:191-208. doi:10.1007/978-981-16-7857-8_16
In a field-based 3D-QSAR study,55 a model for the triterpene
4. Das AP, Agarwal SM. Recent advances in the area of plant-based
maslinic acid and its analogs was established. The model was success- anti-cancer drug discovery using computational approaches. Mol
fully applied to virtually screen a large number of compounds for anti- Divers. 2023;1-25. doi:10.1007/s11030-022-10590-7
cancer activity against the breast cancer cell line MCF7. The model's 5. Ojo OA, Ojo AB, Okolie C, et al. Deciphering the interactions of bio-
active compounds in selected traditional medicinal plants against Alz-
acceptable regression and cross-validation coefficients demonstrated
heimer's diseases via pharmacophore modeling, auto-QSAR, and
its accuracy and reliability for identifying active compounds. The
molecular docking approaches. Molecules. 2021;26(7):1996. doi:10.
activity atlas models provided a global view of the training set and 3390/molecules26071996
helped in the identification of key features responsible for SAR analy- 6. Omoboyowa DA. Exploring molecular docking with E-pharmacophore
sis. The virtual screening process, which involved applying filters for and QSAR models to predict potent inhibitors of 14-α-demethylase
protease from Moringa spp. Pharmacol Res Mod Chin Med. 2022;4:
oral bioavailability, drug-like features, chemical synthesis, and cellular
100147. doi:10.1016/j.prmcm.2022.100147
target docking, resulted in the identification of P-902 as the top hit. 7. McCullough BD. Assessing the reliability of statistical software: part I.
Overall, the results of this study demonstrate the usefulness of QSAR Am Stat. 1998;52(4):358-366.
models in drug discovery for anticancer therapy and lead compound 8. Emmert-Streib F. Statistical Modelling of Molecular Descriptors in
QSAR/QSPR. John Wiley & Sons; 2012.
optimization from natural products. QSAR modeling can be a valuable
9. R Core Team. R: A Language and Environment for Statistical Computing.
tool in identifying potential drug candidates and in reducing the time http://www.R-project.org/. R Foundation for Statistical Computing;
and cost associated with traditional drug discovery methods. 2016.
A QSAR-based machine learning-integrated stepwise method was 10. Lo Y-C, Rensi SE, Torng W, Altman RB. Machine learning in chemoin-
formatics and drug discovery. Drug Discov Today. 2018;23(8):1538-
used to discover novel anti-obesity phytochemicals that antagonize
1546. doi:10.1016/j.drudis.2018.05.010
the glucocorticoid receptor. The workflow includes contributions of 11. Puzyn T. Recent Advances in QSAR Studies Methods and Applications.
some R packages.56 Springer; 2022.
12. McNaught AD, Wilkinson A. Compendium of Chemical Terminology: 34. Shamsara J. Ezqsar: an R package for developing QSAR models
IUPAC Recommendations. Wiley–Blackwell; 1997. directly from structures. Open Med Chem J. 2017;11(1):212-221. doi:
13. Golbraikh A, Tropsha A. QSAR modeling using chirality descriptors 10.2174/1874104501711010212
derived from molecular topology. J Chem Inf Comput Sci. 2003;43(1): 35. Grabner M, Varmuza K, Dehmer M. RMol: a toolset for transforming
144-154. doi:10.1021/ci025516b SD/Molfile structure information into R objects. Source Code Biol
14. Kar S, Roy K. QSAR of phytochemicals for the design of better drugs. Med. 2012;7(1):12. doi:10.1186/1751-0473-7-12
Expert Opin Drug Discovery. 2012;7(10):877-902. doi:10.1517/ 36. Tsiliki G, Munteanu CR, Seoane JA, Fernandez-Lozano C,
17460441.2012.716420 Sarimveis H, Willighagen EL. RRegrs: an R package for computer-
15. Roy K, Kar S, das RN, Roy K, Kar S, Das RN. Statistical methods in aided model selection with multiple regression models. J Chem. 2015;
QSAR/QSPR. In: A Primer on QSAR/QSPR Modeling: Fundamental Con- 7(1):46. doi:10.1186/s13321-015-0094-2
cepts. Springer; 2015:37-59. doi:10.1007/978-3-319-17281-1_2 37. Murrell DS, Cortes-Ciriano I, van Westen GJP, et al. Chemically
16. Liu P, Long W. Current mathematical methods used in QSAR/QSPR Aware Model Builder (camb): an R package for property and bioactiv-
studies. Int J Mol Sci. 2009;10(5):1978-1998. doi:10.3390/ ity modelling of small molecules. J Chem. 2015;7(1):45. doi:10.1186/
ijms10051978 s13321-015-0086-2
17. Sharma S, Bhatia V. Recent trends in QSAR in modelling of drug-protein 38. Ash JR, Hughes-Oliver JM. chemmodlab: a cheminformatics modeling
and protein-protein interactions. Comb Chem High Throughput Screen. laboratory R package for fitting and assessing machine learning
2021;24(7):1031-1041. doi:10.2174/1386207323666201209093537 models. J Chem. 2018;10:1-20.
18. Wang Z, Chen J. Applicability domain characterization for machine 39. Alexander DL, Tropsha A, Winkler DA. Beware of R2: simple, unam-
learning QSAR models. In: Machine Learning and Deep Learning in biguous assessment of the prediction accuracy of QSAR and QSPR
Computational Toxicology. Springer; 2023:323-353. doi:10.1007/978- models. J Chem Inf Model. 2015;55(7):1316-1322. doi:10.1021/acs.
3-031-20730-3_13 jcim.5b00206
19. Roth BL, Sheffler DJ, Kroeze WK. Magic shotguns versus magic bullets: 40. Varmuza K, Filzmoser P, Dehmer M. Multivariate linear QSPR/QSAR
selectively non-selective drugs for mood disorders and schizophrenia. models: rigorous evaluation of variable selection for PLS. Comput Struct
Nat Rev Drug Discov. 2004;3(4):353-359. doi:10.1038/nrd1346 Biotechnol J. 2013;5(6):e201302007. doi:10.5936/csbj.201302007
20. Luan F, Kleandrova VV, González-Díaz H, et al. Computer-aided 41. Goodarzi M, Dejaegher B, Heyden YV. Feature selection methods in
nanotoxicology: assessing cytotoxicity of nanoparticles under diverse QSAR studies. J AOAC Int. 2012;95(3):636-651. doi:10.5740/
experimental conditions by using a novel QSTR-perturbation jaoacint.SGE_Goodarzi
approach. Nanoscale. 2014;6(18):10623-10630. doi:10.1039/ 42. Ripley B, Venables B, Bates DM, Hornik K, Gebhardt A, Firth D.
C4NR01285B MASS: support functions and datasets for Venables and Ripley's
21. Kubinyi H. Chemogenomics in drug discovery. Ernst Schering Res MASS. R package version 7.3-47. 2017.
Found Workshop. 2006;58:1-19. doi:10.1007/978-3-540-37635-4_1 43. Cherkasov A, Muratov EN, Fourches D, et al. QSAR modeling: where
22. Playe B, Stoven V. Evaluation of deep and shallow learning methods have you been? Where are you going to? J Med Chem. 2014;57(12):
in chemogenomics for the prediction of drugs specificity. J Chem. 4977-5010. doi:10.1021/jm4004285
2020;12(1):11. doi:10.1186/s13321-020-0413-0 44. Warren DL, Glor RE, Turelli M. ENMTools: a toolbox for comparative
23. Chakravarti SK, Alla SRM. Descriptor free QSAR modeling using deep studies of environmental niche models. Ecography. 2010;33(3):
learning with long short-term memory neural networks. Front Artif 607-611.
Intell. 2019;2:17. doi:10.3389/frai.2019.00017 45. Topliss JG, Costello RJ. Chance correlations in structure-activity stud-
24. Rahman MM, Biswas S, Islam KJ, et al. Antiviral phytochemicals as ies using multiple regression analysis. J Med Chem. 1972;15(10):1066-
potent inhibitors against NS3 protease of dengue virus. Comput Biol 1068. doi:10.1021/jm00280a017
Med. 2021;134:104492. doi:10.1016/j.compbiomed.2021.104492 46. Peterangelo SC, Seybold PG. Synergistic interactions among QSAR
25. Islam R, Parves MR, Paul AS, et al. A molecular modeling approach to descriptors. Int J Quantum Chem. 2004;96(1):1-9. doi:10.1002/qua.
identify effective antiviral phytochemicals against the main protease 10591
of SARS-CoV-2. J Biomol Struct Dyn. 2021;39(9):3213-3224. doi:10. 47. Barua N, Sarmah P, Hussain I, Deka RC, Buragohain AK. DFT-based
1080/07391102.2020.1761883 QSAR models to predict the antimycobacterial activity of chalcones.
26. Basu A, Sarkar A, Maulik U. Molecular docking study of potential phy- Chem Biol Drug Des. 2012;79(4):553-559. doi:10.1111/j.1747-0285.
tochemicals and their effects on the complex of SARS-CoV2 spike 2011.01289.x
protein and human ACE2. Sci Rep. 2020;10(1):17699. doi:10.1038/ 48. Hawkins DM, Hawkins D. Multivariate outlier detection. In: Identifica-
s41598-020-74715-4 tion of Outliers. Springer; 1980:104-114. doi:10.1007/978-94-015-
27. Williams G. Data Mining with Rattle and R: The Art of Excavating Data 3994-4_8
for Knowledge Discovery. Springer Science & Business Media; 2011. 49. Kabacoff R. R in Action. Manning Publications Co.; 2011.
doi:10.1007/978-1-4419-9890-3 50. Tropsha A. Best practices for QSAR model development, validation,
28. Kursa M, Rudnicki W. Boruta: wrapper algorithm for all relevant fea- and exploitation. Mol Inform. 2010;29(6–7):476-488. doi:10.1002/
ture selection. Version 5.2.0. 2017. minf.201000061
29. Fox J, Weisberg S. An R Companion to Applied Regression. Sage Publi- 51. Maggiora GM. On outliers and activity cliffs—why QSAR often disap-
cations; 2011. points. J Chem Inf Model. 2006;46(4):1535. doi:10.1021/ci060117s
30. Lumley T. Leaps: regression subset selection. R package version 3.0. 52. Agarwal SM, Nandekar P, Saini R. Computational identification of nat-
Based on Fortran code by Alan Miller. 2017. ural product inhibitors against EGFR double mutant (T790M/L858R)
31. Hothorn T, Zeileis A, Farebrother RW, Cummins C, Millo G, by integrating ADMET, machine learning, molecular docking and a
Mitchell D. lmtest: testing linear regression models. R package version dynamics approach. RSC Adv. 2022;12(26):16779-16789. doi:10.
0.9-34. 2015. 1039/D2RA00373B
32. Jiratchayut K, Bumrungsup C. Penalized linear regression methods 53. Shukla A, Tyagi R, Meena S, Datta D, Srivastava SK, Khan F. 2D- and
where the predictors have grouping effect. Thail Stat. 2019;17(2): 3D-QSAR modelling, molecular docking and in vitro evaluation stud-
212-222. ies on 18β-glycyrrhetinic acid derivatives against triple-negative
33. Cerdeira JO, Silva PD, Cadimo J, Minhoto M. Subselect: selecting var- breast cancer cell line. J Biomol Struct Dyn. 2020;38(1):168-185. doi:
iable subsets. Version 0.13. 2017. 10.1080/07391102.2019.1570868
54. Yadav DK, Kumar S, Saloni S, et al. Molecular docking, QSAR and 64. Menezes IR, Lopes JC, Montanari CA, et al. 3D QSAR studies on bind-
ADMET studies of withanolide analogs against breast cancer. Drug ing affinities of coumarin natural products for glycosomal GAPDH of
Des Devel Ther. 2017;11:1859-1870. doi:10.2147/DDDT.S130601 Trypanosoma cruzi. J Comput Aided Mol Des. 2003;17(5/6):277-290.
55. Alam S, Khan F. 3D-QSAR studies on maslinic acid analogs for anti- doi:10.1023/A:1026171723068
cancer activity against breast cancer cell line MCF-7. Sci Rep. 2017; 65. Dhiman P, Malik N, Khatkar A. 3D-QSAR and in-silico studies of natu-
7(1):6019. doi:10.1038/s41598-017-06131-0 ral products and related derivatives as monoamine oxidase inhibitors.
56. Shin SH, Hur G, Kim NR, Park JHY, Lee KW, Yang H. A machine Curr Neuropharmacol. 2018;16(6):881-900. doi:10.2174/
learning-integrated stepwise method to discover novel anti-obesity 1570159X15666171128143650
phytochemicals that antagonize the glucocorticoid receptor. Food 66. Hartmann AP, de Carvalho MR, Bernardes LSC, et al. Synthesis and
Funct. 2023;14(4):1869-1883. doi:10.1039/D2FO03466B 2D-QSAR studies of neolignan-based diaryl-tetrahydrofuran and-
57. Bhargav A, Chaurasia P, Kumar R, Ramachandran S. Phytovid19: a furan analogues with remarkable activity against Trypanosoma cruzi
compilation of phytochemicals research in coronavirus. Struct Chem. and assessment of the trypanothione reductase activity. Eur J Med
2022;33(6):2169-2177. doi:10.1007/s11224-022-02035-6 Chem. 2017;140:187-199. doi:10.1016/j.ejmech.2017.08.064
58. Scior T, Garcia-Hernandez JC, Abdallah HH, Alexander C. QSAR 67. Bartalis J, Halaweish FT. In vitro and QSAR studies of cucurbitacins
applied to 4-chloro-3-formylcoumarin derivatives targeting human on HepG2 and HSC-T6 liver cell lines. Bioorg Med Chem. 2011;19(8):
thymidine phosphorylase. Clin Complementary Med Pharmacol. 2022; 2757-2766. doi:10.1016/j.bmc.2011.01.037
2(2):100031. doi:10.1016/j.ccmp.2022.100031 68. dos Santos IM, Agra JPG, de Carvalho TGC, de Azevedo Maia GL, de
59. Ahmadi S, Ghanbari H, Lotfi S, Azimi N. Predictive QSAR modeling Alencar Filho EB. Classical and 3D QSAR studies of larvicidal mono-
for the antioxidant activity of natural compounds derivatives based terpenes against Aedes aegypti: new molecular insights for the rational
on Monte Carlo method. Mol Divers. 2021;25(1):87-97. doi:10.1007/ design of more active compounds. Struct Chem. 2018;29(5):1287-
s11030-019-10026-9 1297. doi:10.1007/s11224-018-1110-8
60. Biswal A, Aishwarya A, Sharma A, Pazhamalai V. 2D QSAR, Admet 69. Liang Z, Zhang L, Li L, et al. Identification of pentacyclic triterpenes
prediction and multiple receptor molecular docking strategy in bioac- derivatives as potent inhibitors against glycogen phosphorylase based
tive compounds of Gracilaria corticata against Plasmodium falciparum on 3D-QSAR studies. Eur J Med Chem. 2011;46(6):2011-2021. doi:
(contractile Protein). Inform Med Unlocked. 2019;17:100258. doi:10. 10.1016/j.ejmech.2011.02.053
1016/j.imu.2019.100258
61. Ajiboye BO, Iwaloye O, Owolabi OV, et al. Screening of potential
antidiabetic phytochemicals from Gongronema latifolium leaf against
therapeutic targets of type 2 diabetes mellitus: multi-targets drug How to cite this article: Ningthoujam SS, Nath R, Kityania S,
design. SN Appl Sci. 2022;4(1):14. doi:10.1007/s42452-021-04880-2
et al. R software for QSAR analysis in phytopharmacological
62. Duchowicz PR, Bennardi DO, Ortiz EV, Comelli NC. QSAR models for
insecticidal properties of plant essential oils on the housefly (Musca studies. Phytochemical Analysis. 2023;1‐20. doi:10.1002/pca.
domestica L.). SAR QSAR Environ Res. 2021;32(5):395-410. doi:10. 3239
1080/1062936X.2021.1905711
63. Das S, Laskar MA, Sarker SD, et al. Prediction of anti-Alzheimer's
activity of flavonoids targeting acetylcholinesterase in silico. Phyto-
chem Anal. 2017;28(4):324-331. doi:10.1002/pca.2679

R PCA PublishedOnline

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R PCA PublishedOnline

Uploaded by

Copyright:

Available Formats

Received: 8 April 2023 Revised: 11 May 2023 Accepted: 11 May 2023

SPECIAL ISSUE REVIEW

R software for QSAR analysis in phytopharmacological studies

Sanjoy Singh Ningthoujam 1 | Rajat Nath 2 | Sibashish Kityania 2 |

1 | I N T RO DU CT I O N has diverse applications in various fields such as agrochemistry, phar-

4 | P R E S E N T T RE N D S I N Q S A R 4.1 | Artificial intelligence and machine learning in

approach in that it involves predicting interactions with multiple tar- to Applied

(X) to the response variable (Y). However, determining the appropriate 35

FIGURE 3 Types of feature selection.

9.2 | ANOVA approach

By using the anova() function in the R base installation,9 two nested

descriptors that do not make a significant contribution to the

Stepwise model selection (forward, backward, and stepwise) can

The AIC measures the quality of a statistical model. It is used to com-

1. In forward stepwise regression, the procedure starts with a

In the GA approach, a theoretical best model could not be

as such descriptors are not well defined or identified.43 There is no hard

R software and add-on packages can be used for regression diagnos-

10.1 | Linearity 10.2 | Normality

In regression analysis, residuals refer to the differences between the

From the PP plot generated by the R script, normality can be

F I G U R E 9 Residual histogram from a sample

FIGURE 11 Box–Cox plot.

13.4 | Adding or removal of descriptors

one approach is to use ridge regression for addressing multicollinear- phytochemicals

specific dataset and objective of the study. diaryl- trypanothione reductase

You might also like