You are on page 1of 16

Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to

Biological Applications
Swathik Clarancia Peter1, Tamil Nadu Agricultural University, Coimbatore, India
Jaspreet Kaur Dhanja1, Vidhi Malik, and Navaneethan Radhakrishnan, Indian Institute of Technology Delhi, New Delhi, India
Mannu Jayakanthan, Tamil Nadu Agricultural University, Coimbatore, India
Durai Sundar, Indian Institute of Technology Delhi, New Delhi, India
r 2018 Elsevier Inc. All rights reserved.

Introduction

Quantitative structure-activity relationship (QSAR) approach relies on the basic principle of chemistry that states that the bio-
logical activity of any ligand or compound is associated with the arrangement of atoms forming the molecular structure. In other
words, structurally related molecules possess similar biological activities. This structural information can be defined in terms of a
series of parameters called molecular descriptors. In QSAR, the biological activity is represented as a function of these molecular
descriptors as depicted in Eq. (1).

Biological response or activity ¼ f ðmolecular descriptorsÞ… ð1Þ

The model thus developed based on the biological activities of known ligands is used to predict the response of new
compounds.
QSAR finds applicability in a wide range of fields including toxicology (Wang et al., 2014; Rochani et al., 2010), ecotoxicology
(Hermens et al., 1984; Van Gestel and Ma, 1990; Escher et al., 2006), drug design and discovery (Zernov et al., 2003; Speck-Planche
et al., 2012; Buolamwini and Assefa, 2002), chemical data mining (Shen et al., 2004), combinatorial library design (Ghose et al.,
1999; Zhang et al., 2008) and so on.
QSAR studies therefore involve selection of active and inactive compounds with the measure of their biological activity,
description and calculation of molecular descriptors, selection of appropriate features followed by construction of the mathe-
matical model and its evaluation.

QSAR and QSPR

Quantitative structure-activity relationship (QSAR) prediction depends on the structure of molecules and atoms present in the
compound. Biological activity is understood in terms of numerical values (example bioavailability, inhibitory concentration) and
presence/absence of a condition (example infected/not infected, mutagenic/non mutagenic). Various QSAR studies have been
carried out to understand biological properties such as pharmacokinetics (Vieira et al., 2014; Gombar and Hall, 2013), blood brain
barrier penetration (BBB) (Zhang et al., 2008), carcinogenicity (Fjodorova et al., 2010; Kar and Roy, 2011), drug metabolism
(Braga and Andrade, 2012; Lewis, 2000), bio-concentration (Grisoni et al., 2016; Papa et al., 2007), permeability (Gozalbes et al.,
2011; Fujikawa et al., 2007), drug clearance (Manga et al., 2003; Boik and Newman, 2008), mutagenicity (Valencia et al., 2013;
Barber et al., 2016), and so on. Another term associated with this approach is Quantitative structure-property relationship (QSPR).
In QSPR, physiochemical properties of the chemical compounds are determined based on the molecular structure information.
Physiochemical properties such as melting point (Katritzky et al., 2002; Modarresi et al., 2006), boiling point (Sola et al., 2008; Dai
et al., 2013), solubility (Duchowicz and Castro, 2009; Gao et al., 2002), stability (Dioury et al., 2014; Ghasemi et al., 2010),
dielectric constant (Achary, 2014; Soltanpour et al., 2016), reactivity (Toropov et al., 2004), diffusion coefficient (Mirkhani et al.,
2012), thermodynamic properties (Puri et al., 2002; Duchowicz et al., 2006), hydrophobicity (Zou et al., 2016; Berinde, 2013)
have been exploited to determine quantitative structure-property relationships.

History

The concept of QSAR had begun a century ago. Crum-Brown and Fraser in 1868 proposed the physiological activity of molecules
based on their composition (Crum-Brown and Fraser, 1868). Then the narcotic effect of the primary alcohols respective to the
molecular weight was studied by Richardson in 1869 (Richardson, 1869). Following this, studies of simple organic compounds in
response to water solubility (Richet, 1893), potency variation of narcotic compounds (Meyer, 1899), study of chemical reactivity
of substituted benzenes (Hammett, 1937), narcotic study based on logP and thermodynamic (Ferguson, 1939), physical organic
chemistry to linear steric energy relationships (Taft, 1952, 1953a, 1953b), linear free energy relationship model by Hansch and
Fujita (1964), QSAR study based on molecular fragments by Free and Wilson (1964) and substituent-based structure-activity

1
Equal contribution.

Encyclopedia of Bioinformatics and Computational Biology doi:10.1016/B978-0-12-809633-8.20197-0 1


2 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

relationship (Fujrra and Ban, 1971) are some of the important hallmarks in the history of QSAR. The development of 2D-QSAR
began in early 1970s and the development of 3D-QSAR started in early 1980s. With the arrival of new technologies and
perspectives, QSAR has now become multidimensional. More robust and accurate QSAR approaches came into practice with
increased dimensionality like 4D, 5D and 6D, which has led to increased predictability, reliability and precision of the models.

Types of QSAR Methodologies

QSAR can be broadly classified in two ways (Qiao et al., 2014). The first classification is based on the dimensions of the descriptors
involved in the model such as 1D, 2D, 3D, 4D, 5D and so forth. Other is based on the type of biological activity predicted as a
dependent variable. This includes Quantitative structure-toxicity relationship (QSTR) (Can, 2014), Quantitative structure-metabolism
relationship (QSMR), Quantitative structure-reactivity relationship or Quantitative structure-retention relationship (QSRR) (Hem-
mateenejad et al., 2009; Goryński et al., 2013), Quantitative structure-permeability relationship or Quantitative structure-pharma-
cokinetics relationship (QSPR) (Moss et al., 2002; Mayer and Van De Waterbeemd, 1985), Quantitative structure-bioavailability
relationship or Quantitative structure-binding affinity relationship (QSBR) (Andrews et al., 2000; Zhang et al., 2006), and so forth.
QSAR models can also be grouped based on analysis of correlation-linear and non-linear (Roy and Mandal, 2008), or
depending on binding nature of molecule and receptor–receptor dependent and receptor independent (Magdziarz et al., 2009).

QSAR Model Construction

The quality of the QSAR model to a large extent depends on the data used for its construction. Hence, prior to the model
development, it is necessary to gain insights about the data. Thorough understanding of the problem and influencing factors assists
in discerning meaningful relationships. Relevant background information about the system under study either biological or
chemical is to be collected via literature search. Data sets for model construction need to be chosen carefully as poor and
inconsistent data would lead to corrupt model. There are several other factors like division of data into training and test data sets,
molecular descriptors, statistical methods for model development that influence the quality of the QSAR model.
The schematic representation of QSAR model construction is given in Fig. 1. A brief description of the steps involved in QSAR
model generation is discussed in the following section.

Fig. 1 Schematic representation of QSAR (quantitative structure-activity relationship) model development.


Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications 3

Data Pre-Processing
Pre-processing eliminates noise and redundancy in the data sets. It involves data transformation that includes smoothing,
normalization and aggregation (Tomar and Agarwal, 2014), data reduction, sampling when data sets are large, noise elim-
ination, feature selection, data cleaning, data integration when data is collected from heterogeneous sources, and discretization
(Cocu et al., 2008).

Training and Test Data Sets


The data is divided into training and test data sets. The training set is used to formulate the QSAR model, while the test set is used
to evaluate its predictability and accuracy. Dataset is generally divided in such a way that both the sets occupy entire descriptor
space. Employing appropriate data splitting techniques improves the model prediction. The various approaches available for the
division of dataset include k-means clustering, based on X response, based on Y response, random selection, statistical molecular
design, sphere exclusion, Kennard-Stone selection, Kohonen’s self-organizing map selection, and extrapolation-oriented test set
selection (Roy et al., 2008).

Calculation of Molecular Descriptors


The information about the structure of molecules, defined by molecular descriptors obtained from different representations such
as 2D, 3D, etc., is embedded into the QSAR model. Molecular conformations used should be correct for a better predictive model.
There are different kinds of descriptors like count descriptors (0D), fingerprints (1D), topological descriptors (2D), geometrical
(3D), grid based (4D) and so forth. The complexity of information and power to discriminate between similar structures as
provided by different descriptors increases with dimensionality. 0D and 1D descriptors provide basic information like molecular
weight and number of constituent elements that are directly derived from molecular formula. Net charge of the molecule is a 1D
descriptor. Topological indices that are computed from the structural formula are 2D descriptors. These are based on the graph
theory and reflect the connections in the structure. The most widely used topological descriptor is connectivity index proposed by
Randić (2001); Li and Shi (2008); Kier (1985). Other topological indices are Wiener’s index W (Ivanciuc, 2000; Wiener, 1947),
Connectivity indices (Kier et al., 1975), Kier Shape (Kier, 1985), Balaban J Index (Balaban, 1982) and Zagreb indices (Gutman and
Trinajstić, 1972). The 3D descriptors are based on three-dimensional coordinates of atoms comprising the compounds. Some
commonly used methods for calculating 3D descriptors are CoMFA, CoMSIA, CoMBINE, GERM, CoMMA, GRIND, WHIM,
HoloQSAR and CoSA. 4D, 5D and 6D descriptors are multidimensional descriptors, which include the parameters involved in the
structure and flexibility of the receptor-binding site in conjunction with ligand topology. 4D descriptors are based on reference
grids and molecular dynamic simulations. Descriptors calculated using multiple conformations, orientations, protonation states
and isosteriomers of the ligand constitute 5D descriptors. The solvation terms constitute 6D descriptors. Molecular descriptors can
be calculated using various software, some of which are listed in Table 1 (Damale et al., 2014).

Feature Selection
Feature selection reduces the dataset horizontally. Among the large number of calculated descriptors, only few are chosen to define
the model. Feature selection after descriptor calculation removes collinearity between the descriptor pairs. Selection of the most
appropriate features is done using filter and wrapper methods (Goodarzi et al., 2012). Filter methods involve filtering out
descriptors, thereby reducing the pool size of descriptors based on inter-variable correlations. So, molecular descriptors that show
inter-correlation are removed retaining only one descriptor from a pair (Roy et al., 2015a). Descriptors with lowest variance are
also removed. Filter methods use techniques like chi-square analysis, Shannon entropy, odds ratio, GSS coefficient (Liu, 2004),
correlation based feature selection (Demel et al., 2009), Fisher Score, Kolmogorov-Smirnov statistics (Guyon et al., 2002), and
principle component analysis. Distance based methods like Euclidean distance measures are also grouped under filter methods.
Wrapper methods use regression-based approaches to select descriptors. In general, wrapper methods involve more computational
power and perform better than filter methods. Recursive feature elimination (Xue et al., 2004), variable selection and modeling
based on the prediction (Liu et al., 2003), k nearest neighbor, backward elimination, forward selection, genetic algorithm, Bayesian
regularized neural network, factor analysis and combinatorial protocol are some of the commonly used wrapper methods. Hybrid
methods are also being used that combine both filter and wrapper methods for selecting features (Goodarzi et al., 2012).

Table 1 Software for calculation of molecular descriptors

Software Description Availability

ACD/LogP Freeware logP prediction by fragment-based algorithm Freely available


Dragon Calculation of topological, constitutional & geometrical descriptors Commercial
MOLGEN Calculation of topological, constitutional & geometrical descriptors Freely available
PaDEL Descriptors Calculation of 2D & 3D descriptors Freely available
4 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

Heuristic methods based on multiple linear regression are also used in selecting descriptors for QSAR models. It is fast when
compared to other methods. It discards descriptors with constant values and removes descriptors whose values are not available
for all the structures in the dataset thereby removing trivial descriptors. It also removes highly correlated descriptors (Liu and Long,
2009). Heuristic method has used for searching descriptor space and selection of vital descriptors in QSAR study of 1,4-dihy-
dropyridine calcium channel antagonists (Si et al., 2006), to select descriptors for multivariable linear model to predict the Percent
of Applied Dose Dermally Absorbed (PADA) of the polycyclic aromatic hydrocarbons (Wang et al., 2008), and descriptors
selection in QSAR analysis of photosystem II electron transfer inhibitors (Karacan et al., 2012).

QSAR Methods
Statistical methods are used in model construction and feature selection when there are large numbers of descriptors. They are
helpful in obtaining functional endpoints. The statistical methods can be classified into regression-based approaches, classification
based approaches, and machine learning techniques. QSAR modeling can be done for both linear and non-linear properties. Some
of the methods for modeling linear properties include linear regression and partial regression, while artificial neural networks are
being employed for modeling non-linear properties.
Models can be constructed using both supervised and unsupervised techniques. The chance effects in the unsupervised learning
are less when compared to supervised learning, as it does not change to fit the model. Semi-supervised learning is advantageous
over the supervised and unsupervised techniques. It considers both the labeled and unlabeled data giving better performance
(Settles, 2012). Comparison of supervised learning algorithms and semi-supervised algorithms in different data sets suggested that
semi-supervised learning can assist in understanding the addition of unlabeled data and hence is helpful for certain type of dataset
and methods (Levatic et al., 2013).

Regression based methods


Multiple linear regression
Multiple Linear Regression (MLR) method helps in establishing correlation between the independent and dependent variables.
Here, the dependent variables are the biological activity or physiochemical property of the system that is being studied and the
independent variables are molecular descriptors obtained from different representations. In linear regression models, the
dependent variable is predicted using only one descriptor or feature. Multiple linear regression models consider more than one
descriptor for the prediction of property/activity in question. The model based on the linear regression can be represented as a
mathematical equation given below-
y ¼ a þ bx… ð2Þ
where, y is the dependent/response variable representing the physiochemical property or biological activity, x is the independent
or predictor variable accounting for the molecular descriptor, and b is the regression coefficient.
Examples of QSAR studies involving the use of MLR method include the prediction of binding affinities of H3 antagonists
(Dastmalchi et al., 2012) and inhibitory activity of human non-pancreatic secretory phospholipase A2 (Singh and Verma, 2014).

Partial least squares method


Partial Least Squares (PLS), developed from the principal component regression, helps in building models predicting more than
one dependent variable (Lorber et al., 1987). This method is used when the number of variables are more than the number
of compounds in the datasets and where the variables considered for the study are correlated (Cramer, 1993). It is applied in
3D-QSAR technique, Comparative Molecular Field Analysis (CoMFA) to reduce the number of descriptors. PLS is also used in the
validation metrics of the models (Mota et al., 2009). It is advantageous over other regression models (Cramer, 1993). PLS has been
used in the construction of many successful QSAR models. Prediction of binding affinity of polycyclic aromatic compounds with
the rat liver 2,3,7,8-tetrachlorodibenzene-p-dioxin (TCCD) receptor (Johnels et al., 1989) is one such example. PLS in combination
with other methods – Genetic Partial Least Squares (G/PLS), Factor analysis Partial Least Squares (FA-PLS) and Orthogonal Signal
Correction Partial Least Squares (OSC-PLS) (Liu and Long, 2009) is also being used for QSAR studies.

Classification based methods


Cluster analysis
Clustering involves placing similar data into a group in a way that maximizes similarity within groups and dissimilarity between
groups. It involves methods like hierarchical clustering and k-means clustering. In hierarchical clustering, clusters are grouped on
the basis of the dissimilarities calculated through the distances between the objects (Euclidean distances). The k-means clustering
is a non-hierarchical method. It is based on k-centroids. Some of the other classification based methods are linear discriminant
analysis and logistic regression (Roy et al., 2015c; Agresti, 2007; Harrell, 2001).

Machine learning techniques


Artificial neural network
Artificial Neural Network (ANN) mimics the behavior of biological neurons. The ANN has input layer, hidden layer(s) and
output layer. The molecular information is fed through the input layer, which is processed by a number of processing units in
Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications 5

parallel and the biological activity or property is obtained as an output. The commonly used ANNs in QSAR are back
propagation neural networks, probabilistic neural networks, Kohonen self-organizing maps and Bayesian regularized neural
networks. Neural networks can be supervised or unsupervised in nature. The learning is supervised when the trained model is
validated by a separate test set. The training set helps in fitting weight parameters and decides the number of hidden layers in
the network architecture. ANN methods have been proven to be highly adaptable and are deployed in modeling non-linear
systems with high variability in data sets. Superior models can be obtained from ANNs compared to traditional approaches
like MLR and PLR (Shi et al., 2010). Implementing techniques like dropout in training data set reduces over-fitting of data and
produces improved results when compared to conventional ANN models. However, Bayesian networks still outstand in their
performance.
ANNs have been recently used in many QSAR studies. Some examples include the study of neurotrophic effects of N-p-Tolyl/
phenylsulfonyl L-amino acid thioester derivatives (Luo et al., 2011) and the study of antibacterial activity of oxazolidinone
derivatives (Zou and Zhou, 2007).

Support vector machine


Support Vector Machine (SVM) is a machine learning approach that uses a linear classifier to classify data into two categories.
The classifier is non-probabilistic. It performs better than other 3D QSAR models. In a comparative study of 3D QSAR modeling
and SVMs for predicting the activity of BRAF-V600E and HIV integrase inhibitors, SVMs outperformed 3D-QSAR models
(Wesley et al., 2016). SVMs are also being used in combination with other methods like MLR, PLS and so forth for building
more powerful and accurate QSAR models. In one of the QSAR reports where the anti-Alzheimer activity of triazolylthiopenes as
cyclin dependent kinase 5 inhibitors was studied (Garkani-Nejad and Ghanbari, 2016), MLR was used for selecting molecular
descriptors, and support vector regression and PLS were used for constructing non-linear and linear models. When these
methods were compared, support vector regression outperformed other methods. SVMs are employed in dimensionality
reduction through variable ranking and selection (Bi et al., 2003). SVMs also overcome the issue of over-fitting observed in
artificial networks.
In a study conducted for predicting the reduction of dihydrofolate reductase by pyrimidines, SVMs outperformed three neural
networks, namely radial basis function network, nearest neighbor classifier and decision tree (Burbidge et al., 2001). QSPR models
are also being developed by employing a combination of SVM with other methods like Principal Component Least Square
methods and so forth (Veyseh et al., 2015; Khorshidi et al., 2014).

Gene expression programming


Gene Expression Programming (GEP) is based on the genetic algorithm and genetic programming. GEP has been used in QSAR
modeling for the prediction of dermal penetration (PADA, Percent of Applied Dose Dermally Absorbed) of polycyclic aromatic
hydrocarbons (Wang et al., 2008), prediction of EC50 of anti-HIV drugs (Si et al., 2008), prediction of binding affinity of
substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas with the chemokine receptor 5, and also for the prediction of
toxicity of aromatic compounds (Shi et al., 2010). GEP proved to be better in prediction when compared to the previously
discussed methods like ANNs (Shi et al., 2010), and SVMs (Si et al., 2008). Further, Improved Gene Expression Programming
(IGEP) proves to be more efficient than existing methods (Fu et al., 2010).
Some of other methods deployed in QSAR studies include Monte Carlo Simulations (Kumar and Chauhan, 2017), principal
component analysis (Suzuki et al., 2001), and decision trees and random forest algorithm (Polishchuk et al., 2009; Simeon et al.,
2016). Still newer methods like Projection Pursuit Regression (Du et al., 2011) and Local Lazy Regression (Guha et al., 2006; Lei
et al., 2010) are also implemented in QSAR models.

Software for QSAR Studies and Modeling

QSAR studies are carried out using various platforms that help in building models for predicting chemical, biological and
toxicological activities. Some of these are listed in Table 2.

List of Databases Used in QSAR Studies

Attempt has also been made to archive the constructed QSAR models for further reference and usage. Some of these are listed in
Table 3.

Validation of Models

Once the model is constructed, it must be validated. Validating the models avoid chance correlation of numerous descriptors used
in the model and also over-fitting of data. It helps in assessing the accuracy and prediction of the model. The Organization for
Economic Cooperation and Development (OECD) has put forth five principles to test the model. They are (1) a defined endpoint,
6 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

Table 2 Platforms used for QSAR modeling

Software Description Availability

3D-QSAR To build 3-D QSAR models Freely available


ACD/Tox Suite Used for prediction of toxicity endpoints Commercial, free web service
ADMET Predictor Calculation of ADMET properties Commercial
AZOrange Machine learning platform for QSAR modeling Freely available
BioPPsy Prediction of pharmacokinetic properties of drug candidates Freely available
by QSPR modeling
BioTriangle Web-based platform for calculating molecular descriptors Freely available
BlueDesc Molecular Descriptor Calculator Freely available
CACTVS Molecular Descriptor Calculator Freely available
CAESAR Models for developmental toxicity Freely available
ChemDes Descriptor and fingerprint calculation (web-based) Freely available
CODESSA Generate predictive QSAR models from Quantum chemical, Commercial
topological & electrostatic descriptors
CoFFer Web-based QSAR service for the prediction of chemical compounds Freely available
CORALSEA Building of quantitative structure - property / activity relationships Freely available
Derek Rules based system with structural alerts for developmental toxicity, Commercial
teratogenicity, testicular toxicity, and oesterogenecity
DMax Data mining tool for QSAR, virtual screening and compound screening Freely available
data analysis
DWFS Parallel GA wrapper Feature selection (web-based) Freely available
ECOSAR Calculates aquatic toxicities Freely available
EPISuite Suite of programs for estimation of physiochemical property Freely available
calculation and environmental fate
eTOXlab Development and validation of QSAR models Freely available
GUSAR Development of QSAR/QSPR models (web-based) Freely available
HASL Software package for 3D-QSAR Commercial
HYBOT-PLUS Descriptor calculation Commercial
Leadscope QSAR models to predict reproductive and developmental toxicity Commercial
for rodent foetus
Mathematica Software package for ANN development Commercial
Matlab Software package for ANN development Commercial
MC  3DQSAR Generates QSAR equations Freely available
Molcode Toolbox Prediction of toxicological endpoints Commercial
MultiCASE Models for developmental toxicity Commercial
Neuralware Software package for ANN development Commercial
OECD QSAR Application QSAR models to fill data gaps & missing data Freely available
Toolbox
PASS Predicts biological activity using Bayesian algorithm, predicts Commercial
embryotoxicity and teratogenicity
QSARpro Predicts activity and optimizes lead compounds using QSAR models Commercial
SPSS Software package for ANN development Commercial
Statistica Software package for ANN development Commercial
TerraQSAR Database compounds with structure-specific Biological activity Freely available
T.E.S.T Predicts toxicities of compounds by applying QSAR methodologies Freeware
TIMES Prediction of oestrogen, androgen and aryl hydrocarbon binding compound Commercial
TOPKAT Prediction of toxicological endpoints Commercial
Toxmatch Provides chemical similarity indices to assist in read-cross Freely available
assessments & developing categories
WEBCDK Calculation of molecular descriptors (web-based) Freely available
VCCL Suite of programs for descriptor calculation, dimensionality reduction Freely available
& data analysis
VEGA-QSAR QSAR models for regulatory purposes can be accessed and new Freely available
QSAR models can be built
NVirtualToxLab Based on the combination of Auto flexible docking and mQSAR Commercial

(2) an unambiguous algorithm, (3) a defined domain of applicability, (4) appropriate measures of goodness-of-fit, robustness and
prediction accuracy, and (5) a mechanistic interpretation, if possible (Roy et al., 2015b). Models can be validated through
techniques such as internal validation, external validation, and cross validation (Veerasamy et al., 2011). In internal validation,
activity is predicted and parameters are estimated to analyze the precision of the prediction based on the compounds used for
model construction. This is not suitable when new test set of compounds is used. But the external validation technique works well
Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications 7

Table 3 List of databases related to QSAR studies

Database Description URL

Danish QSAR database Database of QSAR predictions http://qsar.food.dtu.dk


ECOTOX Online database that provides toxicity data on aquatic life, https://cfpub.epa.gov/ecotox/
terrestrial plants & wild life
EDKB (Endocrine Online database with predictive models to predict binding affinity of the http://edkb.fda.gov/webstart/edkb/index.
Disruptor Knownledge compounds with oesterogen & androgen nuclear receptor proteins html
Base)
JRC QSAR Model Database of QSAR models https://eurl-ecvam.jrc.ec.europa.eu/
Database databases/jrc-qsar-model-database
MOE Database of molecular data and QSAR modeling https://www.chemcomp.com/MOE-
Molecular_Operating_Environment.htm
MOLE db Free database for molecular descriptors http://michem.disat.unimib.it/mole_db/
QsarDB Database of QSAR/QSPR models https://qsardb.org/

even with the new datasets. In this case, the dataset is divided into test and training data sets. Model validation is done by test set
compounds that are independent of training set (Roy et al., 2015b; Veerasamy et al., 2011; Roy and Kar, 2015). However, external
validation is not worthy as it leaves a large portion of data set for testing (Hawkins et al., 2003). Golbraikh and Tropsha (2002)
proposed high value of cross-validated R2 to be one of the criteria to have high predictive power for a QSAR model. It was
emphasized that depending on q2 for predictivity is incorrect. Instead external data set should be used for validation to have high
predictive power for the model (Golbraikh and Tropsha, 2002).
Validation metrics are of two types based on type of QSAR model (Kar and Roy, 2011; Roy et al., 2015b).
1. Regression-based QSAR models
2. Classification-based QSAR models

Both regression and classification based methods have their unique metrics for validation.

Validation Metrics for Regression-Based Methods


Validation metrics for regression-based models are calculated for both internal and external validation strategies.

Validation metrics for internal validation


Internal validation of QSAR models employs the use of molecules from training set to test the predictability of the model. Some of
the most common methods used for internal validation of QSAR models are described here.

Least square fitting


Least square fitting is similar to linear regression and is the most commonly used validation method. It is the measure of square
correlation coefficient (R2) between the predicted and experimental value of activity. Outliers can be removed from the training
data set, in order to optimize QSAR model, if difference between R2 and R2adj is less than 0.3 (Veerasamy et al., 2011).

Chi-squared ( χ2) and root-mean squared error (RMSE)


The χ2 and RMSE values are used to assess the predictive quality of a model. χ2 value shows the difference between experimentally
determined bioactivity values and the values predicted by the model, whereas the RMSE value is the depiction of error between the
mean of experimental and predicted activity values. Even for models with large R2 value (that is 4 ¼ 7), values of χ2 and RMSE
should be lower than 0.5 and 0.3 respectively, for good predictive ability of the model (Veerasamy et al., 2011).

Cross validation
Cross validation approaches for internal validation include Leave-Group-Out (LGO), which involves leaving of a molecule or a
group of molecules while creating model and evaluating the predictability of the model using the molecules left. Some of the
important measures used in the internal cross validation of QSAR models are listed below.

Leave-One-Out (LOO) cross validation


In LOO cross validation, one compound is left out and the QSAR model is constructed using remaining compounds. The
eliminated compound is used as a test for the predicted model. This process is repeated eliminating each of the compounds in the
dataset one by one. The results so obtained from this are used for estimation of parameters involved in validation metrics. The
predictability of the model is assessed by Predicted Residual Sum of Squares (PRESS) and cross-validated R2 (Q2) when Standard
Deviation of Error of Prediction (SDEP) is obtained from PRESS (Roy and Kar, 2015; Roy et al., 2015b).
8 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

Leave-Some-Out cross validation


In case of Leave-Some-Out (LSO) or Leave-Many-Out (LMO) a set of data compounds are eliminated and models are created with
rest of the compounds. The left out compounds are then used to check the predictability of the model. Similar to LOO approach,
LMO approach also involves repetitive cycles of elimination and model creation until each and every compound has been treated
as a test set (Veerasamy et al., 2011; Kar and Roy, 2011; Roy et al., 2015b). On completion of all cycles of model training and
testing, overall LMO-Q2 value is calculated based on the compounds’ predicted activity values. LMO methods is more consistent as
compared to LOO (Veerasamy et al., 2011).
Value of Q2 is usually smaller than value of R2. In order to avoid over-fitting of the model, the difference between R2 and Q2
should not exceed 0.3. Over-fitted model may very well predict the activity of compounds for training set but for new compounds
predictivity is compromised (Veerasamy et al., 2011).

True Q2 and rm2 metrics


True Q2 proposed by Hawkins et al. is used for small data sets and rm 2
metric is calculated based on the scaled values of observed
and predicted activity. Q should not be treated as an ultimate proof for good predictability of models. Value of Q2 higher than 0.5
2

should not be interpreted as high predictive power of QSAR models; until the ability of model to predict the activity of large
number of compounds that are not used for training of model is tested (Golbraikh and Tropsha, 2002). Some of the other metrics
used for internal validation are true rm2 (LOO) and Y-Randomization, a metric for chance correlation. The true rm2(LOO) metric reveals
external validation characteristics as its value is derived from the model developed after repetitive cycles of LOO. Y randomization
test involves process randomization and model randomization to validate the model by permuting the response values with
respect to unaltered matrix (Roy et al., 2015b).

Validation metrics for external validation


The validation metrics employed in external validation are as following:

a) Predictive R2 that can also be given as Q2 (F1) says about the correlation of observed and predicted data. Model is said to have
good predictive power if the value of Q2 (F1) is greater than 0.5 (Roy et al., 2015b).
b) Q2 (F2) and Q2 (F3) using the mean of test data set and training data set respectively (Schü Ü Rmann et al., 2008). For validation
of QSAR model, threshold value of 0.5 is defined for both metrics (Roy et al., 2015b).
c) Golbraikh and Tropsha’s criteria puts forth condition for selection of training and test data sets. For having a good predictive
power, QSAR model should satisfy following conditions (Golbraikh and Tropsha, 2002; Veerasamy et al., 2011):
i. Q2training 4 0.5
ii. R2test 4 0.6
iii. (r2–r20)/ r2 o 0.1 or (r2–r0 20)/ r2 o0.1, where r20 is R2 of predicted vs. observed activities and r0 20 is R2 of observed vs. predicted
activities.
iv. 0.85 o¼ k o¼ 1.15 or 0.85 o¼ k0 o¼ 1.15, where k and k0 are the slopes of regression lines through the origin.
d) Other metrics includes Root Mean Square Error of Prediction (RMSEP) to calculate prediction error of QSAR model (Roy et al.,
2015b); Concordance Correlation Coefficient (CCC), the most restrictive and precautionary measure, with ideal value of 1
(Chirico and Gramatica, 2011); rm2(rank) which makes rank order predictions (Roy et al., 2015b); and r2m (test) to understand the
relationship between observed and predicted values (Roy and Mitra, 2011; Roy et al., 2015b).

Validation Metrics for Classification-Based Methods


The validation matrix employed in classification-based methods is the Wilks lambda (l) statistics. It is used to test the significance
of discriminant model function and is calculated as the ratio of within-category sum of squares to total dispersion. The value
ranges between 0o l o1, with lower value corresponding to higher level of discrimination. Further, canonical index (Rc) is used
to estimate the strength of relationship between various dependent and independent variables; Chi-square (χ2) to check the
quality of the classification based model; and Squared Mahalanobis distance is a measure calculated using random data points
(Roy and Mitra, 2011; Roy et al., 2015b).

Interpretation & Applicability Domain Analysis

The parameters or descriptors used in the model should be interpretable. Mechanistic interpretation of the built QSAR model
helps in understanding the influence of descriptors in the predicted activity. Applicability domain analysis helps us to understand
whether the built QSAR model can be used for any set of compounds. The applicability domain model is built on the theoretical
region present in the chemical space of descriptors and activity modeled. It enables to understand the feasibility of activity or
response prediction by the constructed QSAR model for a given set of compounds. So, the QSAR model prediction for a set of
compounds is only reliable if the chemical space or applicability domain of those compound falls within the applicability domain
of the compounds used for training the model (Roy et al., 2015b). The theoretical region in the chemical space is identified or
estimated by different methods. Application domain assessment is done through probability density distribution, geometrical
Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications 9

methods, distance-based methods, ranges in descriptor space when descriptor space are used, and the range of response variable
when modeled response space is used (Jaworska et al., 2005).

Multidimensional QSAR

QSAR started with 0D and has evolved to 6D. Each dimension arose as a need to overcome the limitations of previous dimensions
and to be more advantageous than the former. 1D QSAR calculated the molecular properties such as electronic, hydrophobic,
steric and so forth. 2D QSAR considers geometric parameters, topological indices, molecular fingerprints, polar surface area but it
excludes steric properties. 3D QSAR technique focuses on the spatial properties of the compound. So, the drawbacks of 2D QSAR
get addressed in 3D QSAR method. The descriptor methods used in 3D QSAR are alignment-dependent and alignment-inde-
pendent. The alignment dependent methods are Comparative Molecular Field Analysis (CoMFA), Comparative Molecular Simi-
liarity Indices Analysis (CoMSIA), Comparative Binding Energy Analysis (CoMBINE), Comparative Residue Interaction Analysis
(CoRIA), Hint Interaction Field Analysis (HIFA) and so forth. The alignment independent methods include Comparative Mole-
cular Moment Analysis (CoMMA), Comparative Spectral Analysis (CoSA), and Holo-QSAR (HQSAR). 3D QSAR employs methods
like Artificial Neural networks (ANN), Partial Least Squares Method (PLS), cluster analysis, and principal component analysis, and
others for descriptor selection makes it more powerful than 2D QSAR. Even then, the 3D QSAR technique faces difficulties
when there are large numbers of compounds in the data set, and has to compromise with the prediction accuracy. To overcome
these drawbacks, 4D QSAR evolved. 4D-QSAR with fourth dimension of ensemble sampling addresses the issues of 3D QSAR.
It includes descriptors for gird occupancy measures. 4D-QSAR can be applied for both, receptor independent and receptor
dependent analysis (Hopfinger et al., 1997). Further, 5D-QSAR evolved with the addition of new dimension to the 4D-QSAR, the
new dimension being the multiple ligand topology representations. Ensembles of multiple representations deployed in 5D-QSAR
makes this approach less biased when compared to 4D QSAR (Vedani and Dobler, 2002). 6D-QSAR improves the former
5D-QSAR strategy by including another dimension for solvation function that helps in analyzing different solvation models
(Polanski, 2009).

QSAR in Drug Designing and Discovery

Drug design and discovery is a laborious and time-consuming process. It takes roughly 10–12 years for a molecule to be identified
and approved as a drug. Most of the drugs fail during pre-clinical and clinical trials. The cost of drug discovery is very high.
Determining drug candidates through QSAR studies would reduce cost of production and failure at an early stage. QSAR
approaches help in identifying hits from a large library of compounds. The identified hit molecules can be purchased and studied
for activity through experiments (Tang et al., 2009; Montero-Torres et al., 2006). The molecules with proven activity can be further
optimized to design promising drug candidates. Thereby, QSAR studies avoid synthesis and testing of large number of compounds
saving enormous time and cost.

Case Studies

A number of success stories reflecting the potential of QSAR in building reliable models have been reported in the literature. We
have discussed here a few studies that dealt with a wide range of problems to get insights into the applications of QSAR approach.
The first example presented here is a QSAR model for designing better drugs for combating cancer. Apoptosis is an important
process that can decide the fate of a cell. Malfunctioning of the process with atypical expression of B-cell lymphoma-2 (Bcl-2) anti-
apoptotic proteins is a promising hallmark of cancer and in most of the cases results in resistance to chemo and radiotherapy.
Multiple approaches including antisense oligonucleotides (ASOs), peptides and small molecule inhibitory compounds have been
designed against these anti-apoptotic proteins. However, low cost and easy delivery makes small molecules a method of choice. To
develop better compounds on the basis of available information, various QSAR models have been generated. But these models are
limited to a single scaffold. In one such study, attempt was made to develop QSAR models for seven different classes of inhibitors
reported in literature targeting Bcl-2 and Bcl-xL (Kanakaveti et al., 2017). 453 such small molecule inhibitors with known IC50
(ranging from1 nm to 100 mm) were grouped into seven categories comprising of Apogossypol (89 compounds), Quinazoline
thione (51 compounds), Pyrazole pyrimidine phenyl acyl (110 compounds), Quinolone (56 compounds), Thiomorpholine (42
compounds), Benzothiazole hydrazine (78 compounds) and Polyquinoline (27 compounds) depending upon the core structure.
A total of 787 features (including constitutional, topological, electrostatic, geometrical and physicochemical descriptors) were
tested. The most relevant descriptors were chosen and fitted using multiple linear regression analysis. Two to three parametric
models were generated for all the different classes of compounds. (n  1) and (n–10) leave-out cross validations were used to
evaluate the performance of the generated models. The QSAR analysis resulted in models showing Pearson correlation coefficient
ranging from 0.95 to 0.985. Three already known inhibitors- ABT-199, Navitoclax and Sabutoclax were tested against their
generated models. The predicted IC50 was comparable to the reported activity. A correlation between pIC50 and pKi (  logKi) was
delineated and also found commonalities in the activity shown by the seven families with respect to structural disparities using an
10 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

approach called similarity–descriptor coupling. The study has been translated into a user-friendly webtool to predict a pan or
specific inhibitor for Bcl-2 and Bcl-xL targets (Kanakaveti et al., 2017).
This second example illustrates how pharmacokinetic properties of drug molecules can be predicted using QSAR modeling.
Penetration through blood brain barrier (BBB) is one of the most important features that reflect the drug-likeliness of small
molecular compounds. Drugs designed for central nervous system should be able to cross this barrier however penetration of BBB
by peripherally acting drugs should be minimize to avoid side effects. So estimation of pharmacokinetic properties of candidate
compounds in silico before testing them experimentally can save enormous time and money. In this study, a set of 529 organic
compounds was used to correlate their chemical structures and distribution coefficients between the brain and blood using
artificial neural networks. The molecular structures were defined in terms of occurrence of number of various types of fragments
containing up to 10 atoms. Out of all the fragment descriptors, the most important ones were selected using the stepwise multiple
linear regression. It was seen that the best model was based on the fragments comprising up to nine atoms. The model was further
validated using a dataset of 2053 compounds, which was categorized as BBB þ (penetrating) and BBB- (non-penetrating). The
constructed model could correctly classify 90% of BBB þ compounds from a test set, however the prediction specificity for BBB-
category was low. Some of the features that have the strongest positive effects on the LogBB included hydrophobic fragments like
alkyl or aromatic and presence of hydroxyl group in a- position to the carbonyl group. On the other hand, it was found that
permeability decreases if the structure contains strongly polar groups like hydroxyl, carboxyl, and guanidine. This model has been
integrated into a web service for predicting ADMET parameters of drugs developed in the Laboratory of Medicinal Chemistry,
Department of Chemistry, Lomonosov Moscow State University (Dyabina et al., 2016).
With increasing industrialization, degradation of pollutants has become a crucial need to keep the environment clean. Benzene
and its derivatives are the most common chemical structure found in the nature. It is also a structural part of degradation
intermediates of complex pollutants like pesticides, pharmaceuticals, surfactants or synthetic dyes. However, the chemical
properties deciding their environmental fate and behavior depends on the type, number and position of functional groups present
at substitution sites. Here QSAR proved to be a fast and reliable approach for testing the degradation of aromatic compounds.
Thirty six congeneric single-benzene ring compounds with known biodegradability were divided into training and test sets (24 and
12 compounds, respectively). The semi-empirical quantum-chemical descriptors like dipole moment, energy of the highest
occupied molecular orbital (EHOMO), energy of the lowest unoccupied molecular orbital (ELUMO), energetic difference between
EHOMO and ELUMO, final heat of formation and ionization potential, and various other molecular descriptors were calculated
using various software. The correlation between descriptors and biodegradability prior and at half-life was obtained using variable
selection Genetic Algorithm and Multiple Linear Regression Analysis methods. The validation of best-selected models based on
statistical parameters was performed using Leave Many Out and “Y-scrambling” tests. The generated models were thus used to
study the key structural features that influence the biodegradability of the compound of interest and correlate to its degradation
mechanisms by UV-C/H2O2. Molecular mass, number of C-C bonds determining the rate of saturation of benzene ring making it
susceptible to cleavage into more readily degradable aliphatic compounds, electron donating/withdrawing groups, symmetry of
the molecule, presence of sulfo-group, ionization potential and electrotopological states were some of the important descriptors
related to the biodegradability of aromatic compounds in water. The potential of such models can be extended to more extensive
purposes like risk assessment studies (Cvetnic et al., 2017).
In the following example, structural factors of ionic liquids (ILs) have been correlated with the process of micelization. Critical
Micelization Concentration (CMC) of any ionic liquid depends upon the types of ions it possess. Micelization affects the synthesis,
purification and regeneration routes of these ionic liquids and thus is an important feature to be considered. In this study, an attempt
was made to derive a qualitative relationship between structural features of ions and their effect on micelization of ILs. It was also
verified whether the micelization process is governed by the constituent ions separately or they have additive effects. Literature was
explored to collect experimental data of ILs with their CMC. A dataset of 59 structurally diversified IL’s with the CMC ranging
between the 0.098 and 902 mM was prepared. 42 compounds were used to train the model while rests were used for testing. Various
molecular descriptors were generated using DRAGON software and the most appropriate ones were chosen using genetic algorithm
implemented in QSARINS software. The correlation between the descriptors and CMCs was derived using Multiple Linear Regression
(MLR) technique. Various statistical parameters like determination coefficient, Concordance Correlation Coefficient, Root Mean
Square Error, Mean Average Error and F-value were used to evaluate the fitting of model and its significance. The leave-one-out and Y-
scrambling approach were used for internal validation and investigating the robustness of the generated model. An altogether new
test data set was used for external validation. Decrease in CMC was attributed to less spherical, improperly folded cations containing
larger hydrophobic domain. For anions, bigger size was associated with decrease in CMC. Also the effect of cations and anions in
determining the CMC was found to be independent of each other (Barycki et al., 2017).
People today have become more careful about their eating habits to adopt a healthy life style. They avoid overconsumption of
high-calorific food to reduce the risk of various metabolic disorders, cardiovascular diseases, obesity and diabetes. Sugars or
saccharides are major contributors in this. Industry is focusing on finding out new compounds, natural or synthetic, with low
calories but high sweetness. There are various QSAR models in the literature that help in extracting various structural features
responsible for imparting the sweetness to the compounds, or distinguishing between sweet and non-sweet compounds. A study
was carried out for virtually screening known natural compounds using QSAR modeling for identification of new sweeteners
(Chéron et al., 2017). A database of 316 compounds belonging to seventeen chemical families and sweetness values ranging from
0.20 to 225,000 was created. The sweetness index for all the compounds was relative to sucrose. The protonation state of
compounds was adjusted according to the pH value of saliva, i.e., 6.5. Descriptors were calculated for both 2D and 3D structures of
Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications 11

compounds using Dragon software. Random Forest and Support Vector Regression, machine learning algorithms were used to
generate the models. Training set consisted of 225 molecules while the test set had 91 molecules. Leave-one-out cross-validation
methods were used for both 2D and 3D QSAR models. Additional filters to avoid undesirable properties like bitterness and
toxicity were applied. It was found that majority of the identified natural sweeteners were from terpene family, less than 200
molecules belonged to the category of saccharides, polyphenols or phenylpropanoids. The most potent natural sweeteners were
based on saponin and stevioside scaffolds, with 1000–10,000 times more sweetness than sucrose (Chéron et al., 2017).

References

Achary, P., 2014. QSPR modelling of dielectric constants of p-conjugated organic compounds by means of the CORAL software. SAR and QSAR in Environmental Research 25,
507–526.
Agresti, A., 2007. An Introduction to Categorical Data Analysis. John Wiley.
Andrews, C.W., Bennett, L., Lawrence, X.Y., 2000. Predicting human oral bioavailability of a compound: Development of a novel quantitative structure-bioavailability
relationship. Pharmaceutical Research 17, 639–644.
Balaban, A.T., 1982. Highly discriminating distance-based topological index. Chemical Physics Letters 89, 399–404.
Barber, C.E., Marshall, D.A., Mosher, D.P., et al., 2016. Development of system-level performance measures for evaluation of models of care for inflammatory arthritis in
Canada. The Journal of Rheumatology. 150839. [jrheum].
Barycki, M., Sosnowska, A., Puzyn, T., 2017. Which structural features stand behind micelization of ionic liquids? Quantitative structure-property relationship studies. Journal of
Colloid and Interface Science 487, 475–483.
Berinde, Z., 2013. A QSPR study of hydrophobicity of phenols and 2-(aryloxy-a-acetyl)-phenoxathiin derivatives using the topological index ZEP. Creative Mathematics and
Informatics 22, 33–40.
Bi, J., Bennett, K., Embrechts, M., Breneman, C., Song, M., 2003. Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research 3,
1229–1243.
Boik, J.C., Newman, R.A., 2008. Structure-activity models of oral clearance, cytotoxicity, and LD50: A screen for promising anticancer compounds. BMC Pharmacology 8, 12.
Braga, R.C., Andrade, C.H., 2012. QSAR and QM/MM approaches applied to drug metabolism prediction. Mini Reviews in Medicinal Chemistry 12, 573–582.
Buolamwini, J.K., Assefa, H., 2002. CoMFA and CoMSIA 3D QSAR and docking studies on conformationally-restrained cinnamoyl HIV-1 integrase inhibitors: Exploration of a
binding mode at the active site. Journal of Medicinal Chemistry 45, 841–852.
Burbidge, R., Trotter, M., Buxton, B., Holden, S., 2001. Drug design by machine learning: Support vector machines for pharmaceutical data analysis. Computers & Chemistry
26, 5–14.
Can, A., 2014. Quantitative structure–toxicity relationship (QSTR) studies on the organophosphate insecticides. Toxicology Letters 230, 434–443.
Chéron, J.-B., Casciuc, I., Golebiowski, J., Antonczak, S., Fiorucci, S., 2017. Sweetness prediction of natural compounds. Food Chemistry 221, 1421–1425.
Chirico, N., Gramatica, P., 2011. Real external predictivity of QSAR models: How to evaluate it? Comparison of different validation criteria and proposal of using the
concordance correlation coefficient. Journal of Chemical Information and Modeling 51, 2320–2335.
Cocu, A., Dumitriu, L., Craciun, M., Segal, C., 2008. A Hybrid Approach for Data Preprocessing in the QSAR Problem. Knowledge-Based Intelligent Information and
Engineering Systems. Springer. pp. 565–572.
Cramer, R.D., 1993. Partial least squares (PLS): Its strengths and limitations. Perspectives in Drug Discovery and Design 1, 269–278.
Crum-Brown, A., Fraser, T., 1868. On the connection between chemical constitution and physiological action. Part 1. On the physiological action of the ammonium bases,
derived from Strychia, Brucia, Thebaia, Codeia, Morphia and Nicotia. Transactions of the Royal Society of Edinburgh 25, 151–203.
Cvetnic, M., Perisic, D.J., Kovacic, M., et al., 2017. Prediction of biodegradability of aromatics in water using QSAR modeling. Ecotoxicology and Environmental Safety 139,
139–149.
Dai, Y.-M., Zhu, Z.-P., Cao, Z., et al., 2013. Prediction of boiling points of organic compounds by QSPR tools. Journal of Molecular Graphics and Modelling 44, 113–119.
Damale, M.G., Harke, S.N., Kalam Khan, F.A., Shinde, D.B., Sangshetti, J.N., 2014. Recent advances in multidimensional QSAR (4D-6D): A critical review. Mini Reviews in
Medicinal Chemistry 14, 35–55.
Dastmalchi, S., Hamzeh-Mivehroud, M., Asadpour-Zeynali, K., 2012. Comparison of different 2D and 3D-QSAR methods on activity prediction of histamine H3 receptor
antagonists. Iranian Journal of Pharmaceutical Research: IJPR 11, 97.
Demel, M.A., Janecek, A.G., Gansterer, W.N., Ecker, G.F., 2009. Comparison of contemporary feature selection algorithms: Application to the classification of ABC‐transporter
substrates. Molecular Informatics 28, 1087–1091.
Dioury, F., Duprat, A., Dreyfus, G.R., Ferroud, C., Cossy, J., 2014. QSPR prediction of the stability constants of gadolinium (III) complexes for magnetic resonance imaging.
Journal of Chemical Information and Modeling 54, 2718–2731.
Du, H., Hu, Z., Bazzoli, A., Zhang, Y., 2011. Prediction of inhibitory activity of epidermal growth factor receptor inhibitors using grid search-projection pursuit regression
method. PLOS ONE 6, e22367.
Duchowicz, P.R., Castro, E.A., 2009. QSPR studies on aqueous solubilities of drug-like compounds. International Journal of Molecular Sciences 10, 2558–2577.
Duchowicz, P.R., Castro, E.A., Fernandez, F., Pankratov, A., 2006. QSPR evaluation of thermodynamic properties of acyclic and aromatic compounds. Anales de la Asociación
Química Argentina. SciELO Argentina. 31–45.
Dyabina, A., Radchenko, E., Palyulin, V., Zefirov, N., 2016. Prediction of blood-brain barrier permeability of organic compounds. Doklady Biochemistry and Biophysics.
Springer. pp. 371–374.
Escher, B.I., Bramaz, N., Richter, M., Lienert, J., 2006. Comparative ecotoxicological hazard assessment of beta-blockers and their human metabolites using a mode-of-action-
based test battery and a QSAR approach. Environmental Science & Technology 40, 7402–7408.
Ferguson, J., 1939. The use of chemical potentials as indices of toxicity. Proceedings of the Royal Society of London Series B, Biological Sciences 127, 387–404.
Fjodorova, N., Vračko, M., Novič, M., Roncaglioni, A., Benfenati, E., 2010. New public QSAR model for carcinogenicity. Chemistry Central Journal. Springer. p. S3.
Free, S.M., Wilson, J.W., 1964. A mathematical contribution to structure-activity studies. Journal of Medicinal Chemistry 7, 395–399.
Fujikawa, M., Nakao, K., Shimizu, R., Akamatsu, M., 2007. QSAR study on permeability of hydrophobic compounds with artificial membranes. Bioorganic & Medicinal
Chemistry 15, 3756–3767.
Fujrra, T., Ban, T., 1971. Structure-activity study of phenethyiamines as substrates of biosynthetic enzymes of sympathetic enzymes of sympathetic transmitters. ZMed. CRem
14, 148–152.
Fu, W., Zhang, Y., Cheng, Z., 2010. Improved gene expression programming and its application to QSAR. In: Proceedings of the Sixth International Conference on Natural
Computation (ICNC), IEEE, 4057–4061.
Gao, H., Shanmugasundaram, V., Lee, P., 2002. Estimation of aqueous solubility of organic compounds with QSPR approach. Pharmaceutical Research 19, 497–503.
12 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

Garkani-Nejad, Z., Ghanbari, A., 2016. Application of support vector machine in QSAR study of triazolyl thiophenes as cyclin dependent kinase-5 inhibitors for their
anti-alzheimer activity.
Ghasemi, J.B., Ahmadi, S., Ayati, M., 2010. QSPR modeling of stability constants of the Li-hemispherands complexes using MLR: A theoretical host-guest study.
Macroheterocycles 3, 234–242.
Ghose, A.K., Viswanadhan, V.N., Wendoloski, J.J., 1999. A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for drug discovery. 1.
A qualitative and quantitative characterization of known drug databases. Journal of Combinatorial Chemistry 1, 55–68.
Golbraikh, A., Tropsha, A., 2002. Beware of q 2!. Journal of Molecular Graphics and Modelling 20, 269–276.
Gombar, V.K., Hall, S.D., 2013. Quantitative structure–activity relationship models of clinical pharmacokinetics: Clearance and volume of distribution. Journal of Chemical
Information and Modeling 53, 948–957.
Goodarzi, M., Dejaegher, B., Heyden, Y.V., 2012. Feature selection methods in QSAR studies. Journal of AOAC International 95, 636–651.
Goryński, K., Bojko, B., Nowaczyk, A., et al., 2013. Quantitative structure–retention relationships models for prediction of high performance liquid chromatography retention time
of small molecules: Endogenous metabolites and banned compounds. Analytica Chimica Acta 797, 13–19.
Gozalbes, R., Jacewicz, M., Annand, R., Tsaioun, K., Pineda-Lucena, A., 2011. QSAR-based permeability model for drug-like compounds. Bioorganic & Medicinal Chemistry
19, 2615–2624.
Grisoni, F., Consonni, V., Vighi, M., Villa, S., Todeschini, R., 2016. Investigating the mechanisms of bioconcentration through QSAR classification trees. Environment
International 88, 198–205.
Guha, R., Dutta, D., Jurs, P.C., Chen, T., 2006. Local lazy regression: Making use of the neighborhood to improve QSAR predictions. Journal of Chemical Information and
Modeling 46, 1836–1847.
Gutman, I., Trinajstić, N., 1972. Graph theory and molecular orbitals. Total (j)-electron energy of alternant hydrocarbons. Chemical Physics Letters 17, 535–538.
Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422.
Hammett, L.P., 1937. The effect of structure upon the reactions of organic compounds. Benzene derivatives. Journal of the American Chemical Society 59, 96–103.
Hansch, C., Fujita, T., 1964. p-s-p Analysis. A method for the correlation of biological activity and chemical structure. Journal of the American Chemical Society 86,
1616–1626.
Harrell, F., 2001. Regression modeling strategies. 2001. Nashville: Springer CrossRef Google Scholar.
Hawkins, D.M., Basak, S.C., Mills, D., 2003. Assessing model fit by cross-validation. Journal of Chemical Information and Computer Sciences 43, 579–586.
Hemmateenejad, B., Sanchooli, M., Mehdipour, A., 2009. Quantitative structure–reactivity relationship studies on the catalyzed Michael addition reactions. Journal of Physical
Organic Chemistry 22, 613–618.
Hermens, J., Canton, H., Janssen, P., De Jong, R., 1984. Quantitative structure-activity relationships and toxicity studies of mixtures of chemicals with anaesthetic potency:
Acute lethal and sublethal toxicity to Daphnia magna. Aquatic Toxicology 5, 143–154.
Hopfinger, A., Wang, S., Tokarski, J.S., et al., 1997. Construction of 3D-QSAR models using the 4D-QSAR analysis formalism. Journal of the American Chemical Society 119,
10509–10524.
Ivanciuc, O., 2000. QSAR comparative study of Wiener descriptors for weighted molecular graphs. Journal of Chemical Information and Computer Sciences 40, 1412–1422.
Jaworska, J., Nikolova-Jeliazkova, N., Aldenberg, T., 2005. QSAR applicability domain estimation by projection of the training set descriptor space: A review. ATLA-
NOTTINGHAM- 33, 445.
Johnels, D., Gillner, M., Nordén, B., Toftgård, R., Gustafsson, J.Å., 1989. Quantitative structure‐activity relationship (QSAR) analysis using the partial least squares (PLS)
method: The binding of polycyclic aromatic hydrocarbons (PAH) to the rat liver 2, 3, 7, 8–tetrachlorodibenzo‐P‐dioxin (TCDD) receptor. Molecular Informatics 8, 83–89.
Kanakaveti, V., Sakthivel, R., Rayala, S., Gromiha, M.M., 2017. Importance of functional groups in predicting the activity of small molecule inhibitors for Bcl‐2 and Bcl‐xL.
Chemical Biology & Drug Design.
Kar, S., Roy, K., 2011. Development and validation of a robust QSAR model for prediction of carcinogenicity of drugs. Indian Journal of Biochemistry & Biophysics 48 (2),
Karacan, M.S., Yakan, Ç., Yakan, M., et al., 2012. Quantitative structure–activity relationship analysis of perfluoroiso-propyldinitrobenzene derivatives known as photosystem II
electron transfer inhibitors. Biochimica et Biophysica Acta (BBA)-Bioenergetics 1817, 1229–1236.
Katritzky, A.R., Lomaka, A., Petrukhin, R., et al., 2002. QSPR correlation of the melting point for pyridinium bromides, potential ionic liquids. Journal of Chemical Information
and Computer Sciences 42, 71–74.
Khorshidi, N., Sarkhosh, M., Niazi, A., 2014. QSPR study of maximum absorption wavelength of various flavones by multivariate image analysis and principal components-least
squares support vector machine. Journal of Scientific and Innovative Research 3, 189–202.
Kier, L.B., 1985. A shape index from molecular graphs. Molecular Informatics 4, 109–116.
Kier, L.B., Hall, L.H., Murray, W.J., Randi, M., 1975. Molecular connectivity I: Relationship to nonspecific local anesthesia. Journal of Pharmaceutical Sciences 64, 1971–1974.
Kumar, A., Chauhan, S., 2017. Monte Carlo method based QSAR modelling of natural lipase inhibitors using hybrid optimal descriptors. SAR and QSAR in Environmental
Research 28, 179–197.
Lei, B., Ma, Y., Li, J., et al., 2010. Prediction of the adsorption capability onto activated carbon of a large data set of chemicals by local lazy regression method. Atmospheric
Environment 44, 2954–2960.
Levatic, J., Dzeroski, S., Supek, F., Smuc, T., 2013. Semi-supervised learning for quantitative structure-activity modeling. Informatica 37, 173.
Lewis, D.F., 2000. Structural characteristics of human P450s involved in drug metabolism: Qsars and lipophilicity profiles. Toxicology 144, 197–203.
Li, X., Shi, Y., 2008. A survey on the Randic index. MATCH Communications in Mathematical and in Computer Chemistry 59, 127–156.
Liu, P., Long, W., 2009. Current mathematical methods used in QSAR/QSPR studies. International Journal of Molecular Sciences 10, 1978–1998.
Liu, S.-S., Liu, H.-L., Yin, C.-S., Wang, L.-S., 2003. VSMP: A novel variable selection and modeling method based on the prediction. Journal of Chemical Information and
Computer Sciences 43, 964–969.
Liu, Y., 2004. A comparative study on feature selection methods for drug discovery. Journal of Chemical Information and Computer Sciences 44, 1823–1828.
Lorber, A., Wangen, L.E., Kowalski, B.R., 1987. A theoretical foundation for the PLS algorithm. Journal of Chemometrics 1, 19–31.
Luo, J., Hu, J., Fu, L., Liu, C., Jin, X., 2011. Use of artificial neural network for a QSAR study on neurotrophic activities of Np-tolyl/phenylsulfonyl L-amino acid thiolester
derivatives. Procedia Engineering 15, 5158–5163.
Magdziarz, T., Mazur, P., Polanski, J., 2009. Receptor independent and receptor dependent CoMSA modeling with IVE-PLS: Application to CBG benchmark steroids and
reductase activators. Journal of Molecular Modeling 15, 41–51.
Manga, N.N., Duffy, J.C., Rowe, P.H., Cronin, M.T., 2003. A hierarchical QSAR model for urinary excretion of drugs in humans as a predictive tool for biotransformation.
Molecular Informatics 22, 263–273.
Mayer, J.M., Van De Waterbeemd, H., 1985. Development of quantitative structure-pharmacokinetic relationships. Environmental Health Perspectives 61, 295.
Meyer, H., 1899. Zur theorie der alkoholnarkose. Naunyn-Schmiedeberg's Archives of Pharmacology 42, 109–118.
Mirkhani, S.A., Gharagheizi, F., Sattari, M., 2012. A QSPR model for prediction of diffusion coefficient of non-electrolyte organic compounds in air at ambient condition.
Chemosphere 86, 959–966.
Modarresi, H., Dearden, J.C., Modarress, H., 2006. QSPR correlation of melting point for drug compounds based on different sources of molecular descriptors. Journal of
Chemical Information and Modeling 46, 930–936.
Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications 13

Montero-Torres, A., García-Sánchez, R.N., Marrero-Ponce, Y., et al., 2006. Non-stochastic quadratic fingerprints and LDA-based QSAR models in hit and lead generation
through virtual screening: Theoretical and experimental assessment of a promising method for the discovery of new antimalarial compounds. European Journal of Medicinal
Chemistry 41, 483–493.
Moss, G.P., Dearden, J.C., Patel, H., Cronin, M.T., 2002. Quantitative structure–permeability relationships (QSPRs) for percutaneous absorption. Toxicology in Vitro 16,
299–317.
Mota, S.G., Barros, T.F., Castilho, M.S., 2009. 2D QSAR studies on a series of bifonazole derivatives with antifungal activity. Journal of the Brazilian Chemical Society 20,
451–459.
Papa, E., Dearden, J., Gramatica, P., 2007. Linear QSAR regression models for the prediction of bioconcentration factors by physicochemical properties and structural
theoretical molecular descriptors. Chemosphere 67, 351–358.
Polanski, J., 2009. Receptor dependent multidimensional QSAR for modeling drug-receptor interactions. Current Medicinal Chemistry 16, 3243–3257.
Polishchuk, P.G., Muratov, E.N., Artemenko, A.G., et al., 2009. Application of random forest approach to QSAR prediction of aquatic toxicity. Journal of Chemical Information
and Modeling 49, 2481–2488.
Puri, S., Chickos, J.S., Welsh, W.J., 2002. Three-dimensional quantitative structure–property relationship (3d-qspr) models for prediction of thermodynamic properties of
polychlorinated biphenyls (PCBs): Enthalpy of sublimation. Journal of Chemical Information and Computer Sciences 42, 109–116.
Qiao, L.S., Cai, Y.-L., He, Y.-S., et al., 2014. Trend of multi-scale QSAR in drug design. Asian Journal of Chemistry 26, 5917.
Randić, M., 2001. The connectivity index 25 years after. Journal of Molecular Graphics and Modelling 20, 19–35.
Richardson, B., 1869. Physiological research on alcohols. The Medical Times and Gazett. 703–706.
Richet, M., 1893. Note sur le rapport entre la toxicité et les propriétes physiques des corps. Compt Rend Soc Biol ((Paris)) 45, 775–776.
Rochani, A.K., Suma, B., Kumar, S., Jays, J., Madhavan, V., 2010. QSAR, ADME AND QSTR Studies of Some Synthesized Anti-Cancer 2-Indolinone Derivatives. International
Journal of Pharma and Bio Sciences 1, 208–218.
Roy, K., Kar, S., 2015. How to judge predictive quality of classification and regression based QSAR models. In: Haq, Z.U., Madura, J. (Eds.), Frontiers of Computational
Chemistry. Sharjah, UAE: Bentham Science, pp. 71–120.
Roy, K., Kar, S., Das, R.N., 2015a. A Primer on QSAR/QSPR Modeling: Fundamental Concepts. Springer.
Roy, K., Kar, S., Das, R.N., 2015b. Statistical methods in QSAR/QSPR. A Primer on QSAR/QSPR Modeling. 37–59.
Roy, K., Kar, S., Das, R.N., 2015c. Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment. Academic Press.
Roy, K., Mandal, A.S., 2008. Development of linear and nonlinear predictive QSAR models and their external validation using molecular similarity principle for anti-HIV indolyl
aryl sulfones. Journal of Enzyme Inhibition and Medicinal Chemistry 23, 980–995.
Roy, K., Mitra, I., 2011. On various metrics used for validation of predictive QSAR models with applications in virtual screening and focused library design. Combinatorial
Chemistry & High Throughput Screening 14, 450–474.
Roy, K., Roy, P., Leonard, J., 2008. On some aspects of validation of predictive QSAR models. Chemistry Central Journal 2, P9.
SchüÜ Rmann, G., Ebert, R.-U., Chen, J., Wang, B., KüHne, R., 2008. External validation and prediction employing the predictive squared correlation coefficient test set activity
mean vs training set activity mean. Journal of Chemical Information and Modeling 48, 2140–2145.
Settles, B., 2012. Active learning: Synthesis lectures on artificial intelligence and machine learning. Long Island, NY: Morgan & Clay Pool.
Shen, M., Béguin, C., Golbraikh, A., et al., 2004. Application of predictive QSAR models to database mining: Identification and experimental validation of novel anticonvulsant
compounds. Journal of Medicinal Chemistry 47, 2356–2364.
Shi, W., Zhang, X., Shen, Q., 2010. Quantitative structure-activity relationships studies of CCR5 inhibitors and toxicity of aromatic compounds using gene expression
programming. European Journal of Medicinal Chemistry 45, 49–54.
Si, H., Yuan, S., Zhang, K., et al., 2008. Quantitative structure activity relationship study on EC50 of anti-HIV drugs. Chemometrics and Intelligent Laboratory Systems 90, 15–24.
Si, H.Z., Wang, T., Zhang, K.J., De Hu, Z., Fan, B.T., 2006. QSAR study of 1, 4-dihydropyridine calcium channel antagonists based on gene expression programming.
Bioorganic & Medicinal Chemistry 14, 4834–4841.
Simeon, S., Anuwongcharoen, N., Shoombuatong, W., et al., 2016. Probing the origins of human acetylcholinesterase inhibition via QSAR modeling and molecular docking.
PeerJ 4, e2322.
Singh, K., Verma, N., 2014. 3-Dimensional QSAR and Molecular Docking Studies of a Series of Indole Analogues as Inhibitors of Human Non-Pancreatic Secretory
Phospholipase. A2.
Sola, D., Ferri, A., Banchero, M., Manna, L., Sicardi, S., 2008. QSPR prediction of N-boiling point and critical properties of organic compounds and comparison with a group-
contribution method. Fluid Phase Equilibria 263, 33–42.
Soltanpour, S., Shahbazy, M., Omidikia, N., Kompany-Zareh, M., Baharifard, M.T., 2016. A comprehensive QSPR model for dielectric constants of binary solvent mixtures. SAR
and QSAR in Environmental Research 27, 165–181.
Speck-Planche, A., Kleandrova, V.V., Luan, F., Cordeiro, M.N.D., 2012. Rational drug design for anti-cancer chemotherapy: Multi-target QSAR models for the in silico discovery
of anti-colorectal cancer agents. Bioorganic & Medicinal Chemistry 20, 4848–4855.
Suzuki, T., Ide, K., Ishida, M., Shapiro, S., 2001. Classification of environmental estrogens by physicochemical properties using principal component analysis and hierarchical
cluster analysis. Journal of Chemical Information and Computer Sciences 41, 718–726.
Taft Jr, R.W., 1952. Polar and steric substituent constants for aliphatic and o-Benzoate groups from rates of esterification and hydrolysis of esters1. Journal of the American
Chemical Society 74, 3120–3128.
Taft Jr, R.W., 1953a. The general nature of the proportionality of polar effects of substituent groups in organic chemistry. Journal of the American Chemical Society 75,
4231–4238.
Taft Jr, R.W., 1953b. Linear steric energy relationships. Journal of the American Chemical Society 75, 4538–4539.
Tang, H., Wang, X.S., Huang, X.-P., et al., 2009. Novel inhibitors of human histone deacetylase (HDAC) identified by QSAR modeling of known inhibitors, virtual screening,
and experimental validation. Journal of Chemical Information and Modeling 49, 461–476.
Tomar, D., Agarwal, S., 2014. A survey on pre-processing and post-processing techniques in data mining. International Journal of Database Theory and Application 7, 99–128.
Toropov, A., Kudyshkin, V., Voropaeva, N., Ruban, I., Rashidova, S.S., 2004. QSPR modeling of the reactivity parameters of monomers in radical copolymerizations. Journal of
Structural Chemistry 45, 945–950.
Valencia, A., Prous, J., Mora, O., Sadrieh, N., Valerio, L.G., 2013. A novel QSAR model of Salmonella mutagenicity and its application in the safety assessment of drug
impurities. Toxicology and Applied Pharmacology 273, 427–434.
Van Gestel, C., Ma, W.-C., 1990. An approach to quantitative structure-activity relationships (QSARs) in earthworm toxicity studies. Chemosphere 21, 1023–1033.
Vedani, A., Dobler, M., 2002. 5D-QSAR: The key for simulating induced fit? Journal of Medicinal Chemistry 45, 2139–2149.
Veerasamy, R., Rajak, H., Jain, A., et al., 2011. Validation of QSAR models-strategies and importance. International Journal of Drug Design & Discovery 3, 511–519.
Veyseh, S., Hamzehali, H., Niazi, A., Ghasemi, J.B., 2015. Application of multivariate image analysis in qspr study of pKa of various acids by principal components-least
squares support vector machine. Journal of the Chilean Chemical Society 60, 2985–2987.
Vieira, J.B., Braga, F.S., Lobato, C.C., et al., 2014. A QSAR, pharmacokinetic and toxicological study of new artemisinin compounds with anticancer activity. Molecules 19,
10670–10697.
Wang, D.-D., Feng, L.-L., He, G.-Y., Chen, H.-Q., 2014. QSAR studies for the acute toxicity of nitrobenzenes to the Tetrahymena pyriformis. Journal of the Serbian Chemical
Society 79, 1111–1125.
14 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

Wang, T., Si, H., Chen, P., Zhang, K., Yao, X., 2008. QSAR models for the dermal penetration of polycyclic aromatic hydrocarbons based on Gene Expression Programming.
Molecular Informatics 27, 913–921.
Wesley, L., Veerapaneni, S., Desai, R., et al., 2016. 3D-QSAR and SVM Prediction of BRAF-V600E and HIV integrase inhibitors: A comparative study and characterization of
performance with a new expected prediction performance metric. American Journal of Biochemistry and Biotechnology 12, 253–262.
Wiener, H., 1947. Structural determination of paraffin boiling points. Journal of the American Chemical Society 69, 17–20.
Xue, Y., Li, Z.-R., Yap, C.W., et al., 2004. Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological
properties of chemical agents. Journal of Chemical Information and Computer Sciences 44, 1630–1638.
Zernov, V.V., Balakin, K.V., Ivaschenko, A.A., Savchuk, N.P., Pletnev, I.V., 2003. Drug discovery using support vector machines. The case studies of drug-likeness,
agrochemical-likeness, and enzyme inhibition predictions. Journal of Chemical Information and Computer Sciences 43, 2048–2056.
Zhang, L., Zhu, H., Oprea, T.I., Golbraikh, A., Tropsha, A., 2008. QSAR modeling of the blood–brain barrier permeability for diverse organic compounds. Pharmaceutical
Research 25, 1902.
Zhang, S., Golbraikh, A., Tropsha, A., 2006. Development of quantitative structure–binding affinity relationship models based on novel geometrical chemical descriptors of the
protein–ligand interfaces. Journal of Medicinal Chemistry 49, 2713–2724.
Zou, C., Zhou, L., 2007. QSAR study of oxazolidinone antibacterial agents using artificial neural networks. Molecular Simulation 33, 517–530.
Zou, J.-W., Huang, M., Huang, J.-X., Hu, G.-X., Jiang, Y.-J., 2016. Quantitative structure–hydrophobicity relationships of molecular fragments and beyond. Journal of Molecular
Graphics and Modelling 64, 110–120.

Biographical Sketch

Swathik Clarancia Peter received her M.Tech. in Computational Biology from Pondicherry University, India in
2017 and B.Tech. in Bioinformatics from Tamil Nadu Agricultural University, Coimbatore, India in 2015. At
present she is working as a senior research fellow in ICAR-Sugarcane Breeding Institute, Coimbatore, India. She
works in the field of drug design & discovery and transcriptome data analysis.

Mannu Jayakanthan is an Assistant Professor in the Department of Plant Molecular Biology and Bioinformatics at
Tamil Nadu Agricultural University, Coimbatore, India. He obtained his PhD from Pondicherry University and
pursued his postdoctoral fellowship at the Centre for Cellular and Molecular Biology (CCMB) in Hyderabad,
India. His research interest lies in computer-aided drug discovery.
Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications 15

Jaspreet Kaur received her M.Tech. in Bioinformatics from Delhi Technological University, India in 2013, and B.
Tech. in Biotechnology from Amity University, India (2011). She joined Indian Institute of Technology, Delhi,
India in 2014 for the Ph.D. program. She works in the field of computer aided drug designing and devising
computational approaches for aiding in targeted genome editing.

Vidhi Malik received her M.Tech. in Bioinformatics from Delhi Technological University, India in 2013 and B.
Tech. in Biotechnology form Sardar Vallabhbhai Patel University of Agriculture and Technology, India in 2011.
She joined Indian Institute of Technology, Delhi, India in 2015 for the Ph.D. program. Her research interest lies in
next generation sequence (NGS) data analysis and computer-aided drug designing.

Navaneethan Radhakrishnan received his B.Tech. in Bioinformatics from Tamil Nadu Agricultural University,
India in 2016. He joined Indian Institute of Technology Delhi, India in 2016 as Junior Research Fellow in the
Department of Biochemical Engineering and Biotechnology. His research interest lies in understanding the bio-
logical systems using computational approaches.
16 Quantitative Structure-Activity Relationship (QSAR): Modeling Approaches to Biological Applications

Durai Sundar is a DuPont Young Professor in the Department of Biochemical Engineering and Biotechnology at
Indian Institute of Technology, Delhi. He obtained his education from Pondicherry University and Johns Hopkins
University, Baltimore, United States. He is a specialist in molecular and computational biology and his current
research interests are in rational design of genome editing tools and in the biological activity of natural drugs.

You might also like