Basics of QSAR Modeling by Prof Rahul D. Jawarkar

Basics of QSAR Modeling
Prof. Rahul D. Jawarkar,

Department of Pharmaceutical Chemistry,
Dr Rajendra Gode Institute of Pharmacy,
University Mardi Road, Amravati, Maharashtra, India(444602),
E-Mail: rahuljawarkar@gmail.com,
Contact:+91-7385178762.
Drug: Drug is a single active chemical moiety which is found
in medicine and. used for diagnosis, prevention, treatment and
cure of a disease.
Natural source (80%) Synthetic (20%)

Chemotherapy: It is the treatment of infection or malignancy
with the specific chemical which possesses selective adverse
effects on the infecting organism, malignant cell or host cell.
Drug Development
Drug Discovery- Finding therapeutic actions of the
molecule. e.g. Penicilin, anti-pletlet action of aspirin,
etc.
Drug Designing- Modifying the molecule for high
activity and Absorption-Distribution-Metabolism-
Excretion-Toxicity (ADMET). e.g. Tamiflu, Relinza,
Dorzolamide, etc.
Drug Delivery- Developing methods for drug
administration. e.g. Gelatin, starch, etc.
Conventional Procedure for drug
discovery/designing:
Synthesis-Testing-Synthesis-Testing
Cost for designing a new drug is
about $300 million
Needs 10-15 years to launch a drug
in market.
Resources like time, chemicals, etc.
are consumed
Slower, frustrating, lower success,
etc.
QSAR is not theoretical !!!!
• Collection of experimental bioactivity like IC50, EC50,
LD50, Kd, Ki, etc.

• Use of chemical structures of reported molecules only
• Comparison of bioactivity of one molecule with another
• Finding reasons for high and low activity
• Validating analysis using Statistical techniques
• OECD guidelines
In short, the experimental part has been accomplished in
advance, now QSAR analysis is being done for experimental
data to identify the reasons for bio-activity of a molecule.
Quantitative Structure-Activity Relationship (QSAR)
“Similar compounds behave similarly

and
Activity or Property varies with Structure.”
Do you agree?
Activity = Lipophilicity + Steric + Electronic + Unknown
Factors
A QSAR is a multivariate, mathematical relationship

between a set of 2D- and 3D- physicochemical
properties (molecular descriptors) and a biological
activity/toxicity.
Important steps involved in
QSAR analysis:
• Experimental data collection
• Structure drawing and appropriate 3D-
optimization
• Molecular descriptor calculation and pruning
• Model building
• Model validation
• Model interpretation
Experimental data collection:
1. ChEMBL Database - EMBL-EBI: ChEMBL is a manually
curated database of bioactive molecules with drug-like
properties. It brings together chemical, bioactivity and
genomic data
https://www.ebi.ac.uk/chembl/
2. Binding Database: BindingDB is a public, web-accessible
database of measured binding affinities, focusing chiefly on
the interactions of protein considered to be drug-targets with
small, drug-like molecules.
http://bindingdb.org/bind/index.jsp
3. Enzyme Database – BRENDA: A comprehensive enzyme
information system. https://www.brenda-enzymes.org/
Structure drawing and appropriate
3D-optimization:
Identification, Information & Description

Molecular descriptor calculation
and pruning:
1D- like MW, Number of atoms, etc.
2D- like Distance, functional group, etc.
3D- like torsional angles, etc.
Step-2: Calculation of Descriptors
Charge on atom
Dipole moment
pKa
HOMO, LUMO
Chirality
Hydrogen bond donor/acceptor
LogP
Thermodynamic………etc .
Note: At present, more than 45,000 descriptor can be calculated !!!

Step-3: Descriptor selection & Model building
All descriptors do not contain useful information.
Many descriptors provide same information.
Use of too many descriptors results in “Over Fitting”.
Use of improper descriptors results in poor and misleading models.
Use of many descriptors can lead to Chancy correlation.
Use SR, GA, MA, etc. to select best descriptors

Current Methods for Model Building
A) Multiple Linear Regression (MLR)
 Best Multiple Linear Regression (BMLR),

 Heuristic Method (HM),
 Genetic Algorithm-Multiple Linear Regression (GA-MLR),
 Stepwise MLR,
 Factor Analysis MLR and so on.
B) Partial Least Squares (PLS)
 Genetic Partial Least Squares (G/PLS),

 Factor Analysis Partial Least Squares (FA-PLS),
 Orthogonal Signal Correction Partial Least Squares (OSC-
PLS)
Step-4: Validation of model
a) Leave-One-Out Cross validation:
b) Leave-Many-Out Cross Validation:
c) External validation
d) Use PCA, Simulated Annealing, Automated Relevance

Determination (ARD), etc…
e) Use Bayesian Statistics or Gaussian Processes

since they do not require Cross-Validation!!!
Modern trends in QSAR modeling
• Currently, there is much talk about the use of artificial
intelligence (AI) in chemistry.
• AI is the superset of tasks that demonstrate characteristics
of human intelligence, while ML is a subset of AI which
accesses data, analyses trends and generates intelligent,
actionable insights.
• Many people use the term AI in the same context as ML in
many data-rich disciplines, ranging from health care to
astronomy.
• In this regard one can say that AI has been used in
chemistry since the 1960’s under the name QSAR.
i f i ed
. id ent
A. I
Real
troponin I-interacting troponin I-interacting
kinase (TNNI3K) kinase (TNNI3K)
IC50 = 8000 nM* AI pred:
IC50 = 7800 nM
Experimental:
IC50 = 80 nM*
*Lawhorn, B. G. et al., Identification of purines and 7-deazapurines as potent and
selective type I inhibitors of troponin I-interacting kinase (TNNI3K). J. Med. Chem.
2015, 58, 7431−7448.
spleen tyrosine spleen tyrosine
kinase (Syk) kinase (Syk)
IC50 = 8.8 nM* AI pred:
IC50 = 10 nM
Experimental:
IC50 = 0.060 nM*
*Ellis, J. M. et al., Overcoming mutagenicity and ion channel activity: optimization of
selective spleen tyrosine kinase inhibitors. J. Med. Chem. 2015, 58, 1929−1939.
QSAR based virtual screening
• Molecular docking can rapidly identify large subsets of
molecules with desired activity from large screening
collections of compounds (105–106 compounds) using
automated methods.
• However, the hit rate ranges between 0.01% and 0.1% !!!
• Most of the screened compounds are routinely reported as
false positives.
• On the other hand, typical hit rates for QSAR-based virtual
screening range between 1% and 40% !!!!!
Reference: Neves BJ, Braga RC, Melo-Filho CC, Moreira-Filho JT, Muratov EN and
Andrade CH (2018) QSAR-Based Virtual Screening: Advances and Applications in
Drug Discovery. Frontiers in Pharmacology 9. doi: 10.3389/fphar.2018.01275
QSAR based virtual screening:
Success Stories
• Zhang et al. (2013), a data set of 3,133 compounds reported
as active or inactive against P. falciparum was used to
develop QSAR models.
• QSAR models were applied for VS of the ChemBridge
database.
• After VS, 176 potential antimalarial compounds were
identified and submitted to experimental validation along
with 42 putative inactive compounds.
• Twenty-five compounds presented antimalarial activity in P.
falciparum.
• All 42 compounds predicted as inactives by the models were
confirmed experimentally to be inactives.
QSAR based virtual screening:
Success Stories
• Alves et al. (2020), a data set of 113 compounds (40 actives
and 73 inactives) for the SARS-CoV Mpro.
• QSAR models were applied for VS of the DrugBank
database of FDA approved drugs.
• After VS, 42 potential drugs were identified but only 11 were
tested for experimental validation.
• Three compounds presented strong activity for the SARS-
CoV-2 Mpro.
1. Zhang, L. et al. (2013) Discovery of novel antimalarial compounds enabled by
QSAR-based virtual screening, J. Chem. Inf. Model. 53, 475–492. DOI:
10.1021/ci300421n
2. Alves et al. (2020) QSAR Modeling of SARS-CoV Mpro Inhibitors Identifies
Sufugolix, Cenicriviroc, Proglumetacin, and Other Drugs as Candidates for
Repurposing against SARS-CoV-2, Mol inf (Wiley). DOI: 10.1002/minf.202000113
Disadvantages of QSAR
• False correlations may arise because biological data that
are subject to considerable experimental error (noisy data).
• If training dataset is not large enough, the data collected
may not reflect the complete property space.
Consequently, many QSAR results cannot be used to
confidently predict the most likely compounds of best
activity.
• Features may not be reliable as well. This is particularly
serious for 3D features because 3D structures of ligands
binding to receptor may not be available. Common
approach is to use minimized structure, but that may not
represent the reality well.
Free Software for QSAR
1. ACD Chemsketch (www.acdlabs.com)
2. PyMOL
3. RDKit
4. ChemDraw
5. Avogadro software (https://avogadro.cc/)
6. OpenBabel (http://openbabel.org/wiki/Main_Page)
7. MMTK (http://dirac.cnrs-orleans.fr/MMTK.html)
8. PyDescriptor (available from Dr. V. H. Masand)
9. PaDEL (http://www.yapcwsoft.com/dd/padeldescriptor/)
10.BuildQSAR (https://profanderson.net/files/buildqsar.php)
11.Weka (https://www.cs.waikato.ac.nz/ml/weka/)
12.‘R’ package like GA-MLR, Carret, etc.
Databases
1. ChEMBL Database - EMBL-EBI: ChEMBL is a
manually curated database of bioactive molecules with
drug-like properties. It brings together chemical,
bioactivity and genomic data
https://www.ebi.ac.uk/chembl/
2. Enzyme Database – BRENDA: A comprehensive
enzyme information system.
https://www.brenda-enzymes.org/

Basics of QSAR Modeling by Prof Rahul D. Jawarkar

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basics of QSAR Modeling by Prof Rahul D. Jawarkar

Uploaded by

Copyright:

Available Formats

Basics of QSAR Modeling

Prof. Rahul D. Jawarkar,

Natural source (80%) Synthetic (20%)

LD50, Kd, Ki, etc.

“Similar compounds behave similarly

A QSAR is a multivariate, mathematical relationship

Identification, Information & Description

Note: At present, more than 45,000 descriptor can be calculated !!!

Many descriptors provide same information.

Use of too many descriptors results in “Over Fitting”.

Use of improper descriptors results in poor and misleading models.

Use of many descriptors can lead to Chancy correlation.

Use SR, GA, MA, etc. to select best descriptors

 Best Multiple Linear Regression (BMLR),

B) Partial Least Squares (PLS)

 Genetic Partial Least Squares (G/PLS),

a) Leave-One-Out Cross validation:

b) Leave-Many-Out Cross Validation:

d) Use PCA, Simulated Annealing, Automated Relevance

e) Use Bayesian Statistics or Gaussian Processes

You might also like