Open Source For Computer-Aided Drug Discovery: Prof. Gajendra P.S. Raghava

Open Source for Computer-Aided Drug
Discovery
Prof. Gajendra P.S. Raghava

Head, Center for Computational Biology
Web Site: http://webs.iiitd.edu.in/raghava/
These slides were created with using various resources so

no claim of authorship on any slide
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers
Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Concept of Drug and Vaccine
 Concept of Drug
 Kill invaders of foreign pathogens
 Inhibit the growth of pathogens
 Concept of Vaccine
 Generate memory cells
 Trained immune system to face various existing disease
agents
History of Drug/Vaccine development
 Plants or Natural Product

 Plant and Natural products were source for medical substance
 Example: foxglove used to treat congestive heart failure
 Foxglove contain digitalis and cardiotonic glycoside
 Identification of active component
 Accidental Observations
 Penicillin is one good example
 Alexander Fleming observed the effect of mold
 Mold(Penicillium) produce substance penicillin
 Discovery of penicillin lead to large scale screening
 Soil micoorganism were grown and tested
 Streptomycin, neomycin, gentamicin, tetracyclines etc.
History of Drug/Vaccine development
Chemical Modification of Known Drugs
Drug improvement by chemical modification
Pencillin G -> Methicillin; morphine->nalorphine
Receptor Based drug design
Receptor is the target (usually a protein)
Drug molecule binds to cause biological effects
It is also called lock and key system
Structure determination of receptor is important
Ligand-based drug design
Search a lead compound or active ligand
Structure of ligand guide the drug design process
Drug Design based on Bioinformatics Tools
 Detect the Molecular Bases for Disease

 Detection of drug binding site
 Tailor drug to bind at that site
 Protein modeling techniques
 Traditional Method (brute force testing)
 Rational drug design techniques
 Screen likely compounds built
 Modeling large number of compounds (automated)
 Application of Artificial intelligence
 Limitation of known structures
Drug Discovery & Development
Identify disease Drug Design
- Molecular Modeling
- Virtual Screening
Find a drug effective
against disease protein
Isolate protein (2-5 years)
involved in Scale-up
disease (2-5 years)
Preclinical testing
(1-3 years) Human clinical trials
(2-10 years)
Formulation
FDA approval
(2-3 years)
Technology is impacting this process
GENOMICS, PROTEOMICS & BIOPHARM.
Potentially producing many more targets
and “personalized” targets
HIGH THROUGHPUT SCREENING

Identify disease Screening up to 100,000 compounds a
day for activity against a target protein
VIRTUAL SCREENING
Using a computer to
Isolate protein predict activity
COMBINATORIAL CHEMISTRY
Rapidly producing vast numbers Find drug
of compounds
MOLECULAR MODELING
Computer graphics & models help improve activity
Preclinical testing
IN VITRO & IN SILICO ADME MODELS
Tissue and computer models begin to replace animal testing
Computer Aided Drug Design Techniques
- Physicochemical Properties Calculations
- Partition Coefficient (LogP), Dissociation Constant (pKa) etc.

- Drug Design
- Ligand Based Drug Design
- QSARs
- Pharmacophore Perception
- Structure Based Drug Design
- Docking & Scoring
- de-novo drug design
- Pharmacokinetic Modeling (QSPRs)
- Absorption, Metabolism, Distribution and Toxicity etc.
- Cheminformatics
- Database Management
- Similarity / Diversity Searches
-All techniques joins together to form VIRTUAL SCREENING protocols
Quantitative Structure Activity Relationships (QSAR)
 QSARs are the mathematical relationships linking chemical structures with biological
activity using physicochemical or any other derived property as an interface.
Biological Activity = f (Physico-chemical properties)
 Mathematical Methods used in QSAR includes various regression and pattern recognition
techniques.
 Physicochemical or any other property used for generating QSARs is termed as Descriptors
and treated as independent variable.
 Biological property is treated as dependent variable.

QSAR and Drug Design
Compounds + biological activity
QSAR
New compounds with

improved biological activity
Chemical Space
Issue
QSAR Generation Process
1. Selection of training set

2. Enter biological activity data
3. Generate conformations
4. Calculate descriptors
5. Selection of statistical method
6. Generate a QSAR equation
7. Validation of QSAR equation
8. Predict for Unknown
Descriptors
Selection of Descriptors
1. Structural descriptors
2. Electronic descriptors 1.What is relevant to the therapeutic target?
3. Quantum Mech. descriptors 2.What variation is relevant to the compound series?
4. Thermodynamic descriptors 3.What property data can be readily measured?
5. Shape descriptors 4.What can be readily calculated?
6. Spatial descriptors
7. Conformational descriptors
8. Receptor descriptors
Singla et al. (2013) Open source software and web services for
designing therapeutic molecules. Curr Top Med Chem. 13(10):1172-91.
An overview of the workflow

of in silico drug designing
process
Important Points
1. Source of Molecules (databases or repositories)

2. Molecular Editors (editing & viewing existing molecules)
3. Analog Generators (software used to generate analogs)
4. Structure Optimization (Energy/geometry of molecules)
5. Calculation of Molecular Descriptors
6. Chemical Similarity Search
7. Development of QSAR/QSPR Models
8. Classification and Clustering of Small Molecules
9. Docking Small Molecules in Macromolecules
10. Pharmacophore Tools/Search
11. Software for ADMET Techniques
12. Designing of Inhibitors
13. Major Initiatives towards affordable drugs
Databases and resources of chemical compounds
Major molecular editors
Analogs generation software
Software for structure optimization of molecules
Computing molecular descriptors
Similarity search algorithms and their web links
Different types of pharmocophore
searching softwares
Softwares Brief description
Pharmapper Ligand based Pharmacophore search
[69] (http://59.78.96.61/pharmmapper/)
PharmaGist Ligand based Pharmacophore search
[70] (http://bioinfo3d.cs.tau.ac.il/PharmaGist/)
Pharmer Both PDB and ligand based pharmacophore search

[71] (http://smoothdock.ccbb.pitt.edu/pharmer/)
ZincPharma Both PDB and ligand based pharmacophore search

[133] (http://zincpharmer.csb.pitt.edu/pharmer.html)
Boomer Pharmacokinetic drug monitering

(http://www.boomer.org/)
Cyber A software for pharmacokinetic simulations
Patient
(http://www.labsoft.com/www/software.html)
PKfit A tool for pharmacokinetic modeling
(http://cran.csie.ntu.edu.tw/web/packages/PKfit/index.html)
JPKD Therapeutic drug monitoring
(http://pkpd.kmu.edu.tw/jpkd/)
tdm Therapeutic drug monitoring
(http://pkpd.kmu.edu.tw/tdm/)
mobilePK (http://pkpd.kmu.edu.tw/mobilepk/)
http://www.yapcwsoft.com/dd/padeldescriptor/
Molecular Descriptors
Experimental Descriptors Theoretical Descriptors

Derived from a symbolic
Physico-chemical properties representation of the molecule and
can be further classified according
• Boiling point
to the different types of molecular
• Melting point
representation.
• Dipole Moment
For example, Water can be
• Polarizability
represented in many different
• molar refractivity
formats
1. Molecular Fomula: H2O
2. PDBe Ligand code: HOH
3. European community Number:
231-791-2 etc.
Source:
http://www.moleculardescriptors.eu/tutorials/T2_moleculardescriptors_chemom.pdf
Source:
Different File Formats
SDF File (Standard Data File): Saved in plain text and contains chemical structure records;
used as a standard exchange format for chemicals information.
SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows
a user to represent a chemical structure in a way that can be used by the computer.
MDL mol format: An MDL Molfile is a file format for holding information about the atoms,
bonds, connectivity and coordinates of a molecule.
Link for different file formats:

https://en.wikipedia.org/wiki/Chemical_file_format
https://open-babel.readthedocs.io/en/latest/FileFormats/Overview.html
Case Study for Computer-Aided Drug
Discovery


Family
Proteomics
History
Gene
Expression Imaging
Integrated
Clinical Metabolimics
information
Data
Treatment
Decision
Personalized
medicine
Pharmacogenomics(PGx)
◆ Pharmacogenomics(PGx) – the study of variations of DNA and RNA
characteristics as related to drug response.
Explains inter-individual
differences in drug
metabolism
DNA sequence Pharmacokinetics
Gene expression Pharmacogenomics Physiological

(PGx) drug response
copy number
Pharmacodynamics
SNP
Identifying responders Efficacy

and non-responders
Toxicity
Feature Engineering & Case Studies


Feature Engineering
Few Facts
 It is most important part of prediction/classification
 It is an art, it is human-driven design
 Number of techniques have been used for in past
 Principles are not well defined or validated
 Commonly used feature selection will be discussed
 Theoretical view: more feature means more discrimination power
 In practice, identification major feature is more important
 Optimization of feature is major challenge
Why feature reduction
➢ Many domains like biology thousands of variables/features (Genetic features)
➢ Many features are irrelevant and redundant ones!
➢ Probability distribution can be very complex and hard to estimate (e.g.
dependencies between variables) !
➢ Irrelevant and redundant features can „confuse“ learners!
➢ Limited training data!
➢ Limited computational resources!
➢ Curse of dimensionality!
Definition of Feature Selection
 Classification/Regression (Supervised Learning):  x1 
 
L = {( x1 , y1 ),..., ( xi , yi ),..., ( xm , ym )}  X  Y 
x= 
  
X = f1  ...  f i  ...  f n  
 xn 
Select Feature from F ={ f1 ,..., fi ,..., f n } to F' F
It should have more or equal features after reducing features

Feature Selection / - Extraction
 Feature Selection:
F
F‘
{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯ →{ fi1 ,..., fi j ,..., fim } i j  1,..., n ; j = 1,..., m

f . selection ia = ib  a = b; a, b  1,..., m
 Feature Extraction/Creation
F F‘
{ f1,..., fi ,..., f n } ⎯⎯⎯⎯→

f . extraction
{g1 ( f1 ,..., f n ),..., g j ( f1 ,..., f n ),..., gm ( f1 ,..., f n )}
Filter Methods: Variable Ranking
 A simple method for feature selection using variable ranking
is to select the k highest ranked features according to S.
 This is usually not optimal
 but often preferable to other, more complicated methods
 computationally efficient(!): only calculation and sorting of n

scores
Filter Methods: Ranking Correlation
Correlation Criteria:
 Pearson correlation coefficient
cov( f i , y)
R ( fi , y ) =
var( f i ) var( y)
 Estimate for m samples:
 (f )( y )
m
k =1 k,i − fi k −y
R( fi , y) =
 (f )  (y )
m 2 m 2
k =1 k ,i
− fi k =1 k
−y
The higher the correlation between the feature and the target, the higher the score!
Filter Methods: Classification
Methods
1. Difference in mean for positive and negative samples
Ranking of feature based on Maximum difference F(Pmean – Nmean)
2. Significance difference (t-test) in mean
3. Threshold based feature selection
Based on MCC
Based on Accuracy
4. Ranking of features
Feature Subset Selection
Wrapper Methods
• The problem of finding the optimal subset is NP-hard!
• A wide range of heuristic search strategies can be

used.
Two different classes:
– Forward selection
(start with empty feature set and add features at each step)
– Backward elimination
(start with full feature set and discard features at each step)
• predictive power is usually measured on a validation

set or by cross-validation
• By using the learner as a black box wrappers are
universal and simple!
• Criticism: a large amount of computation is required.
62/54
Feature Subset Selection
Embedded Methods
• Specific to a given learning machine!
• Performs variable selection (implicitly) in the

process of training
• E.g. WINNOW-algorithm
(linear unit with multiplicative updates)
63/54
Important points 1/2
• Feature selection can significantly increase the
performance of a learning algorithm (both
accuracy and computation time) – but it is not
easy!
• One can work on problems with very high-

dimensional feature-spaces
• Relevance <-> Optimality
• Correlation and Mutual information between

single variables and the target are often used as
Ranking-Criteria of variables.
Important points 2/2
• One can not automatically discard variables with
small scores – they may still be useful together
with other variables.
• Filters – Wrappers - Embedded Methods
• How to search the space of all feature subsets ?
• How to asses performance of a learner that uses a

particular feature subset ?
Filters,Wrappers, and
Embedded methods
Feature
All features Filter Predictor
subset
Multiple
All features Feature Predictor
subsets
Wrapper
Feature
subset
Embedded
All features
method
Predictor
Filters
Methods:
 Criterion: Measure feature/feature subset
“relevance”
 Search: Usually order features (individual feature
ranking or nested subsets of features)
 Assessment: Use statistical tests
Results:
 Are (relatively) robust against overfitting
 May fail to select the most “useful” features
Wrappers
Methods:
 Criterion: Measure feature subset “usefulness”
 Search: Search the space of all feature subsets
 Assessment: Use cross-validation
Results:
 Can in principle find the most “useful” features,
but
 Are prone to overfitting
Embedded Methods
Methods:
 Criterion: Measure feature subset “usefulness”
 Search: Search guided by the learning process
 Assessment: Use cross-validation
Results:
 Similar to wrappers, but
 Less computationally expensive
 Less prone to overfitting
Three “Ingredients”
Single
feature
Cross relevance
validation
Relevance
in context
Performance Feature subset
bounds relevance
Performance
Statistical learning
tests machine
Nested subset,
Heuristic or forward selection/
stochastic search backward elimination
Exhaustive search Single feature ranking
Search
Feature selection examples
Garg A, Tewari R, Raghava GP. KiDoQ: using docking based energy scores todevelop ligand based
model for predicting antibacterials. BMC Bioinformatics.2010 Mar 11;11:125
 23 inhibitors against DHDPS enzyme
 11 energy-based descriptors obtain from docking using Autodock
 F-stepping remove-one approach,
 Singh H, Singh S, Singla D, Agarwal SM, Raghava GP. QSAR based model for discriminating EGFR
inhibitors and non-inhibitors using Random forest. Biol. Direct. 2015 Mar 25;10:10.
 EGFR inhibitors 508 inhibitors and 2997 non-inhibitors
 881 PubChem fingerprints
 Frequency-based feature selection technique
 Difference of frequency in inhibitors and non-inhibitors
Chauhan JS, Dhanda SK, Singla D; Open Source Drug Discovery Consortium, Agarwal SM, Raghava
GP. QSAR-based models for designing quinazoline/imidazothiazoles/pyrazolopyrimidines based
inhibitors against wild and mutant EGFR. PLoS One. 2014 Jul 3;9(7)
 Selection of descriptors having high correlation with IC50
 Removal of descriptors with low variance
 Removal of highly correlated descriptors
 Removal of useless descriptors having lot of zeros
Dhanda SK, Singla D, Mondal AK, Raghava GP. DrugMint: a webserver for predicting and designing
of drug-like molecules. Biol Direct. 2013 Nov 5;8:28. doi: 10.1186/1745-6150-8-28.
 Weka software
 Remove Useless (rm-useless): either varies too much or variation is negligible
 CfsSsubsetEval module of Weka: have high correlation with class/activity and very
less inter-correlation
Bhalla S, Chaudhary K, Kumar R, Sehgal M, Kaur H, Sharma S, Raghava GP. Gene expression-based
biomarkers for discriminating early and late stage of clear cell renal cancer. Sci Rep. 2017 Mar
28;7:44997.
523 samples to discriminate early and late stage of ccRCC
Total descriptors: expression of 20,538 genes in samples
Threshold-based approach for ranking genes (over or under expressed)
Removal of highly correlated genes (0,60)
Average of output of models based on best genes/descriptors
Weka Fast Correlation-Based Feature (FCBF): selection utilizes predominant
correlation to identify relevant features in high-dimensional datasets in reduced
feature space
K-NEAREST NEIGHBOR METHOD (KNN)


K-NEAREST NEIGHBOR METHOD (KNN)
 Classification of unknown object based on similarity/distance of
annotated object
 Search similar objects in a database of known objects
 Different names
 Memory based reasoning
 Example based learning
 Instance based reasoning
 Case based reasoning
K-NEAREST NEIGHBOR METHOD
(KNN)
KNN – Number of Neighbors
 If K=1, select the nearest neighbor
 If K>1,
 For classification select the most frequent neighbor.
 Voting or average concept
 Preference/weight to similarity/distance
 For regression calculate the average of K neighbors.
Weight to Instance
 All instance or examples are not reliable
 Weight a instance based on its success in prediction
86
Distance Metrics
Standardization
 Transform raw feature values into z-scores
x ij - m j
zij =
 is the value for the ithssample
j and jth feature
 x ijis the average of all for feature j
 is the standard deviation of all over all input samples
mj x ij
 Range and scale of z-scores should be similar (providing
sj
distributions of raw feature values are alike)x ij
Instance Based Reasoning
Type of IBR
• IB1 is based on the standard KNN
• IB2 is incremental KNN learner that only incorporates misclassified
instances into the classifier.
• IB3 discards instances that do not perform well by keeping success
records.
Weight to Instance
 All instance or examples are not reliable
 Weight a instance based on its success in prediction
96
Python Code
https://www.youtube.com/watch?v=6kZ-OPLNcgE
K-Nearest Neighbor
More about KNN
1. Prepare: Numeric values
2. Method: Similarity/Distance
4. Train: Does not apply to the kNN algorithm.
5. Compute distance/similarity of instance with examples
6. Identify neighbor of instance
7. Voting of classification & average for regression
Pros: High accuracy, insensitive to outliers

Cons: Computationally expensive, requires a lot of memory
Works with: Numeric values, nominal values
Important Points (KNN)
➢ The k-nearest neighbors (KNN) algorithm is a simple
➢ Easy-to-implement can be used to solve both classification and regression problems.
➢ The KNN algorithm assumes that similar things exist in close proximity.
➢ KNN captures the idea of similarity, calculating the distance between points on a graph
➢ As we decrease the value of K to 1, our predictions become less stable.
➢ The algorithm gets significantly slower with increase size of data
➢ Accuracy depends on the quality of the data.
➢ Must find an optimal k value (number of nearest neighbors).
➢ Poor at classifying data points in a boundary where they can be classified wrong
Support Vector Machine Learning


Introduction to Support Vector Machine (SVM)
 Proposed by Boser, Guyon and Vapnik in 1992

 Gained increasing popularity in late 1990s.
 Highly effective on small datasets
 Minimum over optimization
 Classification (binary, multi), Regression, clustering
 Most popular SMO, LIBSVM and SVMlight
 Tuning SVMs parameters is a challenge, hit and try
Hyperplanes in 2D and 3D feature space
If the number of input features is 2, then the hyperplane is just a line. If the
number of input features is 3, then the hyperplane becomes a two-
dimensional plane.
Linear Separators
 Which of the linear separators is optimal?
Perceptron Revisited: Linear Separators
 Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
Classification Margin
 Distance from example to the separator is
 Examples closest to the hyperplane are support vectors.
 Margin ρ of the separator is the width of separation between
classes. ρ
r
Maximum Margin Classification
 Maximizing the margin is good according to intuition and PAC
theory.
 Implies that only support vectors are important; other training
examples are ignorable.
Soft Margin Classification
 What if the training set is not linearly separable?
 Slack variables ξi can be added to allow misclassification of difficult or
noisy examples.
ξi
ξi
Soft Margin
Linear SVMs: Overview
 The classifier is a separating hyperplane.
 Most “important” training points are support vectors; they
define the hyperplane.
 Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
 Both in the dual formulation of the problem and in the solution
training points appear only inside inner products:
Non-linear SVMs
 Datasets that are linearly separable with some noise work out great:
0 x
 But what are we going to do if the dataset is just too hard?
0 x
 How about… mapping data to a higher-dimensional space:
x2
0 x
Non-linear SVMs: Feature spaces
 General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
The “Kernel Trick”
 The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
 If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes: K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is some function that corresponds to an inner product in some
expanded feature space.
Examples of Kernel Functions
 Linear: K(xi,xj)= xi Txj
 Polynomial of power p: K(xi,xj)= (1+ xi Txj)p 2
xi − x j
 Gaussian (radial-basis function network): K (x i , x j ) = exp( − )
2 2
HARD MARGIN
If the training data is linearly separable, we can select two parallel hyperplanes
that separate the two classes of data.
Distance between them is as large as possible.
SOFT MARGIN
As most of the real-world data are not fully linearly separable, we will allow some
margin violation to occur, which is called soft margin classification.
It is better to have a large margin, even though some constraints are violated.
Allow some data points to stay in either the incorrect side of the hyperplane and
between the margin and the correct side of the hyperplane.
Cost Function and Gradient Updates
➔ In the SVM algorithm, we are looking to maximize the margin between the data
points and the hyperplane. The loss function that helps maximize the margin is
hinge loss.
➔ The function of the first term, hinge loss, is to penalize misclassifications. It
measures the error due to misclassification (or data points being closer to the
classification boundary than the margin). The second term is the regularization
term, which is a technique to avoid overfitting by penalizing large coefficients in
the solution vector.
Ensemble Methods, Decision Tree
and Random forest


Ensemble learning
Ensembles of Classifiers
 Idea
 Combine the classifiers to improve the performance
 Ensembles of Classifiers
 Combine the classification results from different
classifiers to produce the final output
 Unweighted voting
 Weighted voting
Build Ensemble Classifiers
• Basic idea:
Build different “experts”, and let them vote
• Advantages:
Improve predictive performance
Other types of classifiers can be directly included
Easy to implement
No too much parameter tuning
• Disadvantages:
The combined classifier is not so transparent (black box)
Not a compact representation
Ensemble-based methods
 Classifier combination/Aggregator is a general concept
 Creation of large set of classifiers, here called the “base learners”
 Building large number of base classifiers
 These sets and classifies should be diverse
 Aim to maximize accuracy using combination
 These methods will be:
 Bagging,
 Boosting,
 AdaBoost,
 Random Forests
Outline
Bias/Variance Tradeoff
• Ensemble methods that minimize variance

– Bagging
– Random Forests
• Ensemble methods that minimize bias

– Functional Gradient Descent
– Boosting
– Ensemble Selection
Bagging
 Bagging = Bootstrap + aggregating
 It uses bootstrap resampling to generate L different training sets from the
original training set
 On the L training sets it trains L base learners
 During testing it aggregates the L learners by taking their average (using
uniform weights for each classifiers), or by majority voting
 The diversity or complementarity of the base learners is not controlled in
any way, it is left to chance and to the instability of the base learning
method
 The ensemble model is almost always better than the unique base learners
if the base learners are unstable (which means that a small change in the
training dataset may cause a large change in the result of the training)
Bootstrap resampling
 Suppose we have a training set with n samples
 Create L different training sets
 Bootstrap resampling takes random samples
 Randomness is required to obtain different sets for L rounds
 Allowing replacement is required
 As the L training sets are different, the result of the training over
these set will be different
 Works better with unstable learners (e.g. neural nets, decision
trees)
 Not really effective with stable learners (e.g. k-NN, SVM)
Aggregation methods
 There are several methods to combine (aggregate) the outputs of the various classifiers
 When the output is a class label:
 Majority voting
 Weighted majority voting (e.g we can weight each classifier
by ts reliability (which also has to be estimated somehow, of course…)
 When the output is numeric (e.g. a probability
estimate for each class ci):
 We can combine the dj scores by taking their (weighted)
mean, product, minimum, maximum, …
 Stacking
 Instead of using the above simple aggregation rules, we can
train yet another classifier on the output values of the base
classifiers
Method Minimize Bias? Minimize Variance? Other Comments
Bagging Complex model class. Bootstrap aggregation Does not work for simple
(Deep DTs) (resampling training data) models.
Random Complex model class. Bootstrap aggregation Only for decision trees.
Forests (Deep DTs) + bootstrapping features
Gradient Optimize training Simple model class. Determines which model to

Boosting performance. (Shallow DTs) add at run-time.
(AdaBoost)
Ensemble Optimize validation Optimize validation performance Pre-specified dictionary of

Selection performance models learned on training
set.
• State-of-the-art prediction performance

– Won Netflix Challenge
– Won numerous KDD Cups
– Industry standard
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity Each internal node tests an attribute
High Normal Each branch corresponds to an

attribute value node
No Yes Each leaf node assigns a classification
146
When to consider Decision Trees
 Instances describable by attribute-value pairs
 Target function is discrete valued
 Disjunctive hypothesis may be required
 Possibly noisy training data
 Missing attribute values
 Examples:
 Medical diagnosis
 Credit risk analysis
 Object classification for robot manipulator (Tan 1993) 147
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak

No Yes No Yes
R1 : If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No

R2 : If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes
R3 : If (Outlook=Overcast) Then PlayTennis=Yes
R4 : If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No
R5 : If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes 148
Basic Concept
 A Decision Tree is an important data structure known to
solve many computational problems
Binary Decision Tree
A B C f
0 0 0 m0
0 0 1 m1
0 1 0 m2
0 1 1 m3
1 0 0 m4
1 0 1 m5
1 1 0 m6
1 1 1 m7
149
Basic Concept
 In last example, we have considered a decsion tree where
values of any attribute if binary only. Decision tree is also
possible where attributes are of continuous data type
Decision Tree with numeric data
150
Some Characteristics
 Decision tree may be n-ary, n ≥ 2.
 There is a special node called root node.
 All nodes drawn with circle (ellipse) are called internal nodes.
 All nodes drawn with rectangle boxes are called terminal nodes or leaf
nodes.
 Edges of a node represent the outcome for a value of the node.
 In a path, a node with same label is never repeated.
 Decision tree is not unique, as different ordering of internal nodes can give
different decision tree.
151
Decision Tree and Classification Task
 Decision tree helps us to classify data.
 Internal nodes are some attribute
 Edges are the values of attributes
 External nodes are the outcome of classification

 Such a classification is, in fact, made by posing questions starting from the root
node to each terminal node.
152
Vertebrate Classification
Name Body Temperature Skin Cover Gives Birth Aquatic Aerial Has Legs Hibernates Class
Creature Creature
Human Warm hair yes no no yes no Mammal

Python Cold scales no no no no yes Reptile
Salmon Cold scales no yes no no no Fish
Whale Warm hair yes yes no no no Mammal
Frog Cold none no semi no yes yes Amphibian
Komodo Cold scales no no no yes no Reptile
Bat Warm hair yes no yes yes yes Mammal
Pigeon Warm feathers no no yes yes no Bird
Cat Warm fur yes no no yes no Mammal
Leopard Cold scales yes yes no no no Fish
Turtle Cold scales no semi no yes no Reptile
Penguin Warm feathers no semi no yes no Bird
Porcupine Warm quills yes no no yes yes Mammal
Eel Cold scales no yes no no no Fish
Salamander Cold none no semi no yes yes Amphibian
153
What are the class label of Dragon and Shark?
Vertebrate Classification
 Suppose, a new species is discovered as follows.
Name Body Skin Gives Aquatic Aerial Has Hibernates Class
Temperature Cover Birth Creature Creature Legs
Gila Monster cold scale no no no yes yes

?
 Decision Tree that can be inducted based on the data is as follows.
154
Building Decision Tree
 Many decision tree can be constructed from a dataset
 Some of the tree may not be optimum
 Some of them may give inaccurate result
 Two approaches are known
 Greedy strategy
 A top-down recursive divide-and-conquer
 Modification of greedy strategy

 ID3
 C4.5
 CART, etc.
Unsupervised decision trees are structurally similar to hierarchical clustering155
Built Decision Tree Algorithm
 Algorithm BuiltDT
 Input: D : Training data set
 Output: T : Decision tree
Steps
1. If all tuples in D belongs to the same class Cj
Add a leaf node labeled as Cj
Return // Termination condition
2. Select an attribute Ai (so that it is not selected twice in the same branch)
3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D
4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute value in Dk
5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop 156
Node Splitting in BuildDT Algorithm
 BuildDT algorithm must provides a method for expressing an attribute test condition and
corresponding outcome for different attribute type
 Case: Binary attribute

 This is the simplest case of node splitting
 The test condition for a binary attribute generates only two outcomes
157
 Case: Nominal attribute
 Since a nominal attribute can have many values, its test condition can be expressed
in two ways:
 A multi-way split
 A binary split
 Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute
 Binary splitting by grouping attribute values
CS 40003: Data Analytics 158

 Case: Ordinal attribute
 It also can be expressed in two ways:
 A multi-way split
 A binary split
 Muti-way split: It is same as in the case of nominal attribute

 Binary splitting attribute values should be grouped maintaining the order property of the attribute
values
159
 Case: Numerical attribute
 For numeric attribute (with discrete or continuous values), a test condition can be expressed
as a comparison set
 Binary outcome: A >v or A ≤ v
 In this case, decision tree induction must consider all possible split positions
 Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)
 Here, q should be decided a priori
 For a numeric attribute, decision tree induction is a combinatorial optimization problem
160
Illustration : BuildDT Algorithm
Person Gender Height Class Attributes:
1 F 1.6 S Gender = {Male(M), Female (F)} // Binary attribute
2 M 2.0 M Height = {1.5, …, 2.5} // Continuous
3 F 1.9 M attribute
4 F 1.88 M
5 F 1.7 S Class = {Short (S), Medium (M), Tall (T)}
6 M 1.85 M
7 F 1.6 S
8 M 1.7 S
9 M 2.2 T Given a person, we are to test in which class s/he
10 M 2.1 T belongs
11 F 1.8 M
12 M 1.95 M
13 F 1.9 M
14 F 1.8 M
15 F 1.75 S
161
 To built a decision tree, we can select an attribute in two different orderings:
<Gender, Height> or <Height, Gender>
 Further, for each ordering, we can choose different ways of splitting
 Different instances are shown in the following.
 Approach 1 : <Gender, Height>
CS 40003: Data Analytics 162

163
 Approach 2 : <Height, Gender>
164
Concept of Entropy
More ordered Less ordered

less entropy higher entropy
More organized or Less organized or
ordered (less probable) disordered (more probable)
165
Information Gain
Examples (Gini Index) Examples (Entropy)
Lets a bag have 4 red balls Lets a bag have 4 red balls
Gini Index (Impurity) = 1 – 1 = 0 Entropy = -1 log2 (1) = 0
Purity is 100% Lets a bag have 2 red and 2 blue balls
Lets a bag have 2 red and 2 blue balls Entropy = -1/2 log2 (1/2) - 1/2 log2 (1/2)
Gini Index = 1 – (1/4 + 1/4) = 0.5 =1
Lets a bag have 3 red and 1 blue balls Lets a bag have 3 red and 1 blue balls
Gini Index = 1 – (1/16 + 9/16) = 6/16 Entropy = 0.811
ID3: Decision Tree Induction Algorithms
 Iterative Dichotomizer 3 for decision trees,
 In ID3, each node corresponds to a splitting attribute and each arc is a possible value of that
attribute.
 At each node, the splitting attribute is selected to be the most informative

 In ID3, entropy is used to measure how informative a node is.
 It is observed that splitting on any attribute has the property that average entropy of the resulting training subsets
will be less than or equal to that of the previous training set.
 ID3 algorithm defines a measurement of a splitting called Information Gain to determine the goodness of a
split.
 The attribute with the largest value of information gain is chosen as the splitting attribute and
 it partitions into a number of smaller training sets based on the distinct values of attribute under split.
167
CART Algorithm
 It is observed that information gain measure used in ID3 is biased towards test with many
outcomes, that is, it prefers to select attributes having a large number of values.
 L. Breiman, J. Friedman, R. Olshen and C. Stone in 1984 proposed an algorithm to build a

binary decision tree also called CART decision tree.
 CART stands for Classification and Regression Tree
 In fact, invented independently at the same time as ID3 (1984).
 ID3 and CART are two cornerstone algorithms spawned a flurry of work on decision tree induction.
 CART is a technique that generates a binary decision tree; That is, unlike ID3, in CART, for
each node only two children is created.
 ID3 uses Information gain as a measure to select the best attribute to be splitted, whereas CART
do the same but using another measurement called Gini index . It is also known as Gini Index
of Diversity and is denote as 𝛾.
168
Gini Index of Diversity
Gini Index
Suppose, D is a training set with size |D| and 𝐶 = 𝑐1 , 𝑐2 , … , 𝑐𝑘 be the set of k classifications and 𝐴
= 𝑎1 , 𝑎2 , … , 𝑎𝑚 be any attribute with m different values of it. Like entropy measure in ID3, CART
proposes Gini Index (denoted by G) as the measure of impurity of D. It can be defined as follows.
𝑘
𝐺 𝐷 = 1 − ෍ 𝑝𝑖2
𝑖=1
where 𝑝𝑖 is the probability that a tuple in D belongs to class 𝑐𝑖 and 𝑝𝑖 can be estimated as
|𝐶𝑖 , 𝐷|
𝑝𝑖 =
𝐷
where |𝐶𝑖 , 𝐷| denotes the number of tuples in D with class 𝑐𝑖 .
169
Algorithm C 4.5
 J. Ross Quinlan, a researcher in machine learning developed a decision tree induction algorithm
in 1984 known as ID3 (Iterative Dichotometer 3).
 Quinlan later presented C4.5, a successor of ID3, addressing some limitations in ID3.
 ID3 uses information gain measure, which is, in fact biased towards splitting attribute having a
large number of outcomes.
 For example, if an attribute has distinct values for all tuples, then it would result in a large
number of partitions, each one containing just one tuple.
 In such a case, note that each partition is pure, and hence the purity measure of the partition, that is
𝐸𝐴 𝐷 = 0
170
Algorithm C4.5 : Introduction
Limitation of ID3
In the following, each tuple belongs to a unique class. The splitting on A is shown.
𝑛 𝑛
𝐷𝑗 1
𝐸𝐴 𝐷 = ෍ . 𝐸 𝐷𝑗 = ෍ .0 = 0
𝐷 𝐷
𝑗=1 𝑗=1
Thus, 𝛼 𝐴, 𝐷 = 𝐸 𝐷 − 𝐸𝐴 𝐷 is maximum in such a situation. 171

Algorithm: C 4.5 : Introduction
 Although, the previous situation is an extreme case, intuitively, we can infer
that ID3 favours splitting attributes having a large number of values
 compared to other attributes, which have a less variations in their values.
 Such a partition appears to be useless for classification.
 This type of problem is called overfitting problem.
Note:
Decision Tree Induction Algorithm ID3 may suffer from overfitting problem.
172
Algorithm: C 4.5 : Introduction
 The overfitting problem in ID3 is due to the measurement of information gain.
 In order to reduce the effect of the use of the bias due to the use of information gain, C4.5 uses a
different measure called Gain Ratio, denoted as 𝛽.
 Gain Ratio is a kind of normalization to information gain using a split information.

Notes on Decision Tree Induction algorithms
1. Optimal Decision Tree: Finding an optimal decision tree is an NP-complete
problem. Hence, decision tree induction algorithms employ a heuristic based
approach to search for the best in a large search space. Majority of the algorithms
follow a greedy, top-down recursive divide-and-conquer strategy to build
decision trees.
2. Missing data and noise: Decision tree induction algorithms are quite robust to
the data set with missing values and presence of noise. However, proper data pre-
processing can be followed to nullify these discrepancies.
3. Redundant Attributes: The presence of redundant attributes does not adversely

affect the accuracy of decision trees. It is observed that if an attribute is chosen
for splitting, then another attribute which is redundant is unlikely to chosen for
splitting.
4. Computational complexity: Decision tree induction algorithms are

computationally inexpensive, in particular, when the sizes of training sets are
large, Moreover, once a decision tree is known, classifying a test record is
extremely fast, with a worst-case time complexity of O(d), where d is the
maximum depth of the tree.
174
Ensemble Methods, Decision Tree
and Random forest


RANDOM FOREST
➢Random forest is a supervised learning algorithm can
be used for classification as well as regression.
➢It is mainly used for classification problems.
➢The "forest" it builds is an ensemble of decision trees,
usually trained with the “bagging” method.
➢The general idea of the bagging method is to
increases the overall performance.
Working of Random Forest Algorithm
 Step 1: First, start with the selection of random samples from a
given dataset.
 Step 2: Next, this algorithm will construct a decision tree for
every sample. Then it will get the prediction result from every
decision tree.
 Step 3: In this step, voting will be performed for every predicted
result.
 Step 4: At last, select the most voted prediction result as the
final prediction result.
Random Forest Classifier
Training Data
M features
N examples
M features
N examples
....…
Construct a decision tree
M features
N examples
....…
At each node in choosing the split feature
choose only among m<M features
M features
N examples
....…
Create decision tree
from each bootstrap sample
M features
N examples
....…
....…
M features
N examples
Take he
majority
vote
....…
....…
Important Points
➢Random forest has nearly the same hyperparameters as a decision tree or
a bagging classifier.
➢Random forest adds additional randomness to the model, while growing
the trees. Instead of searching for the most important feature while splitting
a node, it searches for the best feature among a random subset of
features. This results in a wide diversity that generally results in a better
model.
➢Therefore, in random forest, only a random subset of the features is taken
into consideration by the algorithm for splitting a node. You can even make
trees more random by additionally using random thresholds for each
feature rather than searching for the best possible thresholds.
Important Points
 Random forest is based decision trees that are created by randomly
splitting the data.
 Generating decision trees is also known as forest.
 Each decision tree is formed using feature selection indicators like
information gain.
 Each tree is dependent on an independent sample.
 Considering it to be a classification problem, then each tree computes
votes and the highest votes class is chosen.
 If it's regression, the average of all the tree's outputs is declared as
the result.
 It is the most powerful algorithm compared to all others.
FEATURE IMPORTANCE
 It is very easy to measure the relative importance of each feature on the
prediction.
 Sklearn provides a great tool for this that measures a feature's importance.
 Tree nodes that use that feature reduce impurity across all trees in the forest.
 By looking at the feature importance you can decide which features to possibly
drop
 This is important because a general rule in machine learning is that the more
features you have the more likely your model will suffer from overfitting and vice
versa.
 Random forests make use of Gini importance or MDI (Mean decrease impurity)
to compute the importance of each attribute.
 The amount of total decrease in node impurity is also called Gini importance.
 This is the method through which accuracy or model fit decreases when there is a
drop of feature.
Features and Advantages
The advantages of random forest are:
 One of the most accurate learning algorithms available.
 Runs efficiently on large databases.
 It can handle thousands of input variables.
 Identification of importance of variables in classification
 Highly effective in estimating missing data.
 It has methods for balancing error in class population unbalanced data sets.
 Generated forests can be saved for future use on other data.
 It computes proximities between pairs of cases (clustering, locating outliers)
 The capabilities can be used for unsupervised learning
 It offers an experimental method for detecting variable interactions.
189/14
Implementation in Python
 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd
 from sklearn.model_selection import train_test_split X_train, X_test,
y_train, y_test = train_test_split(X, y, test_size = 0.30)
 from sklearn.ensemble import RandomForestClassifier
 classifier = RandomForestClassifier(n_estimators = 50)
 classifier.fit(X_train, y_train)
 y_pred = classifier.predict(X_test)

Open Source For Computer-Aided Drug Discovery: Prof. Gajendra P.S. Raghava

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Open Source For Computer-Aided Drug Discovery: Prof. Gajendra P.S. Raghava

Uploaded by

Copyright:

Available Formats

Open Source for Computer-Aided Drug

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

 Plants or Natural Product

 Detect the Molecular Bases for Disease

HIGH THROUGHPUT SCREENING

- Partition Coefficient (LogP), Dissociation Constant (pKa) etc.

Biological Activity = f (Physico-chemical properties)

 Biological property is treated as dependent variable.

Compounds + biological activity

New compounds with

1. Selection of training set

An overview of the workflow

1. Source of Molecules (databases or repositories)

Pharmer Both PDB and ligand based pharmacophore search

ZincPharma Both PDB and ligand based pharmacophore search

Boomer Pharmacokinetic drug monitering

Experimental Descriptors Theoretical Descriptors

Link for different file formats:

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

DNA sequence Pharmacokinetics

Gene expression Pharmacogenomics Physiological

Identifying responders Efficacy

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

➢ Many features are irrelevant and redundant ones!

➢ Probability distribution can be very complex and hard to estimate (e.g.

dependencies between variables) !

➢ Irrelevant and redundant features can „confuse“ learners!

➢ Limited training data!

➢ Limited computational resources!

It should have more or equal features after reducing features

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯ →{ fi1 ,..., fi j ,..., fim } i j  1,..., n ; j = 1,..., m

{ f1,..., fi ,..., f n } ⎯⎯⎯⎯→

 This is usually not optimal

 but often preferable to other, more complicated methods

 computationally efficient(!): only calculation and sorting of n

 Estimate for m samples:

Ranking of feature based on Maximum difference F(Pmean – Nmean)

2. Significance difference (t-test) in mean

3. Threshold based feature selection

• A wide range of heuristic search strategies can be

• predictive power is usually measured on a validation

• Performs variable selection (implicitly) in the

• One can work on problems with very high-

• Relevance <-> Optimality

• Correlation and Mutual information between

• Filters – Wrappers - Embedded Methods

• How to search the space of all feature subsets ?

• How to asses performance of a learner that uses a

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

Pros: High accuracy, insensitive to outliers

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

 Proposed by Boser, Guyon and Vapnik in 1992

 But what are we going to do if the dataset is just too hard?