Professional Documents
Culture Documents
Discovery
Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure • Disease • Drug design • Image
prediction biomarkers • Chemical Classification
• Subcellular • Drug biomarkers descriptor • Medical images
localization • mRNA • QSAR models • Disease
• Therapeutic expression • Personalized classification
Application • Copy number inhibitors • Disease
• Ligand binding variation diagnostics
Concept of Drug and Vaccine
Concept of Drug
Kill invaders of foreign pathogens
Inhibit the growth of pathogens
Concept of Vaccine
Generate memory cells
Trained immune system to face various existing disease
agents
History of Drug/Vaccine development
Preclinical testing
(1-3 years) Human clinical trials
(2-10 years)
Formulation
FDA approval
(2-3 years)
Technology is impacting this process
GENOMICS, PROTEOMICS & BIOPHARM.
Potentially producing many more targets
and “personalized” targets
COMBINATORIAL CHEMISTRY
Rapidly producing vast numbers Find drug
of compounds
MOLECULAR MODELING
Computer graphics & models help improve activity
Preclinical testing
IN VITRO & IN SILICO ADME MODELS
Tissue and computer models begin to replace animal testing
Computer Aided Drug Design Techniques
- Physicochemical Properties Calculations
QSARs are the mathematical relationships linking chemical structures with biological
activity using physicochemical or any other derived property as an interface.
Mathematical Methods used in QSAR includes various regression and pattern recognition
techniques.
Physicochemical or any other property used for generating QSARs is termed as Descriptors
and treated as independent variable.
QSAR
Selection of Descriptors
1. Structural descriptors
2. Electronic descriptors 1.What is relevant to the therapeutic target?
3. Quantum Mech. descriptors 2.What variation is relevant to the compound series?
4. Thermodynamic descriptors 3.What property data can be readily measured?
5. Shape descriptors 4.What can be readily calculated?
6. Spatial descriptors
7. Conformational descriptors
8. Receptor descriptors
Singla et al. (2013) Open source software and web services for
designing therapeutic molecules. Curr Top Med Chem. 13(10):1172-91.
Source:
http://www.moleculardescriptors.eu/tutorials/T2_moleculardescriptors_chemom.pdf
Source:
Different File Formats
SDF File (Standard Data File): Saved in plain text and contains chemical structure records;
used as a standard exchange format for chemicals information.
SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows
a user to represent a chemical structure in a way that can be used by the computer.
MDL mol format: An MDL Molfile is a file format for holding information about the atoms,
bonds, connectivity and coordinates of a molecule.
Integrated
Clinical Metabolimics
information
Data
Treatment
Decision
Personalized
medicine
Pharmacogenomics(PGx)
◆ Pharmacogenomics(PGx) – the study of variations of DNA and RNA
characteristics as related to drug response.
Explains inter-individual
differences in drug
metabolism
Pharmacodynamics
SNP
Toxicity
Feature Engineering & Case Studies
➢ Curse of dimensionality!
Definition of Feature Selection
Classification/Regression (Supervised Learning): x1
L = {( x1 , y1 ),..., ( xi , yi ),..., ( xm , ym )} X Y
x=
X = f1 ... f i ... f n
xn
Select Feature from F ={ f1 ,..., fi ,..., f n } to F' F
F‘
Feature Extraction/Creation
F F‘
(f )( y )
m
k =1 k,i − fi k −y
R( fi , y) =
(f ) (y )
m 2 m 2
k =1 k ,i
− fi k =1 k
−y
The higher the correlation between the feature and the target, the higher the score!
Filter Methods: Classification
Methods
1. Difference in mean for positive and negative samples
Based on MCC
Based on Accuracy
4. Ranking of features
Feature Subset Selection
Wrapper Methods
• The problem of finding the optimal subset is NP-hard!
• E.g. WINNOW-algorithm
(linear unit with multiplicative updates)
63/54
Important points 1/2
• Feature selection can significantly increase the
performance of a learning algorithm (both
accuracy and computation time) – but it is not
easy!
Multiple
All features Feature Predictor
subsets
Wrapper
Feature
subset
Embedded
All features
method
Predictor
Filters
Methods:
Criterion: Measure feature/feature subset
“relevance”
Search: Usually order features (individual feature
ranking or nested subsets of features)
Assessment: Use statistical tests
Results:
Are (relatively) robust against overfitting
May fail to select the most “useful” features
Wrappers
Methods:
Criterion: Measure feature subset “usefulness”
Search: Search the space of all feature subsets
Assessment: Use cross-validation
Results:
Can in principle find the most “useful” features,
but
Are prone to overfitting
Embedded Methods
Methods:
Criterion: Measure feature subset “usefulness”
Search: Search guided by the learning process
Assessment: Use cross-validation
Results:
Similar to wrappers, but
Less computationally expensive
Less prone to overfitting
Three “Ingredients”
Single
feature
Cross relevance
validation
Relevance
in context
Performance Feature subset
bounds relevance
Performance
Statistical learning
tests machine
Nested subset,
Heuristic or forward selection/
stochastic search backward elimination
Exhaustive search Single feature ranking
Search
Feature selection examples
Garg A, Tewari R, Raghava GP. KiDoQ: using docking based energy scores todevelop ligand based
model for predicting antibacterials. BMC Bioinformatics.2010 Mar 11;11:125
23 inhibitors against DHDPS enzyme
11 energy-based descriptors obtain from docking using Autodock
F-stepping remove-one approach,
Singh H, Singh S, Singla D, Agarwal SM, Raghava GP. QSAR based model for discriminating EGFR
inhibitors and non-inhibitors using Random forest. Biol. Direct. 2015 Mar 25;10:10.
EGFR inhibitors 508 inhibitors and 2997 non-inhibitors
881 PubChem fingerprints
Frequency-based feature selection technique
Difference of frequency in inhibitors and non-inhibitors
Feature selection examples
Chauhan JS, Dhanda SK, Singla D; Open Source Drug Discovery Consortium, Agarwal SM, Raghava
GP. QSAR-based models for designing quinazoline/imidazothiazoles/pyrazolopyrimidines based
inhibitors against wild and mutant EGFR. PLoS One. 2014 Jul 3;9(7)
Selection of descriptors having high correlation with IC50
Removal of descriptors with low variance
Removal of highly correlated descriptors
Removal of useless descriptors having lot of zeros
Dhanda SK, Singla D, Mondal AK, Raghava GP. DrugMint: a webserver for predicting and designing
of drug-like molecules. Biol Direct. 2013 Nov 5;8:28. doi: 10.1186/1745-6150-8-28.
Weka software
Remove Useless (rm-useless): either varies too much or variation is negligible
CfsSsubsetEval module of Weka: have high correlation with class/activity and very
less inter-correlation
Feature selection examples
Bhalla S, Chaudhary K, Kumar R, Sehgal M, Kaur H, Sharma S, Raghava GP. Gene expression-based
biomarkers for discriminating early and late stage of clear cell renal cancer. Sci Rep. 2017 Mar
28;7:44997.
523 samples to discriminate early and late stage of ccRCC
Total descriptors: expression of 20,538 genes in samples
Threshold-based approach for ranking genes (over or under expressed)
Removal of highly correlated genes (0,60)
Average of output of models based on best genes/descriptors
Weka Fast Correlation-Based Feature (FCBF): selection utilizes predominant
correlation to identify relevant features in high-dimensional datasets in reduced
feature space
K-NEAREST NEIGHBOR METHOD (KNN)
Weight to Instance
All instance or examples are not reliable
Weight a instance based on its success in prediction
86
Distance Metrics
Standardization
Transform raw feature values into z-scores
x ij - m j
zij =
is the value for the ithssample
j and jth feature
x ijis the average of all for feature j
is the standard deviation of all over all input samples
mj x ij
Range and scale of z-scores should be similar (providing
sj
distributions of raw feature values are alike)x ij
Instance Based Reasoning
Type of IBR
• IB1 is based on the standard KNN
• IB2 is incremental KNN learner that only incorporates misclassified
instances into the classifier.
• IB3 discards instances that do not perform well by keeping success
records.
Weight to Instance
All instance or examples are not reliable
Weight a instance based on its success in prediction
96
Python Code
https://www.youtube.com/watch?v=6kZ-OPLNcgE
K-Nearest Neighbor
More about KNN
1. Prepare: Numeric values
2. Method: Similarity/Distance
4. Train: Does not apply to the kNN algorithm.
5. Compute distance/similarity of instance with examples
6. Identify neighbor of instance
7. Voting of classification & average for regression
If the number of input features is 2, then the hyperplane is just a line. If the
number of input features is 3, then the hyperplane becomes a two-
dimensional plane.
Linear Separators
Which of the linear separators is optimal?
Perceptron Revisited: Linear Separators
Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
Classification Margin
Distance from example to the separator is
Examples closest to the hyperplane are support vectors.
Margin ρ of the separator is the width of separation between
classes. ρ
r
Maximum Margin Classification
Maximizing the margin is good according to intuition and PAC
theory.
Implies that only support vectors are important; other training
examples are ignorable.
Soft Margin Classification
What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification of difficult or
noisy examples.
ξi
ξi
Soft Margin
Linear SVMs: Overview
The classifier is a separating hyperplane.
Most “important” training points are support vectors; they
define the hyperplane.
Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
Both in the dual formulation of the problem and in the solution
training points appear only inside inner products:
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
0 x
0 x
How about… mapping data to a higher-dimensional space:
x2
0 x
Non-linear SVMs: Feature spaces
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
The “Kernel Trick”
The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes: K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some
expanded feature space.
Examples of Kernel Functions
Linear: K(xi,xj)= xi Txj
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p 2
xi − x j
Gaussian (radial-basis function network): K (x i , x j ) = exp( − )
2 2
HARD MARGIN
If the training data is linearly separable, we can select two parallel hyperplanes
that separate the two classes of data.
Distance between them is as large as possible.
SOFT MARGIN
As most of the real-world data are not fully linearly separable, we will allow some
margin violation to occur, which is called soft margin classification.
It is better to have a large margin, even though some constraints are violated.
Allow some data points to stay in either the incorrect side of the hyperplane and
between the margin and the correct side of the hyperplane.
Cost Function and Gradient Updates
➔ In the SVM algorithm, we are looking to maximize the margin between the data
points and the hyperplane. The loss function that helps maximize the margin is
hinge loss.
➔ The function of the first term, hinge loss, is to penalize misclassifications. It
measures the error due to misclassification (or data points being closer to the
classification boundary than the margin). The second term is the regularization
term, which is a technique to avoid overfitting by penalizing large coefficients in
the solution vector.
Ensemble Methods, Decision Tree
and Random forest
Idea
Combine the classifiers to improve the performance
Ensembles of Classifiers
Combine the classification results from different
classifiers to produce the final output
Unweighted voting
Weighted voting
Build Ensemble Classifiers
• Basic idea:
Build different “experts”, and let them vote
• Advantages:
Improve predictive performance
Other types of classifiers can be directly included
Easy to implement
No too much parameter tuning
• Disadvantages:
The combined classifier is not so transparent (black box)
Not a compact representation
Ensemble-based methods
Classifier combination/Aggregator is a general concept
Creation of large set of classifiers, here called the “base learners”
Building large number of base classifiers
These sets and classifies should be diverse
Aim to maximize accuracy using combination
These methods will be:
Bagging,
Boosting,
AdaBoost,
Random Forests
Outline
Bias/Variance Tradeoff
Bagging Complex model class. Bootstrap aggregation Does not work for simple
(Deep DTs) (resampling training data) models.
Random Complex model class. Bootstrap aggregation Only for decision trees.
Forests (Deep DTs) + bootstrapping features
Outlook
149
Basic Concept
In last example, we have considered a decsion tree where
values of any attribute if binary only. Decision tree is also
possible where attributes are of continuous data type
Decision Tree with numeric data
150
Some Characteristics
Decision tree may be n-ary, n ≥ 2.
There is a special node called root node.
All nodes drawn with circle (ellipse) are called internal nodes.
All nodes drawn with rectangle boxes are called terminal nodes or leaf
nodes.
Edges of a node represent the outcome for a value of the node.
In a path, a node with same label is never repeated.
Decision tree is not unique, as different ordering of internal nodes can give
different decision tree.
151
Decision Tree and Classification Task
Decision tree helps us to classify data.
Internal nodes are some attribute
Edges are the values of attributes
152
Decision Tree and Classification Task
Vertebrate Classification
Name Body Temperature Skin Cover Gives Birth Aquatic Aerial Has Legs Hibernates Class
Creature Creature
153
What are the class label of Dragon and Shark?
Decision Tree and Classification Task
Vertebrate Classification
Suppose, a new species is discovered as follows.
Name Body Skin Gives Aquatic Aerial Has Hibernates Class
Temperature Cover Birth Creature Creature Legs
154
Building Decision Tree
Many decision tree can be constructed from a dataset
Some of the tree may not be optimum
Some of them may give inaccurate result
Two approaches are known
Greedy strategy
A top-down recursive divide-and-conquer
157
Node Splitting in BuildDT Algorithm
Case: Nominal attribute
Since a nominal attribute can have many values, its test condition can be expressed
in two ways:
A multi-way split
A binary split
Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute
159
Node Splitting in BuildDT Algorithm
Case: Numerical attribute
For numeric attribute (with discrete or continuous values), a test condition can be expressed
as a comparison set
In this case, decision tree induction must consider all possible split positions
Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)
160
Illustration : BuildDT Algorithm
Person Gender Height Class Attributes:
1 F 1.6 S Gender = {Male(M), Female (F)} // Binary attribute
2 M 2.0 M Height = {1.5, …, 2.5} // Continuous
3 F 1.9 M attribute
4 F 1.88 M
5 F 1.7 S Class = {Short (S), Medium (M), Tall (T)}
6 M 1.85 M
7 F 1.6 S
8 M 1.7 S
9 M 2.2 T Given a person, we are to test in which class s/he
10 M 2.1 T belongs
11 F 1.8 M
12 M 1.95 M
13 F 1.9 M
14 F 1.8 M
15 F 1.75 S
161
Illustration : BuildDT Algorithm
To built a decision tree, we can select an attribute in two different orderings:
<Gender, Height> or <Height, Gender>
163
Illustration : BuildDT Algorithm
Approach 2 : <Height, Gender>
164
Concept of Entropy
165
Information Gain
Lets a bag have 4 red balls Lets a bag have 4 red balls
Gini Index (Impurity) = 1 – 1 = 0 Entropy = -1 log2 (1) = 0
Purity is 100% Lets a bag have 2 red and 2 blue balls
Lets a bag have 2 red and 2 blue balls Entropy = -1/2 log2 (1/2) - 1/2 log2 (1/2)
Gini Index = 1 – (1/4 + 1/4) = 0.5 =1
Lets a bag have 3 red and 1 blue balls Lets a bag have 3 red and 1 blue balls
Gini Index = 1 – (1/16 + 9/16) = 6/16 Entropy = 0.811
ID3: Decision Tree Induction Algorithms
In ID3, each node corresponds to a splitting attribute and each arc is a possible value of that
attribute.
ID3 algorithm defines a measurement of a splitting called Information Gain to determine the goodness of a
split.
The attribute with the largest value of information gain is chosen as the splitting attribute and
it partitions into a number of smaller training sets based on the distinct values of attribute under split.
167
CART Algorithm
It is observed that information gain measure used in ID3 is biased towards test with many
outcomes, that is, it prefers to select attributes having a large number of values.
CART is a technique that generates a binary decision tree; That is, unlike ID3, in CART, for
each node only two children is created.
ID3 uses Information gain as a measure to select the best attribute to be splitted, whereas CART
do the same but using another measurement called Gini index . It is also known as Gini Index
of Diversity and is denote as 𝛾.
168
Gini Index of Diversity
Gini Index
Suppose, D is a training set with size |D| and 𝐶 = 𝑐1 , 𝑐2 , … , 𝑐𝑘 be the set of k classifications and 𝐴
= 𝑎1 , 𝑎2 , … , 𝑎𝑚 be any attribute with m different values of it. Like entropy measure in ID3, CART
proposes Gini Index (denoted by G) as the measure of impurity of D. It can be defined as follows.
𝑘
𝐺 𝐷 = 1 − 𝑝𝑖2
𝑖=1
where 𝑝𝑖 is the probability that a tuple in D belongs to class 𝑐𝑖 and 𝑝𝑖 can be estimated as
|𝐶𝑖 , 𝐷|
𝑝𝑖 =
𝐷
where |𝐶𝑖 , 𝐷| denotes the number of tuples in D with class 𝑐𝑖 .
169
Algorithm C 4.5
J. Ross Quinlan, a researcher in machine learning developed a decision tree induction algorithm
in 1984 known as ID3 (Iterative Dichotometer 3).
Quinlan later presented C4.5, a successor of ID3, addressing some limitations in ID3.
ID3 uses information gain measure, which is, in fact biased towards splitting attribute having a
large number of outcomes.
For example, if an attribute has distinct values for all tuples, then it would result in a large
number of partitions, each one containing just one tuple.
In such a case, note that each partition is pure, and hence the purity measure of the partition, that is
𝐸𝐴 𝐷 = 0
170
Algorithm C4.5 : Introduction
Limitation of ID3
In the following, each tuple belongs to a unique class. The splitting on A is shown.
𝑛 𝑛
𝐷𝑗 1
𝐸𝐴 𝐷 = . 𝐸 𝐷𝑗 = .0 = 0
𝐷 𝐷
𝑗=1 𝑗=1
Note:
Decision Tree Induction Algorithm ID3 may suffer from overfitting problem.
172
Algorithm: C 4.5 : Introduction
The overfitting problem in ID3 is due to the measurement of information gain.
In order to reduce the effect of the use of the bias due to the use of information gain, C4.5 uses a
different measure called Gain Ratio, denoted as 𝛽.
2. Missing data and noise: Decision tree induction algorithms are quite robust to
the data set with missing values and presence of noise. However, proper data pre-
processing can be followed to nullify these discrepancies.
M features
N examples
Random Forest Classifier
M features
N examples
....…
Random Forest Classifier
Construct a decision tree
M features
N examples
....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features
M features
N examples
....…
Random Forest Classifier
Create decision tree
from each bootstrap sample
M features
N examples
....…
....…
Random Forest Classifier
M features
N examples
Take he
majority
vote
....…
....…
Important Points
➢Random forest has nearly the same hyperparameters as a decision tree or
a bagging classifier.
➢Random forest adds additional randomness to the model, while growing
the trees. Instead of searching for the most important feature while splitting
a node, it searches for the best feature among a random subset of
features. This results in a wide diversity that generally results in a better
model.
➢Therefore, in random forest, only a random subset of the features is taken
into consideration by the algorithm for splitting a node. You can even make
trees more random by additionally using random thresholds for each
feature rather than searching for the best possible thresholds.
Important Points
Random forest is based decision trees that are created by randomly
splitting the data.
Generating decision trees is also known as forest.
Each decision tree is formed using feature selection indicators like
information gain.
Each tree is dependent on an independent sample.
Considering it to be a classification problem, then each tree computes
votes and the highest votes class is chosen.
If it's regression, the average of all the tree's outputs is declared as
the result.
It is the most powerful algorithm compared to all others.
FEATURE IMPORTANCE
It is very easy to measure the relative importance of each feature on the
prediction.
Sklearn provides a great tool for this that measures a feature's importance.
Tree nodes that use that feature reduce impurity across all trees in the forest.
By looking at the feature importance you can decide which features to possibly
drop
This is important because a general rule in machine learning is that the more
features you have the more likely your model will suffer from overfitting and vice
versa.
Random forests make use of Gini importance or MDI (Mean decrease impurity)
to compute the importance of each attribute.
The amount of total decrease in node impurity is also called Gini importance.
This is the method through which accuracy or model fit decreases when there is a
drop of feature.
Features and Advantages
The advantages of random forest are:
One of the most accurate learning algorithms available.
Runs efficiently on large databases.
It can handle thousands of input variables.
Identification of importance of variables in classification
Highly effective in estimating missing data.
It has methods for balancing error in class population unbalanced data sets.
Generated forests can be saved for future use on other data.
It computes proximities between pairs of cases (clustering, locating outliers)
The capabilities can be used for unsupervised learning
It offers an experimental method for detecting variable interactions.
189/14
Implementation in Python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split X_train, X_test,
y_train, y_test = train_test_split(X, y, test_size = 0.30)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)