You are on page 1of 65

Molecular Descriptors and Virtual

Screening using Datamining approach

Abhik Seal
OSDD Cheminformatics
Aim of Cheminformatics Project
 To screen molecules interacting with the
Potential TB targets using classifiers.
 Select the selected molecules and dock
with Targets to further screen the
molecules for leads.
 Use cheminformatics techniques such as
QSAR ,3D qsar, ADMET to look for
potential leads and design Drugs using the
leads – by building combinatorial libraries.
Tuberculosis
Obstacles For Drug Design
 HIV-epidemic that has dramatically increased risk for developing
active TB.
 increasing emergence of multi-drug resistant TB (MDR-TB)
 emergence of extensively drug-resistant (XDR) TB strains
 XDR-TB is characterized by resistance to at least the two first-line
drugs rifampicin and isoniazid and additionally to a fluoroquinolone
and an injectable drug- kanamycin
 Existing TB drugs are therefore only able to target actively growing
bacteria through the inhibition of cell processes such as cell wall
biogenesis and DNA replication.
 TB chemotherapy characterized by an efficient bactericidal activity
but an extremely weak sterilizing activity i.e inability to kill slowly
growing and slowly metabolizing strains.
Drugs Currently in Development

Expected timelines towards approval of candidate drugs


currently in clinical stage of development
(Sources: Global TB Alliance Annual report
2004-2005;StopTBPartnership Working Group on New
Drugs for
TB. Strategic Plan 2006-2015)
Commonly Used TB drugs and Targets
Main Properties of Anti TB drugs
QSAR and Drug Design
Compounds + biological activity

QSAR

New compounds with


improved biological activity
What is QSAR?
QSAR is a mathematical relationship between a biological
activity of a molecular system and its geometric and chemical
characteristics.
A general formula for a quantitative structure-activity relationship
(QSAR) can be given by the following:
activity = f (molecular or fragmental properties)
o QSAR attempts to find consistent relationship between
biological activity and molecular properties, so that
these “rules” can be used to evaluate the activity of new
compounds.
Molecule Properties
SPC : Structure Property Correlation
CHEMICAL PROPERTIES
pKa
Log P
Solubility
MOLECULE Stability
STRUCTURE

INTRINSIC PROPERTIES BIOLOGICAL PROPERTIES


Molar Volume Activity
Connectivity Indices Toxicity
Charge Distribution Biotransformation
Molecular Weight Pharmacokinetics
Polar surface Area....
.......
Molecule Descriptors
o Molecular descriptors are numerical values that
characterize properties of molecules.

o The descriptors fall into Four classes .


a) Topological
b) Geometrical
c) Electronic
d) Hybrid or 3D Descriptors
Classification of Descriptors
Topological Descriptors
Topological descriptors are derived directly from the connection table
representation of the structure which include:
a) Atom and Bond Counts
b) substructure counts
c) molecular connectivity Indices (Weiner Index , Randic Index, Chi Index)
d) Kappa Indices
e) path descriptors
f) distance-sum Connectivity
g) Molecular Symmetry
Geometrical Descriptors
Geometrical descriptors are derived from the three-dimensional
representations and include:
a) principal moments of inertia,
b) molecular volume,
c)solvent-accessible surface area,
d) Charged partial Surface area
e) Molecular Surface area
Electronic Descriptors
Electronic descriptors characterize the molecular Strcutures with
such
quantities :
c) dipole moment,
d) Quadrupole moment,
e) polarizibility,
f) HOMO and LUMO energies,
g) Dielectric energy
h) Molar Refractivity
Hybrid and 3D Descriptors
a) geometric atom pairs and topological
torsions
b) spatial autocorrelation vectors
c) WHIM indices
d) BCUTs
e) GETAWAY descriptors
f) Topomers
g) pharmacophore fingerprints
h) Eva Descriptors
i) Descriptors of Molecular Field
Limit Of Descriptors
 The data set should contain at least 5 times as
many compounds as descriptor in the QSAR.

 The reason for this is that too few compounds


relative to the number of descriptors will give a
falsely high correlation:

 2 point exactly determine a line.


 3 points exactly determine a plane (etc.)
 A data set of drug candidate that is similar in

size meaningless correlation


Tools To calculate Molecular
Descriptors Freely available
 CDK tool
http://rguha.net/code/java/cdkdesc.html
 POWER MV
http://nisla05.niss.org/PowerMV/?q=PowerMV/
 MOLD2
http://www.fda.gov/ScienceResearch/BioinformaticsT
 PADEL Descriptor
http://www.downv.com/Windows/install-PaDEL- De
Admet Descriptors to Screen Molecules
Bioavailability
The Bioavailability of a compound is classified as :

Bioavailability

Absorbtion Liver Metabolism

Permeability Gut-wall Metabolism Transporters

Lipophilicity Solubility Flexibility

Hydrogen Bonding Molecular Size/Shape


PREDICTION OF
ADMET PROPERTIES
 Requirements for a drug:
◦ Must bind tightly to the biological target in vivo
◦ Must pass through one or more physiological barriers
(cell membrane or blood-brain barrier)
◦ Must remain long enough to take effect
◦ Must be removed from the body by metabolism,
excretion, or other means
 ADMET: Absorption, Distribution, metabolism,
Excretion (Elimination), Toxicity
Lipinski Rule of Five(Oral Drug Properties)

 Poor absorption or permeation is more


likely when:
◦ MW > 500
◦ LogP >5
◦ More than 5 H-bond donors (sum of OH and
NH groups)
◦ More than 10 H-bond acceptors (sum of N
and O atoms)
Polar Surface Area
efined as amount of molecular surface(vander-walls) arising from polar
atoms(Nitrogen and oxygen atom together with attached hydrogens)

PSA seems to optimally encode those drug properties which play an


important role in membrane penetration: molecular polarity, H - bonding
features and also solubility.

t provide excellent correlations with transport properties of drugs.


(PSA used in the Prediction of Oral absorbtion,Brain penetration, Intestinal
Absorption, Caco-2- permeability)

t has also been effectively used to characterize drug likeness during virtual
screening & combinatorial library design.

he calculation of PSA, however, is rather time-


consuming because of the necessity to generate a reasonable 3D
molecular geometry and the calculation of the surface itself.
eter Ertl introduced an extremely rapid method to obtain PSA descriptor
simply from the sum of contributions of polar fragments in a molecule
without the necessity to generate its three - dimensional (3D) geometry.
PSA In Intestinal absorption
 Intestinal absorption is usually expressed as fraction absorbed (FA),
expressing the percentage of initial dose appearing in a portal vein.
 A model for PSA was done for the β - adrenoreceptor antagonists[1].A
excellent sigmoidal relationship between PSA and FA after oral
administration was obtained. Similar sigmoidal relationships can also be
obtained for the topological PSA (TPSA).
 These results suggest that drugs with a PSA < 60 Å 2 are completely
(more than 90%) absorbed, whereas drugs with a PSA > 40 Å are
absorbed to less than 10%.This conclusion was later confirmed with
the correct classification of a set endothelin receptor antagonists as having
either low, intermediate or high permeability.
 PSA was also shown to play an important role in explaining human in vivo
jejunum permeability[2]. A Model based on PSA and LogP for the
prediction of drug absorption was developed for 199 well absorbed and
35 poorly absorbed compounds[3].
PSA In Blood brain barrier
penetration(BBB)
 Drugs that act on the CNS need to be able to cross the BBB in order to reach their target, while
minimal BBB penetration is required for other drugs to prevent CNS side effects.
 A common measure of BBB penetration is the ratio of drug conc’s in the brain and the blood,
which is expressed as log (C brain /Cblood ).
 Van de Waterbeemd and Kansy were probably the first to correlate the PSA of a series of CNS
drugs to their membrane transport. They obtained a fair correlation of brain uptake with single
conformer PSA and molecular volume descriptors.
 Clark etal. Derived a model of 55 compounds using TPSA and LogP
LogBB= 0.516-0.115* TPSA
n= 55 r2 =0.686 r= 0.828 σ = 0.42
TPSA in combiantion with ClogP
LogBB= 0.070-0.014*TPSA+0.169*ClogP
n=55 r2 =0.787 r=0.887 σ =0.35
 Great majority of orally administered CNS drugs have a PSA <70 Å2 . Non CNS compounds
suggested that these have a PSA < 120Å2 .
 Thus to conclude a majority of the Non CNS penetrating and orally absorbed compounds have
PSA values between 70 and 120 A2.
.
Partition coefficients
P
Xaqueous Xoctanol

Partition coefficient P (usually expressed as log10P or logP) is defined as:

[X]octanol
P=
[X]aqueous

P is a measure of the relative affinity of a molecule for the lipid and aqueous phases in
the absence of ionisation.

1-Octanol is the most frequently used lipid phase in pharmaceutical research.


This is because:

 It has a polar and non polar region (like a membrane phospholipid)


 Po/w is fairly easy to measure
 Po/w often correlates well with many biological properties
 It can be predicted fairly accurately using computational models
Calculation of logP
LogP for a molecule can be calculated from a sum of fragmental
or atom-based terms plus various corrections.

logP = Σ fragments + Σ corrections


H
C H
Branch H C C clogP for windows output
O
C H C: 3.16 M: 3.16 PHENYLBUTAZONE
H C
H C C Class | Type | Log(P) Contribution Description Value
N
H C C H H
H N FRAGMENT | # 1 | 3,5-pyrazolidinedione -3.240
C H C C
H C H C H
ISOLATING |CARBON| 5 Aliphatic isolating carbon(s) 0.975
H C H
C ISOLATING |CARBON| 12 Aromatic isolating carbon(s) 1.560
O
C EXFRAGMENT|BRANCH| 1 chain and 0 cluster branch(es) -0.130
H H C
C EXFRAGMENT|HYDROG| 20 H(s) on isolating carbons 4.540
H
Phenylbutazone H
EXFRAGMENT|BONDS | 3 chain and 2 alicyclic (net) -0.540

RESULT | 2.11 |All fragments measured clogP 3.165


What else does logP affect?

Binding to Aqueous Binding to Absorption Binding to Binding to


logP enzyme / solubility P450 through blood / tissue hERG heart
receptor metabolising membrane proteins – ion channel -
enzymes less drug free cardiotoxicity
to act risk

So log P needs to be optimised


Admet Descriptors Calculation
Tools
 PreADMET http://preadmet.bmdrc.org/
 Molecular Descriptors Calculation - 1081 diverse molecular descriptors
 Drug-Likeness Prediction - Lipinski rule, lead-like rule, Drug DB like rule
 ADME Prediction  - caco-2, MDCK, BBB, HIA, plasima protein binding and
skin permeability data
 Toxicity Prediction - Ames test and rodent carcinogenicity assay
 SPARC Online Calculator http://ibmlc2.chem.uga.edu/sparc/
SPARC on-line calculator for prediction of pK,, solubility, polarizability,
and other properties; search in the database of experimental pKa values is
also available
 Daylight Chemical Information Systems
www.daylight .com/ daycgi/clogp
Calculation of log P by the CLOGP algorithm from BioByte; also access to the
LOGPSTARdatabase of experimental log P data .
Admet Tools Continued..
 Molinspiration Cheminformatics
www.molinspiration.com/seruices/index.
Calculation of molecular properties relevant to drug design and QSAR, including log
P, polar surface area, Rule of Five parameters, and drug-likeness index
 Pirika - www.pirika.com
Calculation of various types of molecular properties, including boiling point, vapor
pressure, and solubility; web demo restricted to only aliphatic molecules
 Actelion -www.actelion.com/page/property_explorer
Calculation of molecular weight, logP, solubility, drug-score and toxlcity risk .
 Virtual Computational Chemistry Laboratory www. vcclab. org
Prediction of log P and water solubility based on associative neural networks as well
as other parameters; comparison of various prediction methods
Virtual Screening
Ways to Assess Structures from a
Virtual Screening Experiment
 Use a previously derived mathematical model
that predicts the biological activity of each
structure
 Run substructure queries to eliminate
molecules with undesirable functionality
 Use a docking program to ID structures
predicted to bind strongly to the active site of a
protein (if target structure is known)
 Filters remove structures not wanted in a
succession of screening methods
Main Classes of Virtual Screening
Methods
 Depend on the amount of structural and
bioactivity data available
◦ One active molecule known: perform similarity
search (ligand-based virtual screening)
◦ Several active molecules known: try to ID a common
3D pharmacophore, then do a 3D database search
◦ Reasonable number of active and inactive structures
known: train a machine learning technique
◦ 3D structure of the protein known: use protein-
ligand docking
STRUCTURE-BASED VIRTUAL
SCREENING
 Protein-Ligand Docking
◦ Aims to predict 3D structures when a molecule
“docks” to a protein
 Need a way to explore the space of possible protein-ligand
geometries (poses)
 Scoring of the ligand poses uch that the score reflects binding
affinity of the ligand;
 Need to score or rank the poses to ID most likely binding
mode and assign a priority to the molecules
◦ Problem: involves many degrees of freedom
(rotation, conformation) and solvent effects
 Conformations of ligands in complexes often have very
similar geometries to minimum-energy conformations
of the isolated ligand
Protein-Ligand Docking
Methods
 Modern methods explore orientational and
conformational degrees of freedom at the same
time
◦ Monte Carlo algorithms (change conformation of the
ligand or subject the molecule to a translation or
rotation within the binding site
◦ Genetic algorithms
◦ Incremental construction approaches
Distinguish “Docking” and “Scoring”
 Docking involves the prediction of the binding
mode of individual molecules
◦ Goal: ID orientation closest in geometry to the
observed X-ray structure
 Scoring ranks the ligands using some function
related to the free energy of association of the
two units
◦ DOCK function looks at atom pairs of between
2.3-3.5 Angstroms
◦ Pair-wise linear potential looks at attractive and
repulsive regions, taking into account steric and
hydrogen bonding interactions(eg moldock)
Structure-Based Virtual
Screening: Other Aspects
 Computationally intensive and complex
 Multitude of possible parameters figure into
docking programs
 Docking programs require 3D conformation as
the starting point or require partial atomic
charges for protein and ligand
 X-Ray Crystallographic studies don’t include
hydrogens, but most docking programs require
them.
Ligand Based Virtual Screening
The Ligand based approach mainly uses pharmacophore maps and (QSAR) to
identify or modify a lead in the absence of a known three dimensional structure of
the receptor. It is necessary to have experimental affinities and molecular
properties of a set of active compounds, for which the chemical structures are
known.
a)PHARMACOPHORE:A pharmacophore is an explicit geometric hypothesis of the
critical features of a ligand.Standard features include H-bond donors and acceptors, charged
groups,and Hydrophobic patterns.The hypothesis can be used to screen databases for
compounds and to refine existing leads.
 For a geometric alignment of the functional groups of the leads, it is necessary to specify the
conformations that individual compounds adopt in their bound state.
 Since the simple presence of a pharmacophoric fingerprint is not sufficient for predicting
activity, inactive compounds possessing the required pharmacophoric features must also be
considered.
 By comparing the volume of the active and the inactive compounds, a common volume can
be constructed in order to approximate the shape of the (unknown) receptor site to further
refine the pharmacophore model and to screen out additional compounds.
3D compound
Structures

Feature Analysis Set of Conformers Align to template

compa
re

Pharmacophore Pharmacophore
Modelling
Workflow
validatio
n

Application
Continued.......
b)QSAR: The goal of QSAR studies is to predict the activity of
new compounds based solely on their chemical structure. The
underlying assumption is that the biological activity can be
attributed to incremental contributions of the molecular fragments
determining the biological activity. This assumption is called the
linear free energy principle. Information about the strength of
interactions is captured for each compound by,for example,
steric,electronic,and hydrophobic descriptors.
Molecular similarity and searching Molecules
What is it?
Chemical, pharmacological or biological properties of two compounds
match.
The more the common features, the higher the similarity between two
molecules.

Chemical

The two structures on top are chemically similar to each other. This is reflected in their
common sub-graph, or scaffold: they share 14 atoms

Pharmacophore

The two structures above are less similar chemically (topologically) yet have the same
pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE)
inhibitors
Molecular similarity
How to calculate it?
Quantitative assessment of similarity/dissimilarity of structures
 need a numerically tractable form
 molecular descriptors, fingerprints, structural keys

Sequences/vectors of bits, or numeric values that can be compared by


distance functions, similarity metrics .
E= Euclidean distance T = Tanimoto index

n B( x & y)
T ( x, y ) =
∑( x − yi )
2
E ( x, y ) = i
i =1
B( x) + B( y ) − B( x & y )
Molecular descriptors
a) chemical fingerprint

hashed binary fingerprint


o encodes topological properties of the chemical graph: connectivity,
edge label (bond type), node label (atom type)
o allows the comparison of two molecules with respect to their
chemical structure

Construction

3. find all 0, 1, …, n step walks in the chemical graph


4. generate a bit array for each walks with given number of bits set
5. merge the bit arrays with logical OR operation
Molecular descriptors
Example 1: chemical fingerprint
Example
CH3 – CH2 – OH
walks from the first carbon atom
length walk bit array
0 C 1010000000
1 C–H 0001010000
1 C–C 0001000100
2 C–C–H 0001000010
2 C–C–O 0100010000
3 C–C–O–H 0000011000
merge bit arrays for the first carbon atom: 1111011110
This example illustrates how a 10 bits long topological chemical fingerprint is
created for a simple chain structure. In this example all walks up to 3 steps are
considered, and 2 bits are set for each pattern.
Molecular Similarity
Example 1: chemical fingerprint

0100010100010100010000000001101010011010100000010100000000100000

0100010100010100010000000001101010011010100000000100000000100000
Molecular descriptors
Example 2: pharmacophore fingerprint

 encodes pharmacophore properties of molecules as frequency


counts of pharmacophore point pairs at given topological distance
 allows the comparison of two molecules with respect to their
pharmacophore

Construction

3. map pharmacophore point type to atoms


4. calculate length of shortest path between each pair of atoms
5. assign a histogram to every pharmacophore point pairs and count
the frequency of the pair with respect to its distance
Molecular descriptors
Example 2: pharmacophore fingerprint

Pharmacophore point type based


coloring of atoms: acceptor, donor,
hydrophobic, none.

12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3 3
2 2
1 1
0 0
A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Virtual screening using fingerprints
Individual query structure

0101010100010100010100100000000000010010000010010100100100010000

query fingerprint
query

proximity
0000000100001101000000101010000000000110000010000100001000001000
0100010110010010010110011010011100111101000000110000000110001000
0100010100011101010000110000101000010011000010100000000100100000
0001101110011101111110100000100010000110110110000000100110100000

hits
0100010100110100010000000010000000010010000000100100001000101000
0100011100011101000100001011101100110110010010001101001100001000
0101110100110101010111111000010000011111100010000100001000101000
0100010100111101010000100010000000010010000010100100001000101000
0001000100010100010100100000000000001010000010000100000100000000
0100010100010011000000000000000000010100000010000000000000000000
0100010100010100000000000000101000010010000000000100000000000000
0101010101111100111110100000000000011010100011100100001100101000
0100010100011000010000011000000000010001000000110000000001100000
0000000100000000010000100000000000001010100000000100000100100000
0100010100010100000000100000000000010000000000000100001000011000
0001000100001100010010100000010100101011100010000100001000101000
0100011100010100010000100001001110010010000010001100000000101000
0101010100010100010100100000000000010010000010010100100100010000

targets target fingerprints


Hypothesis Fingerprints
Advantages Disadvantages
• strict conditions for hits if
• false results with
actives are fairly similar asymmetric metrics
• misses common features of
highly diverse sets
• very sensitive to one
missing feature
• captures common features • less selective if actives are
of more diverse active sets very similar
• captures common features • less selective if actives are
of more diverse active sets very similar
• specific treatment of the
absence of a feature
• less sensitive to outliers
SUMMARY
 Virtual screening methods are central to many
cheminformatics problems in:
◦ Design
◦ Selection
◦ Analysis
 Increasing numbers of molecules can be
evaluated using these techniques
 Reliability and accuracy remain as problems in
docking and predicting ADMET properties
 Need much more reliable and consistent
experimental data
Datamining and Machine Learning
Approaches to Virtual Screening
Idea of Datamining
 Is discovering for patterns in the data i.e
for example
a)an hunter looks pattern in animal migration behavior.
b)farmers seek patterns in crop growth.
c) politcians seek patterns in voters opinion
d) Pattern in the compound structures .
 The Patterns which are discovered must be
meaningful and lead to some advantage.
 The process must be automatic or
semiautomatic.
Canonical learning Problems
 Supervised Learning: given examples of inputs and
corresponding desired outputs, predict outputs on
future inputs.
a) Classification
b) Regression
c) Time series prediction
 Unsupervised Learning: given only inputs, automatically
discover representations, features, structure, etc.
a) Clustering
b) Outlier detection
c) Compression
Datamining Methods
 Substructural Analysis
The Substrcutural fragments makes a contribution to activity
irrespective of the other fragments of the molecule. The idea is to
derive a weight for each fragment which reflects to be active or
inactive. The sum of weight gives the score of molecule which
enables a new set of structures to be ranked in Decreasing
probability of activity.
The weight is calculated using the eq :

Where act(i) is the number of active molecules that contain the i th fragment and
inact(i) is the number of inactive molecules that contain the i th fragment
Discriminant algorithms
 The aim of discriminant analysis is try to separate the
molecules into constituent classes.
 The simplest Linear discriminant which in case of two
activity class and two descriptors which aim to find a st.
line that separates data such that maximum number of
compounds are classified.
 If more than variable uses the line become hyperplane.
 The idea is to express a class as a linear combination of
attributes.
X= w0+w1a1+w2a2+w3a3+.........
X =class a1 a2 = attributes w1 w2 = weights
Neural Networks(NN)
 The two most commonly used neural network architectures used
in chemistry are the feed forward networks and the Kohonen
networks.
 The feed forward NN is a supervised learning method as it uses
the values of dependent variables to derive the model. The
Kohonen or Self Organizing map (SOM) is an unsupervised
method.
 The Feed forward NN contains layers of nodes with connection
between all pairs of nodes in the adjacent layers. A key feature is
presence of hidden nodes along with back propagation algorithm
makes the network applicable to many fields.
 The neural network must first be trained with set of inputs. Once
it has been trained it can then be used to predict values for new
and unseen molecules.
Neural Networks Continued...
The Figure Below shows a Feed forward network with 3Hidden nodes
and one output.

 A Kohonen NN consist of rectangular array of nodes and each nodes


associates a vector that corresponds to input data (Descriptors values)
 The data is presented to the network one molecule at a time and the
distance between each of node vectors and molecule vectors are
determined with distance metric. The node with minimum distance
becomes the wining node.
Disadvantage of Neural Networks
 Its is difficult to design a perfect model for neural
networks with number of hidden layers and nodes
which will best fit the data.
 Another practical issue is Overtraining .An overtrained
NN will give excellent results train data but will
perform poorly on an unseen data(test data).This is
because the network memorizes the data.
 The way solve this problem is to divide the sets in train
and test and then watch performance of the set . If the
performance of the test set increase such that till it
reaches a plateau and start to decline ,at this point
network has maximum predictive ability.
DECISION TREES(DT)
 In Feed forward NN it is not possible to determine the result for a
given input due to complex nature of interconnection between
nodes one cannot determine which properties are important.
 Decision trees consist of set of rules that associate molecular
descriptor values with property of interest.
 A DT is a tree with nodes containing specific rules .Each Rule may
correspond to the presence or absence of a particular feature .
 In a DT one start at the root node and follows the edge with
appropriate first rule. This continues until a terminal node is
reached at which point one can assign the molecule into active and
inactive class.
 DTs like ID3 ,C4.5,C 5.0 uses information theory to choose which
criteria to choose at each step.
 Random forests a small subset of the descriptors is randomly selected at
each node rather than using the full set.
Support Vector Machines(SVM)
 Support vector machines select a small number of critical
boundary instances called support vectors from each class and
build a linear discriminant function that separates them as widely
as possible.
 Molecules in the test set are mapped to the same feature space
and
their activity is predicted according to which side of the hyper
plane they fall.
 The distance to the boundary can be used to assign confidence
level to the prediction such that higher the distance the higher the
confidence.
 The output of SVM is given by f(x)=sign(g(x)) where
g(x)=w(t)x+b, w is a vector and b is a scalar.
 linear SVM can be applied only when the active and inactive
compounds can be divided by a straight line (hyperplane) in the
feature space.
SVM continued....
 When the data cannot be separated linearly, kernel functions are
used to transform to the Higher dimensions.
 The output of SVM is given by f(x)=sign(g(x)) and g(x) is given
by

 
where K is the so-called kernel function, the suffix k represents
the support vector, and m stands for the number of support
vectors.
 The Gaussian and the Polynomial kernel function are used
Strengths and Weaknesses of SVM

Strengths
 Training is relatively easy
 No local optima
 It scales relatively well to high dimensional data
 Tradeoff between classifier complexity and error can be controlled
explicitly
 Non-traditional data like strings and trees can be used as input to
SVM, instead of feature vectors

Weaknesses
 Need to choose a “good”kernel function.
Measuring Classifier Performance
N= total number of instances in the dataset
TPj= Number of True Positives for class j
FPj = Number of False positives for class j
TNj= Number of True Negatives for class j
FNj= Number of False Negatives for class j

Accuracy =

Sensitivity/recall =

Specificity/precision =
Types of Datamining learning
Process in Weka
 Classification- learning-the learning scheme is presented
with a set of classified examples from which it is expected to learn
a way of classifying unseen examples.
 Association Learning-any association among features
is sought, not just ones that predict a particular class value
 Clustering-groups of examples that belong together are
sought
 Numeric prediction-the outcome to be predicted
is not a discrete class but a numeric quantity.
Classifier Algorithms in WEKA
a)Bayes Classifier c) Functions
AODE LINEAR REGRESSION
BAYES NET LOGISTIC
NAÏVE BAYES MULTILAYERD PERCEPTRON
NAÏVE BAYES MULTINOMIAL RBF NETWORK
NAÏVE BAYES UPDATABLE SIMPLE LINEAR REGRESSION
SIMPLE LOGISTIC

SMO,SMO REG.

b)Trees d)Rules
ADTREE CONJUCTIVE RULE
ID3 DECISION TABLE
J48 JRIP
LMT M 5RULES

NB5TREE NNGE
RANDOM FOREST ONE R
RANDOM TREE PRISM
REP TREE ZERO R
Summary
 Machine learning is mainly applied to ligand-based drug
screening and it is applied to the calculation of the
optimal distance between the feature vectors of active
and inactive compounds.
 A kernel is essentially a similarity function with certain
mathematical properties, and it is possible to define
kernel functions over all sorts of structures for
example, sets, strings, trees, and probability
distributions .
 Interest in neural networks appears to have declined
since the arrival of support vector machines, perhaps
because the latter generally require fewer parameters
to be tuned to achieve the same (or greater) accuracy.
THANK YOU

You might also like