Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
1
Machine Learning in Predictive
Pharmacology and Toxicology
Andreas Karwath and Christoph Helma
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
2
Contents
Predictive Pharmacology and Toxicology
Feature Mining Algorithms
MolFea
Graph Mining
Classification/Regression Algorithms
lazar
SMIREP
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
3
Goal
Given:
Database with
Chemical structures
Biological activities
Database with untested structures
Task:
Predict biological activities of untested structures
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
4
Inductive Databases for Toxicity Predictions
CAS SMILES SAL
100-00-5 ON(=O)c1ccc(Cl)cc1 1
100-01-6 Nc1ccc(cc1)N(=O)O 1
100-40-3 C=CC1CCC=CC1 0
100-41-4 CCc1ccccc1 0
100-42-5 C=Cc1ccccc1 1
100-44-7 ClCc1ccccc1 1
100-51-6 OCc1ccccc1 0
100-52-7 O=Cc1ccccc1 0
100-63-0 NNc1ccccc1 1
100-75-4 O=NN1CCCCC1 1
100-97-0 C1N2CN3CN1CN(C2)C3 1
10034-93-2 NNOS(=O)(=O)O 1
10034-96-5 O=S1(=O)O[Mn]O1 0
10043-35-3 OB(O)O 0
10043-52-4 [Ca] 0
101-05-3 Clc1ccccc1Nc2nc(Cl)nc(Cl)n2 0
101-14-4 Nc1ccc(Cc2ccc(N)c(Cl)c2)cc1Cl 1
101-73-5 CC(C)Oc1ccc(Nc2ccccc2)cc1 0
101-80-4 Nc1ccc(Oc2ccc(N)cc2)cc1 1
101-90-6 C(Oc1cccc(OCC2CO2)c1)C3CO3 1
...
Inductive Database
Engine
Test Structure
Toxicity Prediction
Training Data
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
5
Chemical Structures
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
6
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
7
Structural Alerts
Alkylating electrophilic centers
Unstable epoxides
Aromatic amines
Azo structures
N-nitroso groups
Aromatic nitro groups

O
O
N
N
N
+
O O

N
N
N
H H
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
8
Outline of a Predictive Toxicology System
Calculation/Mining for Chemical Features
Feature Selection
Classification/Regression
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
9
Chemical Features
Presence of substructures
Graph theoretic descriptors
Physico/chemical properties of the molecule
(e.g. logP, HOMO, LUMO, …)
3D-parameters
Biological properties (e.g. from screening
assays)
Spectra (IR, NMR, MS, …)

Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
10
Feature Selection
Feature construction
Create fewer, more predictive features (e.g. PCA, Clustering)
Wrapper methods
Score feature subsets with learning algorithm
Forward selection
Backward elimination
Efficient, but time consuming for proper cross-validation
Filter method
Rank features according to score function (e.g. Chi-square, r
2
)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
11
Data Mining Algorithms
Statistical Methods (e.g.various regression
techniques)
Bayesian Techniques
k-Nearest Neighbors
Decision Trees
Rule Learners
Neural Nets
Support Vector Machines

Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
12
Classical QSAR
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
13
Decision Trees
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
14
Part Rules
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
15
Support Vector Machines
+1.63 * c:c:c:c:c:c:c:c:c
+1.44 * C-Cl
+1.32 * C-C-C-C-N-C
+1.31 * C-C-C-O
+0.95 * C-C=C
+0.87 * c:c:c:c:c:n
+0.82 * C-C-C-C=C
+0.82 * C-C-C-N-C
+0.80 * c:c:c-C=O
+0.78 * C-N-C
+...
-1.48 * Cl-C-Cl
-1.45 * C-C-C=C-C
-1.01 * C-N-c:c
-1.01 * C-N-c:c:c
-0.95 * C-C
-0.95 * C-C-N-C
-0.94 * C-O-C=O
-0.94 * c:c:c:c:c:c-S
-0.94 * c:c:c:c:c-S
-0.94 * c:c:c:c-S
- ...)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
16
Selection of Data Mining Algorithms
Desired structure and complexity of the models
Representational assumptions
Mechanistic background
Purpose of the model
Performance issues
Capabilities of the algorithm
Sensitivity towards noisy data
Missing values
Skewed distributions between active/inactive molecules
Personal preferences, hypes, …
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
17
Problem Setting
Non–congeneric compounds
No common mode of action
Poor knowledge about biochemical mechanisms
Several hundred - thousand compounds in the
training set
Missing values
Skewed distributions in the training set
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
18
Requirements
Informative, comprehensible and traceable
output
Rationales for predictions, e.g.
Relevant features
Similar compounds in the training set
Rules that are applicable for the test compound
Necessary information depends on prediction algorithm
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
19
Requirements
Informative, comprehensible and traceable
output
Confidence in predictions, e.g.
Confidence intervals
Classification probabilities
Scope of the training set, e.g.
Unknown features of the test structure
Similarity to compounds in the training set
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
20
The Molecular Feature Miner
Given: datasets with molecules
Queries for molecular fragments (substructures)
with constraints concerning frequency and
syntax
e.g. (‚Cl' s f) .
(freq(f, Carcinogens) > 5) .
(freq(f, NonCarcinogens) s 2)
Solver: based on the levelwise version space
algorithm
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
21
Linear Molecular Fragments
A fragment is a sequence of
linearly
connected atoms
(e.g., ‚‘O-c:c:c:c-Cl‘)
O, C, Cl, N, S, ... denote elements
- ... single bond
= ... double bond
# ... triple bond
: ... aromatic bond
(hydrogens implicit)
Smarts encoding
‚O-c:c:c:c-Cl‘
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
22
Properties of Linear Molecular Fragments
g ~ s
g is equivalent to s (syntactic variants) only when they are a
reversal of one another
E.g. ‚C-O-S' and ‚S-O-C' denote the same substructure
g s s
g is more general than s if and only if g is a subsequence of s or g
is a subsequence of the reversal of s
a partial order on fragments in M
E.g. ‚C-c‘ s ‚C-c:c:c‘
E.g. ‚c:C' s ‚C-C-C:c'
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
23
Primitive Constraints
f s P, P s f, not (f s P) and not (P s f):
f ... unknown target fragment,
P ... a specific fragment
e.g. ‚c:c:c‘ s f
freq(f, D)
absolute frequency of a fragment f on a set of molecules D
freq(f, D1) > t, freq(f, D2) s t,
t ... positive integer
D1, D2 ... sets of molecules
e.g. freq(f, Pos) > 20
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
24
Computing Borders
Borders completely characterize the set of solutions
e.g.: if C-c is a solution and C-c:c:c-N is a solution -> {C-c:c, C-
c:c:c} are also solutions
Compute
the set of the most general solutions G (i.e. shortest fragments)
the set of the most specific solutions S (i.e. longest fragments)
to determine all solutions
Conjunctive queries: update G or S
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
25
Levelwise Version Space Algorithm
Is more
general
G
Too frequent w.r.t. max. frequency
Too general
S
Infrequent w.r.t. min. frequency
Too specific
Solutions
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
26
Levelwise Version Space Algorithm
Is more
general
G
S
min. frequency constraint:
update S
Solutions
S'
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
27
Levelwise Version Space Algorithm
Is more
general
G
max. Frequency constraint:
update G
S
min. frequency constraint:
update S
Solutions
S'
G'
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
28
Example Dataset
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
29
Example Query
freq(f,A) > 2
LEVEL 1:
Candidates: [Li], [Be], B, C, N, O, F, [Na], [Mg], [Al], [Si], P, S, Cl, [K], [Ca], [Sc], [Ti],
[V], [Cr], [Mn], [Fe], [Co], [Ni], [Cu], [Zn], [Ga], [Ge], [As], [Se], Br, [Rb], [Sr], [Y],
[Zr], [Nb], [Mo], [Tc], [Ru], [Rh], [Pd], [Ag], [Cd], [In], [Sn], [Sb], [Te], I, [Cs], [Ba],
[Lu], [Hf], [Ta], [W], [Re], [Os], [Ir], [Pt], [Au], [Hg], [Tl], [Pb], [Bi], [Po], [At], [Rn],
[Fr], [Ra], [Lr], c, n, s, o, p (78)
Frequent: C, N, O, c (4)
LEVEL 2:
Candidates: C-C, C-N, C-O, C-c, C=C, C=N, C=O, C=c, C#C, C#N, C#O, C#c, N-N,
N-O, N-c, N=N, N=O, N=c, N#N, N#O, N#c, O-O, O-c, O=O, O=c, O#O, O#c, c-
c, c=c, c#c (30)
Frequent: C-C, C-N, C-O, C-c, c-c (5)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
30
LEVEL 3 :
Candidates: C-C-C, C-C-N, C-C-O, C-C-c, N-C-N, N-C-O, N-C-c, C-N-C, O-C-O, O-
C-c, C-O-C, C-c-c, c-C-c, C-c-C, c-c-c (15)
Frequent: C-C-C, C-C-N, C-C-O, C-C-c, C-c-c, c-c-c (6)
LEVEL 4 :
Candidates: C-C-C-C, C-C-C-N, C-C-C-O, C-C-C-c, N-C-C-N, N-C-C-O, N-C-C-c,
O-C-C-O, O-C-C-c, C-C-c-c, c-C-C-c, C-c-c-c, C-c-c-C, c-c-c-c (14)
Frequent: C-C-C-O, C-C-C-c, N-C-C-O, N-C-C-c, C-C-c-c, C-c-c-c, c-c-c-c (7)
LEVEL 5 :
Candidates: O-C-C-C-O, O-C-C-C-c, C-C-C-c-c, c-C-C-C-c, N-C-C-c-c, C-C-c-c-c,
C-c-c-c-c, C-c-c-c-C, c-c-c-c-c (9)
Frequent: C-C-C-c-c, N-C-C-c-c, C-C-c-c-c, C-c-c-c-c, c-c-c-c-c (5)
LEVEL 6 :
Candidates: C-C-C-c-c-c, N-C-C-c-c-c, C-C-c-c-c-c, C-c-c-c-c-c, C-c-c-c-c-C, c-c-c-
c-c-c (6)
Frequent: C-C-C-c-c-c, N-C-C-c-c-c, C-C-c-c-c-c, C-c-c-c-c-c, c-c-c-c-c-c (5)
Example Query II
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
31
LEVEL 7 :
Candidates: C-C-C-c-c-c-c, N-C-C-c-c-c-c, C-C-c-c-c-c-c, C-c-c-c-c-c-c, C-c-c-c-c-c-
C, c-c-c-c-c-c-c (6)
Frequent: C-C-C-c-c-c-c, N-C-C-c-c-c-c, C-C-c-c-c-c-c, C-c-c-c-c-c-c (4)
LEVEL 8 :
Candidates: C-C-C-c-c-c-c-c, N-C-C-c-c-c-c-c, C-C-c-c-c-c-c-c, C-c-c-c-c-c-c-C (4)
Frequent: C-C-C-c-c-c-c-c, N-C-C-c-c-c-c-c, C-C-c-c-c-c-c-c (3)
LEVEL 9 :
Candidates: C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c (2)
Frequent: C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c (2)
LEVEL 10 :
Candidates: (0)
Frequent: (0)
Example Query III
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
32
============================================
G: C, N, O, c (4)
S: C-C-C-O, N-C-C-O, C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c (4)
============================================
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
33
Example Dataset
S: C-C-C-O, N-C-C-O, C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
34
MolFea for Min. Frequency Constraint
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
35
MolFea Problems
Goal: Identification of the most predictive
fragments
Restriction to linear fragments
Molfea works only with (anti-)monotonic
constraints
Extension towards convex evaluation functions
(e.g. Chi-square) possible, but rather inefficient
Necessity to keep track of multiple activities
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
36
Graph Mining
Finding reoccuring sub-graphs from a graph database
FSG
gSpan
1L06
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
37
Labeled Graph
We define a labeled graph G as a five element tuple
G = {V, E, _
V
, _
E
, o} where
V is the set of vertices of G,
E _ V ×V is a set of undirected edges of G,
_
V
(_
E
) are set of vertex (edge) labels,
o is the labeling function: V ÷ _
V
and E ÷ _
E
that maps vertices and
edges to their labels.
a
b
b
y
x
y
(Q)
q
1
q
3
q
2
p
2
p
5
a
b
b d
y
x
y
y
y
(P)
p
1
p
3
p
4
c
a
b
b
y
y
(S)
s
1
s
3
s
2
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
38
Frequent Sub-graph Mining
Given a graph database GD = {G
0
,G
1
,…,G
n
}, find all sub-
graphs appearing in at least min
sup
graphs.
Methodology
Transaction · Labeled graph
Item · Vertex
Item set · sub-graph
(induced, connected, …)
Size of an item set · Number of edges/vertices
(size of a sub-graph)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
39
Frequent Sub-graph Mining
min
sup
= 2
a
b
b
y
x
y
(Q)
q
1
q
3
q
2
p
2
p
5
a
b
b d
y
x
y
y
y
(P)
p
1
p
3
p
4
c
a
b
b
y
y
(S)
s
1
s
3
s
2
Input: A set GD of labeled undirected graphs
a b
a b
y x
b b
a
b
b
y
x
a
b
b
y
y
a
b
b
y
x
y
Output: All frequent sub-graphs (w. r. t. min
sup
) from GD.
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
40
Graph/Subgraph Isomorphism: Hard Problems
Graph isomorphism
Determine if two graphs are equivalent.
Suspected to be neither in P nor in NP-complete.
Subgraph isomorphism
Determine if a graph is a part of another.
NP-complete
Isomorphic sub-graphs are considered the same sub-graph.
Canonical labeling is equivalent to graph isomorphism.
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
41
Finding Frequent Subgraphs
Greedy searches (GBI)
Inductive logic programming (ILP)
Inductive database approach (MolFea)
Graph theory based approaches:
Apriori approaches (AGM, FSG, …)
Generation of sub-graph candidates
Involves sub-graph isomorphism tests, which is an NP-complete problem, so
pruning is expensive.
DFS based (gSpan)
Kernel based
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
42
FSG:Frequent Subgraph Discovery Algorithm
Incremental and breadth-first fashion on the size of
frequent sub-graphs (like Apriori for frequent item sets)
Counting of frequent single and double edge sub-graphs
For finding frequent size k-sub-graphs (k > 3),
Candidate generation
Joining two size (k–1)-sub-graphs similar to each other.
Candidate pruning by downward closure property
Frequency counting
Check if a sub-graph is contained in a transaction.
Repeat the steps for k= k+ 1
Increase the size of sub-graphs by one edge.
FSG finds frequent connected sub-graphs in the bottom-
up and breadth-first manner
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
43
FSG: Algorithm
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
44
Trivial Operations Become Complicated With
Graphs…
Candidate generation
To determine two candidates for joining, one needs to perform
sub-graph isomorphism testing.
Isomorphism for redundancy check
Candidate pruning
To check downward closure property, sub-graph isomorphism
check is needed.
Frequency counting
Sub-graph isomorphism for checking containment of a frequent
sub-graph
How to reduce the number of graph/sub-graph
isomorphism operations?
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
45
FSG: Candidate Generation
To generate a size k-candidate (k edges)
Take the intersection of the parent lists of two (k–1)-frequent
subgraphs
To see if two (k–1)-frequent subgraphs share the same size (k–2)-
parent.
Parent lists are obtained at the pruning phase.
Subgraph isomorphism free!
Example
parent(c
5
) = { g
4
, h
4
, i
4
}, parent(d
5
) = { f
4
, g
4
, h
4
}
Generate size 6-candidates from the cores g
4
and h
4
.
Canonical labeling for redundancy check
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
46
Intersection of the parent
lists of two 3-frequent sub-
graphs
Without sub-graph
isomorphism, one can
detect the core of the two 3-
frequent sub-graphs.
Redundancy check by
canonical labeling
FSG: Candidate Generation
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
47
FSG: CG Core Extension
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
48
FSG: CG Core Extension II
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
49
FSG: Candidate Pruning
•Downward closure property
•Every (k–1)-sub-graph must
be frequent
•Keep the list of (k-1) sub-
graphs
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
50
FSG: Candidate Pruning
Pruning of size k-candidates
For all the (k–1)-sub-graphs of a size k-candidate,
check if downward closure property holds.
Canonical labeling is used to speedup the computation.
Build the parent list of (k–1)-frequent sub-graphs for
the k-candidate.
Used later in the candidate generation, if this candidate
survives the frequency counting check.
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
51
FSG: Frequency Counting
Employs Transaction IDs in TID lists
If a size k-candidate is contained in a transaction, all the
size (k–1)-parents must be contained in the same
transaction.
Perform sub-graph isomorphism only on the intersection
of the TID lists of the parent frequent sub-graphs of size
k–1.
Significantly reduces the number of sub-graph isomorphism
checks.
Trade-off between running time and memory
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
52
FSG: Summary
Faster than naive Apriori based graph miner like
AGM
Induces connected sub-graphs (AGM: induced
sub-graphs)
High memory consumption (TID lists)
Still a lot of sub-graph isomorphism testing
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
53
gSpan: Graph-Based Substructure Pattern
Mining
Avoid cost intensive problems like
Candidate generation
Isomorphism testing
uses two main concepts:
DFS lexicographic order
minimum DFS code
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
54
Search space for finding sub-graphs

no edge
1 edge
2 edge

Every node is equivalent to one sub-graph
An extra edge is added for each child vertice
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
55
gSpan: Extensions to sub-graphs
Introduction of a lexigraphical order of graphs
¬Limits the extension of an existing sub-graph:
Edges are only added to the right most vertice of the
path
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
56
gSpan: Extensions to sub-graphs [2]
(1) (2) (3) (4) (5)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
57
gSpan: Extensions to sub-graphs [3]
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
58
gSpan: Extensions to sub-graphs [4]
prune
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
59
gSpan: DFS Code
canonical encoding of sub-graphs
one code line per edge
Enables to check on previously generated
isomorphic sub-graphs
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
60
gSpan: DFS Code [2]
X
X
Z
Y
Z
a
b
c
d
a
b
) , , , , (
) , ( j j i i
l l l j i 0
1
2
3
4
0 (0,1,X,a,Y)
1 (1,2,Y,b,X)
2 (2,0,X,a,X)
3 (2,3,X,c,Z)
4 (3,1,Z,b,Y)
5 (1,4,Y,d,Z)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
61
gSpan: DFS Code[4]
B
A
B
A
(b) (a)
(0,3,A,-,B)
(1,2,B,-,A)
(0,1,A,-,B)
(2,3,A,-,B) 2
(1,2,B,-,A) 1
(0,1,A,-,B) 0 A
B
A
B
(a) (b)
minimal
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
62
gSpan: minimal DFS Code Algorithm
start with the alphanumerically lowest 1-edged subgraph
number the two vertices 0 and 1, mark the edge as explored
set k = 1
while unexplored edges exist:
if possible,
add a backward edge from the rightmost vertex to a vertex in the rightmost
path (in acending order of the vertex number) using the smallest vertex-
edge-vertex label,
mark the edge as explored
else if possible,
add a forward edge and the corresponding vertex to a vertex in the rightmost
path (in descending order of the vertex number), using the smalles vertex-
edge-vertex label
label the new vertex with k+1
set k = k+1
mark the edge as explored
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
63
gSpan: Extensions to sub-graphs [2]
(1) (2) (3) (4) (5)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
64
gSpan: minimal DFS Code Example
b
c
X
X
a
0
1
Y
a
2
Z
b
3
Z
d
4
Z Z
Y X
X
d
b
b
c
a
a
Generating the DFS code for :
DFS Code
(i ,j,l
i
,l
(i,j)
,l
j
)
(0,1,X,a,X)
(1,2,X,a,Y)
(2,0,Y,b,X)
(2,3,Y,b,Z)
(3,0,Z,c,X)
(2,4,Y,d,Z)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
65
gSpan: DFS Lexicographical Order
= code(G

,T

) = (a
0
,a
1
,…,a
m
)
ß = code(G
ß
,T
ß
) = (b
0
,b
1
,…,b
n
)
ß iff (1) or (2):
(1) t, 0 s t s min(m,n), a
k
= b
k
for k < t, a
t
<
e
b
t
(2) a
k
= b
k
for 0 s k s m, n > m
Minimum DFS code
The minimum DFS code min(G), in DFS lexicographical order, is
the canonical label of graph G.
Graphs A and B are isomorphic if min(A) = min(B).
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
66
gSpan: DFS Lexicographic Order
0 0
( , ) ( , )
DFS lexicographical order :
( ,..., ) ( ,..., ),
( , , , , ), ( , , , , )
a a a a b b b b
m n
t a a i i j j t b b i i j j
is a linear order defined as follows
If DFS codes a a and b b with the codes
a i j l l l b i j l l l then
iff e
o |
o |
= =
= =
s
( , ) ( , )
:
( ) , 0 min( , ),
,
, ,
, ,
,
a a b b
k k
a a b b
a a b b a b
a a b b a b i j i j
t t a a b b a b
a a
ither of the following is true
ii t t m n a b for k t and
i j and i j
i j i j and j j
i j i j j j and l l
a b true if i j i j and i i
i j
- s s = <
< >
> > <
> > = <
< = < < >
<
( , ) ( , )
( , ) ( , )
,
, , ,
, , , ,
( ) 0
a b
a b a a b b
a b a a b b a b
b b a b i i
a a b b a b i i i j i j
a a b b a b i i i j i j j j
k k
i j i i and l l
i j i j i i l l and l l
i j i j i i l l l l and l l
or
ii a b for k m and n m
¦
¦
¦
¦
¦
¦
´
¦
< = <
¦
¦
< < = = <
¦
< < = = = <
¦
¹
= s s >
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
67
gSpan: Extensions to sub-graphs [4]
prune
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
68
gSpan: Example
B
C
A
A
B
A
A
B
C
C
B
C
A
A
A
(1) (2) (3)
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
69
gSpan: Example [2]
Frequent vertices:
Frequent sub-graphs
with one edge:
(1) (2) (3)
A B C
support < minSup
minSup=3
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
70
(1) (2) (3)
no frequent children
not minimal
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
71
Synthetic data
Number of graphs: 10 000
Number of frequent
sub-graphs: 200
minSup: 100
N: number of labels
I: size of the sub-graphs
T: size of graphs
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
72
Chemical data
340 molecules
66 atom types and
4 bond types as labels
on average only 27 vertices
with 28 edges
R
u
n
t
i
m
e

(
s
e
c
)
Support Threshold (%)
1000
100
10
1
0 5 10 15 20 25 30
Apriori (FSG) gSpan
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
73
Summary: gSpan
Lower memory requirements.
Faster than naïve FSG by an order of magnitude.
No candidate generation.
Lexicographic ordering minimizes search tree.
False positives pruning.
Upgradable to other data structures
Webpage for graph mining:
http://hms.liacs.nl/graphs.html
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
74
lazar: Lazy Structure Activity Relationships
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
75
lazar Design Goals
Compliance with the OECD Principles for
(Q)SAR validation (Setubal principles)
Defined algorithm
Possibility to evaluate the mechanistic basis
Defined applicability domain (unknown features,
similarity to training instances)
Rationales for predictions
Traceable results
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
76
lazar Features
Can use various types of descriptors
Feature languages (e.g. linear fragments, subgraphs)
Structural alerts
Chemical similarity is determined always in respect to a
given biological effect
Discrimination between activating and inactivating
features
Classification and Regression
Confidence (intervals) for predictions
Almost no statistical assumptions
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
77
lazar: Lazy Structure Activity Relationships
For a given chemical structure
Search for similar (in respect to given biological activity)
compounds in the training set
Classification:
Predict weighted majority class
Regression:
Predict weighted median activity
Modified k-nearest neighbour classification/
regression
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
78
Example Prediction
Compound #101 [120-71-8]
COc1ccc(C)cc1N
Activity: SAL
349 Neighbors
Similarity Nr. Smiles Activity
0.67 219 COc1ccc(N)cc1 1
0.67 143 COc1ccccc1N 1
0.63 379 COc1ccc(N)c(OC)c1 1
0.50 296 COc1cc(N=Nc2ccccc2)ccc1N 1
0.50 220 COc1cc(ccc1N)c2ccc(N)c(OC)c2 1
0.45 20 CC(C)Oc1ccc(Nc2ccccc2)cc1 0
0.41 466 CCOc1ccc(NC(=O)C)cc1 1
0.41 201 CCOc1ccc(NC(=O)C)cc1N 1
0.40 30 COc1ccc(C=CC)cc1 0
0.40 312 COc1ccc(NOS(=O)(=O)O)cc1N 1
...
Feature actives inactives p_active chisq p_chisq
c-N 162 65 0.23 47.76 1.00
O-c:c:c:c-C 8 33 -0.29 13.80 1.00
O-C 98 153 -0.09 8.88 1.00
Prediction: 1 (17.18)
Database Activity: 1
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
79
Why instance based classification/regression?
Very few model assumptions
Works similarly to domain experts -> interpretation intuitive and
checkable
More specific than global models
Has to deal with less features than global models
Scope of the training set
Detection of unknown features
Similarity to training instances
Straightforward cross-validation
Good performance on QSAR datasets
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
80
Alkylating Antimitotic
Chemical Similarity in Respect to
Different Biological Effects
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
81
lazar Classification
For a given test structure
Determine features that are present in the test structure but not in
the trainings structures (i.e. unknown features)
For each training structure
Determine all features, that are present in the training structure or
the test structure
Determine the set of most significant non-redundant features
Use their statistical significance to calculate a weighted Tanimoto
index as similarity index
Classification
Similarity-weighted majority vote from all neighbors
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
82
Identification of Non-Redundant Features
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
83
Definition of Redundancy
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
84
Determination of most significant non-
redundant features
Determine statistical significances on the training
set (e.g. chi-square f. classification, sign-test for
regression)
Sort features according to their statistical
significance
For each set of redundant features
Select feature with highest statistical significance as
non-redundant feature
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
85
Non-Redundant Features for Mutagenicity (with
Structural Alerts)
Activity: SAL
Feature actives inactives p_active chisq p_chisq p_active*p_chisq
O=N 110 13 0.41 82.72 1.00 0.41
O-N 79 11 0.39 55.73 1.00 0.39
c-N 162 65 0.23 47.76 1.00 0.23
N-N 65 15 0.33 34.46 1.00 0.33
O=C-C-C-C 8 51 -0.35 28.80 1.00 -0.35
O=C-O 22 77 -0.26 27.28 1.00 -0.26
c-C-C 17 64 -0.27 24.48 1.00 -0.27
c:c:c:c:c-C-O 7 39 -0.33 20.39 1.00 -0.33
C-C-C-C-C 19 61 -0.25 19.57 1.00 -0.25
O-c:c:c-C 5 31 -0.35 17.28 1.00 -0.35
O-c:c:c:c:c-C 5 30 -0.34 16.42 1.00 -0.34
c:c:c:c:c:c:c:c:c 44 15 0.26 16.12 1.00 0.26
c:o:c:c 28 6 0.34 15.65 1.00 0.34
O-C-C-C-C 18 53 -0.23 15.19 1.00 -0.23
c-c 36 11 0.28 14.91 1.00 0.28
C=C-C-C-C 5 27 -0.33 13.87 1.00 -0.33
O-c:c:c:c-C 8 33 -0.29 13.80 1.00 -0.29
c-[F,Cl,Br,I] 17 48 -0.22 12.97 1.00 -0.22
[CH2]-[F,Cl,Br,I] 32 10 0.28 12.94 1.00 0.28
n:c:c:c:c:n:c 11 0 0.52 11.74 1.00 0.52
c:c:c:c-S-N 0 12 -0.48 11.42 1.00 -0.48
C-C-C=C-C-C 7 28 -0.28 11.40 1.00 -0.28
n:c:c:c:c:c:c:n 10 0 0.52 10.68 1.00 0.52
n:c:s:c 12 1 0.44 10.04 1.00 0.44
c:c:c:c:n:c:c:c:c:c 14 2 0.39 9.79 1.00 0.39
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
86
Determination of Activity-Related Chemical
Similarity
Determine the set of most significant non-redundant
fragments for compounds A and B with statistical
significances p
i
Calculate
_
_
·
·
=
B A
B A
p
p
sim
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
87
lazar: Lazy Structure Activity Relationships
For a given chemical structure
Search for similar (in respect to given biological activity)
compounds in the training set
Classification:
Predict weighted majority class
Regression:
Predict weighted median activity
Machine Learning and Dat-Mining WS2004/05
ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
88
Determination of Prediction Confidence
Classification:
Presence of unknown features: expert evaluation
Distance to training instances
Regression:
Presence of unknown features: expert evaluation
Upper and lower weighted 95% percentiles of neighbors
as 95% confidence interval
_
_ _
÷
=
all
neg pos
sim
sim sim
N Conf *