- 0082
- Data Mining Attrition Analysis
- Intro Graph Algorithms
- hw03
- 1212.2129
- A6DE496Ed01
- Mteh 2nd Sem
- Data Mining
- Naeem Ur Rehman
- Data Mining – A Perspective Approach
- 00723735
- Twitter data mining using Naive Bayes Multi-label classifier
- Snake and Ladder Program
- Paper 5 - An Incremental Learning Algorithm Considering Texts' Reliability
- Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009
- chapter 5 graphing stories assignment
- Ph.D. Candidate in Computer Science, Expected Graduation
- A New Data-Mining Approach for Network Intrusion
- Multiclass Regression
- NN Final Project
- Deep Learning for Content-based Image Retrieval a Comprehensive
- Group Work #1
- Networks
- Gaussian Processes in Machine Learning Tutorial
- Oops All Programs (1)
- Bootstrap Training ISMIR 2005
- Practical Network Coding by Philip A
- Improving the Reliability of Breaker-and-a-Half Substations Using Sectionalized Busbars
- Artificial Intelligence Construction Technologys Next Frontier
- Automated Categorization of Bioacoustic Signals
- projec12t.doc
- A4
- StandardAmericanAccentWorksheet.pdf
- THE UBIQUITOUS GRE WORD LIST part 2.pdf
- DECEMBER-2013-GRADUATION.docx
- 5th Kalasha pooja 99 names.xlsx (1).pdf
- missing pages.pdf
- Resume.doc
- Sundara Nataraja
- 44113491 Shell Scripting
- 220handout-lecture11
- Resume
- Resume (1)

**ML&DM: Machine Learning in Predictive Pharmacology and Toxicology
**

1

Machine Learning in Predictive

Pharmacology and Toxicology

Andreas Karwath and Christoph Helma

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

2

Contents

Predictive Pharmacology and Toxicology

Feature Mining Algorithms

MolFea

Graph Mining

Classification/Regression Algorithms

lazar

SMIREP

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

3

Goal

Given:

Database with

Chemical structures

Biological activities

Database with untested structures

Task:

Predict biological activities of untested structures

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

4

Inductive Databases for Toxicity Predictions

CAS SMILES SAL

100-00-5 ON(=O)c1ccc(Cl)cc1 1

100-01-6 Nc1ccc(cc1)N(=O)O 1

100-40-3 C=CC1CCC=CC1 0

100-41-4 CCc1ccccc1 0

100-42-5 C=Cc1ccccc1 1

100-44-7 ClCc1ccccc1 1

100-51-6 OCc1ccccc1 0

100-52-7 O=Cc1ccccc1 0

100-63-0 NNc1ccccc1 1

100-75-4 O=NN1CCCCC1 1

100-97-0 C1N2CN3CN1CN(C2)C3 1

10034-93-2 NNOS(=O)(=O)O 1

10034-96-5 O=S1(=O)O[Mn]O1 0

10043-35-3 OB(O)O 0

10043-52-4 [Ca] 0

101-05-3 Clc1ccccc1Nc2nc(Cl)nc(Cl)n2 0

101-14-4 Nc1ccc(Cc2ccc(N)c(Cl)c2)cc1Cl 1

101-73-5 CC(C)Oc1ccc(Nc2ccccc2)cc1 0

101-80-4 Nc1ccc(Oc2ccc(N)cc2)cc1 1

101-90-6 C(Oc1cccc(OCC2CO2)c1)C3CO3 1

...

Inductive Database

Engine

Test Structure

Toxicity Prediction

Training Data

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

5

Chemical Structures

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

6

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

7

Structural Alerts

Alkylating electrophilic centers

Unstable epoxides

Aromatic amines

Azo structures

N-nitroso groups

Aromatic nitro groups

…

O

O

N

N

N

+

O O

−

N

N

N

H H

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

8

Outline of a Predictive Toxicology System

Calculation/Mining for Chemical Features

Feature Selection

Classification/Regression

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

9

Chemical Features

Presence of substructures

Graph theoretic descriptors

Physico/chemical properties of the molecule

(e.g. logP, HOMO, LUMO, …)

3D-parameters

Biological properties (e.g. from screening

assays)

Spectra (IR, NMR, MS, …)

…

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

10

Feature Selection

Feature construction

Create fewer, more predictive features (e.g. PCA, Clustering)

Wrapper methods

Score feature subsets with learning algorithm

Forward selection

Backward elimination

Efficient, but time consuming for proper cross-validation

Filter method

Rank features according to score function (e.g. Chi-square, r

2

)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

11

Data Mining Algorithms

Statistical Methods (e.g.various regression

techniques)

Bayesian Techniques

k-Nearest Neighbors

Decision Trees

Rule Learners

Neural Nets

Support Vector Machines

…

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

12

Classical QSAR

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

13

Decision Trees

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

14

Part Rules

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

15

Support Vector Machines

+1.63 * c:c:c:c:c:c:c:c:c

+1.44 * C-Cl

+1.32 * C-C-C-C-N-C

+1.31 * C-C-C-O

+0.95 * C-C=C

+0.87 * c:c:c:c:c:n

+0.82 * C-C-C-C=C

+0.82 * C-C-C-N-C

+0.80 * c:c:c-C=O

+0.78 * C-N-C

+...

-1.48 * Cl-C-Cl

-1.45 * C-C-C=C-C

-1.01 * C-N-c:c

-1.01 * C-N-c:c:c

-0.95 * C-C

-0.95 * C-C-N-C

-0.94 * C-O-C=O

-0.94 * c:c:c:c:c:c-S

-0.94 * c:c:c:c:c-S

-0.94 * c:c:c:c-S

- ...)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

16

Selection of Data Mining Algorithms

Desired structure and complexity of the models

Representational assumptions

Mechanistic background

Purpose of the model

Performance issues

Capabilities of the algorithm

Sensitivity towards noisy data

Missing values

Skewed distributions between active/inactive molecules

Personal preferences, hypes, …

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

17

Problem Setting

Non–congeneric compounds

No common mode of action

Poor knowledge about biochemical mechanisms

Several hundred - thousand compounds in the

training set

Missing values

Skewed distributions in the training set

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

18

Requirements

Informative, comprehensible and traceable

output

Rationales for predictions, e.g.

Relevant features

Similar compounds in the training set

Rules that are applicable for the test compound

Necessary information depends on prediction algorithm

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

19

Requirements

Informative, comprehensible and traceable

output

Confidence in predictions, e.g.

Confidence intervals

Classification probabilities

Scope of the training set, e.g.

Unknown features of the test structure

Similarity to compounds in the training set

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

20

The Molecular Feature Miner

Given: datasets with molecules

Queries for molecular fragments (substructures)

with constraints concerning frequency and

syntax

e.g. (‚Cl' s f) .

(freq(f, Carcinogens) > 5) .

(freq(f, NonCarcinogens) s 2)

Solver: based on the levelwise version space

algorithm

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

21

Linear Molecular Fragments

A fragment is a sequence of

linearly

connected atoms

(e.g., ‚‘O-c:c:c:c-Cl‘)

O, C, Cl, N, S, ... denote elements

- ... single bond

= ... double bond

# ... triple bond

: ... aromatic bond

(hydrogens implicit)

Smarts encoding

‚O-c:c:c:c-Cl‘

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

22

Properties of Linear Molecular Fragments

g ~ s

g is equivalent to s (syntactic variants) only when they are a

reversal of one another

E.g. ‚C-O-S' and ‚S-O-C' denote the same substructure

g s s

g is more general than s if and only if g is a subsequence of s or g

is a subsequence of the reversal of s

a partial order on fragments in M

E.g. ‚C-c‘ s ‚C-c:c:c‘

E.g. ‚c:C' s ‚C-C-C:c'

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

23

Primitive Constraints

f s P, P s f, not (f s P) and not (P s f):

f ... unknown target fragment,

P ... a specific fragment

e.g. ‚c:c:c‘ s f

freq(f, D)

absolute frequency of a fragment f on a set of molecules D

freq(f, D1) > t, freq(f, D2) s t,

t ... positive integer

D1, D2 ... sets of molecules

e.g. freq(f, Pos) > 20

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

24

Computing Borders

Borders completely characterize the set of solutions

e.g.: if C-c is a solution and C-c:c:c-N is a solution -> {C-c:c, C-

c:c:c} are also solutions

Compute

the set of the most general solutions G (i.e. shortest fragments)

the set of the most specific solutions S (i.e. longest fragments)

to determine all solutions

Conjunctive queries: update G or S

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

25

Levelwise Version Space Algorithm

Is more

general

G

Too frequent w.r.t. max. frequency

Too general

S

Infrequent w.r.t. min. frequency

Too specific

Solutions

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

26

Levelwise Version Space Algorithm

Is more

general

G

S

min. frequency constraint:

update S

Solutions

S'

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

27

Levelwise Version Space Algorithm

Is more

general

G

max. Frequency constraint:

update G

S

min. frequency constraint:

update S

Solutions

S'

G'

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

28

Example Dataset

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

29

Example Query

freq(f,A) > 2

LEVEL 1:

Candidates: [Li], [Be], B, C, N, O, F, [Na], [Mg], [Al], [Si], P, S, Cl, [K], [Ca], [Sc], [Ti],

[V], [Cr], [Mn], [Fe], [Co], [Ni], [Cu], [Zn], [Ga], [Ge], [As], [Se], Br, [Rb], [Sr], [Y],

[Zr], [Nb], [Mo], [Tc], [Ru], [Rh], [Pd], [Ag], [Cd], [In], [Sn], [Sb], [Te], I, [Cs], [Ba],

[Lu], [Hf], [Ta], [W], [Re], [Os], [Ir], [Pt], [Au], [Hg], [Tl], [Pb], [Bi], [Po], [At], [Rn],

[Fr], [Ra], [Lr], c, n, s, o, p (78)

Frequent: C, N, O, c (4)

LEVEL 2:

Candidates: C-C, C-N, C-O, C-c, C=C, C=N, C=O, C=c, C#C, C#N, C#O, C#c, N-N,

N-O, N-c, N=N, N=O, N=c, N#N, N#O, N#c, O-O, O-c, O=O, O=c, O#O, O#c, c-

c, c=c, c#c (30)

Frequent: C-C, C-N, C-O, C-c, c-c (5)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

30

LEVEL 3 :

Candidates: C-C-C, C-C-N, C-C-O, C-C-c, N-C-N, N-C-O, N-C-c, C-N-C, O-C-O, O-

C-c, C-O-C, C-c-c, c-C-c, C-c-C, c-c-c (15)

Frequent: C-C-C, C-C-N, C-C-O, C-C-c, C-c-c, c-c-c (6)

LEVEL 4 :

Candidates: C-C-C-C, C-C-C-N, C-C-C-O, C-C-C-c, N-C-C-N, N-C-C-O, N-C-C-c,

O-C-C-O, O-C-C-c, C-C-c-c, c-C-C-c, C-c-c-c, C-c-c-C, c-c-c-c (14)

Frequent: C-C-C-O, C-C-C-c, N-C-C-O, N-C-C-c, C-C-c-c, C-c-c-c, c-c-c-c (7)

LEVEL 5 :

Candidates: O-C-C-C-O, O-C-C-C-c, C-C-C-c-c, c-C-C-C-c, N-C-C-c-c, C-C-c-c-c,

C-c-c-c-c, C-c-c-c-C, c-c-c-c-c (9)

Frequent: C-C-C-c-c, N-C-C-c-c, C-C-c-c-c, C-c-c-c-c, c-c-c-c-c (5)

LEVEL 6 :

Candidates: C-C-C-c-c-c, N-C-C-c-c-c, C-C-c-c-c-c, C-c-c-c-c-c, C-c-c-c-c-C, c-c-c-

c-c-c (6)

Frequent: C-C-C-c-c-c, N-C-C-c-c-c, C-C-c-c-c-c, C-c-c-c-c-c, c-c-c-c-c-c (5)

Example Query II

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

31

LEVEL 7 :

Candidates: C-C-C-c-c-c-c, N-C-C-c-c-c-c, C-C-c-c-c-c-c, C-c-c-c-c-c-c, C-c-c-c-c-c-

C, c-c-c-c-c-c-c (6)

Frequent: C-C-C-c-c-c-c, N-C-C-c-c-c-c, C-C-c-c-c-c-c, C-c-c-c-c-c-c (4)

LEVEL 8 :

Candidates: C-C-C-c-c-c-c-c, N-C-C-c-c-c-c-c, C-C-c-c-c-c-c-c, C-c-c-c-c-c-c-C (4)

Frequent: C-C-C-c-c-c-c-c, N-C-C-c-c-c-c-c, C-C-c-c-c-c-c-c (3)

LEVEL 9 :

Candidates: C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c (2)

Frequent: C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c (2)

LEVEL 10 :

Candidates: (0)

Frequent: (0)

Example Query III

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

32

============================================

G: C, N, O, c (4)

S: C-C-C-O, N-C-C-O, C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c (4)

============================================

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

33

Example Dataset

S: C-C-C-O, N-C-C-O, C-C-C-c-c-c-c-c-c, N-C-C-c-c-c-c-c-c

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

34

MolFea for Min. Frequency Constraint

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

35

MolFea Problems

Goal: Identification of the most predictive

fragments

Restriction to linear fragments

Molfea works only with (anti-)monotonic

constraints

Extension towards convex evaluation functions

(e.g. Chi-square) possible, but rather inefficient

Necessity to keep track of multiple activities

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

36

Graph Mining

Finding reoccuring sub-graphs from a graph database

FSG

gSpan

1L06

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

37

Labeled Graph

We define a labeled graph G as a five element tuple

G = {V, E, _

V

, _

E

, o} where

V is the set of vertices of G,

E _ V ×V is a set of undirected edges of G,

_

V

(_

E

) are set of vertex (edge) labels,

o is the labeling function: V ÷ _

V

and E ÷ _

E

that maps vertices and

edges to their labels.

a

b

b

y

x

y

(Q)

q

1

q

3

q

2

p

2

p

5

a

b

b d

y

x

y

y

y

(P)

p

1

p

3

p

4

c

a

b

b

y

y

(S)

s

1

s

3

s

2

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

38

Frequent Sub-graph Mining

Given a graph database GD = {G

0

,G

1

,…,G

n

}, find all sub-

graphs appearing in at least min

sup

graphs.

Methodology

Transaction · Labeled graph

Item · Vertex

Item set · sub-graph

(induced, connected, …)

Size of an item set · Number of edges/vertices

(size of a sub-graph)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

39

Frequent Sub-graph Mining

min

sup

= 2

a

b

b

y

x

y

(Q)

q

1

q

3

q

2

p

2

p

5

a

b

b d

y

x

y

y

y

(P)

p

1

p

3

p

4

c

a

b

b

y

y

(S)

s

1

s

3

s

2

Input: A set GD of labeled undirected graphs

a b

a b

y x

b b

a

b

b

y

x

a

b

b

y

y

a

b

b

y

x

y

Output: All frequent sub-graphs (w. r. t. min

sup

) from GD.

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

40

Graph/Subgraph Isomorphism: Hard Problems

Graph isomorphism

Determine if two graphs are equivalent.

Suspected to be neither in P nor in NP-complete.

Subgraph isomorphism

Determine if a graph is a part of another.

NP-complete

Isomorphic sub-graphs are considered the same sub-graph.

Canonical labeling is equivalent to graph isomorphism.

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

41

Finding Frequent Subgraphs

Greedy searches (GBI)

Inductive logic programming (ILP)

Inductive database approach (MolFea)

Graph theory based approaches:

Apriori approaches (AGM, FSG, …)

Generation of sub-graph candidates

Involves sub-graph isomorphism tests, which is an NP-complete problem, so

pruning is expensive.

DFS based (gSpan)

Kernel based

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

42

FSG:Frequent Subgraph Discovery Algorithm

Incremental and breadth-first fashion on the size of

frequent sub-graphs (like Apriori for frequent item sets)

Counting of frequent single and double edge sub-graphs

For finding frequent size k-sub-graphs (k > 3),

Candidate generation

Joining two size (k–1)-sub-graphs similar to each other.

Candidate pruning by downward closure property

Frequency counting

Check if a sub-graph is contained in a transaction.

Repeat the steps for k= k+ 1

Increase the size of sub-graphs by one edge.

FSG finds frequent connected sub-graphs in the bottom-

up and breadth-first manner

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

43

FSG: Algorithm

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

44

Trivial Operations Become Complicated With

Graphs…

Candidate generation

To determine two candidates for joining, one needs to perform

sub-graph isomorphism testing.

Isomorphism for redundancy check

Candidate pruning

To check downward closure property, sub-graph isomorphism

check is needed.

Frequency counting

Sub-graph isomorphism for checking containment of a frequent

sub-graph

How to reduce the number of graph/sub-graph

isomorphism operations?

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

45

FSG: Candidate Generation

To generate a size k-candidate (k edges)

Take the intersection of the parent lists of two (k–1)-frequent

subgraphs

To see if two (k–1)-frequent subgraphs share the same size (k–2)-

parent.

Parent lists are obtained at the pruning phase.

Subgraph isomorphism free!

Example

parent(c

5

) = { g

4

, h

4

, i

4

}, parent(d

5

) = { f

4

, g

4

, h

4

}

Generate size 6-candidates from the cores g

4

and h

4

.

Canonical labeling for redundancy check

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

46

Intersection of the parent

lists of two 3-frequent sub-

graphs

Without sub-graph

isomorphism, one can

detect the core of the two 3-

frequent sub-graphs.

Redundancy check by

canonical labeling

FSG: Candidate Generation

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

47

FSG: CG Core Extension

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

48

FSG: CG Core Extension II

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

49

FSG: Candidate Pruning

•Downward closure property

•Every (k–1)-sub-graph must

be frequent

•Keep the list of (k-1) sub-

graphs

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

50

FSG: Candidate Pruning

Pruning of size k-candidates

For all the (k–1)-sub-graphs of a size k-candidate,

check if downward closure property holds.

Canonical labeling is used to speedup the computation.

Build the parent list of (k–1)-frequent sub-graphs for

the k-candidate.

Used later in the candidate generation, if this candidate

survives the frequency counting check.

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

51

FSG: Frequency Counting

Employs Transaction IDs in TID lists

If a size k-candidate is contained in a transaction, all the

size (k–1)-parents must be contained in the same

transaction.

Perform sub-graph isomorphism only on the intersection

of the TID lists of the parent frequent sub-graphs of size

k–1.

Significantly reduces the number of sub-graph isomorphism

checks.

Trade-off between running time and memory

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

52

FSG: Summary

Faster than naive Apriori based graph miner like

AGM

Induces connected sub-graphs (AGM: induced

sub-graphs)

High memory consumption (TID lists)

Still a lot of sub-graph isomorphism testing

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

53

gSpan: Graph-Based Substructure Pattern

Mining

Avoid cost intensive problems like

Candidate generation

Isomorphism testing

uses two main concepts:

DFS lexicographic order

minimum DFS code

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

54

Search space for finding sub-graphs

no edge

1 edge

2 edge

…

Every node is equivalent to one sub-graph

An extra edge is added for each child vertice

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

55

gSpan: Extensions to sub-graphs

Introduction of a lexigraphical order of graphs

¬Limits the extension of an existing sub-graph:

Edges are only added to the right most vertice of the

path

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

56

gSpan: Extensions to sub-graphs [2]

(1) (2) (3) (4) (5)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

57

gSpan: Extensions to sub-graphs [3]

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

58

gSpan: Extensions to sub-graphs [4]

prune

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

59

gSpan: DFS Code

canonical encoding of sub-graphs

one code line per edge

Enables to check on previously generated

isomorphic sub-graphs

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

60

gSpan: DFS Code [2]

X

X

Z

Y

Z

a

b

c

d

a

b

) , , , , (

) , ( j j i i

l l l j i 0

1

2

3

4

0 (0,1,X,a,Y)

1 (1,2,Y,b,X)

2 (2,0,X,a,X)

3 (2,3,X,c,Z)

4 (3,1,Z,b,Y)

5 (1,4,Y,d,Z)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

61

gSpan: DFS Code[4]

B

A

B

A

(b) (a)

(0,3,A,-,B)

(1,2,B,-,A)

(0,1,A,-,B)

(2,3,A,-,B) 2

(1,2,B,-,A) 1

(0,1,A,-,B) 0 A

B

A

B

(a) (b)

minimal

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

62

gSpan: minimal DFS Code Algorithm

start with the alphanumerically lowest 1-edged subgraph

number the two vertices 0 and 1, mark the edge as explored

set k = 1

while unexplored edges exist:

if possible,

add a backward edge from the rightmost vertex to a vertex in the rightmost

path (in acending order of the vertex number) using the smallest vertex-

edge-vertex label,

mark the edge as explored

else if possible,

add a forward edge and the corresponding vertex to a vertex in the rightmost

path (in descending order of the vertex number), using the smalles vertex-

edge-vertex label

label the new vertex with k+1

set k = k+1

mark the edge as explored

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

63

gSpan: Extensions to sub-graphs [2]

(1) (2) (3) (4) (5)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

64

gSpan: minimal DFS Code Example

b

c

X

X

a

0

1

Y

a

2

Z

b

3

Z

d

4

Z Z

Y X

X

d

b

b

c

a

a

Generating the DFS code for :

DFS Code

(i ,j,l

i

,l

(i,j)

,l

j

)

(0,1,X,a,X)

(1,2,X,a,Y)

(2,0,Y,b,X)

(2,3,Y,b,Z)

(3,0,Z,c,X)

(2,4,Y,d,Z)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

65

gSpan: DFS Lexicographical Order

= code(G

,T

) = (a

0

,a

1

,…,a

m

)

ß = code(G

ß

,T

ß

) = (b

0

,b

1

,…,b

n

)

ß iff (1) or (2):

(1) t, 0 s t s min(m,n), a

k

= b

k

for k < t, a

t

<

e

b

t

(2) a

k

= b

k

for 0 s k s m, n > m

Minimum DFS code

The minimum DFS code min(G), in DFS lexicographical order, is

the canonical label of graph G.

Graphs A and B are isomorphic if min(A) = min(B).

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

66

gSpan: DFS Lexicographic Order

0 0

( , ) ( , )

DFS lexicographical order :

( ,..., ) ( ,..., ),

( , , , , ), ( , , , , )

a a a a b b b b

m n

t a a i i j j t b b i i j j

is a linear order defined as follows

If DFS codes a a and b b with the codes

a i j l l l b i j l l l then

iff e

o |

o |

= =

= =

s

( , ) ( , )

:

( ) , 0 min( , ),

,

, ,

, ,

,

a a b b

k k

a a b b

a a b b a b

a a b b a b i j i j

t t a a b b a b

a a

ither of the following is true

ii t t m n a b for k t and

i j and i j

i j i j and j j

i j i j j j and l l

a b true if i j i j and i i

i j

- s s = <

< >

> > <

> > = <

< = < < >

<

( , ) ( , )

( , ) ( , )

,

, , ,

, , , ,

( ) 0

a b

a b a a b b

a b a a b b a b

b b a b i i

a a b b a b i i i j i j

a a b b a b i i i j i j j j

k k

i j i i and l l

i j i j i i l l and l l

i j i j i i l l l l and l l

or

ii a b for k m and n m

¦

¦

¦

¦

¦

¦

´

¦

< = <

¦

¦

< < = = <

¦

< < = = = <

¦

¹

= s s >

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

67

gSpan: Extensions to sub-graphs [4]

prune

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

68

gSpan: Example

B

C

A

A

B

A

A

B

C

C

B

C

A

A

A

(1) (2) (3)

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

69

gSpan: Example [2]

Frequent vertices:

Frequent sub-graphs

with one edge:

(1) (2) (3)

A B C

support < minSup

minSup=3

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

70

(1) (2) (3)

no frequent children

not minimal

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

71

Synthetic data

Number of graphs: 10 000

Number of frequent

sub-graphs: 200

minSup: 100

N: number of labels

I: size of the sub-graphs

T: size of graphs

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

72

Chemical data

340 molecules

66 atom types and

4 bond types as labels

on average only 27 vertices

with 28 edges

R

u

n

t

i

m

e

(

s

e

c

)

Support Threshold (%)

1000

100

10

1

0 5 10 15 20 25 30

Apriori (FSG) gSpan

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

73

Summary: gSpan

Lower memory requirements.

Faster than naïve FSG by an order of magnitude.

No candidate generation.

Lexicographic ordering minimizes search tree.

False positives pruning.

Upgradable to other data structures

Webpage for graph mining:

http://hms.liacs.nl/graphs.html

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

74

lazar: Lazy Structure Activity Relationships

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

75

lazar Design Goals

Compliance with the OECD Principles for

(Q)SAR validation (Setubal principles)

Defined algorithm

Possibility to evaluate the mechanistic basis

Defined applicability domain (unknown features,

similarity to training instances)

Rationales for predictions

Traceable results

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

76

lazar Features

Can use various types of descriptors

Feature languages (e.g. linear fragments, subgraphs)

Structural alerts

Chemical similarity is determined always in respect to a

given biological effect

Discrimination between activating and inactivating

features

Classification and Regression

Confidence (intervals) for predictions

Almost no statistical assumptions

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

77

lazar: Lazy Structure Activity Relationships

For a given chemical structure

Search for similar (in respect to given biological activity)

compounds in the training set

Classification:

Predict weighted majority class

Regression:

Predict weighted median activity

Modified k-nearest neighbour classification/

regression

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

78

Example Prediction

Compound #101 [120-71-8]

COc1ccc(C)cc1N

Activity: SAL

349 Neighbors

Similarity Nr. Smiles Activity

0.67 219 COc1ccc(N)cc1 1

0.67 143 COc1ccccc1N 1

0.63 379 COc1ccc(N)c(OC)c1 1

0.50 296 COc1cc(N=Nc2ccccc2)ccc1N 1

0.50 220 COc1cc(ccc1N)c2ccc(N)c(OC)c2 1

0.45 20 CC(C)Oc1ccc(Nc2ccccc2)cc1 0

0.41 466 CCOc1ccc(NC(=O)C)cc1 1

0.41 201 CCOc1ccc(NC(=O)C)cc1N 1

0.40 30 COc1ccc(C=CC)cc1 0

0.40 312 COc1ccc(NOS(=O)(=O)O)cc1N 1

...

Feature actives inactives p_active chisq p_chisq

c-N 162 65 0.23 47.76 1.00

O-c:c:c:c-C 8 33 -0.29 13.80 1.00

O-C 98 153 -0.09 8.88 1.00

Prediction: 1 (17.18)

Database Activity: 1

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

79

Why instance based classification/regression?

Very few model assumptions

Works similarly to domain experts -> interpretation intuitive and

checkable

More specific than global models

Has to deal with less features than global models

Scope of the training set

Detection of unknown features

Similarity to training instances

Straightforward cross-validation

Good performance on QSAR datasets

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

80

Alkylating Antimitotic

Chemical Similarity in Respect to

Different Biological Effects

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

81

lazar Classification

For a given test structure

Determine features that are present in the test structure but not in

the trainings structures (i.e. unknown features)

For each training structure

Determine all features, that are present in the training structure or

the test structure

Determine the set of most significant non-redundant features

Use their statistical significance to calculate a weighted Tanimoto

index as similarity index

Classification

Similarity-weighted majority vote from all neighbors

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

82

Identification of Non-Redundant Features

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

83

Definition of Redundancy

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

84

Determination of most significant non-

redundant features

Determine statistical significances on the training

set (e.g. chi-square f. classification, sign-test for

regression)

Sort features according to their statistical

significance

For each set of redundant features

Select feature with highest statistical significance as

non-redundant feature

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

85

Non-Redundant Features for Mutagenicity (with

Structural Alerts)

Activity: SAL

Feature actives inactives p_active chisq p_chisq p_active*p_chisq

O=N 110 13 0.41 82.72 1.00 0.41

O-N 79 11 0.39 55.73 1.00 0.39

c-N 162 65 0.23 47.76 1.00 0.23

N-N 65 15 0.33 34.46 1.00 0.33

O=C-C-C-C 8 51 -0.35 28.80 1.00 -0.35

O=C-O 22 77 -0.26 27.28 1.00 -0.26

c-C-C 17 64 -0.27 24.48 1.00 -0.27

c:c:c:c:c-C-O 7 39 -0.33 20.39 1.00 -0.33

C-C-C-C-C 19 61 -0.25 19.57 1.00 -0.25

O-c:c:c-C 5 31 -0.35 17.28 1.00 -0.35

O-c:c:c:c:c-C 5 30 -0.34 16.42 1.00 -0.34

c:c:c:c:c:c:c:c:c 44 15 0.26 16.12 1.00 0.26

c:o:c:c 28 6 0.34 15.65 1.00 0.34

O-C-C-C-C 18 53 -0.23 15.19 1.00 -0.23

c-c 36 11 0.28 14.91 1.00 0.28

C=C-C-C-C 5 27 -0.33 13.87 1.00 -0.33

O-c:c:c:c-C 8 33 -0.29 13.80 1.00 -0.29

c-[F,Cl,Br,I] 17 48 -0.22 12.97 1.00 -0.22

[CH2]-[F,Cl,Br,I] 32 10 0.28 12.94 1.00 0.28

n:c:c:c:c:n:c 11 0 0.52 11.74 1.00 0.52

c:c:c:c-S-N 0 12 -0.48 11.42 1.00 -0.48

C-C-C=C-C-C 7 28 -0.28 11.40 1.00 -0.28

n:c:c:c:c:c:c:n 10 0 0.52 10.68 1.00 0.52

n:c:s:c 12 1 0.44 10.04 1.00 0.44

c:c:c:c:n:c:c:c:c:c 14 2 0.39 9.79 1.00 0.39

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

86

Determination of Activity-Related Chemical

Similarity

Determine the set of most significant non-redundant

fragments for compounds A and B with statistical

significances p

i

Calculate

_

_

·

·

=

B A

B A

p

p

sim

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

87

lazar: Lazy Structure Activity Relationships

For a given chemical structure

Search for similar (in respect to given biological activity)

compounds in the training set

Classification:

Predict weighted majority class

Regression:

Predict weighted median activity

Machine Learning and Dat-Mining WS2004/05

ML&DM: Machine Learning in Predictive Pharmacology and Toxicology

88

Determination of Prediction Confidence

Classification:

Presence of unknown features: expert evaluation

Distance to training instances

Regression:

Presence of unknown features: expert evaluation

Upper and lower weighted 95% percentiles of neighbors

as 95% confidence interval

_

_ _

÷

=

all

neg pos

sim

sim sim

N Conf *

- 0082Uploaded byChayank Khosla
- Data Mining Attrition AnalysisUploaded byNam Tung
- Intro Graph AlgorithmsUploaded byThahir Shah
- hw03Uploaded byBob
- 1212.2129Uploaded byJayakrishna
- A6DE496Ed01Uploaded byItalo Chiarella
- Mteh 2nd SemUploaded byMahesh Ebony Dagger
- Data MiningUploaded byMarifa Farzin
- Naeem Ur RehmanUploaded bytahirsss
- Data Mining – A Perspective ApproachUploaded byIRJET Journal
- 00723735Uploaded byysf1991
- Twitter data mining using Naive Bayes Multi-label classifierUploaded byIRJET Journal
- Snake and Ladder ProgramUploaded byManikandan
- Paper 5 - An Incremental Learning Algorithm Considering Texts' ReliabilityUploaded byEditor IJACSA
- Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009Uploaded byTân Quí Long
- chapter 5 graphing stories assignmentUploaded byapi-241733606
- Ph.D. Candidate in Computer Science, Expected GraduationUploaded byJulie Williams
- A New Data-Mining Approach for Network IntrusionUploaded bybalaram
- Multiclass RegressionUploaded bySuprabhat Tiwari
- NN Final ProjectUploaded byZeyad Etman
- Deep Learning for Content-based Image Retrieval a ComprehensiveUploaded byrajkumar.manju
- Group Work #1Uploaded byClio Margaret Cuevas
- NetworksUploaded byMuhd Hanif Mat Rusdi
- Gaussian Processes in Machine Learning TutorialUploaded bysakya_dasgupta
- Oops All Programs (1)Uploaded bygopaltirupur
- Bootstrap Training ISMIR 2005Uploaded byfifster
- Practical Network Coding by Philip AUploaded bylazy1261990
- Improving the Reliability of Breaker-and-a-Half Substations Using Sectionalized BusbarsUploaded byVankurd
- Artificial Intelligence Construction Technologys Next FrontierUploaded byMarinos Voulgaris
- Automated Categorization of Bioacoustic SignalsUploaded byjuan marti

- projec12t.docUploaded bySundar Nataraj
- A4Uploaded bySundar Nataraj
- StandardAmericanAccentWorksheet.pdfUploaded bySundar Nataraj
- THE UBIQUITOUS GRE WORD LIST part 2.pdfUploaded bySundar Nataraj
- DECEMBER-2013-GRADUATION.docxUploaded bySundar Nataraj
- 5th Kalasha pooja 99 names.xlsx (1).pdfUploaded bySundar Nataraj
- missing pages.pdfUploaded bySundar Nataraj
- Resume.docUploaded bySundar Nataraj
- Sundara NatarajaUploaded bySundar Nataraj
- 44113491 Shell ScriptingUploaded bySundar Nataraj
- 220handout-lecture11Uploaded bySundar Nataraj
- ResumeUploaded bySundar Nataraj
- Resume (1)Uploaded bySundar Nataraj