You are on page 1of 42

Data mining and querying of integrated chemical and

biological information using Chem2Bio2RDF

David J. Wild
Assistant Professor & Director, Cheminformatics Program
Indiana University School of Informatics and Computing
djwild@indiana.edu - http://djwild.info
Overview
  Background – Big Data drug discovery & Chem2Bio2RDF
  Contextualizing Chem2Bio2RDF – BioRDF, LODD, LOD
  Using SPARQL queries for polypharmacology
  Finding all links between any two entities
  Algorithm
  Pathfinder visualization
  Visualization tools
  ChemBioScape: Visualization & pathfinding in Cytoscape
  PlotViz: Visualization & SPARQL querying in 3D Chemcial Space
  BioLDA and Topic Models: Advanced Literature Mining
  Summary

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Epochs in drug discovery
Empirical – up until 1960’s
754 First pharmacy opened in Baghdad
Late 1800’s – major pharmaceutical companies, mass production
1900-1960 – major discoveries (insulin, penicillin, the pill …)

Rational – 1960’s to 1990’s


Designing molecules to target protein active sites – “lock and key”
Computational Drug Discovery
Biggest success HIV (RT, protease inhibitors)

Big Experiment – 1990’s to 2000’s


High throughput screening
Microarray Assays
Gene Sequencing and Human Genome Project

Big Data – 2010’s onwards


Informatics-driven drug discovery
Accepting the body is complex and we don’t understand it well
Everything is connected

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Big Data in the public domain
  There is now an incredibly rich resource of
public information relating
compounds, targets, genes, pathways, and diseases. Just for starters there is in
the public domain information on:
  69 million compounds and 449,392 bioassays (PubChem)
  4,763 drugs (DrugBank)
  9 million protein sequences (SwissProt) and 58,000 3D structures (PDB)
  14 million human nucleotide sequences (EMBL)
  19 million life science publications - 800,000 new each year (PubMed)
  Multitude of other sets (drugs, toxicogenomics, chemogenomics, SAR, …)

  Even more important are the relationships between these entities. For example
a chemical compound can be linked to a gene or a protein target in a multitude
of ways:
  Biological assay with percent inhibition, IC50, etc
  Crystal structure of ligand/protein complex
  Co-occurrence in a paper abstract
  Computational experiment (docking, predictive model)
  Statistical relationship
  System association (e.g. involved in same pathways cellular processes)

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
2005-01
2005-03
2005-05
2005-07
2005-09
2005-11

2,824,265
2006-01
2006-03
2006-05
2006-07
2006-09
2006-11
2007-01
2007-03
2007-05
2007-07
2007-09
Addition of

2007-11
ChemSpider

2008-01
2008-03

35,379,748
2008-05
2008-07
2008-09
2008-11
2009-01
2009-03
2009-05
2009-07
2009-09
56,774,950
PubChem Substance Size 2005-2010

2009-11
PubChem growth since 2005

2010-01
2010-03
2010-05
2010-07
69,088,100
1
10
100
1000
10000
100000
1000000

2005-01
2005-04
2005-07
2005-10
2006-01
2006-04
2006-07
2006-10
2007-01
2007-04
2007-07
2007-10
2008-01
2008-04
2008-07
2008-10
ChEMBL

2009-01
Addition of

2009-04
2009-07
2009-10
PubChem Bioassays 2005-2010

2010-01
Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

2010-04
2010-07
434635
Large amount of data and links for each compound

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Proteins & Genes

http://www.genome.jp/en/db_growth.html

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


NGS and PHR/EHRs add new dimensions to data

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Informatics-based drug discovery

Predicting new molecular targets for known drugs. Nature 462, 175-181(12 November 2009)

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


“Systems chemical biology” and chemogenomics

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Chem2Bio2RDF BMC Bioinformatics, 2010, 11, 255; chem2bio2rdf.org

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


The Semantic Web – meaning & relationships

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Chem2Bio2RDF – RDF integration & SPARQL querying

Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.
Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems
chemical biology data. BMC Bioinformatics 2010, 11, 255

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Chem2Bio2RDF context

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Chem2Bio2RDF Relationships

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Linked Open Data Cloud (linkeddata.org)

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Converting data into RDF

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


SPARQL Interface to Chem2Bio2RDF

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Finding multi-target inhibitors of MAPK pathway with a SPARQL query

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Finding compounds with similar polypharmacology using SPARQL

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Relating Pathways to Adverse Drug Reactions

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Top 3 pathways by number of paths

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Isoniazid and Ethionamide – replicate paper results

  Banerjee, A., Dubnau, E., Quemard, A., Balasubramanian, V., Um, K., Wilson, T., et al.: inhA, a gene encoding a target for
isoniazid and ethionamide in Mycobacterium tuberculosis. Science, 263(5144), 227-230 (1994).

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Pathfinding – shortest path algorithm (DFS)
  For a network G = (V, E) and an association query (ai, aj), association search is
to find and rank possible associations {αk(ai, aj)}.
Input: a query (vi, vj) and a social network G = (V, E) 9. foreach (es∈E(s)){
Output: a ranked list of associations A={αk} with 10. (s, u) ← the edge pointed to by es;
L(αk)< (1+β)Lmin, where Lmin is the length of the shortest 11. if( c(u) = 0 && d(s) + 1 + d'(u) < (1+β)Lmin ){
association and β is a user-defined parameter. 12. if( u = vj ) {
Algorithm: Our proposed algorithm /* find a new association */
{ 13. α(vi, vj) ← all edges in stack U es;
/*Step 1. Shortest association finding*/ 14. d(α(vi, vj)) ← d(s) + 1 + d'(u); //calculate the length
/*The following is resolved in a single heap-based */ 15. add (α(vi, vj), d(α(vi, vj))) into A;
/*shortest-association finding solution*/ 16. } else {
1. foreach (v∈V\vj) {d'(v) ← shortest-association from 17. if( stack.size() < max_length){
v to vj;} 18. push (u, es) on stack;
3. Lmin ← d'(vi); 19. c(u) ← c(u) + 1; d(u) ← d(s) + 1;
/*Step 2. Near-shortest associations finding*/ 20. }
4. stack ← (vi, NULL); 21. }
/*c(v) denotes the times v appears in the current 22. } else {
association*/ 23. pop (s, e) from stack; c(s) ← c(s) - 1;
/*It is used to avoid loops in the association*/ 24. }
/*d(v) denotes the length of the current association*/ 25. }
5. foreach (v∈V) {d(v) ← 0; c(v) ← 0;} c(vi) ← 1; 26. }
6. while (stack is not empty){ /*Step 4. Ranking the found associations */
7. (s, e) ← node at the top of stack; /*to rank the found associations with the shortest on the top*/
8. E(s) ← all edges pointing out from the node s; 27. A ← sort (A);
28. return A;
}

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Chem2Bio2RDF Dashboard: finding paths

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Pathfinder
NFKB1

Glucocorticoid Receptor

Triamcinalone Dexamethasone

http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Fleroxacin-Pefloxacin relationships

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Top 3 pathways linked to Hepatitis

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


ChemBioScape – pathfinding in Cytoscape

David Wild, Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

December 2009.
PlotViz – visualizing in chemical space

Choi, J.Y. , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics
Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop,
ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Chemical & Biological Literature Extraction

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Integration of PubMed BioTerms

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


ChemBioSpace – a literature-centric space

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Bio-LDA Topic Model
 Identifies “latent topics” by word association: a kind of fuzzy
clustering
 Each word can have associations with multiple topics, and has a
varying degree of strength
 Term-topic edges labeld with probability (i.e. strength of a
relationship to a topic). Term-term edges labeled with KL-
divergence (measure of distance)
 We considered BioTerms rather than free text, and applied to
336,899 MedLine abstracts on 50 topics published in 2009
 Based on work done by Jie Tang on social networks (see
www.arnetminer.com)
 He, B., Sankaranarayanan, M., Ding, Y., Tang, J., Wang, H., Wild,
D., Chen, B., Sun, Y., Sigimoto, C., Wu, Y., & Qiu, J. Semantic
Path and Topic Mining in Linked Life Data (submitted). The 9th
International Semantic Web Conference (ISWC'10). Shanghai,
China, Nov 7 -11, 2010

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Bio-LDA Topic Model
  Distribution of BioTerms over Topics
 w1 w 2 ... w n 
 T1 T2 ... Tz 
   
 B1 θ11 θ12 ... θ1z   T1 φ11 φ12 ... φ1n 
θ =  B2 θ 21 θ 22 ... θ 2z  φ = T2 φ 21 φ 22 ... φ 2n 
   
 ... ... ... ... ... 
 ... ... ... ... ... 
Bm θ m1 θ m 2 ... θ mz  Tz φ z1 φ z 2 ... φ zn 

  Generative probability:
T Bd
€ PBio−LDA (w | d,θ ,φ ) = ∑∑ P(w | z ,φz )P(z | x,θ x )P(x | d)
€ z=1 x =1
  Kullback-Leibler Divergence:
T
θ bi z θbjz
€ sKL(bi ,b j ) = ∑ (θ bi z log + θ b j z log )
z =1
θbjz θ bi z

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Bio-LDA III
  Entropy
  In information theory, entropy is a measure of the uncertainty associated
with a random variable.
  Here we can compute the bio-term entropies over topics
  Kullback-Leibler divergence (KL divergence)
  a non-symmetric measure of the difference between two probability
distributions.
  Here we used the KL divergence as the non-symmetric distance measure for
two bio-terms over topics

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Example: Topic 10

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Relating Drugs to ABL1 with Chem2Bio2RDF+LDA

Drug
Gene
Disease
ChemBioSpace Link
Predicted Link

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Hydrocortisone – Dexamethasone links

  Fig. Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using
ChemBioScape.Drugbank interaction contains information about every drug’s target. In this case, DB00741 and
DB01234 share common targets through several different Drugbank interaction ID’s.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Tolcapone and Entacapone links

  Fig. Use case 2.Tolcapone and Entacapone are connected to each other through drugbank
interaction 2348 and 1962.Also, the two drugs appear in PubMed articles 8119326 and 8223912
via their CID (Compound ID)

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Summary
  ChemBio2RDF is useful in its own right, but requires advanced SPARQL skills
to use.

  It is good to have both relational database and triple-store forms


  Pathfinding between two points integrated with tools like Pathfinder and
ChemBioScape offer an intuitive and straightforward way to data mine big
RDF networks

  With large RDF networks, ranking of paths is extremely important (we are
working on this)

  Integration of
PubMed BioTerms and advanced topic modeling offer both a
new data source and a way of ranking paths (sum of KL divergence over a
path)

  We are now doing “proof of pudding” work (relationships with


experimentalists, etc)

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.


Acknowledgments
Semantic Web Lab Cheminformatics Salsa HPC Group
http://swl.slis.indiana.edu/ http://djwild.info http://salsahpc.indiana.edu/

Seung-Hee Bae. Jong Choi, Bin Chen, Ying Ding, Xiao Dong, Geoffrey Fox
Dazhi Jiao, Judy Qiu, Yuyin Sun, Huijun Wang, Qian Zhu

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

You might also like