Data Mining and Querying of Integrated Chemical and Biological Information Using Chem2Bio2RDF

Data mining and querying of integrated chemical and
biological information using Chem2Bio2RDF
David J. Wild
Assistant Professor & Director, Cheminformatics Program
Indiana University School of Informatics and Computing
djwild@indiana.edu - http://djwild.info
Overview
  Background – Big Data drug discovery & Chem2Bio2RDF
  Contextualizing Chem2Bio2RDF – BioRDF, LODD, LOD
  Using SPARQL queries for polypharmacology
  Finding all links between any two entities
  Algorithm
  Pathfinder visualization
  Visualization tools
  ChemBioScape: Visualization & pathfinding in Cytoscape
  PlotViz: Visualization & SPARQL querying in 3D Chemcial Space
  BioLDA and Topic Models: Advanced Literature Mining
  Summary
Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Epochs in drug discovery
Empirical – up until 1960’s
754 First pharmacy opened in Baghdad
Late 1800’s – major pharmaceutical companies, mass production
1900-1960 – major discoveries (insulin, penicillin, the pill …)
Rational – 1960’s to 1990’s

Designing molecules to target protein active sites – “lock and key”
Computational Drug Discovery
Biggest success HIV (RT, protease inhibitors)
Big Experiment – 1990’s to 2000’s

High throughput screening
Microarray Assays
Gene Sequencing and Human Genome Project
Big Data – 2010’s onwards

Informatics-driven drug discovery
Accepting the body is complex and we don’t understand it well
Everything is connected

Big Data in the public domain
  There is now an incredibly rich resource of
public information relating
compounds, targets, genes, pathways, and diseases. Just for starters there is in
the public domain information on:
  69 million compounds and 449,392 bioassays (PubChem)
  4,763 drugs (DrugBank)
  9 million protein sequences (SwissProt) and 58,000 3D structures (PDB)
  14 million human nucleotide sequences (EMBL)
  19 million life science publications - 800,000 new each year (PubMed)
  Multitude of other sets (drugs, toxicogenomics, chemogenomics, SAR, …)
  Even more important are the relationships between these entities. For example
a chemical compound can be linked to a gene or a protein target in a multitude
of ways:
  Biological assay with percent inhibition, IC50, etc
  Crystal structure of ligand/protein complex
  Co-occurrence in a paper abstract
  Computational experiment (docking, predictive model)
  Statistical relationship
  System association (e.g. involved in same pathways cellular processes)

0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
2005-01
2005-03
2005-05
2005-07
2005-09
2005-11
2,824,265
2006-01
2006-03
2006-05
2006-07
2006-09
2006-11
2007-01
2007-03
2007-05
2007-07
2007-09
Addition of
2007-11
ChemSpider
2008-01
2008-03
35,379,748
2008-05
2008-07
2008-09
2008-11
2009-01
2009-03
2009-05
2009-07
2009-09
56,774,950
PubChem Substance Size 2005-2010
2009-11
PubChem growth since 2005
2010-01
2010-03
2010-05
2010-07
69,088,100
1
10
100
1000
10000
100000
1000000
2005-01
2005-04
2005-07
2005-10
2006-01
2006-04
2006-07
2006-10
2007-01
2007-04
2007-07
2007-10
2008-01
2008-04
2008-07
2008-10
ChEMBL
2009-01
Addition of
2009-04
2009-07
2009-10
PubChem Bioassays 2005-2010
2010-01
2010-04
2010-07
434635
Large amount of data and links for each compound

Proteins & Genes
http://www.genome.jp/en/db_growth.html

NGS and PHR/EHRs add new dimensions to data

Informatics-based drug discovery
Predicting new molecular targets for known drugs. Nature 462, 175-181(12 November 2009)

“Systems chemical biology” and chemogenomics

Chem2Bio2RDF BMC Bioinformatics, 2010, 11, 255; chem2bio2rdf.org

The Semantic Web – meaning & relationships

Chem2Bio2RDF – RDF integration & SPARQL querying
Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.
Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems
chemical biology data. BMC Bioinformatics 2010, 11, 255

Chem2Bio2RDF context

Chem2Bio2RDF Relationships

Linked Open Data Cloud (linkeddata.org)

Converting data into RDF

SPARQL Interface to Chem2Bio2RDF

Finding multi-target inhibitors of MAPK pathway with a SPARQL query

Finding compounds with similar polypharmacology using SPARQL

Relating Pathways to Adverse Drug Reactions

Top 3 pathways by number of paths

Isoniazid and Ethionamide – replicate paper results
  Banerjee, A., Dubnau, E., Quemard, A., Balasubramanian, V., Um, K., Wilson, T., et al.: inhA, a gene encoding a target for
isoniazid and ethionamide in Mycobacterium tuberculosis. Science, 263(5144), 227-230 (1994).

Pathfinding – shortest path algorithm (DFS)
  For a network G = (V, E) and an association query (ai, aj), association search is
to find and rank possible associations {αk(ai, aj)}.
Input: a query (vi, vj) and a social network G = (V, E) 9. foreach (es∈E(s)){
Output: a ranked list of associations A={αk} with 10. (s, u) ← the edge pointed to by es;
L(αk)< (1+β)Lmin, where Lmin is the length of the shortest 11. if( c(u) = 0 && d(s) + 1 + d'(u) < (1+β)Lmin ){
association and β is a user-defined parameter. 12. if( u = vj ) {
Algorithm: Our proposed algorithm /* find a new association */
{ 13. α(vi, vj) ← all edges in stack U es;
/*Step 1. Shortest association finding*/ 14. d(α(vi, vj)) ← d(s) + 1 + d'(u); //calculate the length
/*The following is resolved in a single heap-based */ 15. add (α(vi, vj), d(α(vi, vj))) into A;
/*shortest-association finding solution*/ 16. } else {
1. foreach (v∈V\vj) {d'(v) ← shortest-association from 17. if( stack.size() < max_length){
v to vj;} 18. push (u, es) on stack;
3. Lmin ← d'(vi); 19. c(u) ← c(u) + 1; d(u) ← d(s) + 1;
/*Step 2. Near-shortest associations finding*/ 20. }
4. stack ← (vi, NULL); 21. }
/*c(v) denotes the times v appears in the current 22. } else {
association*/ 23. pop (s, e) from stack; c(s) ← c(s) - 1;
/*It is used to avoid loops in the association*/ 24. }
/*d(v) denotes the length of the current association*/ 25. }
5. foreach (v∈V) {d(v) ← 0; c(v) ← 0;} c(vi) ← 1; 26. }
6. while (stack is not empty){ /*Step 4. Ranking the found associations */
7. (s, e) ← node at the top of stack; /*to rank the found associations with the shortest on the top*/
8. E(s) ← all edges pointing out from the node s; 27. A ← sort (A);
28. return A;
}

Chem2Bio2RDF Dashboard: finding paths

Pathfinder
NFKB1
Glucocorticoid Receptor
Triamcinalone Dexamethasone
http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html

Fleroxacin-Pefloxacin relationships

Top 3 pathways linked to Hepatitis

ChemBioScape – pathfinding in Cytoscape
David Wild, Chem2Bio2RDF David Wild, August 2010. http://djwild.info.
December 2009.
PlotViz – visualizing in chemical space
Choi, J.Y. , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics
Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop,
ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

Chemical & Biological Literature Extraction

Integration of PubMed BioTerms

ChemBioSpace – a literature-centric space

Bio-LDA Topic Model
 Identifies “latent topics” by word association: a kind of fuzzy
clustering
 Each word can have associations with multiple topics, and has a
varying degree of strength
 Term-topic edges labeld with probability (i.e. strength of a
relationship to a topic). Term-term edges labeled with KL-
divergence (measure of distance)
 We considered BioTerms rather than free text, and applied to
336,899 MedLine abstracts on 50 topics published in 2009
 Based on work done by Jie Tang on social networks (see
www.arnetminer.com)
 He, B., Sankaranarayanan, M., Ding, Y., Tang, J., Wang, H., Wild,
D., Chen, B., Sun, Y., Sigimoto, C., Wu, Y., & Qiu, J. Semantic
Path and Topic Mining in Linked Life Data (submitted). The 9th
International Semantic Web Conference (ISWC'10). Shanghai,
China, Nov 7 -11, 2010

Bio-LDA Topic Model
  Distribution of BioTerms over Topics
 w1 w 2 ... w n 
 T1 T2 ... Tz 
   
 B1 θ11 θ12 ... θ1z   T1 φ11 φ12 ... φ1n 
θ =  B2 θ 21 θ 22 ... θ 2z  φ = T2 φ 21 φ 22 ... φ 2n 
   
 ... ... ... ... ... 
 ... ... ... ... ... 
Bm θ m1 θ m 2 ... θ mz  Tz φ z1 φ z 2 ... φ zn 
  Generative probability:
T Bd
€ PBio−LDA (w | d,θ ,φ ) = ∑∑ P(w | z ,φz )P(z | x,θ x )P(x | d)
€ z=1 x =1
  Kullback-Leibler Divergence:
T
θ bi z θbjz
€ sKL(bi ,b j ) = ∑ (θ bi z log + θ b j z log )
z =1
θbjz θ bi z
€
Bio-LDA III
  Entropy
  In information theory, entropy is a measure of the uncertainty associated
with a random variable.
  Here we can compute the bio-term entropies over topics
  Kullback-Leibler divergence (KL divergence)
  a non-symmetric measure of the difference between two probability
distributions.
  Here we used the KL divergence as the non-symmetric distance measure for
two bio-terms over topics

Example: Topic 10

Relating Drugs to ABL1 with Chem2Bio2RDF+LDA
Drug
Gene
Disease
ChemBioSpace Link
Predicted Link

Hydrocortisone – Dexamethasone links
  Fig. Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using
ChemBioScape.Drugbank interaction contains information about every drug’s target. In this case, DB00741 and
DB01234 share common targets through several different Drugbank interaction ID’s.

Tolcapone and Entacapone links
  Fig. Use case 2.Tolcapone and Entacapone are connected to each other through drugbank
interaction 2348 and 1962.Also, the two drugs appear in PubMed articles 8119326 and 8223912
via their CID (Compound ID)

Summary
  ChemBio2RDF is useful in its own right, but requires advanced SPARQL skills
to use.
  It is good to have both relational database and triple-store forms

  Pathfinding between two points integrated with tools like Pathfinder and
ChemBioScape offer an intuitive and straightforward way to data mine big
RDF networks
  With large RDF networks, ranking of paths is extremely important (we are
working on this)
  Integration of
PubMed BioTerms and advanced topic modeling offer both a
new data source and a way of ranking paths (sum of KL divergence over a
path)
  We are now doing “proof of pudding” work (relationships with

experimentalists, etc)

Acknowledgments
Semantic Web Lab Cheminformatics Salsa HPC Group
http://swl.slis.indiana.edu/ http://djwild.info http://salsahpc.indiana.edu/
Seung-Hee Bae. Jong Choi, Bin Chen, Ying Ding, Xiao Dong, Geoffrey Fox
Dazhi Jiao, Judy Qiu, Yuyin Sun, Huijun Wang, Qian Zhu

Data Mining and Querying of Integrated Chemical and Biological Information Using Chem2Bio2RDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Querying of Integrated Chemical and Biological Information Using Chem2Bio2RDF

Uploaded by

Copyright:

Available Formats

Data mining and querying of integrated chemical and

biological information using Chem2Bio2RDF

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Rational – 1960’s to 1990’s

Big Experiment – 1990’s to 2000’s

Big Data – 2010’s onwards

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

David Wild, Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

 It is good to have both relational database and triple-store forms

 We are now doing “proof of pudding” work (relationships with

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

Chem2Bio2RDF David Wild, August 2010. http://djwild.info.

You might also like

  It is good to have both relational database and triple-store forms

  We are now doing “proof of pudding” work (relationships with