Professional Documents
Culture Documents
David J. Wild
Assistant Professor & Director, Cheminformatics Program
Indiana University School of Informatics and Computing
djwild@indiana.edu - http://djwild.info
Overview
Background – Big Data drug discovery & Chem2Bio2RDF
Contextualizing Chem2Bio2RDF – BioRDF, LODD, LOD
Using SPARQL queries for polypharmacology
Finding all links between any two entities
Algorithm
Pathfinder visualization
Visualization tools
ChemBioScape: Visualization & pathfinding in Cytoscape
PlotViz: Visualization & SPARQL querying in 3D Chemcial Space
BioLDA and Topic Models: Advanced Literature Mining
Summary
Even more important are the relationships between these entities. For example
a chemical compound can be linked to a gene or a protein target in a multitude
of ways:
Biological assay with percent inhibition, IC50, etc
Crystal structure of ligand/protein complex
Co-occurrence in a paper abstract
Computational experiment (docking, predictive model)
Statistical relationship
System association (e.g. involved in same pathways cellular processes)
2,824,265
2006-01
2006-03
2006-05
2006-07
2006-09
2006-11
2007-01
2007-03
2007-05
2007-07
2007-09
Addition of
2007-11
ChemSpider
2008-01
2008-03
35,379,748
2008-05
2008-07
2008-09
2008-11
2009-01
2009-03
2009-05
2009-07
2009-09
56,774,950
PubChem Substance Size 2005-2010
2009-11
PubChem growth since 2005
2010-01
2010-03
2010-05
2010-07
69,088,100
1
10
100
1000
10000
100000
1000000
2005-01
2005-04
2005-07
2005-10
2006-01
2006-04
2006-07
2006-10
2007-01
2007-04
2007-07
2007-10
2008-01
2008-04
2008-07
2008-10
ChEMBL
2009-01
Addition of
2009-04
2009-07
2009-10
PubChem Bioassays 2005-2010
2010-01
Chem2Bio2RDF David Wild, August 2010. http://djwild.info.
2010-04
2010-07
434635
Large amount of data and links for each compound
http://www.genome.jp/en/db_growth.html
Predicting new molecular targets for known drugs. Nature 462, 175-181(12 November 2009)
Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.
Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems
chemical biology data. BMC Bioinformatics 2010, 11, 255
Banerjee, A., Dubnau, E., Quemard, A., Balasubramanian, V., Um, K., Wilson, T., et al.: inhA, a gene encoding a target for
isoniazid and ethionamide in Mycobacterium tuberculosis. Science, 263(5144), 227-230 (1994).
Glucocorticoid Receptor
Triamcinalone Dexamethasone
http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html
December 2009.
PlotViz – visualizing in chemical space
Choi, J.Y. , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics
Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop,
ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL
Generative probability:
T Bd
€ PBio−LDA (w | d,θ ,φ ) = ∑∑ P(w | z ,φz )P(z | x,θ x )P(x | d)
€ z=1 x =1
Kullback-Leibler Divergence:
T
θ bi z θbjz
€ sKL(bi ,b j ) = ∑ (θ bi z log + θ b j z log )
z =1
θbjz θ bi z
€
Bio-LDA III
Entropy
In information theory, entropy is a measure of the uncertainty associated
with a random variable.
Here we can compute the bio-term entropies over topics
Kullback-Leibler divergence (KL divergence)
a non-symmetric measure of the difference between two probability
distributions.
Here we used the KL divergence as the non-symmetric distance measure for
two bio-terms over topics
Drug
Gene
Disease
ChemBioSpace Link
Predicted Link
Fig. Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using
ChemBioScape.Drugbank interaction contains information about every drug’s target. In this case, DB00741 and
DB01234 share common targets through several different Drugbank interaction ID’s.
Fig. Use case 2.Tolcapone and Entacapone are connected to each other through drugbank
interaction 2348 and 1962.Also, the two drugs appear in PubMed articles 8119326 and 8223912
via their CID (Compound ID)
With large RDF networks, ranking of paths is extremely important (we are
working on this)
Integration of
PubMed BioTerms and advanced topic modeling offer both a
new data source and a way of ranking paths (sum of KL divergence over a
path)
Seung-Hee Bae. Jong Choi, Bin Chen, Ying Ding, Xiao Dong, Geoffrey Fox
Dazhi Jiao, Judy Qiu, Yuyin Sun, Huijun Wang, Qian Zhu