Professional Documents
Culture Documents
scalable ontologies
the association of
genes and Disease Targets & proteins, their
proteins with Phenotypes Networks function(s) and
disease phenotype their interactions
Seite 4
Proprietary knowledge
Public knowledge
Experimental data
Patent information
Seite 5
Biological networks
Metabolic networks
e.g. KEGG, BRENDA
Regulatory networks
Expert networks (e.g. Transpath, Biocarta, STKE…)
Large scale experiments (e.g. genome-wide location analysis)
Seite 6
Data stored …
… in public databases
Seite 7
Data stored …
… in public databases
However, text is
mostly unstructured
redundant
ambiguous
allows for different interpretations
complexity of information source
Information extraction
requires natural language processing TEMIS Insight Discoverer Extractor
Organisation of documents
by meaning
Quick overview over many
documents
Exhibiting interesting
relations between topics
E.g.: Clustering of about 1100
documents about “caspase 1”
with respect to their MeSH
annotation (C – Diseases and
D – Chemicals and Drugs)
Seite 13
Clustering of documents containing “caspase 1”
Chem. Compounds:
/12 Heterocyclic Compounds ; Heterocyclic Compounds 1-Ring ; Chemical Actions ; Enzyme Inhibitors ;
Enzyme Inhibitors ; Protease Inhibitors ; Protease Inhibitors ; Chemical Actions and Uses ; Alkaloids ;
Heterocyclic Compounds with 4 or More Rings ; 18
/15 Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional andSeite 14
Metabolic Diseases ; Naphthacenes ; Tetracyclines ; Naphthols ; Tetracyclines ; Naphthalenes ; Congenital
Hereditary and Neonatal Diseases and Abnormalities ; 11
/6 Inorganic Chemicals ; Nitrogen Compounds ; Elements ; Metals ; Amino Acids Peptides and Proteins ;
Peptides ; Pathological Conditions Signs and Symptoms ; Pathologic Processes ; Immunologic and
Biological Factors ; Amino Acids ; 39
Clustering of documents containing “caspase 1”
Chem. Compounds:
/12 Heterocyclic Compounds ; Heterocyclic Compounds 1-Ring ; Chemical Actions ; Enzyme Inhibitors ;
Enzyme Inhibitors ; Protease Inhibitors ; Protease Inhibitors ; Chemical Actions and Uses ; Alkaloids ;
Lipids:
Heterocyclic Compounds with 4 or More Rings ; 18
/15/8Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional and
Lipids ; Lipids and Antilipemic Agents ; Membrane Lipids ; Sphingolipids ; Glycosphingolipids ;
Metabolic Diseases ; Naphthacenes ; Tetracyclines ; Naphthols ; Tetracyclines ; Naphthalenes ; Congenital
Glycolipids ; Glycosphingolipids ; Glycosphingolipids ; Glycolipids ; Glycoconjugates ; 31
Hereditary and Neonatal Diseases and Abnormalities ; 11
/6 Inorganic Chemicals ; Nitrogen Compounds ; Elements ; Metals ; Amino Acids Peptides and Proteins ;
Peptides ; Pathological Conditions Signs and Symptoms ; Pathologic Processes ; Immunologic and
Biological Factors ; Amino Acids ; 39
Seite 15
Clustering of documents containing “caspase 1”
Chem. Compounds:
/12 Heterocyclic Compounds ; Heterocyclic Compounds 1-Ring ; Chemical Actions ; Enzyme Inhibitors ;
Enzyme Inhibitors ; Protease Inhibitors ; Protease Inhibitors ; Chemical Actions and Uses ; Alkaloids ;
Lipids:
Heterocyclic Compounds with 4 or More Rings ; 18
/15/8Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional and
Lipids ; Lipids and Antilipemic Agents ; Membrane Lipids ; Sphingolipids ; Glycosphingolipids ;
Diseases:
Metabolic Diseases ; Naphthacenes ; Tetracyclines ; Naphthols ; Tetracyclines ; Naphthalenes ; Congenital
Glycolipids ; Glycosphingolipids ; Glycosphingolipids ; Glycolipids ; Glycoconjugates ; 31
Hereditary and Neonatal Diseases and Abnormalities ; 11
/6 Inorganic
/7 Neoplasms
Chemicals ; Nitrogenby
; Neoplasms Com pounds ;Type
Histologic Elements ; MetalsGlandular
; Neoplasms ; Amino Acids Peptides and
and Epithelial Proteins Germ
; Neoplasms ; Cell
Peptidesand; Pathological
Embryonal ; Conditions Signs and
Neuroectodermal Symp;toms
Tumors ; Pathologic Processes
Neuroectodermal Tumors ; ;Neoplasms
Immunologic andTissue ; Proteins
Nerve
Biologic al Factors ; Amino
; Neoplasms ; 39Soft Tissue ; Amino Acids Peptides and Proteins ; 32
Acidsand
Connective
10 Nervous System Diseases ; Central Nervous System Diseases ; Wounds and Injuries ; Disorders of
Environmental Origin ; Brain Diseases ; Neurodegenerative Diseases ; Trauma Nervous System ;
Pathological Conditions Signs and Symptoms ; Caspases ; Cysteine Endopeptidases ; 23
/13 Cardiovascular Diseases ; Vascular Diseases ; Ischemia ; ischemia ; Immunologic and Biological Factors
Seite 16
; Brain Diseases ; Pathologic Processes ; Nervous System Diseases ; Caspases ; Immunologic Factors ; 16
/14 Digestive System Diseases ; Liver Diseases ; Hepatitis ; Intestinal Diseases ; Gastrointestinal Diseases ;
hepatitis ; Pancreatitis ; Pancreatic Diseases ; pancreatitis ; Immunologic and Biological Factors ; 15
/19 Hemic and Lymphatic Diseases ; Hematologic Diseases ; Lymphatic Diseases ; Lymphoproliferative
Disorders ; Immunoproliferative Disorders ; Lymphoma ; Lymphoma ; Lymphoproliferative Disorders ;
Lymphoma ; Bone Marrow Diseases ; 9
Information Extraction – Construction of Information
Networks
F12A
Neuronectin, GMEM, tenascin,
HXB, cytotactin, hexabrachion
COL1A1
Collagen alpha 1(I) chain
Spelling variants Alpha 1 collagen
Alpha-1 type I collagen
Permutations alpha 1( I) collagen
TNF receptor 1
Nested names collagen, type I, alpha receptor Seite 18
Dictionary-Based Approach
Biomedical dictionaries allow for SwissProt
TREMBL
easy identification of multi-word terms LocusLink
TREMBL
easy identification of multi-word terms LocusLink
Seite 23
Method Validation
Definition of a manually curated benchmark set is extremely time-consuming.
Use existing annotation from (an older version of) the TransPath database,
i.e. extract proteins (141) and references to abstracts (490) as gold standard.
Biomedical
Evaluate method name
Biomedical using recognition
namevarious isisan
anactive
dictionaries
recognition field
fieldof
and matching
active ofresearch.
parameters.
research.
The
TheBioCreative’03
BioCreative’03competition,
competition,forforinstance,
instance,will
willprovide
providefurther
further
grounds
groundsfor
forlarge-scale
large-scalevalidation
validationof ofproposed
proposedmethods…
methods…
¾ Results
¾ Resultsjustify
justifyuse
useof
ofour
ourmethod
methodasasbuilding
buildingblock
blockfor
for
interaction
interactionextraction
extractionvia
viahigher
higherlevel
levelsemantic
semanticanalysis...
analysis...
Seite 24
Information Extraction – Construction of Information
Networks
Techniques:
Example sentence:
“Betacellulin (BTC) has been demonstrated to directly bind to both EGFR
and HER4 and induces the growth of certain epithelial cell types.“
Seite 26
Protein - Protein Interactions
Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=1956"> and <concept name="~id=2066"> and
induces the growth of certain epithelial cell types.“
<concept name="~id=685">
<concept name="~id=685">
<concept
<conceptname="~LOCUSLINK">
name="~LOCUSLINK">
<ll>685@LOCUSLINK</ll>
</concept> <ll>685@LOCUSLINK</ll>
</concept>
<concept
<conceptname="~SWISSPROT">
name="~SWISSPROT">
<sp>BTC_HUMAN@SWISSPROT</sp>
</concept> <sp>BTC_HUMAN@SWISSPROT</sp>
</concept>
<concept
<conceptname="~PREFERRED">
name="~PREFERRED">
<pr>BTC</pr>
</concept> <pr>BTC</pr>
</concept>
<concept
<conceptname="~SYNONYMS">
name="~SYNONYMS">
<syn>BTC</syn>
<syn>BTC</syn>
<syn>betacellulin</syn>
<syn>betacellulin</syn>
<syn>Betacellulin
</concept > <syn>Betacellulinprecursor</syn>
precursor</syn> Seite 27
</concept
<concept >
<conceptname="~DISEASES">
name="~DISEASES">
<dis>Intestinal
<dis>IntestinalNeoplasms</dis>
<dis>Carcinoma, Neoplasms</dis>
Squamous
<dis>Carcinoma,
<dis>Pancreatic SquamousCell</dis>
Cell</dis>
Neoplasms</dis>
<dis>Pancreatic
<dis>Vulvar Neoplasms</dis>
<dis>VulvarNeoplasms</dis>
Neoplasms</dis>
<dis>Haemophilus Infections</dis>
<dis>Haemophilus
<dis>Papilloma</dis> Infections</dis>
<dis>Papilloma</dis>
<dis>Head
<dis>Headand
andNeck Neoplasms</dis>
<dis>Pharyngeal Neck Neoplasms</dis>
Neoplasms</dis>
<dis>Pharyngeal
…… Neoplasms</dis>
</concept>
</concept>
</concept>
Protein - Protein Interactions
Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=2066">
<concept
<concept name="~id=1956">
name="~id=2066">
<concept
<conceptname="~LOCUSLINK">
name="~LOCUSLINK">
and <concept name="~id=2066"> and
<ll>2066@LOCUSLINK</ll>
<ll>2066@LOCUSLINK</ll>
induces the
</concept>
</concept> growth of certain epithelial cell types.“
<concept name="~SWISSPROT">
<concept name="~SWISSPROT">
<sp>ERB4_HUMAN@SWISSPROT</sp>
</concept> <sp>ERB4_HUMAN@SWISSPROT</sp>
</concept>
<concept
<conceptname="~PREFERRED">
name="~PREFERRED">
<pr>ERBB4</pr>
</concept> <pr>ERBB4</pr>
</concept>
<concept
<conceptname="~SYNONYMS">
name="~SYNONYMS">
<syn>ERBB4</syn>
<syn>ERBB4</syn>
<syn>v-erb-a
<syn>v-erb-aerythroblastic
<syn>v-erb-a erythroblasticleukemia
erythroblastic leukemiaviral
leukemia viraloncogene
viral oncogenehomolog
oncogene homolog4</syn>
homolog 44</syn>
<syn>v-erb-a
<syn>HER4</syn> erythroblastic leukemia viral oncogene homolog 4(avian)</syn>
(avian)</syn>
<syn>HER4</syn>
<syn>avian
<syn>avianerythroblastic leukemia viral
viral(v-erb-b2) oncogene homolog
<syn>v-erb-a erythroblastic
avian leukemia
erythroblastic leukemia (v-erb-b2)
viral oncogene
oncogene homolog4</syn>
homolog-like 4</syn>
4</syn>
<syn>v-erb-a
<syn>ERB4</syn> avian erythroblastic leukemia viral oncogene homolog-like 4</syn>
<syn>ERB4</syn>
<syn>Receptor protein-tyrosine
<syn>Receptor
<syn>EC protein-tyrosinekinase
2.7.1.112</syn> kinaseerbB-4
erbB-4precursor</syn>
precursor</syn>
<syn>EC
...... 2.7.1.112</syn>
Seite 28
</concept >>
</conceptname="~DISEASES">
<concept
<concept name="~DISEASES">
<dis>Glioblastoma</dis>
<dis>Glioblastoma</dis>
<dis>Carcinoma, Endometrioid</dis>
<dis>Carcinoma,
<dis>Hand Endometrioid</dis>
Deformities, Congenital</dis>
<dis>Hand Deformities,
<dis>Neoplasms, Congenital</dis>
Glandular and
<dis>Neoplasms,
<dis>Muscle Glandular
Neoplasms</dis> andEpithelial</dis>
Epithelial</dis>
<dis>Muscle
<dis>Mammary Neoplasms</dis>
Neoplasms</dis>
...<dis>Mammary Neoplasms</dis>
</concept> ...
</concept>
</concept>
</concept>
Protein - Protein Interactions
Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=1956">
<concept name="~id=1956">
<concept name="~id=1956"> and <concept
<concept
<concept name="~id=2066"> and
name="~LOCUSLINK">
name="~LOCUSLINK">
<ll>1956@LOCUSLINK</ll>
</concept> <ll>1956@LOCUSLINK</ll>
induces the growth of certain epithelial cell
</concept>
<concept types.“
<conceptname="~SWISSPROT">
name="~SWISSPROT">
<sp>EGFR_HUMAN@SWISSPROT</sp>
</concept> <sp>EGFR_HUMAN@SWISSPROT</sp>
</concept>
<concept
<conceptname="~PREFERRED">
name="~PREFERRED">
<pr>EGFR</pr>
</concept> <pr>EGFR</pr>
</concept>
<concept
<conceptname="~SYNONYMS">
name="~SYNONYMS">
<syn>EGFR</syn>
<syn>EGFR</syn>
<syn>ERBB</syn>
<syn>ERBB</syn>
<syn>ERBB1</syn>
<syn>ERBB1</syn>
<syn>Epidermal growth
<syn>Epidermal
<syn>EC growthfactor
2.7.1.112</syn> factorreceptor</syn>
receptor</syn>
<syn>EC
...... 2.7.1.112</syn>
</concept >
</conceptname="~DISEASES">
<concept >
<concept name="~DISEASES">
<dis>Pharyngeal Neoplasms</dis>
<dis>Pharyngeal
<dis>Lymphatic Neoplasms</dis>
Metastasis</dis>
<dis>Lymphatic
<dis>Brain Metastasis</dis>
<dis>BrainStem
StemNeoplasms</dis>
Seite 29
<dis>Neoplasms, Neoplasms</dis>
Hormone-Dependent</dis>
<dis>Neoplasms, Hormone-Dependent</dis>
<dis>Fibroadenoma</dis>
<dis>Fibroadenoma</dis>
<dis>Mammary
<dis>MammaryNeoplasms</dis>
Neoplasms</dis>
<dis>Papilloma</dis>
<dis>Papilloma</dis>
<dis>Maxillary Sinus
<dis>Maxillary
<dis>Bladder SinusNeoplasms</dis>
Neoplasms</dis>
Neoplasms</dis>
<dis>Bladder
<dis>Vulvar Neoplasms</dis>
<dis>VulvarNeoplasms</dis>
Neoplasms</dis>
<dis>Oropharyngeal Neoplasms</dis>
<dis>Oropharyngeal
<dis>Head and Neck Neoplasms</dis>
Neoplasms</dis>
<dis>Head
...... and Neck Neoplasms</dis>
</concept>
</concept>
</concept>
Protein - Protein Interactions
Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=1956"> and <concept name="~id=2066"> and
induces the growth of certain epithelial cell types.“
Extracted information:
BTC binds to EGFR
BTC binds to ERBB4
Seite 30
Protein Interaction SkillCartridge™
Seite 31
Protein Interaction SkillCartridge™
In
Inthis
thisexample,
example,concepts
conceptsfor
foractivating
activating
and
andinhibiting
inhibitingprocesses
processesare
aredefined.
defined.
Seite 32
Extraction of Protein and Gene Interactions
<protein
<proteinid=111067
id=111067pmid
pmid=12515821
=12515821snr=0/>
snr=0/>activates
activates<protein
<proteinid=100423
id=100423pmid
pmid
=12515821 snr=0/> by entering in a complex with <protein id=104331pmid
=12515821 snr=0/> by entering in a complex with <protein id=104331pmid
=12515821
=12515821snr=0/>,
snr=0/>,Abi1,
Abi1,and
and<protein
<proteinid=113769
id=113769pmid
pmid=12515821
=12515821snr=0/>.
snr=0/>.
Suppression ofof<protein id=107215 pmid =12516092 snr=0/> transactivation
Suppression
Suppression of <protein id=107215
<protein id=107215pmid
pmid=12516092 snr=0/>
=12516092 transactivationby
transactivation
snr=0/> by
<protein
<proteinid=108242
id=108242pmid
pmid=12516092
=12516092snr=0/>
snr=0/>accompanies
accompaniesinhibition
inhibitionofof<protein
<protein
id=201409
id=201409pmid
pmid=12516092
=12516092snr=0/>
snr=0/>induction.
induction.
Although
Although<protein
<proteinid=100694
id=100694pmid
pmid=12515826
=12515826snr=11/>
snr=11/>inside
insideraft
raftclusters
clustersseems
seems
totobe cleaved by <protein id=101049 pmid =12515826 snr=11/>, <protein id=100694
be cleaved by <protein id=101049 pmid =12515826 snr=11/>, <protein id=100694
The
Theextracted
extractedconcepts
concepts
pmid
pmid=12515826
=12515826snr=11/>
snr=11/>outside
outsiderafts
raftsundergo
undergocleavage
cleavageby
byalpha-secretase.
alpha-secretase.
are
arelisted
listedin
inthe
theskill
skill <protein
<proteinid=100694
id=100694pmid
pmid=12515826
=12515826snr=2/>
snr=2/>isiscleaved
cleavedby
by<protein
<proteinid=101094
id=101094pmid
pmid
cartridge
cartridgestudio.
studio.
=12515826
=12515826snr=2>
snr=2>ororby
byalpha-secretase
alpha-secretasetotoinitiate
initiateamyloidogenic
amyloidogenic(release
(releaseofofAA
beta)
The beta)orornonamyloidogenic
nonamyloidogenicprocessing
processingofof <protein
<proteinid=100694
id=100694pmid
pmid=12515826
Theuser
usercan
canbrowse
=12515826
browse snr=2/>,
snr=2/>,respectively.
respectively.
through the extracted
through the extracted
concepts
conceptsandandview
viewthe
the
text
textsource
source
Seite 33
Seite 34
Fraunhofer SCAI
Seite 35
Text Mining for the Context Specific Interpretation of
Clinical Data
Seite 36
Chemical
Compounds
Seite 37
Combining Text Mining with Experimental Data Analysis
Aim:
Find sub-networks (a significant area) with predominantly significantly
regulated genes.
Specify behavior
Seite 38
Combining Text Mining with Experimental Data Analysis
Aim:
Find sub-networks (a significant area) with predominantly significantly
regulated genes.
SigAr-Search (Sohler et al., GCB 2003) uses greedy
SigAr-Search (Sohler et al., GCB 2003) uses greedy
search
searchstrategy
strategyfor
forexploring
exploringsubgraphs
subgraphsand
andstatistical
statistical
Specify behavior
null
nullmodel
modelfor
forvalidation
validationof
ofhypothetical
hypotheticalSigArs.
SigArs.
for expression measurements via a p-value,
functional relatedness using connectedness in biological networks.
Seite 39
Combining network information and expression data
Seite 40
Combining network information and expression data
Seite 41
Harnessing the Power of Semantic Text Analysis for the
Interpretation of Experimental Data
Seite 42
Chemical
Compounds
the association of
genes and Disease Targets &
proteins with Phenotypes Networks
disease phenotype
Seite 43
Statistical Association of Proteins and Diseases
disease context
specific protein-protein-interaction
network Seite 46
chemical compounds
Chemical and their association
Compounds to targets
Seite 47
Chemical Compound Acetylsalicylic Acid,
Name Recognition 2-(Acetyloxy)benzoic Acid,
Aspirin
Acetysal, Colfarit, Dispril,
Multiple names for one Easprin, Ecotrin, Endosprin,
Magnecyl, Micristin, Polopirin,
compound Polopiryna, …
Seite 49
Roads to go …..
In the future we will focus on the following topics:
Daniel Hanisch
Special thanks also to Ralf Zimmer and Florian Sohler (LMU Munich) and to TEMIS group
Seite 51