You are on page 1of 51

Symposium on

Text Mining in the Life Sciences


Symposium on
Text Mining in the Life Sciences

Text Mining in Clinical Genome Research I

Dr. Martin Hofmann


Department of Bioinformatics
Cooperation between TEMIS and Fraunhofer SCAI

TEMIS Fraunhofer SCAI

industry leading software in text mining biomedical domain expertise

‰ approved ‰ domain specific dictionaries

‰ scalable ‰ ontologies

‰ standard compliant ‰ software for biomedical entity


recognition
‰ text format independent
Seite 3
From Genes to Drugs ….
chemical compounds
Chemical and their association
Compounds to targets

the association of
genes and Disease Targets & proteins, their
proteins with Phenotypes Networks function(s) and
disease phenotype their interactions

Seite 4
Proprietary knowledge

Public knowledge

Experimental data

Patent information

Seite 5
Biological networks

Metabolic networks
‰ e.g. KEGG, BRENDA

Regulatory networks
‰ Expert networks (e.g. Transpath, Biocarta, STKE…)
‰ Large scale experiments (e.g. genome-wide location analysis)

Protein Interaction networks


‰ e.g. DIP, BIND

Seite 6
Data stored …

… in public databases

… in proprietary databases and tables

… and as free (unstructured) text

Seite 7
Data stored …

… in public databases

… in proprietary databases and tables

… and as free (unstructured) text

The challenge: … link between databases and text


Seite 8
Why text mining?
‰ Text is still by far the most important and most complete source of
information in biology and medicine

‰ Textual information is growing at a rate that makes it almost impossible to


follow all relevant information, even in restricted fields

‰ Information extracted from free text is residing in the brain of individual


researchers, the current state of knowledge is not made broadly available

‰ Network information stored in databases (e.g. KEGG, Transpath) is


oftentimes incomplete or not specific enough with respect to certain species
or cell types.
Seite 9
Why text mining?
‰ Text is still by far the most important and most complete source of
information in biology and medicine
For this reason, we have developed text mining methods mainly
‰ Textualfor the construction
information of interaction
is growing at a ratenetworks based
that makes on biomedical
it almost impossible to
follow all
freerelevant
text. information, even in restricted fields

‰ Information extracted from free text is residing in the brain of individual


researchers, the current state of knowledge is not made broadly available

‰ Network information stored in databases (e.g. KEGG, Transpath) is


oftentimes incomplete or not specific enough with respect to certain species
or cell types.
Seite 10
Challenges in text mining

However, text is

‰ mostly unstructured
‰ redundant
‰ ambiguous
‰ allows for different interpretations
‰ complexity of information source

‰ Linking text mining results to experimental data


Seite 11
Knowledge Management in the Biomedical Domain

‰ Information retrieval and organization


requires excellent text mining machinery TEMIS Insight Discoverer Clusterer

‰ Information extraction
requires natural language processing TEMIS Insight Discoverer Extractor

‰ Construction of biological networks


SCAI ProMiner
requires a sound understanding of biomedical entities
‰ Integrated context specific analysis of data from different domains
biomedical reseach community
requires biomedical ontologies
Seite 12
Making Use of Public Resources ….

‰ Organisation of documents
by meaning
‰ Quick overview over many
documents
‰ Exhibiting interesting
relations between topics
‰ E.g.: Clustering of about 1100
documents about “caspase 1”
with respect to their MeSH
annotation (C – Diseases and
D – Chemicals and Drugs)
Seite 13
Clustering of documents containing “caspase 1”

Chem. Compounds:

/12 Heterocyclic Compounds ; Heterocyclic Compounds 1-Ring ; Chemical Actions ; Enzyme Inhibitors ;
Enzyme Inhibitors ; Protease Inhibitors ; Protease Inhibitors ; Chemical Actions and Uses ; Alkaloids ;
Heterocyclic Compounds with 4 or More Rings ; 18
/15 Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional andSeite 14
Metabolic Diseases ; Naphthacenes ; Tetracyclines ; Naphthols ; Tetracyclines ; Naphthalenes ; Congenital
Hereditary and Neonatal Diseases and Abnormalities ; 11
/6 Inorganic Chemicals ; Nitrogen Compounds ; Elements ; Metals ; Amino Acids Peptides and Proteins ;
Peptides ; Pathological Conditions Signs and Symptoms ; Pathologic Processes ; Immunologic and
Biological Factors ; Amino Acids ; 39
Clustering of documents containing “caspase 1”

Chem. Compounds:

/12 Heterocyclic Compounds ; Heterocyclic Compounds 1-Ring ; Chemical Actions ; Enzyme Inhibitors ;
Enzyme Inhibitors ; Protease Inhibitors ; Protease Inhibitors ; Chemical Actions and Uses ; Alkaloids ;
Lipids:
Heterocyclic Compounds with 4 or More Rings ; 18
/15/8Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional and
Lipids ; Lipids and Antilipemic Agents ; Membrane Lipids ; Sphingolipids ; Glycosphingolipids ;
Metabolic Diseases ; Naphthacenes ; Tetracyclines ; Naphthols ; Tetracyclines ; Naphthalenes ; Congenital
Glycolipids ; Glycosphingolipids ; Glycosphingolipids ; Glycolipids ; Glycoconjugates ; 31
Hereditary and Neonatal Diseases and Abnormalities ; 11
/6 Inorganic Chemicals ; Nitrogen Compounds ; Elements ; Metals ; Amino Acids Peptides and Proteins ;
Peptides ; Pathological Conditions Signs and Symptoms ; Pathologic Processes ; Immunologic and
Biological Factors ; Amino Acids ; 39

Seite 15
Clustering of documents containing “caspase 1”

Chem. Compounds:

/12 Heterocyclic Compounds ; Heterocyclic Compounds 1-Ring ; Chemical Actions ; Enzyme Inhibitors ;
Enzyme Inhibitors ; Protease Inhibitors ; Protease Inhibitors ; Chemical Actions and Uses ; Alkaloids ;
Lipids:
Heterocyclic Compounds with 4 or More Rings ; 18
/15/8Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional and
Lipids ; Lipids and Antilipemic Agents ; Membrane Lipids ; Sphingolipids ; Glycosphingolipids ;
Diseases:
Metabolic Diseases ; Naphthacenes ; Tetracyclines ; Naphthols ; Tetracyclines ; Naphthalenes ; Congenital
Glycolipids ; Glycosphingolipids ; Glycosphingolipids ; Glycolipids ; Glycoconjugates ; 31
Hereditary and Neonatal Diseases and Abnormalities ; 11
/6 Inorganic
/7 Neoplasms
Chemicals ; Nitrogenby
; Neoplasms Com pounds ;Type
Histologic Elements ; MetalsGlandular
; Neoplasms ; Amino Acids Peptides and
and Epithelial Proteins Germ
; Neoplasms ; Cell
Peptidesand; Pathological
Embryonal ; Conditions Signs and
Neuroectodermal Symp;toms
Tumors ; Pathologic Processes
Neuroectodermal Tumors ; ;Neoplasms
Immunologic andTissue ; Proteins
Nerve
Biologic al Factors ; Amino
; Neoplasms ; 39Soft Tissue ; Amino Acids Peptides and Proteins ; 32
Acidsand
Connective
10 Nervous System Diseases ; Central Nervous System Diseases ; Wounds and Injuries ; Disorders of
Environmental Origin ; Brain Diseases ; Neurodegenerative Diseases ; Trauma Nervous System ;
Pathological Conditions Signs and Symptoms ; Caspases ; Cysteine Endopeptidases ; 23
/13 Cardiovascular Diseases ; Vascular Diseases ; Ischemia ; ischemia ; Immunologic and Biological Factors
Seite 16
; Brain Diseases ; Pathologic Processes ; Nervous System Diseases ; Caspases ; Immunologic Factors ; 16
/14 Digestive System Diseases ; Liver Diseases ; Hepatitis ; Intestinal Diseases ; Gastrointestinal Diseases ;
hepatitis ; Pancreatitis ; Pancreatic Diseases ; pancreatitis ; Immunologic and Biological Factors ; 15
/19 Hemic and Lymphatic Diseases ; Hematologic Diseases ; Lymphatic Diseases ; Lymphoproliferative
Disorders ; Immunoproliferative Disorders ; Lymphoma ; Lymphoma ; Lymphoproliferative Disorders ;
Lymphoma ; Bone Marrow Diseases ; 9
Information Extraction – Construction of Information
Networks

‰Detection of named entities in text (e.g. protein names)


‰Finding of semantic associations (e.g. protein-protein-
interactions)
Techniques:

‰Automatic pattern extraction


‰Combination of natural language processing and statistical
approaches Seite 17
Protein Name Recognition

F12A
Neuronectin, GMEM, tenascin,
HXB, cytotactin, hexabrachion

p21, EPO, large T antigen


Multiple names for one gene

Ambiguous names in databases WAS, STEP, TRAIL, iCE, StAR,…

Common word names Interleukin 1 alpha


Tumor necrosis factor beta, …
Multi-word terms Collagen, type I, alpha 1

COL1A1
Collagen alpha 1(I) chain
Spelling variants Alpha 1 collagen
Alpha-1 type I collagen
Permutations alpha 1( I) collagen

TNF receptor 1
Nested names collagen, type I, alpha receptor Seite 18
Dictionary-Based Approach
Biomedical dictionaries allow for SwissProt

TREMBL
‰ easy identification of multi-word terms LocusLink

‰ matching of multiple synonyms to one biological entity …




‰ identification of ambiguous names

‰ mapping of extracted knowledge to external data sources


(e.g. expression data, gene ontology information, …)
Obviously, entities not contained in the dictionary cannot be found.

But: Such entries are usually of minor interest as they cannot be


automatically associated with relevant information.
Seite 19
Dictionary-Based Approach
Biomedical dictionaries allow for SwissProt

TREMBL
‰ easy identification of multi-word terms LocusLink

‰ matching of multiple synonyms to one biological entity …




‰ identification of ambiguous names
For
Forinstance,
instance,our
oursemi-automatic
semi-automaticgeneration
generationand andcuration
curation
procedure
‰ mapping of extracted
procedure results
resultsin
knowledge intoaaexternal
human
humandictionary
data sources
dictionary containing
containing
~17.000
(e.g. expression data, geneobjects
~17.000 ontology
objects and ~100.000
information,
and ~100.000…) synonyms.
synonyms.
Spelling variants
Spellingnotvariants are
arebeing
being handled
handledby by aaspecial
special algorithm
algorithm
Obviously, entities contained in the dictionary cannot be found.
(token
(tokenassignment
assignmentand andmatching
matchingscores).
scores).
But: Such entries are usually of minor interest as they cannot be
automatically associated with relevant information.
Seite 20
Scoring a Match Based on Token Classes
Basic observation: Token in synonyms have different importance when matches
and/or mismatches occur during the search.

‰ Construct scoring function based on token classes.


‰ Train coefficients with machine learning methods.

¾ Need to define a search procedure to employ this scoring scheme.


Seite 21
Fast Approximate Matching
Algorithm
‰ Search process needs to be re-iterated on update
or first inclusion of dictionaries.
‰ Medline database is huge and rapidly growing.
‰ Characteristics of search algorithm:
o Inspect each word in database only once.
o Keep a list of candidate solutions and extend
their scores on token parsing.
o Prune rejected, report accepted candidates.

¾ Search 15.000.000 abstracts for all human


proteins and genes (17.000 entities) overnight.
Seite 22
Method Validation
‰ Definition of a manually curated benchmark set is extremely time-consuming.
‰ Use existing annotation from (an older version of) the TransPath database,
i.e. extract proteins (141) and references to abstracts (490) as gold standard.
‰ Evaluate method using various dictionaries and matching parameters.

Seite 23
Method Validation
‰ Definition of a manually curated benchmark set is extremely time-consuming.
‰ Use existing annotation from (an older version of) the TransPath database,
i.e. extract proteins (141) and references to abstracts (490) as gold standard.
Biomedical
‰ Evaluate method name
Biomedical using recognition
namevarious isisan
anactive
dictionaries
recognition field
fieldof
and matching
active ofresearch.
parameters.
research.
The
TheBioCreative’03
BioCreative’03competition,
competition,forforinstance,
instance,will
willprovide
providefurther
further
grounds
groundsfor
forlarge-scale
large-scalevalidation
validationof ofproposed
proposedmethods…
methods…

¾ Results
¾ Resultsjustify
justifyuse
useof
ofour
ourmethod
methodasasbuilding
buildingblock
blockfor
for
interaction
interactionextraction
extractionvia
viahigher
higherlevel
levelsemantic
semanticanalysis...
analysis...

Seite 24
Information Extraction – Construction of Information
Networks

‰Detection of named entities in text (e.g. protein names)


‰Finding of semantic associations (e.g. protein-protein-interactions)

Techniques:

‰Automatic pattern extraction


‰Combination of natural language processing and statistical
approaches Seite 25
Protein - Protein Interactions

Example sentence:
“Betacellulin (BTC) has been demonstrated to directly bind to both EGFR
and HER4 and induces the growth of certain epithelial cell types.“

Seite 26
Protein - Protein Interactions

Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=1956"> and <concept name="~id=2066"> and
induces the growth of certain epithelial cell types.“
<concept name="~id=685">
<concept name="~id=685">
<concept
<conceptname="~LOCUSLINK">
name="~LOCUSLINK">
<ll>685@LOCUSLINK</ll>
</concept> <ll>685@LOCUSLINK</ll>
</concept>
<concept
<conceptname="~SWISSPROT">
name="~SWISSPROT">
<sp>BTC_HUMAN@SWISSPROT</sp>
</concept> <sp>BTC_HUMAN@SWISSPROT</sp>
</concept>
<concept
<conceptname="~PREFERRED">
name="~PREFERRED">
<pr>BTC</pr>
</concept> <pr>BTC</pr>
</concept>
<concept
<conceptname="~SYNONYMS">
name="~SYNONYMS">
<syn>BTC</syn>
<syn>BTC</syn>
<syn>betacellulin</syn>
<syn>betacellulin</syn>
<syn>Betacellulin
</concept > <syn>Betacellulinprecursor</syn>
precursor</syn> Seite 27
</concept
<concept >
<conceptname="~DISEASES">
name="~DISEASES">
<dis>Intestinal
<dis>IntestinalNeoplasms</dis>
<dis>Carcinoma, Neoplasms</dis>
Squamous
<dis>Carcinoma,
<dis>Pancreatic SquamousCell</dis>
Cell</dis>
Neoplasms</dis>
<dis>Pancreatic
<dis>Vulvar Neoplasms</dis>
<dis>VulvarNeoplasms</dis>
Neoplasms</dis>
<dis>Haemophilus Infections</dis>
<dis>Haemophilus
<dis>Papilloma</dis> Infections</dis>
<dis>Papilloma</dis>
<dis>Head
<dis>Headand
andNeck Neoplasms</dis>
<dis>Pharyngeal Neck Neoplasms</dis>
Neoplasms</dis>
<dis>Pharyngeal
…… Neoplasms</dis>
</concept>
</concept>
</concept>
Protein - Protein Interactions

Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=2066">
<concept
<concept name="~id=1956">
name="~id=2066">
<concept
<conceptname="~LOCUSLINK">
name="~LOCUSLINK">
and <concept name="~id=2066"> and
<ll>2066@LOCUSLINK</ll>
<ll>2066@LOCUSLINK</ll>
induces the
</concept>
</concept> growth of certain epithelial cell types.“
<concept name="~SWISSPROT">
<concept name="~SWISSPROT">
<sp>ERB4_HUMAN@SWISSPROT</sp>
</concept> <sp>ERB4_HUMAN@SWISSPROT</sp>
</concept>
<concept
<conceptname="~PREFERRED">
name="~PREFERRED">
<pr>ERBB4</pr>
</concept> <pr>ERBB4</pr>
</concept>
<concept
<conceptname="~SYNONYMS">
name="~SYNONYMS">
<syn>ERBB4</syn>
<syn>ERBB4</syn>
<syn>v-erb-a
<syn>v-erb-aerythroblastic
<syn>v-erb-a erythroblasticleukemia
erythroblastic leukemiaviral
leukemia viraloncogene
viral oncogenehomolog
oncogene homolog4</syn>
homolog 44</syn>
<syn>v-erb-a
<syn>HER4</syn> erythroblastic leukemia viral oncogene homolog 4(avian)</syn>
(avian)</syn>
<syn>HER4</syn>
<syn>avian
<syn>avianerythroblastic leukemia viral
viral(v-erb-b2) oncogene homolog
<syn>v-erb-a erythroblastic
avian leukemia
erythroblastic leukemia (v-erb-b2)
viral oncogene
oncogene homolog4</syn>
homolog-like 4</syn>
4</syn>
<syn>v-erb-a
<syn>ERB4</syn> avian erythroblastic leukemia viral oncogene homolog-like 4</syn>
<syn>ERB4</syn>
<syn>Receptor protein-tyrosine
<syn>Receptor
<syn>EC protein-tyrosinekinase
2.7.1.112</syn> kinaseerbB-4
erbB-4precursor</syn>
precursor</syn>
<syn>EC
...... 2.7.1.112</syn>
Seite 28
</concept >>
</conceptname="~DISEASES">
<concept
<concept name="~DISEASES">
<dis>Glioblastoma</dis>
<dis>Glioblastoma</dis>
<dis>Carcinoma, Endometrioid</dis>
<dis>Carcinoma,
<dis>Hand Endometrioid</dis>
Deformities, Congenital</dis>
<dis>Hand Deformities,
<dis>Neoplasms, Congenital</dis>
Glandular and
<dis>Neoplasms,
<dis>Muscle Glandular
Neoplasms</dis> andEpithelial</dis>
Epithelial</dis>
<dis>Muscle
<dis>Mammary Neoplasms</dis>
Neoplasms</dis>
...<dis>Mammary Neoplasms</dis>
</concept> ...
</concept>
</concept>
</concept>
Protein - Protein Interactions

Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=1956">
<concept name="~id=1956">
<concept name="~id=1956"> and <concept
<concept
<concept name="~id=2066"> and
name="~LOCUSLINK">
name="~LOCUSLINK">
<ll>1956@LOCUSLINK</ll>
</concept> <ll>1956@LOCUSLINK</ll>
induces the growth of certain epithelial cell
</concept>
<concept types.“
<conceptname="~SWISSPROT">
name="~SWISSPROT">
<sp>EGFR_HUMAN@SWISSPROT</sp>
</concept> <sp>EGFR_HUMAN@SWISSPROT</sp>
</concept>
<concept
<conceptname="~PREFERRED">
name="~PREFERRED">
<pr>EGFR</pr>
</concept> <pr>EGFR</pr>
</concept>
<concept
<conceptname="~SYNONYMS">
name="~SYNONYMS">
<syn>EGFR</syn>
<syn>EGFR</syn>
<syn>ERBB</syn>
<syn>ERBB</syn>
<syn>ERBB1</syn>
<syn>ERBB1</syn>
<syn>Epidermal growth
<syn>Epidermal
<syn>EC growthfactor
2.7.1.112</syn> factorreceptor</syn>
receptor</syn>
<syn>EC
...... 2.7.1.112</syn>
</concept >
</conceptname="~DISEASES">
<concept >
<concept name="~DISEASES">
<dis>Pharyngeal Neoplasms</dis>
<dis>Pharyngeal
<dis>Lymphatic Neoplasms</dis>
Metastasis</dis>
<dis>Lymphatic
<dis>Brain Metastasis</dis>
<dis>BrainStem
StemNeoplasms</dis>
Seite 29
<dis>Neoplasms, Neoplasms</dis>
Hormone-Dependent</dis>
<dis>Neoplasms, Hormone-Dependent</dis>
<dis>Fibroadenoma</dis>
<dis>Fibroadenoma</dis>
<dis>Mammary
<dis>MammaryNeoplasms</dis>
Neoplasms</dis>
<dis>Papilloma</dis>
<dis>Papilloma</dis>
<dis>Maxillary Sinus
<dis>Maxillary
<dis>Bladder SinusNeoplasms</dis>
Neoplasms</dis>
Neoplasms</dis>
<dis>Bladder
<dis>Vulvar Neoplasms</dis>
<dis>VulvarNeoplasms</dis>
Neoplasms</dis>
<dis>Oropharyngeal Neoplasms</dis>
<dis>Oropharyngeal
<dis>Head and Neck Neoplasms</dis>
Neoplasms</dis>
<dis>Head
...... and Neck Neoplasms</dis>
</concept>
</concept>
</concept>
Protein - Protein Interactions

Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=1956"> and <concept name="~id=2066"> and
induces the growth of certain epithelial cell types.“

Extracted information:
‰ BTC binds to EGFR
‰ BTC binds to ERBB4

Seite 30
Protein Interaction SkillCartridge™

We developed the Protein Interaction Skill Cartridge based TEMIS Insight


Discoverer that processes text by

‰ defining semantic rules, based on our dictionaries and a specialized


biomedical grammar,

‰ identifying the relations among genes and proteins.

Seite 31
Protein Interaction SkillCartridge™

In
Inthis
thisexample,
example,concepts
conceptsfor
foractivating
activating
and
andinhibiting
inhibitingprocesses
processesare
aredefined.
defined.

Seite 32
Extraction of Protein and Gene Interactions

<protein
<proteinid=111067
id=111067pmid
pmid=12515821
=12515821snr=0/>
snr=0/>activates
activates<protein
<proteinid=100423
id=100423pmid
pmid
=12515821 snr=0/> by entering in a complex with <protein id=104331pmid
=12515821 snr=0/> by entering in a complex with <protein id=104331pmid
=12515821
=12515821snr=0/>,
snr=0/>,Abi1,
Abi1,and
and<protein
<proteinid=113769
id=113769pmid
pmid=12515821
=12515821snr=0/>.
snr=0/>.
Suppression ofof<protein id=107215 pmid =12516092 snr=0/> transactivation
Suppression
Suppression of <protein id=107215
<protein id=107215pmid
pmid=12516092 snr=0/>
=12516092 transactivationby
transactivation
snr=0/> by
<protein
<proteinid=108242
id=108242pmid
pmid=12516092
=12516092snr=0/>
snr=0/>accompanies
accompaniesinhibition
inhibitionofof<protein
<protein
id=201409
id=201409pmid
pmid=12516092
=12516092snr=0/>
snr=0/>induction.
induction.
Although
Although<protein
<proteinid=100694
id=100694pmid
pmid=12515826
=12515826snr=11/>
snr=11/>inside
insideraft
raftclusters
clustersseems
seems
totobe cleaved by <protein id=101049 pmid =12515826 snr=11/>, <protein id=100694
be cleaved by <protein id=101049 pmid =12515826 snr=11/>, <protein id=100694
The
Theextracted
extractedconcepts
concepts
pmid
pmid=12515826
=12515826snr=11/>
snr=11/>outside
outsiderafts
raftsundergo
undergocleavage
cleavageby
byalpha-secretase.
alpha-secretase.
are
arelisted
listedin
inthe
theskill
skill <protein
<proteinid=100694
id=100694pmid
pmid=12515826
=12515826snr=2/>
snr=2/>isiscleaved
cleavedby
by<protein
<proteinid=101094
id=101094pmid
pmid
cartridge
cartridgestudio.
studio.
=12515826
=12515826snr=2>
snr=2>ororby
byalpha-secretase
alpha-secretasetotoinitiate
initiateamyloidogenic
amyloidogenic(release
(releaseofofAA
beta)
The beta)orornonamyloidogenic
nonamyloidogenicprocessing
processingofof <protein
<proteinid=100694
id=100694pmid
pmid=12515826
Theuser
usercan
canbrowse
=12515826
browse snr=2/>,
snr=2/>,respectively.
respectively.
through the extracted
through the extracted
concepts
conceptsandandview
viewthe
the
text
textsource
source
Seite 33
Seite 34
Fraunhofer SCAI

Seite 35
Text Mining for the Context Specific Interpretation of
Clinical Data

Central hypotheses: Only the inclusion of a priori knowledge allows for a


systematic analysis and interpretation of large-scale biomedical experiments.

In particular, biological interaction networks reconstructed from scientific


publications are an invaluable rich source of information.

Seite 36
Chemical
Compounds

Disease Targets & proteins, their


Phenotypes Networks function(s) and
their interactions

Seite 37
Combining Text Mining with Experimental Data Analysis

Focus: Interpretation of gene expression data

Aim:
Find sub-networks (a significant area) with predominantly significantly
regulated genes.

Specify behavior

‰ for expression measurements via a p-value,


‰ functional relatedness using connectedness in biological networks.

Seite 38
Combining Text Mining with Experimental Data Analysis

Focus: Interpretation of gene expression data

Aim:
Find sub-networks (a significant area) with predominantly significantly
regulated genes.
SigAr-Search (Sohler et al., GCB 2003) uses greedy
SigAr-Search (Sohler et al., GCB 2003) uses greedy
search
searchstrategy
strategyfor
forexploring
exploringsubgraphs
subgraphsand
andstatistical
statistical
Specify behavior
null
nullmodel
modelfor
forvalidation
validationof
ofhypothetical
hypotheticalSigArs.
SigArs.
‰ for expression measurements via a p-value,
‰ functional relatedness using connectedness in biological networks.

Seite 39
Combining network information and expression data

Seite 40
Combining network information and expression data

Seite 41
Harnessing the Power of Semantic Text Analysis for the
Interpretation of Experimental Data

The combination of information extracted from text and


experimental data can indeed foster the building of new working
hypotheses. Intuitive visualization + navigation and integration of
heterogeneous data types is mandatory for this approach.

Seite 42
Chemical
Compounds

the association of
genes and Disease Targets &
proteins with Phenotypes Networks
disease phenotype

Seite 43
Statistical Association of Proteins and Diseases

¾ Natural Language Processing unsuitable


due to restriction to sentence level
relations

¾ Here, disease information is based on


Medical Subject Heading annotation
“Osteoarthritis” in Medline

¾ Use of statistical methods to obtain score


whether co-occurrence is relevant
Seite 44
Osteoarthritis
Disease
Network

top ranking 70 proteins


associated with osteoarthritis
display unexpected degree of connectedness Seite 45
Osteoarthritis
Sub-Network

disease context
specific protein-protein-interaction
network Seite 46
chemical compounds
Chemical and their association
Compounds to targets

Disease Targets &


Phenotypes Networks

Seite 47
Chemical Compound Acetylsalicylic Acid,
Name Recognition 2-(Acetyloxy)benzoic Acid,

Aspirin
Acetysal, Colfarit, Dispril,
Multiple names for one Easprin, Ecotrin, Endosprin,
Magnecyl, Micristin, Polopirin,
compound Polopiryna, …

Ambiguous names in databases Release, proven,


Marathon, Universal ,…
Common word names
6- deoxy-N-(7- nitrobenz-2- oxa-
1,3- diazol-4- yl)aminoglucose
Multi-word terms (!) 8-azidoadenosine diphosphate
glucose
Spelling variants
Glucose, Glucose-6-Phosphate
Glucose-6-Phosphate
Nested names Dehydrogenase Seite 48
Chemical compounds

Seite 49
Roads to go …..
In the future we will focus on the following topics:

• Generation of relevant dictionaries and grammar for pharmacology and


toxicology

• Extraction of information from patents

• Extraction of compound information from various textual sources

• Context-specific interpretation of biomedical data from clinical research and


functional genomics experiments Seite 50
Fraunhofer SCAI – Text Mining Team

Dr. Juliane Fluck

Dr. Hartwig Deneke

Dr. Christian Gieger

Heinz Theo Mevissen

Daniel Hanisch

Special thanks also to Ralf Zimmer and Florian Sohler (LMU Munich) and to TEMIS group
Seite 51

You might also like