Symposium On Text Mining in The Life Science TEMIS

Symposium on
Text Mining in the Life Sciences

Symposium on
Text Mining in the Life Sciences
Text Mining in Clinical Genome Research I
Dr. Martin Hofmann

Department of Bioinformatics
Cooperation between TEMIS and Fraunhofer SCAI
TEMIS Fraunhofer SCAI
industry leading software in text mining biomedical domain expertise
approved domain specific dictionaries
scalable ontologies
standard compliant software for biomedical entity

recognition
text format independent
Seite 3
From Genes to Drugs ….
chemical compounds
Chemical and their association
Compounds to targets
the association of
genes and Disease Targets & proteins, their
proteins with Phenotypes Networks function(s) and
disease phenotype their interactions
Seite 4
Proprietary knowledge
Public knowledge
Experimental data
Patent information
Seite 5
Biological networks
Metabolic networks
e.g. KEGG, BRENDA
Regulatory networks
Expert networks (e.g. Transpath, Biocarta, STKE…)
Large scale experiments (e.g. genome-wide location analysis)
Protein Interaction networks

e.g. DIP, BIND
Seite 6
Data stored …
… in public databases
… in proprietary databases and tables
… and as free (unstructured) text
Seite 7
Data stored …
… in public databases
… in proprietary databases and tables
… and as free (unstructured) text
The challenge: … link between databases and text

Seite 8
Why text mining?
Text is still by far the most important and most complete source of
information in biology and medicine
Textual information is growing at a rate that makes it almost impossible to

follow all relevant information, even in restricted fields
Information extracted from free text is residing in the brain of individual

researchers, the current state of knowledge is not made broadly available
Network information stored in databases (e.g. KEGG, Transpath) is

oftentimes incomplete or not specific enough with respect to certain species
or cell types.
Seite 9
Why text mining?
Text is still by far the most important and most complete source of
information in biology and medicine
For this reason, we have developed text mining methods mainly
Textualfor the construction
information of interaction
is growing at a ratenetworks based
that makes on biomedical
it almost impossible to
follow all
freerelevant
text. information, even in restricted fields
Information extracted from free text is residing in the brain of individual

researchers, the current state of knowledge is not made broadly available
Network information stored in databases (e.g. KEGG, Transpath) is

oftentimes incomplete or not specific enough with respect to certain species
or cell types.
Seite 10
Challenges in text mining
However, text is
mostly unstructured
redundant
ambiguous
allows for different interpretations
complexity of information source
Linking text mining results to experimental data

Seite 11
Knowledge Management in the Biomedical Domain
Information retrieval and organization

requires excellent text mining machinery TEMIS Insight Discoverer Clusterer
Information extraction
requires natural language processing TEMIS Insight Discoverer Extractor
Construction of biological networks

SCAI ProMiner
requires a sound understanding of biomedical entities
Integrated context specific analysis of data from different domains
biomedical reseach community
requires biomedical ontologies
Seite 12
Making Use of Public Resources ….
Organisation of documents
by meaning
Quick overview over many
documents
Exhibiting interesting
relations between topics
E.g.: Clustering of about 1100
documents about “caspase 1”
with respect to their MeSH
annotation (C – Diseases and
D – Chemicals and Drugs)
Seite 13
Clustering of documents containing “caspase 1”
Chem. Compounds:
/12 Heterocyclic Compounds ; Heterocyclic Compounds 1-Ring ; Chemical Actions ; Enzyme Inhibitors ;
Enzyme Inhibitors ; Protease Inhibitors ; Protease Inhibitors ; Chemical Actions and Uses ; Alkaloids ;
Heterocyclic Compounds with 4 or More Rings ; 18
/15 Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional andSeite 14
Metabolic Diseases ; Naphthacenes ; Tetracyclines ; Naphthols ; Tetracyclines ; Naphthalenes ; Congenital
Hereditary and Neonatal Diseases and Abnormalities ; 11
/6 Inorganic Chemicals ; Nitrogen Compounds ; Elements ; Metals ; Amino Acids Peptides and Proteins ;
Peptides ; Pathological Conditions Signs and Symptoms ; Pathologic Processes ; Immunologic and
Biological Factors ; Amino Acids ; 39
Chem. Compounds:
Lipids:
/15/8Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional and
Lipids ; Lipids and Antilipemic Agents ; Membrane Lipids ; Sphingolipids ; Glycosphingolipids ;
Glycolipids ; Glycosphingolipids ; Glycosphingolipids ; Glycolipids ; Glycoconjugates ; 31
/6 Inorganic Chemicals ; Nitrogen Compounds ; Elements ; Metals ; Amino Acids Peptides and Proteins ;
Peptides ; Pathological Conditions Signs and Symptoms ; Pathologic Processes ; Immunologic and
Biological Factors ; Amino Acids ; 39
Seite 15
Chem. Compounds:
Lipids:
/15/8Polycyclic Hydrocarbons Aromatic ; Polycyclic Hydrocarbons ; Metabolic Diseases ; Nutritional and
Lipids ; Lipids and Antilipemic Agents ; Membrane Lipids ; Sphingolipids ; Glycosphingolipids ;
Diseases:
Glycolipids ; Glycosphingolipids ; Glycosphingolipids ; Glycolipids ; Glycoconjugates ; 31
/6 Inorganic
/7 Neoplasms
Chemicals ; Nitrogenby
; Neoplasms Com pounds ;Type
Histologic Elements ; MetalsGlandular
; Neoplasms ; Amino Acids Peptides and
and Epithelial Proteins Germ
; Neoplasms ; Cell
Peptidesand; Pathological
Embryonal ; Conditions Signs and
Neuroectodermal Symp;toms
Tumors ; Pathologic Processes
Neuroectodermal Tumors ; ;Neoplasms
Immunologic andTissue ; Proteins
Nerve
Biologic al Factors ; Amino
; Neoplasms ; 39Soft Tissue ; Amino Acids Peptides and Proteins ; 32
Acidsand
Connective
10 Nervous System Diseases ; Central Nervous System Diseases ; Wounds and Injuries ; Disorders of
Environmental Origin ; Brain Diseases ; Neurodegenerative Diseases ; Trauma Nervous System ;
Pathological Conditions Signs and Symptoms ; Caspases ; Cysteine Endopeptidases ; 23
/13 Cardiovascular Diseases ; Vascular Diseases ; Ischemia ; ischemia ; Immunologic and Biological Factors
Seite 16
; Brain Diseases ; Pathologic Processes ; Nervous System Diseases ; Caspases ; Immunologic Factors ; 16
/14 Digestive System Diseases ; Liver Diseases ; Hepatitis ; Intestinal Diseases ; Gastrointestinal Diseases ;
hepatitis ; Pancreatitis ; Pancreatic Diseases ; pancreatitis ; Immunologic and Biological Factors ; 15
/19 Hemic and Lymphatic Diseases ; Hematologic Diseases ; Lymphatic Diseases ; Lymphoproliferative
Disorders ; Immunoproliferative Disorders ; Lymphoma ; Lymphoma ; Lymphoproliferative Disorders ;
Lymphoma ; Bone Marrow Diseases ; 9
Information Extraction – Construction of Information
Networks
Detection of named entities in text (e.g. protein names)

Finding of semantic associations (e.g. protein-protein-
interactions)
Techniques:
Automatic pattern extraction

Combination of natural language processing and statistical
approaches Seite 17
Protein Name Recognition
F12A
Neuronectin, GMEM, tenascin,
HXB, cytotactin, hexabrachion
p21, EPO, large T antigen

Multiple names for one gene
Ambiguous names in databases WAS, STEP, TRAIL, iCE, StAR,…
Common word names Interleukin 1 alpha

Tumor necrosis factor beta, …
Multi-word terms Collagen, type I, alpha 1
COL1A1
Collagen alpha 1(I) chain
Spelling variants Alpha 1 collagen
Alpha-1 type I collagen
Permutations alpha 1( I) collagen
TNF receptor 1
Nested names collagen, type I, alpha receptor Seite 18
Dictionary-Based Approach
Biomedical dictionaries allow for SwissProt
TREMBL
easy identification of multi-word terms LocusLink
matching of multiple synonyms to one biological entity …

…
…
identification of ambiguous names
mapping of extracted knowledge to external data sources

(e.g. expression data, gene ontology information, …)
Obviously, entities not contained in the dictionary cannot be found.
But: Such entries are usually of minor interest as they cannot be

automatically associated with relevant information.
Seite 19
Dictionary-Based Approach
Biomedical dictionaries allow for SwissProt
TREMBL
easy identification of multi-word terms LocusLink
matching of multiple synonyms to one biological entity …

…
…
identification of ambiguous names
For
Forinstance,
instance,our
oursemi-automatic
semi-automaticgeneration
generationand andcuration
curation
procedure
mapping of extracted
procedure results
resultsin
knowledge intoaaexternal
human
humandictionary
data sources
dictionary containing
containing
~17.000
(e.g. expression data, geneobjects
~17.000 ontology
objects and ~100.000
information,
and ~100.000…) synonyms.
synonyms.
Spelling variants
Spellingnotvariants are
arebeing
being handled
handledby by aaspecial
special algorithm
algorithm
Obviously, entities contained in the dictionary cannot be found.
(token
(tokenassignment
assignmentand andmatching
matchingscores).
scores).
But: Such entries are usually of minor interest as they cannot be
automatically associated with relevant information.
Seite 20
Scoring a Match Based on Token Classes
Basic observation: Token in synonyms have different importance when matches
and/or mismatches occur during the search.
Construct scoring function based on token classes.

Train coefficients with machine learning methods.
¾ Need to define a search procedure to employ this scoring scheme.

Seite 21
Fast Approximate Matching
Algorithm
Search process needs to be re-iterated on update
or first inclusion of dictionaries.
Medline database is huge and rapidly growing.
Characteristics of search algorithm:
o Inspect each word in database only once.
o Keep a list of candidate solutions and extend
their scores on token parsing.
o Prune rejected, report accepted candidates.
¾ Search 15.000.000 abstracts for all human

proteins and genes (17.000 entities) overnight.
Seite 22
Method Validation
Definition of a manually curated benchmark set is extremely time-consuming.
Use existing annotation from (an older version of) the TransPath database,
i.e. extract proteins (141) and references to abstracts (490) as gold standard.
Evaluate method using various dictionaries and matching parameters.
Seite 23
Method Validation
Definition of a manually curated benchmark set is extremely time-consuming.
Use existing annotation from (an older version of) the TransPath database,
i.e. extract proteins (141) and references to abstracts (490) as gold standard.
Biomedical
Evaluate method name
Biomedical using recognition
namevarious isisan
anactive
dictionaries
recognition field
fieldof
and matching
active ofresearch.
parameters.
research.
The
TheBioCreative’03
BioCreative’03competition,
competition,forforinstance,
instance,will
willprovide
providefurther
further
grounds
groundsfor
forlarge-scale
large-scalevalidation
validationof ofproposed
proposedmethods…
methods…
¾ Results
¾ Resultsjustify
justifyuse
useof
ofour
ourmethod
methodasasbuilding
buildingblock
blockfor
for
interaction
interactionextraction
extractionvia
viahigher
higherlevel
levelsemantic
semanticanalysis...
analysis...
Seite 24
Information Extraction – Construction of Information
Networks
Detection of named entities in text (e.g. protein names)

Finding of semantic associations (e.g. protein-protein-interactions)
Techniques:
Automatic pattern extraction

Combination of natural language processing and statistical
approaches Seite 25
Protein - Protein Interactions
Example sentence:
“Betacellulin (BTC) has been demonstrated to directly bind to both EGFR
and HER4 and induces the growth of certain epithelial cell types.“
Seite 26
Example sentence:
“<concept name="~id=685"> has been demonstrated to directly bind to both
<concept name="~id=1956"> and <concept name="~id=2066"> and
induces the growth of certain epithelial cell types.“
<concept name="~id=685">
<concept
<conceptname="~LOCUSLINK">
name="~LOCUSLINK">
<ll>685@LOCUSLINK</ll>
</concept> <ll>685@LOCUSLINK</ll>
</concept>
<concept
<conceptname="~SWISSPROT">
name="~SWISSPROT">
<sp>BTC_HUMAN@SWISSPROT</sp>
</concept> <sp>BTC_HUMAN@SWISSPROT</sp>
</concept>
<concept
<conceptname="~PREFERRED">
name="~PREFERRED">
<pr>BTC</pr>
</concept> <pr>BTC</pr>
</concept>
<concept
<conceptname="~SYNONYMS">
name="~SYNONYMS">
<syn>BTC</syn>
<syn>BTC</syn>
<syn>betacellulin</syn>
<syn>betacellulin</syn>
<syn>Betacellulin
</concept > <syn>Betacellulinprecursor</syn>
precursor</syn> Seite 27
</concept
<concept >
<conceptname="~DISEASES">
name="~DISEASES">
<dis>Intestinal
<dis>IntestinalNeoplasms</dis>
<dis>Carcinoma, Neoplasms</dis>
Squamous
<dis>Carcinoma,
<dis>Pancreatic SquamousCell</dis>
Cell</dis>
Neoplasms</dis>
<dis>Pancreatic
<dis>Vulvar Neoplasms</dis>
<dis>VulvarNeoplasms</dis>
Neoplasms</dis>
<dis>Haemophilus Infections</dis>
<dis>Haemophilus
<dis>Papilloma</dis> Infections</dis>
<dis>Papilloma</dis>
<dis>Head
<dis>Headand
andNeck Neoplasms</dis>
<dis>Pharyngeal Neck Neoplasms</dis>
Neoplasms</dis>
<dis>Pharyngeal
…… Neoplasms</dis>
</concept>
</concept>
</concept>
Example sentence:
<concept
name="~id=2066">
<concept
<conceptname="~LOCUSLINK">
name="~LOCUSLINK">
and <concept name="~id=2066"> and
induces the
</concept>
</concept> growth of certain epithelial cell types.“
<concept name="~SWISSPROT">
<concept name="~SWISSPROT">
<sp>ERB4_HUMAN@SWISSPROT</sp>
</concept> <sp>ERB4_HUMAN@SWISSPROT</sp>
</concept>
<concept
name="~PREFERRED">
<pr>ERBB4</pr>
</concept> <pr>ERBB4</pr>
</concept>
<concept
name="~SYNONYMS">
<syn>ERBB4</syn>
<syn>ERBB4</syn>
<syn>v-erb-a
<syn>v-erb-aerythroblastic
<syn>v-erb-a erythroblasticleukemia
erythroblastic leukemiaviral
leukemia viraloncogene
viral oncogenehomolog
oncogene homolog4</syn>
homolog 44</syn>
<syn>v-erb-a
<syn>HER4</syn> erythroblastic leukemia viral oncogene homolog 4(avian)</syn>
(avian)</syn>
<syn>HER4</syn>
<syn>avian
<syn>avianerythroblastic leukemia viral
viral(v-erb-b2) oncogene homolog
<syn>v-erb-a erythroblastic
avian leukemia
erythroblastic leukemia (v-erb-b2)
viral oncogene
oncogene homolog4</syn>
homolog-like 4</syn>
4</syn>
<syn>v-erb-a
<syn>ERB4</syn> avian erythroblastic leukemia viral oncogene homolog-like 4</syn>
<syn>ERB4</syn>
<syn>Receptor protein-tyrosine
<syn>Receptor
<syn>EC protein-tyrosinekinase
2.7.1.112</syn> kinaseerbB-4
erbB-4precursor</syn>
precursor</syn>
<syn>EC
...... 2.7.1.112</syn>
Seite 28
</concept >>
</conceptname="~DISEASES">
<concept
<concept name="~DISEASES">
<dis>Glioblastoma</dis>
<dis>Glioblastoma</dis>
<dis>Carcinoma, Endometrioid</dis>
<dis>Carcinoma,
<dis>Hand Endometrioid</dis>
Deformities, Congenital</dis>
<dis>Hand Deformities,
<dis>Neoplasms, Congenital</dis>
Glandular and
<dis>Neoplasms,
<dis>Muscle Glandular
Neoplasms</dis> andEpithelial</dis>
Epithelial</dis>
<dis>Muscle
<dis>Mammary Neoplasms</dis>
Neoplasms</dis>
...<dis>Mammary Neoplasms</dis>
</concept> ...
</concept>
</concept>
</concept>
Example sentence:
<concept name="~id=1956"> and <concept
<concept
<concept name="~id=2066"> and
name="~LOCUSLINK">
name="~LOCUSLINK">
</concept> <ll>1956@LOCUSLINK</ll>
induces the growth of certain epithelial cell
</concept>
<concept types.“
<conceptname="~SWISSPROT">
name="~SWISSPROT">
<sp>EGFR_HUMAN@SWISSPROT</sp>
</concept> <sp>EGFR_HUMAN@SWISSPROT</sp>
</concept>
<concept
name="~PREFERRED">
<pr>EGFR</pr>
</concept> <pr>EGFR</pr>
</concept>
<concept
name="~SYNONYMS">
<syn>EGFR</syn>
<syn>EGFR</syn>
<syn>ERBB</syn>
<syn>ERBB</syn>
<syn>ERBB1</syn>
<syn>ERBB1</syn>
<syn>Epidermal growth
<syn>Epidermal
<syn>EC growthfactor
2.7.1.112</syn> factorreceptor</syn>
receptor</syn>
<syn>EC
...... 2.7.1.112</syn>
</concept >
</conceptname="~DISEASES">
<concept >
<concept name="~DISEASES">
<dis>Pharyngeal Neoplasms</dis>
<dis>Pharyngeal
<dis>Lymphatic Neoplasms</dis>
Metastasis</dis>
<dis>Lymphatic
<dis>Brain Metastasis</dis>
<dis>BrainStem
StemNeoplasms</dis>
Seite 29
<dis>Neoplasms, Neoplasms</dis>
Hormone-Dependent</dis>
<dis>Neoplasms, Hormone-Dependent</dis>
<dis>Fibroadenoma</dis>
<dis>Fibroadenoma</dis>
<dis>Mammary
<dis>MammaryNeoplasms</dis>
Neoplasms</dis>
<dis>Maxillary Sinus
<dis>Maxillary
<dis>Bladder SinusNeoplasms</dis>
Neoplasms</dis>
Neoplasms</dis>
<dis>Bladder
<dis>Vulvar Neoplasms</dis>
<dis>VulvarNeoplasms</dis>
Neoplasms</dis>
<dis>Oropharyngeal Neoplasms</dis>
<dis>Oropharyngeal
<dis>Head and Neck Neoplasms</dis>
Neoplasms</dis>
<dis>Head
...... and Neck Neoplasms</dis>
</concept>
</concept>
</concept>
Example sentence:
<concept name="~id=1956"> and <concept name="~id=2066"> and
induces the growth of certain epithelial cell types.“
Extracted information:
BTC binds to EGFR
BTC binds to ERBB4
Seite 30
Protein Interaction SkillCartridge™
We developed the Protein Interaction Skill Cartridge based TEMIS Insight

Discoverer that processes text by
defining semantic rules, based on our dictionaries and a specialized

biomedical grammar,
identifying the relations among genes and proteins.
Seite 31
Protein Interaction SkillCartridge™
In
Inthis
thisexample,
example,concepts
conceptsfor
foractivating
activating
and
andinhibiting
inhibitingprocesses
processesare
aredefined.
defined.
Seite 32
Extraction of Protein and Gene Interactions
<protein
<proteinid=111067
id=111067pmid
pmid=12515821
=12515821snr=0/>
snr=0/>activates
activates<protein
<proteinid=100423
id=100423pmid
pmid
=12515821 snr=0/> by entering in a complex with <protein id=104331pmid
=12515821 snr=0/> by entering in a complex with <protein id=104331pmid
=12515821
=12515821snr=0/>,
snr=0/>,Abi1,
Abi1,and
and<protein
<proteinid=113769
id=113769pmid
pmid=12515821
=12515821snr=0/>.
snr=0/>.
Suppression ofof<protein id=107215 pmid =12516092 snr=0/> transactivation
Suppression
Suppression of <protein id=107215
<protein id=107215pmid
pmid=12516092 snr=0/>
=12516092 transactivationby
transactivation
snr=0/> by
<protein
<proteinid=108242
id=108242pmid
pmid=12516092
=12516092snr=0/>
snr=0/>accompanies
accompaniesinhibition
inhibitionofof<protein
<protein
id=201409
id=201409pmid
pmid=12516092
=12516092snr=0/>
snr=0/>induction.
induction.
Although
Although<protein
<proteinid=100694
id=100694pmid
pmid=12515826
=12515826snr=11/>
snr=11/>inside
insideraft
raftclusters
clustersseems
seems
totobe cleaved by <protein id=101049 pmid =12515826 snr=11/>, <protein id=100694
be cleaved by <protein id=101049 pmid =12515826 snr=11/>, <protein id=100694
The
Theextracted
extractedconcepts
concepts
pmid
pmid=12515826
=12515826snr=11/>
snr=11/>outside
outsiderafts
raftsundergo
undergocleavage
cleavageby
byalpha-secretase.
alpha-secretase.
are
arelisted
listedin
inthe
theskill
skill <protein
<proteinid=100694
id=100694pmid
pmid=12515826
=12515826snr=2/>
snr=2/>isiscleaved
cleavedby
by<protein
<proteinid=101094
id=101094pmid
pmid
cartridge
cartridgestudio.
studio.
=12515826
=12515826snr=2>
snr=2>ororby
byalpha-secretase
alpha-secretasetotoinitiate
initiateamyloidogenic
amyloidogenic(release
(releaseofofAA
beta)
The beta)orornonamyloidogenic
nonamyloidogenicprocessing
processingofof <protein
<proteinid=100694
id=100694pmid
pmid=12515826
Theuser
usercan
canbrowse
=12515826
browse snr=2/>,
snr=2/>,respectively.
respectively.
through the extracted
through the extracted
concepts
conceptsandandview
viewthe
the
text
textsource
source
Seite 33
Seite 34
Fraunhofer SCAI
Seite 35
Text Mining for the Context Specific Interpretation of
Clinical Data
Central hypotheses: Only the inclusion of a priori knowledge allows for a

systematic analysis and interpretation of large-scale biomedical experiments.
In particular, biological interaction networks reconstructed from scientific

publications are an invaluable rich source of information.
Seite 36
Chemical
Compounds
Disease Targets & proteins, their

Phenotypes Networks function(s) and
their interactions
Seite 37
Combining Text Mining with Experimental Data Analysis
Focus: Interpretation of gene expression data
Aim:
Find sub-networks (a significant area) with predominantly significantly
regulated genes.
Specify behavior
for expression measurements via a p-value,

functional relatedness using connectedness in biological networks.
Seite 38
Combining Text Mining with Experimental Data Analysis
Focus: Interpretation of gene expression data
Aim:
Find sub-networks (a significant area) with predominantly significantly
regulated genes.
SigAr-Search (Sohler et al., GCB 2003) uses greedy
SigAr-Search (Sohler et al., GCB 2003) uses greedy
search
searchstrategy
strategyfor
forexploring
exploringsubgraphs
subgraphsand
andstatistical
statistical
Specify behavior
null
nullmodel
modelfor
forvalidation
validationof
ofhypothetical
hypotheticalSigArs.
SigArs.
for expression measurements via a p-value,
functional relatedness using connectedness in biological networks.
Seite 39
Combining network information and expression data
Seite 40
Combining network information and expression data
Seite 41
Harnessing the Power of Semantic Text Analysis for the
Interpretation of Experimental Data
The combination of information extracted from text and

experimental data can indeed foster the building of new working
hypotheses. Intuitive visualization + navigation and integration of
heterogeneous data types is mandatory for this approach.
Seite 42
Chemical
Compounds
the association of
genes and Disease Targets &
proteins with Phenotypes Networks
disease phenotype
Seite 43
Statistical Association of Proteins and Diseases
¾ Natural Language Processing unsuitable

due to restriction to sentence level
relations
¾ Here, disease information is based on

Medical Subject Heading annotation
“Osteoarthritis” in Medline
¾ Use of statistical methods to obtain score

whether co-occurrence is relevant
Seite 44
Osteoarthritis
Disease
Network
top ranking 70 proteins

associated with osteoarthritis
display unexpected degree of connectedness Seite 45
Osteoarthritis
Sub-Network
disease context
specific protein-protein-interaction
network Seite 46
chemical compounds
Chemical and their association
Compounds to targets
Disease Targets &

Phenotypes Networks
Seite 47
Chemical Compound Acetylsalicylic Acid,
Name Recognition 2-(Acetyloxy)benzoic Acid,
Aspirin
Acetysal, Colfarit, Dispril,
Multiple names for one Easprin, Ecotrin, Endosprin,
Magnecyl, Micristin, Polopirin,
compound Polopiryna, …
Ambiguous names in databases Release, proven,

Marathon, Universal ,…
Common word names
6- deoxy-N-(7- nitrobenz-2- oxa-
1,3- diazol-4- yl)aminoglucose
Multi-word terms (!) 8-azidoadenosine diphosphate
glucose
Spelling variants
Glucose, Glucose-6-Phosphate
Glucose-6-Phosphate
Nested names Dehydrogenase Seite 48
Chemical compounds
Seite 49
Roads to go …..
In the future we will focus on the following topics:
• Generation of relevant dictionaries and grammar for pharmacology and

toxicology
• Extraction of information from patents
• Extraction of compound information from various textual sources
• Context-specific interpretation of biomedical data from clinical research and

functional genomics experiments Seite 50
Fraunhofer SCAI – Text Mining Team
Dr. Juliane Fluck
Dr. Hartwig Deneke
Dr. Christian Gieger
Heinz Theo Mevissen
Daniel Hanisch
Special thanks also to Ralf Zimmer and Florian Sohler (LMU Munich) and to TEMIS group
Seite 51

Symposium On Text Mining in The Life Science TEMIS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Symposium On Text Mining in The Life Science TEMIS

Uploaded by

Copyright:

Available Formats

Symposium on

Text Mining in the Life Sciences

Text Mining in Clinical Genome Research I

Dr. Martin Hofmann

TEMIS Fraunhofer SCAI

industry leading software in text mining biomedical domain expertise

 approved  domain specific dictionaries

 standard compliant  software for biomedical entity

Protein Interaction networks

… in proprietary databases and tables

… and as free (unstructured) text

… in proprietary databases and tables

… and as free (unstructured) text

The challenge: … link between databases and text

 Textual information is growing at a rate that makes it almost impossible to

 Information extracted from free text is residing in the brain of individual

 Network information stored in databases (e.g. KEGG, Transpath) is

 Information extracted from free text is residing in the brain of individual

 Network information stored in databases (e.g. KEGG, Transpath) is

 Linking text mining results to experimental data

 Information retrieval and organization

 Construction of biological networks

Detection of named entities in text (e.g. protein names)

Automatic pattern extraction

p21, EPO, large T antigen

Ambiguous names in databases WAS, STEP, TRAIL, iCE, StAR,…

Common word names Interleukin 1 alpha

 matching of multiple synonyms to one biological entity …

 mapping of extracted knowledge to external data sources

But: Such entries are usually of minor interest as they cannot be

 matching of multiple synonyms to one biological entity …

 Construct scoring function based on token classes.

¾ Need to define a search procedure to employ this scoring scheme.

¾ Search 15.000.000 abstracts for all human

Detection of named entities in text (e.g. protein names)

Automatic pattern extraction

We developed the Protein Interaction Skill Cartridge based TEMIS Insight

 defining semantic rules, based on our dictionaries and a specialized

 identifying the relations among genes and proteins.

Central hypotheses: Only the inclusion of a priori knowledge allows for a

In particular, biological interaction networks reconstructed from scientific

Disease Targets & proteins, their

Focus: Interpretation of gene expression data

 for expression measurements via a p-value,

Focus: Interpretation of gene expression data

The combination of information extracted from text and

¾ Natural Language Processing unsuitable

¾ Here, disease information is based on

¾ Use of statistical methods to obtain score

top ranking 70 proteins

Disease Targets &

Ambiguous names in databases Release, proven,

• Generation of relevant dictionaries and grammar for pharmacology and

• Extraction of information from patents

• Extraction of compound information from various textual sources

• Context-specific interpretation of biomedical data from clinical research and

Dr. Juliane Fluck

Dr. Hartwig Deneke

Dr. Christian Gieger

Heinz Theo Mevissen

You might also like

approved domain specific dictionaries

standard compliant software for biomedical entity

Textual information is growing at a rate that makes it almost impossible to

Information extracted from free text is residing in the brain of individual

Network information stored in databases (e.g. KEGG, Transpath) is

Information extracted from free text is residing in the brain of individual

Network information stored in databases (e.g. KEGG, Transpath) is

Linking text mining results to experimental data

Information retrieval and organization

Construction of biological networks

Detection of named entities in text (e.g. protein names)

Automatic pattern extraction

matching of multiple synonyms to one biological entity …

mapping of extracted knowledge to external data sources

matching of multiple synonyms to one biological entity …

Construct scoring function based on token classes.

Detection of named entities in text (e.g. protein names)

Automatic pattern extraction

defining semantic rules, based on our dictionaries and a specialized

identifying the relations among genes and proteins.

for expression measurements via a p-value,