You are on page 1of 28

Information to Knowledge: A

Biodiversity Phenotype Case

Hong Cui, Ph.D.


School of Information
Information, Knowledge, and
Phenotype
• Information: derived from raw data and become potentially
useful
• Journal articles, news stories, business reports, lecture slides,
multimedia presentations
• Knowledge: understood information, can be used in a
reasoning process
• Resides in the brain of human beings
• Used in or derived by artificial intelligence, machine learning, data
mining
• Phenotype: the observable characteristics or traits of an
organism that are produced by the interaction of
the genotype and the environment : the physical expression
of one or more genes [Webster]
Organism Traits are “Undiscovered
Public Knowledge” (Swanson, 1986)

Fig. 6. Formatted description of the common sunflower (Helanthus annuus) as constructed from traits extracted by the ETC. From the beta
version of the digital Flora of North America. See also the list of the individual traits in Fig. 5, and compare both with the original text in Fig. 1.
Source: http://dev.floranorthamerica.org/Helianthus_annuus.
Fig.1. Trait data coverage in plants and vertebrates. For most plant and vertebrate orders less than
10% of species are represented in these databases (a-b), and thousands of species in most countries
have no trait data at all (c-d). Adapted from Feng et al (in review).
Trait Extraction from Textual
Descriptions

NSF DBI-1147266, DEB-1541509, DEB-1208567. etc.


Powered by…

• CharaParser: a parsing approach


• Unsupervised learning of PoS tags for domain terms
• Feed to a Stanford Parser
• Formulate semantic chucks in constituency-based parse trees
• Transform chucks to structure/character/character value/etc, e.g.,
leaf max length 3 cm
• MicroPIE: Named Entity Recognition
• For microbe taxonomic descriptions (growth conditions, metabolic
processes)
• Support Vector Machines to classify character types
• Heuristic rules to extra traits
• EULER:
• RCC-5 based reasoning for character-based alignments between
taxonomies.
Fig. 5. Partial list of traits of the common sunflower (Helianthus annuus) as extracted by the ETC from the original description in Flora of
North America (Fig. 1). See also the formatted display of the same traits in Fig. 6. Source:
http://dev.floranorthamerica.org/Helianthus_annuus.
Trait Construction Challenge

Stipules sometimes caducous, rarely obsolete or wanting


Structure Characte Character Value Modifier
r
Stipules ? Caducous Sometimes
? Obsolete Rarely
? Wanting Rarely

Existing glossaries/ontologies don’t agree


OTO: Ontology Term Organizer

• Multi-user environment supporting conflict


flagging, comments, and history.

terms Categorical
Glossaries
Character extraction

Domain
Ontologies
OTO: Group Terms
Term Report: History and
Comments
Plant Glossaries Built in OTO
• Plant_glossary v0.1
• 3293 terms from the published FNA
Categorical Glossary + FNA v19
• New terms from 30 vols of FNA & FoC
• FNA vols 3-5, 7-8, 19-23, and 26-27;
• FoC vols 4-14, 18, and 20-25;
Issues

• Disagreements in term categorization


• Restricted vs. interpretative point of view
• “chlorophyllous and achlorophyllous denote presence or
absence of a green color, but they also implicitly
inform the reader if a plant is autotrophic or parasitic”
• Many technical terms with multiple senses.
• Ferrugious (rusty red), it may also indicate that there is
an indumentum present on the leaf
• Ambiguity in original description
• -like terms: appendage-like, ericaceous-form,
brickwork-like
• becoming phrases: becoming glabrous
Using OBO Ontologies

• OBO Foundry
• http://obofoundry.org/
• Anatomy ontologies
• UBERON, etc.
• Limb, fin, head, …
• BSPO (Biological Spatial Ontology)
• Anterior region, posterior region, …
• Quality ontology
• PATO (phenotype quality ontology)
• Color, shape, …
EQ Generation EQ = Entity+ Quality
Eye absent
UBERON:Eye PATO:absent

• Goal
• Convert character descriptions to EQ statements using the
set of ontologies
Machine-against-People on EQ
Generation
• Machine: CharaParser+EQ
• People: 3 post-doc curators
• Experiment:
• 202 characters selected randomly
• 3 post-doc curators created EQs
• Naïve round: based on character descriptions only
• Knowledge round: free access to other resources
• Ontologies used
• UBERON, BSPO, PATO
• Curators can add terms independently when needed to
ontologies in both rounds of curation
Ontologies Generated

Curation Curator Ontologies Used Ontologies Produced


naive 1 the initial set naive 1 augmented set
naive 2 the initial set naive 2 augmented set
naive 3 the initial set naive 3 augmented set
knowledge 1 the initial set knowledge 1 augment. set
knowledge 2 the initial set knowledge 2 augment. set
knowledge 3 the initial set knowledge 3 augment. set

Terms added to ontologies by curators


Result: Initial vs. Aug ontologies

Average Precision Average Recall


CP+EQ + initial 0.36 0.42
ontologies
CP+EQ + aug. 0.41 0.49
ontologies
curators 0.58 0.59

Performance differences are statistically significant (p<0.05)

Augmented ontologies help improve EQ generation performance .


Result: Compare to BioCreative
2012

Precision Recall Precision Recall


(now) (now)
(BioCreative (BioCreative 2012)
2012)
Curator 0.54 0.59 0.58 0.59

average
CharaParser 0.24 0.26 0.41 0.49
+EQ
Result: Naïve vs. Knowledge
rounds
Curators modified > 50% EQs in the Knowledge round:
  Changed/ # Added # Removed no change in term
total states terms terms count
naïve_1_know_1 261/463 87 78 96
naïve_2_know_2 326/463 79 119 128
naïve_3_know_3 298/463 92 97 109

Changes are mixed types: adding, removing, or changing terms in EQs


Changes were made on more complex EQs

count avg # of terms avg # of terms


    (naïve) (knowledge)
naïve_1_know_1 changed 261 5.82 5.61
unchanged 202 3.84 3.84
naïve_2_know_2 changed 326 5.06 5.57
unchanged 137 3.85 3.88
naïve_3_know_3 changed 298 5.23 5.29
unchanged 165 3.69 3.68

Accessing knowledge does not reduce inter-curator variations


Issue I: Quality Ontology

• Ontology search issues. Phrases that appear to be a


good match but have unwanted meaning.
• “tooth crown distinct from root”,
• PATO:differentiated is_a PATO.cellular potency,
• “separate from”,
• PATO:separated from(is_a PATO:structure)
• PATO:far from (is_a PATO:position)
• PATO:adjacent to(is_a PATO:position)
• not PATO: in contact with (is_a PATO:structure).  
Issue 2: Pre vs. Post Composition

• When good matches can’t be found in an ontology,


• propose a new term
• post-compose a term
• When to expect a pre-composed vs post-compose term
• Distal carpal bone 1
• Lateral pelvic glands
• ‘dermal sculpture on skull-roof’
• "dermal sculpture" does not exist in UBERON
• “surface sculpting* and (part_of some UBERON:dermatocranium)”
• “UBERON: dermatocranium and (bearer_of some PATO:sculpted
surface)”
• “UBERON: dermatocranium”
Issue 3: Source Descriptions

• Due to possible different interpretations of


description text, sometimes background
knowledge, or lack of it, affect the resulting EQs.
• “Weak” in "Dermal sculpture on skull-roof weak" was
interpreted differently:
• CharaParser+EQ
• “PATO:decreased strength" has exact synonym "weak" in PATO
• Curators each has a different interpretation:
• “PATO:poorly developed"
• “PATO:decreased magnitude"
• "weakly sculptured surface (new term)"
• none of three interpretations has "weak" as a synonym.
Problem: Disconnection among the Three

*Author’s terms not covered in ontologies


Literature/ *Authors are not aware of ontology terms
Authors and relationships

Ontologies
*No communication btw authors/curators
*Curators’ “weak” may not be author’s “weak”

*Ontologies – not easy to use.


*Inappropriate usage not identified
Curators *Issues reported not fixed in time
*Term usage not tracked
But….
• Ontologies must reflect domain consensus

Working towards: make trait descriptions knowledge


graphs that are “understood” by computers

Literature/
Authors

Ontologies

Curators
Currently:

• Author attitude survey


• Ontology-powered authoring platform prototypes
• Description Editor
• Matrix/Trait Editor
• Add2Ontology
• OUTPUT: human readable descriptions + computable
graphs
• Usability testing with
• Info Science students
• Bio Science students
• Botany/Plant Science authors
Trait/Character Recorder
Thank You
Acknowledgements:
NSF
Collaborators at University of Illinois Urbana Champaign
Harvard University
University of Michigan
University of Manitoba
University of South Maine
University of South Dakota
University of Duke
University of North Carolina
University of Florida
University of Ottawa, Canada
Agriculture Agri-Food, Canada

You might also like