Faculty Presentation For University of Arizona Human Language and Technology 2020 Homecoming

Information to Knowledge: A
Biodiversity Phenotype Case
Hong Cui, Ph.D.

School of Information
Information, Knowledge, and
Phenotype
• Information: derived from raw data and become potentially
useful
• Journal articles, news stories, business reports, lecture slides,
multimedia presentations
• Knowledge: understood information, can be used in a
reasoning process
• Resides in the brain of human beings
• Used in or derived by artificial intelligence, machine learning, data
mining
• Phenotype: the observable characteristics or traits of an
organism that are produced by the interaction of
the genotype and the environment : the physical expression
of one or more genes [Webster]
Organism Traits are “Undiscovered
Public Knowledge” (Swanson, 1986)
Fig. 6. Formatted description of the common sunflower (Helanthus annuus) as constructed from traits extracted by the ETC. From the beta
version of the digital Flora of North America. See also the list of the individual traits in Fig. 5, and compare both with the original text in Fig. 1.
Source: http://dev.floranorthamerica.org/Helianthus_annuus.
Fig.1. Trait data coverage in plants and vertebrates. For most plant and vertebrate orders less than
10% of species are represented in these databases (a-b), and thousands of species in most countries
have no trait data at all (c-d). Adapted from Feng et al (in review).
Trait Extraction from Textual
Descriptions
NSF DBI-1147266, DEB-1541509, DEB-1208567. etc.

Powered by…
• CharaParser: a parsing approach

• Unsupervised learning of PoS tags for domain terms
• Feed to a Stanford Parser
• Formulate semantic chucks in constituency-based parse trees
• Transform chucks to structure/character/character value/etc, e.g.,
leaf max length 3 cm
• MicroPIE: Named Entity Recognition
• For microbe taxonomic descriptions (growth conditions, metabolic
processes)
• Support Vector Machines to classify character types
• Heuristic rules to extra traits
• EULER:
• RCC-5 based reasoning for character-based alignments between
taxonomies.
Fig. 5. Partial list of traits of the common sunflower (Helianthus annuus) as extracted by the ETC from the original description in Flora of
North America (Fig. 1). See also the formatted display of the same traits in Fig. 6. Source:
http://dev.floranorthamerica.org/Helianthus_annuus.
Trait Construction Challenge
Stipules sometimes caducous, rarely obsolete or wanting

Structure Characte Character Value Modifier
r
Stipules ? Caducous Sometimes
? Obsolete Rarely
? Wanting Rarely
Existing glossaries/ontologies don’t agree

OTO: Ontology Term Organizer
• Multi-user environment supporting conflict

flagging, comments, and history.
terms Categorical
Glossaries
Character extraction
Domain
Ontologies
OTO: Group Terms
Term Report: History and
Comments
Plant Glossaries Built in OTO
• Plant_glossary v0.1
• 3293 terms from the published FNA
Categorical Glossary + FNA v19
• New terms from 30 vols of FNA & FoC
• FNA vols 3-5, 7-8, 19-23, and 26-27;
• FoC vols 4-14, 18, and 20-25;
Issues
• Disagreements in term categorization

• Restricted vs. interpretative point of view
• “chlorophyllous and achlorophyllous denote presence or
absence of a green color, but they also implicitly
inform the reader if a plant is autotrophic or parasitic”
• Many technical terms with multiple senses.
• Ferrugious (rusty red), it may also indicate that there is
an indumentum present on the leaf
• Ambiguity in original description
• -like terms: appendage-like, ericaceous-form,
brickwork-like
• becoming phrases: becoming glabrous
Using OBO Ontologies
• OBO Foundry
• http://obofoundry.org/
• Anatomy ontologies
• UBERON, etc.
• Limb, fin, head, …
• BSPO (Biological Spatial Ontology)
• Anterior region, posterior region, …
• Quality ontology
• PATO (phenotype quality ontology)
• Color, shape, …
EQ Generation EQ = Entity+ Quality
Eye absent
UBERON:Eye PATO:absent
• Goal
• Convert character descriptions to EQ statements using the
set of ontologies
Machine-against-People on EQ
Generation
• Machine: CharaParser+EQ
• People: 3 post-doc curators
• Experiment:
• 202 characters selected randomly
• 3 post-doc curators created EQs
• Naïve round: based on character descriptions only
• Knowledge round: free access to other resources
• Ontologies used
• UBERON, BSPO, PATO
• Curators can add terms independently when needed to
ontologies in both rounds of curation
Ontologies Generated
Curation Curator Ontologies Used Ontologies Produced

naive 1 the initial set naive 1 augmented set
knowledge 1 the initial set knowledge 1 augment. set
Terms added to ontologies by curators

Result: Initial vs. Aug ontologies
Average Precision Average Recall

CP+EQ + initial 0.36 0.42
ontologies
CP+EQ + aug. 0.41 0.49
ontologies
curators 0.58 0.59
Performance differences are statistically significant (p<0.05)
Augmented ontologies help improve EQ generation performance .

Result: Compare to BioCreative
2012
Precision Recall Precision Recall

(now) (now)
(BioCreative (BioCreative 2012)
2012)
Curator 0.54 0.59 0.58 0.59
average
CharaParser 0.24 0.26 0.41 0.49
+EQ
Result: Naïve vs. Knowledge
rounds
Curators modified > 50% EQs in the Knowledge round:
Changed/ # Added # Removed no change in term
total states terms terms count
naïve_1_know_1 261/463 87 78 96
naïve_2_know_2 326/463 79 119 128
naïve_3_know_3 298/463 92 97 109
Changes are mixed types: adding, removing, or changing terms in EQs

Changes were made on more complex EQs
count avg # of terms avg # of terms

(naïve) (knowledge)
naïve_1_know_1 changed 261 5.82 5.61
unchanged 202 3.84 3.84
unchanged 137 3.85 3.88
unchanged 165 3.69 3.68
Accessing knowledge does not reduce inter-curator variations

Issue I: Quality Ontology
• Ontology search issues. Phrases that appear to be a

good match but have unwanted meaning.
• “tooth crown distinct from root”,
• PATO:differentiated is_a PATO.cellular potency,
• “separate from”,
• PATO:separated from(is_a PATO:structure)
• PATO:far from (is_a PATO:position)
• PATO:adjacent to(is_a PATO:position)
• not PATO: in contact with (is_a PATO:structure).
Issue 2: Pre vs. Post Composition
• When good matches can’t be found in an ontology,

• propose a new term
• post-compose a term
• When to expect a pre-composed vs post-compose term
• Distal carpal bone 1
• Lateral pelvic glands
• ‘dermal sculpture on skull-roof’
• "dermal sculpture" does not exist in UBERON
• “surface sculpting* and (part_of some UBERON:dermatocranium)”
• “UBERON: dermatocranium and (bearer_of some PATO:sculpted
surface)”
• “UBERON: dermatocranium”
Issue 3: Source Descriptions
• Due to possible different interpretations of

description text, sometimes background
knowledge, or lack of it, affect the resulting EQs.
• “Weak” in "Dermal sculpture on skull-roof weak" was
interpreted differently:
• CharaParser+EQ
• “PATO:decreased strength" has exact synonym "weak" in PATO
• Curators each has a different interpretation:
• “PATO:poorly developed"
• “PATO:decreased magnitude"
• "weakly sculptured surface (new term)"
• none of three interpretations has "weak" as a synonym.
Problem: Disconnection among the Three
*Author’s terms not covered in ontologies

Literature/ *Authors are not aware of ontology terms
Authors and relationships
Ontologies
*No communication btw authors/curators
*Curators’ “weak” may not be author’s “weak”
*Ontologies – not easy to use.

*Inappropriate usage not identified
Curators *Issues reported not fixed in time
*Term usage not tracked
But….
• Ontologies must reflect domain consensus
Working towards: make trait descriptions knowledge

graphs that are “understood” by computers
Literature/
Authors
Ontologies
Curators
Currently:
• Author attitude survey

• Ontology-powered authoring platform prototypes
• Description Editor
• Matrix/Trait Editor
• Add2Ontology
• OUTPUT: human readable descriptions + computable
graphs
• Usability testing with
• Info Science students
• Bio Science students
• Botany/Plant Science authors
Trait/Character Recorder
Thank You
Acknowledgements:
NSF
Collaborators at University of Illinois Urbana Champaign
Harvard University
University of Michigan
University of Manitoba
University of South Maine
University of South Dakota
University of Duke
University of North Carolina
University of Florida
University of Ottawa, Canada
Agriculture Agri-Food, Canada

Faculty Presentation For University of Arizona Human Language and Technology 2020 Homecoming

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Faculty Presentation For University of Arizona Human Language and Technology 2020 Homecoming

Uploaded by

Copyright:

Available Formats

Information to Knowledge: A

Biodiversity Phenotype Case

Hong Cui, Ph.D.

NSF DBI-1147266, DEB-1541509, DEB-1208567. etc.

• CharaParser: a parsing approach

Stipules sometimes caducous, rarely obsolete or wanting

Existing glossaries/ontologies don’t agree

• Multi-user environment supporting conflict

• Disagreements in term categorization

Curation Curator Ontologies Used Ontologies Produced

Terms added to ontologies by curators

Average Precision Average Recall

Performance differences are statistically significant (p<0.05)

Augmented ontologies help improve EQ generation performance .

Precision Recall Precision Recall

Changes are mixed types: adding, removing, or changing terms in EQs

count avg # of terms avg # of terms

Accessing knowledge does not reduce inter-curator variations

• Ontology search issues. Phrases that appear to be a

• When good matches can’t be found in an ontology,

• Due to possible different interpretations of

*Author’s terms not covered in ontologies

*Ontologies – not easy to use.

Working towards: make trait descriptions knowledge

• Author attitude survey

You might also like