You are on page 1of 23

Author-Driven Computable

[Phenotype] Data and


Ontology Production for
Taxonomists
Hong Cui1, Bruce Ford2, Julian Starr3, James Macklin4,
Anton Reznicek5, Noah W. Giebink1, Dylan Longert3,
Étienne Léveillé-Bourret6, Limin Zhang1
1. University of Arizona 2. University of Manitoba 3. University of Ottawa
4.Agriculture Agri-Food Canada 5. University of Michigan 6. University of Montreal

This set of slides are licensed under CC0 1.0 Universal License
Talk outline
● Currently morphological/phenotype publications are not FAIR [Findable, Accessible, Interoperable, and Reusable]
○ To computationally use published information
■ Professional curation: burdened by the high cost and high inter-curator variation
■ Automated information extraction/natural language processing techniques: extracted information remains ambiguous, contains extraction
errors
■ Author curation: post-publication, low response rate, ontology issues
● Authors in the driver's seat: fast, consistent, computable phenotype data and ontology production
○ Themes
■ Authors produce FAIR data at the time of the publication: RDF data + human readable morphological descriptions
■ Directly contribute to relevant ontologies
■ Term standards emerge from community practices.
○ Outcomes
■ A 2019 survey on biologists attitude towards ontologies, curation, and potential solutions
■ Semantic representation of morphological characters
● Numerical Characters
● Colors
■ Software prototype implementation
● Character Recorder (Web application)
● Conflict Resolver (mobile application)
The Attitude Survey
● Attitudes of biologists towards adopting ontologies
○ November 2018 to February 2019
○ 130 responses received
○ 91 effective responses analyzed, all publish or consume phenotype data
■ https://uarizona.co1.qualtrics.com/jfe/form/SV_6VRPiQFGFNzYCwd
○ 28 questions
■ Current experience and overall attitude with controlled vocabularies/ontologies,
■ Awareness of the issues around ambiguous information and post-publication
professional data curation,
■ Biologists preferred solutions, and the effort and desired rewards for adopting a new
authoring workflow.
1 Frustration with ambiguity in phenotypic descriptions (73%) 2 The existence of controlled vocabularies is widespread,
but their use is not common

3 Author curation would better reflect the original 4 Adopt a new authoring workflow to directly produce FAIR
meaning of phenotype data (80%) data (85%)
5 Extra effort I am willing to put to use a new authoring 6 Unwillingness to use CV (22%)
workflow (5% => 93%)

7 Positive attitude towards CV (62%) Additional Findings


• 50% of the respondents believe authors lack of the skills to
self-curate their writings.
• 64% of respondents would like to manually annotate their
publications using ontologies.
• 65% of respondents would like to use an automated
vocabulary checker.
• Preferences for rewards: citations > auto format
publications > auto data conversion > monetary rewards
Paper to be published in DATABASE (Oxford) [Cui et. al. (in
revision]
Semantic Representation:
Numerical Characters
Perigynium beak 1.6 mm
Carex petasata
Semantic Representation:
Numerical Characters
Author color labels
Semantic Representation: Colors in FNA

● 2883
records of
sRGB
values for
major
organs of
Carex
species.

Luminance=(0.299*R +
0.587*G + 0.114*B)
L*a*b color space groups a* = red (+) opposes green (-)
b* = yellow(+) opposes blue(-)
colors better than RGB L = lightness, independent of a* and b*

Support Vector Machine


predicted labels

(b*)

(a*)

SVM color classification in RGB space SVM color classification in L*a*b space
(only show RG as B has small variance) (only show a*b* as L is lightness)
Neither color spaces are good for
seperating human labels well

human color labels


Use L to separate basic colors to
light, medium, and dark groups
Software Implementation
● Research prototype supports ontology-aware taxon-by-character matrix editing
● Character Recorder: http://shark.sbs.arizona.edu/chrecorder/public/
Create a character
Define a character:
Input values The matrix is transformed into RDF trig files nightly

Text description is
generated from
the matrix

Click in a cell of a categorical character, invokes the Input Template


Input template: reuse values
Input template: select/create values
Input template: ontology suggested values

Subtitle, right here


User-entered value “rusty” is replaced with ferruginous
Orange dot indicates some data on this tab
have been deprecated in the ontology.

Click on Rench to accept a replacement term for the


deprecated character or to dispute the deprecation
Conflict Resolver
● A mobile application
● Terms added by authors in Character Recorder
may have issues (“conflicts”)
○ One term has no or multiple superclasses
○ Terms with poor definition
○ Multiple synonyms are added as independent terms
● Such conflicts are collected from the ontology
periodically and presented in the mobile app.
● The majority decision of any three experts is
taken as the solution
Usability Studies
● Can undergraduate biology students define numerical characters?
○ Yes, the median time for them to create a new character is < 5 min (Cui, et al. 2020)
● Does the character sharing feature reduce character variation?
○ Yes, by 48% (Cui, et al. 2020)
● Which approach is the most effect for novice users to add terms to an ontology?
○ A simple form, a wizard, and web protégé have different strength and weakness but can be used
effectively for different purposes (Zhang, et al. 2021)
● Currently recruiting Carex or other plant taxonomists for a comprehensive test of
Character Recorder and Conflict Resolver.
○ 3 day experiment.
○ $200 / day honorarium
○ Contact hongcui@email.arizona.edu

Егор Камелев, CC0 Егор Камелев, CC0


References
● Cui, H., Zhang, L., Ford, B., Cheng, H-L., Macklin, J., Reznicek, A. Starr, J. (2020)
Measurement Recorder: developing a useful tool for making species descriptions that
produces computable phenotypes. Database (Oxford). DOI:
10.1093/database/baaa079/5995854
● Cui, H. et al. (submitted 2021) A survey of biologists’ attitude towards using ontologies to
make the phenotypic data computable at the time of publication. Database (Oxford): The
Journal of Biological Databases and Curation.
● Zhang, L., Yang, X., Cota, Z., Cui, H., Ford, B., Cheng, H-L., Macklin, J. Reznicek, A.,
Starr, J. (2021) Which methods are the most effective to enable novice users to
participate in FAIR ontology creation? A usability study. Database (Oxford): The Journal
of Biological Databases and Curation. DOI: https://doi.org/10.1093/database/baab035

Егор Камелев, CC0 Егор Камелев, CC0


Acknowlegements
● US National Science Foundation
● University of Manitoba for the multiple usability studies with UM biology
students
● Miss. Jocelyn Pender of Agriculture and Agri-Food Canada for her
constructive suggestions on the user interface design of Character
Recorder

Егор Камелев, CC0 Егор Камелев, CC0

You might also like