Professional Documents
Culture Documents
Author-Driven Computable (Phenotype) Data and Ontology Production For Taxonomists
Author-Driven Computable (Phenotype) Data and Ontology Production For Taxonomists
This set of slides are licensed under CC0 1.0 Universal License
Talk outline
● Currently morphological/phenotype publications are not FAIR [Findable, Accessible, Interoperable, and Reusable]
○ To computationally use published information
■ Professional curation: burdened by the high cost and high inter-curator variation
■ Automated information extraction/natural language processing techniques: extracted information remains ambiguous, contains extraction
errors
■ Author curation: post-publication, low response rate, ontology issues
● Authors in the driver's seat: fast, consistent, computable phenotype data and ontology production
○ Themes
■ Authors produce FAIR data at the time of the publication: RDF data + human readable morphological descriptions
■ Directly contribute to relevant ontologies
■ Term standards emerge from community practices.
○ Outcomes
■ A 2019 survey on biologists attitude towards ontologies, curation, and potential solutions
■ Semantic representation of morphological characters
● Numerical Characters
● Colors
■ Software prototype implementation
● Character Recorder (Web application)
● Conflict Resolver (mobile application)
The Attitude Survey
● Attitudes of biologists towards adopting ontologies
○ November 2018 to February 2019
○ 130 responses received
○ 91 effective responses analyzed, all publish or consume phenotype data
■ https://uarizona.co1.qualtrics.com/jfe/form/SV_6VRPiQFGFNzYCwd
○ 28 questions
■ Current experience and overall attitude with controlled vocabularies/ontologies,
■ Awareness of the issues around ambiguous information and post-publication
professional data curation,
■ Biologists preferred solutions, and the effort and desired rewards for adopting a new
authoring workflow.
1 Frustration with ambiguity in phenotypic descriptions (73%) 2 The existence of controlled vocabularies is widespread,
but their use is not common
3 Author curation would better reflect the original 4 Adopt a new authoring workflow to directly produce FAIR
meaning of phenotype data (80%) data (85%)
5 Extra effort I am willing to put to use a new authoring 6 Unwillingness to use CV (22%)
workflow (5% => 93%)
● 2883
records of
sRGB
values for
major
organs of
Carex
species.
Luminance=(0.299*R +
0.587*G + 0.114*B)
L*a*b color space groups a* = red (+) opposes green (-)
b* = yellow(+) opposes blue(-)
colors better than RGB L = lightness, independent of a* and b*
(b*)
(a*)
SVM color classification in RGB space SVM color classification in L*a*b space
(only show RG as B has small variance) (only show a*b* as L is lightness)
Neither color spaces are good for
seperating human labels well
Text description is
generated from
the matrix