Professional Documents
Culture Documents
BioScience 2015 Ellwood - Etal
BioScience 2015 Ellwood - Etal
net/publication/274371684
CITATIONS READS
49 330
12 authors, including:
Some of the authors of this publication are also working on these related projects:
Advancing and Mobilizing Citizen Science Data through an Integrated Sustainable Cyber-Infrastructure View project
All content following this page was uploaded by Elizabeth R Ellwood on 26 June 2015.
A goal of the biodiversity research community is to digitize the majority of the one billion specimens in US collections by 2020. Meeting
this ambitious goal requires increased collaboration and technological innovation and broader engagement beyond the walls of universities
and museums. Engaging the public in digitization promises to both serve the digitizing institutions and further the public understanding
of biodiversity science. We discuss three broad areas accessible to public participants that will accelerate research progress: label and ledger
transcription, georeferencing from locality descriptions, and specimen annotation from images. We illustrate each activity, compare useful
tools, present best practices and standards, and identify gaps in our knowledge and areas for improvement. The field of public participation in
digitization of biodiversity research specimens is in a growth phase with many emerging opportunities for scientists, educators, and the public,
as well as broader communication with complementary projects in other areas (e.g., the digital humanities).
Keywords: crowdsourcing, citizen science, digital humanities, digitization of biodiversity research collections, public participation in scientific research.
BioScience XX: 1–14. © The Author(s) 2015. Published by Oxford University Press on behalf of the American Institute of Biological Sciences. All rights
reserved. For Permissions, please e-mail: journals.permissions@oup.com.
doi:10.1093/biosci/biv005 Advance Access publication XX XXXX, XXXX
and OpenStreetMap, http://openstreetmap.org) has become public participation, we focus here on those activities that
increasingly important (Bonney et al. 2014, Kelty et al. can be deployed online, where the number of potential par-
2014). Public participation is also known as citizen science ticipants is greater, because it is less limited by monetary and
(when scientists collaborate with the public) or crowdsourced physical constraints such as those related to onsite supervis-
science (in which contributions are made by a large, usually ing personnel, workspace, and parking. Improvements and
online and occasionally paid community of individuals; advancements made to online digitization tools for public
Wiggins and Crowston 2011). In the sciences, the need participation might also lead to their widespread use onsite
for a formalization of practice related to public participa- by paid staff.
tion and the establishment of supporting infrastructure has iDigBio’s Public Participation in Digitization of
been met by several recent organizational developments. Biodiversity Specimens Workshop participants recognized
The Human Computation and Crowdsourcing meetings 26 digitization activities in which the public could par-
began as annual workshops sponsored by the Association ticipate, some of which fit neatly into the last two task (i.e.,
for the Advancement of Artificial Intelligence in 2009 and activity) clusters of Nelson and colleagues (2012) described
became an annual conference in 2013. The biennial Citizen above and others that occur after the initial digitization of
Cyberscience Summit in the United Kingdom began in 2010 the specimen data and subsequent deployment of it online.
Table 1. Digitization activities identified by the participants of iDigBio’s Public Participation in Digitization of
Biodiversity Specimens Workshop organized by the twelve crowdsourcing processes recognized by Dunn and Hedges
(2013) for the humanities.
Process Activity
Transcribing • Into appropriate database fields.
Cataloging • Overlaps broadly with other processes (e.g., transcribing and georeferencing); identified by the
production of structured, descriptive metadata.
Translating • Between a nonnative language and the native language (e.g., between Chinese and English in the
United States).
Georeferencing • Assign latitude and longitude and measures of precision to collection localities not previously described
in that way.
Recording and creating content • Provide location and other information on historical place names used in collection locality descriptions.
Mapping • Production of maps useful for identifying outliers that might be due to errors or something that is
biologically interesting.
• Production of maps useful for citizen science research.
(http://crowdflower.com). These benefits can be gained in for- et al. 2001, National Research Council 2009). These experiences
mal classroom settings or in informal settings. The design provide crucial lifelong learning opportunities to increase sci-
and supplementary materials for online digitization activities ence awareness, appreciation, interest, and understanding with
in a classroom setting can emphasize foundational areas in different types of digitization programs and activities being able
the Next Generation Science Standards (National Research to achieve a variety of learning outcomes.
Council 2012), including scientific and engineering practices, Despite successful scientific advancements (e.g., Lintott
crosscutting concepts, and disciplinary core ideas. ZooTeach et al. 2008), critics of these approaches cite data quality
(http://zooteach.org) is a repository for K–16 educational mate- as a primary concern over the use of citizen science data
rials that use Zooniverse’s citizen science tools (Masters 2013). (Penrose and Call 1995, Nerbonne and Vondracek 2003).
Participants in informal and online learning experiences are In addition, citizen science is not well suited to all facets
diverse and include all ages, cultural and socioeconomic back- of scientific applications and workflows (Dickinson et al.
grounds, abilities, knowledge, and educational backgrounds. 2010, Kremen et al. 2011). Description of data quality has
Their experiences are characterized as being self-motivated, been formalized in the areas of transcription (Hill et al.
guided by their own interests, voluntary, personal, embedded 2012) and georeferencing (e.g., the National Standard for
in a context, and open-ended (Falk and Dierking 2000, Falk Spatial Data Accuracy; http://fgdc.gov/standards/projects/
Table 2. Online tools for public participation in transcription of biodiversity specimen labels and field notebooks.
Characteristics of each are described as applicable according to the given category. Values are valid as of February
2015, unless otherwise noted.
Transcription Taxonomic; Training Incentives Contributors Transcriptions Interface Validation
tool geographic; process
and object
type focus
Atlas of Living Life; global, Onsite tutorials Recognition 860 130,816 Zoom and pan Each task
Australia’s but especially and forum. of every in window or has one
DigiVol Australia; individual’s in separate transcription
specimens and contributions window; all and one
field notebooks. to each fields seen at validation
expedition, once. (proofread by
as well as an experienced
those making transcriber).
the greatest
contribution.
is XML-TEI markup (http://tei-c.org/index.xml), which is methods. Each of these also has clear relevance to the
important in the context of transcribing ledgers. georeferencing and annotating activities. Improvements to
transcription tools could enhance participant enjoyment
Gaps in our knowledge and areas for improvement. Despite recent and ease of use. For example, new functionality could give
recommendations from the Notes from Nature project (Hill the contributor more control of their transcription experi-
et al. 2012) and limited research into motivations of citizen ence, such as providing them with the ability to establish
scientists (Rotman et al. 2014), we still lack a satisfactory the criteria used to determine the specimens that they
understanding of several aspects of public participation in transcribe (e.g., on the basis of the collection supplying the
transcribing biodiversity specimen labels and ledgers. These specimen images or the occurrence of a word in the OCR
include the most significant factors affecting efficiency, text strings generated from images) or the ability to toggle
accuracy, initial motivation, and long-term engagement; the between interfaces that show a single field at a time and
best algorithms to produce consensus transcription from multiple fields at a time. Furthermore, records could be
multiple replicates; and the most effective data validation sorted for transcription based on similarity (e.g., overall
The image display should produce a clear view of all relevant text at an appropriate zoom level at once or via panning.
Data entry fields should be accessible whilst viewing the image.
Drop-down lists should be provided when the universe of acceptable responses can be populated from controlled vocabularies and is
relatively small (e.g., the 50 US states); autocomplete functionality in free text fields should be provided when the number of acceptable
responses is larger and cannot be fully populated from the beginning of the project (e.g., collector names).
Dependencies in the acceptable values for fields should be built in (e.g., only those counties from the state of Georgia are available in
a dropdown once the state is established as Georgia).
Readily accessible examples and directions for each field should be available during the activity.
Forums to enable volunteers to ask questions about specific specimens or ledgers or the general process of transcription to the project
manager and each other should be provided.
A task completion count should provide the public participant with both progress towards the project’s digitization goal and the
similarity of OCR text strings). Improvements could also transects may be recorded as a line with start and stop
address data quality issues by providing the ability for par- coordinates, as is common in samples from trawlers. The
ticipants to return to earlier transcription records to correct expression of uncertainty is crucial to determining a data
what they later learn are transcription errors. The biodi- record’s fitness for use (Wieczorek et al. 2004). For example,
versity research collections community would also benefit point data with an uncertainty of 10 km may be unsuitable
from greater sharing of best practices and tools with the for an analysis across 1-km-resolution environmental gra-
digital humanities community, in which projects, such as the dients. Georeferences as latitude and longitude coordinates
University College London’s Transcribing Bentham Project and the datum on which the coordinates are based are typi-
(http://blogs.ucl.ac.uk/transcribe-bentham), the University of cally lacking from terrestrial and inland aquatic specimens
Iowa’s Civil War Diaries and Letters Transcription Project collected before the 1990s (Beaman and Conn 2003; marine
(http://digital.lib.uiowa.edu/cwd), and the Medici Archives specimens might differ). Where those are available, they
Project (http://medici.org), and standalone tools such as can provide useful validation for textual descriptions or vice
Ben Brumfield’s FromThePage (http://beta.fromthepage.com) versa, because such latitude and longitude readings also have
for transcription and Juxta (http://juxtasoftware.org) for associated, and often unreported, uncertainties.
the comparison of multiple transcriptions of a single text, Public participants can be expected to be most efficient
represent significant overlap in objectives between the two and accurate at georeferencing when they can read the
communities. language in which the label was written, can read relevant
map types (e.g., topographic or nautical), and have some
Online activity 2: Georeferencing familiarity with the area in which the specimen was collected
Georeferencing, as applied to biodiversity research collec- (i.e., experience on the ground or with locally used names).
tions, is the inference of a geospatial geometry from the Useful emphases in training for the task can be placed on
textual collection locality description on a label or in a ledger basic geographical skills such as identifying the locality
(figure 2; Guralnick et al. 2006). information and interpreting locality types, interpreting
geographic jargon, compass bearings, abbreviations, and for-
Overview. The geospatial geometry is often expressed as a mats, and understanding the common types of geographic
single point representing latitude and longitude, usually with projections (e.g., equal area), coordinate systems (e.g.,
an associated radius allowing representation of uncertainty Universal Transverse Mercator) and geodetic systems (e.g.,
(Wieczorek et al. 2004). However, localities could also be World Geodetic System 1984). Training will also improve
represented as multipoints, lines, multilines, polygons, and a participant’s ability to interpret locality descriptions and
multipolygons to better reflect either the collection method uncertainties. For these skills, training emphases can be
or imprecision associated with the interpretation of a tex- placed on finding and using relevant maps and indices of
tual collection locality description. For example, sampling place names, and precisely describing the georeferencing
geographic origin of the specimen (e.g., Africa), rather than produce a useful consensus georeference. Still lacking are the
the collection that curates the specimen. ability to match georeferencing competencies with collec-
tion localities and sufficient strategies for assessing a user’s
Best practices and standards. Best practice documents specific georeferencing competencies initially and through time. A
to georeferencing specimens include Guide to Best Practices better understanding of how to enable collaboration and
for Georeferencing (Chapman et al. 2006), Principles and communication (e.g., by visualizing on a map the collection
Methods of Data Cleaning—Primary Species and Species- localities being discussed in a forum) is also needed.
Occurrence Data (Chapman 2005), and Guide to Best Digital imaging and linking of field notes to specimens
Practices for Generalising Sensitive Species Occurrence Data would likely provide a big benefit to georeferencing, because
(Chapman and Grafton 2008). However, the geospatial com- field notes can contain a wealth of information about
munity has produced many other best practice documents, collecting sites, including travel itineraries, site sketches,
including those related to standards (e.g., as at the Open environmental information, and other remarks not often
Geospatial Consortium; http://opengeospatial.org/standards/ found on specimen labels. iDigBio’s 2014 Digitizing from
bp) and commercial or open-source geographic information Source Materials Workshop (http://idigbio.org/wiki/index.
systems (e.g., as found at ESRI; http://esri.com). A useful php/Digitizing-From-Source-Materials) laid the groundwork
Arkansas), the use of authoritative resources (e.g., taxonomic anticipated (e.g., many beetles are only identifiable by the
keys and illustrated glossaries), and the use of relevant terms number of segments on the tarsus and without that part in
(e.g., leaves and glaucous). Useful emphases in taxa-specific the image, an annotation of taxonomic identity is difficult).
training can be placed on recognizing relevant features of Also, users should have easy access to tools for zooming and
the focal taxonomic group, correct usage of relevant terms, panning and designating an area of interest in the image to
use of specific resources (e.g., a key to the millipedes of associate with the annotation. Finally, constraint of annota-
Arkansas) and the protocol for describing relevant resources tion terms to those in controlled vocabularies (e.g., from
and methods used for reaching the conclusion of an annota- ontologies or taxonomic authority files) can enable semantic
tion. Process- and image-specific training can include iden- processing and reduces spelling errors. Recommendations
tifying typical changes that can occur in the phenotype after made above in reference to transcription and georeferencing
preservation as a specimen (e.g., common color changes or best practices are also relevant here, especially provision of
pest damage patterns) and typical distortions introduced a forum for the users to discuss annotations with each other
by an imaging technique (e.g., deviations from a rectilinear and project scientists, leading to greater user proficiency and
projection or chromatic aberrations). understanding.
Relatively many online applications enable public partici- Standards relevant to annotation specifically include the
social challenges, such as predicting biotic responses to Finally, the development of a public digitization project
climate change and invasive species. Here, we reviewed the relies on somewhat ad hoc negotiations between a collection
state of public participation in three major areas of digitiza- curator and the managers of relevant public participation
tion—transcription, georeferencing, and annotation. Each tools who each require different information in differ-
of these activities contributes crucial data to research and ent formats. This can slow progress and is an area where
offers educational opportunities, but public participation in standardization has the potential to make the creation and
transcription is most advanced of the three. This is perhaps management of public digitization projects accessible to not
due to efficiencies that can be introduced into the latter two just all collections curators but also members of the public
activities once the specimen’s identity and collection locality (e.g., a local chapter of a native plant society). Empowering
description have been digitized. the latter group has the potential to engage far more par-
Across the three major digitization tasks, several com- ticipants by better aligning the digitization projects that
mon needs for improvement can be noted. We recognize are available with the motivations of the public, making
seven high priority steps for the community to take in this the projects collaborative or cocreated, rather than simply
area: (1) All of the public participation tools for biodiversity contributory (sensu Shirk et al. 2012). By contrast, opportu-
specimen digitization that we have discussed are relatively nities for public engagement today are largely contingent on
Bird TJ, et al. 2014. Statistical solutions for error and bias in global citizen Hirschman L, et al. 2008. Habitat-Lite: A GSC case study based on free text
science datasets. Biological Conservation 173: 144–154. terms for environmental metadata. OMICS: A Journal of Integrative
Bonney R, Cooper CB, Dickinson J, Kelling S, Phillips T, Rosenberg KV, Biology 12: 129–136.
Shirk J. 2009. Citizen science: A developing tool for expanding science Jenkins M. 2003. Prospects for biodiversity. Science 302: 1175–1177.
knowledge and scientific literacy. BioScience 59: 977–984. Jinbo U, Kato T, Ito M. 2011. Current progress in DNA barcoding and future
Bonney R, Shirk JL, Phillips TB, Wiggins A, Ballard HL, Miller-Rushing AJ, implications for entomology. Entomological Science 14: 107–124.
Parrish JK. 2014. Next steps for citizen science. Science 343: 1436–1437. Jordan RC, Gray SA, Howe DV, Brooks WR, Ehrenfeld JG. 2011. Knowledge
Bonter DN, Cooper CB. 2012. Data validation in citizen science: A gain and behavioral change in citizen-science programs. Conservation
case study from Project FeederWatch. Frontiers in Ecology and the Biology 25: 1148–1154.
Environment 10: 305–307. Kelty C, Panofsky A, Currie M, Crooks R, Erickson S, Garcia P, Wartenbe
Brumfield B. 2012. Quality control for crowdsourced transcription. In M, Wood S. 2014. Seven dimensions of contemporary participation
Brumfield B, ed. Collaborative Manuscript Transcription. BlogSpot. disentangled. Journal of the Association for Information Science and
(17 January 2015; http://manuscripttranscription.blogspot.com/2012/03/ Technology. Forthcoming. doi:10.1002/asi.23202
quality-control-for-crowdsourced.html) Kremen C, Ullman KS, Thorp RW. 2011. Evaluating the quality of citizen-
Chapman AD. 2005. Principles and Methods of Data Cleaning: Primary scientist data on pollinator communities. Conservation Biology 25:
Species and Species-Occurrence Data. Global Biodiversity Information 607–617.
Facility. Kumar N, Belhumeur PN, Biswas A, Jacobs DW, Kress WJ, Lopez IC,
Russell KN, Do MT, Huff JC, Platnick NI. 2007. Introducing SPIDA-Web: scientist for the Center for Science Learning at the Florida Museum of Natural
Wavelets, neural networks and internet accessibility in an image-based History, in Gainesville. She has formed innovative partnerships and developed
automated identification system. Pages 131–152 in MacLeod N, ed. numerous programs designed to promote science interest, understanding, and
Automated Taxon Identification in Systematics: Theory, Approaches engagement. Paul Flemons is head of the Science Services and Infrastructure
and Applications. CRC Press Taylor & Francis Group. Branch and is manager of collection informatics at the Australian Museum,
Sheshadri A, Lease M. 2013. SQUARE: A benchmark for research on in Sydney. His focus is on research and development of innovative solutions
computing crowd consensus. Pages 156–164 in Proceedings of the to biodiversity informatics challenges, particularly Web-based applications
First AAAI Conference on Human Computation and Crowdsourcing. for accessing and analysing biodiversity collection data. Robert Guralnick is
Association for the Advancement of Artificial Intelligence. (17 January an Associate Curator of Biodiversity Informatics at University of Florida. His
2015; http://ir.ischool.utexas.edu/square/documents/sheshadri.pdf) research bridges from biodiversity informatics, especially the digitization and
Shirk JL, et al. 2012. Public participation in scientific research: A framework mobilization of biodiversity data, to scientific questions related to assessing
for deliberate design. Ecology and Society 17 (art. 29). drivers of broad-scale biospheric change. Gil Nelson is an assistant professor for
Tschöpe O, Macklin JA, Morris RA, Suhrbier L, Berendsohn WG. 2013. research in the Institute for Digital Information and Scientific Communication
Annotating biodiversity data via the Internet. Taxon 62: 1248–1258. at Florida State University, in Tallahassee, where he specializes in digitiza-
Wake DB, Vredenburg VT. 2008. Are we in the midst of the sixth mass tion research and practice for iDigBio. Greg Newman is a research scientist
extinction? A view from the world of amphibians. Proceedings of the at the Natural Resource Ecology Laboratory at Colorado State University, in
National Academy of Sciences 105: 11466–11473. Fort Collins, whose research focuses on citizen science, ecological informatics