You are on page 1of 25

Ref.

Ares(2016)1344931 - 17/03/2016

Collection Management Systems: A Research Perspective


Fredrik Ronquist
2016-03-14

Introduction
As a systematist, you are likely to examine, sequence or analyze a number of specimens
in the course of your studies. Those specimens can be on loan from a natural history
collection, or they can be in a collection you build yourself. In either case, you will have
to manage information about the specimens and link that information to the results of
your studies, which may be presented in papers or in various online databases. What is
the best way of approaching these information management tasks?
Most systematists today keep some kind of personal collection database. It is often built
from scratch using commercial database software like FileMaker Pro or Microsoft
Access. There are also dedicated software packages like Specify 1 and Biota 2, which allow
you to manage collections data, and VoSeq 3, which is focused on handling DNA sequence
and voucher information.
Unfortunately, keeping a personal collection database has the effect of building
information silos that are not easily connected. For instance, if you extract and sequence
DNA from some small part of a specimen from a natural history museum, you are likely
to create your own identifier for the voucher specimen, and this is the number that is
likely to end up with the sequence submission record. However, once you have returned
the specimen to the natural history museum where it belongs, it is unlikely that it will be
possible to link any additional information that might become available about the
specimen in the future through your identifier. Most natural history museums simply do
not have systems in place for routinely assigning globally unique identifiers (GUIDs) to
specimens, which make it possible to find information about them online. Many
entomology collections do not even keep specimen-level databases today, except
possibly for type specimens.

Linking data through identifiers


There is no easy solution to these problems just yet. There have been several initiatives
to create systems of unique identifiers or acronyms for the public collections around the
world. Index Herbariorum, covering the world’s herbaria, is probably the most well
known. There is also a list of insect and spider collections of the world, each one
associated with a unique acronym, typically four letters long. These acronyms are
commonly used in systematic entomology papers; you might already have seen them in
a listing of repositories of studied material. The list has a long history; the current
version is maintained online 4 by the Bishop Museum in Hawaii.
In recent years, there have been attempts to merge lists like these into a global list. The
current initiative seems to be with the Consortium for the Barcoding of Life, which is

1 http://specifysoftware.org/
2 http://viceroy.eeb.uconn.edu/Biota/
3 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0039071
4 http://hbs.bishopmuseum.org/codens/codens-inst.html
maintaining GRBio5, the global registry of biodiversity repositories. Unfortunately, the
list is partly outdated and contains a number of obvious problems. It does not appear
that the list is very actively maintained.
Previously, there has been a movement to assign life science identifiers (LSIDs) as GUIDs
for biodiversity information records, such as specimen records. However, many experts
are now abandoning LSIDs in favor of plain old uniform resource identifiers (URIs), that
is, internet addresses. Some institutions will already now be able to provide permanent
URIs for their collection objects, often including the institutional address, a globally
recognized acronym for the collection, and a unique catalog number for the object.
However, these are still early days in the adoption of such URI schemes, and many
institutions are still working on this, or have not even started.
In conclusion, there is currently no way for you to assemble information about
specimens in natural history collections and be sure you can link this information online
through a GUID to other data associated with that specimen. The best you can do is to
ask each natural history museum you are borrowing specimens from for a GUID,
preferably a URI, for each specimen they send to you for loan or allow you to study. Then
use that GUID in your own database, and cite it in any online references you make to that
specimen in sequence databases or other online repositories to which you upload
results from your studies.

Building your own collection


How do you manage the information about the specimens you have collected yourself?
Again, there is no simple answer but a suggestion is that you choose a natural history
institution as the home of your collection. Make sure that the institution will grant you a
longtime loan of your own collection if you move to another institution; you should not
take this for granted. Most aspiring systematists will end up moving among institutions
several times in their career, and usually need to be able to take their collection with
them on a long-term loan (or possibly shift the home of the specimens based on a
transfer agreement).
Once you have chosen a home collection, use their system for assigning GUIDs to your
specimens. This will minimally require some interaction with whatever system the
institution is using to maintain catalog numbers. A more radical and future-oriented
approach is to choose to maintain all of your collection-related information in the
institutional system. However, this is only a reasonable choice if they provide a modern
web-based system that is adequate for your needs. Most institutional collection systems
in use today do not meet the requirements for research use but this may change in the
next few years thanks to projects like DINA (see below).
The institutional perspective is somewhat different. Typically, collection catalogs were
started a long time ago by curators who started to assemble information about the
objects in the collection on index cards. When computers became more commonly
available, the information was transferred to databases purpose-built by each curator
for the part of the collections they were responsible for. The end result was a large
number of heterogeneous systems, built using different software packages, often
maintained by single individuals, and using different data formats. This is still the
situation in many institutions housing natural history collections today.

5
In recent years, this situation has started to change as natural history museums and
similar institutions have become increasingly aware of the value of digital assets and the
need for professional information management. Recent trends like the push for open
science, shared public data, the semantic web, and linked open data have accelerated
this development. This has led to a movement towards a coherent information
management strategy and a central institutional collection management system in most
places.
In the choice of a central system, an organization can opt to: (1) acquire a commercial
system (EMu being the major system used currently by large natural history museums);
(2) develop a system in-house; or (3) join other institutions in distributed open-source
development. There are many reasons suggesting that the third choice is going to be the
most flexible and cost-efficient solution in the long term (see separate PowerPoint
presentation).
The DINA 6 consortium is currently the largest initiative for producing a web-based
collection management system through distributed open-source development. The
consortium currently includes six organizations in six different countries, four of which
contribute actively to the development. Two of the BIG 4 institutions are among the core
members of the DINA consortium: the Swedish Museum of Natural History and the
Natural History Museum of Denmark.
DINA is based on the Specify data model. A hybrid DINA-Specify system, relying on the
Specify 6 Java client for core collection management tasks, is available from the DINA
team at the Swedish Museum of Natural History (contact Markus Skyttner
markus.skyttnar@nrm.se). The hybrid system has been in production at the Swedish
Museum of Natural History since 2011, when the first components were installed.
A fully web-based DINA version is not expected to be available until 2018 according to
the current DINA roadmap. Functionality specifically tailored for researchers is not
currently on the roadmap, but the consortium already now provides API specifications
that you can use to develop your own research client to the DINA-web database
backend. If you are interested in exploring this, contact Markus Skyttner (e-mail above)
for more information on how to install and run a DINA backend that you can use to
communicate with the front-end client you develop. You can run the entire DINA system
on your laptop, and you can share your front-end client with all other institutional and
individual DINA users through the DINA consortium and their github repository if you
like.
If you like the DINA approach and your institution is not a member of the DINA
consortium, you can ask the decision makers at your institution to consider the
possibility of joining the DINA initiative. An easy way of preparing yourself for a future
transition to the DINA system is to use Specify or a Specify-compatible data model for
your own private collection database.
Separately, you will find a PowerPoint presentation that gives you an introduction to the
DINA project, with pointers to web sites where you can find more information.

6 http://dina-project.net
Introduction to the DINA
project

Fredrik Ronquist
Dept. Bioinformatics and Genetics
Swedish Museum of Natural History
Collection Management Systems
Institutional Choices:
1. Develop your own system in-house
2. Acquire a commercial system (e.g., EMu)
3. Partner with other institutions in distributed open-
source development (e.g., DINA project)
The Case For Open Source
 Market considerations. Professional collection management systems not
viable commercial products in a pluralistic market.
 Long-term stability. An open-source software solution developed by
institutions with long-term focus will be more stable than a commercial
solution.
 Flexibility. A distributed open-source system must by necessity conform to a
modular design based on open API:s. This favors flexibility and adaptability
in a way that a commercial product will not.
 Cost effectiveness. Although some overhead is associated with distributed
development, more development teams involved in the effort will result in a
lower cost to the individual institution compared to in-house or commercial
solutions.
The Case For Open Source (cont’d)
 Opt-in opt-out scheme. Institutions can participate in the development
when they have resources to do so, and can opt out when they do not. At any
single point in time, it should be feasible to have enough institutions involved
for development to move forward at an acceptable pace.
 Community Control. A distributed open-source solution means that the
community retains control over both the information standards and the
system architecture and web service/API designs.
 Egalitarian. A professional open-source collection management system
offers a better way for developing countries to catch up than any commercial
product.
 Stable marketplace for extensions and services. A community-supported
de-facto standard for collection management systems architecture will ensure
that there is a stable market for various plugins, extensions and services based
on the system.
EMu: The major commercial collection management system used in natural
history museums
Axiell group, owned by Swedish venture capitalists, recently acquired the company
behind EMu. Lack of competition = profit.
The Natural History Museum in UK, one of several major natural history museums
currently running EMu. They have given Axiell 12 months to solve a number of
serious issues with the system; in parallel, open-source options are being reviewed.
Koha – Origin New Zealand, now 15 % of market share for Library Mgmt Systems
Atlas of Living Australia – Origin Australia (284 M SEK initial investment),
the world’s most complete system for integration, analysis and visualization
of biodiversity data, now 70+ developers around the world, running or being installed
in many countries in Europe and South America in addition to Australia
DINA Consortium
(Digital Information system for NAtural history data)
 Core mission. Pool resources to develop an open-source web-based
collection management system for natural history collections.
 Core Member. Required contribution 1.0 FTE to the project, of which at
least 0.5 to the development effort. Voting member of the DINA Technical
Committee (TC), which controls deliverables and deadlines for the 1.0 FTE
contribution.
 Associate Member. No contribution requirements. Non-voting member of
the Steering Group.
Steering Group
(All Members)

Technical
Committee
Task Force I Task Force II
(Core
Members)

Development Development Development


Team 1 Team II Team III
Current DINA Consortium
 Core Members
 Agriculture and Agri-Food Canada, Ottawa
 Estonia (University of Tartu)
 Denmark (University of Copenhagen)
 Sweden (Swedish Museum of Natural History)
 Associate Members
 Museum für Naturkunde, Berlin
 Royal Botanic Garden, Edinburgh
 Open to Additional Members
 Memorandum of Cooperation and more
information at http://dina-project.net
Challenges and Lessons Learned
 Commitment. Formalization of the collaboration and a good governance
model is essential to ensure work towards common goals.
 Patience. It may take an institution with long-term perspective several years
from a decision to join the consortium to actively contributing to the
development.
 Respect. Different teams come with different backgrounds, different skill
sets, and different external pressures. Striking the right balance between the
cathedral (centrally controlled) and the bazaar (locally controlled) approach to
collaborative development is crucial.
 Trust. A team needs to trust the other teams in the consortium to deliver
according to agreements, so that consortium membership pays off. In the
DINA consortium, we have just reached this stage, 5 years after we started
collaborating with the Specify group in Kansas, USA.
DINA Versions
 DINA Light (“Specify”, “DINA-Specify Hybrid”)
 Based largely on Specify 6 and the Specify data model, combined with new API:s and web
clients (collection web portal, biological survey client, species pages, DNA barcode portal,
loan request system)
 Fully compatible with Specify 7
 In production in Sweden since 2011. Currently includes many of the small Swedish collection
databases (NRM entomology, geology; GNM entomology, SMTP) with several more on the
way in (NRM zoology (part), GB herbarium, GNM zoology and geology).
 DINA Web
 Modern, modular, service-oriented architecture, optimized for distributed development,
based to a large extent on the Specify data model
 DINA API guidelines and style guidelines adopted
 Architectural road map, module overview and API blueprints under development
 Core modules available in proto-DINA versions: collection web portal, species pages system,
biological survey client, DNA barcode portal
 Core modules under development: taxonomy module, collection manager, DNA sequence
module, DINA data tool (batch uploading and editing)
DINA Light (DINA-Specify Hybrid) Original Version

web interface Morphbank


web client

central databases and file archives

Specify Morphbank Morphbank


database database img archive

Specify 6
”thick client“
Java client
DINA Web System Overview

Prototypes at these addresses:

Not available yet Naturfynd.se Naturarv.se DNA-key.se Naturforskaren.se

Biodiversit DNA
Collection Collection Species
y survey barcode
Manager web portal portal pages
client

Collection-
Media
related Media files BLAST DB Taxon info
metadata
databases
Current Specify 6 client (Java stand-alone). Old technology, old-style interface.

Web-based Specify 7 client. Restricted functionality, same interface as Java client,


monolithic system with a code base that is difficult to work with
Modularization important to facilitate distributed development
Modern web form for the
collection manager UI,
developed in collaboration with
collection managers in the
DINA consortium
Estonian Pluto-F project, contributes taxonomy module to DINA-Web
Canadian SeqDB project, contributes DNA sequence module to DINA-Web
More DINA Info
 DINA project wiki (http://dina-project.net)
 Project introduction
 Steering committee and technical committee information, minutes of meetings etc
 Status of the project in each of the participating institutions
 DINA github repository (https://github.com/DINA-Web)
 DINA API guidelines and style guidelines
 Module map, system overview
 Code for DINA modules
 DINA components in production in Sweden:
 http://naturforskaren.se (species pages, in Swedish)
 http://naturfynd.se (biodiversity survey client, requires login)
 http://naturarv.se (collection web portal)
 http://dna-key.se (DNA barcode portal)
 https://www.dina-web.net/loan/ (loan request)

You might also like