Data sharing and retrieval using OAI-PMH

Ranjeet Devarakonda 1 , Giri Palanisamy 1 , James M. Green 2 , Bruce E. Wilson1
1 2

Oak Ridge National Laboratory PO Box 2008 M S 6407, Oak Ridge, TN 37831 USA Information International Associates Oak Ridge, TN 37831 USA

ABSTRACT

There is a growing consensus for the need to store and archive digital data, particularly for publicly funded research. Long term preservation of that data generally requires some form of institutional archive, such as those directed to part icular scientific co mmun ities of practice. Given that data is often of use to multip le co mmun ities of practice, wh ich may have differing norms for data and metadata structure and semantics, effective standards for data and metadata exchange are important factors for users to be able to find and retrieve data. Toolsets that provide a coherent presentation of information across multip le standards are important for data search and access. One such toolset, Mercury, is a open -source metadata harvesting, data discovery, and access system, built for researchers to search for, share and obtain spatiotemporal data used across a range of climate and ecological sciences. Mercury is used across mult iple p rojects to provide a coherent search interface for spatiotempora l data described in any of several metadata formats. Mercury has recently been extended to enable harvesting and distribution of metadata using the Open Archive Initiative Protocol for Metadata Handling (OAI-PMH). In this paper we describe Mercury’s capabilit ies with mu ltip le metadata formats, in general, and, more specifically, the results of our OAI-PMH imp lementations and the lessons learned.

KEYWORDS

Mercury Search System, Scientific data search, OAI-PMH, jOAI, Data sharing, Metadata, Ecological Informatics, Climate change, Environ mental informat ics, Spatiotemporal data,

1

1.

INTRODUCTION

A key conclusion in a recent United States Interagency Working Group on Digital Data (IAW GDD) report on harnessing the power of digital data for science and society is the role of communities of practice in effective quality control, preservation, distribution, interpretation, and use of digital data (NSTC, 2009). A g iven researcher may, however, participate in mu ltiple communit ies of practice, and may also need to draw on data from co mmunit ies outside those he or she normally participates in. The data generated by a researcher, or by any other data generator, is potentially of use to mult iple co mmunit ies of practice and scientific disciplines. While that data may b e archived in a particular repository serving one or more part icular communit ies, that data may need to be discoverable and usable by mult iple co mmun ities. General-purpose search engines, such as Google and Bing, are currently generally ineffective at disc overing scientific data, in part because of the specific semantics associated with a particular search and because those search engines generally perform full text, rather than fielded searches of structured documents. A Google search on the term “Eag les” lacks the context to distinguish between multip le different meanings, whereas a data repository serving a biology community can presume that the searcher is referring to members of the genus Haliaeetus. Advances in the practice of scientific data management, the tools for managing data, the standards for data formats and metadata formats, and the understanding of the value of digital data have created a wide range of digital repositories focused on different applications. Nor are these repositories necessarily distinct. There may be a number of different repositories serving field ecologists, with distinctions based on funding agency, country, organizational affiliatio n, or other artifacts of historical origin. These repositories generally have search tools that work within their particular hold ings, but are often unable to search across the holdings of other repositories, due to various technical and sociological fact ors. Fro m the end user perspective, this situation is problemat ic, as a co mprehensive search for available dig ital data relevant to a research topic is nearly impossible, requiring knowledge of mu ltiple repositories and the particular search interfaces of those repositories. Multiple approaches have been used for enabling search across multip le repositories, such as the Z39.50 (Information Retrieval Standard, 1997) distributed search method. Distributed searches, however, can be problematic, both for re sponse time and uptime. Search results can only be presented to the user as quickly as the slowest search agent returns (plus some processing time if the results are to be integrated) and the composite uptime is the product of the individual uptimes. As a result of the problems with distributed searches, repositories have turned to a harvest and index approach as a means to ensure rapid response, enable full integration of metadata fro m mu ltip le sources, and provide acceptable uptime. However, 2

harvesting can be an inefficient process, particularly if the metadata are completely reharvested regularly as a means to ensure that source changes are propagated into the search results. Mercury (Devarakonda, 2010) is an open-source toolset for metadata authoring, harvesting, indexing, and searching which implements a variety of harvesting protocols and provides a coherent view of metadata across a range of metadata standards, including Federal Geographic Data Co mmittee Content Standard for Digital Geospatial Met adata (FGDC CSDGM), Ecological Markup Language (EML), Global Change Master Directory’s Directory Interchange Format (GCMD DIF), Dublin Core and ISO 19115. Mercury’s architecture includes 1) a harvesting engine to collect various metadata records from publically available folders, web sites, ftp sites, and other network accessible locations; 2) a powerful indexing engine based on Apache Lucene and SOLR that can index b illions of records ; and 3) a service oriented architecture based search engine, which can perform searches and distribute results through web user interfaces, web services, RSS feed, and portlets. Recently, we added the Open Archive Initiative Protocol for Metadata Handling (OAI-PM H; OAI, 2010; Van de Sample, 2004) as a means for both harvesting metadata from other repositories and enabling the distribution and reuse of metadata fro m repositories using the Mercury toolset. This new feature is an extension to the Mercury’s harvester. OAI-PMH is a standard that is seeing increased use as a means for exchanging structured metadata. OAI-PMH p roviders must support Dublin Core as a metadata standard, with other metadata formats as optional. We have developed tools that enable Mercury to both consume and distribute metadata using OAI-PMH services in any of the metadata formats we support. By

implementing these tools, we seek to at least significantly lower the technical barriers for users to be able to find and use relevant data, regardless of the particular repository that is the authority for that da ta and the associated metadata.

2.

METHODS AND TECHNIQUES

Mercury harvests metadata records from several data providers around the world and builds a centralized index and makes it searchable via Mercury’s search interface (See Figure 1). Once the records are harvested by the default harvester, they are then exposed to the new OAI-PMH based Mercury harvester.

3

Figure 1. Mercury Metadata Search

The Mercury OAI-PMH Handler is implemented using a Java-based, open source Open Archives Initiative software package (jOAI), developed by Digital Learn ing Sciences at the University Corporation for Atmospheric Research (UCAR, 2010). This package allo ws metadata records from a file system to be exposed as items in an OAI data repository and made available to the data provider for harvesting. Remote harvests can monitor the OAI data repository can effectively mirror the files or harvest them incrementally. For examp le, NASA ’s Global Change Master Directory’s (GCM D) PMH handler consumes this structured metadata via ORNL’s OAI harvester service. Figure 2 describes the high level metadata flow fro m ORNL to GCM D and also shows other potential metadata distribution standards. For a nu mber of reasons, our OAI-PMH provider is generally configured to expose only the metadata for which that repository is authoritative. A given repository may be harvesting from multip le different locations, for purposes of providing a coherent view to the user. That repository may not, however, have permission to redistribute the harvested metadata. Further, red istribution brings in a number of technical challenges. If repository A harvests from B, which harvests from C, which harvests from D, then an update to a metadata record at repository D will take three update cycles to reach users of repository A, which could be a significant delay, depending on the harvest frequency. Furthermore, if repository A harvests fro m B and fro m C, while both B and C harvest fro m D and D harvests from A, avoid ing a perpetual update cycle and determining the authoritative instance for a particular metadata record may prove problematic. While this type of repository harvesting cyclic arrangement may seem contrived, the authors are familiar with a number of cases where such situations could occur. 4

2.1

OAI-PMH Overview

Metadata are exchanged among the data or service providers as XML documents transmitted over HTTP. There can be mu ltip le data providers and service providers; each service provider harvests the data from several different data providers. These transfers are carried over by simple HTTP Requests and Responses . There are six different types of Requests (See Figure 3). It ’s not mandatory for the harvester to use all the requests.

ORNL DAAC Metadata Records DIF

ORNL’s OAI-PMH Handler Service

FGDC

ISO

DC

GCMD’s OAI -PMH

GCMD Data Discovery Service

Figure 2. ORNL’s OAI-PMH Metadata flow

2.2

ORNL’s OAI-PMH Harvester

Generally, an OAI-PMH provider stores metadata in an autonomous OAI-PMH repository. This repository has a unique, persistent baseURL, and the http address BaseURL(n). To monitor the metadata revisions, an OAI-PMH harvester can read when the record was added, modified or deleted, which helps in synchronization between data provider and the harvester. It typically uses datestamp for this purpose, which by definition is the data and time of creat ion or modification of the Dublin Core metadata record. However, updating a resource does not necessarily reflect a modification of Dublin Core record, thus datestamp might not be the most reliab le basis for incremental harvesting approach. In the previous harvesting approach, incremental harvesting was unavailable, resulting in long network connections and slowing down the processes until entire load of metadata are downloaded.

5

Figure 3. OAI-PMH Overview

In general, however, the OAI-PM H protocol provides reliable informat ion on the revision date for metadata records, which ensures that the harvester only retrieves the records which have changed since the last harvest. This places less strain on both the PMH provider and the PM H client, and allows for more rapid update cycles.

A metadata crosswalk module manages the available metadata formats. This component helps in conversion of one metadata format to another. Though it supports multip le formats, Dublin Core is a mandatory for interoperability and standards compliance. This is yet another new value added module to the Mercury system. Data providers or Data sources can have more flexib ility in choosing the metadata standard. They can concentrate more on the content than the style of presentation. Once metadata are harvested, by whatever means, Mercury then extracts available informat ion fro m the metadata records, to form a co mmon representation used as the basis for the Lucene indexing. The full metadata record is also full-text indexed and available for the end user to examine as part of the search results.

3.

FUTURE DIRECTIONS

While OAI-PMH provides a significant imp rovement in repositories to consolidate metadata from mu ltip le different sources, providing users at least discovery-level metadata to enable scientific research, users still are likely to need to search multiple repositories. And once records have been located, the data access mechanisms for various repositories can be quite different . Some metadata specifications, such as ISO 19115 and GCMD DIF, provide means for data providers to indicate data access services for standard methods, such as Data Access Protocol (DAP) methods or Open Geospatial Consortiu m (OGC) web services. As data providers expand metadata to provide such informat ion, and as metadata becomes more mach ine 6

processable, users will be better able to directly access data without having to have as much understanding of differing data locations and access methods. The Mercury development team is actively engaged in working with data providers to enable more transparent data access using these types of service descriptors. When working across communit ies of practice, there are also issues where different terms are used for the same concept and the same term is used for different concepts. Semantic med iation is one method for addressing this type of problem.

4.

CONCLUSION

OAI-PMH enhancement was a useful addition to the harvesting protocols in use for distributing metadata. Metadata exchanges between agencies are now being carried over more readily. In our implementation, which is based on the standard and on open-source tools, we are able to supply metadata in multip le formats, based on transformations from our internal metadata structure. This enables distribution to multiple collaborating repositories in an efficient method and one which best enables the native capabilit ies of those collaborating institutions. OAI-PMH focuses on the transfer of metadata between data providers, and other common services like metadata searching are outside its scope. Integration of Mercury with OAI-PM H is filling a key gap in s earching, sharing and obtaining spatiotemporal data across the scientific co mmunity, thus boosting its overall performance and usage. While the specification requires that Dublin Core metadata be an option, this can be a very limited metadata structure, particularly for co mplex scientific datasets. Metadata exchanges are asynchronously carried out via simp le HTTP requests and responses that also prove the simplicity of the protocol.

5.

ACKNOWLEDGMENTS

Mercury development has been funded by multip le different projects from the National Aerospace and Space Administration (NASA), the United States Geologic Survey (USGS), and the Department of Energy (DOE). Oak Ridge National Laboratory is managed by the UT-Battelle, LLC, for the U.S. Depart ment of Energy under contract DE-AC05-00OR22725.

6.

REFERENCES

Devarakonda, R., Palan isamy, G., Wilson B. E., James M. Green., (2010) Mercury: reusable metadata management, data discovery and access system. Earth Science Informatic. 3:(87-94) doi:10.1007/s12145-010-0050-7. 7

NAS (2010) Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age (ISBN 0-309-136857). The Nat ional Academies Press.

NSTC (2009) “Harnessing the Power of Digital Data for Science and Society” Report of the Interagency Working Group on Dig ital Data to the Co mmittee on Science of the National Science and Tech nology Council. Washington, D.C. Available at: http://www.nitrd.gov/About/Harnessing_Power_Web.pdf

OAI (2010) Open Archives Initiative Protocol for Metadata Harvesting. Interoperability through Metadata Exchange. Retrieved May 2010, Avialable at: http://www.openarchives.org/pmh/

Suleman, H., and Fo x, E. A. A (2001) “Framework for Building Open Dig ital Libraries” D-Lib Magazine 7#12.

Lynch, C. A. "The Z39.50 Informat ion Retrieval Standard. Part I: A Strategic View of Its Past, Present and Future." D-Lib Magazine, April 1997.

UCAR (2010) jOAI software, developed by Digital Learning Sciences (DLS) (http://www.dlsciences.org/) at the University Corporation for Atmospheric Research (http://www.ucar.edu/). Retrieved May 2010, Avialab le at:

http://www.dlese.org/dds/services/joai_software.jsp

Van de So mpel, H., et al. (2004) “Resource Harvesting within the OA I-PM H Framework,” D-Lib Magazine 10#2.

8

Sign up to vote on this title
UsefulNot useful