You are on page 1of 25

PANGAEA - Providing access to

geoscientific data using Apache
Lucene Java
Uwe Schindler
PANGAEA / SD DataSolutions GmbH, uschindler@pangaea.de
My Background
‡ I am committer and PMC member of Apache Lucene and Solr.
My main focus is on development of Lucene Java.
‡ Implemented fast numerical search and maintaining the new
attribute-based text analysis API.
‡ Studied Physics at the University of Erlangen-Nuremberg and
work as consultant and software architect for PANGAEA
(Publishing Network for Geoscientific & Environmental Data)
in Bremen, Germany, where I implemented the portal's geo-
spatial retrieval functions with Lucene Java.
‡ Talks about Lucene at various international conferences like
ApacheCon EU/US, Lucene Eurocon, Berlin Buzzwords and
various local meetups.
About PANGAEA
‡ since 1993
Information system for earth system science data hosted by AWI &
MARUM
‡ 2001
Mandate of the International Council for Science (ICSU):
World Data Center for Marine Environmental Sciences (WDC-
MARE)
‡ 2007
Mandate of the World Meteorological Organisation (WMO):
World Radiation Monitoring Center (WRMC)
‡ 2010 (certification in progress)
Mandate of the World Meteorological Organisation (WMO):
Data Collection and Processing Center (DCPC)
Network of World Data Centers
Geophysical Year 1957
‡Airglow ‡Meteorology
Mitaka,Japan Asheville NC, USA ‡Marine Geology and Geophysics ‡Rockets and Satellites
Beijing, China Boulder CO, USA ‡Nuclear Radiation Obninsk, Russia
‡Astronomy Obninsk, Russia Moscow, Russia Tokyo, Japan
‡Rotation of the Earth
Beijing, China
Obninsk, Russia
‡Atmospheric Trace Gases Washington DC, USA
Oak Ridge TN, USA ‡Satellite Information
Greenbelt MD, USA
‡Aurora
Tokyo, Japan ‡Seismology
Denver CO, USA
‡Cosmic Rays Beijing, China
Toyokawa, Japan
‡Soils
‡Earth Tides Wageningen, The Netherlands
Brussels, Belgium
‡Solar Activity
‡Geology Meudon, France
Beijing, China ‡Solar Radio Emission
‡Geomagnetism Nagano, Japan
Copenhagen, Denmark ‡Solar Terrestrial Physics
Edinburgh, UK Boulder CO, USA
Kyoto, Japan Didcot Oxon, UK
Colaba, India Moscow, Russia
Haymarket, Australia
‡Glaciology WDC Co-ordination Offices
Boulder CO, USA Washington DC, USA ‡Solid Earth Geophysics
Cambridge, UK Beijing, China Beijing, China
Lanzhou, China Boulder CO, USA
Moscow, Russia
‡Oceaography ‡Recent Crustal Movements
‡Human Interactions in the Environment Obninsk, Russia Ondrejov, Czech Republic
Palisades NY, USA ‡Space Science
Silver Spring MD, USA Beijing, China
‡Ionosphere Tianjin, China ‡Remotely Sensed Land Data
Sioux Falls SD, USA ‡Space Science Satellites
Tokyo, Japan Kanagawa, Japan
‡Marine Environmental Sciences ‡Paleoclimatology ‡Renewable Resources and Environment ‡Sunspot Index
Bremen, Germany, (2001) Boulder CO, USA Beijing, China Brussels, Belgium
Why do we need Data Libraries?

- Good scientific practice
- Needed for verification of scientific
work
- Good availability of data for large
scale and complex scientific
approaches
- ³'DWDUHF\FOLQJ´LVPRUHHIIHFWLYH
than reproduction
Geosciences before 1900

Glomar challenger, 1875 William Smith, 1815

Turin papyrus,
~1160 BC
Technical Improvements
ENIAC, 1944

Magnetometer
Development of the global
climate

Thousands of years before present

Thousands of years before present

The last 1300 years
Information increase in empirical sciences
30
?
25

20

15 Publications
Data
10

5

0

1970 1980 1990 2000 2010
Archiving and publication of
scientific data

‡ Data acquisition
‡ Quality assurance
‡ Long-term availability and access
Long term archive
‡ Open access & non restricted data
o Creative Commons license
‡ Data accepted from individual scientists,
institutes, and science projects
‡ Long term funding for basic operation
o hardware, software, system management &
organisation
‡ Long term preservation of data
o Technical: security, migration of media,
o Usability: preserving the integrity & semantics of
data sets
Contents
Data Types in PANGAEA
PS1389-3 PS1390-3 PS1431-1 PS1640-1 PS1648-1
IRD Sand CaCO3 TOC Radio Sm ect IRD Sand CaCO3 TOC Radio Sm ect IRD Sand CaCO3 TOC Radio Sm ect IRD Sand CaCO3 TOC Radio Sm ect IRD Sand CaCO3 TOC Radio Sm ect
( gr av/ 10 cm 3) ( %) ( %) ( %) ( %/ sand) ( %/ clay) ( gr av/ 10 cm 3) ( %) ( %) ( %) ( %/ sand) ( %/ clay) ( gr av/ 10 cm 3) ( %) ( %) ( %) ( %/ sand) ( %/ clay) ( gr av/ 10 cm 3) ( %) ( %) ( %) ( %/ sand) ( %/ clay) ( gr av/ 10 cm 3) ( %) ( %) ( %) ( %/ sand) ( %/ clay)

0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100

0.0

100.0

‡ Profiles => doi:10.1594/PANGAEA.701299 200.0

‡ Time series => doi:10.1594/PANGAEA.323487 Age (kyr) max. : 233.55 kyr PS1389-3ff

‡
11° 12° 13° 14° 15°

Sea bed photos => doi:10.1594/PANGAEA.319877 55°30' 55°30'

‡ Distributes samples => doi:10.1594/PANGAEA.51749 55° 0' 55° 0'

‡ Complex data => doi:10.1594/PANGAEA.108079 54°30' 54°30'

54° 0' 54° 0'

‡ Air photos => doi:10.1594/PANGAEA.323540 Scale: 1:2695194 at Latitude 0°

Source: Baltic Sea Research Institute, Warnemünde.
11° 12° 13° 14° 15°

World vector shore line
Grain size class KOLP A
Grain size class KOEHN2
Grain size class KOEHN
Geochemistry
Grain size class KOLP B

‡
Grain size class KOLP DIN
20 m

Audio record => doi:10.1594/PANGAEA.339110
Statistics (9/2010)

unclassified
Atmosphere Ice
Sediment
Corals

Water

Total number of data sets ~ 1 million
Data items ~ 8 billions
Now the technical details :-)
PANGAEA -
Architecture

Editorial Sybase Harddisk Apache
system ASE + tape (silo) Lucene

RDB

Webserver Middleware

Google
Maps /
PANGAEA
search
«
Earth engine
Indexing contents from relational
database with dynamic updates
Staffs
Update  Log
Projects

Data  Series
Data  Set

Events

XML  Data  Set
Description
(Metadata)
Indexed Information
‡ Textual metadata: citation (authors, title),
abstract, measurement parameters,
methods, associated projects, comments,
documentation« including field info for all
XML schema element types)
‡ Fulltext data set contents
‡ Geographical information:
latitude/longitude/BBOX/track, dates,
geological age, depth/elevation
[NumericField/NumericRangeQuery]

‡ Soon: Fulltext of attached external documentation
3')«
Geo-Retrieval with Lucene
Using scored queries
with KML regions as filters
Apache Lucene
as fast Key-Value Store
‡ Lucene is used for almost every query on the
web-client
‡ /RWµV of keyword terms indexed for quick
retrieval of data sets
‡ Example: Lookup of datsets related to
publications using DOI ± PANGAEA is hit by
hundreds of DOI lookup queries per second
from scientific publishers:
Apache Lucene
as fast Key-Value Store
‡ Lucene is used for almost every query on the
web-client
‡ /RWµV of keyword terms indexed for quick
retrieval of data sets
‡ Example: Lookup of datsets related to
publications using DOI ± PANGAEA is hit by
hundreds of DOI lookup queries per second
from scientific publishers:
Live

PRESENTATION
Contact
Uwe Schindler

PANGAEA - Publishing Network for Geoscientific &
Environmental Data
MARUM, Leobener Str., 28359 Bremen, Germany
uschindler@pangaea.de

SD DataSolutions GmbH
Wätjenstr. 49, 28213 Bremen, Germany
uschindler@sd-datasolutions.de
Thank you!
Know more about Apache Lucene at
www.lucidimaginatin.com