You are on page 1of 12

See

discussions, stats, and author profiles for this publication at:


https://www.researchgate.net/publication/233391522

Knowledge discovery in large model


datasets in the marine environment:
The THREDDS Data Server example

Article · June 2012


DOI: 10.1080/19475721.2012.669637

CITATIONS READS

8 112

6 authors, including:

Andrea Bergamasco
Italian National Research Council
116 PUBLICATIONS 1,771 CITATIONS

SEE PROFILE

Richard Signell
United States Geological Survey
110 PUBLICATIONS 3,231 CITATIONS

SEE PROFILE

Available from: Alvise Benetazzo


Retrieved on: 03 August 2016
This article was downloaded by: [Mr Andrea Bergamasco]
On: 29 May 2012, At: 14:21
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Advances in Oceanography and


Limnology
Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/taol20

Knowledge discovery in large model


datasets in the marine environment:
the THREDDS Data Server example
a a a a
A. Bergamasco , A. Benetazzo , S. Carniel , F.M. Falcieri , T.
a b a
Minuzzo , R.P. Signell & M. Sclavo
a
CNR-ISMAR, Castello 2737/F, 30122 Venice, Italy
b
U.S. Geological Survey, Woods Hole, MA 02540, USA

Available online: 25 May 2012

To cite this article: A. Bergamasco, A. Benetazzo, S. Carniel, F.M. Falcieri, T. Minuzzo, R.P. Signell
& M. Sclavo (2012): Knowledge discovery in large model datasets in the marine environment: the
THREDDS Data Server example, Advances in Oceanography and Limnology, 3:1, 41-50

To link to this article: http://dx.doi.org/10.1080/19475721.2012.669637

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-


conditions

This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation
that the contents will be complete or accurate or up to date. The accuracy of any
instructions, formulae, and drug doses should be independently verified with primary
sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand, or costs or damages whatsoever or howsoever caused arising directly or
indirectly in connection with or arising out of the use of this material.
Advances in Oceanography and Limnology
Vol. 3, No. 1, June 2012, 41–50

Knowledge discovery in large model datasets in the marine


environment: the THREDDS Data Server example
A. Bergamascoa, A. Benetazzoa, S. Carniela*, F.M. Falcieria, T. Minuzzoa,
R.P. Signellb and M. Sclavoa
a
CNR-ISMAR, Castello 2737/F, 30122 Venice, Italy; bU.S. Geological Survey,
Woods Hole, MA 02540, USA
(Received 8 December 2011; final version received 15 February 2012)
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

In order to monitor, describe and understand the marine environment, many


research institutions are involved in the acquisition and distribution of ocean
data, both from observations and models. Scientists from these institutions are
spending too much time looking for, accessing, and reformatting data: they need
better tools and procedures to make the science they do more efficient. The U.S.
Integrated Ocean Observing System (US-IOOS) is working on making large
amounts of distributed data usable in an easy and efficient way. It is essentially
a network of scientists, technicians and technologies designed to acquire, collect
and disseminate observational and modelled data resulting from coastal and
oceanic marine regions investigations to researchers, stakeholders and policy
makers. In order to be successful, this effort requires standard data protocols,
web services and standards-based tools. Starting from the US-IOOS approach,
which is being adopted throughout much of the oceanographic and meteorolog-
ical sectors, we describe here the CNR-ISMAR Venice experience in the direction
of setting up a national Italian IOOS framework using the THREDDS
(THematic Real-time Environmental Distributed Data Services) Data Server
(TDS), a middleware designed to fill the gap between data providers and data
users. The TDS provides services that allow data users to find the data sets
pertaining to their scientific needs, to access, to visualize and to use them in an
easy way, without downloading files to the local workspace. In order to achieve
this, it is necessary that the data providers make their data available in a standard
form that the TDS understands, and with sufficient metadata to allow the data
to be read and searched in a standard way. The core idea is then to utilize a
Common Data Model (CDM), a unified conceptual model that describes different
datatypes within each dataset. More specifically, Unidata (www.unidata.
ucar.edu) has developed CDM specifications for many of the different kinds
of data used by the scientific community, such as grids, profiles, time series, swath
data. These datatypes are aligned the NetCDF Climate and Forecast (CF)
Metadata Conventions and with Climate Science Modelling Language (CSML);
CF-compliant NetCDF files and GRIB files can be read directly with no
modification, while non compliant files can be modified to meet appropriate
metadata requirements. Once standardized in the CDM, the TDS makes datasets
available through a series of web services such as OPeNDAP or Open Geospatial
Consortium Web Coverage Service (WCS), allowing the data users to easily
obtain small subsets from large datasets, and to quickly visualize their content by
using tools such as GODIVA2 or Integrated Data Viewer (IDV). In addition,
an ISO metadata service is available through the TDS that can be harvested

*Corresponding author. Email: sandro.carniel@cnr.it

ISSN 1947–5721 print/ISSN 1947–573X online


! 2012 Taylor & Francis
http://dx.doi.org/10.1080/19475721.2012.669637
http://www.tandfonline.com
42 A. Bergamasco et al.

by catalogue broker services (e.g. GI-cat) to enable distributed search across


federated data servers. Example of TDS datasets can be accessed at the CNR-
ISMAR Venice site http://tds.ve.ismar.cnr.it:8080/thredds/catalog.html.
Keywords: large model output; NetCDF; CF convention; THREDDS

1. Introduction
The marine environment is characterized by a large number of complex dynamical
processes of societal importance, such as sea level rise, coastal flooding, coastal erosion [1],
harmful algal blooms and oil spills. In addition, climate change induced effects at global
or regional scales have not been yet completely understood, and as a consequence there is a
large uncertainty on the effects that they may have on coastal zones [2].
Many national and international research bodies and institutions are actively acquiring
marine data both in situ (e.g. current meters, wave riders, tide gauges, etc.) and remote
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

(e.g. from satellite), as well as running complex, integrated numerical models with the aim
of monitoring and depicting the status of our seas.
Despite the growing number of datasets produced by observations and modelling,
ocean data is still often generated using custom formats, and distributed using a variety of
ad hoc methods, making it difficult to efficiently locate and access data from multiple
institutions. Scientists very often spend considerable efforts in the time consuming activity
of localizing and retrieving data; and, even when successful in this, they still have to spend
a considerable amount of time to properly organize them before plotting or analyzing
them, because of different formats, conventions, etc.
Luckily, it is possible to use existing tools and techniques to overcome this problem,
turning non-standard datasets held at institutions into standard web services in a way that
puts little burden on the data providers [3,4]. These approaches have been applied to the
U.S. Integrated Ocean Observing System (US-IOOS, see http://www.ioos.gov) to make
collectively held oceanographic data easy to find and utilize. IOOS is a coordinated
network of organizations that work together to acquire, organize and distribute
observational and model data in the coastal ocean, to allow for improved understanding
and prediction of the marine environment [5].
CNR-ISMAR Venice is helping to set up a national Italian IOOS framework, with the
focus of making both its data and model results efficiently available to organizations and
research bodies interested in monitoring and predicting the dynamics of the coastal marine
ecosystem [6]. The Italian IOOS network is being designed to connect naturally into
the international IOOS framework. Such an infrastructure will help the understanding
and forecasting of locally important issues such as the effects of severe meteo storms, the
implications of climate variability effects on global-regional scales, a quantitative risk
assessment in coastal areas, etc.
Examples of stake-holders that will immediately benefit from simple and efficient
access to large ocean dataset and model results are represented by, (a) companies involved
in the managing of marine coastal resources, including fisheries; (b) institutions dealing
with the management emergencies, including search and rescue and civil protection
activities; (c) marine scientists; (d) ‘‘policymakers’’ at local, regional, national and
international level; (e) recreational activities.
The IOOS approach is to standardize not on data formats, but on web services, and
to approve certain web services for certain types of data. For gridded data, the approved
IOOS services are currently Open Geospatial Consortium (OGC) Web Coverage Service
Advances in Oceanography and Limnology 43

(WCS) and the OPeNDAP service, in agreement with the Climate and Forecast (CF)
convention [7].
This paper aims at highlighting the efforts that CNR-ISMAR Venice has recently
carried out on this direction, discussing the basic ideas that have prompted the
oceanographic and meteorological communities to the direction of contributing to
‘‘knowledge discovery’’ by means of increasing the model data interoperability, as well as
data-model intercomparison and validation.

2. From the Common Data Model to the THREDDS Data Server


To unify the access to scientific data, we start from the basic idea of building a Common
Data Model (CDM). Unidata (http://www.unidata.ucar.edu) an US government funded
organization whose mission is to provide data services, tools and cyberinfrastructure for
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

earth-system, proposed to create a common model for different ‘‘feature types’’ of


commonly used scientific data (e.g. grids, profiles, time series, etc.). For each feature type,
readers can then be constructed to translate data from many different formats on disk into
a common model in memory. An implementation of this conceptual model is given by
Unidata CDM written in NetCDF-Java (http://www.unidata.ucar.edu/software/netcdf-
java/CDM). The TDS, which utilizes NetCDF-Java, can for example read data into
the ‘‘grid’’ feature type from NetCDF3, NetCDF4, GRIB1, GRIB2, HDF4 and HDF5
on disk.
Data already written according the CF convention can be directly read into the CDM,
while other data can be adequately modified using the NetCDF Markup Language
(NcML) via XML. The CDM additionally provides a standard API that can identify geo-
referenced coordinate systems and queries specialized and oriented to commonly used data
structures in the earth science community.
The TDS is a middleware that bridges the gap between data providers and data users by
delivering standardized metadata and data in a variety of standard services. The services
allow users to access and use data in a common and efficient way, allowing extraction of
just the data they need, without downloading entire datasets. This is particularly
important as datasets from numerical simulations of the atmosphere and ocean are
growing to hundreds of gigabytes, terabytes or even petabytes.
The adopted interoperability solution for CDM and TDS is shown in Figure 1.
The final aim is that of making the use of scientific data simple, allowing a more
efficient exchange of scientific information that is of high interest in several sectors of
importance. In addition to web services that deliver data, it is necessary to have web
services that allow users to find the data. The TDS ncISO service (ncISO is a package of
tools that facilitates the generation of ISO 19115 metadata from NetCDF data sources
stored in a TDS catalog) can be harvested by catalogue broker services [8] such as GI-CAT
Geoportal Server (see Figure 1a), and Geonetwork [9]. These catalogue services internally
store the gathered information in a metadata common data model (ISO 19115-2), which
in turn can be accessed using standardized queries for data discovery and access such as
OpenSearch or OGC Catalog Services for the Web (CSW). This means the datasets along
with their data services could easily be picked up and used by geosciences integration
efforts like US-IOOS, INSPIRE or GEOSS.
The brokering approach can further reduce the demand on data providers; by letting
them provide any of a number of standard services, and independently providing the
44 A. Bergamasco et al.
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

Figure 1. The adopted interoperability solution for CDM and TDS.


Panel a: the user issues a query to a catalog broker service (like GI-CAT) that has harvested
metadata from TDS servers, and stored them in a common data model (CDM) for metadata
(in a database). The catalog broker service then returns the metadata for datasets that meet the user’s
query constraints.
Panel b: the CDM is contained within the TDS, that provides standard data and metadata services.
The metadata contains links to the actual standard data services such as OPeNDAP, WCS, etc,
so that the application the user can immediately start doing useful things with the results.
Advances in Oceanography and Limnology 45

translation into and out of the common data model. While the catalogue services are
metadata brokers, the THREDDS Data Server is a data broker (Figure 1b), allowing
many different formats of files, as well as OPeNDAP service datasets, to be transformed
into a common data model for actual arrays of data.
In addition to allowing non-conforming data to be virtually transformed into a
common data model, the TDS has also another important characteristic that makes things
easier for data providers and users: aggregation. This means that many individual files on
disk can be virtually joined into a single dataset accessible through the web services. Thus
oceanic and atmospheric model output, as well as remote sensing data, which are typically
present on a file system as numerous smaller files, can be accessed via a single OPeNDAP
or WCS URL. The TDS is simple to install, as it is 100% Java servlet typically deployed
on Tomcat. A provider simply verifies that they have Sun Java, downloads and unzips
Tomcat, and then deploys the thredds.war file through the Tomcat GUI, a process that
typically takes less than one hour, and sometimes as little as 10 minutes. Configuring the
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

TDS for local datasets, of course, takes longer, but is still straightforward.
The CNR-ISMAR Venice catalogue http://tds.ve.ismar.cnr.it:8080/thredds/
catalog.html represents one of the first Italian community examples taking the IOOS
approach. Once a user has selected archive dataset, a description of the dataset (metadata)
appears, along with available web services and data viewers. At the moment, the CNR-
ISMAR Venice catalogue contains datasets from several different implementations of the
coupled hydrodynamic-wave-sediment model ROMS (www.myroms.org). Datasets are
regularly updated, and include cases from different geographic areas (e.g., the Adriatic
sea and the Gulf of Lyon). For a thorough description of these test cases, see [1,10].

3. Advantages for the users


Once connected to a dataset on TDS, users can extract just the data they need using
a variety of methods. One popular method is to access data directly from MATLAB,
a common tool for scientific analysis and visualization. By using the NCTOOLBOX for
MATLAB (http://code.google.com/p/nctoolbox/) the user can extract the data, have them
in his MATLAB workspace and easily plot them or perform a more in-depth analysis.
Python users can use OPeNDAP access tools contained in packages such as NetCDF4-
Python (http://code.google.com/p/netcdf4-python/).
Other users may be more interested in browsing the results than in working directly
with arrays of data. Personnel at the regional agencies of environmental protections
(such as ARPA in Italy) may need to check the results of complex and integrated
numerical models, for instance forecasting the wave-height or the currents regimes in front
of the Venice littoral zone. Instead of asking a research centre for dedicated production
or posting of these data, the user can connect directly to the TDS and quickly browse the
data using the viewer link for GODIVA2 [11]. GODIVA2 is a OGC Web Map Service
(WMS) client that allows users to easily create maps or animations that rely on images
generated TDS WMS service. This combination of quick browse and efficient access not
only saves time for the end user, but allows for more effective use of the data.
Figure 2 presents an example obtained from the above mentioned catalogue under
the ‘‘FIELD_AC test case’’, that stores outputs resulting from an implementation of the
hydrodynamic numerical model ROMS coupled to the wave model SWAN in the Adriatic
Sea. This hindcast was carried out at 500 m resolution spanning year 2007, with both
46 A. Bergamasco et al.
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

Figure 2. Using the quick and intuitive GODIVA2 web map service client, users can pick up
the ocean model variables they wish to visualize from a TDS catalog. After arranging it in terms
of latitude, longitude, depth and time, GODIVA2 service allows for mapping, drawing sections and
producing time animations. Shown here, is the sea surface temperature from a high-resolution run
(500 m) of the northern Adriatic sea using ROMS-SWAN model, referring to July 3, 2007.

models forced with high-resolution atmospheric forcing provided by the model COSMO-
I7. Further details on the numerical implementations are given in [12].
Using GODIVA2 (which can be directly activated by clicking in the bottom area of the
metadata page popping up), we can for instance visualize the field of potential temperature
at the model top level. Moreover, there is also the possibility of using embedded features
that takes care of exporting the results on GoogleTM Earth maps, as shown in Figure 3.
When it is necessary to combine layers or create more complex visualizations, the user can
access the data using the Unidata Integrated Data Viewer (IDV), freely available at (http://
www.unidata.ucar.edu/software/idv/), as proposed in Figure 4.
Making things easier and more efficient for users can be particularly important during
emergency response situations. After the Fukushima incident, the US Navy rapidly spun
up a 1 km NCOM forecast model covering hundreds of km offshore of the Sendai power
plant. The model forecasts were officially made public as NetCDF 3 files, one file for
each forecast time, and packed into 9 GB tar.gz files delivered on an FTP site at NCEP.
This made the forecast data effectively inaccessible to researchers at sea with limited
bandwidth. To facilitate use, the 9GB files were transferred to a THREDDS Data Server,
converted to NetCDF4 (which resulted in a total file size the same as the tar.gz file), and
Advances in Oceanography and Limnology 47
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

Figure 3. The same example shown in Figure 2, exported now to GoogleTM Earth mapping.

Figure 4. Using the open-source software IDV, more complex images can be arranged, such as this
3D view of the northern Adriatic topography (in orange) with superimposed the sea surface
temperature field from a high-resolution run (500 m) of the northern Adriatic sea using ROMS-
SWAN model, referring to May 28, 2007. The figure also shows averaged (2D) velocity fields (black
arrows, plotted every 10th grid points) and contours of significant wave height (m).
48 A. Bergamasco et al.
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

Figure 5. Using NCTOOLBOX to access data from the TDS via OPeNDAP, displaying surface
current vectors and speed for the Fukushima region. On the bottom is shown the actual script
in Matlab that acquires and subsamples the data and produces the plot on the top.

virtually turned into a single CF Compliant dataset available via the TDS, allowing
distribution through OPeNDAP, WCS, and WMS services. This allowed efficient
sub setting and extraction by WHOI researchers at sea, who used the forecast data to
predict the movement of radioactive material they identified in surface water samples
(see Figure 5). The metadata service also allowed others searching for Fukushima products
to effectively locate this new datasets once it came on line [13].
Advances in Oceanography and Limnology 49

4. Conclusions and recommendations


Following the US IOOS approach, the Italian community is advancing toward a
‘‘Euro-Mediterranean IOOS’’. Taking an approach focused on minimizing the burden for
both providers and consumers of data, great strides are being made to overcome the
existing bottleneck in the distribution and use of observational and modelled ocean data.
The TDS represents a free and supported solution that is now allowing data providers
(e.g. numerical modellers) to serve the data they produced, with no need of modification,
via standard web services. Since the TDS metadata service can be harvested by catalogue
brokering systems, data users can easily access the standardized data by using standardized
queries for data discovery (e.g. OpenSearch, OGC Catalog Services, etc.). After deciding
which time/space portion has to be transferred, they can select among a variety of tools,
including 3D open-source viewers, to promptly visualize or carefully examine them.
The success of this approach does require some expert knowledge on the part of the
person configuring the TDS. In particular, the standardization and aggregation of existing
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

files requires learning the NcML language and understanding the Common Data Model
requirements. However, this configuration is mostly a one-time effort, so that an expert
can assist in the initial implementation, and then local instances of the TDS can be
maintained by personnel without this detailed knowledge.
To continue to build on this success, we need to address a few challenges. While this
approach works well for structured grids, standards are just now being introduced for
unstructured grids. Standard handling of staggered grid information and velocity vectors
needs to be improved. And finally, issues related to management and dissemination of
public data (i.e. an adequate data policy) can be faced when unlocking large model
outputs.
The brokering approach to harvest metadata from many different services and to read
data from many different formats into common data models greatly improves model
access and interoperability, unlocking information from other fields (e.g., social and
economic studies). These are very desirable properties in the direction of a ‘‘INSPIRE
compliant web service’’ (see also http://inspire.jrc.ec.europa.eu), since they contribute to
lower the so-called ‘‘Users and Data Producers entry barriers’’ [8].

Acknowledgements
The authors thank Unidata for the technical support and help. This work was supported by the
Project ‘‘MARINA’’, funded by Regione Veneto within the initiatives of the law n. 15/2007. The
activity was partially supported by Projects PRIN 2008YNPNT9_005 and FIRB ‘‘DECALOGO’’
(code #RBFR08D825) and by the Project FIELD_AC, funded by the EC Fp7/2007–2013 under
grant agreement no. 242284.

References

[1] S. Carniel, M. Sclavo, and R. Archetti, Towards validating a last generation, integrated wave-
current-sediment numerical model in coastal regions using video measurements, Oceanological and
Hydrobiol. Studies 40 (2011), pp. 11–20, DOI: 10.2478/s13545-011-0036-1.
[2] D. Bellafiore, E. Bucchignani, S. Gualdi, S. Carniel, V. Djurdjevic, and G. Umgiesser, Assessment
of meteorological climate model inputs for coastal hydrodynamics modeling. Ocean Dyn. 62 (2012),
pp. 555–568, DOI: 10.1007/s10236-011-0508-2.
50 A. Bergamasco et al.

[3] R.P. Signell, S. Carniel, J. Chiggiato, I. Janecovic, J. Pullen, and C. Sherwood, Collaboration
tools and techniques for large model datasets, J. Marine Sys. 65 (2008), pp. 154–161, DOI:
10.1016/j.jmarsys.2007.02.013.
[4] R.P. Signell, Model data interoperability for the United States Integrated Ocean Observing
System (IOOS), Proceedings of the 11th International Conference on Estuarine and Coastal
Modeling, Seattle, WA, USA, 2010. DOI:10.1061/41121(388)14.
[5] S. Rayner, The U.S. Integrated Ocean Observing System in a global context, Marine Tech. Soc. J.
44 (2010), pp. 26–31, DOI:10.4031/MTSJ.44.6.1.
[6] A. Bergamasco, S. Carniel, M. Sclavo, and T. Minuzzo, From interoperability to knowledge
discovery using large model datasets in the marine environment: the THREDDS Data Server
example. Data Flow from Space to Earth 2011 International Conference, Venice, 21–23 March
2011. Available at http://www.space.corila.it/Program.htm
[7] J. de La Beaujardiere, C.J. Beegle-Krause, L. Bermudez, S. Hankin, L. Hazard, E. Howlett,
S. Le, R. Proctor, R.P. Signell, D. Snowden, and J. Thomas, Ocean and coastal data
management, Proc. OceanObs’09: Sustained Ocean Observations and Information for Society
Downloaded by [Mr Andrea Bergamasco] at 14:21 29 May 2012

(Vol. 2), Venice, Italy, 21–25 September 2009.


[8] S. Nativi, S. Khalsa, B. Domenico, M. Craglia, J. Pearlman, P. Mazzetti, and R. Rew,
The Brokering approach for Earth Science Cyberinfrastructure. EarthCube White Paper,
Oct 2011. Available at http://earthcube.ning.com/page/whitepapers.
[9] S. Nativi, S. Bigagli, P. Mazzetti, E. Boldrini, and F. Papeschi, GI-cat: a mediation solution
for building a clearinghouse catalo service. Advanced Geographic Information Systems & Web
Services, 2009. GEOWS’09. DOI: 10.1109/GEOWS.2009.34.
[10] A. Boldrin, S. Carniel, M. Giani, M. Marini, F. Bernardi Aubry, A. Campanelli, F. Grilli, and
A. Russo, The effect of Bora wind on physical and bio-chemical properties of stratified waters in
the Northern Adriatic, J. Geophys. Res. – Ocean 114 (2009), p. C08S92, DOI: 10.1029/
2008JC004837.
[11] J.D. Blower, K. Haines, A. Santokhee, and C.L. Liu, GODIVA2: interactive visualization of
environmental data on the Web, Phil. Trans. R. Soc. A 367 (1890) (2009), pp. 1035–1039, DOI:
10.1098/rsta.2008.0180.
[12] A. Benetazzo, A. Bergamasco, S. Carniel, A. Russo, and M. Sclavo, CNR-ISMAR contribution
in the ‘‘FIELD_AC’’ Project: the Gulf of Venice study case. Data Flow from Space to Earth 2011
International Conference, Venice, 21–23 March 2011. Available at http://www.space.corila.it/
Program.htm
[13] D. Neufeld and R.P. Signell, Case study Fukushima: open source data discovery in disaster
management, FOSS4G Conference, Denver, Co, USA, 2011. Available at http://2011.foss4g.
org/sessions/case-study-fukushima-open-source-data-discovery-disaster-management

You might also like