You are on page 1of 17

Remote Sensing of Environment 202 (2017) 276–292

Contents lists available at ScienceDirect

Remote Sensing of Environment

journal homepage: www.elsevier.com/locate/rse

The Australian Geoscience Data Cube — Foundations and lessons learned


Adam Lewis a,⁎, Simon Oliver a, Leo Lymburner a, Ben Evans b, Lesley Wyborn b, Norman Mueller a,
Gregory Raevksi a, Jeremy Hooke a, Rob Woodcock c, Joshua Sixsmith a, Wenjun Wu a, Peter Tan a, Fuqin Li a,
Brian Killough d, Stuart Minchin a, Dale Roberts a,b, Damien Ayers a, Biswajit Bala a, John Dwyer e,
Arnold Dekker c, Trevor Dhu a, Andrew Hicks a, Alex Ip a, Matt Purss a, Clare Richards b, Stephen Sagar a,
Claire Trenham b, Peter Wang c, Lan-Wei Wang a
a
Geoscience Australia, GPO Box 378, Canberra, ACT 2601, Australia
b
Australian National University, Canberra, ACT 2601, Australia
c
Commonwealth Scientific Industrial Research Organisation (CSIRO), GPO Box 1700, Canberra, ACT 2600, Australia
d
NASA Langley Research Center, 1 Nasa Dr, Hampton, VA 23666, United States
e
U.S. Geological Survey Earth Resources Observation and Science (EROS) Center, 47914 252nd Street, Sioux Falls, SD 57198, United States

a r t i c l e i n f o a b s t r a c t

Article history: The Australian Geoscience Data Cube (AGDC) aims to realise the full potential of Earth observation data holdings
Received 27 July 2016 by addressing the Big Data challenges of volume, velocity, and variety that otherwise limit the usefulness of Earth
Received in revised form 7 February 2017 observation data. There have been several iterations and AGDC version 2 is a major advance on previous work.
Accepted 12 March 2017
The foundations and core components of the AGDC are: (1) data preparation, including geometric and radiomet-
Available online 12 April 2017
ric corrections to Earth observation data to produce standardised surface reflectance measurements that support
Keywords:
time-series analysis, and collection management systems which track the provenance of each Data Cube product
Landsat and formalise re-processing decisions; (2) the software environment used to manage and interact with the data;
Time-series and (3) the supporting high performance computing environment provided by the Australian National Compu-
Big data tational Infrastructure (NCI).
Data cube A growing number of examples demonstrate that our data cube approach allows analysts to extract rich new in-
High performance computing formation from Earth observation time series, including through new methods that draw on the full spatial and
High performance data temporal coverage of the Earth observation archives. To enable easy-uptake of the AGDC, and to facilitate future
Collection management
cooperative development, our code is developed under an open-source, Apache License, Version 2.0. This open-
Geometric correction
source approach is enabling other organisations, including the Committee on Earth Observing Satellites (CEOS),
Pixel quality
Australian Geoscience Data Cube to explore the use of similar data cubes in developing countries.
Crown Copyright © 2017 Published by Elsevier Inc. This is an open access article under the CC BY license (http://
creativecommons.org/licenses/by/4.0/).

1. Introduction: A vision for the Australian Geoscience Data Cube sensing community to improve access to, manage and take full advan-
tage of increasing volumes of Earth observation data.
In this paper we describe the Australian Geoscience Data Cube The characteristics of the AGDC are high quality calibration of satel-
(AGDC), a ‘Big Data’ infrastructure that aims to realise the full potential lite observations, the use of basic measurements (notably standardised
of Earth observation data holdings for Australia. The AGDC is a collabo- surface reflectance for optical data), accurate geo-location, quality as-
rative initiative of Geoscience Australia, the National Computational In- sessment and pixel level quality flags, structuring of data to facilitate
frastructure (NCI), and the Australian Commonwealth Scientific analysis including time-series analysis, and the use of scientific file for-
Industrial Research Organisation (CSIRO). The AGDC was developed mats to facilitate efficient computation and exploratory data analysis.
over several years as we sought to maximise the impact of Land surface These characteristics allow automated workflows to be developed
image archives that dated from Australia's first participation in the which produce continental-scale products drawing on the full time-se-
Landsat program in 1979. The AGDC is demonstrating, through a grow- ries of available data.
ing number of example applications, an unprecedented ability to lever- The AGDC vision is of a ‘Digital Earth’ (Craglia et al., 2012), composed
age Earth observations to support research and operational users. The of observations of the Earth's oceans, surface and subsurface taken
AGDC is also demonstrating an architecture that can allow the remote through space and time and stored in a high performance computing
environment. A fully developed AGDC would allow governments, scien-
⁎ Corresponding author. tists and the public to monitor, analyse and project the state of the Earth,
E-mail address: Adam.Lewis@ga.gov.au (A. Lewis). and will realise the full value of large Earth observation datasets by

http://dx.doi.org/10.1016/j.rse.2017.03.015
0034-4257/Crown Copyright © 2017 Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 277

allowing rapid and repeatable continental-scale analyses of Earth prop- finer resolution observations that must be related to existing archives
erties through space and time. to produce an on-going and consistent record; and of velocity, as the in-
This paper builds upon the initial concept and results that were de- tervals between observations reduce from weeks to days, or from hours
veloped as AGDC ‘version one’ (AGDCv1) and described in Lewis et al. to minutes in the case of Geostationary satellites. According to the inter-
(2016). It next presents the fundamental changes that we have made national Committee on Earth Observation Satellites (CEOS) the number
since then, responding to user feedback, computational challenges and of Earth observation satellites grew from 12 in 1980 to over 69 opera-
lessons learnt, to build AGDC ‘version two’ (AGDCv2). In this paper we: tional EOS missions in 2014, with 137 missions planned for launch be-
tween 2015 and 2030 (CEOS, 2014). Considering only a handful of
1. Describe the challenges that called for radical changes to traditional
these missions, e.g., Landsat-8, Sentinel-1, -2, -3, and Himawari-8/9,
models of processing and analysis of Earth observation data, leading
we should expect at least a 20-fold increase in the global volumes of
us to build the AGDC;
raw data acquired over a 15 year period. Our calculations of data growth
2. Define the foundations and core elements of the AGDC, including our for Australia are shown in Fig. 1. Adding processed data products de-
approach to data preparation and collection management, the soft- rived from this raw data will lead to three to five times greater data vol-
ware environment used to manage and interact with the data, and umes, consistent with the assessment of Overpeck et al. (2011).
the supporting high performance computing/high performance New approaches are required for the management, analysis and dis-
data environment; tribution of Earth observation data and products in order to meet these
3. Outline some examples that, in our view, demonstrate the benefits of big data challenges, which are compounded by growing expectations
our approach through the ability to rapidly complete analyses and that Earth observation data streams will support the delivery of opera-
produce products that draw on tens or hundreds of thousands of im- tional products as well as the on-going development of scientific knowl-
ages without manual intervention; edge (Mattmann, 2013). This challenge has been recognised for some
4. Discuss the international uptake and potential relevance of the time (Jensen, 1986; Schott, 2007; Schowengerdt, 2006). Furthermore
AGDC; and new Earth observation data analysis methods are emerging which de-
5. Summarise general learnings from our AGDC journey and outline mand real-time access to full archives of moderate resolution Earth ob-
some important future directions. servation data, these include:

2. The challenge • The application of time-series analysis techniques to accurately detect


change (Griffiths et al., 2014; Kennedy et al., 2010; Masek et al., 2013;
The Earth and its systems are complex and interconnected and Zhu and Woodcock, 2014);
gaining a comprehensive understanding of these systems and how we • The systematic characterisation of a particular cover type across mul-
interact with them is of critical national and international interest tiple decades (Masek et al., 2008; Mueller et al., 2016; Sexton et al.,
(Boyd and Crawford, 2012; Frankel and Reid, 2008; Hart and 2013);
Saunders, 1997; Lynch, 2008). • The use of ‘best available pixel’ composites to overcome the chal-
Although many sensors are used to acquire data about the Earth and lenges posed by heavily cloud affected settings (Hermosilla et al.,
its systems, constellations of Earth observation satellites (EOS) may be 2015; Thompson et al., 2015; Zald et al., 2016); and
the single most important source thereof (Australian Academy of • The use of time series of Earth observation data and derived products
Science, 2009). Satellites capture observations of diverse physical phe- as Essential Climate Variables (ECV's) (Hollmann et al., 2013) and Cli-
nomena such as crustal deformation, soil mineralogy, vegetation and mate Data Records (Kim et al., 2015).
surface water conditions, sea surface temperature, and magnetic and
gravimetric field intensities, over the entire globe and for over long pe- The challenge of manipulating continental/global archives of moder-
riods of time (decades). Of these constellations, the Landsat Program, ate resolution Earth observation data to enable these analyses is
operated by the United States Geological Survey (USGS) and NASA, rep-
resents the longest, most continuous, openly available Earth observa-
tion program in the world (Arvidson et al., 2001). Landsat missions
have acquired moderate resolution multispectral data for over
40 years. Systematic analysis of these archives provides the capability
to compare the changes that are now occurring on the Earth's surface
with those that have occurred in the past.
However, extracting information from EOS data holdings poses an
enormous technological challenge due to the volume and variety of
the data. Satellite Earth observation data collections are ‘Big Data’. The
Oxford Dictionary defines ‘Big Data’ as “Datasets that are too large and
complex to manipulate or interrogate with standard methods or
tools”. The term was first described by Laney (2001) as “the three V's:
Volume, Velocity and Variety”. Others have extended this list to include
validity, veracity, value, and visibility. Historical sensor archives repre-
sent a ‘volume and variety’ challenge, however the next generation of
meteorological sensors now available (such as the GOES-R Advanced
Baseline Imager (ABI), EUMETSAT (Met-10), and Himawari-8 Advanced
Himawari Imager (AHI)) mean that Earth observation data is now fac-
ing a ‘velocity’ challenge too. Satellite Earth observation data are often
highly structured and stored as large to very large binary data files,
each of which may contain gigabytes or even terabytes of data. This
‘Big Earth Data’ is massive, diverse and growing at an increasing rate
(Overpeck et al., 2011). Fig. 1. The estimated volumes EOS data produced by the Landsat-8, Sentinel-1,-2,-3 and
Users of satellite Earth observation data therefore face challenges of Himawari-8/9 missions from 2014 and 2029 for Australia. Only ‘raw’ data are
volume, as archives of data grow; of variety, as instruments produce considered. Data volume estimates are based on CEOS (2014).
278 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

internationally recognised. For example the European Parliament be accessed seamlessly via the API. Indexing is useful where it is unde-
(2016) considered a new approach to the dissemination of data from sirable to restructure the data, for example to avoid replication of
the Copernicus system, and determined “a properly integrated infra- large datasets (e.g., Himawari-8 AHI) or for datasets that do not form
structure system”, to be essential for industry to be able to take full ad- dense time-series, such as such as elevation models.
vantage of these new data sources. The private and research sectors A key enabler of the AGDC has been the National Computational In-
have also recognised the challenges posed by ‘big Earth observation frastructure (NCI) at the Australian National University, a petascale High
data’ with approaches such as Google Earth Engine™, Hexagon's Performance Computing (HPC) facility. Given our expectations of rapid
M.appX™, Web Enabled Landsat Data (WELD) (Roy et al., 2010), analysis of data over all of the Australian continent and for the full time
Rasdaman (Baumann et al., 1998), EarthServer (Baumann et al., 2016) length of the available datasets, it was clear that local desktop com-
and the ‘Data Rods’ approach (https://disc.gsfc.nasa.gov/hydrology/ puters and high-end on-premise servers were no longer adequate.
data-rods-time-series-data) being developed. It is beyond the scope of However, within this HPC environment AGDCv2 also progressed sub-
this paper to provide comprehensive comparison of these approaches. stantially from AGDCv1 in the specification of ‘high performance data’
However the AGDC has been specifically developed with a view to gov- (HPD), defined by Evans et al. (2015) as ‘data that is carefully prepared,
ernments, decision makers and researchers having access to Earth ob- standardised and structured so that it can be used in Data-intensive Sci-
servation data analysis platforms that are transparent (open source ence on HPC’.
code) and traceable (with known data provenance). The use of open The foundations of the AGDCv2 are thus (1) data preparation and
source code ensures that users have control over the ongoing availabil- collection management protocols, (2) the software environment used
ity of their data analysis platform. to manage and interact with the data, and (3) the supporting, integrated
A further challenge is that large collections of satellite Earth observa- HPC-HPD environment including the file structures and architectures to
tion data are often dynamic, with new data being continuously added store and access hundreds of terabytes, and ultimately petabytes, of
and/or existing data being updated (e.g., as calibration algorithms im- Earth observation data at full resolution and over long time scales.
prove over time, errors are found in existing data, etc.). Data infrastruc-
tures are needed that can manage such flexible collections, for example 3.1. Data and collection management protocols
maintaining sufficient data provenance that higher-level data products
can be traced back to the original observation. These challenges require The value of the AGDCv2 lies in the ability to work seamlessly with
new approaches in the way we organise, store, manage and analyse entire petabyte-scale satellite image archives - observations that cover
these data: traditional approaches will not be sufficient (Mattmann, decades of time and entire continents. AGDCv2 currently includes
2013). some 300,000 images from Landsat satellites collected continuously
Finally, as we start to harness the full spatial-temporal-spectral ar- since 1986. Each Landsat scene includes approximately 36 × 106 pixels,
chive of Earth observation data, combining measurements and ancillary each of which is a potentially useful observation, and each pixel holds a
data from various instruments, a number of statistical challenges arise data value for each image band. AGDCv2 therefore holds some 1013
due to high-dimensionality and massive sample sizes. Phenomena potential measurements from Landsat of Australia's land and water
that are well-known in the recent statistics literature such as noise surfaces, and provides the capability to conduct investigations across
accumulation, spurious correlation, incidental endogeneity and mea- space and through time in ways that were previously impractical.
surement errors (Fan et al., 2014) are all present in these Earth observa- Whilst much of the data in the AGDCv2 is from the Landsat satellite
tion datasets. This leads to a number of critical issues that need to be series, and we describe this, the AGDCv2 also supports secondary mea-
addressed as the use of classical (small data) methods may lead to surements derived from surface reflectance and other data sources,
erroneous statistical inferences and poor conclusions. such as ‘fractional cover’ (Scarth et al., 2010) which estimates the
ground cover within each pixel.
3. Foundations and core elements Comparing observations through space and time requires a shift
from ‘raw’ data collected by the sensor to calibrated, comparable mea-
In this part of the paper we describe the foundations of the AGDC, surements of the Earth's surface (Hilker et al., 2009). Comparable mea-
that is, the primary components that we have developed in order to un- surements are therefore an essential foundation for the AGDCv2.
lock the potential of large Earth observation data stores and which we As our work has progressed, we have increasingly focussed on the
see as fundamental to future success. We also describe how our ap- need for a solid data foundation for the AGDCv2. Several key protocols
proach to these foundations has evolved as we have continued to devel- have emerged (Fig. 3): Firstly, geometric corrections ensure spatial
op the Data Cube from the preliminary forms described in Lewis et al. alignment so that pixels of data can be ‘stacked’ as time-series of obser-
(2016) to the current form, which we refer to as ‘AGDCv2’. vations (3.1.1). Secondly, radiometric corrections convert satellite
The Data Cube was originally implemented as regular, non-overlap- image pixels to gridded measurements of normalised surface reflec-
ping, ‘tiles’ of gridded sensor data, ‘stacked’ according to the time of data tance, allowing measurements from different places and times to be val-
capture (observation), leading to the visual metaphor of a ‘cube’ of data idly compared (3.1.2). Quality assessments then produce quality flags
(Fig. 2). For Version 1 of the AGDC we chose a 4000 ∗ 4000 array of pixels and accuracy measures applied at both the dataset and pixel level, pro-
to be a complete tile. Expressed in geographic coordinates, the tiles viding users with sufficient metadata to allow them to decide which
were 1 degree by 1 degree and each pixel 0.00025 degrees on a side. specific observations are fit for any given purpose. The data are then
In Version 2 of the AGDC, the tile size and the pixel size are parameters spatially partitioned into tiles and packaged into netCDF files to be in-
set at the time of data ingest. Ingestion configures the data for storage - cluded in the Data Cube. Finally, products are derived, such as Fractional
effectively making an AGDCv2 formatted dataset. The ability to config- Cover. Products may also be ‘ingested’ back into the AGDCv2 for use in
ure ingest parameters means that AGDCv2 has increased flexibility to other analyses.
support gridded observations of varying resolutions from various In order to ensure that scientific transparency and repeatability can
sources. be retained through numerous data preparation steps, as described in
AGDCv2 also has the ability to index, rather than ingest, a dataset. Fig. 3, we have developed more sophisticated protocols for collection
The indexing process adds information to the Data Cube database management.
which enables AGDCv2 functions to operate directly on the native
data. These data can thus remain in their original configuration and be 3.1.1. Spatial alignment (geometric accuracy) - comparable in space
tailored during access to the user's need. The indexing approach sim- Geometric corrections accurately locate pixels in space, allowing
plifies addition of new sensor data whilst allowing those datasets to comparability between observations captured at different times, and
A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 279

Fig. 2. The AGDC data cube concept (after Lewis et al., 2016). Landsat scenes are re-formatted as spatially consistent tiles of data. The spatial footprint of Landsat scenes changes over time,
whilst the tiled datasets maintain a constant footprint. Area shown is Brooks Island in Lake Eyre, central Australia, in 2009.

data from different sources to be combined in analyses. Relative accura- spacecraft model to be refined, and to correct for terrain distortions,
cy can be achieved through pixel to pixel registration of images to en- leading to L1T products with positional accuracy similar to that of the
sure that the same target is represented by a single time series (Zhang ground control points. The LPGS uses the Global Land Surveys (GLS)
et al., 2003). However absolute spatial accuracy is necessary to enable datasets (Gutman et al., 2008) as control points. The Australian Geo-
comparison between data from different sensors and sources, such as graphic Reference Image (AGRI) (Rottensteiner et al., 2009; Lewis et
using elevation models to apply terrain illumination corrections to im- al., 2011; Ravanbakhsh et al., 2012) was developed to support improve-
agery (Teillet et al., 1982). ments in GLS-2000 and to thereby ensure the absolute spatial accuracy
Procedures for spatial correction of satellite imagery are well of Landsat imagery for Australia. AGRI provides a spatially correct refer-
established. For Landsat these are described in Shlien (1979). Geometric ence image at a 2.5-metre resolution across Australia, allowing im-
corrections of Landsat data are implemented in the Level-1 Product provement in the control point dataset used in Landsat processing, as
Generation System (LPGS) provided by the USGS for International illustrated in Fig. 4.
Landsat Cooperators (http://landsat.usgs.gov/Landsat_Processing_ To assess the accuracy of Landsat L1G data products over all of
Details.php). The LPGS processes the raw satellite telemetry data, col- Australia we resampled the AGRI dataset and compared it to the
lected as paths, to produce gridded images. Systematic radiometric 15 m panchromatic band from the GLS dataset. Normalised cross cor-
and geometric corrections are applied using data collected by the sensor relation was used to match points between the reference AGRI image
and spacecraft to model the spacecraft sensor position and attitude and the GLS image. We were thus able to identify those regions of
(Gruen and Zhang, 2002; Hattori et al., 2000; Poli et al., 2004) providing Australia where large errors are indicated in the GLS-2000 dataset,
a nominal geometric accuracy for low-relief areas at sea level. These are and to work with the United States Geological Survey to improve
known as L1G products. The geometric accuracy of these ‘systematic’ accuracy in those places.
products varies depending on the platform and sensor, from tens of Geometric correction is a prerequisite to radiometric correction as
pixels for Landsat 5 Thematic Mapper (TM) and Landsat 7 Enhanced the latter requires ancillary data, which must be combined with the
Thematic Mapper Plus (ETM+) to single-pixel for Landsat-8 Operation- image data without introducing errors due to positional mismatching.
al Land Imager (OLI). Geometric accuracy is most critical when applying terrain-illumination
Further corrections are applied if ancillary data are available to refer- corrections, which use slope and aspect to modify the reflectance esti-
ence the data to known locations (ground control points), allowing the mates at each pixel, allowing for irradiance variations at each pixel
280 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

scores are not discarded because different analyses will have different
tolerances for geometric accuracy.

3.1.2. Radiometric correction - comparability of measurements in time and


space
Radiometric correction aims to produce consistent and comparable
measurements from satellite images by correcting for a range of varia-
tions in observing conditions (Schott et al., 1988). Radiometric correc-
tion is therefore a prerequisite for systematic detection of change over
time (Du et al., 2002) and large-scale automated analyses (Masek et
al., 2008). Radiometric consistency between observations from the
same sensor is therefore critical to AGDCv2 in order to allow users to
perform analyses that compare measurements in space and time for
that sensor. However, radiometric correction is also critical for compar-
ing measurements collected by different sensors because it provides the
user with confidence that any differences that are observed between
measurements are due either to inter-sensor differences or on-ground
change, rather than variations in observing conditions (Chen et al.,
2005).
Our basic measurement for optical remote sensing data in the
AGDCv2 is surface reflectance for each image band, normalised for a
nadir view with illumination at 45 degrees elevation, often referred to
as Nadir Bidirectional reflectance distribution function (BRDF) Adjusted
Reflectance or NBAR.
To achieve measurement consistency, Geoscience Australia has de-
veloped automated processes to produce NBAR products from optical
instruments on the Landsat 5, 7 and 8 satellites, using the methods of
Li et al. (2010, 2012). The start point for NBAR processing is geometrical-
ly corrected data, i.e., in the case of Landsat, L1T data. Corrections are
Fig. 3. The process flow from raw data to products from the Data Cube. required for variations in solar illumination, the combined variations
in sun and satellite view angles, the presences of aerosols and
atmospheric moisture content (radiative transfer modelling), and the
and for bidirectional reflectance distribution function (BRDF) effects BRDF of the target (Fig. 5). The NBAR correction requires ancillary data
due to the inclination of the terrain (Li et al., 2012). We have found including a digital surface model (DSM), atmospheric state parameters
that slope and aspect must be estimated at a resolution comparable to and bi-directional reflectance distribution function (BRDF) parameters.
the imagery, and with sub-pixel locational accuracy, in order to avoid A physical model is used to apply corrections for atmospheric transfer,
artefacts in the terrain corrections. illumination and viewing geometry including terrain surface slope and
To provide a further check on geometric correction accuracy of aspect and reflectance distribution. Whilst a number of empirical
Landsat data in AGDCv2, we include the results of an image matching al- methods have been previously developed to correct for topographic
gorithm (gVerify), which provides a measure of the circular error in the effects (e.g., Richter et al., 2009; Soenen et al., 2005) it is difficult to
90th percentile (ce90) for match results against the geometric control reliably apply these methods in automated processing workflows.
source (GLS). The Geometric Quality Assessment (GQA) allows a user The surface reflectance correction uses coupled, physics-based, at-
to set a threshold to exclude data with low geometric accuracy from mospheric transfer, BRDF and terrain models as described in of Li et al.
an analysis, e.g., ce90 b 0.5 (pixels). GQA allows us to retain in (2010, 2012). BRDF shape functions are derived from concurrent
AGDCv2 data that would previously have been excluded due to failure MODIS observations. The MODTRAN 5 radiative transfer model is ap-
to achieve a given level of geometric accuracy. Data with low GQA plied (Berk et al., 1999), using the best available aerosol data from

Fig. 4. Improved geometric correction is possible by updating (a) the Global Land Survey (GLS) reference based on (b) the AGRI reference image. Blue crosses are GPS field measurements
across a road intersection.
A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 281

DSM Atmospheric state: aerosols optical


Landsat orthocorrected images and metadata
depth, CO2, ozone and water vapour

Slope and aspect View and solar zenith Radiative transfer models

Incident and exiting angles, Atmospheric parameters:


Pm, Lo, Ts, t v, S, td(os), td(ov), Eh, Ehdir, Ehdif

Physical model BRDF shape function

Normalised Surface
Reflectance (NBAR)

Fig. 5. The radiometric correction process after Li et al. (2012).

Aeronet, AATSR, MIISR, MODIS and climatology ancillary data sources. Landsat data, although several tests existed for cloud assessment includ-
Elevation is modelled with the 1-arc-second SRTM Digital Surface ing Automated Cloud Cover Assessment (ACCA) (Irish et al., 2006) and
Model (DSM). An advantage of a physics-based approach, and essential ‘Function of mask’ (Fmask) (Zhu and Woodcock, 2012).
to our work where we are processing (and routinely re-processing) We therefore developed a Pixel Quality Assessment (PQA) system
hundreds of thousands of scenes, is the ability to automate the radio- applicable to the Australian Landsat collection that can be applied in au-
metric corrections in robust, parallel, workflows. Li et al. (2010) demon- tomated high performance workflows (Sixsmith et al., 2013). PQA uses
strate that these corrections are effective by comparing results with 16 bits to store the results of a number of tests that are applied to the
field spectra collected at the time of satellite overpass, and in the side data during processing. These include ACCA and Fmask cloud assess-
lap areas between adjacent Landsat passes. ments, associated tests for cloud shadow, tests for sensor saturation
Landsat Thematic Mapper (TM) imagery dating between March and zero values and sea/ocean flags. Table 1 shows the PQA bitmap as-
1984 and December 1999 pre-dates the MODIS instruments and there- signment. The 16th bit is un-used, however a topographic shadow flag is
fore cannot be corrected using concurrent MODIS observations. Li et al. anticipated in future.
(2013) investigated variations in of MODIS BRDF shape functions for Application of PQA is illustrated in Fig. 6. PQA can be used to select
Australia acquired over a 10-year period from 2001 to 2011, discovering pixels which are relevant to a specific analysis. For example, users inter-
inter-annual variation of the shape function to be relatively small. This ested in studying changes in coastal intertidal zones might choose to re-
enabled the development of a seasonally adjusted representative tain land/sea mask pixels, but to exclude all saturated, cloud and cloud
BRDF shape function for Australia that we use to produce surface reflec- shadow affected pixels.
tance (NBAR) products for the pre-MODIS era. Similarly, Roy et al. In Fig. 6 all saturated pixels are also cloud-affected pixels. The binary
(2016) propose a fixed set of BRDF spectral model parameters to representation and cumulative sum indicating a pass (1) or fail (0) for
allow Landsat NBAR generation in the USA. each quality test (reading from right to left) at points A, B and C are:
A: 0011001111101110 (13294) – Bands 1 and 5 saturated; Cloud
3.1.3. Per pixel quality flags detected using both ACCA and Fmask.
It is important that users of EOS data are able to identify undesirable
or failed observations in the data, such as pixels affected by clouds or in-
strument saturation. Traditionally researchers rejected entire scenes Table 1
which were only partly cloud-affected in favour of using cloud-free Pixel quality assessment bitmap values.
scenes. This was both a manual and time consuming process, and result- Pixel quality test Bit Value Cumulative sum
ed in significant loss of useful observations, particularly in regions that
Saturation band 1 0 1 1
experience high cloud frequency. Automated analysis techniques, on Saturation band 2 1 2 3
the other hand, rely on the flagging of cloud, cloud shadow, instrument Saturation band 3 2 4 7
saturation and so on for each pixel. This allows users to retrieve all of the Saturation band 4 3 8 15
unaffected observations of the Earth's surface at a particular location as Saturation band 5 4 16 31
Saturation band 61⁎ 5 32 63
input into their analyses.
Saturation band 62⁎ 6 64 127
In the AGDCv2 we retain all observations, but apply a set of Pixel Saturation band 7 7 128 255
Quality Assessment or PQA flags, as well as metadata at the dataset Contiguity 8 256 511
level. Rather than discarding data, our approach recognises that what Land/Sea 9 512 1023
Cloud (ACCA) 10 1024 2047
may be considered ‘noise’ by one user may be a valid observation for an-
Cloud (Fmask) 11 2048 4095
other purpose, and that new algorithms will emerge that are able to use Cloud shadow (ACCA) 12 4096 8191
data once regarded to be too ‘noisy’. Users may exclude undesirable ob- Cloud shadow (Fmask) 13 8192 16,383
servations as appropriate for their specific analysis. Topographic shadow 14 16,384 32,767
Whilst there are numerous pixel quality products available for Not used 15 32,786 65,553

MODIS data, prior to Landsat-8 there were no equivalent products for ⁎ Band 6 high- and low-gain modes of operation.
282 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

Fig. 6. (a) True colour image and (b) the corresponding Pixel Quality product.

B: 0000111111111111 (4095) – Cloud Shadow detected; all other workflows which are not ‘hard-wired’ to a particular specification of per
tests passed. pixel metadata.
C: 0011111111111111 (16383) – Pixel is clear, all tests passed.
Rather than working directly with PQA bits, the AGDCv2 code sup- 3.1.4. Collection management - retaining the provenance of data and
ports text-based mask specifications, such as: cloud_acca = ‘no_cloud’, products
cloud_fmask = ‘no_cloud’, contiguous = True. This sematic abstraction Rigorous collection management is essential if products from Earth
also allows the PQA approach to be extended, for example to cover Sur- observation analyses are to meet scientific, practical and potentially
face Reflectance products from MODIS and from the United States Geo- legal expectations of transparency and repeatability, for example meta-
logical Survey Landsat (For example: cfmask = ‘clear’) and is critically data that allow the user to unequivocally trace a specific product back to
important as the number of sensors in the AGDCv2 increases. AGDCv2 the observations used to produce it. Earth observation products are fre-
users are not required to be conversant with the quality flags of multiple quently used in applications where traceability is important such as na-
different sensor types, and the text-based abstraction of the pixel qual- tional accounts (Caccetta et al., 2007) and in situations where ‘chain of
ity assessments increases the portability and reuse of AGDCv2 code and custody’ information is required (Purdy, 2010; Purdy and Leung, 2012).

Fig. 7. Inputs to processing of Earth observation raw archives to produce products.


A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 283

However, establishing collection management systems that capture


full provenance information is difficult. Collection management tools
and automated approaches are needed because collections are complex
and costly to maintain. Production of fit-for-purpose data products from
raw satellite data archives involves many steps and a range of inputs. As
algorithms, calibration parameters and ancillary inputs mature and im-
prove, reprocessing of data is required to maintain the product suite as
best practice. Fig. 7 illustrates many of the potential sources that could
contribute to a change in an output product. As well as tracking prove-
nance, our experience is that collection managers require mechanisms
to:

• Evaluate whether a given change (e.g., in calibration parameters) war-


rants a full or partial reprocess of the archive; and
• Communicate the impact of the updates to existing users whose prod-
ucts may be impacted by the changes.

Fig. 7 shows how multiple ancillary inputs are used which change
from time to time as quality is improved. Software code libraries,
Fig. 8. Collection numbering approaches adapted from software versioning. The
which may be internal to the processing agency or from third party pro- numbering of versions depends upon the scope and significance of each change.
viders, have specific versions. The hardware and operating system envi-
ronments are also significant and are frequently upgraded. Finally,
A significant development of AGDC v2 is the progressive develop-
products often are chained, with one product being an input to the
ment these automated collection management systems. Our systems
next, for example as higher levels of correction are produced. In Fig. 7,
capture the following contributors to change:
the metadata for Product C must capture the processing parameters of
all of the earlier steps. • data inputs, location and time of access; and
We use a concept of Managed Collections to address these chal- • software versions and repository locations.
lenges, which are fundamental to the success of data cube infrastruc-
tures. Consistent data, with known quality and input parameters, is
Future advances will include dependent libraries, compilers, and
essential if data cubes are to provide authoritative and consistent re-
version controlled ancillaries.
sources to users. Our approach differs significantly from other current
Managed collections require discipline and constraints on changes to
models of processing which often use common systems of differing ver-
processing systems, so that each change is both documented, and consid-
sions across multiple platforms to produce “like” products.
ered in terms of the implications for further processing. In practice, not all
Our concept of a managed collection includes:
data can meet our expectations of a managed collection now or in the near
future. Therefore we have developed the capability to index pre-existing
• Software versioning, and governed production software upgrades;
data collections to ensure that they can be utilised via the AGDCv2 API.
• Ancillary input collection versioning and update control;
• Assessment of the scope and significance of a proposed change for the
3.2. AGDC architecture and technologies
collection. Scope refers to the proportion of the collection affected,
whereas significance is the established through comparison with
To ensure longevity and to lower the costs of maintenance, the
benchmarks and acceptable deviations from these in regard to radio-
AGDCv2 architecture is designed to be sustainable, flexible and use
metric and geometric changes, for example;
existing code libraries and utilities wherever possible. The AGDC is de-
• Business processes, which determine a course of action based on the
signed to be able to continue to grow and be expandable to incorporate
scope and significance of the change, for example to:
multiple datasets from new platforms, be scalable to cater for order of
० Add to a backlog until significance and scope reach thresholds; magnitude growth in data volumes, and be flexible to support many ap-
० Upgrade the collection; plications of the data ranging from routine processing through to inter-
० Update components of the collection; and active data exploration.
० Patch components of the collection. The key concepts of the modular, layered architecture of the AGDCv2
are illustrated in Fig. 9. It comprises four layers from bottom to top as
follows:

1. Data Acquisition and Inflow - Observations are collected and pre-


System changes may be simple software patches or re-writes that
processed to an ‘analysis ready’ level by various custodians;
improve function but have no material impact on the output products;
of minor importance, requiring a partial re-processing because, for ex- 2. Data Cube Infrastructure - analysis ready data are indexed into the
ample, only some of the products in the collection will be affected; or AGDCv2 including ingestion into multi-dimensional datasets, cur-
of major importance requiring a comprehensive re-processing of prod- rently netCDF4-CF1.6, with a suite of tools for task execution, discov-
ucts to apply a ubiquitous improvement. ery, visualisation and so on;
To manage AGDCv2 generated products we have adopted version 3. Data and Application Platform - Platforms and environments that
numbering schemes widely used in software development practice allow routine generation of products, and, exploration of new prod-
(e.g., https://en.wikipedia.org/wiki/Software_versioning). AGDCv2 ucts in a ‘virtual laboratory’ environment; and
data and product collections are numbered using a three-level hierarchy 4. UI and Application Layer - A diverse set of applications is enabled by
to capture and communicate such changes (Fig. 8). Thus the first collec- the underlying infrastructure.
tion version would be V1.0.0. A software patch would produce V1.0.1, a
further minor change would lead to V.1.1.1, and a major change follow- This architecture has evolved significantly since our first proof of
ing that would produce a new collection i.e., v2.0.0. concept forms which were used to complete our first national scale
284 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

Fig. 9. The architectural concepts of the Australian Geoscience Data Cube.

analysis (Lewis et al., 2016). In the following we present some of our analysis of the time-series versus formation of continental mosaics for
most important technical and architectural choices. a specific time interval).
The following variables may be configured when ingesting data into
the Data Cube:
3.2.1. Open source libraries
The AGDCv2 code (https://github.com/data-cube/agdc-v2) is writ- • Coordinate Reference System
ten in Python (N2.6, N3.3) and publicly available under the Apache Li- • Spatial resolution
cense, Version 2.0. The code heavily leverages open source libraries • Chunk dimensions: x, y, t
including: pyyaml, sqlalchemy and jsonschema for file attribute and da- • Compression
tabase access; netCDF4 for Data Cube dataset creation and access; Luigi • Temporal depth: hour, day, month year/s
(https://github.com/spotify/luigi) for managing workflows, numpy, • Tile dimension: x, y
xarray and dask for array handling, rasterio/gdal for geospatial data ac- • Data type: CF convention
cess, scipy, pandas, scikit-image, and scikit-learn for analysis - see Fig. 9.

3.2.2. Data Cube datasets and NoSQL index store The Index is queried to find Data Cube datasets relevant to a partic-
A key element of our architecture is the use of a Data Cube dataset, ular need, to report on the available data, to control processing, and to
referenced by an Index Store (Fig. 9). A data cube dataset is a retrieve provenance information (that is, a given datasets' ancestral
netCDF4-CF 1.6 file (adherence to CF is not strictly required for the and descendent datasets and the system components used to produce
AGDC software but its value is described more fully within Section it).
3.3.2) with defined spatial and temporal extent. For example, we have Whilst we initially used a fully relational database to manage the
created Data Cube datasets for Landsat TM, ETM + and OLI data contents of AGDCv1, experience has led us to a hybrid model combining
spanning 100 × 100 km, and one calendar year. These choices are a minimalist relational database with use of ‘Not Only SQL’ (NoSQL). We
configurable, are likely to differ for differing datasets, and can in princi- found that the full relational database approach, already complex for a
ple be optimised to the analyses that are most required (for example Data Cube containing only Landsat data, would not be sustainable as
A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 285

we expanded the schema to support a growing number of sensors and 1,2,3, Himawari-8, MODIS, AVHRR, and ALOS data which are made
products and sought to add rules and fields to track provenance and available primarily for researchers and government organisations to
to include full metadata fields, which are often peculiar to a specific sen- access and analyse, either locally or remotely.
sor/product (e.g., Landsat path/row). In a relational model constant
changes are needed to the schema as new data sources are added. A fur- 3.3.1. NCI NERDIP infrastructure
ther challenge will arise when multiple organisations seek to maintain The data management practice for the research data collections is
compatible data cubes, because schemas will inevitably tend to diverge, formally organised and made available through NCI's National Environ-
limiting interoperability between data cubes. mental Research Data Interoperability Platform (NERDIP http://nci.org.
Highly complex relational data models also raise efficiency questions au/systems-services/national-facility/nerdip/) (Evans et al., 2015;
if there are high numbers of largely-unused fields used to cater for spe- National Computational Infrastructure, 2016) to meet the needs of the
cial cases, and as the number of relational joins required for each query broad stakeholder and research communities. Management and opera-
grows. tion of these large and diverse data collections is complex, and to ensure
We have therefore chosen to use a more simplified structure. As that users can find, evaluate, understand and then utilise the data, ulti-
shown in Fig. 10, the current AGDCv2 uses a simple five-table structure mately in new and unanticipated ways, NCI have benchmarked NERDIP
(metadata-type, dataset-type, dataset, dataset_source and against known and emerging data standards. One such relevant case is
dataset_location) with composite-type JSON documents. the US GEO Common Framework for Earth Observation Data (United
States Government, 2016), which provides recommendations for data
3.2.3. The AGDCv2 application programming interface (API) providers to standardise their data management practices. These rec-
The AGDCv2 API provides a set of user functions to search, read and ommendations are designed to both make Earth observation data
write a Data Cube. The API conceptualises the underlying data model more discoverable, accessible and usable, as well as to facilitate integra-
and aims to provide a consistent and logical interface to the AGDCv2 tion across multiple repositories. Aligning Earth systems science collec-
functions. These are distinguished as: tions with this Framework lays the foundations for interdisciplinary
science and the seamless integration of Earth observation data with cli-
1. Higher Level User Functions which enable listing of products and mate, marine, environmental and geoscience data collections.
measurements as well data load capability; and The NCI integrated High Performance Compute (HPC)-High Perfor-
2. Low-Level Internal Functions that enable data discovery, dataset mance Data (HPD) infrastructure which underpins the current imple-
grouping along defined non-spatial dimensions and load of data mentation of the AGDCv2 comprises a 1.2 PetaFlop Supercomputer
from product sources into a dataset object. (Raijin), an HPC class 3000 OpenStack cloud system (Tenjin) and sever-
Detailed technical documentation on the AGDCv2 is available from al highly connected large-scale high-bandwidth Lustre filesystems. The
the software repository documentation at https://github.com/ coupled HPC-HPD system has been designed (Evans et al., 2015;
opendatacube. Wyborn and Evans, 2015) to support the mixed-mode and varying
work flows of Data-intensive Science (Hey et al., 2009), and in particu-
3.3. NCI high performance computing/high performance data environment lar, to analyse and model the high volume data collections from the
Earth observation, climate and geoscience communities. The coupling
A key enabler for AGDCv2 has been the National Computational In- of HPC and data infrastructure reduces the time taken to both generate
frastructure (NCI) facility which has supported storage, management and process the data, and to enable further secondary independent
and analysis of the large Earth observation datasets that underpin the analysis. This is particularly important for large volume datasets that
Data Cube. AGDCv2 data and products total ~1 Petabyte which are sup- are required for continental or global scale analysis, which are increas-
ported by the NCI as part of a 10+ Petabyte set of nationally significant ingly difficult to manage separately. Moreover many analyses will use
research data collections that span five major categories: 1) Earth a combination of several large environmental datasets.
system sciences, climate and weather model data assets and products; The NCI facility has been important in several aspects of AGDCv2 in-
2) Earth and marine observations and products; geosciences; 3) terres- cluding data management practices, ensuring suitability for high perfor-
trial ecosystems; 4) water management and hydrology; and 5) astrono- mance analysis, interoperability (whether local or remote), and
my, biosciences and social sciences (see http://geonetwork.nci.org.au). regularisation of data models. In particular, NCI established the use of
Within this, satellite Earth observations include Landsat, Sentinel- netCDF4/HDF5 for the majority of the NERDIP datasets, as well as a

Fig. 10. The Data Cube uses a simple five-table structure (metadata-type, dataset-type, dataset, dataset_source and dataset_location) with composite-type JSONB documents.
286 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

common quality control for the application of metadata standards such also enables workflows to be implemented that can be regularly
as CF-1.6, and we have been able to adopt this in AGDCv2. Furthermore, and routinely run, updating WOfS as new data become available.
NCI have established standards for performance for each of the datasets Fig. 11 is an example of the WOfS product for a remote and arid
which is supporting us to manage the configuration of chunking, com- area of Australia.
pression, and internal data layouts in AGDCv2. The quality controls for
the datasets also include testing against known packages and exemplar
cases to ensure that both performance and functionality can meet the 4.2. Best available pixel composites and temporal statistics
research community requirements.
The NCI integrated HPC-HPD environment can meet the needs of ‘Best available pixel’ composites are used to combine multiple mod-
users that require access to the HPC to undertake high resolution conti- erate resolution images to generate a seamless coverage for further
nental scale analysis of the full archive of Australian Landsat data, but at analysis (Hermosilla et al., 2015; Roy et al., 2010; Thompson et al.,
the same time, it can accommodate those individual researchers who 2015). However the criterion for the ‘best’ pixel is application depen-
wish to access the data via standards based web services to extract dent. Data cubes allow a flexible approach to ‘best available pixel’ com-
minor data volumes to be processed elsewhere (e.g., in the cloud). posites because the pixels can be chosen according to the application of
Thus, any Earth observation researcher can always access the latest ver- interest. For example we have been able to select ‘best’ pixels as those
sion of AGDCv2 data and derivative products. This also ensures that that were seen at high or low tide. We have also selected ‘best pixels’
workflows developed via the API remain scalable against the entire as those that correspond to the lowest values of vegetation cover, and
AGDCv2 data holdings. were therefore ‘best’ for certain geological studies.
However, as AGDCv2 retains all observations the ‘best pixel’ concept
3.3.2. Data formats to facilitate transdisciplinary access and high perfor- can be extended to include temporal statistics, with the user choosing
mance analysis typical (or atypical) values of the pixel through median, mean or per-
One of the key guiding principles of AGDCv2 is to ensure that it uses centile values or other statistics of the time series; for example ‘the me-
data suitable for conversion to ‘high performance data’ and moves from dian surface reflectance in winter’. In Australia, over 500 Landsat
data previously stored in sensor-specific or less accommodating formats observations may be available for a given location, starting from 1987.
such as GeoTIFF. This is a critical design element so that data cubes can Products produced using temporal statistics also have advantages such
accommodate the increasing variety of environmental data streams. as removing data outliers, producing stable estimates, reducing data
The datasets at NCI are stored as netCDF4/HDF5 files to capitalise on volumes, and enabling new approaches to analysis.
the underlying vocabulary standards and supporting publicly accessible Two examples that we have produced are:
data services, libraries, packages and scientific toolkits. The AGDCv2
uses the Climate and Forecast (CF 1.6) conventions that are being used • Seasonal median surface reflectance; and
across NERDIP. This convention provides a definitive description of • High tide and low tide median images.
what the data in each variable represents, including the spatial and tem-
poral properties of the data.
The use of netCDF as the storage profile of HDF5 is dual purpose in Fig. 12 shows an image of Australia produced from the median sur-
that it supports applications within NCI through the use of the well- face reflectance values from Landsat 8 between April and October, for
used netCDF4/HDF5 libraries, as well as through the DAP protocol the years of 2014–2015. The composite is generated from an automated
used by data services like THREDDS (Thematic Realtime Environmental analysis using data from over 10,000 individual Landsat ‘scenes’, yet can
Distributed Data Services http://www.unidata.ucar.edu/software/ be produced from the Data Cube in a matter of hours.
thredds/current/tds/). This effectively web-enables the AGDCv2 Median surface reflectance composites produced by us are now
datasets, as per one of the central design decision for NERDIP. being used by governments of the Australian states of South Australia
and Victoria, which together manage a land area of 1,200,000 km2. We
4. Applications and capabilities were able to produce those datasets to specifications of the state gov-
ernment scientists. Researchers in those states use these inputs to pro-
The value of the AGDCv2 is demonstrated through a growing num- duce new, highly detailed, maps of the land cover for each five year
ber of successful applications and capabilities, some of which we draw interval from 1990 to 2015 (Cunningham et al., 2008).
on below. Many of these uses would be prohibitively expensive, time High-tide and low-tide median images are a further variation on
consuming or difficult without the advantages of the AGDCv2 architec- time-series statistics. Fig. 13 shows a synthetic image produced as the
ture which allows us to efficiently access measurements through time median value of surface reflectance from Landsat sensors between
across the entire Australian continent. 1987 and 2016, drawing only on observations that coincide with the
highest 10% and lowest 10% of observed tides. To produce this we linked
4.1. Water observations from space each coastal observation to the Oregon State University tidal model
(Egbert and Erofeeva, 2002). The images reveal intertidal features
A number of studies have mapped surface water within parts of with consistency and detail, despite characteristically turbid coastal wa-
Australia, (e.g., Thomas et al., 2011; Tulbure et al., 2016), however ters, thereby enabling more detailed characterisation of intertidal and
with the Data Cube we are able to map surface water for all of Australia shallow marine zones in the future. Unlike methods that use a small
drawing on every Landsat observation since the first Landsat Thematic number of individual satellite images to map the intertidal zone, by
Mapper collections in 1987. In contrast to localised studies, AGDCv2 utilising the full 28-year time series of observations the data cube ap-
allows models to be trained on geographically and temporally compre- proach overcomes the need for clear, high quality observations acquired
hensive datasets, producing more robust and accurate models across concurrent to the time of low and high tide.
the full range of Australian environments. This is demonstrated in The low tide image (Fig. 13b) demonstrates the potential of statisti-
Water Observations from Space (WOfS) (Mueller et al., 2016) which cal products for improved mapping in shallow, turbid, waters. The shal-
is a comprehensive mapping of surface water for all of Australia low bank features are rarely visible in imagery due to the high turbidity
http://www.ga.gov.au/interactive-maps/#/theme/water/map/wofs). of these inshore waters. The difference between the low and high tide
WOfS is produced by testing every pixel in the available Landsat datasets provides a basis for improved mapping of the intertidal zone.
archive for the presence of water, and summarising the results as This has allowed us to produce a new national intertidal map product
the number of times water was detected at each location. AGDCv2 for Australia (http://dx.doi.org/10.4225/25/575F47F34675D).
A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 287

Fig. 11. Water Observations from Space (WOfS) results for the braided river network of Coopers Creek in south-western Queensland, Australia. Colours indicate the frequency of
inundation, ranging from permanent water holes (blue) through to the extremes of the floodplains (red).

4.3. Time series water quality observation of surface reflectance means that data streams from multiple instru-
ments can be combined, further densifying the data. Lymburner et al.
An important advantage of the AGDCv2 over traditional remote (2016) demonstrate that data from three Landsat instruments TM,
sensing image analysis approaches is that all data points are retained, ETM+ and OLI can be combined to produce a single time series of mea-
providing a richer stream of data. The use of quantitative measurements surements of total suspended matter spanning the past three decades.

Fig. 12. Median surface reflectance values from Landsat 8 between April and October, for the years of 2014–2015.
288 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

Fig. 13. The high tide composite (left panel) and low tide composite (right panel) median surface reflectance (1987–2015) for an estuarine area in the Kimberley, Western Australia.

For one lake lying on the overlap between adjacent Landsat paths analysis through high processing capacity linked directly to high perfor-
Lymburner et al. (2016) were able to retrieve over 500 valid measure- mance storage.
ments of total suspended matter between 1987 and 2014. The ability
of the AGDCv2 to provide such measurements routinely and over the 5.2. Lessons learned
entire continent indicates the potential of Earth observation data
cubes to support continental assessments to help to address global chal- The initial AGDCv1 prototype provided an important platform for
lenges such as those posed by the United Nation's Sustainable Develop- proving the concept of an Earth observation ‘Data Cube’. The system
ment Goals (Griggs et al., 2013). was, however, tightly coupled to the analysis of Landsat data and fell
short of the common analytical environment for geospatial data that
we had envisioned. Rigid database schemas meant that addition of
5. Discussion new data sources imposed a large development overhead. For some
sensor data, fields remained unused or the schema needed expansion
5.1. International applications to accommodate new attributes. The simplified structure of AGDCv2
provides flexibility in terms of the attributes use to describe a given
Whilst the development of the AGDCv2 has necessarily been focused dataset and one can add support for a new sensor by writing a high
on the Australian region, we have developed the Data Cube with an level ingest configuration in a relatively user friendly mark-up language.
awareness of the international need for improved approaches to the ex- AGDCv2 treats internal datasets (Data Cube formatted/tiled and
ploitation of satellite Earth observation data in order to effectively ad- reprojected) and external datasets (i.e., those retained in their native
dress global challenges. We have therefore chosen to build, and use, construct) as conceptually equal. If a dataset is to be used frequently,
open source code and to work actively with the international communi- it may be ingested to become a Data Cube dataset, whereas datasets
ty to facilitate open technical and scientific partnerships that we see as used less often may remain as indexed external datasets, accessed and
essential to global solutions. temporarily ingested (warped to the analysis grid) only during analysis.
One of the benefits of this approach has been the adoption of the We found this to be a crucial step forward because it allows researchers
AGDCv2 codebase by NASA's Systems Engineering Office for the Com- to retain control over how data are presented for analysis and removes
mittee on Earth Observing Satellites (the CEOS-SEO). Through this ini- the need to replicate datasets. The AGDCv2 has access to a variety of
tiative CEOS is exploring ways to overcome the obstacles of data size geophysical datasets hosted on the NCI and by indexing rather than
and data complexity which face individual user agencies, particularly ingesting these datasets custodians of the data are able to maintain
within developing countries. their role, such as controls on and versioning of the data.
The CEOS-SEO has used the AGDC code to implement data cubes in We have found data quality, including quality assurance and rele-
Colombia and Kenya and is using those to explore the use of data vant measures of quality, to be of great importance. Both spatial accura-
cubes in a wider international context, with more diverse user needs, cy and radiometric fidelity are crucial when conducting time-series
differing technical capabilities, and with differing levels of computation- analysis and being able to trace and address data issues is essential.
al infrastructure. The initial focus of the projects is on forest manage- When managing hundreds of thousands of images this becomes a sig-
ment and agriculture, but with the potential to serve far more nificant technical challenge leading us to a focus on version control, col-
applications. They are exploring a range of technical scenarios such as lection management and accurate data calibration as described in this
deployment via local computers, regional data hubs and cloud-based re- paper.
sources (e.g., Amazon Web Services). A collaborative and open-source approach has proven to be essential
CEOS has applied the AGDC concepts to small-scale data cubes (the to us. The development of the AGDC has proven to be a large, complex,
land area of Kenya is less than one tenth that of Australia and data vol- and on-going task requiring the collaborative efforts of many, and
umes will be at least an order of magnitude smaller than the AGDCv2) benefitting from diverse perspectives and contributions as a range of in-
allowing them to explore ‘off the shelf’ computing infrastructures to dividuals and institutions have engaged in the work. As agencies, coun-
support data cubes in developing countries. We are yet to see the devel- tries and international organisations seek to apply vast quantities of
opment of continental scale data cubes outside of Australia using inte- Earth observation data to increasingly pressing issues, it seems inevita-
grated HPC-HPD facilities such as the NCI which enable large data ble that they will need to cooperate in their efforts to build capabilities
A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 289

akin to the AGDCv2. Collaboration and open source approaches also and geophysical variables in interoperable grid formats would allow
open the door to consistency and standardisation, and these will be these to be easily included in analyses.
needed to progress from national to global perspectives.
5.3.5. Global analysis grids and federated systems
5.3. Future directions To address international challenges (Griggs et al., 2013) increasingly
global analyses will be needed, drawing on data from multiple sources
We anticipate that the future development of the AGDCv2 will focus including traditionally non-gridded variables such as statistical or de-
on a number of key areas. mographic information stored in vector formats. Continental and global
scale analyses are difficult with the gridding systems traditionally used
5.3.1. Web services in remote sensing analyses, introducing problems such as crossing zone
Web services will be increasingly important for a wide range of users boundaries of the Universal Transverse Mercator projection system, and
to visualise, access, download and utilise data and information products. distorting area calculations if simple longitude - latitude grid systems
Increasing sophistication and reliability of web services will be required are used. Development will be needed to utilise emerging new ap-
including support for mobile device access to data for specific purposes. proaches to tessellation of the globe, such as Discrete Global Grid Sys-
tems (Purss et al., 2016).
5.3.2. Semantic layers Addressing global problems will also call for federated approaches.
Semantic abstraction is needed to translate technical details into In the future the implementation of global interoperability and
higher level terms, allowing users to focus on analysis. The AGDCv2 geospatial standards via the Earth Systems Grid Federation (Cinquini
API already supports some abstraction, for example AGDCv2 can store et al., 2014; Williams et al., 2015) will enable data fusion to be achieved
full spectral response functions for an optical instrument, whilst without the bulk movement and restructuring of data across different
allowing the programmer-analyst to use familiar terms such as the networks. We see AGDCv2 as an important step toward establishing a
band number. Similarly, as discussed above, the bit-mapping of a pixel common analytical framework to link very large multi-resolution and
quality band can be abstracted so that the analyst-user is not required multi-domain datasets together for analysis. We see potential for the
to interpret a binary code such as 0000111111111111 (decimal value expansion of the AGDC concept toward a network of data cubes operat-
4095), but can instead use labels such as “Cloud Shadow detected”. Ab- ing on large geoscientific and geospatial datasets, co-located in suitable
straction will become increasingly important as data from more Earth HPC-HPD facilities, to address global as well as national challenges
observation sensors and other sources are included, since each dataset (Fig. 14).
will have its own peculiarities. Each sensor will have its own band num-
ber to spectral wavelength mapping and its own per-pixel-metadata 5.4. Machine learning
specification, and the AGDC will need to store the full technical detail
of those whilst allowing Earth observation scientists and application de- A major enhancement of the AGDCv2 framework is that it combines
velopers to work with more familiar terms. Abstraction will simplify the consistently corrected data with machine readable metadata, which en-
process for creating products that retrieve and integrate data from mul- hances programmatic discovery, access and utilisation of the data. This
tiple sensors, for example an NDVI time series for a particular location means that the petascale archive is ready for use by off-the-shelf ma-
could draw on data from multiple Landsat and Sentinel 2 sensor chine learning frameworks such as scikit-learn (Pedregosa et al., 2011).
instruments. This will allow users to leverage the latest machine learning techniques
to quickly prototype, analyse or develop new remote sensing products.
5.3.3. Virtual research environments/virtual laboratories
Virtual Research Environments (VREs, also termed Virtual Laborato- 6. Conclusions
ries and Science Gateways) are increasingly being used to support a
more dynamic approach to collaborative working. Users who are not The collection of Earth observations continues apace, in particular
co-located, can dynamically work together at various scales from the through satellite platforms that are delivering Terabytes of data each
local to continental and share data, models, workflows, best practice, day to inform our understanding of Earth and how it is changing. Scien-
tools, etc. tific instruments of improved quality are producing greater data vol-
The AGDC API enables the possibility of developing VREs that would umes, geostationary satellites are delivering greater ‘velocity’ of data,
enable Earth observation researchers to collaboratively develop and and small satellites promise increasing diversity of data. The need to ex-
share new products and results. The API allows this to be done in a prod- tract timely and relevant information from these rich data sources and
uct development environment that is closely coupled with the produc- the challenges associated with that are widely recognised on many
tion environment, reducing the effort required to develop and deploy levels. Free and open data policies have been a vital government re-
new products at scale. sponse to this problem (Wulder et al., 2012), opening the field for the
Virtual research environments can also enable decision makers to scientific and technical communities to address the corresponding big
engage interactively with products and services generated from Earth data challenges.
observation data. For example a catchment management agency could The Australian Geoscience Data Cube was developed, as AGDCv1, to
perform their queries within a managed virtual laboratory to character- allow us to analyse the full depth of the Australian Landsat archives at
ise the inundation dynamics of particular wetland during a particular continental scale. AGDCv2 has now progressed to be a far more general
epoch and then compare that behaviour with a different epoch. and scalable infrastructure. The new AGDCv2 features can be considered
in terms of how they address the key challenges posed by creating and
5.3.4. Inclusion of additional data streams analysing Earth observation ‘Big Data’. The various elements of this Big
The design of AGDCv2 simplifies the steps to making additional Data challenge include:
Earth observation datasets accessible via the AGDCv2; datasets can be
indexed by AGDCv2 rather than being physically reformatted and dupli- 6.1. Volume
cated within the AGDC. Emerging data management protocols for high
performance data will facilitate the inclusion of various other Earth The AGDCv2 addresses the volume challenge presented by the Aus-
and environmental datasets (see http://geonetwork.nci.org.au) in tralian satellite Earth observation data collections by storing surface re-
cross-disciplinary analyses, and this will lead to a rapid diversification flectance, pixel quality and higher level products within a high
of datasets used in the AGDCv2. For example the availability of climate performance data (HPD) environment at the NCI. Data are thus
290 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

Fig. 14. Conceptual representation of a global network of data cubes that contain the data for each continent.

processed ‘on receipt’, and progressively add to the AGDCv2. The previ- 6.3. Veracity
ously referred to NASA Systems Engineering Office data cubes - Kenya
Cube and Colombia Cube - address the volume challenge by focussing The AGDCv2 addresses the veracity challenges of Earth observation
on smaller spatial extents, thereby making those data cubes portable data by implementing collection management processes and protocols
and deployable on smaller scale infrastructure. The capability of the and capturing detailed metadata about both the data inputs and the
API to access and manipulate data that is either ingested or indexed data processing systems. The metadata is captured automatically and
reduces the need to duplicate major collections by allowing users to is stored in adherence with pertinent national and international stan-
access data curated by different agencies on the same computing dards. Furthermore, the metadata is stored in a machine-readable
infrastructure. form to enable a wider range of modes of access and downstream prov-
enance tracking.

6.2. Variety 6.4. Velocity

The AGDCv2 addresses the variety challenge of Earth observation The AGDCv2 addresses the velocity issue of Earth observation data
data through: via the use of both HPD and HPC to facilitate objective automated
workflows rather than subjective manual image interpretation. Howev-
• Conversion of sensor specific measurements of digital numbers into er there remain a number of challenges in terms of velocity including
a common ‘analysis ready’ measurement (for example of surface unpredictable timelines for the ancillary data required for atmospheric
reflectance); correction, and developing algorithms to exploit the high velocity data
• The storage of sensor specific wavelength characteristics that support streams generated by geostationary meteorological satellites.
users to choose how they wish to combine data from different sensors
at the time of analysis;
Acknowledgements
• A semantic layer that interprets the per-pixel-metadata standards
used for each sensor by different providers (in this case GA and
The AGDC has been developed through the efforts of many people
USGS) that allow the users to retrieve ‘observations of interest’ at
over a number of years. The Authors wish to recognise and thank in par-
the time of analysis; and
ticular Brendan Brooke, Kelsey Druken, Xavier Ho, David Hudson, Alexis
• Support for multiple spatial resolutions and reprojection to enable
McIntyre, Simon Oldfield, Wendy Soh, Medhavy Thankappan, Jingbo
users to combine data from different sensors according to their
Wang, and Fei Zhang, for their part in the development of the AGDC.
requirements.
Fig. 2 was originally published by Taylor and Francis and is re-used
here under a Creative Commons Attribution License.
In the future, the implementation of a standardised Discrete Global This work funded by the Australian Government through Geosci-
Gridding System (Purss et al., 2016) may further increase the variety ence Australia and the NCI.
of gridded and spatial data sources that can be accessed or analysed Funding for the supporting HPC-HPD infrastructure at NCI came
via the AGDCv2, and data cubes more broadly. from the Australian Government Department of Education, through
A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292 291

the National Collaborative Research Infrastructure (NCRIS) and Educa- based on Landsat image composites. Remote Sens. Environ. 151:72–88. http://dx.
doi.org/10.1016/j.rse.2013.04.022 (Special Issue on 2012 ForestSAT).
tion Investment Fund (EIF) Super Science Initiatives via the NCI, Re- Griggs, D., Stafford-Smith, M., Gaffney, O., Rockström, J., Öhman, M.C., Shyamsundar, P.,
search Data Storage Infrastructure (RDSI) and Research Data Services Steffen, W., Glaser, G., Kanie, N., Noble, I., 2013. Policy: Sustainable development
(RDS) projects (particularly the A1.1 National Earth Systems Science goals for people and planet. Nature 495:305–307. http://dx.doi.org/10.1038/495305a.
Gruen, A., Zhang, L., 2002. Sensor modeling for aerial mobile mapping with Three-Line-
Data Services Domain). Scanner (TLS) imagery. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 34,
This paper is published with the permission of the CEO, Geoscience 139–146.
Australia. Gutman, G., Byrnes, R., Masek, I., Covington, S., Justice, C., Franks, S., Headley, R., 2008. To-
wards monitoring land-cover and land-use changes at a global scale: the global land
use survey 2005. Photogramm. Eng. Remote. Sens. 74, 6–10.
References Hart, P., Saunders, C., 1997. Power and trust: Critical factors in the adoption and use of
electronic data interchange. Organ. Sci. 8:23–42. http://dx.doi.org/10.1287/orsc.8.1.23.
Arvidson, T., Gasch, J., Goward, S.N., 2001. Landsat 7's long-term acquisition plan — an in- Hattori, S., Ono, T., Fraser, C., Hasegawa, H., 2000. Orientation of high-resolution satellite
novative approach to building a global imagery archive. Remote Sens. Environ. 78: images based on affine projection. Int. Arch. Photogramm. Remote. Sens. Spat. Inf.
13–26. http://dx.doi.org/10.1016/S0034-4257(01)00263-2 (Landsat 7). Sci. 33, 359–366.
Australian Academy of Science, 2009. An Australian Strategic Plan for Earth Observations Hermosilla, T., Wulder, M.A., White, J.C., Coops, N.C., Hobart, G.W., 2015. An integrated
from Space. Australian Academy of Science, Canberra https://www.science.org.au/ Landsat time series protocol for change detection and generation of annual gap-
files/userfiles/support/reports-and-plans/2015/earth-observations-from-space.pdf free surface reflectance composites. Remote Sens. Environ. 158:220–234. http://dx.
(Accessed 14 July 2016). doi.org/10.1016/j.rse.2014.11.005.
Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N., 1998. The multidimensional Hey, T., Tansley, S., Tolle, K.M., et al., 2009. The Fourth Paradigm: Data-Intensive Scientific
database system RasDaMan. Proceedings of the 1998 ACM SIGMOD International Discovery. Microsoft Research Redmond, WA.
Conference on Management of Data, SIGMOD '98. ACM, New York, NY, USA: Hilker, T., Wulder, M.A., Coops, N.C., Seitz, N., White, J.C., Gao, F., Masek, J.G., Stenhouse, G.,
pp. 575–577. http://dx.doi.org/10.1145/276304.276386. 2009. Generation of dense time series synthetic Landsat data through data blending
Baumann, P., Mazzetti, P., Ungar, J., Barbera, R., Barboni, D., Beccati, A., Bigagli, L., Boldrini, with MODIS using a spatial and temporal adaptive reflectance fusion model. Remote
E., Bruno, R., Calanducci, A., Campalani, P., Clements, O., Dumitru, A., Grant, M., Herzig, Sens. Environ. 113:1988–1999. http://dx.doi.org/10.1016/j.rse.2009.05.011.
P., Kakaletris, G., Laxton, J., Koltsida, P., Lipskoch, K., Mahdiraji, A.R., Mantovani, S., Hollmann, R., Merchant, C.J., Saunders, R., Downy, C., Buchwitz, M., Cazenave, A.,
Merticariu, V., Messina, A., Misev, D., Natali, S., Nativi, S., Oosthoek, J., Pappalardo, Chuvieco, E., Defourny, P., de Leeuw, G., Forsberg, R., Holzer-Popp, T., Paul, F.,
M., Passmore, J., Rossi, A.P., Rundo, F., Sen, M., Sorbera, V., Sullivan, D., Torrisi, M., Sandven, S., Sathyendranath, S., van Roozendael, M., Wagner, W., 2013. The
Trovato, L., Veratelli, M.G., Wagner, S., 2016. Big data analytics for earth sciences: ESA climate change initiative: satellite data records for essential climate variables.
the EarthServer approach. Int. J. Digital Earth 9:3–29. http://dx.doi.org/10.1080/ Bull. Am. Meteorol. Soc. 94:1541–1552. http://dx.doi.org/10.1175/BAMS-D-11-
17538947.2014.1003106. 00254.1.
Berk, A., Anderson, G.P., Bernstein, L.S., Acharya, P.K., Dothe, H., Matthew, M.W., Adler- Irish, R.R., Barker, J.L., Goward, S.N., Arvidson, T., 2006. Characterization of the Landsat-7
Golden, S.M., Chetwynd Jr., J.H., Richtsmeier, S.C., Pukall, B., et al., 1999. MODTRAN4 ETM+ automated cloud-cover assessment (ACCA) algorithm. Photogramm. Eng. Re-
radiative transfer modeling for atmospheric correction. SPIE's International Sympo- mote. Sens. 72, 1179–1188.
sium on Optical Science, Engineering, and Instrumentation. International Society for Jensen, J.R., 1986. Introductory Digital Image Processing: A Remote Sensing Perspective.
Optics and Photonics, pp. 348–353. Univ. of South Carolina, Columbus.
Boyd, D., Crawford, K., 2012. Critical questions for big data. Inf. Commun. Soc. 15:662–679. Kennedy, R.E., Yang, Z., Cohen, W.B., 2010. Detecting trends in forest disturbance and re-
http://dx.doi.org/10.1080/1369118X.2012.678878. covery using yearly Landsat time series: 1. LandTrendr — temporal segmentation al-
Caccetta, P., Furby, S., O'Connell, J., Wallace, J.F., Wu, X., 2007. Continental monitoring: 34 gorithms. Remote Sens. Environ. 114:2897–2910. http://dx.doi.org/10.1016/j.rse.
years of land cover change using Landsat imagery. 32nd International Symposium on 2010.07.008.
Remote Sensing of Environment, June 25–29, 2007, San José, Costa Rica, pp. 25–29. Kim, Y., Kimball, J.S., Robinson, D.A., Derksen, C., 2015. New satellite climate data records
CEOS, 2014. The Earth Observation Handbook: Special Edition for Rio+20. Committee on indicate strong coupling between recent frozen season changes and snow cover over
Earth Observation Satellites, Paris http://www.eohandbook.com/ (Accessed 14 July high northern latitudes. Environ. Res. Lett. 10:84004. http://dx.doi.org/10.1088/1748-
2016). 9326/10/8/084004.
Chen, X., Vierling, L., Deering, D., 2005. A simple and effective radiometric correction Laney, D., 2001. 3D data management: controlling data volume, velocity and variety.
method to improve landscape change detection across sensors and across time. Re- META Group Research Note. 6 :p. 70. https://blogs.gartner.com/doug-laney/files/
mote Sens. Environ. 98 (1), 63–79. 2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-
Cinquini, L., Crichton, D., Mattmann, C., Harney, J., Shipman, G., Wang, F., Variety.pdf (Accessed 14 July 2016).
Ananthakrishnan, R., Miller, N., Denvil, S., Morgan, M., Pobre, Z., Bell, G.M., Lewis, A., Wang, L.-W., Coghlan, R., Geoscience Australia, 2011. AGRI: the Australian Geo-
Doutriaux, C., Drach, R., Williams, D., Kershaw, P., Pascoe, S., Gonzalez, E., Fiore, S., graphic Reference Image: a Technical Report. Geoscience Australia, Canberra.
Schweitzer, R., 2014. The earth system grid federation: an open infrastructure for Lewis, A., Lymburner, L., Purss, M.B.J., Brooke, B., Evans, B., Ip, A., Dekker, A.G., Irons,
access to distributed geospatial data. Futur. Gener. Comput. Syst. 36:400–417. J.R., Minchin, S., Mueller, N., Oliver, S., Roberts, D., Ryan, B., Thankappan, M.,
http://dx.doi.org/10.1016/j.future.2013.07.002 (Special Section: Intelligent Big Data Woodcock, R., Wyborn, L., 2016. Rapid, high-resolution detection of environ-
ProcessingSpecial Section: Behavior Data Security Issues in Network Information mental change over continental scales from satellite data – the Earth Observa-
PropagationSpecial Section: Energy-efficiency in Large Distributed Computing Archi- tion Data Cube. Int. J. Digital Earth 9:106–111. http://dx.doi.org/10.1080/
tectures Special Section: eScience Infrastructure and Applications). 17538947.2015.1111952.
Craglia, M., de Bie, K., Jackson, D., Pesaresi, M., Remetey-Fülöpp, G., Wang, C., Annoni, A., Li, F., Jupp, D.L., Reddy, S., Lymburner, L., Mueller, N., Tan, P., Islam, A., 2010. An evaluation
Bian, L., Campbell, F., Ehlers, M., van Genderen, J., Goodchild, M., Guo, H., Lewis, A., of the use of atmospheric and BRDF correction to standardize Landsat data. IEEE J. Sel.
Simpson, R., Skidmore, A., Woodgate, P., 2012. Digital Earth 2020: towards the vision Top. Appl. Earth Obs. Remote Sens. 3 (3):257–270. http://dx.doi.org/10.1109/JSTARS.
for the next decade. Int. J. Digital Earth 5:4–21. http://dx.doi.org/10.1080/17538947. 2010.2042281.
2011.638500. Li, F., Jupp, D.L.B., Thankappan, M., Lymburner, L., Mueller, N., Lewis, A., Held, A., 2012. A
Cunningham, S.C., Nally, R.M., Read, J., Baker, P.J., White, M., Thomson, J.R., Griffioen, P., physics-based atmospheric and BRDF correction for Landsat data over mountainous
2008. A robust technique for mapping vegetation condition across a Major River sys- terrain. Remote Sens. Environ. 124:756–770. http://dx.doi.org/10.1016/j.rse.2012.
tem. Ecosystems 12:207–219. http://dx.doi.org/10.1007/s10021-008-9218-0. 06.018.
Du, Y., Teillet, P.M., Cihlar, J., 2002. Radiometric normalization of multitemporal Li, F., Jupp, D., Lymburner, L., Tana, P., McIntyre, A., Thankappan, M., Lewis, A., Held, A.,
high-resolution satellite images with quality control for land cover change 2013. Characteristics of MODIS BRDF shape and its relationship with land cover clas-
detection. Remote Sens. Environ. 82:123–134. http://dx.doi.org/10.1016/ ses in Australia. Proceedings of the 20th International Congress on Modelling and
S0034-4257(02)00029-9. Simulation. Adelaide, Australia.
Egbert, G.D., Erofeeva, S.Y., 2002. Efficient inverse modeling of barotropic ocean tides. Lymburner, L., Botha, E., Hestir, E., Anstee, J., Sagar, S., Dekker, A., Malthus, T., 2016.
J. Atmos. Ocean. Technol. 19:183–204. http://dx.doi.org/10.1175/1520- Landsat 8: Providing continuity and increased precision for measuring multi-decadal
0426(2002)019b0183:EIMOBON2.0.CO;2. time series of total suspended matter. Remote Sens. Environ. 185, 108–118.
Evans, B., Wyborn, L., Pugh, T., Allen, C., Antony, J., Gohar, K., Porter, D., Smillie, J., Lynch, C., 2008. Big data: how do your data grow? Nature 455:28–29. http://dx.doi.org/
Trenham, C., Wang, J., Ip, A., Bell, G., 2015. The NCI high performance computing 10.1038/455028a.
and high performance data platform to support the analysis of petascale environ- Masek, J.G., Huang, C., Wolfe, R., Cohen, W., Hall, F., Kutler, J., Nelson, P., 2008. North
mental data collections. In: Denzer, R., Argent, R.M., Schimak, G., Hřebíček, J. (Eds.), American forest disturbance mapped from a decadal Landsat record. Remote Sens.
Environmental Software Systems. Infrastructures, Services and Applications, IFIP Ad- Environ. 112:2914–2926. http://dx.doi.org/10.1016/j.rse.2008.02.010.
vances in Information and Communication Technology. Springer International Pub- Masek, J.G., Goward, S.N., Kennedy, R.E., Cohen, W.B., Moisen, G.G., Schleeweis, K., Huang,
lishing, pp. 569–577. C., 2013. United States Forest disturbance trends observed using Landsat time series.
Fan, J., Han, F., Liu, H., 2014. Challenges of big data analysis. Natl. Sci. Rev. 1:293–314. Ecosystems 16:1087–1104. http://dx.doi.org/10.1007/s10021-013-9669-9.
http://dx.doi.org/10.1093/nsr/nwt032. Mattmann, C.A., 2013. Computing: A vision for data science. Nature 493:473–475. http://
Frankel, F., Reid, R., 2008. Big data: Distilling meaning from data. Nature 455:30. http://dx. dx.doi.org/10.1038/493473a.
doi.org/10.1038/455030a. Mueller, N., Lewis, A., Roberts, D., Ring, S., Melrose, R., Sixsmith, J., Lymburner, L.,
Griffiths, P., Kuemmerle, T., Baumann, M., Radeloff, V.C., Abrudan, I.V., Lieskovsky, J., McIntyre, A., Tan, P., Curnow, S., Ip, A., 2016. Water observations from space: map-
Munteanu, C., Ostapowicz, K., Hostert, P., 2014. Forest disturbances, forest recovery, ping surface water from 25 years of Landsat imagery across Australia. Remote Sens.
and changes in forest types across the Carpathian ecoregion from 1985 to 2010 Environ. 174:341–352. http://dx.doi.org/10.1016/j.rse.2015.11.003.
292 A. Lewis et al. / Remote Sensing of Environment 202 (2017) 276–292

National Computational Infrastructure, 2016. NERDIP. http://nci.org.au/systems-services/ Shlien, S., 1979. Geometric correction, registration, and resampling of Landsat imagery.
national-facility/nerdip/ (Accessed on 7 July 2016). Can. J. Remote. Sens. 5:74–89. http://dx.doi.org/10.1080/07038992.1979.10854986.
Overpeck, J.T., Meehl, G.A., Bony, S., Easterling, D.R., 2011. Climate data challenges in the Sixsmith, J., Oliver, S., Lymburner, L., 2013. A hybrid approach to automated landsat pixel
21st century. Science 331:700–702. http://dx.doi.org/10.1126/science.1197869. quality. 2013 IEEE International Geoscience and Remote Sensing Symposium -
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., IGARSS:pp. 4146–4149. http://dx.doi.org/10.1109/IGARSS.2013.6723746.
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Soenen, S.A., Peddle, D.R., Coburn, C.A., 2005. A modified sun-canopy-sensor topographic
Brucher, M., Perrot, M., Duchesnay, É., 2011. Scikit-learn: machine learning in python. correction in forested terrain. IEEE Trans. Geosci. Remote Sens. 43:2148–2159. http://
J. Mach. Learn. Res. 12, 2825–2830. dx.doi.org/10.1109/TGRS.2005.852480.
Poli, D., Zhang, L., Gruen, A., 2004. Orientation of satellite and airborne imagery from Teillet, P.M., Guindon, B., Goodenough, D.G., 1982. On the slope-aspect correction of mul-
multi-line pushbroom sensors with a rigorous sensor model. Int. Arch. Photogramm. tispectral scanner data. Can. J. Remote. Sens. 8:84–106. http://dx.doi.org/10.1080/
Remote. Sens. 35, 130–135. 07038992.1982.10855028.
Purdy, R., 2010. Satellite Monitoring of Environmental Laws: Lessons to Be Learnt from Thomas, R.F., Kingsford, R.T., Lu, Y., Hunter, S.J., 2011. Landsat mapping of annual inunda-
Australia. (SSRN Scholarly Paper No. ID 1744018). Social Science Research Network, tion (1979–2006) of the Macquarie Marshes in semi-arid Australia. Int. J. Remote
Rochester, NY. Sens. 32:4545–4569. http://dx.doi.org/10.1080/01431161.2010.489064.
Purdy, R., Leung, D., 2012. Evidence from Earth Observation Satellites: Emerging Legal Is- Thompson, S.D., Nelson, T.A., White, J.C., Wulder, M.A., 2015. Mapping dominant tree spe-
sues. Martinus Nijhoff Publishers. cies over large forested areas using Landsat best-available-pixel image composites.
Purss, M.B.J., Peterson, P., Gibb, R., Samavati, F., Strobl, P., 2016. Discrete global grid sys- Can. J. Remote. Sens. 41:203–218. http://dx.doi.org/10.1080/07038992.2015.
tems for handling big data from space. Proceedings 2016 Conference on Big Data 1065708.
from Space (BiDS'16):pp. 117–119. http://dx.doi.org/10.2788/854791. Tulbure, M.G., Broich, M., Stehman, S.V., Kommareddy, A., 2016. Surface water extent dy-
Ravanbakhsh, M., Wang, L.W., Fraser, C.S., Lewis, A., 2012. Generation of the Australian namics from three decades of seasonally continuous Landsat time series at subconti-
geographic reference image through long-strip ALOS PRISM orientation. Int. Arch. nental scale in a semi-arid region. Remote Sens. Environ. 178:142–157. http://dx.doi.
Photogramm. Remote. Sens. Spat. Inf. Sci. 39, 225–229 (XXII ISPRS Congress, 25 Au- org/10.1016/j.rse.2016.02.034.
gust – 01 September 2012, Melbourne, Australia). United States Government, 2016. Common Framework for Earth Observation Data. Com-
Richter, R., Kellenberger, T., Kaufmann, H., 2009. Comparison of topographic correction mittee on Environmental, Natural resources, and Sustainability. https://www.
methods. Remote Sens. 1:184–196. http://dx.doi.org/10.3390/rs1030184. whitehouse.gov/sites/default/files/microsites/ostp/common_framework_for_earth_
Rottensteiner, F., Weser, T., Lewis, A., Fraser, C.S., 2009. A strip adjustment approach for observation_data.pdf (Accessed on 9 July 2016).
precise Georeferencing of ALOS optical imagery. IEEE Trans. Geosci. Remote Sens. Williams, D.N., Balaji, V., Cinquini, L., Denvil, S., Duffy, D., Evans, B., Ferraro, R., Hansen, R.,
47:4083–4091. http://dx.doi.org/10.1109/TGRS.2009.2014366. Lautenschlager, M., Trenham, C., 2015. A global repository for planet-sized experi-
Roy, D.P., Ju, J., Kline, K., Scaramuzza, P.L., Kovalskyy, V., Hansen, M., Loveland, T.R., ments and observations. Bull. Am. Meteorol. Soc. 97:803–816. http://dx.doi.org/10.
Vermote, E., Zhang, C., 2010. Web-enabled Landsat data (WELD): Landsat ETM+ 1175/BAMS-D-15-00132.1.
composited mosaics of the conterminous United States. Remote Sens. Environ. 114: Wulder, M.A., Masek, J.G., Cohen, W.B., Loveland, T.R., Woodcock, C.E., 2012. Opening the
35–49. http://dx.doi.org/10.1016/j.rse.2009.08.011. archive: How free data has enabled the science and monitoring promise of Landsat.
Roy, D.P., Zhang, H.K., Ju, J., Gomez-Dans, J.L., Lewis, P.E., Schaaf, C.B., Sun, Q., Li, J., Huang, Remote Sens. Environ. 122:2–10. http://dx.doi.org/10.1016/j.rse.2012.01.010.
H., Kovalskyy, V., 2016. A general method to normalize Landsat reflectance data to Wyborn, L., Evans, B., 2015. Integrating ‘big’ geoscience data into the petascale national
nadir BRDF adjusted reflectance. Remote Sens. Environ. 176:255–271. http://dx.doi. environmental interoperability platform (NERDIP): successes and unforeseen chal-
org/10.1016/j.rse.2016.01.023. lenges. IEEE Big Data Conference 2015 Workshop on Big Data in the Geosciences.
Scarth, P., Röder, A., Schmidt, M., Denham, R., 2010. Tracking grazing pressure and climate Santa Clara, California. http://dx.doi.org/10.1109/BigData.2015.7363981.
interaction—the role of Landsat fractional cover in time series analysis. Proceedings of Zald, H.S.J., Wulder, M.A., White, J.C., Hilker, T., Hermosilla, T., Hobart, G.W., Coops, N.C.,
the 15th Australasian Remote Sensing and Photogrammetry Conference, Alice 2016. Integrating Landsat pixel composites and change metrics with lidar plots to
Springs, Australia, pp. 13–17. predictively map forest structure and aboveground biomass in Saskatchewan, Cana-
Schott, J.R., Salvaggio, C., Volchok, W.J., 1988. Radiometric scene normalization using da. Remote Sens. Environ. 176:188–201. http://dx.doi.org/10.1016/j.rse.2016.01.015.
pseudoinvariant features. Remote Sens. Environ. 26:1–16. http://dx.doi.org/10. Zhang, X., Friedl, M.A., Schaaf, C.B., Strahler, A.H., Hodges, J.C.F., Gao, F., Reed, B.C., Huete,
1016/0034-4257(88)90116-2. A., 2003. Monitoring vegetation phenology using MODIS. Remote Sens. Environ. 84:
Schott, J.R., 2007. Remote Sensing: The Image Chain Approach. Oxford University Press, 471–475. http://dx.doi.org/10.1016/S0034-4257(02)00135-9.
USA. Zhu, Z., Woodcock, C.E., 2012. Object-based cloud and cloud shadow detection in Landsat
Schowengerdt, R.A., 2006. Remote Sensing: Models and Methods for Image Processing. imagery. Remote Sens. Environ. 118:83–94. http://dx.doi.org/10.1016/j.rse.2011.10.
Academic Press. 028.
Sexton, J.O., Song, X.-P., Huang, C., Channan, S., Baker, M.E., Townshend, J.R., 2013. Urban Zhu, Z., Woodcock, C.E., 2014. Continuous change detection and classification of land
growth of the Washington, D.C.–Baltimore, MD metropolitan region from 1984 to cover using all available Landsat data. Remote Sens. Environ. 144:152–171. http://
2010 by annual, Landsat-based estimates of impervious cover. Remote Sens. Environ. dx.doi.org/10.1016/j.rse.2014.01.011.
129:42–53. http://dx.doi.org/10.1016/j.rse.2012.10.025.

You might also like