Project Summary: Pangeo: An Open Source Big Data Climate Science Platform

Project Summary
Pangeo: An Open Source Big Data Climate Science Platform
Climate, weather, and ocean simulations (Earth System Models; ESMs) are crucial tools for the
study of the Earth system, providing both scientific insight into fundamental dynamics as well as
valuable practical predictions about Earth’s future. Continuous increases in ESM spatial resolu-
tion have led to more realistic, more detailed physical representations of Earth system processes,
while the proliferation of statistical ensembles of simulations has greatly enhanced understanding
of uncertainty and internal variability. Hand in hand with this progress has come the generation of
Petabytes of simulation data, resulting in huge downstream challenges for geoscience researchers.
The task of mining ESM output for scientific insights (generically called “analysis” or “postpro-
cessing”) has now itself become a serious Big Data problem. Existing Big Data tools cannot easily
be applied to the analysis of ESM data, leading to a building crisis across a wide range of geo-
science fields. This is exactly the sort of problem EarthCube was conceived to address.
The proposed work will integrate a suite of open-source software tools (the Pangeo Platform)
which together can tackle petabyte-scale ESM datasets. The core technologies are the python
packages Dask, a flexible parallel computing library which provides dynamic task scheduling,
and XArray, a wrapper layer over dask data structures which provides user-friendly metadata
tracking, indexing, and visualization. These tools interface with netCDF datasets and understand
CF conventions. They will be brought to bear on four high impact Geoscience Use Cases in at-
mospheric science, land-surface hydrology, and physical oceanography. Disciplinary scientists
will define workflows for each use case and interact with computational scientists to demonstrate,
benchmark, and optimize the software. The resulting software improvements will be contributed
back to the upstream open-source projects, ensuring long-term sustainability of the platform. The
end result will be a robust new software toolkit for climate science and beyond. This toolkit will
enhance the Data Science aspect of EarthCube. Collaborating Institutions: Lamont Doherty Earth
Observatory, National Corporation for Atmospheric Research, Continuum Analytics, NOAA Geo-
physical Fluid Dynamics Laboratory, NASA Goddard Institute for Space Studies.
Intellectual Merit: Some of the most exciting and ambitious ideas in climate science are currently
impossible to realize due to the computational burden of processing petabyte-scale datasets. The
Pangeo Platform will enable a new Big Data era in climate science in which disciplinary scientists
can realize their most ambitious goals. Beyond climate and related fields, multidimensional nu-
meric arrays are common in many fields of science (e.g. astronomy, materials science, microscopy).
However, the dominant Big Data software stack (Hadoop) is oriented towards tabular text-based
data structures and cannot easily ingest petabyte scale multidimensional numeric arrays. The pro-
posed work thus has potential to transform Data Science itself, enabling analysis of such datasets
via a novel, highly scalable, highly flexible tool with a syntax familiar to disciplinary researchers.
Broader Impacts: The Dask and XArray packages are already widely used in the scientific python
community, and the resulting development of these tools from our proposed work will greatly
benefit this upstream community, enhancing productivity in areas from finance to astronomy. Ad-
ditionally, training and educational materials for these tools will be developed, distributed widely
online, and integrated into existing educational curricula at Columbia. A workshop at NCAR in
the final year will help inform the broader community about Pangeo. Collaborators at other US
climate modeling centers will encourage adoption and participation in the Pangeo project by their
scientists.
B–1
Contents
1 Introduction D–1
1.1 Motivation: The Big Data Crisis in Earth System Modeling . . . . . . . . . . . . . . . D–1
1.2 The Scientific Python Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–2
1.3 Proposal Team, Project Goals, and Science Outcomes . . . . . . . . . . . . . . . . . . D–2
2 Community Datasets D–2
3 Geoscience Use Cases D–3

3.1 Atmospheric Moisture Budgets in CMIP Models (Seager, Henderson) . . . . . . . . . D–3
3.2 Convective Parameters for Understanding Severe Thunderstorms (Tippett, Lepore) . D–4
3.3 Spatial Downscaling and Bias Correction for Hydrologic Modeling (Hamman) . . . D–5
3.4 Statistical and Spectral Analysis of Eddy-Resolving Ocean Models (Abernathey) . . D–6
4 Technical Implementation D–7

4.1 Python Software Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–7
4.1.1 NetCDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–8
4.1.2 XArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–8
4.1.3 Dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–9
4.1.4 MetPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–11
4.1.5 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–11
4.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–11
4.2.1 NCAR CMIP Analysis Platform and Research Data Archive . . . . . . . . . . D–11
4.2.2 Columbia Habanero HPC System . . . . . . . . . . . . . . . . . . . . . . . . . D–12
4.2.3 Cloud Computing and Jetstream . . . . . . . . . . . . . . . . . . . . . . . . . . D–12
4.3 Benchmarking and Optimization of Use Cases . . . . . . . . . . . . . . . . . . . . . . D–13
4.4 Demonstration and Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–13
5 EarthCube Participation D–13
6 Broader Impacts D–14
7 Management Plan D–14

7.1 Roles of Different Institutions and Investigators . . . . . . . . . . . . . . . . . . . . . D–14
7.2 Coordination and Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–14
7.3 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D–15
8 Sustainability Plan D–15

(a) Professional Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F–1
(b) Appointments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F–1
(c) Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F–1
(d) Synergistic Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F–2
B–2
Project Description
1 Introduction
1.1 Motivation: The Big Data Crisis in Earth System Modeling
Earth’s climate system is experiencing unprecedented change, as anthropogenic greenhouse
emissions continue to perturb the global energy balance [51]. Understanding and forecasting
the nature of this change, and its impact on human welfare, is both a profound scientific chal-
lenge and an urgent societal problem. While other scientific fields (e.g. microbiology, particle
physics) are able to conduct laboratory experiments to better understand complex phenomena,
climate scientists only have one planet to observe. Consequently, Earth system models (ESMs,
a.k.a. climate models)—numerical simulations of the coupled interaction of ocean, atmosphere,
land, cryosphere, and biosphere—have become an indispensable tool for scientific inquiry, a “vir-
tual laboratory” in which hypotheses can be tested and predictions validated. The output of ESMs
constitutes a primary data source for thousands of geoscientists across the world. Despite this cen-
tral, cross-disciplinary role for ESMs, relatively few EarthCube projects are focused on ESM data.
Our proposed work will integrate a suite of open-source technologies and data archives to pro-
duce a comprehensive platform for the analysis of ESM simulations, transforming the way climate
and related science is done. In doing so, we will add cutting-edge Data Science techniques with
cross-disciplinary value to the EarthCube Geoscientist’s Workbench. By deploying our platform
on NSF-funded computing resources, we will lay the groundwork for scalable infrastructure.
The need for a new, integrated approach to ESM data is underscored by the explosive growth in
the volume of the datasets themselves. ESMs are traditional high performance computing (HPC)
applications which run on supercomputing clusters. Over the past decades, the computing power
of such clusters has increased exponentially, and ESMs have adapted by increasing resolution,
complexity, and ensemble size. This adaptation is possible because of the massively parallel ar-
chitecture of ESM codes, which are designed from the ground up to leverage HPC resources. At
the same time, there has been little change in how ESM output data is processed, analyzed, and
visualized: these tasks are most commonly accomplished on a single workstation or laptop using
MATLAB or similar software. This “post-processing” code is generally not suited to leverage HPC
resources. As ESM output datasets reach the petabyte scale (detailed in Sec. 2), this analysis work-
flow is becoming totally unsustainable, leading to a paradoxical crisis in climate science: the better
the models become, the harder it is to use them for research. A related crisis is the cost of simply
storing the ESM data; while once it was feasible for institutions or individuals to maintain local
copies of these datasets, petabyte-scale storage is prohibitively expensive. This is a quintessential
Big Data problem, resulting in a lack of interoperability between systems.
The only way to close these widening gaps is to move ESM analysis workflows to shared
computing resources capable of parallel task execution, massive data storage, and high-speed in-
put/output (I/O). Such environments exist on HPC clusters and cloud platforms. In particular,
the NSF has made major investments, via the NCAR Computational and Information Systems
Laboratory (CISL), in the recently launched Cheyenne HPC system, and in the XSEDE Jetstream
Cloud platform. However, hardware alone won’t solve the problem. A sophisticated software
layer is needed to effectively leverage computational resources in a way that allows disciplinary
scientists to easily transfer their workflows from desktop to HPC/cloud. Our proposed project
will provide this layer, creating new technical capability across resources that improves interoper-
ability.
D–1
1.2 The Scientific Python Ecosystem
Rather than developing new tools from scratch, our project seeks to integrate elements of the
Scientific Python ecosystem into the EarthCube Geoscientist’s Workbench. In the past decade, the
Python programming language has emerged as the dominant open-source platform for science,
partially displacing commercial tools such as MATLAB and IDL [44]. Python appeals to scientists
because it is free and open (facilitating reproducibility), connects easily to existing C/C++ and For-
tran codes, and also because of its powerful, constantly evolving capabilities for scientific analysis.
Rather than a single monolithic application like MATLAB, the scientific python platform consists
of many separate but related packages which work together and provide different functionality;
this scenario is often described as a “software ecosystem” [32]. The foundational packages of this
ecosystem are NumPy, which provides fast numerical array data structures [42, 59], SciPy, provid-
ing common scientific numerical routines, matplotlib [25], a full-featured plotting environment,
and the Jupyter Notebook, an interactive, web-based execution environment for python code [43].
All of this software is open-source and is supported and developed through a combination of com-
mercial, foundation, and volunteer efforts. The scientific python ecosystem is also widely used in,
and supported by, industry, particularly in the emerging field of Data Science [33]. Our proposed
work will use tools from, and contribute back to, this ecosystem. The size, health, and vibrancy
of this broader scientific python community is key to the long-term sustainability of our proposed
project.
1.3 Proposal Team, Project Goals, and Science Outcomes
Our team represents an ambitious collaboration between geoscientists from Columbia’s Lam-
ont Doherty Earth Observatory, one of the world’s leading university-based centers for climate
research, NCAR, an NSF-sponsored provider of computational services to the atmospheric and
related sciences, and Continuum Analytics, a dynamic Data Science startup oriented around the
open-source scientific python ecosystem. Each of these institutions has demonstrated leadership
and innovation in its respective geoscience / Data Science disciplines.
The long-term, overarching goal of our project, provisionally named Pangeo, is to aid the tran-
sition of geoscience research to the era of Big Data, particularly the analysis of gridded data, which
is not well supported by existing Big Data tools. The specific goal of our proposed work is to in-
tegrate existing open-source software packages into the Pangeo Platform and create four flagship
demonstrations, corresponding to our four Geoscience Use Cases, of this powerful tool. All these
use cases represent potentially transformative scientific inquiries which are currently hampered
by the overwhelming volume of ESM datasets.
2 Community Datasets
The datasets described here will be employed in our Geoscience Use Cases and are broadly rep-
resentative of the nature of datasets in these fields. In contrast to many other areas of geoscience,
ESM datasets for climate, atmospheric science, and physical oceanography are already highly
Findable, Accessible, Interoperable, and Re-usable (FAIR). The vast majority of ESM data is stored
in netCDF format following well-established CF-conventions, and the same netCDF archives are
analyzed by thousands of scientists. The main technical obstacle is the sheer size of these datasets.
The Coupled Model Intercomparison Project (CMIP) is the standard experimental protocol for
ESM centers. Since the project’s inception in 1995, nearly the whole international climate model-
ing community has participated. The CMIP3 multi-model dataset [31, 2239 citations] includes 12
experiments using 25 models from 17 modeling centers, and is hosted by PCMDI at LLNL. The
total size is 36TB. The CMIP5 dataset [53, 4214 citations] includes 110 more recent experiments
using 61 models from 29 modeling centers, resulting in a total of 3.3 PB. CMIP5 data is accessi-
D–2
ble via the NCAR CMIP analysis system (see Sec. 4.2.1). The high citation counts for CMIP3 and
CMIP5 testify to their widespread usage across disciplines. CMIP6 is currently in preparation [18]
and is expected to become available incrementally over the next few years. It will include a 6-fold
increase in the number of experiments and number of models and is projected to produce 150PB
of data, dwarfing previous CMIP datasets. Meeting the challenges posed by CMIP6 is a major
motivation for our proposal.
Another important category of ESM dataset is Atmospheric Reanalysis, i.e. data assimilating
hindcasts of the past evolution of the atmosphere. A currently widely used Reanalysis product is
the ERA-Interim [14, 7099 citations]; its temporal range is 1979 through current date, and its size is
36.17 TB. ERA5, expected to be released in spring 2018, will provide data at a considerably higher
spatial and temporal resolution than ERA-Interim. It is expected to produce roughly 5 PB of data.
Like CMIP6, this volume will overwhelm existing analysis frameworks.
Large subsets of these datasets (and many other similar ones) are hosted on NCAR’s Globally
Accessible Data Environment (GLADE) storage system, one of the central data facilities for US
geoscience. However, the lack of scalable analysis tools means that scientists cannot take full
advantage of the available datasets and computational resources.
3 Geoscience Use Cases
Our proposal is fundamentally science driven. To illustrate the cross-disciplinary relevance
of the Pangeo Platform, we have assembled four Geoscience Use Cases which exemplify its po-
tential impact. Domain scientists with direct expertise in these use cases are funded members
of our proposal team, ensuring direct, immediate, high-impact geoscience outcomes. There is a
common thread to these use cases: with traditional tools, domain scientists are unable to pursue
their most exciting visions for the analysis of ESM datasets due to the overwhelming computa-
tional and storage costs. A platform that can leverage HPC and cloud computing resources for
ESM data analysis will enable these ambitious visions to be realized. Here we describe the science
background, current roadblocks, and desired outcomes for each use case. In the next section, we
describe the technical implementation of our proposed solution.
3.1 Atmospheric Moisture Budgets in CMIP Models (Seager, Henderson)
A major focus of climate research concerns characterizing, understanding and predicting hy-
drological variability and change, particularly droughts and floods, on timescales of hours to
decades. In this regard, the simulations of the historical period and projections of the future
performed for CMIP3 and CMIP5 (see Sec. 2) have been tremendously valuable, allowing an un-
precedented assessment of the ability of state-of-the-art climate models to properly reproduce
major hydrological events. In addition, the CMIP projections allow development of probabilistic
assessments of changing risk of drought and flood. As the CMIP6 data comes online over the next
few years, we are likely to see major advances in how models represent, and consequently predict,
such risks.
Key to assessing the validity of these results is understanding of the physical mechanisms that
underly the projected changes. The simplest mechanism is that a warming atmosphere holds more
moisture. Hence, given circulation anomalies create larger moisture convergences/divergences
and, consequently, floods and droughts, and the mean climate contrast between climatological dry
and wet zones intensifies. More interesting, and essential for fully understanding hydrological
variability and change, is the role of variability and change in atmospheric circulation (e.g. the
Hadley Cell, monsoons, subtropical anticyclones, jet streams, storm tracks, etc.).
Complete determination of the causes of hydrological events requires a full breakdown of the
moisture budgets in the models. A breakdown into components due to changes in humidity and
D–3
circulation must be done, followed by a breakdown into changes in moisture advection and mois-
ture divergence, and the relation of these to the causes of the changes in atmospheric flow. This
requires working with sub-daily resolution data on a model-by-model basis at the highest hor-
izontal and vertical resolutions possible, with careful attention to numerical methods for differ-
entiation and integration required to conserve moisture. A “best-practises” methodology for this
was developed using atmospheric reanalyses by Seager and Henderson (2013) and has been ap-
plied to the CMIP5 models (e.g. Seager et al., 2014a,b). This work was supported by NSF awards
AGS-1401400 and AGS-1243204, both with Seager as lead PI and Henderson as co-PI.
Computing moisture budgets (as for temperature, humidity and vorticity budgets) is a time-
consuming and computationally intricate affair, which is probably why few other researchers have
attempted it as comprehensively as we have. However, the resulting products are exceedingly
useful, offering physical insight into modeled hydroclimate change and variability. Unfortunately,
for each new model generation (e.g., the soon-to-be CMIP6), the budgets need to be recomputed
and the prior analyses compared with those for the new model generation. Once a new generation
of model simulations and projections is produced, re-implementing the analysis can take years,
delaying significantly the research breakthroughs to be gained by intercomparison of different
model generations. The unprecedented scale of CMIP6 only heightens this problem.
The main technical obstacle to computing the budgets is simply a matter of accessing a large
amount of 5-dimensional data (time, model run, longitude, latitude, level). Our past workflows
required downloading over a PB of netCDF data, storing it locally on workstations (mostly pur-
chased with NSF funds), and implementing fragile, ad-hoc analysis pipelines that can take weeks
or months to complete. This way of working is not easily reproducible by other researchers, or
even by ourselves, as model data volumes continue to grow. The estimated data volumes of
CMIP6, which has higher resolution models and more ensembles, renders our past way of work-
ing unfeasible.
The moisture-budget analysis we developed represents a scientifically valuable, computation-
ally challenging Big Data climate science application, ideally suited for acceleration by the Pangeo
Platform. In this proposed work, we will re-implement the moisture budget calculations using
XArray and Dask and deploy this analysis on NCAR and Columbia HPC resources. Working to-
gether with NCAR and Continuum software engineers, we will identify and squash performance
bottlenecks. We will benchmark the parallel scalability of this new way of working against our
old, in-house system. As CMIP6 data comes online over the project performance period, the Pan-
geo Platform will enable us to calculate the moisture budgets in this new generation of models,
resulting in immediate, high impact science outcomes.
3.2 Convective Parameters for Understanding Severe Thunderstorms (Tippett, Lepore)
Flooding, tornadoes and severe thunderstorms cause billion-dollar weather and climate dis-
asters every year and are among the top drivers of catastrophic losses in the United States, with
estimated losses from severe thunderstorms alone totaling $14.5 billion (US) in 2016 [41]. Key
science questions about severe thunderstorms include estimating the current risk, understand-
ing past variability and projecting changes expected in a warmer climate. These questions have
gained urgency as recent studies have indicated increased variability in a broad range of metrics
of severe thunderstorms activity, including economic and insured losses [45], the number of days
on which many tornadoes occur [5, 17], and the number of tornadoes per outbreak [56]. The cli-
mate signals behind these trends are currently uncertain and the observed trends are different
from those expected under climate change [57].
Long-term historical records of severe thunderstorm activity are available in few parts of the
world, and even the quality of U.S. reports is inadequate to answer many climate questions [60].
D–4
Consequently, ESMs and reanalysis have a critical role to play understanding the impact of climate
signals on severe thunderstorm activity. However, a fundamental challenge is that reanalysis and
ESMs do not resolve the spatial scales of severe thunderstorms. Despite this limitation, consid-
erable progress has been made using related large-scale meteorological parameters as proxies for
severe thunderstorm activity. Reanalysis and ESMs are able to resolve these proxies, and they have
been used to address questions about climate change [15] and year-to-year variability [54]. These
proxies are based on convective parameters (functions of winds, temperature and moisture) that
forecasters have found to be important “ingredients” for severe thunderstorm occurrence [e.g.,
6]. Similar approaches have been used for flash floods [16], tropical cyclones [55], and the US
rainfall intensity process [29, 30]. The development of more sophisticated severe thunderstorms
proxies has likely been held back by the volume of data (high-resolution in time and space reanal-
ysis fields) and the need for a platform that can both efficiently access the data and perform the
required statistical calculations [27].
In fact, most studies have focused on specific regions, specific reanalysis and/or ESM, and
are usually limited to only one ensemble member. This is due to computational constraints; con-
vective parameters need to be calculated from the 3D atmospheric data at the highest possible
spatial and temporal resolution using complex formulas. The volume of data as well as the com-
putational cost for such calculations quickly becomes burdensome, especially for high resolution
models with many ensemble members. At present, there is no publicly available archive of con-
vective parameters for the most commonly used renanalysis products and ESMs.
In this Use Case, we will use the Pangeo Platform to calculate the convective parameters from
a suite of re-analysis and climate models stored locally on NCAR’s GLADE, generating a publicly
available archive of convective parameters, creating an unprecedented probabilistic view of these
important variables. We will also implement quantile mapping/regression capabilities necessary
to understand the drivers of variability in convective parameters. As new re-analysis and ESMs
data becomes available (ERA5, CMIP6), we will be able to update the calculations with the pre-
dictions of the latest, most sophisticated models. This Use Case builds on results from past NSF
award EAR-0910721 [30], which funded C. Lepore’s postdoc.
3.3 Spatial Downscaling and Bias Correction for Hydrologic Modeling (Hamman)
As noted above, ESMs help us understand how large regional-to-global scale hydroclimate
will change in response to anthropogenic effects such as greenhouse gas emissions and land cover
change. However, there are some fundamental issues that limit the usefulness of information
we can get directly from the ESMs. First, modern ESMs continue to exhibit regional biases in
important hydroclimatic forcing variables (e.g. sea level pressure, temperature, precipitation) as
well as issues with climate dynamics such as teleconnections (ENSO, PDO, etc.) and storm in-
termittency and severity. These biases limit the information that can be directly extracted from
ESMs with regard to regional-to-local scale hydrology. Second, in the current generation of ESMs
(i.e., CMIP5 and CMIP6), the spatial resolution is not high enough (even in the highest resolution
models) to capture important hydrologic processes (e.g. mountain snow) that are controlled by
spatial heterogeneity in topography and other land surface characteristics. An important area of
research spanning the climate and water resource science communities revolves around finding
better ways to relate the climate changes from state-of-the-art ESM projections to real-world wa-
ter resource systems. In this Use Case, we discuss the approach of spatial downscaling and bias
correction of ESMs for hydrologic modeling and how ongoing research at NCAR will benefit from
the Pangeo Platform.
Downscaling is a procedure that relates information at large scales (ESMs) to local scales. With
the common goal of addressing the issues described above, a range of downscaling methodolo-
D–5
gies have been developed that span a broad range of complexities from simple statistical tools to
full dynamic atmosphere models. Ongoing research in the NCAR Water Systems Program (par-
tially supported by NSF award AGS-0753581) is developing tools [22, 21] for understanding how
methodological choices in the downscaling step impact the inferences end-users may draw from
climate change impacts work.
The specific workflow to be implemented will enhance ongoing research in the NCAR Water
Systems Program by developing a Python-based toolkit for circulation-based statistical downscal-
ing. We will leverage the tools developed in this project to build an extensible and user-friendly
interface for existing tools currently written in low level languages (C and Fortran), specifically the
Generalized Analog Regression Downscaling (GARD) toolkit. Using this interface, GARD will be
extended to support additional downscaling methodologies that utilize high-level machine learn-
ing packages available in Python (e.g., scikit-learn, PyMVPA) and to operate on larger datasets
than is currently possible. The end product will be an open-source package that enables future
downscaling work throughout the climate impacts community.
3.4 Statistical and Spectral Analysis of Eddy-Resolving Ocean Models (Abernathey)
Of all the components of ESMs, the ocean is among the most data-intensive, due to the high
spatial resolution required to accurately model ocean physics. The resolution is dictated by the
ocean’s internal deformation radius, a fundamental length scale for ocean dynamics, which varies
with latitude from roughly 15-200 km [7, 23]. When the deformation radius is resolved by the
model grid, turbulent “mesoscale” eddies emerge spontaneously in the ocean, leading to turbu-
lent transport of heat, carbon, and other climatically relevant tracers. Historically, ocean models
have not attempted to resolve such small scales, employing grid sizes of 1 or 2 degrees and pa-
rameterizing the unresolved transports using diffusive closures [19]. Due to increases in HPC
capability, however, a new generation of high-resolution (approximately 0.1 degree) ocean mod-
els has become computationally feasible within ESMs. Such models are currently being run at
both NCAR and GFDL, two of the leading US climate modeling centers. They produce greatly
improved representations of the ocean circulation, which in turn leads to major impacts on the
overall climate system, including the uptake of heat and carbon in future climate change scenar-
ios [49, 20]. This new generation of ocean models thus represents a transformative breakthrough.
Unfortunately, these eddy-resolving ocean models generate unwieldy amounts of data. A sin-
gle snapshot of one model 3D state variable typically requires more than 1 GB of storage. Models
typically have at least a dozen such variables (temperature, salinity, velocities, chemical tracers,
etc.). Daily output frequency is desirable in order to capture the intrinsic timescales of ocean
motion, while century-long runs are necessary to capture intrinsic and forced climate variability.
Consequently, the ocean model outputs easily reach the Petabyte scale. As with our other use
cases, downloading the data is simply not an option. Fortunately, ocean model data is generally
stored on the HPC systems where it is produced. (For CESM simulations, this is the NCAR Yel-
lowstone / Cheyenne system; GFDL uses the NOAA / ORNL Gaea system.) It is thus a perfect
target for the Pangeo Platform, which aims to leverage shared supercomputing resources for ESM
data analysis.
The specific workflows to be implemented for this use case are linked to ongoing NSF award
OCE-1553593, “CAREER: Evolution of Ocean Mesoscale Turbulence in a Changing Climate” (PI
Ryan Abernathey, also a PI on this proposal). This work seeks to understand how the statistics of
ocean turbulence are evolving as the ocean warms, in part by analyzing the ocean component of
high-resolution ESMs. The analysis involves computing statistical distributions and wavenumber
spectra of the ocean velocity fields, aggregated in regional boxes. While relatively straightforward
conceptually, scaling these methods to Petabyte-scale ESM datasets is a major computational chal-
D–6
Figure 1: Schematic of the Pangeo Platform. End user connects to HPC system with a standard web browser
via Jupyter Notebook. User programs data analysis routines using xarray. Dask schedules the computations
across compute nodes, reading data in parallel from the storage system as necessary.
lenge.
The ocean model data to be analyzed are from the eddy-resolving CESM simulation of [49],
which are archived on the CISL HPSS storage system. (Abernathey has already accessed the data
as part of ongoing work.) So far the analysis has focused on just the surface fields [58], due to
the computational difficulty of working with full three-dimensional output. The Pangeo Platform
deployed on NCAR systems will facilitate a more ambitious, full-depth analysis of mesoscale
statistics and their temporal variability and trends. This understanding will lead to improved
estimates of ocean uptake of heat and carbon, a high-impact science outcome.
4 Technical Implementation
The technical implementation of the Pangeo Platform involves the following steps: (1) Inte-
grate existing python packages into a coherent system; (2) Deploy this system on NCAR HPC
resources and NSF Jetstream cloud computing resources; (3) Benchmark and tune based on the
Geoscience Use Cases; and (4) Write documentation and tutorials (demonstrations) for end users.
4.1 Python Software Integration
The basic building blocks of Pangeo already exist as community-maintained open-source soft-
ware projects. However these projects will require improvements both to better support Pangeo
use cases and to facilitate integration into a cohesive system. Here we briefly outline these projects
and their relationship and then describe each project in greater depth in the following sections.
Fig. 1 illustrates schematically how the elements work together.
1. NetCDF (4.1.1) is the standard file format for large volumes of complex geoscience data
2. XArray (4.1.2) provides intuitive computation on top of NetCDF data for commonly used
computations
3. Dask (4.1.3) sits beneath XArray and provides parallel and distributed computing on parallel
hardware
4. MetPy (4.1.4) is a collection of advanced routines for atmospheric data analysis
These projects can already work together today to some degree, enabling a large community
of researchers to intuitively perform computations on medium-sized datasets (10-100GB) on per-
sonal computers. In the sections below we describe each project and how it relates to the others in
more depth, as well as necessary changes that will have to be made to support better collaboration
D–7
Figure 2: An example of how a dataset (netCDF or XAarray) for a weather forecast might be structured.
This dataset has three dimensions, time, y, and x, each of which is also a one-dimensional coordinate.
Temperature and precipitation are three-dimensional data variables. Also included in the dataset are two-
dimensional coordinates latitude and longitude, having dimensions y and x, and reference time, a zero-
dimensional (scalar) coordinate. Figure and caption from [24].
on very large (Petabyte) datasets. These activities are a centerpiece of our proposed work.
4.1.1 NetCDF
NetCDF is the most common storage format for large multi-dimensional data in geoscience
domains. NetCDF extends the powerful HDF5 format commonly used in HPC environments that
allows for efficient storage of large multi-dimensional arrays of regularly strided data and arbi-
trary metadata. NetCDF extends HDF5 by adding the Common Data Model (CDM) conventions
that coordinate metadata and link together different arrays with consistent labels and coordinates.
Implementations of the CDM are ubiquitous across the geosciences in the form of file formats and
data access protocols such as netCDF, HDF, and OPeNDAP. NetCDF and its associated metadata
conventions are the focus of existing EarthCube project “Advancing netCDF-CF for the Geoscience
Community,” and our project will leverage and build upon that group’s work.
The CDM is a natural way to store and share scientific datasets and, as such, is widely un-
derstood by scientists throughout the Earth science community. For example, we can associate
arrays that store coordinates like latitude, longitude, and time, to other arrays that store data vari-
ables like temperature or pressure. This association of how the arrays relate to each other allows
us to transform many values together as a single cohesive system. As the number of variables
increases, transforming everything together in a consistent manner can become challenging. The
netCDF Python library itself focuses entirely on exposing the functionality of the netCDF C li-
brary, and does not provide higher-level functionality for interpreting the metadata in the file and
assigned to variables. Tools like XAarray, described in Section 4.1.2, sit atop the netCDF Python
API and provide this higher-level functionality to simplify user tasks.
4.1.2 XArray
XAarray is a community-developed, open-source software project and Python package that
provides tools and data structures for working with N-dimensional labeled arrays [24]. The
XAarray data model is fundamentally based on the Common Data Model (CDM) used with netCDF,
which provides a standard for metadata-enabled self-describing scientific datasets. The labels
used by XAarray come from the metadata described by the CDM —examples of labels (also re-
ferred to as coordinates) include latitude, longitude, time, and experiment number. Fig. 2 illus-
trates an example XArray dataset. Active XArray developers (Hamman and Abernathey) are PIs
on this proposal.
The development of XAarray has largely been by the geoscientific computing community, and
has been motivated by the lack of existing extensible and scalable analysis tools that can be applied
D–8
to modern geoscientific datasets. XAarray provides an interface designed to meet three primary
goals: (1) Provide data structures that include necessary metadata to facilitate label-based oper-
ations and eliminate common data coding mistakes that arise from detaching datasets from the
metadata that describes them; (2) Facilitate rapid analysis of array-based datasets by providing a
high-level analysis toolkit; and (3) Inter-operate with a range of existing packages in and out of
the geoscience domain.
Built on top of the XAarray data model is a robust toolkit that includes the following key fea-
tures: (1) label-based indexing and arithmetic; (2) interoperability with the core scientific Python
packages (e.g., Pandas, NumPy, Matplotlib); (3) out-of-core computation on datasets that dont fit
into memory (via Dask, see Section 4.1.3); (4) a wide range of serialization and input/output (I/O)
options (e.g. NetCDF 3/4, OPeNDAP (read-only), GRIB 1/2 (read-only), and HDF 4/5); and (5)
advanced multi-dimensional data manipulation tools such as group-by and resampling.
XAarray’s high-level interface is well-documented, intuitive, and easy to use, even for those
new to Python. For the geoscience community, this facilitates relatively painless transitions to
Python and XAarray. Once using XAarray, geoscientists are able to quickly produce meaningful
science without 1) writing boilerplate code (e.g. reading/writing datasets) or 2) being bogged
down by excessive reliance on a particular metadata structure. The example code snippet below
demonstrates the clarity and concision of XAarray syntax and produces Figure 3:
1 import xarray as xr
2
3 # Load a netCDF dataset
4 ds = xr . open_dataset ( ’ a ir _t em p er at u re . nc ’)
5 # Resample daily data to monthly means
6 ds = ds . resample ( ’ MS ’ , dim = ’ time ’ , how = ’ mean ’)
7 # Calculate a monthly climatology
8 climatology = ds . groupby ( ’ time . month ’) . mean ( dim = ’ time ’)
9 # Calculate monthly anomalies
10 anomalies = ds . groupby ( ’ time . month ’) - climatology
11 # Plot an example monthly anomaly ( June 2013)
12 anomalies . sel ( time = ’ 2013 -06 ’) [ ’ air ’ ]. plot ()
While the current version of XAarray is be-

ginning to facilitate new and improved data
science workflows for geoscientific datasets,
there are a number of limitations that we
hope to improve though this project, namely:
(1) improvement of distributed computing
and efficiency with Dask, focusing on new
functionality for parallel map/apply/groupby
operations of arbitrary functions; (2) en-
hancement of current XAarray tools for in-
tegrating 3rd-party packages within xarray,
including statsmodels (advanced statistics),
scikit-learn (machine learning), pint (geo-
physical units), and MetPy (Sec. 4.1.4); and
(3) development a new set of tutorials for Figure 3: Example output from the XAarray analysis
XAarray targeting the geoscience user commu- code snippet shown above.
nity which demonstrate XAarray’s integration
within the Pangeo Platform.
4.1.3 Dask
Dask is a system for parallel computing that coordinates well with Python’s existing scientific
D–9
software ecosystem. Data analysis tools like XArray use Dask to parallelize complex workloads
across many cores on a single workstation or across many machines in a distributed cluster. Dask
manages running tasks on different workers, tracking the location of intermediate results, mov-
ing data across the network, managing failed nodes, etc. The creator of Dask, Matt Rocklin of
Continuum Analytics, is a PI on this proposal.
Dask can be used at either a high level or low level. At a high level, Dask provides paral-
lel multi-dimensional arrays, tables, machine learning tools, etc. that parallelize existing popular
libraries like NumPy, Pandas, and Scikit-Learn respectively. For example Dask.array provides a dis-
tributed multi-dimensional array interface that is API-compatible with NumPy, Scientific Python’s
standard array solution. Dask arrays are composed of many smaller in-memory NumPy arrays dis-
tributed throughout a cluster of machines. Operations on a Dask array trigger many smaller op-
erations on the constituent NumPy arrays. Because Dask arrays use NumPy arrays internally they
are quite fast (NumPy is written largely in C) and interoperate well with most other scientific
Python libraries (NumPy is the standard shared data structure between most libraries). Because
Dask.arrays faithfully implements a large (and growing) subset of the NumPy programming
interface, users and downstream libraries (like XArray) benefit because they can easily switch be-
tween NumPy and Dask array workloads without costly rewriting.
At a low level, Dask is a dynamic dis-
tributed task scheduler. Any Dask computa- ('add-#2', 1, 0) ('add-#2', 1, 2) ('add-#2', 0, 2)
tion (such as those that arise from geophysical add add add
XArray commands) creates a task graph with ('add-#2', 1, 1) ('add-#2', 0, 1) ('transpose-#0', 1, 0) ('add-#2', 2, 1) ('transpose-#0', 1, 2) ('add-#2', 2, 0) ('transpose-#0', 0, 2) ('add-#2', 2, 2) ('add-#2', 0, 0)
dependencies (see Fig. 4 for example). Each add add transpose add transpose add transpose add add
task is composed of a function to be applied ('transpose-#0', 1, 1) ('transpose-#0', 0, 1) ('wrapped-#1', 0, 1) ('transpose-#0', 2, 1) ('wrapped-#1', 2, 1) ('transpose-#0', 2, 0) ('wrapped-#1', 2, 0) ('transpose-#0', 2, 2) ('transpose-#0', 0, 0)
on some piece of data, such as a single NumPy transpose transpose ones transpose ones transpose ones transpose transpose
array chunk, or on the results of other depen- ('wrapped-#1', 1, 1) ('wrapped-#1', 1, 0) ('wrapped-#1', 1, 2) ('wrapped-#1', 0, 2) ('wrapped-#1', 2, 2) ('wrapped-#1', 0, 0)
dent tasks. For non-trivial computations de- ones ones ones ones ones ones
pendencies exist between tasks, such that the

results of one computation are required by oth- Figure 4: A task graph arising from computing x +
ers. Normal geophysical computations create x.T on a blocked 3 x 3 matrix. Circles represent in-
a complex web of hundreds of thousands of memory NumPy computations like add or transpose
small tasks. It is Dask’s job to take this web while boxes represent data, in this case small chunks
of small tasks and map it intelligently to the of the overall array. This small example shows how
we can break down blocked array computations that
available computing resources in such a way can then be computed in parallel.
that balances load, minimizes data transfer, re-
sponds to busy or failed workers, etc.
Dask combines some (but not all) of the managed parallelism of systems like Hadoop or Spark
with some (but not all) of the algorithmic flexibility and numerical performance of MPI. This
middle-ground position is ideal for complex post-processing of large numeric datasets. Dask’s
high-level parallel array modules can hide most of the complexity of building parallel algorithms
while Dask’s low-level task scheduling can hide most of the complexity of doing load balancing,
handling network connections, responding to failures, etc. This reduces entry-barriers to dis-
tributed systems for scientists with only modest programming skills and allows them to focus on
their scientific objectives.
However there are some challenges to using Dask’s distributed scheduler with XArray on HPC
systems. While Dask is already commonly deployed on distributed systems in other scientific do-
mains and in for-profit enterprise, most use of Dask and XArray today in the geophysical sciences
is on single-machine workstations or notebook computers. This lack of use is due to a few limita-
tions that we hope to correct through this project: (1) Deployment of Dask on typical job schedulers
D–10
used on HPC systems like Cheyenne is still a manual process. While this process is straightfor-
ward to IT or developers accustomed to cluster computing it is challenging to a broader class of
researcher with more modest programming skills. (2) Dask was not originally designed to work
on high performance networks and doesn’t take advantage of fast interconnects. ESM computa-
tions can often be communication-bound, leaving this as a prominent performance bottleneck in
some cases. (3) The netCDF file format is well optimized for parallel reads using MPI. Dask is not
MPI-enabled and does not today benefit from this feature. We will experiment with integrating
parallel MPI read capability into Dask. (4) While Dask is commonly used today on many-terabyte
datasets, we anticipate needing to perform some optimizations when pushing towards Petabyte-
scale datasets like CMIP6
Because Dask is used in other domains, these optimizations will have far reaching effects in ge-
nomics, image processing, finance, machine learning, etc. Conversely, concurrent optimizations
in those fields also benefit applications in the geoscience fields. By developing our software sys-
tem in a more generally useful and open manner, our project both benefits the wider scientific
community and receives features and long-term maintenance from other sources.
4.1.4 MetPy
MetPy is an open-source Python library, sitting on top of standard scientific tools such as NumPy,
SciPy, matplotlib, and Dask, providing meteorological- and atmospheric science-specific func-
tionality. In this project, MetPy will provide domain-specific calculations to drive many of the use
cases, leveraging Dask and XArray to help scale to larger data sets. The creator of MetPy, Ryan
May of Unidata, is a PI on this proposal.
Today, MetPy supports a variety of common meteorology-specific calculations and plot for-
mats. As part of its collection of calculations, MetPy utilizes the pint library to provide support
for automated handling of physical quantity information, making calculations more robust and
eliminating an entire class of potential errors for users of the library. To integrate MetPy into the
Pangeo Platform, we will add hooks into XArray to facilitate its integration with pint and other
physical quantity packages. We will also expand MetPy’s collection of calculations to facilitate the
various Geoscience Use Cases (e.g. storm relative helicity, CAPE, CIN). These modifications to
MetPy will benefit its existing user community, as well as expand its potential audience.
4.1.5 Jupyter Notebook
All of the packages described above will run on remote HPC or cloud systems. To enable open-
ended, exploratory analysis, the main mode of user interaction with the Pangeo Platform will be
via interactive Jupyter Notebooks [43]. A Jupyter Notebook is a web application that supports
interactive code execution, display of figures and animations, and in-line explanatory text and
equations. Jupyter Notebooks are ubiquitous in the scientific python community. There is no
apparent difference to the user between a local Jupyter Notebook and one running on a remote
system. This seamless transition to remote execution is a key ingredient for moving data analysis
into the cloud.
4.2 Deployment
4.2.1 NCAR CMIP Analysis Platform and Research Data Archive
NCAR is the main NSF-sponsored provider of high-performance computing and data storage
services to the atmospheric and related sciences. All NSF-funded scientists are eligible to apply to
access to NCAR systems, and NCAR hosts a wide range of ESM and observational datasets. Con-
sequently, NCAR systems play a central central role in NSF-funded infrastructure for geoscience.
The CMIP Analysis Platform (CMIP-AP) [37] is a service hosted by NCAR’s Computational
Information Systems Laboratory (CISL) [35] which provides users access to CMIP data [62] stored
D–11
locally on NCAR’s Globally Accessible Data Environment (GLADE) storage system [36]. The
CMIP-AP provides users access not only to NCAR’s own CMIP data but, through an “inter-
library-loan-like” system, to data from other ESM modeling institutions around the world. Users
of the CMIP-AP can request that remote CMIP datasets from other institutions be collected and
stored temporarily on NCAR’s GLADE storage system so that the various CMIP datasets can be
directly compared with NCAR’s own CMIP data. CMIP-AP users can then use NCAR’s Yellow-
stone [39] or Cheyenne [34] compute systems, and the Geyser and Caldera analysis and visualiza-
tion systems, to perform their comparative analysis and visualize the results. (Note that NCAR’s
Cheyenne compute system will supplant the Yellowstone compute system in April of 2017, though
Yellowstone is expected to remain operational until 2018.)
In addition to the CMIP-AP, NCAR’s Research Data Archive [38] also makes available a wide
variety of atmospheric reanalysis data. This data is available for download to an off-site location,
while much of it is directly accessible on NCAR’s GLADE storage system.
To facilitate analysis of the CMIP and reanalysis data (and other data stored on GLADE) for
the Geoscience Use Cases, we will demonstrate the deployment of the Pangeo Platform on the
NCAR HPC systems. This will involve deploying Dask jobs through the LSF and PBS Professional
job schedulers and workload managers, which are used for resource and job management on the
Yellowstone/Geyser/Caldera and Cheyenne computing systems, respectively. This will require a
demonstration of launching both batch and interactive jobs through these resource managers, and
providing the proper documentation and tools necessary to easily facility user job submissions
with these tools in the future.
4.2.2 Columbia Habanero HPC System
Smaller, university-level HPC systems are a common part of geoscience infrastructure. In
addition to targeting NCAR systems, our proposed work will help scientists leverage these other
computing resources for Big Data analysis. We will deploy Pangeo and Columbia University’s
Habanero HPC system (purchase of four compute nodes and 100 TB of mass storage has been
budgeted), demonstrating how to launch the Dask scheduler and Jupyter notebooks. This resource
will also enable Lamont researchers to test and debug Geoscience Use Cases locally rather than
relying exclusively on NCAR computing for the entire project.
4.2.3 Cloud Computing and Jetstream
Cloud computing is a quickly growing approach to mainstream, commercial Big Data. Cloud
platforms are technologically distinct from HPC systems, typically with cheaper, easier to use
interfaces but with slower I/O and interconnect. Many commercial cloud platforms exist, such as
Amazon Web Services (AWS), Google Compute Cloud (Google CC), or Microsoft Azure. Typically,
commercial pricing makes using such services for scientific Big Data analysis cost-prohibitive, but
since Dask has already been demonstrated on AWS, Google CC, and Azure, it is reasonable to
assume that many of the use cases in the proposal could be easily ported to run on similar cloud
service platforms. For these reasons, NSF’s new XSEDE cloud computing platform, Jetstream
[52], presents a new and unique opportunity to test the Pangeo Platform in a cloud computing
environment.
We will deploy the Pangeo Platform on the Jetstream cloud computing platform (requesting
an allocation via XSEDE) and benchmark our Use Cases within the constraints of available storage
and computing. For comparison purposes, we will also deploy on other available cloud comput-
ing platforms, such as AWS, Google CC, or Azure, for which funds have been allocated in the
NCAR budget with this proposal.
D–12
4.3 Benchmarking and Optimization of Use Cases
Unlike traditional HPC codes that run the same simulation every time, Dask and XArray run a
wide variety of user-submitted computations that vary by several orders of magnitude in expected
effort, communication/computation ratios, etc. Our goal is to provide an interactive experience
(milliseconds to seconds seconds of turnaround) for GB-scale datasets and to provide reasonable
batch runtimes (hours to days) for 100 TB to PB scale datasets for our Geoscience Use Cases.
Achieving this level performance requires collecting representative benchmarks, profiling their
execution on the relevant HPC hardware to identify bottlenecks, and then addressing those bot-
tlenecks with Dask and XArray. Existing Dask benchmarks will be applied on NCAR systems and
increasingly large datasets to see where the scalability of Dask and XArray start to break down.
This will inform additional open-source work within those libraries, resulting in future releases
of Dask and XArray more appropriate for use on general HPC hardware. The calculations asso-
ciated with the four Geoscience Use Cases will define additional benchmarking targets to drive
further optimization. The optimization process will be an iterative dialog between the scientists
and developers on our team.
4.4 Demonstration and Documentation
Each Use Case will be implemented as an interactive, self-documenting Jupyter Notebook [43].
This format combines rich text (including equations), code, and figures into a self-contained doc-
ument of a scientific study. These notebooks will constitute the definitive demonstrations of the
Pangeo Platform. They will be published online at the Pangeo website and will also be stored in
Github code repositories. In addition to merely reading the notebooks, any user with access to Yel-
lowstone / Cheyenne will be able to actually execute them, reproducing the analysis pipeline from
raw data to final figures. Development versions of these notebooks will be presented annually at
the EarthCube All Hands Meeting (AHM).
In addition to the Use Case notebooks, our team will continue to follow its strong track record
in producing well-documented code. The online documentation for XArray, Dask, and MetPy
is already comprehensive and user friendly; any additional features will follow this established
standard.
5 EarthCube Participation
Our community of computationally-focused climate and atmospheric scientists, hydrologists,
and oceanographers represents a wide swath of of the NSF GEO community which currently
does not actively participate in EarthCube. If EarthCube intends to truly span the breadth of
GEO research, it must incorporate all disciplines. We firmly believe our community has a great
deal to contribute to the EarthCube mission; in particular, we bring a deep understanding of
high performance computing and Data Science to the table. The EarthCube Advisory Committee
Reverse Site Visit Report and Response [28] noted that “the data science component of the program
is underemphasized” and recommended that EarthCube “embrace discovery in both data science
and geoscience.” Given the mix of disciplinary and computational scientists on our team, we are
ideally poised to meet this goal.
We will participate fully in EarthCube governance, planning, demonstration, and assessment
activities. Our science users from Lamont will volunteer for the the Science Committee, while our
programmers from NCAR will volunteer for the Technology & Architecture Committee. Annual
travel for our whole team to the EarthCube All-Hands meeting has been budgeted. We also intend
to help foster increased EarthCube presence on Github, which is the central tool our team uses
for communication, project management, and code version control. Our team’s well-developed
practices for continuous integration and testing could serve as a template for future EarthCube
D–13
software-related activities.
6 Broader Impacts
All of the Geoscience Use Cases in our proposal represent urgent problems in Earth System Sci-
ence with significant broader impacts for human society and well-being. These broader impacts
are described in the introduction to each Use Case and are not repeated here. By realizing our vi-
sion for the Pangeo Platform and enabling these Use Cases to overcome their Big Data challenges,
these specific broader impacts will be advanced.
Beyond our specific Use Cases, widespread adoption of the Pangeo Platform will enhance
the overall productivity and resource-efficiency of all science fields that work with big netCDF
datasets on HPC or cloud platforms, particularly in the data-intensive ESM world. Indeed, we
see no viable way for the community to confront CMIP6 without such a tool. NCAR is already a
funded participant in our proposal, and NCAR model users will naturally adopt Pangeo for their
analysis. To further encourage adoption of our platform, we will collaborate with the two other
leading ESM centers in the US: NASA Goddard Institute for Space Studies (GISS) and NOAA
Geophysical Fluid Dynamics Laboratory (GFDL). Users at these two centers will be encouraged
to try Pangeo on their own datasets and HPC systems and contribute feedback back to the Pangeo
developers via Github. We have obtained collaboration letters from leaders at these two institu-
tions (Gavin Schmidt, Director of GISS, and Venkatramani Balaji, Head, GFDL Modeling Systems
Group) to confirm their support for and participation in these activities. By engaging users via
NCAR, GFDL, and GISS, we will address the needs of nearly all ESM researchers in the US. A
week-long workshop aimed at graduate students, to be held at NCAR in year 3 (see budget),
will provide further education for prospective users. Pangeo tools will also be incorporated into
Abernathey’s undergraduate and graduate teaching at Columbia University.
Finally, we note that XArray and Dask are in use far beyond geoscience and academia; there are
active users bases in enterprise data science, industrial genomics, and the financial sector. The en-
hancements to these packages made via our proposed work will feed back on those communities,
potentially enhancing economic productivity and competitiveness of the US. Our collaboration
with the private data science startup Continuum Analytics represents a new, potentially transfor-
mative partnership between academia and industry.
7 Management Plan
7.1 Roles of Different Institutions and Investigators
Lamont Doherty Earth Observatory (Abernathey, Henderson, Lepore, Seager, Tippett) will
primarily focus on the development and implementation of the Geoscience Use Cases (see Sec. 3
for detailed investigator breakdown). NCAR developers (Paul, Del Vento, May) will lead the
deployment of the Pangeo Platform on Yellowstone / Cheyenne and will contribute to software
development (e.g. MetPy). Hamman (NCAR) is both a disciplinary scientist (see Use Case 3.3)
and a lead XArray developer. Continuum Analytics (Rocklin + software developer) will respond
to bug reports and feature request from other investigators, contributing to XArray and Dask as
need arises. Unfunded collaborators (G. Schmidt at GISS and V. Balaji at GFDL) will encourage
users to try out and give feedback on Pangeo.
7.2 Coordination and Communication
Technical communication among team members will primarily be via the Github website,
which is already the nexus of development for XArray, Dask, and MetPy. Github offers power-
ful features for issue tracking, code merging, and continuous integration which have gained near
universal adoption in the software development world. Additionally, we will establish a Slack
D–14
team for more informal communication. Finally we will hold an annual face-to-face meeting at
Columbia (year 1) and Boulder (years 2 and 3); travel has been budgeted for this. The annual
meeting will serve as a review of project progress, at which challenges can be addressed and un-
expected problems corrected.
7.3 Timeline
In year 1, scientists will work on translating existing analysis routines for Geoscience Use
Cases from legacy systems into python (employing XArray and Dask). Any hurdles that arise will
be reported via the issue trackers on the relevant package Github pages and addressed by the
software developers. NCAR software developers will work with Continuum developers to de-
ploy the Pangeo Platform on Cheyenne and Jetstream, creating documentation and streamlining
the launch procedure. Basic Pangeo functionality will be in place on Cheyenne by the end of year
1 and will be demonstrated at EarthCube AHM. Year 2 will consist of focused testing, debugging,
benchmarking, and optimization of Geoscience Use Cases on Jetstream and Cheyenne. This task
will involve close collaboration between all project team members. By the end of year 2, efficient
performance on 100+ TB datasets will be achieved. A Big-Data-scale demonstration will be pre-
sented at EarthCube AHM. In year 3, we will continue optimization activities and will complete
the creation and publication of the Geoscience Use Case Jupyter notebooks. Developers and sci-
entists will collaborate on Pangeo technical documentation and new-user tutorials. The workshop
at NCAR will serve as a capstone to the project, informing the target cross-disciplinary geoscience
audience about the capabilities of the Pangeo project. Final demonstration of the platform will be
presented at EarthCube AHM.
8 Sustainability Plan
To ensure the long-term sustainability of the Pangeo Platform, we will actively nurture a com-
munity of users, developers, and other stakeholders by 1) maintaining and growing the already
successfuly Dask and XArray projects; 2) establishing a strong Pangeo user base at all major ESM
centers in the US; and 3) strengthening ties with industry Data Science partners.
Our core open-source building blocks emerged primarily via industry support and volunteer
effort on Github. The vibrant, high-traffic interactions evident on the Dask and XArray github
repositories provide ample evidence for a true “grassroots” community surrounding these tools.
(Dask and XArray have, respectively, 1371 and 509 stars on Github.) When the EarthCube award
ends, this same community will continue developing and enhancing this foundation of the Pangeo
Platform for their own science needs. By participating actively in Dask and XArray development
during our award, we will lay the groundwork for this long-term sustainable future.
The main producers of ESM data in the US are NCAR, GISS, and GFDL. All of these organi-
zations are either funded (NCAR) or unfunded collaborators on our proposal. By encouraging
their scientists and collaborators to adopt the Pangeo Platform, a large population of users from
the geoscience community will be exposed to Pangeo. Our workshop and demonstration activ-
ities will further help entrain these users. We expect that the productivity gains these scientists
experience will make them enthusiastic Pangeo partners and contributors. The modeling centers
will thus be motivated to incorporate sustained Pangeo support into their core infrastructure.
Finally, our partnership with Continuum Analytics, the leading industry supporter of open-
source python-based Data Science tools, will open a new era in geoscientific software develop-
ment. The prominence of this company will help raise the profile of geoscience applications in
the Data Science world, helping inform an extremely broad community about the exciting chal-
lenges and opportunities in our field. We hope this will lead to a feedback loop of enthusiasm,
participation, and progress which will sustain Pangeo for decades.
D–15
Results from Prior NSF Support
No prior NSF support for C. Lepore, M. Tippett, J. Hamman, K. Paul, D. Del Vento, M. Rocklin.
Collaborative Research: The Upper Branch of the Southern Ocean Overturning in the South-
ern Ocean State Estimate: Water Mass Transformation and the 3-D Residual Circulation OCE-
1357133 (Ryan Abernathey) — 2/1/2014–2/1/2017, $101,621
Summary of results – Intellectual Merit: This project used the Southern Ocean State Estimate (SOSE)
to quantify the thermodynamics processes that help maintain the upper branch of the Southern
Ocean overturning circulation. An exciting and potentially transformative result was the revela-
tion that freshwater from melting sea ice plays a major role in converting upwelled circumpolar
deep water into ligher intermediate and mode water [1]. The importance of freshwater forcing was
verified in a suite of studies using the CESM high-resolution climate model [4, 40]. We also ex-
amined the relative roles of advection vs. isopycnal diffusion in the subduction of surface tracers,
obtaining another surprising result: under increased wind forcing, enhanced isopycnal mixing
(due to stronger mesoscale eddies) dominates the change in subduction [2].
Broader Impacts: This proposal supported the creation of a new packaged with the MITgcm ocean
model called LAYERS. The LAYERS package supports online diagnostics of isopycnally averaged
circulation and water mass transformation, leading to greater numerical accuracy and also lower
effort to obtain this useful information. LAYERS has been adopted broadly by the MITgcm com-
munity.
Publications: Listed in References Cited: [1], [4], [40], [2]
1
P2C2: Continental scale droughts in North America: Their frequency, character and causes over
the past millennium and near term future NSF grant AGS-14-01400 (Richard Seager and Naomi
Henderson) — 7/1/14–6/30/17, $776,807
Summary of results – Intellectual Merit: This award is supporting research into the causes and chang-
ing probability of droughts through analysis of instrumental data, historical ESM simulations,
model projections of the future, and tree ring-based reconstructions. Pan-continental droughts
arise when different modes of climate variability combine to generate drying across the continent.
Climate models assessed by IPCC can realistically simulate pan-continental drought occurrence
and their spatial character but in general do so via a different mix of atmosphere-ocean processes.
Pan-continental drought risk increases over coming decades as precipitation changes and warm-
ing drive down soil moisture levels in the southwest and Plains to levels unprecedented in the past
millennium. The physical mechanisms of hydroclimate change that impacts drought frequency
and intensity across North America were analyzed via a complex moisture budget breakdown
on Reanalyses and CMIP5 models identifying contributions from thermodynamic and dynamic
processes.
Broader Impacts: The research has addressed the causes of ongoing and future peculiarly widespread
droughts and hydroclimate change that by impacting multiple regions have important conse-
quences for water resources, ecosystems, agriculture etc. The work has also analyzed links be-
tween drought and other climate variability and production of major crops in the Americas which
will enable better anticipation of food supply variability and prices. Findings have been presented
at many venues where water and land managers have been present, especially in southwestern
North America and have been readily communicated to decision makers.
Publications: Listed in References Cited: [3, 8, 10, 9, 11, 12, 46, 47, 48, 50]
2
EarthCube IA: Collaborative Proposal: Advancing netCDF-CF for the Geoscience Community
NSF grant GEO-1541031 (Ryan May) — 9/1/15–8/31/17, $1,091,266
Summary of results – Intellectual Merit: This award is funding the expansion of the Climate and
Forecasting (CF) conventions for variable and attribute data in netCDF files. This covers expand-
ing the conventions to additional data types, including radar data, satellite data, and geospatial
information. The work is also examining expanding the conventions to leverage new netCDF
capabilities available in the netCDF extended data model. The work also encompasses the devel-
opment of prototype implementation of support for these standards in several community tools.
Broader Impacts: Conventions for metadata allow for standard access to more varieties of data in
netCDF. This simplifies incorporation of new data into projects, accelerating the pace of research.
This project also funds several netCDF-CF workshops, enhancing collaboration between users of
netCDF in different scientific disciplines.
Publications: Listed in References Cited: [13, 26, 61]
3
References
[1] ⇤ R. Abernathey, I. Cerovečki, P. R. Holland, E. Newsom, M. Mazloff, and L. D. Talley. South-
ern Ocean water mass transformation driven by sea ice. Nature Geoscience, 9(8):596–601, 08
2016.
[2] ⇤ R. Abernathey and D. Ferreira. Southern ocean isopycnal mixing and ventilation changes
driven by winds. Geophysical Research Letters, 42:10,357–10,365, 2015.
[3] ⇤ W. Anderson, R. Seager, W. Baethgen, and M. Cane. Life cycles of agriculturally relevant
enso teleconnections in north and south america. International Journal of Climatology, 2016.
[4] ⇤ S. P. Bishop, P. R. Gent, F. O. Bryan, A. F. Thompson, M. C. Long, and R. P. Abernathey.

Southern Ocean overturning compensation in an eddy-resolving climate simulation. Journal
of Climate, 46:1575–1592, 2016.
[5] H. E. Brooks, G. W. Carbin, and P. T. Marsh. Increased variability of tornado occurrence in

the united states. Science, 346(6207):349–352, 2014.
[6] H. E. Brooks, C. A. Doswell, III, and J. Cooper. On the environments of tornadic and nontor-
nadic mesocyclones. 9:606–618, 1994.
[7] D. B. Chelton, R. A. de Szoeke, M. G. Schlax, K. Naggar, and N. Siwertz. Geographical

variability of the first baroclinic rossby radius of deformations. J. Phys. Oceanogr., 28:433–450,
1998.
[8] ⇤ S. Coats, J. E. Smerdon, B. Cook, R. Seager, E. R. Cook, and K. Anchukaitis. Internal ocean-
atmosphere variability drives megadroughts in western north america. Geophysical Research
Letters, 43(18):9886–9894, 2016.
[9] B. I. Cook, T. R. Ault, and J. E. Smerdon. Unprecedented 21st century drought risk in the
american southwest and central plains. Science Advances, 1(1):e1400082, 2015.
[10] ⇤ B. I. Cook, R. Seager, and J. E. Smerdon. The worst north american drought year of the last
millennium: 1934. Geophysical Research Letters, 41(20):7298–7305, 2014.
[11] ⇤ B. I. Cook, J. E. Smerdon, R. Seager, and S. Coats. Global warming and 21st century drying.
Climate Dynamics, 43(9-10):2607–2627, 2014.
[12] ⇤ B. I. Cook, J. E. Smerdon, R. Seager, and E. R. Cook. Pan-continental droughts in north

america over the last millennium. Journal of Climate, 27(1):383–397, 2014.
[13] E. Davis, K. Kehoe, S. Collis, N. Guy and S. D. Peckham. CF Standard Names: Supporting
Increased Use of netCDF-CF Across the Geoscience Community. American Geophysical Union
Fall Meeting, San Francisco, CA, 2016.
[14] D. Dee, S. Uppala, A. Simmons, P. Berrisford, P. Poli, S. Kobayashi, U. Andrae, M. Balmaseda,

G. Balsamo, P. Bauer, et al. The ERA-Interim reanalysis: Configuration and performance of
the data assimilation system. Quarterly Journal of the Royal Meteorological Society, 137(656):553–
597, 2011.
[15] N. S. Diffenbaugh, M. Scherer, and R. J. Trapp. Robust increases in severe thunderstorm

environments in response to greenhouse forcing. 101:16361—16366, 09 2013.
E–1
[16] C. A. Doswell, III, H. E. Brooks, and R. A. Maddox. Flash flood forecasting: An ingredients-
based methodology. Wea. Forecasting, 11:560–581, 1996.
[17] J. B. Elsner, S. C. Elsner, and T. H. Jagger. The increasing efficiency of tornado days in the
united states. Climate Dynamics, 45(3):651–659, 2015.
[18] V. Eyring, S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor.
Overview of the coupled model intercomparison project phase 6 (cmip6) experimental design
and organization. Geoscientific Model Development, 9(5):1937–1958, 2016.
[19] P. Gent, J. Willebrand, T. McDougal, and J. McWilliams. Parameterizing eddy-induced tracer
transports in ocean circulation models. J. Phys. Oceanogr., 25:463–475, 1995.
[20] S. M. Griffies, M. Winton, W. G. Anderson, R. Benson, T. L. Delworth, C. O. Dufour, J. P.
Dunne, P. Goddard, A. K. Morrison, A. Rosati, A. T. Wittenberg, J. Yin, and R. Zhang. Impacts
on ocean heat from transient mesoscale eddies in a hierarchy of climate models. Journal of
Climate, 28(3):952–977, 2015.
[21] E. Gutmann, I. Barstad, M. Clark, J. Arnold, and R. Rasmussen. The intermediate complexity
atmospheric research model (icar). Journal of Hydrometeorology, 17(3):957–973, 2016.
[22] E. Gutmann, T. Pruitt, M. P. Clark, L. Brekke, J. R. Arnold, D. A. Raff, and R. M. Rasmussen.
An intercomparison of statistical downscaling methods used for water resource assessments
in the united states. Water Resources Research, 50(9):7167–7186, 2014.
[23] R. Hallberg. Using a resolution function to regulate parameterizations of oceanic mesoscale
eddy effects. Ocean Modelling, 72:92–103, 2013.
[24] ⇤ S. Hoyer and J. J. Hamman. xarray: N-D labeled Arrays and Datasets in Python. Journal of
Open Research Software, 2017.
[25] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering,
9(3):90–95, 2007.
[26] B. W. Koziol, T. L. Whiteaker, D. L. Blodgett and R. Simons Representing Simple Geometry
Types in NetCDF-CF. American Geophysical Union Fall Meeting, San Francisco, CA, 2016.
[27] V. Lakshmanan, E. Gilleland, A. McGovern, and M. Tingley. Machine Learning and Data Mining
Approaches to Climate Science. Springer, 2015.
[28] K. Lehnert, R. Koskela, T. Ahern, K. Rubin, L. Powers, S. Goring, S. Peckham, J. Fredericks,
L. Yarmey, F. Kamalabadi, and M. Ramamurthy. Response to the earthcube advisory com-
mittee (rsv) report, Nov 2016.
[29] ⇤ C. Lepore, J. T. Allen, and M. K. Tippett. Relationships between hourly rainfall intensity and
atmospheric variables over the contiguous United States. Journal of Climate, 29(9):3181–3197,
2016.
[30] ⇤ C. Lepore, D. Veneziano, and A. Molini. Temperature and CAPE dependence of rainfall
extremes in the eastern United States. Geophys. Res. Lett., 42:74–83, 2015. 2014GL062247.
[31] G. A. Meehl, C. Covey, K. E. Taylor, T. Delworth, R. J. Stouffer, M. Latif, B. McAvaney, and
J. F. Mitchell. The wcrp cmip3 multimodel dataset: A new era in climate change research.
Bulletin of the American Meteorological Society, 88(9):1383–1394, 2007.
E–2
[32] K. J. Millman and M. Aivazis. Python for scientists and engineers. Computing in Science &
Engineering, 13(2):9–12, Mar 2011.
[33] R. A. Muenchen. The popularity of data science software, 2017.
[34] National Center for Atmospheric Research, Computational Information Systems Laboratory,
Boulder, CO. Cheyenne: SGI ICE XA System. doi:10.5065/D6RX99HX, 2017.
Boulder, CO. Computational Resources Overview. https://www2.cisl.ucar.edu/, 2017.
[36] National Center for Atmospheric Research, Computational Information

Systems Laboratory, Boulder, CO. Computational Resources Overview.
https://www2.cisl.ucar.edu/resources/resources-overview, 2017.
Boulder, CO. NCAR CMIP Analysis Platform. doi:10.5065/D60R9MSP, 2017.
Boulder, CO. NCAR’s Research Data Archive. https://rda.ucar.edu, 2017.
Boulder, CO. Yellowstone: IBM iDataPlex System. http://n2t.net/ark:/85065/d7wd3xhc,
2017.
[40] ⇤ E. Newsom, C. Bitz, F. Bryan, R. P. Abernathey, and P. Gent. Southern Ocean deep circu-
lation and heat uptake in a high-resolution climate model. Journal of Climate, 29:2597–2619,
2016.
[41] NOAA. National Centers for Environmental Information (NCEI) U.S. Billion-Dollar Weather
and Climate Disasters. https://www.ncdc.noaa.gov/billions/, 2017.
[42] T. E. Oliphant. Python for scientific computing. Computing in Science & Engineering, 9(3):10–
20, 2007.
[43] F. Perez and B. E. Granger. Ipython: A system for interactive scientific computing. Computing
in Science & Engineering, 9(3):21–29, 2007.
[44] J. M. Perkel. Programming: Pick up python. Nature, 518:125–126, 2015.
[45] J. Sander, J. F. Eichner, E. Faust, and S. M. Rising variability in thunderstorm-related u.s.

losses as a reflection of changes in large-scale thunderstorm forcing. Weather, Climate, and
Society, 5(4):317–331, 2013.
[46] ⇤ R. Seager and M. Hoerling. Atmosphere and ocean origins of north american droughts.
Journal of Climate, 27(12):4581–4606, 2014.
[47] ⇤ R. Seager, M. Hoerling, S. Schubert, H. Wang, B. Lyon, A. Kumar, J. Nakamura, and N. Hen-
derson. Causes of the 2011–14 california drought. Journal of Climate, 28(18):6997–7024, 2015.
[48] ⇤ R. Seager, D. Neelin, I. Simpson, H. Liu, N. Henderson, T. Shaw, Y. Kushnir, M. Ting, and
B. Cook. Dynamical and thermodynamical causes of large-scale changes in the hydrological
cycle over north america in response to global warming. Journal of Climate, 27(20):7921–7948,
2014.
E–3
[49] R. J. Small, J. Bacmeister, D. Bailey, A. Baker, S. Biship, F. Bryan, J. Caron, J. Dennis, P. Gent,
H. Hsu, M. Jochum, D. Lawrence, E. Munoz, P. diNezio, T. Scheitlin, B. Tomas, J. Tribbia,
Y. Tseng, and M. Vertenstein. A new synoptic scale resolving global climate simulation using
the Community Earth System Model. J. Adv. Modeling Earth Systems, 2014.
[50] ⇤ J. E. Smerdon, B. I. Cook, E. R. Cook, and R. Seager. Bridging past and future climate across
paleoclimatic reconstructions, observations, and models: A hydroclimate case study. Journal
of Climate, 28(8):3212–3231, 2015.
[51] S. Solomon, D. Qin, M. Manning, Z. Chen, M. Marquis, K. Averyt, M. Tignor, and H. Miller,
editors. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental
Panel on Climate Change. Cambridge University Press, 2007.
[52] C. Stewart, T. Cockerill, I. Foster, D. Hancock, N. Merchant, E. Skidmore, D. Stanzione, J. Tay-

lor, S. Tuecke, G. Turner, M. Vaughn, and N. Gaffney. Jetstream: a self-provisioned, scalable
science and engineering cloud environment. In Proceedings of the 2015 XSEDE Conference:
Scientific Advancements Enabled by Enhanced Cyberinfrastructure, St. Louis, Missouri, volume
2792774, pages 1–8. ACM, 2015.
[53] K. E. Taylor, R. J. Stouffer, and G. A. Meehl. An overview of cmip5 and the experiment design.
Bulletin of the American Meteorological Society, 93(4):485–498, 2012.
[54] ⇤ M. K. Tippett. Changing volatility of U.S. annual tornado reports. 41:6956–6961, 2014.
[55] ⇤ M. K. Tippett, S. J. Camargo, and A. H. Sobel. A Poisson regression index for tropical
cyclone genesis and the role of large-scale vorticity in genesis. 24:2335–2357, 2011.
[56] ⇤ M. K. Tippett and J. E. Cohen. Tornado outbreak variability follows taylor’s power law of
fluctuation scaling and increases dramatically with severity. Nature Communications, 7:10668
EP –, 02 2016.
[57] ⇤ M. K. Tippett, C. Lepore, and J. E. Cohen. More tornadoes in the most extreme u.s. tornado
outbreaks. Science, 354(6318):1419–1423, 2016.
[58] ⇤ T. Uchida, R. P. Abernathey, and K. S. Smith. Seasonality in ocean mesoscale turbulence in

a high resolution climate model. In Preparation, 2017.
[59] S. van der Walt, S. C. Colbert, and G. Varoquaux. The numpy array: A structure for efficient
numerical computation. Computing in Science & Engineering, 13(2):22–30, March 2011.
[60] S. M. Verbout, H. E. Brooks, L. M. Leslie, and D. M. Schultz. Evolution of the U.S. tornado
database: 1954-2003. 21:86–93, 2006.
[61] ⇤ C. Ward-Garrison, R. May, E. Davis and S. C. Arms. Serving Real-Time Point Observation
Data in netCDF using Climate and Forecasting Discrete Sampling Geometry Conventions.
American Geophysical Union Fall Meeting, San Francisco, CA, 2016.
[62] World Climate Research Programme. WCRP Coupled Model Intercomparison Project
(CMIP). https://www.wcrp-climate.org/wgcm-cmip, 2017.
E–4

Project Summary: Pangeo: An Open Source Big Data Climate Science Platform

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Summary: Pangeo: An Open Source Big Data Climate Science Platform

Uploaded by

Copyright:

Available Formats

Project Summary

Pangeo: An Open Source Big Data Climate Science Platform

2 Community Datasets D–2

3 Geoscience Use Cases D–3

4 Technical Implementation D–7

5 EarthCube Participation D–13

6 Broader Impacts D–14

7 Management Plan D–14

8 Sustainability Plan D–15

While the current version of XAarray is be-

pendencies exist between tasks, such that the

[4] ⇤ S. P. Bishop, P. R. Gent, F. O. Bryan, A. F. Thompson, M. C. Long, and R. P. Abernathey.

[5] H. E. Brooks, G. W. Carbin, and P. T. Marsh. Increased variability of tornado occurrence in

[7] D. B. Chelton, R. A. de Szoeke, M. G. Schlax, K. Naggar, and N. Siwertz. Geographical

[12] ⇤ B. I. Cook, J. E. Smerdon, R. Seager, and E. R. Cook. Pan-continental droughts in north

[14] D. Dee, S. Uppala, A. Simmons, P. Berrisford, P. Poli, S. Kobayashi, U. Andrae, M. Balmaseda,

[15] N. S. Diffenbaugh, M. Scherer, and R. J. Trapp. Robust increases in severe thunderstorm

[33] R. A. Muenchen. The popularity of data science software, 2017.

[36] National Center for Atmospheric Research, Computational Information

[44] J. M. Perkel. Programming: Pick up python. Nature, 518:125–126, 2015.

[45] J. Sander, J. F. Eichner, E. Faust, and S. M. Rising variability in thunderstorm-related u.s.

[52] C. Stewart, T. Cockerill, I. Foster, D. Hancock, N. Merchant, E. Skidmore, D. Stanzione, J. Tay-

[58] ⇤ T. Uchida, R. P. Abernathey, and K. S. Smith. Seasonality in ocean mesoscale turbulence in

You might also like