You are on page 1of 4

TOWARDS A FRAMEWORK FOR OFFERING REMOTE SENSING DATA IN AN

ANALYSIS-READY FORMAT

Jianghua Zhao1,2, Xuezhi Wang1, Yuanchun Zhou1, Qiming, Qin3


1
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China 100190
2
University of Chinese Academy of Sciences, Beijing, China 100049
3
Peking University, Beijing, China 100871

ABSTRACT requires the user to know the storage structure, index fields,
and the data format, and thus fails to satisfy the demand of
Diverse storage formats, archive dispersal, and inconsistent fast data discovery and on-demand read in the big data era.
naming make it difficult for researchers and the general So how to remove barriers to data availability and
public to find and access remote sensing data. To facilitate accessibility becomes one of the key issues to be solved in
the use of remote sensing data, this paper provides an geosciences and spatial information sciences.
integrated framework for direct reading remote sensing data Nowadyas, to process remote sensing data, except for
in a widely compatible and analysis-ready format, NumPy commercial software, there is a growing interest in open
ndarray. The framework is composed of two main source alternatives[5]. Nowadays, the Python programming
components. One is the raster data processing and storage language has become one of the fastest-growing
model. All the operational gridded remote sensing data are programming languages in the remote sensing community in
split into tiles, and reorganized in n-dimensional array. Then the last decade, and many libraries for processing remote
the N-Dimensional data array is serialized into netCDF and sensing data can be accessed through Python. The
stored into distributed file system. The other is the Geospatial Data Abstraction Library (GDAL;
spatiotemporal filter to achieve parallel query, and it has www.gdal.org), for instance, provides a single abstract
been encapsulated into Internet-accessible application model for reading and writing all of the common image file
programming interfaces (APIs). The scenario of calculating formats and has been widely used. The Remote Sensing and
NDVI of a specified spatiotemporal range given at last GIS Library (RSGISLib; www.rsgislib.org) offers more than
illustrate the efficiency and convenience of our platform 300 commands for processing remote sensing data,
provided for remote sensing data analysis. including stacking image bands, image segmentation, image-
to-image registration, and so on. The Scikit-learn Python
Index Terms— remote sensing data, spatiotemporal library [6], which requires data to be presented as NumPy
retrieval, analysis-ready format, multi-dimensional array ndarrays, contains a number of machine learning algorithms,
including random forest, neural network models, etc. All
1. INTRODUCTION these libraries are leading us to a new world of processing
remote sensing data.
With the rapid development of remote sensing technology, So, in order to facilitate the use of remote sensing data,
the volume of global coverage image data increases greatly. we propose an architecture for data access in an analysis-
In order to obtain higher-quality decision-making results, ready format, which is NumPy ndarray, the most popular
many applications, such as environmental change monitoring, data format for n-dimensional data[7]. Through the data
are using remote sensing data more and more frequently for access API we designed and implemented, users are able to
comprehensive analysis. read remote sensing image values of a given spatiotemporal
To transform remote sensing data into scientific range and other query conditions.
understanding, a majority of research have been done, such This paper is organized as follows. In section 2, the
as change detection[1], image fusion[2], edge detection[3], system architecture of the platform is proposed. In section 3,
land use and land cover extraction[4], and so on. we give a detailed description about the data storage model.
However, the diverse storage formats, archive dispersal, In section 4, we present the spatiotemporal query process.
and inconsistent naming make it difficult for researchers and Section 5 present a case of querying and calculating NDVI
the general public to find and access these data. To obtain using Landsat data. Finally, Section 6 offers conclusion and
long time series data of a certain region, for instance, one future directions of this paper.
need to find the image data by region and temporal query
first, and then download and read them. In this way, it

978-1-5386-7150-4/18/$31.00 ©2018 IEEE 5262 IGARSS 2018


2. SYSTEM ARCHITECTURE All the data are firstly segmented based on a grid tessellation.
Then, the tiles at the same location are stacked together to
In this paper, we propose a framework to support fast raster form a cube. Meanwhile, index indicating the data storage
data query. It is composed of two main components. One is location is build. The process of restructuring remote
the raster data processing and storage model. All the sensing data is as follows:
operational gridded remote sensing data are split into tiles, 1) Split the data into tiles with a grid tessellation.
and reorganized in n-dimensional array. Then the N- Grids of 0.5°*0.5°in WGS84 coordinate system were
Dimensional data array is serialized into netCDF and stored used. Steps needed to carry out the remote sensing data
in distributed file system. The other is the spatiotemporal tiling process is shown in figure2.
filter to achieve parallel query.
latitude and longitude
Data Access API
global tiling grids
Metadata Request Spatiotemporal Query

Request Parse
Spatiotemporal Filter
Grids intersect with
Spatial mask MBR in WGS84
satellite image
Temporal Filter
Grid intersection reprojection

MBR
Locating Data
Grids transformation Satellite image

Node1 Node2 Node n Pixel resolution


...
Query Cache Query Cache Query Cache
Crop

Data Storage
Image tiles array

Storage Manager Metadata DB Tiles Store Cube Store


Fig. 2 Steps needed to carry out remote sensing data tiling
process
Data Processing
Projection Tiles Subdivision Metadata Remote Firstly, project the MBR(Minimum Boundary Rectangle)
Index
Sensing data of the remote sensing data into WGS84 and perform
Cube organization
intersect analysis to select grids intersecting with the
Fig. 1 The system architecture diagram remote sensing image. Then, before splitting, the grids
are transformed according to the projection of the
Figure 1 gives an overview of the framework. It remote sensing image. At last, read the values of the
consists of Data Access API, Spatiotemporal Filter, Data image according to the grids boundary in the form of a
Storage, and Processing. The remote sensing data are first two-dimensional array. The locations out of the image
split into tiles and reorganized. Meanwhile, indexes are built boundary are filled as NoData.
and metadata containing information such as the projection, 2) Two-dimensional arrays of different bands are
acquisition date, and the conditions under which the image stacked together. Thus, longitude, latitude, and band
was collected are collected. After that, two main processes together form a cube data structure, which is stored in
define the spatiotemporal filter, namely, data locating, and netCDF, a data structure which is suitable for network
values reading. When searching data, the query conditions transmission and sharing. Moreover, dimension indexes
are first parsed, and data is located by the index. Then, the are built respectively, and attributes of the gird such as
query is performed in parallel on different nodes. In the top coordinate system, coordinate range(bbox), spatial
layer, the query function is encapsulated into python- resolution and so on are stored in the netCDF as
package APIs which greatly facilitate the use of remote attributes. This hierarchical data structure makes it
sensing data. possible to read values satisfy given retrieval conditions,
In following sections, the data storage model and the such as a specified location and band name.
data query process are described in detail. 3) Build index for the tiles from multiple dimensions
such as time, sensor, topic and GirdID as spatial
3. DATA STORAGE MODEL attribute, and form a directory. GridID is determined by
the relative position of the grid over latitude and
To satisfy the need of providing data efficiently and in an longitude throughout the globe. It is initialized by (0, 0)
easy to use way, the remote sensing data are restructured. from 0 degree longitude, 90 degrees south latitude. The

5263
Grid_X increase by 1 with the longitude increase by one Region Queries

grid width. The Grid_Y increase by 1 with the latitude


increase by one grid height. In this paper, we set the Cache

grid_size to be 0.5°*0.5°. Then the Grid_X and


Grid_Y is calculated according to Equations 1 and 2. In cache? Yes
No

Coordinate transformation into EPSG:4326


(1) Read results from cache

Calculate the MBR of the query polygon

(2) Locating data

The directory index makes it efficient to select data by


Node Node
providing fast filtering capabilities. Read netCDF Read netCDF

4. DATA RETRIEVE METHOD Intersection analysis of the Intersection analysis of the


grid boundary with query grid boundary with query
polygon’s MBR polygon’s MBR
When querying remote sensing data, two main processes
……
define the query, namely data locating and values reading. Return grid ID and Return grid ID and
intersect geometry intersect geometry
Typical query conditions include spatial, temporal or
other criteria, such as sensor type. To make it easy to select Coordinate transformation
of intersect geometries
Coordinate transformation
of intersect geometries
data that meets specific criteria, data location is firstly according to the coordinate according to the coordinate
system of the netCDF system of the netCDF
obtained by parsing query conditions and calculating the
GridIDs in the form of directory index. As a result, there is Values read by mask Values read by mask

no need to search cluster nodes that do not contain results


any more.
Results aggregation
After computing the locations of data, perform the
query in parallel on different nodes to read the values on-
demand. Results are read as NumPy arrays by intersection Results

analysis of the spatial query conditions and the tiles stored. Fig. 3 Flow chart of a spatial region query
And NumPy arrays obtained from multiply cluster nodes are
aggregated at last. After that, a number of functions for 5. APPLICATION ANALYSIS
image processing available through NumPy can be used for
analysis. This section presents a real-world scenario of calculating
Fig 3 depicts the process of a region query which finds NDVI from Landsat data for a given specific spatial and
out data that intersect or covered by a given region. temporal criteria.
Because it is very common for a single user to repeat Landsat data is one of the longest series available of
the same queries, caching technique is used to avoid re- satellite observations distributed by USGS in GeoTiff format.
searching values that have already been previously requested. In our work, 2.5 TB Landsat data has been reorganized and
All the query conditions and results stored in the cache are stored, and data retrieve method has been encapsulated as
expired by the improved Least Recently Used (LRU) API. Users can retrieve all the Landsat data that stored in
algorithm, which guarantees that each query is stored only our system through the data access API based on our Jupyter
once. When subsequent requests for the same data arrive, the environment.
earlier retrieve results are returned by checking the cache. If In order to illustrate how the platform can be leveraged
the query has not been performed, transform the coordinate for data access and analysis, figure 4 gives the details of the
system of the query polygon into WGS84, and calculate its real-time NDVI calculation from Landsat remote sensing
Minimum Boundary Rectangle (MBR). The Grid_X and data.
Grid_Y can be easily obtained according to equations 1 and
2. And the girds covered by the MBR can be retrieved out.
Then, by intersection analysis, the intersections geometries
are returned.

5264
7. ACKNOWLEDGMENTS

This study was supported as part of the “Strategic Priority


Research Program” of the Chinese Academy of Sciences
(Grant No XDA19020103). We would like to thank
anonymous reviewers for their constructive comments,
which greatly improved the quality of our manuscript.

8. REFERENCES

[1] H. Leichtle, T., Geiß, C., Wurm, M., Lakes, T., & Taubenböck,
“Unsupervised change detection in VHR remote sensing imagery–
an object-based clustering approach in a dynamic urban
environment,” Int. J. Appl. Earth Obs. Geoinf., vol. 54, pp. 15–27,
2017.
Fig. 4 Spatiotemporal data access and NDVI calculation
[2] H. S. Jung and S. W. Park, “Multi-sensor fusion of landsat 8
From figure 4, we can see that, by giving the spatial thermal infrared (TIR) and panchromatic (PAN) images,” Sensors
query region and other search criteria, such as "LANDSAT", (Switzerland), vol. 14, no. 12, pp. 24425–24440, 2014.
"L45TM" , "EPSG:4326", time period, and calling the
[3] M. Han, X. Yang, and E. Jiang, “An Extreme Learning
“info_by_geom” function, the results can be located by Machine based on Cellular Automata of edge detection for remote
returning the info data. Then, calling the “query_by_geom” sensing images,” Neurocomputing, vol. 198, pp. 27–34, 2016.
functions, the results are retrieved. At last, NDVI can be
calculated by reading the values of band3 and band4. The [4] P. Tokarczyk, J. D. Wegner, S. Walk, and K. Schindler,
NDVI results are displayed in the below. “Features, color spaces, and boosting: New insights on semantic
classification of remote sensing images,” IEEE Trans. Geosci.
Remote Sens., vol. 53, no. 1, pp. 280–295, 2015.
6. CONCLUSIONS
In this paper, a platform providing analysis-ready remote [5] V. T. T, “Object-based remote sensing image analysis with
sensing data is presented. It consists of two main modules OSGeo tools,” AGSE 2012–FOSS4G-SEA, p. 79, 2012.
which are tasked with splitting remote sensing data into tiles
and spatiotemporal query algorithm. All the tiles are stored [6] F. Pedregosa, R. Weiss, and M. Brucher, “Scikit-learn :
in a distributed file system, and metadata are stored to Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, no.
facilitate the processing of queries. And a two-step Oct, pp. 2825–2830, 2011.
spatiotemporal query algorithm is implemented. By parallel
performing user’s query, the platform can save cluster [7] T. A. Collaboration et al., “Astropy : A community Python
package for astronomy,” Astron. Astrophys., vol. 558, p. A33,
resources to support a large number of concurrent queries. In
2013.
addition, caching is used as a means to increase system
response time and improve data retrieval efficiency.
Our platform makes it easy to use remote sensing data.
Instead of downloading massive amounts of data and
clipping out values of desired region, users can access and
read desired data by just connecting the Internet and call the
encapsulated APIs. This will lead to an innovation in data-
driven analysis, which will also greatly promote the
popularization of remote sensing data in various fields.
This paper addresses the challenges for efficient remote
sensing data access and make it easy to process the remote
sensing data. However, the remote sensing data we provide
have not been processed to remove cloud, and the
corresponding quality data are not integrated. How to
improve our data storage model and data retrieve method to
support data storage with more information and of higher
quality is another interesting topic. And it will be the next
step of our research.

5265

You might also like