Professional Documents
Culture Documents
Big Data
meets HPC
Rajani Kumar Pradhan, Pedro Hernandez
Gelado
1 Introduction problems. In this context we will study a real challenge ahead of the scientific
two different prototypical scientific ap- community, especially for the R users.
I
t is estimated that every single day plications that could benefit from ap- In this context, the main objective of
2.5 quintillion bytes of data is pro- plying Big Data techniques coupled this case study is to explore the state-of-
duced in the whole world. How with traditional HPC (High Performance the-art of Spark architecture and frame-
much of it can we store? And how Computing). Our objective: to optimize work to analyze and process the geospa-
much of it can we analyze? Data is be- workflows, leverage cluster resources, tial (e.g., IMERG) datasets in a more
coming increasingly commoditised and increase parallelism and use distributed efficient way.
the tool-kits developed to manage it of- data storage. Our second use case will deal with
fer huge potential, however they still The first use case will deal with ana- another area of research that is becom-
to be widely adopted in an academic lyzing the Integrated Multi-Satellite Re- ing increasingly memory-bound: post-
context for scientific research in non- trievals for GPM (IMERG), a satellite processing Computational Fluid Dynam-
computer science fields. precipitation product form the Global ics (CFD) simulations. Aerodynamic re-
Apache Spark has become the "state- Precipitation Measurement (GPM) mis- searchers deal with increasingly larger
of-the-art" framework in Big Data and sion by National Aeronautics and Space file outputs due to several reasons,
it offers huge untapped potential to Administration (NASA) and Japan amongst them: finer meshes, more com-
academic researchers that deal with Aerospace Exploration Agency (JAXA). plex geometries, and unsteady aerody-
increasingly challenging to process The IMERG precipitation datasets are at namic research, that is through many
amounts of data in conventional sys- daily scale and in Netcdf format, cov- evolving time-steps. Open-source CFD
tems. Scientific problems are now not ering from 2000 to 2020 which has tools like Open-Foam and Paraview are
only processor-bound tasks, they are around 230 GB in total. The process- able to leverage resources in the re-
increasingly becoming memory-bound ing and analyzing of these datasets is searchers machine, but how can we
optimize CFD post-processing codes to Notebooks is a fantastic tool to de- Table 1: Benchmarking R versus SparkR in
use under-utilised clusters? This second velop your code as you can test these reading and extracting one month of IMERG
use-case will analyze the opportunities, small samples and benchmark early on daily datasets (units are in seconds).
challenges and pit-falls of using Apache against your legacy serial version. HINT:
Spark in this context. use the ’%%time’ function in Jupyter to Method Minimum Mean
time cell execution. At any point, if you R 225 220
run into trouble, try to create minimum SparkR 25 26
2 Methods reproducible examples to share, many
online forums including Stack Overflow
In developing this project, we have de- have vibrant communities in which you Furthermore, the bench-marking is
signed an approach for budding Spark can share Spark issues and questions. also performed for 1 year of datasets
users who are attempting to optimize Once you are ready to build, create your (365 files each 30 MB) and with three
and port their data-intensive tasks to slurm files and submit to your HPC sys- different write functions. As, shown in
Spark for the first time. Benchmarks on tem. the figure, extracting the variable from
each of these steps will be developed in each datasets and writing the output
the results section of this report. How- into a paraquat format is much faster
ever, a rough heuristic to calculate the 3 Results than writing into a .RDS or into CSV for-
ratio of time-savings and performance mat. Additionally, reading the paraquat
improvements to input work from the Results are shown here for the two pro- file is relatively faster than reading a
researcher when applying these steps totypical use cases studied. .RDS format and CSV.
can be summarized as 1,2,3,4. The easi-
est time shaves can be obtained in steps
3.2 CFD Post-processing
1 and 2, whilst steps 3 and 4 require 3.1 Reading and processing of
more work from the user. IMERG datasets In this use case a Gamma-1 method
code in python that can be found here
The main bottleneck of dealing with
was ported into Spark with variable
2.1 Step 1: Read into Spark Netcdf datasets is that the data stored in
success. The Gamma-1 method is com-
multiple dimension. Therefore, before
Our strategy’s first step is to read, pre- puted using the following formula:
doing any analysis the detests should
process and transform our serial scien- be stored in more user friendly for-
tific application’s data, for example a mat. The R script to convert Netcdf 1 X (PM × UM ) · z
.csv file, into a Spark data abstraction Γ1 (P ) = (1)
to a R dataframe was slow, which N ||PM|| · ||UM ||
like RDDs, a Dataframe or parquet files. S
takes approximately 220 seconds for
30 files each with 30MB in the Little
Big Data Cluster (LBD) which has xxx
2.2 Step 2: Apply pre-built
GB of RAM and 762 processors. How-
Spark functions ever, as R script is designed for sin-
If you can, always substitute any func- gle machine it does not use the ben-
tions from the serial code you are port- efits of multi core clusters. Therefore,
ing with native Spark functions. Trivial the same R script was applied through
functions that can be performed with the spark.lapply() function, which par-
simple SQL queries, groupBy, filters or allelize and distribute the computations
map-reduce operations work best here. among the nodes and their cores.