You are on page 1of 3

The convergence of HPC and Big Data HPDA

Big Data
meets HPC
Rajani Kumar Pradhan, Pedro Hernandez
Gelado

Abstract Scientific research can benefit


laterally from Big Data tools developed in
the last decade for commercial data
management, like Hadoop and Apache
Spark, by leveraging its use with traditional
High Performance Computing resources.
High degrees of automated parallelism
can be obtained to optimize read, write
and simple transformations trivially by the
researcher, although certain complex
operations are still challenging to perform
within Spark, traditional HPC can still fill
these gaps.

1 Introduction problems. In this context we will study a real challenge ahead of the scientific
two different prototypical scientific ap- community, especially for the R users.

I
t is estimated that every single day plications that could benefit from ap- In this context, the main objective of
2.5 quintillion bytes of data is pro- plying Big Data techniques coupled this case study is to explore the state-of-
duced in the whole world. How with traditional HPC (High Performance the-art of Spark architecture and frame-
much of it can we store? And how Computing). Our objective: to optimize work to analyze and process the geospa-
much of it can we analyze? Data is be- workflows, leverage cluster resources, tial (e.g., IMERG) datasets in a more
coming increasingly commoditised and increase parallelism and use distributed efficient way.
the tool-kits developed to manage it of- data storage. Our second use case will deal with
fer huge potential, however they still The first use case will deal with ana- another area of research that is becom-
to be widely adopted in an academic lyzing the Integrated Multi-Satellite Re- ing increasingly memory-bound: post-
context for scientific research in non- trievals for GPM (IMERG), a satellite processing Computational Fluid Dynam-
computer science fields. precipitation product form the Global ics (CFD) simulations. Aerodynamic re-
Apache Spark has become the "state- Precipitation Measurement (GPM) mis- searchers deal with increasingly larger
of-the-art" framework in Big Data and sion by National Aeronautics and Space file outputs due to several reasons,
it offers huge untapped potential to Administration (NASA) and Japan amongst them: finer meshes, more com-
academic researchers that deal with Aerospace Exploration Agency (JAXA). plex geometries, and unsteady aerody-
increasingly challenging to process The IMERG precipitation datasets are at namic research, that is through many
amounts of data in conventional sys- daily scale and in Netcdf format, cov- evolving time-steps. Open-source CFD
tems. Scientific problems are now not ering from 2000 to 2020 which has tools like Open-Foam and Paraview are
only processor-bound tasks, they are around 230 GB in total. The process- able to leverage resources in the re-
increasingly becoming memory-bound ing and analyzing of these datasets is searchers machine, but how can we
optimize CFD post-processing codes to Notebooks is a fantastic tool to de- Table 1: Benchmarking R versus SparkR in
use under-utilised clusters? This second velop your code as you can test these reading and extracting one month of IMERG
use-case will analyze the opportunities, small samples and benchmark early on daily datasets (units are in seconds).
challenges and pit-falls of using Apache against your legacy serial version. HINT:
Spark in this context. use the ’%%time’ function in Jupyter to Method Minimum Mean
time cell execution. At any point, if you R 225 220
run into trouble, try to create minimum SparkR 25 26
2 Methods reproducible examples to share, many
online forums including Stack Overflow
In developing this project, we have de- have vibrant communities in which you Furthermore, the bench-marking is
signed an approach for budding Spark can share Spark issues and questions. also performed for 1 year of datasets
users who are attempting to optimize Once you are ready to build, create your (365 files each 30 MB) and with three
and port their data-intensive tasks to slurm files and submit to your HPC sys- different write functions. As, shown in
Spark for the first time. Benchmarks on tem. the figure, extracting the variable from
each of these steps will be developed in each datasets and writing the output
the results section of this report. How- into a paraquat format is much faster
ever, a rough heuristic to calculate the 3 Results than writing into a .RDS or into CSV for-
ratio of time-savings and performance mat. Additionally, reading the paraquat
improvements to input work from the Results are shown here for the two pro- file is relatively faster than reading a
researcher when applying these steps totypical use cases studied. .RDS format and CSV.
can be summarized as 1,2,3,4. The easi-
est time shaves can be obtained in steps
3.2 CFD Post-processing
1 and 2, whilst steps 3 and 4 require 3.1 Reading and processing of
more work from the user. IMERG datasets In this use case a Gamma-1 method
code in python that can be found here
The main bottleneck of dealing with
was ported into Spark with variable
2.1 Step 1: Read into Spark Netcdf datasets is that the data stored in
success. The Gamma-1 method is com-
multiple dimension. Therefore, before
Our strategy’s first step is to read, pre- puted using the following formula:
doing any analysis the detests should
process and transform our serial scien- be stored in more user friendly for-
tific application’s data, for example a mat. The R script to convert Netcdf 1 X (PM × UM ) · z
.csv file, into a Spark data abstraction Γ1 (P ) = (1)
to a R dataframe was slow, which N ||PM|| · ||UM ||
like RDDs, a Dataframe or parquet files. S
takes approximately 220 seconds for
30 files each with 30MB in the Little
Big Data Cluster (LBD) which has xxx
2.2 Step 2: Apply pre-built
GB of RAM and 762 processors. How-
Spark functions ever, as R script is designed for sin-
If you can, always substitute any func- gle machine it does not use the ben-
tions from the serial code you are port- efits of multi core clusters. Therefore,
ing with native Spark functions. Trivial the same R script was applied through
functions that can be performed with the spark.lapply() function, which par-
simple SQL queries, groupBy, filters or allelize and distribute the computations
map-reduce operations work best here. among the nodes and their cores.

2.3 Step 3: Optimize loops


More complex fors, whiles, recursive-
ness, loops are the next bottlenecks we
need to tackle. How can we port them Figure 2: Gamma-1 method, as visible re-
to Spark if we can’t find any native quires a calculation of the nearest neighbors
Spark substitutes? Use and exploit user- before computing the vector multiplication.
defined functions were you can perform
row-wise operations using lambda func- Figure 1: IMERG image from NASA more
tions. At all costs, avoid collect() and info on the dataset can be found here. 3.2.1 Read Improvements
other actions that extract and serial-
By using Spark we can obtain significant
ize RDD data from Spark, slowing you
The bench-mark (run for 5 times) performance upgrades from our serial
down!
results shows that (Table 1), compar- code in pandas were it took a mean µ=
atively SparkR has significantly faster 22 minutes and 19 seconds to perform
2.4 Step 4: Test small, build than R. For instance, on average, R a read operation of 18 million records,
big takes 220 seconds to complete the pro- about 500mb of data, using a for loop.
cess, whereas for SparkR it was just 26 Instead, in Apache Spark we were able
Always begin testing with very small ex- seconds, which is approximately 8 times parallelize this read to leverage 72 vir-
cerpts from your large dataset. Jupyter as much as faster than R. tual cores, optimizing it to a mean read
time of µ= 41.3s, a 3240% increase creating automatic parallelism for sci- fine meshes that are difficult to post-
in performance. Tests were performed entific applications that have cascading process serially or in a personal com-
five times to account for variable cluster
algorithms, were the results of one op- puter. Nonetheless, it is not trivial to per-
loads and capacity, standard deviation eration are fed and called by another form complex operations on this data in
is cited as σ. function recursively. The author hence Spark, which shines in filtering, aggre-
struggled to completely port the code, gation, and native machine learning li-
Table 2: Read times measurements for 5 even after porting all the separate com- brary tools, but is not adequate to seam-
samples, mean µ and standard deviation σ ponents necessary to complete it. lessly port more complex cascade-type
codes without significant modifications.
Method µ [s] σ [s] This still leaves room hence for further
3.2.4 Stagnation Point Loca- work into looking how to transfer cer-
Serial for loop 1339 107
Spark Dataframe 41.3s 4.0
tion tain workloads from Spark to mpi, and
vice-versa, to be able to have the "best
Finally another common CFD post- of both worlds".
processing operation in unsteady fluid Future work could study the possi-
3.2.2 Neighbours in radius mechanics, when searching for areas bility of using new Spark 3.12 UDAFs
with high probability of a vortex core lo- (User defined aggregated functions),
The author attempted to optimize a Eu- cation, is to locate stagnation points in
clidean radius neighbours search with Koalas and other tools that allow easy-
the flow, that is points where the veloc- porting of python code into Spark, as
Spark using a native locality-sensitive ity is approximately 0. Spark shines in
hashing (LSH) implementation in the well as stragies on how to integrate
this type of single filtering and retriev- classical HPC parallelism into hybrid
Apache Spark library MLlib known as ing operation in comparison with locat-
BucketedRandomProjectionLSH. How- Spark-HPC applications. Into this feeds
ing these serially in pandas or even us- studying how to best integrate proven
ever poor performance improvements, ing OpenFoam. The test input files were
in fact degraded performance, was ob- highly scalable parallelism frameworks
used, that is 400 frames with 47,000 like mpi to Big Data tools like Spark to
tained, even when accounting the fact records each or approximately 1.8 mil-
that the legacy function performed an best aid the researcher attempting to
lion records. deal with large files in their scientific
exhaustive itemized serial search of
47,000 records for each run, vs. an LSH research.
Table 4: Stagnation points
model built once for every query in
its vanilla implementation, and an en- Method µ [s] σ [s] 5 Acknowledgments
hanced LSH precomputed model (only
built once) repeatedly queried for each Vanilla pandas 143 1.96
We want to profusely thank Giovanna
point. The test results were obtained Spark 21.41 3.4
Roda, Dieter Kvasnicka, Claudia Blaas-
from five runs and cited on a per point OpenFoam 2-5s N/A
Schenner, Liana Akobian and Vienna-
queried for neighbours basis, the Spark Scientific-Cluster for hosting us during
implementations used 400 cores vs one the summer. The team feeling, experi-
serial core. 4 Discussion & Conclusions ence and cheery atmosphere has been
fantastic at VSC, and we have learned
Table 3: Radius neighbours search perfor-
Spark has been shown as extremely lots thanks to them. Moreover we want
mance
powerful tool in the management, wran- to thank Leon Kos and PRACE for offer-
gling and native function analysis of ing us the opportunity to work in this
Method µ [s] σ [s] large amounts of scientific data. In this, fantastic subject. Thank you.
LSH vanilla 6.58 0.63 it offers significant performance im-
LSH precomputed 4.87 0.75 provements in comparison with serial PRACE SoHPCProject Title
Legacy 0.484 0.0358 python read, write and simple transfor- The convergence of HPC and Big
Data/HPDA
mation codes, whilst leveraging cluster
PRACE SoHPCSite
and HPC-level parallelism trivially. How-
Vienna Scientific Cluster, Austria
3.2.3 Gamma-1 ever, more complex operations that do
PRACE SoHPCAuthors
not have native Spark equivalents in of-
Rajani Kumar Pradhan, Pedro
To be able to perform the Gamma-1 ficial or supported libraries, specially Hernandez Gelado
Rajani Kumar Pradhan
computation for a record within Spark, those in which cascading algorithms PRACE SoHPCMentor
a cascading set of UDFs(User-defined are computed repeatedly, are unfortu- Dr. Giovanna Roda, Technical
University of Vienna, Austria
functions) would have to be performed, nately beyond the scope of what Spark
first a radius nearest neighbours search is ideally suited to perform easily, and PRACE SoHPCContact
Leon, Kos, Univerza v Ljubljani
for each point, followed by the compu- are challenging for the researcher to Phone: +12 324 4445 5556
tation of the gamma-1 method for it. port from python into the Apache Spark E-mail: leon.kos@lecad.fs.uni-lj.si Pedro H. Gelado

Unfortunately, this is not possible, framework. PRACE SoHPCSoftware applied


as in Spark worker nodes/executors (al- In the specific case of CFD post- Hadoop, Python, Apache Spark
ready performing an action) cannot call processing Spark offers large potential PRACE SoHPCMore Information
a further action upon other executor to a researcher in the management and spark.apache.org hadoop.apache.org
python.org
nodes. This is part of the basis of the simple analysis of their CFD data, spe-
PRACE SoHPCProject ID
Spark architecture and a key barrier in cially for unsteady flow problems with 2133

You might also like