You are on page 1of 38

International Journal of Parallel Programming

https://doi.org/10.1007/s10766-021-00717-y (0123456789().,-volV)
(0123456789().,-volV)

A Comparative Survey of Big Data Computing and HPC:


From a Parallel Programming Model to a Cluster
Architecture

Fei Yin1 · Feng Shi1


Received: 7 April 2020 / Accepted: 13 May 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2021

Abstract
With the rapid growth of artificial intelligence (AI), the Internet of Things (IoT) and
big data, emerging applications that cross stacks with different techniques bring new
challenges to parallel computing systems. These cross-stack functionalities require
one system to possess multiple characteristics, such as the ability to process data
under high throughput and low latency, the ability to carry out iterative and
incremental computation, transparent fault tolerance, and the ability to perform
heterogeneous tasks that evolve dynamically. However, high-performance com-
puting (HPC) and big data computing, as two categories of parallel computing
architecture, are incapable of meeting all these requirements. Therefore, by per-
forming a comparative analysis of HPC and big data computing from the per-
spective of the parallel programming model layer, middleware layer, and
infrastructure layer, we explore the design principles of the two architectures and
discuss a converged architecture to address the abovementioned challenges.

Keywords High-performance computing · Big data computing · Parallel


computing architecture · Iterative computation · Heterogeneous tasks · Converged
architecture · Cross-stack functionality

1 Introduction

The advent of big data makes it possible for artificial intelligence (AI) to take
advantage of a data set of sufficient size to provide meaningful learning and results.
To date, applications based on the supervised learning (SL) paradigm are maturing.
An SL model that uses sample data with labels can be trained offline and deployed

& Fei Yin


Planck@163.com
1
School of Computer Science, Beijing Institute of Technology, Beijing 100081, China

123
International Journal of Parallel Programming

to serve predictions online. A good example is automatic speech recognition (ASR),


such as Google Assistant and Amazon Alexa. ML applications are expanding from
the supervised learning paradigm, in which static models are trained on offline data,
to a broader paradigm, exemplified by reinforcement learning (RL), in which
applications may operate in real environments, fuse and react to sensory data from
numerous input streams, perform continuous microsimulations, and close the loop
by taking actions that affect the sensed environment [65]. The self-driving
automobiles of Google and Tesla are examples of RL-based applications.
In the IoT era of IoT, massive numerous sensing devices in home automation,
health care, smart cities, traffic management, industry, etc., collect and generate
enormous amounts of continuous data. Data are the fuel of the IoT, and big data
computing is the engine that powers the IoT, as it derives value from data. In [52],
Mehdi M. et al. presented some applications utilizing ML algorithms in the IoT,
such as traffic prediction by employing classification and clustering algorithms,
energy usage real-time prediction by employing linear regression, passenger travel
patterns by employing K-nearest neighbors, and fault detection for monitoring
public places by employing principal component analysis. Pattern recognition is
also widely used in IoT applications, such as identifying movement patterns by
employing the RNN model, human activity recognition or mobility prediction by
employing the LSTM model, and traffic sign detection by employing the CNN
model [60].
Introduced in 2013, high-performance data analytics (HPDA), developed by the
Pacific Northwest National Laboratory, has been used to explore, evaluate, and
demonstrate the application of HPC techniques to data analytics challenges [37].
HPDA, drawing from HPC and big data computing, is a rapidly evolving field
covering graph analytics, computing-intensive analytics, streaming analytics,
exploratory data analytics, etc. Demands for HPDA originate from the need to
obtain analysis results extremely quickly (e.g., real-time high-frequency analysis),
the need to address extreme problem complexity requiring high-capability analytics
(e.g., those found in large-scale scientific research and industrial settings), and the
case where patterns or insights are of an extremely valuable nature (e.g., economic,
scientific or social) [16].
The convergence of big data with AI, the IoT and HPDA creates great
opportunities. However, it also brings new challenges to the current parallel
computing systems. These applications, across the different stacks of the IoT, AI
and big data solutions, or even their combinations, are some of the features that need
to be considered as the aggregate architecture. We define them as cross-stack
functionality applications.æ

1.1 Scope and Contributions of This paper

In this survey, we review a wide range of applications that are based on cross-stack
solutions (i.e., AI?IoT?big data, HPC?big data, AI?big data, etc.) and compare
the workload characterization and software/hardware stacks of applications of
typical HPC, typical big data and these cross-stack functionality applications. We
discuss three fundamental topics in current parallel computing systems based on the

123
International Journal of Parallel Programming

anatomy of HPC and big data computing architectures in an attempt to explore the
solutions to the challenges arising from cross-stack functionality applications from
the following aspects:
1 (1) What makes parallel programming frameworks distinct from each other, and
which design principles determine the capabilities and efficiency in parallel
computing, especially for iterative computation?
2 (2) How can we design a parallel system that has multiple capabilities in one
system, provides stream and batch computation, and can handle heterogeneous
tasks that evolve dynamically?
3 (3) How can we design a converged architecture in which the capabilities
present in either HPC or big data computing stacks benefit the other?

To the best of our knowledge, our survey is the first to explore the three
abovementioned topics, which define the scope of this paper and direct our future
research on designing new parallel computing systems.

1.2 Organization

In this paper, we first provide a literature review on big data computing and HPC-
related works in Sect. 2. In Sect. 3, we compare typical HPC and typical big data
applications with cross-stack functionality applications, presenting their layered
software and hardware stack architectures and pointing out the challenges of these
applications. In Sect. 4, we study in detail the strengths and weaknesses of state-of-
the-art big data computing and HPC systems in the corresponding layered
architectures, i.e., the parallel programming model layer, middleware layer and
infrastructure layer. We conduct a comparative analysis and discuss the converged
architecture of HPC and big data computing and then conduct open discussions in
Sect. 5. Finally, we conclude this survey.

2 Literature Review

Some works focus on big data computing and HPC architectures, including the
parallel programming model, and cluster architecture.
In [19], Doulkeridis C. et al. reviewed the state-of-the-art techniques in
improving the performance of parallel query processing using MapReduce. The
paper first provided an in-depth analysis of the weaknesses and limitations of the
MapReduce framework and then proposed the existing approaches that improve the
performance of query processing, categorized into eight groups: data access,
avoidance of redundant processing, early termination, iterative processing, query
optimization, fair work allocation, interactive real-time processing, and the
processing of n-way operations. The paper performed a comparative analysis of
the MapReduce-related technologies and systems according to these categories.
In [92], Zhang H. et al. presented important techniques for memory management,
including memory hierarchy (e.g., register, cache, main memory), memory
hierarchy utilization (e.g., register-conscious optimization, cache-conscious

123
International Journal of Parallel Programming

optimization), NUMA (see 4.4.2) (e.g., data partitioning, OLTP Latency),


transactional memory, and nonvolatile random-access memory (NVRAM). Then,
they provided a review of in-memory data management and processing proposals.
Sakr S. et al. [75] surveyed a family of proposals and systems for big data
processing based on the original idea of the MapReduce framework. The extension
approaches to the original MapReduce framework were discussed, including joint
operations, iterative processing, data and process sharing, data indices and column
storage, pipelining and streaming. In addition, the authors reviewed the systems that
provide declarative interfaces for the MapReduce framework, as well as other
related big data processing systems with different architectures, such as structured
computations optimized for parallel execution (SCOPE), Dryad, and Spark.
In [4], Asaadi H. R. et al. carried out a comparative study of HPC and big data
programming models. Important issues in massive data parallelism, including fault
tolerance, memory model, storage and IO, and heterogeneous processors, were
discussed. The authors also analyzed HPC and big data software stacks, including
parallel file systems, resource managers and APIs. Three benchmark tests were
designed and conducted on Comet (the SDSC’s newest HPC resource cluster), with
programs implemented by the message-passing interface (MPI), openMP, Hadoop
and Spark, separately. From these experiments, the authors drew conclusions and
findings with regard to maintainability, execution flow, performance, scalability,
and fault tolerance.
In [16], by performing a comparative analysis of the distinct features,
characteristics and capabilities of both HPC and big data computing stacks, the
author extended these to related workloads of deep learning (DL) and HPDA. With
a closer analysis of the application/workload requirements and technique stack (i.e.,
HPC, DBC, DL), the author further teased out the differences of in the software and
hardware architectures of these stacks. Finally, the author discussed mutual findings
in HPC and big data computing from the software layer to the infrastructure
architecture layer.

3 Requirements and Challenges

3.1 Typical HPC and Big Data Applications

HPC involves the use of parallel computing to run modeling and simulation
workloads in science, industry, and commerce that require significant amounts of
computation, which is called the third paradigm of scientific discovery. Classic HPC
modeling and simulation seek to represent the dynamics of a system based on its
mathematical model, which is usually derived from scientific or engineering laws
[16]. The core procedure of this scientific discovery is to solve equations. Due to the
complexity of these systems, these equations usually cannot be solved analytically;
thus, a numerical analysis method is needed. These workloads are often
computationally sensitive and require substantial memory. Moreover, numerical
simulation often generates a very large amount of intermediate data, where high-
throughput IO and low-latency interconnects are necessary. Therefore, it is required

123
International Journal of Parallel Programming

to use a specific parallel computing architecture and parallel programming model


software in HPC.
Big data computing is an emerging data science paradigm of multidimensional
information mining for scientific discovery and business analytics over a large-scale
infrastructure [48]. The way that laws are summarized directly from data is called
the fourth paradigm of scientific discovery [84]. Big data can be captured from any
source, such as social media, web activity, IoT, business activity, industry and
engineering data. The procedure of this scientific discovery can be treated as
knowledge discovery from data. Statistical models [46], built on sets of mathemat-
ical functions (e.g., the computation of the count, maximum, minimum, mean,
variance, and covariance) to describe the behavior of the objects in a target class in
terms of random variables and their associated probability distributions, are typical
modeling processes for big data that study the collection, analysis, interpretation or
explanation, and presentation of the data. Big data systems process massive amounts
of data efficiently, often with fast response times, and are typically characterized by
the 4 Vs, i.e., volume, variety, velocity, and veracity [71]. The workloads of
statistical models are data intensive, and current popular big data computing
systems, such as Hadoop, Spark and Flink, can process them.
ML, as another main technique for big data, is used to recognize complex
patterns and make intelligent decisions based on massive data. The workloads of
ML models are also data intensive. Regardless of the efficiency of workloads (either
local or global ML workloads [91]), big data programming frameworks can support
most ML algorithms.

3.2 Cross-Stack Functionality Applications

Cross-stack functionality applications bring new challenges to the current parallel


computing systems, which fail to fully meet the following requirements.
First, for AI-based applications, both SL and RL require massive input training
data to train numerous parameters, requiring large amounts of iterative computation
with high efficiency; in addition, the computation graph of an RL application is
heterogeneous and evolves dynamically [69], which requires dynamic task creation
or dataflow variation with the runtime.
Second, for IoT-based applications, systems are required to process massive
sensory data under high throughput and low latency; RL-based IoT applications (i.
e., all kinds of human–robot interactions, autonomous driving) are required to
process sensory data as numerous input streams from a constantly changing
environment under low latency and perform iterative computations for
microsimulations.
Third, for HPDA-based applications, systems are required to provide stream-
oriented and graph computation; furthermore, iterative and incremental computation
should be essential.
In Table 1, we list several applications of typical HPC, typical big data, and
cross-stack functionality.

123
International Journal of Parallel Programming

Table 1 Applications and algorithms


Application algorithm Algorithm description References

Typical App: Medium/Short-range 1. Model physics parameterization at the [32]


HPC weather forecasting subgrid scale
Algorithm: Meteorological 2. Model discretion based on the subgrid,
dynamics equations including the vertical exchange and
horizontal exchange of momentum, heat,
moisture, etc
3. Obtain the initial conditions from
observations
4. Integrate the equations forward in time,
starting from the initial conditions to the
final boundary, to predict the physical
values at a certain time
App: Normal/Stochastic differential equation [64]
Protein dynamics 1. Create the simulation model available for
Algorithm: the protein molecular mechanics
Normal/Stochastic 2. Set the boundary conditions for the
differential equation simulation area and select the potential
function between the particles
3. Set the initial location and the initial
velocity for all the particles in the system
4. Calculate the interactive force, the
potential between the particles and every
particle’s location, and the velocity
according to the dynamical equations
5. Calculate every particle’s velocity and
location iteratively until the system is stable
Typical big App: Calculate the initial PageRank matrix U0 and [30]
data The who to follow service the transition matrix M according to the
computing of Twitter web-relation graph
Algorithm 2. Set the damping coefficient α; normally, α
=0.85
Page Bank
3. Calculate the PageRank matrix iteratively
according to Un ¼ aM T Un1 þ ð1  aÞU0
until it converges
App: 1. Calculate the distance between the test data [49]
Recommender systems, and each row of the training data [68]
social network 2. Sort the calculated distances in ascending
Algorithm: order based on the distance values
K-nearest neighbours 3. Obtain the top k rows from the sorted array
4. Memory-based
Big Data? App: 1. Take the whole data set consisting of [61]
IoT?AI Fault detection in an d-dimensional samples [79]
industrial process 2. Compute the d-dimensional mean vector
Algorithm: 3. Compute the scatter matrix (alternatively,
Page Bank the covariance matrix) of the whole data set
4. Compute eigenvectors and corresponding
eigenvalues

123
International Journal of Parallel Programming

Table 1 continued

Application algorithm Algorithm description References


5. Sort the eigenvectors by decreasing
eigenvalue and choose k eigenvectors with
the largest eigenvalues to form a d9k-
dimensional matrix;
6. Use this d9k eigenvector matrix to
transfer the samples into the new subspace
App: 1. Assign points to arbitrary coordinates in [80]
Clone detection in the p-dimensional space
Internet of Thing 2. Compute the Euclidean distances among
Algorithm: all pairs of points to form the DHAT matrix
Multidimensional scaling 3. Compare the DHAT matrix with the input
D matrix by evaluating the stress function
4. Adjust the coordinates of each point in the
direction that best maximizes stress
5. Repeat steps 2 to 4 until the stress will no
longer decrease
App: LSTM comprises convolutional, recurrent [67]
Human activity recognition and softmax layers
Algorithm: 1. Transform sensor data through four
convolutional operations that employ
Long short-term memory rectified linear units (ReLUs) to compute
the feature maps
2. Employ rectified linear units (ReLUs) to
compute the feature maps
3. Obtain the output of the model from a
softmax layer
HPDA App: 1. Event projection and transformation [37]
(HPC? Event analysis and 2. Event co-occurrence analysis [38]
Big Data) recurrent pattern 3. Recurrent pattern assembly
discovery
Algorithm:
An ensemble of ML, signal
processing, and statistical
methods

3.3 Workload Characterization

The performance of a system is influenced by the characteristics of its hardware and


software components as well as of the load it has to process [55]. The authors
discuss the methodologies and techniques applied for constructing workload models
on different types of systems (e.g., centralized, distributed, and parallel systems). A
set of parameters that are able to describe the behavior of the workload must be
chosen, and the workload characterization process must take into account the
system architecture. In [16], the author proposed that the HPC and big data

123
International Journal of Parallel Programming

computing workload types have affinity towards the capabilities offered by a


discrete stack architecture and hardware. In [89], the authors introduced the concept
of a functional workload model that captures in an implementation-independent
manner the load that the system needs to service. This functional workload,
including three components (i.e., the functions of abstraction, their load pattern, and
the data set they act upon), is designed to be representative of the demands put on
the system by an average use case within the application domain. In [50], the
authors point out that most big data workloads have a low ratio of floating point
operations to memory accesses, whereas HPC workloads generally have high ratios
of floating point operations to memory accesses. In [93], the authors chose a variety
of microarchitectural metrics, e.g., instruction mix, cache behavior, branch
execution, pipeline behavior, and parallelism, to gain a better comprehensive
understanding of the workload characterization.
The workload model of a typical HPC application or a typical big data
application depicts how this application will be designed and executed in an HPC or
a big data parallel computing environment. According to [16, 50, 89, 93], the
workload is closely related to the layered software stacks and is sensitive to their
underlying resource architectures. Table 2 offers a fine-grained exploration of the
respective software and hardware stacks representing disaggregated profile views
for typical big data, typical HPC, and cross-stack functionality applications. For
architecture-wise applications, cross-stack functionality applications have a big data
workload and are similar to big data stacks in the parallel programming layer. In
addition, there are no unified parallel programming frameworks that can process all
the workloads of these applications, and this is also the most active research area in
big data computing. Because of their reliance on (dense) linear algebra and
numerical computation, these applications stand to benefit from HPC stacks in the
system layer and infrastructure layer. The emergence of cross-stack functionality
applications gradually blurs the boundary between HPC and big data computing. No
single architecture stack can process all the workloads of cross-stack functionality
applications.

3.4 Requirements and Challenges

Based on the above discussion, we summarize the requirements of cross-stack


functionality, as shown in Table 3, including both functional and nonfunctional
requirements. These requirements of cross-stack functionality bring three big
challenges to both HPC and big data computing, inherently connected to the three
fundamental topics proposed by us.
The first challenge is related to iterative computation, which is required by most
applications. In iterative computation mode, there are some nonfunctional
requirements: the requirement of high performance (e.g., high throughput and low
latency are essential in massive sensory data processing) and the requirement of
transparent fault tolerance. In addition, for some algorithms, nested iterations and
even more complex dependency structures are necessary.
The second challenge is the heterogeneous tasks produced by these cross-stack
functionality applications. The capabilities that are present in either the parallel

123
International Journal of Parallel Programming

Table 2 Software and hardware stacks of typical big data, typical hpc, and converged applications
Typical big data Typical HPC Big Data? HPDA
IoT?AI

Use Case [30, 49, 68] [32, 64] [61, 67, 79, 80] [37, 38]
Function view 1. Offline Analytics: 1. Modeling 1. Real-Time 1. Streaming
Sort, Grep, Naı̈ve Bayes, 2. Simulation Analytics: Analytics:
K-means, PageRank Streaming Data rapidly analyze
2. Interactive Analytics: Processing high-
Projection, Filter, Union, 2. Predictive bandwidth,
Cross Product, Analytic high-
Difference, Join, throughput
ML, DL, streaming data
Aggregate, Select, etc
RL, etc 2. Interactive
analytics:
explore and
analyze
massive
streaming data
3. Graph
Analytics:
graph modeling,
visualization
Parallel Execution Engine: Execution Engine: Multiple Execution Engines from
programming 1. Batch System: e.g., 1. Message Passing: Big Data Stacks:
view Hadoop 1. Streaming System
e.g., MPI
2. Microbatch System: e. 2. Shared Memory: Flink, Spark Streaming, Naiad, etc
g., Spark 2. DL Framework
e.g., openMP, PGAS
3. Real-time/Streaming TensorFlow, Torch, Theano, Caffe,
System: e.g., Flink, Neon, etc
Spark Streaming, Storm
3. RL Framework
Ray, CEIL, etc
System view 1. Data Storage: HDFS, 1. Data Storage: Parallel 1. Data Storage and Data Ingestion
Hbase, S3, MongoDB File System, e.g., from Big Data Stacks: HDFS,
Lustre, GPFS Hbase, Kafka, Flume, etc
2. Cluster Mgmt. & 2. Cluster Mgmt. & Scheduling
Scheduling: Slurm from Both Stacks: Mesos, Slurm
3. Communication 3. Communication and
Libraries: RDMA (see Computation Libraries from
4.3.1), Collective HPC
communication Communication Libraries: RDMA
libraries, e.g., MPI (see 4.3.1),
4. Numerical Libraries: Numerical Libraries: PETSc,
PETSc, ScaLAPCK, ScaLAPCK, BLAS
BLAS
Accelerator APIs: CUDA, OpenCL
5. Accelerator APIs:
CUDA, OpenCL

123
International Journal of Parallel Programming

Table 2 continued

Typical big data Typical HPC Big Data? HPDA


IoT?AI

Infrastructure Computation sensitivity: Computation sensitivity: Computation Computation


view low high sensitivity: sensitivity:
high high
IO sensitivity: depends IO sensitivity: depends IO sensitivity: IO sensitivity:
depends high
Interconnect sensitivity: Interconnect sensitivity: Interconnect Interconnect
depends high sensitivity: sensitivity:
depends high

programming framework of HPC (e.g., MPI, openMP) or those of big data


computing (e.g., MapReduce, Spark, Flink) fail to fully meet the requirements of
heterogeneous tasks. We take RL-based autonomous driving as a typical example.
For sensory data processing (e.g., status update, compensation, aggregation), the
system needs to provide data stream processing, supporting both iterative
computation and incremental computation. For RL training and prediction, the
system needs to provide data set processing and support both DL primitives and RL
simulation. In this scenario, the application produces tasks with widely different
execution times and resource requirements. One megatrend is to process hetero-
geneous tasks by a unified data processing engine in the parallel programming
model layer.
The third challenge is related to the HPC and big data computing architectures.
HPC and big data computing, targeting their respective fields and representing two
different technology stacks, cannot handle all the workloads of heterogeneous tasks
simultaneously. These workloads differ from those of typical big data or HPC, as
shown in Table 3. We can also observe in detail that HPC and big data computing
workload types have affinity towards the capabilities offered by the aggregate stack
architecture, ranging from the parallel programming model layer and system layer
to the infrastructure layer. In general, big data computing frameworks do not closely
couple with the architecture of parallel computers (e.g., cluster, MPP) and cannot
leverage cluster architecture and system software to improve the performance of big
data analytics, especially computation-sensitive and interconnect-sensitive work-
loads; in contrast, typical big data computing capabilities such as data streaming
processing, in situ processing and interactive analytics should also be added to HPC.
These results also indicate the necessity for the converged architecture of HPC and
big data computing.

123
International Journal of Parallel Programming

Table 3 Requirements from cross-stack functionality applications


Requirements Requirement definitions

Functional Offline big data Recognizing and extracting meaningful patterns


analytics from enormous raw data. Insight from big data
analytics can be delivered after a few minutes to
several days of data generation
Real-time Collect, process and analyze streaming data as it
analytics arrives, and respond in real time. Streaming data
analytics should be ready in a range of a few
hundreds of milliseconds to a few seconds
Interactive Run complex analytic queries to search, explore,
analytics filter and aggregate data to gain insights and
inform decisions. Interactive analytics should be
ready in a range of a few tens of seconds to a few
minutes
Predictive Gain insights from data by training ML or DL
analytics models
Graph analytics Graph modeling, visualization, and evaluation for
understanding large, complex networks [37]
Nonfunctional Performance High throughput High throughput stands for both millions of tasks
executed simultaneously and the ability to process
massive data each second
Low latency The ability of real-time and interactive
responsiveness for ML applications to make
predictions and learn from feedback [15]
Fault Transparent fault The mechanism of fault tolerance is provided by a
tolerance tolerance parallel programming framework that is
transparent to developers
Execution Dynamic task New tasks may be generated during execution
model creation based on the results or the durations of other tasks
Arbitrary DL primitives and RL simulations produce arbitrary
task/dataflow and often fine-grained task dependencies (not
dependencies restricted to bulk synchronous parallelism) [65]
State model Stateful By definition, stateful operators interact with earlier
computation computations or data observed in the recent past.
Thus, since a state represents prior computational
results or previously seen data, it must be
persisted for subsequent use [71]

4 Anatomy of HPC and Big Data Computing

4.1 Architecture of HPC and Big Data Computing

4.1.1 HPC Architecture

HPC workloads were first run on some supercomputing platforms, which were
specialized and highly expensive. People have been trying to find alternatives for
the infrastructure of HPC, and the emergence of cluster computing follows the trend

123
International Journal of Parallel Programming

of moving away from traditional specialized supercomputing platforms such as


Cray T3E to general-purpose clusters with high-speed interconnects and corre-
sponding software stacks. Figure 1 shows the HPC architecture.
Typical HPC applications are iterative and closely coupled based on mathemat-
ical models, where accelerator hardware (e.g., general-purpose GPU or FPGA),
accelerator APIs (e.g., CUDA or openCL) and numerical libraries (e.g., BLAS or
LAPACK) are essential. HPC workloads are interconnect-sensitive and computa-
tion-sensitive, where data are shared by message exchanges over high-speed
interconnects (e.g., InfiniBand-class fabrics or high-performance Ethernet) between
computing nodes with high bandwidth and low latency [16]. MPI and RDMA (see
4.3.1) are widely used communication libraries of HPC. They are specifically
designed for performance improvement by leveraging user-space communication
pathways that bypass the kernel space, avoiding context switches and using a zero-
copy buffer (e.g., mmap function), and by exploiting a high-speed interconnected
HPC infrastructure to reduce software overheads. HPC workloads are batch jobs
with large data sets on large distributed computer clusters. The storage is dedicated
to HPC with parallel distributed file systems, providing high speed, high bandwidth
and concurrent access to data.

4.1.2 Big Data Architecture

The HPC architecture is not designed for the big data computing paradigm. The
workloads of big data are data intensive; thus, a parallel programming framework
(e.g., MapReduce, Spark or Flink), distributed file system (e.g., HDFS) for

Parallel Runtime Environment & Programming Framework


Parallel
MPI Framework OpenMP/PGAS Framework Programming
point-to-point/
shared/partitioned Model
Process Parallel Collective- Thread Parallel
data model
Communication

Communication Libraries: MPI/PGAS/RDMA

Middleware
Resource & Workload Manager File System:
Slurm/Torque/Sun Grid Engine Lustre/NFS/GPFS

Infiniband/Ethernet iSCSI/FC/Infiniband

Operating System: Linux/Unix


Infrastructure
Numerical Libraries Accelerator APIs
BLAS,LAPACK CUDA/openCL/...
(Cluster/MPP)
SAN/NAS

CPU/GPGPU/FPGA

Fig. 1 Typical HPC architecture

123
International Journal of Parallel Programming

Parallel Runtime Environment & Programming Framework


Hadoop Spark Flink
MR/Tez/Pig/Hive/Mahout SQL/Streaming/ML/Graphx SQL/CEP/ML/Gelly
Parallel
Abstractions in Programming Framework Programming
Model
Data Abstractions Checkpoint State Management Data Operators

Resource & Workload Manager


YARN/MESOS
Middleware
Compute Node File System:
Worker/Executor HDFS

Operating System: Linux


Virtual Machine/ DAS Cloud Storage
Physical Machine SATA/SAS/SSD EBS/S3 Infrastructure
(Cloud/Cluster)
Cluster
Cloud
Workstations/SMPs

Fig. 2 Typical big data computing architecture

sequential reading and writing, and fast direct attached storage (e.g., SSDs) with low
latency are essential. Figure 2 demonstrates the big data computing architecture.
One of the first proposals for large-scale data processing in batches was
MapReduce [18], which offered a parallel programming framework running on the
clusters of commodity machines in terms of scalability and reliability as the first
generation of big data computing systems. Based on the original idea of the
MapReduce framework, a family of approaches and systems of large-scale data
processing have been implemented and are currently gaining much momentum in
both research and industrial communities [75], such as HDFS™, HBase™, Hive™,
Mahout™, Pig™, and Tez™, which build upon the Hadoop [35] ecosystem. In this
stage, the main workloads on big data clusters are batch-oriented.
MapReduce has drawbacks, such as a low-level programming model and a lack
of support for iterations [71]. To overcome these limitations, researchers designed
the second generation of big data processing systems, represented by Spark [18] and
Flink [7–9]. Both Spark and Flink provide in-memory performance and iterative
computation, although they process data via different dataflow models. In this stage,
researchers and communities focused on improving the parallel programming
framework to support both batch-oriented and steam-oriented data processing. Apart
from these, other excellent frameworks were developed, such as the graph
computing framework Pregel [53], stream processing framework MillWheel [1] and
Naiad [58, 59, 63], providing a differential/timely dataflow model to support both
incremental and iterative computation.

4.1.3 Summary

In Fig. 1 and Fig. 2, we identify three layers in two different architectures: a parallel
programming model layer, a middleware layer, and an infrastructure layer. We

123
International Journal of Parallel Programming

make further comparisons between the two architectures at each layer and
summarize the design principles that can address the abovementioned challenges
(see chapter 3.4).
In the parallel programming model layer, the HPC programming framework
provides low-level programming APIs, where users can explicitly control the
parallel program that is based either on process parallelism, which exchanges data
through a message-passing model, or on multithreads, which exchange data through
a shared (partitioned) data model. In contrast, the big data programming framework
provides rich and highly abstracted APIs that encompass the interface of distributed
data abstractions, interface of state management for intermediate data, checkpoint
interface, interface of manipulating data, etc., allowing for distributed data-
parallelism processing across hundreds or even thousands of servers in a cluster.
The significant difference between HPC and big data computing is that HPC
codes tend to generate large amounts of result data that have to be stored without
impeding the computation, while big data codes have to process large amounts of
input data [16], which results in totally different designs of the infrastructure and
resource management in the two architectures: the HPC approach historically
separates the data and computations, while big data computing colocates the
computations and data [45]. The structure of data exchange (between computing
nodes or between computing nodes and storage nodes) in HPC, i.e., communication
libraries and high-speed interconnects, enables computation-sensitive workloads to
be processed efficiently, while the structure of resource management in big data
computing that colocates computations and data enables data-intensive workloads to
be processed with high throughput.

4.2 Parallel Programming Model Layer

To address the first challenge, we compare big data programming frameworks


(Hadoop MapReduce, Spark, Flink, and Naiad) and HPC programming frameworks
(MPI, openMP and PGAS) in five aspects, i.e., data model, execution model, state
management, fault tolerance and communication model, to identify the factors that
affect the efficiency of iterative computation.
MapReduce [18] is a programming model and an associated implementation for
processing large data sets. Users specify a map function that processes a key-value
pair to generate a set of intermediate key-value pairs and a reduce function that
merges all intermediate values associated with the same intermediate key. Many
practical tasks are expressible in this model. Hadoop MapReduce is a subproject of
Apache Hadoop that implements the MapReduce programming model for large-
scale data processing.
Spark [90] was developed and first used for research and production applications
at UC Berkeley. It provides a convenient language-integrated programming
interface, allowing a general-purpose programming language to be used at
interactive speed for in-memory data mining on clusters.
Flink [8] is an open-source system for processing streaming and batch data. It is
built on the philosophy that many classes of data processing applications, including
real-time analytics, continuous data pipelines, historic data processing (batch

123
International Journal of Parallel Programming

processing), and iterative algorithms (ML, graph analysis), can be expressed and
executed as pipelined fault-tolerant dataflows.
Naiad [63] is a distributed system for executing data parallel, cyclic dataflow
programs. It offers high throughput for batch processors, low latency for stream
processors, and the ability to perform iterative and incremental computations.
The message-passing interface standard (MPI) is a message-passing library
standard based on the consensus of the MPI Forum at the Supercomputing’92
Conference. The MPI defines an extended message-passing model for parallel,
distributed programming in a distributed computing environment [29]. Multiple
implementations of the MPI have been developed and are widespread as parallel
frameworks to run on scalable distributed systems that comprise multiple computing
nodes integrated via high-speed interconnection networks.
OpenMP [66] is a shared memory multiprocessing application program
inference (API) for the easy development of shared memory parallel programs.
OpenMP is designed to exploit certain characteristics of a shared memory
architecture. The ability to directly access memory throughout the system (with
minimum latency and no explicit address mapping), combined with fast shared
memory locks, makes the shared memory architecture best suited for supporting
OpenMP [17].
The partitioned global address space (PGAS) [2] is a data parallel model. It
provides each process with a view of the global memory even though the memory is
distributed across multiple computers. The PGAS model extends the shared memory
model to a distributed memory setting, and data structures in partitioned global
space can be shared by all the computation threads with affinity to the data (affinity
is the association of a thread to a memory) [77]. The PGAS outperforms the shared
memory model in scalability. Several implementations of the PGAS model have
been designed for the HPC community, such as UPC [24] and OpenSHMEM [13].

4.2.1 Data Model

Hadoop MapReduce’s data model is based on the HDFS, where partitioned files can
be read or written by the MapReduce program in parallel. The MapReduce
framework operates exclusively on key-value pairs; that is, the framework views the
input to a job as a set of key-value pairs and produces a set of key-value pairs as the
output of the job, conceivably of different types. The data types for the serialization
and deserialization of data storage in the HDFS and those used in MapReduce
computations for key or value fields must implement the interface of the framework
to facilitate the functionalities (e.g., read, write, shuffle, sort) of the framework [54].
Spark uses a resilient distributed dataset (RDD) data model. RDDs are read-only
objects that are partitioned across a set of nodes, which are more abstract than
MapReduce’s file model. In comparison with distributed shared memory (DSM), the
advantages of RDDs are listed below: (1) the consistency of RDDs is trivial because
RDDs are immutable; (2) fine-grained and low-overhead fault recovery can be
achieved by using the lineage of RDDs; (3) workloads are automatic based on data
locality and easy to schedule in a balanced way; and (4) straggler mitigation is made
possible by using a backup task.

123
International Journal of Parallel Programming

Flink has two data types: one represents a finite data set, and the other represents
an unbounded data stream. It provides the special class DataSet and class Data-
Stream as collections of data representing the two data types in a program. These
two collections are of the following key features: (1) a collection is initially created
by adding a source in the Flink program; (2) new collections are derived from these
initialized collections by transforming them with Flink APIs; and (3) these
collections are immutable once created.
Every message in Naiad bears a logical timestamp consisting of epoch and loop
counters, where there is one loop counter for each of the k loop contexts that contain
the associated edge [63]. These loop counters explicitly distinguish different
iterations and allow a system to track forward progress as messages circulate around
the dataflow graph. The abovementioned features enable Naiad to support more
complex computation models such as nested loops, which is suitable for ML
algorithms such as MDS.
In the MPI, processes can exchange data by point-to-point communications or
collective communications. In HPC, the MPI can utilize the InfiniBand high-speed
interconnects and RDMA.
All variables in OpenMP possess one of the data-sharing attributes (i.e., Shared,
Private, Firstprivate, Lastprivate, Default). The variable with the Shared attribute is
used when all threads use the same copy of the variable; the Private attribute means
that all threads use local storage for this variable; the Firstprivate attribute is similar
to the Private attribute, but the variable is initialized on a fork; and the Lastprivate
attribute is also similar to the Private attribute, but the variable is updated upon
joining. Default is most often used to define the data-sharing attribute of most
variables in a parallel region.
Unified Parallel C (UPC) extends ISO C to a PGAS programming language that
allows programmers to exploit data locality and parallelism in their applications
[23]. UPC provides two types of variables: private variables and shared variables.
Normal C variables and objects are allocated in the private memory space for each
thread. Shared array elements can be distributed to threads in a round-robin fashion
with arbitrary block sizes. The block size and THREADS determine the affinity, in
which the thread’s local shared memory space of a shared data item will reside.

4.2.2 Execution Model

The whole process of MapReduce involves four phases of execution: splitting,


mapping, shuffling and reducing. In the splitting phase, input data for a MapReduce
job are divided into fixed-size pieces called input splits consumed by the map
function. In the mapping phase, the data in each split are passed to a mapping
function to produce intermediate key-value pair values. The optional Combiner
performs local aggregation on the mapper output, which helps to minimize the data
transfer between Mapper and Reducer. The output of Combiner is passed to
Partitioner, which performs partitioning on the basis of the key(s). Shuffling is the
physical movement of data, which is done over the network. Once mapping and
shuffling are performed, the merged and sorted intermediate output is passed to the
reducing phase. In the reducing phase, Reducer takes the set of intermediate key-

123
International Journal of Parallel Programming

value pairs produced by Mapper as the input and then runs a reducer function on
each of them to generate the output. The output of the reducer is stored in the HDFS
as the final output.
The Spark execution model can be defined in three phases: (1) creating a logical
plan; (2) translating the logical plan into a physical plan; and (3) executing tasks on
a cluster. In the first phase, a logical plan is created to show the steps to be executed
when an action is applied. Spark uses transformations (operators) on an RDD to
describe the data processing. Each transformation generates a new RDD such that
all transformations form a lineage or directed acyclic graph (DAG). In the second
phase, actions trigger the translation of the logical DAG into a physical execution
plan. The Spark Catalyst query optimizer creates a physical execution plan for the
DAG. In the third phase, tasks are scheduled and executed on the cluster. The
scheduler splits the graph into stages by transformations. The narrow transforma-
tions (transformations without data movement) are grouped (pipelined) together into
a single stage. Spark provides a unified engine that natively supports both batch and
streaming workloads. Instead of processing one record at a time, Spark Streaming
discretizes the streaming data into tiny, subsecond microbatches. In other words,
Spark Streaming’s Receivers accept data in parallel and buffer them in the memory
of Spark’s workers nodes. Then, the latency-optimized Spark engine runs short tasks
(tens of milliseconds) to process the batches and outputs the results to other systems.
Spark tasks are assigned dynamically to workers according to the locality of data
and available resources, unlike the traditional continuous operator model, where
computations are statically allocated to a node. This contributes to better load
balancing and faster fault recovery.
Flink uses streams for all workloads: streaming, SQL, microbatch and batch [47].
The program is parsed by a Flink compiler and optimized by a Flink optimizer. Each
submitted job is converted to a dataflow graph and passed onto Job Manager, which
creates an execution plan, and then the job graph is passed onto Task Manager,
where the tasks are finally executed (execution graph). Flink is a stream-computing
engine that takes a data set as a special case of a data stream with boundaries; thus,
Flink can process both batch and stream workloads in a single system. Regarding
the streaming capability, Flink is far better than Spark (as Spark handles streams in
the form of microbatches) and has native support for streaming.
Naiad uses a timely dataflow execution model based on a directed graph in which
stateful vertices send and receive logically time-stamped messages along directed
edges. Naiad employs timestamps to enhance dataflow computation, which is
essential in supporting an efficient and lightweight coordination mechanism. There
are three main features of Naiad: (1) structured loops for feedback; (2) stateful
dataflow vertices for record processing (without using global coordination); and (3)
the notification of vertices when all tuples have been received by the system for a
given round of input or loop iteration. While the first two features support low-
latency iterative and incremental computation, the third feature ensures that the
result is consistent [71]. The dataflow graph may contain nested cycles, and the
timestamps reflect this structure to distinguish data arising in different input epochs
and loop iterations.

123
International Journal of Parallel Programming

An MPI program consists of autonomous processes, which typically run


concurrently on multiple cores and span multiple nodes. Processes communicate
with each other via calls to MPI functions, and these processes continue throughout
the entire execution. The starter process of an MPI is primarily responsible for
launching the MPI job. Most MPI implementations use a separate “mpiexec”
process for launching the MPI processes. However, in MPICH-1 p4 channel
implementation, the MPI rank 0 process is the starter process. The MPI
programming model provides two communication models: point-to-point commu-
nication and collective communication. Point-to-point communication involves
sending a message from one named process to another. A group of processes can
call collective communication operations to perform commonly used global
operations such as summation and broadcast [26]. Synchronization in the MPI is
implicit in each point-to-point or collective data movement.
OpenMP is based upon the existence of multiple threads in the shared memory
programming paradigm, which uses the fork-join model of parallel execution. An
OpenMP program starts with a single process called a master thread. The master
thread executes sequentially until it encounters the first parallel region construct and
then creates a group of parallel threads [42]. The statements enclosed by the parallel
region construct are executed in parallel among threads in the same group. When
group threads complete the statements in the parallel region, they synchronize and
terminate, leaving the master thread only [27].
The UPC program runs as a number of threads working independently in an
SPMD fashion [41]. There are two compilation modes: static thread mode and
dynamic thread mode. In static thread mode, the number of threads is specified as
the program variable THREADS at the time of compiling. In dynamic thread mode,
the compiled code may run with varying numbers of threads. In UPC, there are two
types of memory consistency models that define the order in which read operations
“see” the results of write operations. In the strict memory consistency model, any
operation on shared data cannot be performed until the previous operation on it has
been completed. In the relaxed memory consistency model, a thread can operate on
shared data at any time, regardless of whether other threads operate on it or not.
Furthermore, UPC provides three synchronous primitives, i.e., barrier, fence and
lock, to control the interaction between threads.

4.2.3 State Management

The state can be defined as “the intermediate value of a specific computation that
will be used in subsequent operations during the processing of a data flow.” [71]
State management is a mechanism provided to users by a parallel programming
model for storing the intermediate state data in a computation. The state data
generally come in two states: static and dynamic. In the static state, the state of data
is saved at program initialization as data are loaded into memory. In this mode, data
are generally large and used repeatedly in iterative computation. In the dynamic
state, the state of data is updated repeatedly in iterative computations.
In Spark, the RDD bears the state information of data. Spark provides users with
the function to cache or persist intermediate RDDs by specifying this operation

123
International Journal of Parallel Programming

explicitly. This method is very important for iterative computation, where the same
data sets are accessed repeatedly. Spark also offers two choices of specialized
abstractions on state management, namely, updateStateByKey and mapWithState,
the latter being an improved version of the former. The operator updateStateByKey
generates a new RDD by using a cogroup transformation with the former RDD;
thus, it is a full updating mechanism. The operator mapWithState provides an
incremental updating mechanism, which surpasses updateStateByKey in
performance.
There are two fundamental states in Flink: the keyed state and the operator state.
The keyed state is bound to keys, available only to functions and operators that
process data from a KeyedStream. The operator state (or non-keyed state) is bound
to one parallel operator instance. The difference between them is that the operator
state is scoped per parallel instance of an operator (subtask), while the keyed state is
partitioned or shared based on exactly one state partition per key [74].

4.2.4 Fault Tolerance

Checkpointing is the main mechanism needed for fault tolerance in Spark. Spark
currently provides an API for checkpointing (a REPLICATE flag to persist) but
leaves the decision on which data to checkpoint to the users. Tasks are made fault
tolerant with the help of RDD lineage graphs and checkpointing; thus, they can
quickly recompute from checkpointing and recover from failures. The read-only
nature of RDDs makes them simpler to checkpoint than general shared memory.
Since consistency is not a concern, RDDs can be written out in the background
without requiring program pauses or distributed snapshot schemes [90].
Based on the distributed snapshot algorithms proposed by Chandy and Lamport
[12], Flink adopts an approach to take consistent snapshots of the current state of a
distributed system without missing information and without recording duplicates.
Flink injects stream barriers (similar to “markers” in the Chandy–Lamport
algorithm) into sources, which flow through operators and channels with data
records as part of the data stream. A barrier separates records into two groups: those
that are part of the current snapshot (a barrier signals the start of a checkpoint) and
those that are part of the next snapshot. An operator first aligns its barriers from all
incoming stream partitions to buffer data from faster partitions. Upon receiving a
barrier from every incoming stream, an operator checkpoints the barrier state to
persistent storage. Then, the operator forwards the barrier downstream. Once all
data sinks receive the barriers, the current checkpoint ends. Recovery from a failure
allows restoration of the latest checkpointed state and the restarting of sources from
the last recorded barrier [85]. The mechanism of fault tolerance in Naiad is similar
to that in Flink.
Rollback-recovery techniques are commonly used to provide fault tolerance to
parallel applications running on HPC systems to restart from a previously saved
state. Checkpoint-based rollback recovery is the main mechanism by which an
application can be rolled back to the most recent consistent state by using
checkpoint data [22]. There are two major approaches to implementing checkpoint/
restart systems: application-level (or operation-level) implementation and system-

123
International Journal of Parallel Programming

level implementation. The former implementation provides approaches to integrat-


ing checkpointing codes into application codes or linking application programs to
some specific designed checkpointing libraries, such as the Cornell Checkpoint
Compiler (C3) [78], FT-MPI [25] and LA-MPI [28]. The latter implementation
offers a choice of periodic or nonperiodic mechanisms for parallel applications in
the OS kernel, such as CRACK [39] and BLCR [20]. This implementation is always
transparent to the programmer and usually does not modify the application code.
Both application-level and system-level implementations are tightly coupled to a
specific checkpoint/restart system. Researchers tend to design some frameworks to
support the integration of multiple checkpoint/restart systems into the code base,
thereby decoupling the application code from the checkpoint/restart system. LAM/
MPI [76] is a framework design that supports communication over both TCP and
Myrinet interconnects and supports the BLCR and SELF checkpoint/restart systems.
In [40], J Hursey et al. presented the design and implementation of an infrastructure
that has the general capabilities required for distributed checkpoint/restart and
realized these capabilities as extensible frameworks within Open MPI’s modular
component architecture to support checkpoint/restart fault tolerance in the Open
MPI project. In [21], Egwutuoha I. P., et al. presented a fault tolerance framework
for high-performance computing in the cloud. This framework uses process-level
redundancy (PLR) techniques, live migration and a fault tolerance policy to reduce
the wall-clock time of executing computation-intensive applications.

4.2.5 Communication Model

Collective communication is defined as communication operations that simultane-


ously involve a group of modes [11]. We list here typical collective communications
used in parallel programming frameworks. These collective communications can be
classified in two dimensions, as shown in Table 4. One is based on communication
patterns, covering one-to-all patterns, all-to-one patterns and all-to-all patterns; the
other is based on data operation patterns, covering data redistribution and data
reduction. Collective algorithms are generalized parallel algorithms based on linear,
tree and dissemination-based communication. The factors that could directly or
indirectly influence the performance of a collective operation include the parallel
computer architecture, communication network, operation system, and specific
implementations of collective algorithms that are based on linear, tree and
dissemination-based communication [87].
A barrier is a special communication operation for synchronization between two
stages of parallel processing, such as the barrier between the map stage and the
reduce stage in the MapReduce programming pattern and the barrier between
supersteps in BSP patterns. Barrier operation is also used as a control event in
dataflow programming patterns for making distributed snapshots or iterative
computations.
Popular big data frameworks can be divided into three parallel programming
models: batch-oriented model, BSP model, and stream-oriented model. HPC
frameworks can be divided into message-passing and shared memory models. These

123
International Journal of Parallel Programming

Table 4 Communication operations


Communication Data operations
operations

Type Name Data redistribution Data reduction


One to All Broadcast The sender delivers the same
message to all receivers
Scatter The sender delivers different
messages to different receivers
All to One Reduce Different messages from different
senders are combined together
by one receiver to form a single
message
Gather Different messages from different
senders are concatenated together
for one receiver
All to All All- All processes perform their own
Gather broadcast; thus, all processes have
the same set of received
concatenated messages
All- The result of a reduce operation is
Reduce available to all processes

frameworks, having their own communication operations and execution flow, are
suitable for certain types of applications, as shown in Table 5 below.

4.2.6 Summary

We summarize the essential patterns of the parallel programming model of big data
frameworks and HPC frameworks in Table 6.
We reach the following conclusions: (1) openMP and PGAS in the HPC
architecture are based on thread parallel and shared memory, which means that they
must run on computers with a shared memory architecture, e.g., SMP (see 4.4.2) and
NUMA (see 4.4.2), or another distributed shared memory architecture, and these
software and hardware architectures indicate that the system can handle problems
with limited data size and workloads; (2) the MPI, a dominant model used in HPC,
is a more general parallel processing framework that can run on clusters, MPP (see
4.4.2) and other supercomputer architectures; (3) big data frameworks involve
batch-oriented big data frameworks (e.g., Hadoop, MapReduce, and Spark), stream-
oriented big data frameworks (e.g., Flink and Naiad), and BSP (e.g., Pregel and
Giraph), which can run on commodity clusters.
Both the MPI and big data frameworks support the scale-out computer
architecture. The advantage of the MPI over big data programming models lies in
the rich collective communication libraries that enable the MPI to handle complex
algorithms with iterative computations efficiently. However, all-to-all collectives
are not supported by big data programming models. In contrast, big data

123
International Journal of Parallel Programming

Table 5 Parallel programming models, communication operations, execution flow, applications and
computer architecture

123
Table 6 Comparison of big data frameworks and hpc frameworks
Programming Data model Execution model State management Fault tolerance Communication model
Model

Big data MapReduce Key-value pairs Batch N/A Batch re-execution Scatter, reduce
frameworks Spark RDD: coarse-grained 1. DAG Batch/ 1. updateStateByKey: Recover from RDD Scatter, gather, reduce,
immutable data set 2. DAG microbatch update by RDD lineage broadcast
cogroup
2. mapWithState:
update by key
Flink data set/data stream 1. Dataflow 1. Key state Distributed snapshot Scatter, gather, reduce,
execution graph checkpoint broadcast
International Journal of Parallel Programming

2. Operator state
2. Long-running
task
Naiad 1. Timestamp with 1. Timely dataflow 1. Stateful vertices Distributed snapshot Scatter, Gather, Reduce,
epoch and loop 2. Long-running 2. Partially ordered checkpoint Broadcast
counters task sequence
2. Pointstamp 3. Nested loops
HPC MPI Arbitrary Process parallel Variables in process Application-level or Scatter, Ggther, reduce-scatter,
Frameworks system-level fault all-reduce, all-gather,
tolerance broadcast
PGAS Shared data (scalar, Multithreads on Private or shared data Application-level or Scatter, gather, all-reduce, all-
array), Private data partitioned shared system-level fault gather, broadcast
data tolerance
Open MP Shared, private, Multithreads on Private or shared data Application-level or N/A
firstprivate, shared data system-leve
lastprivate, default fault tolerancel

123
International Journal of Parallel Programming

programming models generally provide data abstraction (e.g., abstraction to DataSet


and DataStream), convenient data processing interfaces and a system-level fault-
tolerant architecture, which are not supported directly by MPI models. In the HPC
architecture, MPI collective communications libraries are optimized by high-speed
interconnects (e.g., InfiniBand-class fabrics) directly or by an RDMA-based (see
4.3.1) approach over InfiniBand for large data movement.

4.3 Middleware Layer

The middleware layer bridges the parallel programming model layer and
infrastructure layer. We perform a comparative analysis on different design ideas
between HPC and big data computing to understand how big data programming
frameworks exploit these structures to improve data processing performance on
clusters.

4.3.1 Communication Libraries

Remote direct memory access (RDMA) refers to the capability to access the
memory of one computer directly from another computer without involving the
processor or operating system on either computer. RDMA improves throughput and
performance because it frees up resources via zero-copy techniques that can be
roughly separated into three classes [5]: (1) the avoidance and optimization of in-
kernel data copying: to process data completely within a kernel; (2) bypassing the
kernel on the main data processing path: to allow direct data transfers between user-
space memory and hardware, with the kernel only managing and aiding these
transfers; and (3) the optimization of data transfer between a user application and
the kernel: to optimize CPU copies between the kernel and user space, which
maintains the traditional method of arranging communication. Additionally, RDMA
ensures no CPU involvement to realize remote memory access without any
intervention of a remote processor(s). It also provides the collective communication
feature, which includes scatter collective communication (to support the reading of
data streams from one buffer and writing them into multiple memory buffers) and
gather collective communication (to support the reading of data from multiple
memory buffers and writing them as a stream into one buffer).
To implement these features of RDMA, a new network fabric that supports the
native RDMA protocol is necessary. Thus far, there have been three types of
network fabric and protocol: InfiniBand, RDMA over converged Ethernet (RoCE)
and the Internet wide area RDMA protocol (iWARP). InfiniBand is a lossless
network fabric that supports RDMA natively from the beginning. It requires special
NICs and switches and is widely used in supercomputers and MPPs. RoCE is a
converged network protocol that allows performing RDMA over an Ethernet
network. It encapsulates the InfiniBand protocol inside the Ethernet protocol. This
allows using RDMA over the standard Ethernet infrastructure (switches). All
Ethernet NICs require RoCE network adapter cards, and RoCE drivers are available
in Linux, Microsoft Windows, and other common operating systems. iWARP is a
network protocol that allows running RDMA over TCP transport to patch up

123
International Journal of Parallel Programming

existing LAN/WAN networks. It employs a complex mix of layers, including direct


data placement (DDP), marker PDU aligned framing (MPA) and a separate RDMA
protocol (RDMAP), to deliver RDMA services over TCP/IP. iWARP is too complex
to achieve the same performance as that of the InfiniBand and RoCE solutions.
Collective communication libraries have been standardized as part of the MPI.
Implementations of the standard, such as openMPI, Intel MPI, MPICH, and basic
linear algebra communication subprograms (BLACS), have gained momentum in
industries and research communities due to the importance of efficient communi-
cation for obtaining good application performance on parallel computers (e.g.,
cluster, MPP, supercomputer). The integration of the MPI and RDMA is a
promising research direction.
MPI libraries over RDMA and RDMA-enabled fabrics (e.g., InfiniBand) have
been widely used in the HPC architecture for better performance. The target and
source memory region registrations are integral prior to the RDMA operation. To
avoid the high cost of copying in/out of a static memory region and memory
registration prior to each RDMA operation, message buffers (a memory region) are
first allocated by applications and then registered and cached locally by the MPI
library. For the use cases of applications that reuse target and source buffers for
RDMA operations, the cost of the initial registration is amortized over the
subsequent RDMA operations [88]. MPI libraries over RDMA can take advantage
of RDMA features and achieve better performance when processing large messages
and collective communications.

4.3.2 Distributed File System

Local file systems, such as EXT and FAT, cannot meet the cluster system
requirements for file sharing. Distributed file systems are designed for clusters with
the following characteristics: (1) network transparency: remote and local file access
can be achieved through the same system call; (2) location transparency: the full
path of a file does not need to be bound to its file service; that is, the name or address
of the server is not part of the file path; and (3) location independence: because the
name or address of the server is not part of the file path, changing the file location
does not cause the file path to change. Lustre and the GPFS are distributed file
systems commonly used in HPC, while the HDFS is a dedicated distributed file
system for big data. We compare them in Table 7

4.3.3 Resource Management System

A cluster resource management system (RMS) is a cluster manager that oversees the
collection of resources such as processors, memory and storage from the clusters. It
maintains the status information of resources to know what resources are available
and can thus assign jobs to available machines. YARN and Mesos are two popular
RMSs used for big data, while Slurm is used in HPC.
YARN [36], part of the core Hadoop project, is the prerequisite for Enterprise
Hadoop, providing resource management and a central platform to deliver
consistent operations, security, and data governance tools across Hadoop clusters.

123
International Journal of Parallel Programming

Table 7 Distributed file systems


Features Lustre GPFS HDFS

Type HPC HPC Big Data


Standard POSIX compliant POSIX compliant Not POSIX compliant
Interconnects InfiniBand/Gigabit Ethernet/ InfiniBand/Gigabit Ethernet/ Gigabit Ethernet/
10Gigabit Ethernet/Myrinet 10Gigabit Ethernet/ 10Gigabit Ethernet
Consistency File lock and fine-grained Distributed lock, centralized Trivial (immutable)
metadata lock: many clients management, and central
can read and modify the coordinator
same file or directory at the
same time;
the Lustre distributed lock
manager (LDLM) ensures
that files are consistent
across all clients and
servers in the file system
Read/Write fine-grained fine-grained coarse-grained
High The Lustre file system Fault tolerance by logging NameNode in Active-
availability realizes active- active policy (write-ahead and Standby mode that
failover through the shared redo log), replicated data keeps the meData
storage partition of OSTs Node keeps blocks in
(OSS targets) multireplicas
Throughput The Lustre network allows High throughput by striping High throughput access
for full RDMA throughput successive blocks across to application data by
and zero-copy successive disks; providing the data
communications when Sequential reads and writes in access in parallel
available; parallel
stripe data across multiple
storage locations (OSTs)
initiate file read/write
operations leveraging the
throughput of many storage
devices in parallel
Scalability Increasing the size and The GPFS can dynamically The addition of new
bandwidth of the Lustre allocate system resources datanodes to the
cluster without interrupting and support the dynamic Hadoop cluster
services by adding new addition and deletion of dynamically
OSTs and MDTs to the hard disks without
cluster restarting when the file
system is mounted

It also extends Hadoop to incumbent and new techniques found within the data
center so that they can take advantage of cost-effective, linear-scale storage and
processing.
Mesos [31] is a platform for sharing commodity clusters between multiple
diverse cluster computing frameworks, such as Hadoop and MPI. It supports
different types of workloads, including container orchestration (Mesos containers,
Docker, etc.), analytics (Spark), big data techniques (Kafka, Cassandra) and much

123
International Journal of Parallel Programming

more. YARN is specifically designed for Hadoop workloads, whereas Mesos is


designed for all kinds of workloads. YARN is an application-level scheduler,
and Mesos is an OS-level scheduler.
Slurm [44] is an open-source, fault-tolerant and highly scalable cluster
management and job scheduling system for large and small Linux clusters. It is
of three key functions. First, it assigns exclusive and/or nonexclusive access to
resources (compute nodes) to users for some duration of time so that they can
perform work. Second, it provides a framework for starting, executing, and
monitoring work (normally a parallel job) on the set of allocated nodes. Third, it
arbitrates contention for resources by managing a queue of pending work.
We summarize YARN, Mesos and Slurm in Table 8 by comparing and
contrasting the following features:

● Job types can be primarily categorized into parallel job and job arrays, both of
which capture the types of parallelism that the scheduler can handle [73].
Parallel jobs are used to speed up computation, during which the processes are
launched simultaneously and communicate; job arrays allow users to submit a
series of jobs using a single submission command/script. HPC schedulers always
support both parallel jobs and job arrays, while big data schedulers tend to
support job arrays only.
● The scheduling strategy includes various scheduling algorithms. Queue mech-
anisms [33] indicate that jobs within a queue are ordered according to a
scheduling policy, e.g., FCFS (first come first serve), SJF (shortest job first) and
LJF (longest job first); backfilling is the capability to schedule pending jobs upon
the early completion of an executed job. Two backfilling mechanisms are
commonly used, namely, conservative backfilling [62] and easy backfilling [51].

Table 8 Resource management


YARN Mesos Slurm

Type Big data Big data/HPC HPC


OS support Linux Linux Linux/*nix
Job type Parallel job – – Y
Job array Y Y Y
Scheduling strategy Job dependencies Y Y Y
Backfilling mechanisms – – Y
Queue policies Y Y Y
Resource management Static and dynamic Y Y Y
Resource-aware Y Y Y
Network-aware – – Y
Fault tolerance Checkpointing – – Y
Job migration – – Y
Job restarting Y Y Y

123
International Journal of Parallel Programming

Job dependencies allow users to define execution dependencies between jobs so


that the jobs are executed as DAGs.
● Resource management covers static and dynamic resource management,
focusing on resource-aware and network-aware capabilities. Resource-aware
and network-aware scheduling in big data schedulers (such as YARN) relies on
an HDFS replica mechanism and is not as sophisticated as that in HPC
schedulers (such as Slurm).
● Fault tolerance is the feature where the scheduler allows applications to
checkpoint its runtime state, saved as a snapshot, so that the job can restart from
the checkpointed state when a crash occurs. Related concepts are job migration
and job restart. Job migration means that the scheduler can move a job from one
computational resource to another while it is being executed. Job restarting
involves restarting the job if it is aborted or fails. Supporting checkpoints is a
prerequisite for job migration but not for job restarting. Big data schedulers
typically do not support checkpointing.

4.4 Infrastructure Layer

4.4.1 High-speed Interconnects

An interconnection network plays an essential role in cluster and HPC architectures


and needs to incorporate fast interconnection technology to support high-bandwidth
and low-latency communication between nodes. Thus, many design factors should
be considered to improve interconnection network performance, such as the
topology, routing algorithm, power consumption, reliability and fault tolerance, and
congestion control. The features of different interconnection technologies are
summarized in Table 9.
Gigabit Ethernet provides a reasonably high bandwidth at a low price, but its
relatively high latency limits Gigabit Ethernet as a good choice for HPC. However,
the low price of Gigabit Ethernet is appealing to cluster construction.

Table 9 Comparison of interconnection technologies


Interconnection Bandwidth Latency Network Special feature
technology (MBytes/s) (μs) topology

Gigabit/10Gigabit 100/1000 \100 (fat) Tree [6] N/A


Ethernet
InfiniBand 200/1000/ \7 Clos network RDMA
3000 [14]
Myrinet 230 10 Clos network OS bypass
[14]
SCI \320 1–2 Meshes of rings Tree-structured coherence
directories

123
International Journal of Parallel Programming

InfiniBand is an I/O protocol designed to provide high-bandwidth, low-latency


interconnections for HPC. Released in 2002, it follows industry standards based on
the VIA concepts and supports the connection of various system components within
a system, such as interprocessor networks, I/O subsystems or multiprotocol network
switches. By incorporating powerful RDMA engines, it offloads CPU overhead.
Myrinet is popular for fast cluster networks. The key advantage of Myrinet is
that it operates in the user space, bypassing operating system interference and delay.
The Beowulf [82] cluster employs Ethernet as the management network and
Myrinet as the data network. Myrinet uses the “OS bypass” technique to speed up
communications between cluster nodes.
The scalable coherent interface (SCI), as an interconnection technology
standard specified for cluster computing, defines a directory-based cache
scheme that keeps the caches of connected processors coherent. In this way, it
can implement virtual shared memory.

4.4.2 Memory Access Model

For different memory access models, parallel computers can be categorized as either
shared memory models or nonshared memory models, as shown in Table 10.
Uniform memory access (UMA) is a shared memory architecture used in parallel
computers, where processors share the physical memory uniformly. In the UMA
architecture, the access time to a memory location is independent of which
processor makes the request or which memory chip contains the transferred data.
Nonuniform memory access (NUMA) is a computer memory design used in
multiprocessing, where the memory access time depends on the memory location
relative to the processor. Under NUMA, a processor can access its local memory
faster than nonlocal memory (memory local to another processor or memory shared
between processors). The benefits of NUMA are limited to specific workloads,
notably on servers where data are closely associated with certain tasks or users. The
NUMA architecture can be subdivided into cache-coherent NUMA (CC-NUMA),

Table 10 Parallel computing category based on the memory access model


Memory access model Architecture Computer

Shared memory Uniform memory PVP Cray T-90


model access SMP Intel SHV, Sun fire, IBM R50, SGI power
challenge
Nonuniform COMA KSR-1, DDM
memory access CC-NUMA Stanford dash, SGI origin 2000, Sequent
NUMA-Q
NCC- Cray T3E
NUMA
Nonshared No remote memory Cluster IBM SP2, DEC TruCluster, Beowulf,
memory model access Berkeley NOW, HPVM
(NORMA) MPP Intel/Sandia Option red, Intel paragon

123
International Journal of Parallel Programming

non-cache-coherent NUMA (NCC-NUMA) and the cache-only memory architec-


ture (COMA).
The COMA is a type of cache-coherent nonuniform memory access (CC-NUMA)
architecture. Unlike traditional CC-NUMA architectures, in the COMA, each shared
memory module in the machine is a cache, where each memory line has a tag with
its address and state. When a processor references a line, it transparently takes it to
its private cache or the vicinity of the NUMA shared memory (local memory),
which may displace the valid line from its local memory. In fact, each shared
memory module acts as a very large cache, named the COMA for the architecture.
Massively parallel processors (MPPs) and clusters are usually referred to as
having a shared-nothing architecture, in which each system has many loosely
coupled processing units covering the CPU, memory and disk. Through high-speed
interconnects, the system functions as a whole and can scale out as new servers or
nodes are added to the system. The boundary between the MPP and cluster
gradually blurs, and the cost performance ratio of the cluster is better than that of the
MPP. Therefore, cluster technology is the mainstream trend of scalable parallel
computing, especially in the field of big data computing.

5 Comparative Analysis and Open Discussion

After comparing big data computing and HPC, we come to some conclusions. (1)
Iterative computation is supported by batch-oriented big data frameworks (Spark,
Harp, Twister), stream-oriented big data frameworks (Flink, Naiad) and message-
passage frameworks (MPI) in scalable architectures (the shared memory architec-
ture is excluded). However, iterative computation behaves differently in these
frameworks because of their unique designs in the parallel programming model. (2)
How to process heterogeneous tasks in a single system is a fast-growing research
field. Cross-stack functionality applications require parallel computing systems to
process the heterogeneity of tasks with widely different execution times and
resource requirements. The state-of-the-art big data computing systems try to
integrate batch and stream processing in one data processing engine to support both
streaming computation and analytical computation. (3) Big data computing systems
tend to choose general commodity machines and networks to reduce investment and
improve adaptability, while HPC systems tend to choose specialized parallel
computers to achieve high performance. Cross-stack functionality applications
produce workloads that differ from those of typical big data or HPC. These
workload types must have affinity towards the capabilities offered by the aggregate
stack architecture. However, these capabilities that are present in either the HPC or
big data computing architecture cannot handle all these workloads, hence the need
for the converged HPC and big data computing architecture. These conclusions
represent the latest research progress and prospects of big data computing.

123
International Journal of Parallel Programming

5.1 Iterative Computation

The capabilities (functional features) and performance (nonfunctional features) of


iterative computations depend on the design pattern in the parallel programming
model: data model (e.g., data set vs data stream), execution model (e.g., control flow
vs data flow), state management (e.g., intermediate data), fault tolerance (e.g.,
recovery from lineage vs recovery from distributed snapshot) and collective
communication. In Spark, a driver running on a separate master node controls the
Spark program, and the parallel regions in the driver are transferred to the cluster for
execution. In this model, complex control flow programs that need to be executed
serially, such as iterations and if-conditionals, run on the master node, while data
flow programs are executed on worker nodes. The advantage of this model lies in its
ease in handling complex algorithms with iterative computation or nested loops.
However, the disadvantages are also obvious: it is harder to carry out complex
optimizations on a data flow graph, and the efficiency in execution is low in that the
parallel regions need to be scheduled and executed dynamically. Flink translates the
data flow program into an execution graph before it is executed on distributed
nodes. Different from the short task in each Spark’s iteration, Flink’s application
can be executed as a long-running job. The advantage of the Flink model is that it
can optimize the execution graph easily at the beginning of execution and process
data with high throughput and low latency as a data flow model. Even though Flink
has nice data flow abstraction for programming, it is difficult to program in a strictly
data flow fashion for complex programs, primarily because control flow operations
such as if conditions and iterations are harder to code in the data flow paradigm.
Naiad possesses the advantage and disadvantage of the Flink model; moreover, it
provides a nested loop as an extra feature for iterative computation. Proved by the
benchmark test, the MPI performs better in complex iterative computation
algorithms such as MDS and K-means than do Spark and Flink, as presented in
[81]. The reason for the better performance is that the MPI provides rich collective
communication libraries and a low-level programming interface; thus, it can be
optimized for different algorithms. However, the disadvantage is also apparent:
users have to design their data structure, keep intermediate data at a high cost, and
provide a fault tolerance mechanism. Through these comparative analyses, we reach
conclusions about iterative computation and motivate research in three directions:
(1) the optimization of collective communication, especially the all-to-all commu-
nication in a cluster; (2) the optimal mechanism of caching, persistence, and reuse
of the intermediate data in iteration; and (3) the convergence of the control flow and
data flow paradigms for the easy programing of complex control logic and for better
iterative computation performance.

5.2 Heterogeneous Tasks in One System

Big data computing systems are generally classified into batch-oriented and stream-
oriented systems according to their data processing engine. The unified runtime
engine depends not only on the approaches of bounded and unbounded data
processing but also on the structure of the computation graph and the integration of

123
International Journal of Parallel Programming

different data flow and control flow paradigms. Spark, as a batch processing system,
takes the stream data as a sequence of RDDs. An RDD is created at each time
interval and is processed by the Spark engine as a microbatch. Flink, in contrast, is a
stream processing system. A bounded data set is processed by Flink as a special case
of an unbounded data stream. Spark and Flink integrate the two data processing
procedures into one system by abstracting one procedure to the other, although this
abstraction is not very efficient or cost-effective. Spark and Flink can process
typical big data workloads; however, cross-stack functionality workloads are
beyond their capabilities.
To date, there are no unified parallel programming frameworks that can process
all the workloads of cross-stack functionality applications. For most of the offline
and interactive analytics scenarios, you can use Spark, Flink or Naiad; for scenarios
involving automatic speech recognition and other pattern recognition, your best
choice is a DL framework, such as TensorFlow [56], MXNet [83], etc.; and for
scenarios involving the RL-based IoT, you can try Ray [70], which is a dynamic
task graph computation model designed for RL. Therefore, designing a unified
parallel programming framework is the most active and promising research area in
big data computing.
Here, we present two research directions as the end of this chapter: (1) the design
of general-purpose underlying primitives and the unified programming model
should consider supporting both batch and stream processing, supporting hetero-
geneous computation, and coworking with other DL libraries and numeric libraries;
and (2) the design of a unified programming model should consider leveraging the
cluster architecture for better performance and avoid conflicts in cluster resource
scheduling.

5.3 Converged Architecture of HPC and Big Data Computing

In [86], Wenguang C. pointed out that although big data computing and HPC are
significantly different in many aspects, they still have similarities, and a trend of
mutual learning and convergence is emerging. In the convergence of HPC and big
data computing, on the one hand, big data computing should maintain the benefits of
processing extreme-scale data in parallel with scalability, reliability, transparent
fault tolerance, and ease of use; on the other hand, big data computing should learn
from HPC to process computation-sensitive and interconnect-sensitive workloads as
new requirements from these cross-stack functionality applications.
Harp [91] proposes a collective communication abstraction layer on several
common data abstractions, which provides flexible combinations of computational
models that adapt to a variety of applications. It provides six collective
communication operations (i.e., “broadcast”, “allgather”, “allreduce”, “regroup”,
“send messages to vertices”, and “send edges to vertices”) and hierarchical data
abstractions to support various collective communication patterns. Data abstractions
can be classified horizontally as arrays, key values, or vertices, edges and messages
in graphs or can be constructed vertically by basic types, partitions and tables. In
addition to the collective communication and hierarchical data abstraction layer,
Harp builds a MapCollective model that follows the BSP style and provides a

123
International Journal of Parallel Programming

MapCollective programming interface (a set of collective communication APIs) and


a data abstractions programming interface. It is suitable for processing algorithms
such as clustering, MDS, and force-directed graph drawing on large-scale data with
high efficiency. Harp is now designed in a pluggable way to bridge the gaps
between the Hadoop ecosystem and HPC system, improving the performance of big
data computing through clear communication abstraction while still retaining the
benefits of the Hadoop ecosystem, such as availability, productivity and fault
tolerance.
Twister [23] is an enhanced MapReduce programming model that efficiently
supports iterative MapReduce computations. Unlike MapReduce tasks executed in
batches, Twister supports long-running map/reduce tasks and concatenates these
tasks with each other for data transfers via broadcast and scatter communication,
thereby efficiently supporting iterative computations. Twister separates variable
data (the computed results in each iteration) from static data (the much larger data
type of the two) and remains fixed throughout the computation. The configure phase
involves Twister loading static data onto the map and reducing tasks for reuse at
each iteration. Twister2 further integrates HPC by launching Harp-DAAL (Data
Analytics Acceleration Library) and the batch or streaming data processing
capabilities of Apache Hadoop, Spark, Heron and Flink. Harp-DAAL is an HPC
communication collective in the Hadoop ecosystem with a kernel ML library
exploiting the Intel node library DAAL. Twister2 follows the BSP style and
Dataflow style. In [81], the authors compared the performances of three distinct ML
algorithms, i.e., MDS, K-means and TeraSort, which are respectively implemented
in Spark, Flink and MPI, and reached the following conclusions: (1) The MPI
achieves the best performance but is the most difficult to program; (2) Twister2 is
proved to be much better than Spark and Flink in performance by a similar
benchmark test, which plans to support RDMA and other communication
enhancements for further optimization in performance.
[9] reported the experiences in porting Spark to large production HPC systems
(Cray XC systems: Edison and Cori) to serve both scientific and data-intensive
workloads by exploiting a combination of HPC hardware support, the system
software configuration, and engineering changes to Spark and the application
libraries. Data movement is one of the determinants of Spark performance. There
are two kinds of data movement in Spark: vertical movement and horizontal
movement. The former refers to movement across the entire memory hierarchy
(including persistent storage). That is, to load source data blocks into the main
memory, to write back result data to persistent storage, and to cache or persist
intermediate data during the data processing procedure. The latter refers to the
random communication phase between work nodes, typically the shuffle stage.
However, HPC and big data computing have different architectures to handle data
movement in the two directions. In big data systems, disk I/O is optimized for
latency by using local disks, and the network between nodes is optimized primarily
for bandwidth. In contrast, HPC systems use a global parallel file system without
local storage (e.g., Lustre). The disk I/O of global storage is optimized primarily for
bandwidth, while the network is optimized primarily for latency. The authors
selected three spark benchmarks to cover the performance space: (1) the SparkSQL

123
International Journal of Parallel Programming

benchmark is designed to capture the vertical data movement performance; (2) the
GroupBy benchmark is designed to capture the shuffle performance, which stresses
both horizontal and vertical data movement; and (3) the PageRank benchmark is
designed to capture the iterative computation performance. In the performance test,
the author ported Spark to run on the Cray XC family by calibrating the single-node
performance when using the Lustre global file system against that of a workstation
with local SSDs. The result shows that file system metadata access latency
dominates in an HPC installation using Lustre, which results in single-node
performance up to 4X slower than that of a typical workstation. In view of the above
test, the authors proposed some software-based and hardware-based approaches to
alleviate the single-node performance gap when porting Spark to HPC and to
improve the scalability of Spark on HPC. For the software-based approach, to solve
the problem of metadata access latency and adapt to large-scale workloads, the
authors added a layer for pooling and caching open file descriptors and evaluated the
statically sized file pool using two eviction policies to resolve capacity conflicts, i.e.,
LIFO and FIFO, in which LIFO provides the best performance for the shuffle stage.
For the hardware-based approach, the authors evaluated a system with a layer of
nonvolatile storage (BurstBuffer) that sits between the processor memory and back-
end Lustre file system, improving scalability at the expense of per-node
performance. The experience of porting Spark to HPC proves that increasing the
local storage of nodes with large NVRAM by better caching near processors reduces
the intermediate data access overhead, namely, the vertical data movement
overhead. This experience can also be extended to other big data computing
systems such as Hadoop.
Here, we present two research directions as the end of this chapter: (1) the design
of the hierarchical memory of clusters, which is optimized for vertical data
movement; and (2) the design of the topology of high-speed interconnects, which is
optimized for horizontal data movement, involving point-to-point communication
and collective communication.

6 Conclusions

The challenges brought by emerging applications that have cross-stack functionality


require efficient iterative computations in scalability, reliability and flexibility,
handling heterogeneous tasks in one system, and processing these workloads with
affinity towards the capabilities offered by the layered architecture stacks. However,
the capabilities present in either HPC or big data computing cannot meet all the
requirements. By conducting a comparative analysis in detail on HPC and big data
computing ranging from the design patterns of parallel programming models and the
design ideas of system middleware to the infrastructure of parallel computers, and
by discussing the experience in the converged architecture of big data computing
and HPC, we expect to address these challenges of emerging applications from the
perspective of layered architecture stacks and the converged architecture. We
believe that this comparative survey will pave the way for research on the
converged architecture of HPC and big data computing in the new era.

123
International Journal of Parallel Programming

References
1. Akidau, T., Balikov, A., Bekiroğlu, Kaya, et al.: MillWheel: fault-tolerant stream processing at
internet scale [J]. Proc. VLDB Endow. 6(11), 1033–1044 (2013)
2. Almasi, G.: PGAS (Partitioned Global Address Space) Languages [J]. Encycl. Parallel Comput. 1,
1539–1545 (2011)
3. Apache Giraph. https://giraph.apache.org/
4. Asaadi, H. R., Khaldi, D., Chapman, B. A. (2016) Comparative Survey of the HPC and Big Data
Paradigms: Analysis and Experiments [C]// IEEE International Conference on Cluster Computing
(CLUSTER). IEEE,:423–432.
5. Bröse E. ZeroCopy: Techniques, Benefits and Pitfalls [EB/OL]. https://static.aminer.org/pdf/PDF/
000/253/158/design_and_implementation_of_zero_copy_data_path_for_efficient.pdf
6. Browning, S. A. The Tree Machine: A Highly Concurrent Computing Environment [EB/OL]. 1980.
http://resolver.caltech.edu/CaltechCSTR:3760-tr-80.
7. Carbone, P., Fóra, G., Ewen, S et al. (2015) Lightweight Asynchronous Snapshots for Distributed
Dataflows [J]. Computer Science.
8. Carbone, P., Katsifodimos, A., Ewen, S., et al.: Apache flinktm: stream and batch processing in a
single engine [J]. IEEE Data Eng. Bulletin 38(4), 28–38 (2015)
9. Chaimov, N., Malony, A., Canon, S et al. Scaling Spark on HPC Systems [C]// Proceedings of the
25th ACM International Symposium on High-Performance Parallel and Distributed Computing –
HPDC. ACM, 2016:97–110.
10. Chambers, C., Raniwala, A., Perry, F et al. FlumeJava: Easy, Efficient Data-parallel Pipelines [C]//
ACM Sigplan Conference on Programming Language Design & Implementation. ACM, 2010.
11. Chan, E., Heimlich, M., Purkayastha, A., et al.: Collective communication: theory, practice, and
experience [J]. Concurrency Computat. Pract. Exper. 19(13), 1749–1783 (2007)
12. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems
[J]. ACM Transact. Comput. Syst. (TOCS) 3(1), 63–75 (1985)
13. Chapman, B., Curtis, T., Pophale, S et al. Introducing OpenSHMEM: SHMEM for the PGAS
community [C]// Conference on Partitioned Global Address Space Programming Model. 2010.
14. Clos, C.: A study of non-blocking switching networks [J]. Bell Syst. Tech. J. 32(2), 406–424 (1953)
15. Crankshaw, D., Bailis, P., Gonzalez, J.E., et al.: The missing piece in complex analytics: low latency,
scalable model management and serving with velox [J]. European J. Obstet. Gynecol. Reprod. Biol
185, 181–182 (2014)
16. Cristina P. The technology Stacks of High Performance Computing & Big Data Computing: What
They Can Learn from Each Other. https://www.etp4hpc.eu/pujades/files/bigdata_and_hpc_FINAL_
20Nov18.pdf
17. Dagum, L., Menon, R.: OpenMP: an industry standard api for shared-memory programming [J].
IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
18. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters [J]. Commun. ACM
51(1), 107–113 (2008)
19. Doulkeridis, C., Nørvåg, Kjetil: A survey of large-scale analytical query processing in mapreduce [J].
VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)
20. Duell, J., Hargrove, P., Roman, E. The Design and Implementation of Berkeley Lab’s Linux
Checkpoint/Restart [R]. Berkeley Lab Technical Report LBNL-54941, 2002.
21. Egwutuoha, I P., Chen, S., Levy, D, et al. A Fault Tolerance Framework for High Performance
Computing in Cloud [C]//12th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing. IEEE, 2012: 709–710.
22. Egwutuoha, I.P., Levy, D., Selic, B., et al.: A survey of fault tolerance mechanisms and checkpoint/
restart implementations for high performance computing systems [J]. J. Supercomput. 65(3), 1302–
1326 (2013)
23. Ekanayake, J., Hui, L., Zhang, B et al. Twister: A Runtime for Iterative MapReduce [C]// Pro-
ceedings of the 19th ACM International Symposium on High Performance Distributed Computing.
DBLP, 2010.
24. El-Ghazawi, T., Smith, L UPC: Unified parallel [C]//ACM/IEEE Conference on High Performance
Networking & Computing. DBLP, 2006.
25. Fagg G E, Dongarra J. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic
world [C]// Proceedings of EuroPVM-MPI 2000. Springer, 2000.

123
International Journal of Parallel Programming

26. Foster I. The MPI Programming Model. https://www.mcs.anl.gov/*itf/dbpp/text/node95.html


27. Gan G, Manzano J. TL-DAE: Thread-Level Decoupled Access/Execution for OpenMP on the
Cyclops-64 Many-Core Processor [C]// Languages and Compilers for Parallel Computing, 22nd
International Workshop, LCPC 2009. Springer, 2009.
28. Graham, R.L., Choi, S.E., Daniel, D.J., et al.: A Network-failure-tolerant message-passing system for
terascale clusters [J]. Int. J. Parallel Prog. 31, 285–303 (2003)
29. Gropp W, Huss-Lederman S, Lumsdaine A, et al. (1998) MPI: The Complete Reference. Volume 2,
the MPI-2 Extensions [M]. Cambridge: MIT Press, .
30. Gupta P, Goel A, Lin J, et al. WTF: The Who to Follow Service at Twitter [C]// Proceedings of the
22nd international conference on World Wide Web. ACM, 2013.
31. Hindma B, Konwinski A, Zaharia M, et al. Mesos: A Platform For Fine-Grained Resource Sharing in
the Data Center [C]// Proceedings of the 8th USENIX conference on Networked systems design and
implementation. NSDI, 2011.
32. Holtslag A A M, De Bruijn E I F, Pan H L. A High Resolution Air Mass Transformation Model for
Short-Range Weather Forecasting [J]. Monthly Weather Review, 1990, 118(8):1561–1575. http://
docs.jboss.org/drools/release/6.0.0.Final/drools-docs/html/HybridReasoningChapter.html#ReteOO
33. Hovestadt M, Kao O, Keller A, et al. Scheduling in HPC Resource Management Systems: Queuing
vs. Planning [C]// Job Scheduling Strategies for Parallel Processing, 9th International Workshop.
Springer, 2003.
34. https://en.wikipedia.org/wiki/Iterative_method
35. https://hadoop.apache.org/
36. https://hortonworks.com/apache/yarn/
37. https://www.pnnl.gov/computing/hpda/
38. https://www.pnnl.gov/computing/HPDA/ResearchAreas/Tasks/HPDA_EventAnalysis_17.pdf
39. Hua Z, Jason N. CRAK: Linux Checkpoint/Restart as a Kernel Module [R]. Technical Report CUCS-
014–01, Department of Computer Science, Columbia University, 2001.
40. Hursey J, Squyres J M, Mattox T I, et al. The Design and Implementation of Checkpoint/Restart
Process Fault Tolerance for Open MPI [C]// 2007 IEEE International Parallel and Distributed Pro-
cessing Symposium. IEEE, 2007.
41. Husbands P: Unified Parallel C. https://pdfs.semanticscholar.org/9b65/
a5dfffbfc9165cc7f2a366f54f8085f51773.pdf
42. Introduction to OpenMP. https://www3.nd.edu/*zxu2/acms60212-40212/Lec-12-OpenMP.pdf
43. Isard, M.: Dryad: distributed data-parallel programs from sequential building block [J]. SIGOPS
Oper. Syst. Rev. 41(3), 59–72 (2007)
44. Jette M A, Yoo A B, Grondona M. Slurm: Simple Linux Utility for Resource Management [C]//
Proceedings of Job Scheduling Strategies for Parallel Processing. Springer, 2003.
45. Jha, S., Qiu, J., Luckow, A et al. (2014), A Tale of Two Data-Intensive Paradigms: Applications,
Abstractions, and Architectures 3. https://arxiv:1403.1528
46. Jiawei, H., Micheline, K.: Data mining: concepts and techniques [J]. Data Min. Concep. Models
Methods Algorithms Second Edition 5(4), 1–18 (2006)
47. Kaur, D., Chadha, R., Verma, N.: Comparison of micro-batch and streaming engine on real time data
[J]. Int. J. Eng.Sci. Res. Techonol. 4, 756–761 (2017)
48. Kune, R., Konugurthi, P.K., Agarwal, A., et al.: The anatomy of big data computing [J]. Software
Pract. Exper. 46(1), 79–105 (2016)
49. Lathia N, Hailes S, Capra L. kNN CF: A Temporal Social Network [C]// Proceedings of the 2008
ACM Conference on Recommender Systems. ACM, 2008.
50. Lei W, Jianfeng Z, Chunjie L, et al. BigDataBench: A big data benchmark suite from internet
services [C]// IEEE International Symposium on High Performance Computer Architecture. IEEE,
2014.
51. Lifka D. The ANL/IBM SP scheduling system [C]// Workshop on Job Scheduling Strategies for
Parallel Processing. Springer, 1995: 295–303.
52. Mahdavinejad, M.S., Rezvan, M., Barekatain, M., et al.: Machine learning for internet of things data
analysis: a survey [J]. Digital Commun. Netw. 3, 161–175 (2018)
53. Malewicz G, Austern M H, Bik A J C, et al. Pregel: A System for Large-scale Graph Processing [C]//
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM,
2010.
54. Mapreduce Tutorial. https://hadoop.apache.org/docs/r3.1.1/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html

123
International Journal of Parallel Programming

55. Maria C C, Giuseppe S. Workload Characterization: A Survey [C]// Proceedings of the IEEE. IEEE,
1993, 81(8):1136–1150. https://doi.org/ 10.1109/5.236191
56. Martı́n Abadi, Paul Barham, Jianmin Chen, et al. TensorFlow: A System for Large-scale Machine
Learning [C]//12th USENIX Symposium on Operating Systems Design and Implementation. USE-
NIX, 2016, 265–283.
57. Marz N. Trident. https://github.com/ nathanmarz/storm/wiki/Trident-tutorial. 2012.
58. McSherry F, Isaacs R, Isard M, et al. Composable Incremental and Iterative Data-Parallel Compu-
tation with Naiad [R]. Microsoft Research, 2012. https://www.microsoft.com/en-us/research/wp-
content/uploads/2012/10/naiad.pdf
59. McSherry F, Murray D G, Isaacs R, and Isard M. Differential Dataflow [C]// Proceedings of 6th
Biennial Conference on Innovative Data Systems Research. 2013. http://cidrdb.org/cidr2013/Papers/
CIDR13_Paper111.pdf
60. Mehdi, M., Ala, A.F., Sameh, S., et al.: Deep learning for iot big data and streaming analytics: a
survey [J]. IEEE Commun. Surv. Tutorials 1(1), 99 (2017)
61. Mina J, Verde C. Fault Detection Using Dynamic Principal Component Analysis by Average Esti-
mation [C]// IEEE International Conference on Electrical & Electronics Engineering. IEEE, 2005.
62. Mu’Alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in
scheduling the ibm sp2 with backfilling [J]. IEEE Transact. Parallel Distributed Syst. 6(12), 529–543
(2001)
63. Murray D G, McSherry F, Isaacs R, et al. Naiad: A Timely Dataflow System [C]// ACM Symposium
on Operating Systems Principles (SOSP). ACM, 2013: 439–455.
64. Neumaier, A.: Molecular modeling of proteins and mathematical prediction of protein structure [J].
SIAM Rev. 39(3), 407–460 (1997)
65. Nishihara R, Moritz P, Wang S, et al. Real-Time Machine Learning: The Missing Pieces. 2017.
https://arxiv.org/abs/1703.03924
66. OpenMP Application Program Interface. OpenMP Architecture Review Board. 2008. http://www.
openmp.org/mp-docu- ments/spec30.pdf
67. Ordónez, F.J., Roggen, D.: Deep convolutional and lstm recurrent neural networks for multimodal
wearable activity recognition [J]. Sensors 16(1), 115 (2016)
68. Pan R, Dolog P, Xu G. KNN-Based Clustering for Improving Social Recommender Systems [C]//
Agents and Data Mining Interaction: 8th International Workshop, ADMI 2012. Springer, 2013.
https://doi: 10.1007/978-3-642-36288-0_11
69. Philipp M, Nishihara R, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI
Applications [C]// Proceedings of the 13th USENIX Symposium on Operating Systems Design and
Implementation. 2018. https://arxiv.org/abs/1712.05889
70. Philipp M, Robert N, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI Appli-
cations [C]// USENIX Symposium on Operating Systems Design and Implementation. USENIX,
2018.
71. Quoc-Cuong, T., Juan, S., Volker, M.: A survey of state management in big data processing systems
[J]. Int. J. Very Large Data Bases 27(6), 847–872 (2018)
72. Ramalingam, G.: Bounded Incremental Computation [M]. Springer, Berlin (1996)
73. Reuther, A., Byun, C., Arcand, W., et al.: Scalable System Scheduling for HPC and Big Data [J].
J. Parallel Distrib. Comput. 111(1), 76–92 (2018)
74. Richer S. A Deep Dive into Rescalable State in Apache Flink. 2017. https://flink.apache.org/features/
2017/07/04/flink-rescalable-state.html
75. Sakr, S., Liu, A., Fayoumi, A.: The family of mapreduce and large scale data processing systems [J].
ACM Comput. Surv. 46(1), 1–44 (2013)
76. Sankaran, S., Squyres, J.M., Barrett, B., et al.: The lam/mpi checkpoint/restart framework: system-
initiated checkpointing[J]. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)
77. Saraswat V, Almasi G, Bikshandi G, et al. The Asynchronous Partitioned Global Address Space
Model. http://www.cs.rochester.edu/u/cding/amp/papers/full/The%20Asynchronous%20Partitioned
%20Global%20Address%20Space%20Model.pdf
78. Schulz M, Bronevetsky G, Fernandes R, et al. Implementation and Evaluation of a Scalable
Application-level Checkpoint-recovery Scheme for MPI Programs [C]// Proceedings of the 2004
ACM/IEEE Conference on Supercomputing. IEEE, 2004.
79. Severson, K., Chaiwatanodom, P., Braatz, R.D.: Perspectives on process monitoring of industrial
systems [J]. Annu. Rev. Control. 42, 190–200 (2016)
80. Stephen P B. Multidimentional Scaling. 1997. http://www.analytictech.com/borgatti/mds.htm

123
International Journal of Parallel Programming

81. Supun, K., Pulasthi, W., Saliya, E., et al.: Anatomy of machine learning algorithm implementations
in mpi, spark, and flink [J]. Int. J. High Perform. Comput. 32(1), 61–73 (2018)
82. The Beowulf Cluster site. http://www.beowulf.org
83. Tianqi Chen, Mu Li, Yutian Li, et al. MXNet: A Flexible and Efficient Machine Learning Library for
Heterogeneous Distributed Systems. Neural Information Processing Systems, Workshop on Machine
Learning Systems. 2016.
84. Tony H, Stewart T, Kristin T. The Fourth Paradigm: Data-Intensive Scientific Discovery [M].
Microsoft Research. 2009. https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/
Fourth_Paradigm.pdf
85. Tzoumas K. High-throughput, Low-latency, and Exactly-once Stream Processing with Apache
Flink™. 2015. https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-
stream-processing-with-apache-flink
86. Wenguang, C.: Big data and high performance computing [J]. Big Data Res. 1(001), 20–27 (2015)
87. 陈文光. 大数据与高性能计算[J]. 大数据, 2015, 1(001):20–27. http://www.infocomm-journal.com/
bdr/article/2015/2096-0271/2096-0271-1-1-00020.shtml
88. Wickramasinghe U , Lumsdaine A . A Survey of Methods for Collective Communication Opti-
mization and Tuning. 2016. ArXiv, abs/1611.06334.
89. Woodall T S, Shipman G M, Bosilca G, et al. High Performance RDMA Protocols in HPC [C]//
European Pvm/mpi Users Group Conference on Recent Advances in Parallel Virtual Machine &
Message Passing Interface. Springer, 2006.
90. Yanpei C, Francois R, Randy K. From TPC-C to Big Data Benchmarks: A Functional Workload
Model [R]. 1st Workshop on Specifying Big Data Benchmarks, 2012, 8163: 28–43. https://doi.org/
10.1007/978-3-642-53974-9_4
91. Zaharia M, Chowdhury M, Das T, et al. Resilient Distributed Datasets: A fault-tolerant Abstraction
for In-memory Cluster Computing [C]// Proceedings of the 9th USENIX conference on Networked
Systems Design and Implementation. USENIX Association, 2012.
92. Zhang B, Ruan Y, Qiu J. Harp: Collective Communication on Hadoop [C]// 2015 IEEE International
Conference on Cloud Engineering. IEEE, 2015.
93. Zhang, H., Chen, G., Ooi, B.C., et al.: In-memory big data management and processing: a survey [J].
IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)
94. Zhen J, Jianfeng Z, Lei W et al. Characterizing and Subsetting Big Data Workloads [C]// 2014 IEEE
International Symposium on Workload Characterization. IEEE, 2014.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

123

You might also like