Professional Documents
Culture Documents
https://doi.org/10.1007/s10766-021-00717-y (0123456789().,-volV)
(0123456789().,-volV)
Abstract
With the rapid growth of artificial intelligence (AI), the Internet of Things (IoT) and
big data, emerging applications that cross stacks with different techniques bring new
challenges to parallel computing systems. These cross-stack functionalities require
one system to possess multiple characteristics, such as the ability to process data
under high throughput and low latency, the ability to carry out iterative and
incremental computation, transparent fault tolerance, and the ability to perform
heterogeneous tasks that evolve dynamically. However, high-performance com-
puting (HPC) and big data computing, as two categories of parallel computing
architecture, are incapable of meeting all these requirements. Therefore, by per-
forming a comparative analysis of HPC and big data computing from the per-
spective of the parallel programming model layer, middleware layer, and
infrastructure layer, we explore the design principles of the two architectures and
discuss a converged architecture to address the abovementioned challenges.
1 Introduction
The advent of big data makes it possible for artificial intelligence (AI) to take
advantage of a data set of sufficient size to provide meaningful learning and results.
To date, applications based on the supervised learning (SL) paradigm are maturing.
An SL model that uses sample data with labels can be trained offline and deployed
123
International Journal of Parallel Programming
In this survey, we review a wide range of applications that are based on cross-stack
solutions (i.e., AI?IoT?big data, HPC?big data, AI?big data, etc.) and compare
the workload characterization and software/hardware stacks of applications of
typical HPC, typical big data and these cross-stack functionality applications. We
discuss three fundamental topics in current parallel computing systems based on the
123
International Journal of Parallel Programming
anatomy of HPC and big data computing architectures in an attempt to explore the
solutions to the challenges arising from cross-stack functionality applications from
the following aspects:
1 (1) What makes parallel programming frameworks distinct from each other, and
which design principles determine the capabilities and efficiency in parallel
computing, especially for iterative computation?
2 (2) How can we design a parallel system that has multiple capabilities in one
system, provides stream and batch computation, and can handle heterogeneous
tasks that evolve dynamically?
3 (3) How can we design a converged architecture in which the capabilities
present in either HPC or big data computing stacks benefit the other?
To the best of our knowledge, our survey is the first to explore the three
abovementioned topics, which define the scope of this paper and direct our future
research on designing new parallel computing systems.
1.2 Organization
In this paper, we first provide a literature review on big data computing and HPC-
related works in Sect. 2. In Sect. 3, we compare typical HPC and typical big data
applications with cross-stack functionality applications, presenting their layered
software and hardware stack architectures and pointing out the challenges of these
applications. In Sect. 4, we study in detail the strengths and weaknesses of state-of-
the-art big data computing and HPC systems in the corresponding layered
architectures, i.e., the parallel programming model layer, middleware layer and
infrastructure layer. We conduct a comparative analysis and discuss the converged
architecture of HPC and big data computing and then conduct open discussions in
Sect. 5. Finally, we conclude this survey.
2 Literature Review
Some works focus on big data computing and HPC architectures, including the
parallel programming model, and cluster architecture.
In [19], Doulkeridis C. et al. reviewed the state-of-the-art techniques in
improving the performance of parallel query processing using MapReduce. The
paper first provided an in-depth analysis of the weaknesses and limitations of the
MapReduce framework and then proposed the existing approaches that improve the
performance of query processing, categorized into eight groups: data access,
avoidance of redundant processing, early termination, iterative processing, query
optimization, fair work allocation, interactive real-time processing, and the
processing of n-way operations. The paper performed a comparative analysis of
the MapReduce-related technologies and systems according to these categories.
In [92], Zhang H. et al. presented important techniques for memory management,
including memory hierarchy (e.g., register, cache, main memory), memory
hierarchy utilization (e.g., register-conscious optimization, cache-conscious
123
International Journal of Parallel Programming
HPC involves the use of parallel computing to run modeling and simulation
workloads in science, industry, and commerce that require significant amounts of
computation, which is called the third paradigm of scientific discovery. Classic HPC
modeling and simulation seek to represent the dynamics of a system based on its
mathematical model, which is usually derived from scientific or engineering laws
[16]. The core procedure of this scientific discovery is to solve equations. Due to the
complexity of these systems, these equations usually cannot be solved analytically;
thus, a numerical analysis method is needed. These workloads are often
computationally sensitive and require substantial memory. Moreover, numerical
simulation often generates a very large amount of intermediate data, where high-
throughput IO and low-latency interconnects are necessary. Therefore, it is required
123
International Journal of Parallel Programming
123
International Journal of Parallel Programming
123
International Journal of Parallel Programming
Table 1 continued
123
International Journal of Parallel Programming
123
International Journal of Parallel Programming
Table 2 Software and hardware stacks of typical big data, typical hpc, and converged applications
Typical big data Typical HPC Big Data? HPDA
IoT?AI
Use Case [30, 49, 68] [32, 64] [61, 67, 79, 80] [37, 38]
Function view 1. Offline Analytics: 1. Modeling 1. Real-Time 1. Streaming
Sort, Grep, Naı̈ve Bayes, 2. Simulation Analytics: Analytics:
K-means, PageRank Streaming Data rapidly analyze
2. Interactive Analytics: Processing high-
Projection, Filter, Union, 2. Predictive bandwidth,
Cross Product, Analytic high-
Difference, Join, throughput
ML, DL, streaming data
Aggregate, Select, etc
RL, etc 2. Interactive
analytics:
explore and
analyze
massive
streaming data
3. Graph
Analytics:
graph modeling,
visualization
Parallel Execution Engine: Execution Engine: Multiple Execution Engines from
programming 1. Batch System: e.g., 1. Message Passing: Big Data Stacks:
view Hadoop 1. Streaming System
e.g., MPI
2. Microbatch System: e. 2. Shared Memory: Flink, Spark Streaming, Naiad, etc
g., Spark 2. DL Framework
e.g., openMP, PGAS
3. Real-time/Streaming TensorFlow, Torch, Theano, Caffe,
System: e.g., Flink, Neon, etc
Spark Streaming, Storm
3. RL Framework
Ray, CEIL, etc
System view 1. Data Storage: HDFS, 1. Data Storage: Parallel 1. Data Storage and Data Ingestion
Hbase, S3, MongoDB File System, e.g., from Big Data Stacks: HDFS,
Lustre, GPFS Hbase, Kafka, Flume, etc
2. Cluster Mgmt. & 2. Cluster Mgmt. & Scheduling
Scheduling: Slurm from Both Stacks: Mesos, Slurm
3. Communication 3. Communication and
Libraries: RDMA (see Computation Libraries from
4.3.1), Collective HPC
communication Communication Libraries: RDMA
libraries, e.g., MPI (see 4.3.1),
4. Numerical Libraries: Numerical Libraries: PETSc,
PETSc, ScaLAPCK, ScaLAPCK, BLAS
BLAS
Accelerator APIs: CUDA, OpenCL
5. Accelerator APIs:
CUDA, OpenCL
123
International Journal of Parallel Programming
Table 2 continued
123
International Journal of Parallel Programming
HPC workloads were first run on some supercomputing platforms, which were
specialized and highly expensive. People have been trying to find alternatives for
the infrastructure of HPC, and the emergence of cluster computing follows the trend
123
International Journal of Parallel Programming
The HPC architecture is not designed for the big data computing paradigm. The
workloads of big data are data intensive; thus, a parallel programming framework
(e.g., MapReduce, Spark or Flink), distributed file system (e.g., HDFS) for
Middleware
Resource & Workload Manager File System:
Slurm/Torque/Sun Grid Engine Lustre/NFS/GPFS
Infiniband/Ethernet iSCSI/FC/Infiniband
CPU/GPGPU/FPGA
123
International Journal of Parallel Programming
sequential reading and writing, and fast direct attached storage (e.g., SSDs) with low
latency are essential. Figure 2 demonstrates the big data computing architecture.
One of the first proposals for large-scale data processing in batches was
MapReduce [18], which offered a parallel programming framework running on the
clusters of commodity machines in terms of scalability and reliability as the first
generation of big data computing systems. Based on the original idea of the
MapReduce framework, a family of approaches and systems of large-scale data
processing have been implemented and are currently gaining much momentum in
both research and industrial communities [75], such as HDFS™, HBase™, Hive™,
Mahout™, Pig™, and Tez™, which build upon the Hadoop [35] ecosystem. In this
stage, the main workloads on big data clusters are batch-oriented.
MapReduce has drawbacks, such as a low-level programming model and a lack
of support for iterations [71]. To overcome these limitations, researchers designed
the second generation of big data processing systems, represented by Spark [18] and
Flink [7–9]. Both Spark and Flink provide in-memory performance and iterative
computation, although they process data via different dataflow models. In this stage,
researchers and communities focused on improving the parallel programming
framework to support both batch-oriented and steam-oriented data processing. Apart
from these, other excellent frameworks were developed, such as the graph
computing framework Pregel [53], stream processing framework MillWheel [1] and
Naiad [58, 59, 63], providing a differential/timely dataflow model to support both
incremental and iterative computation.
4.1.3 Summary
In Fig. 1 and Fig. 2, we identify three layers in two different architectures: a parallel
programming model layer, a middleware layer, and an infrastructure layer. We
123
International Journal of Parallel Programming
make further comparisons between the two architectures at each layer and
summarize the design principles that can address the abovementioned challenges
(see chapter 3.4).
In the parallel programming model layer, the HPC programming framework
provides low-level programming APIs, where users can explicitly control the
parallel program that is based either on process parallelism, which exchanges data
through a message-passing model, or on multithreads, which exchange data through
a shared (partitioned) data model. In contrast, the big data programming framework
provides rich and highly abstracted APIs that encompass the interface of distributed
data abstractions, interface of state management for intermediate data, checkpoint
interface, interface of manipulating data, etc., allowing for distributed data-
parallelism processing across hundreds or even thousands of servers in a cluster.
The significant difference between HPC and big data computing is that HPC
codes tend to generate large amounts of result data that have to be stored without
impeding the computation, while big data codes have to process large amounts of
input data [16], which results in totally different designs of the infrastructure and
resource management in the two architectures: the HPC approach historically
separates the data and computations, while big data computing colocates the
computations and data [45]. The structure of data exchange (between computing
nodes or between computing nodes and storage nodes) in HPC, i.e., communication
libraries and high-speed interconnects, enables computation-sensitive workloads to
be processed efficiently, while the structure of resource management in big data
computing that colocates computations and data enables data-intensive workloads to
be processed with high throughput.
123
International Journal of Parallel Programming
processing), and iterative algorithms (ML, graph analysis), can be expressed and
executed as pipelined fault-tolerant dataflows.
Naiad [63] is a distributed system for executing data parallel, cyclic dataflow
programs. It offers high throughput for batch processors, low latency for stream
processors, and the ability to perform iterative and incremental computations.
The message-passing interface standard (MPI) is a message-passing library
standard based on the consensus of the MPI Forum at the Supercomputing’92
Conference. The MPI defines an extended message-passing model for parallel,
distributed programming in a distributed computing environment [29]. Multiple
implementations of the MPI have been developed and are widespread as parallel
frameworks to run on scalable distributed systems that comprise multiple computing
nodes integrated via high-speed interconnection networks.
OpenMP [66] is a shared memory multiprocessing application program
inference (API) for the easy development of shared memory parallel programs.
OpenMP is designed to exploit certain characteristics of a shared memory
architecture. The ability to directly access memory throughout the system (with
minimum latency and no explicit address mapping), combined with fast shared
memory locks, makes the shared memory architecture best suited for supporting
OpenMP [17].
The partitioned global address space (PGAS) [2] is a data parallel model. It
provides each process with a view of the global memory even though the memory is
distributed across multiple computers. The PGAS model extends the shared memory
model to a distributed memory setting, and data structures in partitioned global
space can be shared by all the computation threads with affinity to the data (affinity
is the association of a thread to a memory) [77]. The PGAS outperforms the shared
memory model in scalability. Several implementations of the PGAS model have
been designed for the HPC community, such as UPC [24] and OpenSHMEM [13].
Hadoop MapReduce’s data model is based on the HDFS, where partitioned files can
be read or written by the MapReduce program in parallel. The MapReduce
framework operates exclusively on key-value pairs; that is, the framework views the
input to a job as a set of key-value pairs and produces a set of key-value pairs as the
output of the job, conceivably of different types. The data types for the serialization
and deserialization of data storage in the HDFS and those used in MapReduce
computations for key or value fields must implement the interface of the framework
to facilitate the functionalities (e.g., read, write, shuffle, sort) of the framework [54].
Spark uses a resilient distributed dataset (RDD) data model. RDDs are read-only
objects that are partitioned across a set of nodes, which are more abstract than
MapReduce’s file model. In comparison with distributed shared memory (DSM), the
advantages of RDDs are listed below: (1) the consistency of RDDs is trivial because
RDDs are immutable; (2) fine-grained and low-overhead fault recovery can be
achieved by using the lineage of RDDs; (3) workloads are automatic based on data
locality and easy to schedule in a balanced way; and (4) straggler mitigation is made
possible by using a backup task.
123
International Journal of Parallel Programming
Flink has two data types: one represents a finite data set, and the other represents
an unbounded data stream. It provides the special class DataSet and class Data-
Stream as collections of data representing the two data types in a program. These
two collections are of the following key features: (1) a collection is initially created
by adding a source in the Flink program; (2) new collections are derived from these
initialized collections by transforming them with Flink APIs; and (3) these
collections are immutable once created.
Every message in Naiad bears a logical timestamp consisting of epoch and loop
counters, where there is one loop counter for each of the k loop contexts that contain
the associated edge [63]. These loop counters explicitly distinguish different
iterations and allow a system to track forward progress as messages circulate around
the dataflow graph. The abovementioned features enable Naiad to support more
complex computation models such as nested loops, which is suitable for ML
algorithms such as MDS.
In the MPI, processes can exchange data by point-to-point communications or
collective communications. In HPC, the MPI can utilize the InfiniBand high-speed
interconnects and RDMA.
All variables in OpenMP possess one of the data-sharing attributes (i.e., Shared,
Private, Firstprivate, Lastprivate, Default). The variable with the Shared attribute is
used when all threads use the same copy of the variable; the Private attribute means
that all threads use local storage for this variable; the Firstprivate attribute is similar
to the Private attribute, but the variable is initialized on a fork; and the Lastprivate
attribute is also similar to the Private attribute, but the variable is updated upon
joining. Default is most often used to define the data-sharing attribute of most
variables in a parallel region.
Unified Parallel C (UPC) extends ISO C to a PGAS programming language that
allows programmers to exploit data locality and parallelism in their applications
[23]. UPC provides two types of variables: private variables and shared variables.
Normal C variables and objects are allocated in the private memory space for each
thread. Shared array elements can be distributed to threads in a round-robin fashion
with arbitrary block sizes. The block size and THREADS determine the affinity, in
which the thread’s local shared memory space of a shared data item will reside.
123
International Journal of Parallel Programming
value pairs produced by Mapper as the input and then runs a reducer function on
each of them to generate the output. The output of the reducer is stored in the HDFS
as the final output.
The Spark execution model can be defined in three phases: (1) creating a logical
plan; (2) translating the logical plan into a physical plan; and (3) executing tasks on
a cluster. In the first phase, a logical plan is created to show the steps to be executed
when an action is applied. Spark uses transformations (operators) on an RDD to
describe the data processing. Each transformation generates a new RDD such that
all transformations form a lineage or directed acyclic graph (DAG). In the second
phase, actions trigger the translation of the logical DAG into a physical execution
plan. The Spark Catalyst query optimizer creates a physical execution plan for the
DAG. In the third phase, tasks are scheduled and executed on the cluster. The
scheduler splits the graph into stages by transformations. The narrow transforma-
tions (transformations without data movement) are grouped (pipelined) together into
a single stage. Spark provides a unified engine that natively supports both batch and
streaming workloads. Instead of processing one record at a time, Spark Streaming
discretizes the streaming data into tiny, subsecond microbatches. In other words,
Spark Streaming’s Receivers accept data in parallel and buffer them in the memory
of Spark’s workers nodes. Then, the latency-optimized Spark engine runs short tasks
(tens of milliseconds) to process the batches and outputs the results to other systems.
Spark tasks are assigned dynamically to workers according to the locality of data
and available resources, unlike the traditional continuous operator model, where
computations are statically allocated to a node. This contributes to better load
balancing and faster fault recovery.
Flink uses streams for all workloads: streaming, SQL, microbatch and batch [47].
The program is parsed by a Flink compiler and optimized by a Flink optimizer. Each
submitted job is converted to a dataflow graph and passed onto Job Manager, which
creates an execution plan, and then the job graph is passed onto Task Manager,
where the tasks are finally executed (execution graph). Flink is a stream-computing
engine that takes a data set as a special case of a data stream with boundaries; thus,
Flink can process both batch and stream workloads in a single system. Regarding
the streaming capability, Flink is far better than Spark (as Spark handles streams in
the form of microbatches) and has native support for streaming.
Naiad uses a timely dataflow execution model based on a directed graph in which
stateful vertices send and receive logically time-stamped messages along directed
edges. Naiad employs timestamps to enhance dataflow computation, which is
essential in supporting an efficient and lightweight coordination mechanism. There
are three main features of Naiad: (1) structured loops for feedback; (2) stateful
dataflow vertices for record processing (without using global coordination); and (3)
the notification of vertices when all tuples have been received by the system for a
given round of input or loop iteration. While the first two features support low-
latency iterative and incremental computation, the third feature ensures that the
result is consistent [71]. The dataflow graph may contain nested cycles, and the
timestamps reflect this structure to distinguish data arising in different input epochs
and loop iterations.
123
International Journal of Parallel Programming
The state can be defined as “the intermediate value of a specific computation that
will be used in subsequent operations during the processing of a data flow.” [71]
State management is a mechanism provided to users by a parallel programming
model for storing the intermediate state data in a computation. The state data
generally come in two states: static and dynamic. In the static state, the state of data
is saved at program initialization as data are loaded into memory. In this mode, data
are generally large and used repeatedly in iterative computation. In the dynamic
state, the state of data is updated repeatedly in iterative computations.
In Spark, the RDD bears the state information of data. Spark provides users with
the function to cache or persist intermediate RDDs by specifying this operation
123
International Journal of Parallel Programming
explicitly. This method is very important for iterative computation, where the same
data sets are accessed repeatedly. Spark also offers two choices of specialized
abstractions on state management, namely, updateStateByKey and mapWithState,
the latter being an improved version of the former. The operator updateStateByKey
generates a new RDD by using a cogroup transformation with the former RDD;
thus, it is a full updating mechanism. The operator mapWithState provides an
incremental updating mechanism, which surpasses updateStateByKey in
performance.
There are two fundamental states in Flink: the keyed state and the operator state.
The keyed state is bound to keys, available only to functions and operators that
process data from a KeyedStream. The operator state (or non-keyed state) is bound
to one parallel operator instance. The difference between them is that the operator
state is scoped per parallel instance of an operator (subtask), while the keyed state is
partitioned or shared based on exactly one state partition per key [74].
Checkpointing is the main mechanism needed for fault tolerance in Spark. Spark
currently provides an API for checkpointing (a REPLICATE flag to persist) but
leaves the decision on which data to checkpoint to the users. Tasks are made fault
tolerant with the help of RDD lineage graphs and checkpointing; thus, they can
quickly recompute from checkpointing and recover from failures. The read-only
nature of RDDs makes them simpler to checkpoint than general shared memory.
Since consistency is not a concern, RDDs can be written out in the background
without requiring program pauses or distributed snapshot schemes [90].
Based on the distributed snapshot algorithms proposed by Chandy and Lamport
[12], Flink adopts an approach to take consistent snapshots of the current state of a
distributed system without missing information and without recording duplicates.
Flink injects stream barriers (similar to “markers” in the Chandy–Lamport
algorithm) into sources, which flow through operators and channels with data
records as part of the data stream. A barrier separates records into two groups: those
that are part of the current snapshot (a barrier signals the start of a checkpoint) and
those that are part of the next snapshot. An operator first aligns its barriers from all
incoming stream partitions to buffer data from faster partitions. Upon receiving a
barrier from every incoming stream, an operator checkpoints the barrier state to
persistent storage. Then, the operator forwards the barrier downstream. Once all
data sinks receive the barriers, the current checkpoint ends. Recovery from a failure
allows restoration of the latest checkpointed state and the restarting of sources from
the last recorded barrier [85]. The mechanism of fault tolerance in Naiad is similar
to that in Flink.
Rollback-recovery techniques are commonly used to provide fault tolerance to
parallel applications running on HPC systems to restart from a previously saved
state. Checkpoint-based rollback recovery is the main mechanism by which an
application can be rolled back to the most recent consistent state by using
checkpoint data [22]. There are two major approaches to implementing checkpoint/
restart systems: application-level (or operation-level) implementation and system-
123
International Journal of Parallel Programming
123
International Journal of Parallel Programming
frameworks, having their own communication operations and execution flow, are
suitable for certain types of applications, as shown in Table 5 below.
4.2.6 Summary
We summarize the essential patterns of the parallel programming model of big data
frameworks and HPC frameworks in Table 6.
We reach the following conclusions: (1) openMP and PGAS in the HPC
architecture are based on thread parallel and shared memory, which means that they
must run on computers with a shared memory architecture, e.g., SMP (see 4.4.2) and
NUMA (see 4.4.2), or another distributed shared memory architecture, and these
software and hardware architectures indicate that the system can handle problems
with limited data size and workloads; (2) the MPI, a dominant model used in HPC,
is a more general parallel processing framework that can run on clusters, MPP (see
4.4.2) and other supercomputer architectures; (3) big data frameworks involve
batch-oriented big data frameworks (e.g., Hadoop, MapReduce, and Spark), stream-
oriented big data frameworks (e.g., Flink and Naiad), and BSP (e.g., Pregel and
Giraph), which can run on commodity clusters.
Both the MPI and big data frameworks support the scale-out computer
architecture. The advantage of the MPI over big data programming models lies in
the rich collective communication libraries that enable the MPI to handle complex
algorithms with iterative computations efficiently. However, all-to-all collectives
are not supported by big data programming models. In contrast, big data
123
International Journal of Parallel Programming
Table 5 Parallel programming models, communication operations, execution flow, applications and
computer architecture
123
Table 6 Comparison of big data frameworks and hpc frameworks
Programming Data model Execution model State management Fault tolerance Communication model
Model
Big data MapReduce Key-value pairs Batch N/A Batch re-execution Scatter, reduce
frameworks Spark RDD: coarse-grained 1. DAG Batch/ 1. updateStateByKey: Recover from RDD Scatter, gather, reduce,
immutable data set 2. DAG microbatch update by RDD lineage broadcast
cogroup
2. mapWithState:
update by key
Flink data set/data stream 1. Dataflow 1. Key state Distributed snapshot Scatter, gather, reduce,
execution graph checkpoint broadcast
International Journal of Parallel Programming
2. Operator state
2. Long-running
task
Naiad 1. Timestamp with 1. Timely dataflow 1. Stateful vertices Distributed snapshot Scatter, Gather, Reduce,
epoch and loop 2. Long-running 2. Partially ordered checkpoint Broadcast
counters task sequence
2. Pointstamp 3. Nested loops
HPC MPI Arbitrary Process parallel Variables in process Application-level or Scatter, Ggther, reduce-scatter,
Frameworks system-level fault all-reduce, all-gather,
tolerance broadcast
PGAS Shared data (scalar, Multithreads on Private or shared data Application-level or Scatter, gather, all-reduce, all-
array), Private data partitioned shared system-level fault gather, broadcast
data tolerance
Open MP Shared, private, Multithreads on Private or shared data Application-level or N/A
firstprivate, shared data system-leve
lastprivate, default fault tolerancel
123
International Journal of Parallel Programming
The middleware layer bridges the parallel programming model layer and
infrastructure layer. We perform a comparative analysis on different design ideas
between HPC and big data computing to understand how big data programming
frameworks exploit these structures to improve data processing performance on
clusters.
Remote direct memory access (RDMA) refers to the capability to access the
memory of one computer directly from another computer without involving the
processor or operating system on either computer. RDMA improves throughput and
performance because it frees up resources via zero-copy techniques that can be
roughly separated into three classes [5]: (1) the avoidance and optimization of in-
kernel data copying: to process data completely within a kernel; (2) bypassing the
kernel on the main data processing path: to allow direct data transfers between user-
space memory and hardware, with the kernel only managing and aiding these
transfers; and (3) the optimization of data transfer between a user application and
the kernel: to optimize CPU copies between the kernel and user space, which
maintains the traditional method of arranging communication. Additionally, RDMA
ensures no CPU involvement to realize remote memory access without any
intervention of a remote processor(s). It also provides the collective communication
feature, which includes scatter collective communication (to support the reading of
data streams from one buffer and writing them into multiple memory buffers) and
gather collective communication (to support the reading of data from multiple
memory buffers and writing them as a stream into one buffer).
To implement these features of RDMA, a new network fabric that supports the
native RDMA protocol is necessary. Thus far, there have been three types of
network fabric and protocol: InfiniBand, RDMA over converged Ethernet (RoCE)
and the Internet wide area RDMA protocol (iWARP). InfiniBand is a lossless
network fabric that supports RDMA natively from the beginning. It requires special
NICs and switches and is widely used in supercomputers and MPPs. RoCE is a
converged network protocol that allows performing RDMA over an Ethernet
network. It encapsulates the InfiniBand protocol inside the Ethernet protocol. This
allows using RDMA over the standard Ethernet infrastructure (switches). All
Ethernet NICs require RoCE network adapter cards, and RoCE drivers are available
in Linux, Microsoft Windows, and other common operating systems. iWARP is a
network protocol that allows running RDMA over TCP transport to patch up
123
International Journal of Parallel Programming
Local file systems, such as EXT and FAT, cannot meet the cluster system
requirements for file sharing. Distributed file systems are designed for clusters with
the following characteristics: (1) network transparency: remote and local file access
can be achieved through the same system call; (2) location transparency: the full
path of a file does not need to be bound to its file service; that is, the name or address
of the server is not part of the file path; and (3) location independence: because the
name or address of the server is not part of the file path, changing the file location
does not cause the file path to change. Lustre and the GPFS are distributed file
systems commonly used in HPC, while the HDFS is a dedicated distributed file
system for big data. We compare them in Table 7
A cluster resource management system (RMS) is a cluster manager that oversees the
collection of resources such as processors, memory and storage from the clusters. It
maintains the status information of resources to know what resources are available
and can thus assign jobs to available machines. YARN and Mesos are two popular
RMSs used for big data, while Slurm is used in HPC.
YARN [36], part of the core Hadoop project, is the prerequisite for Enterprise
Hadoop, providing resource management and a central platform to deliver
consistent operations, security, and data governance tools across Hadoop clusters.
123
International Journal of Parallel Programming
It also extends Hadoop to incumbent and new techniques found within the data
center so that they can take advantage of cost-effective, linear-scale storage and
processing.
Mesos [31] is a platform for sharing commodity clusters between multiple
diverse cluster computing frameworks, such as Hadoop and MPI. It supports
different types of workloads, including container orchestration (Mesos containers,
Docker, etc.), analytics (Spark), big data techniques (Kafka, Cassandra) and much
123
International Journal of Parallel Programming
● Job types can be primarily categorized into parallel job and job arrays, both of
which capture the types of parallelism that the scheduler can handle [73].
Parallel jobs are used to speed up computation, during which the processes are
launched simultaneously and communicate; job arrays allow users to submit a
series of jobs using a single submission command/script. HPC schedulers always
support both parallel jobs and job arrays, while big data schedulers tend to
support job arrays only.
● The scheduling strategy includes various scheduling algorithms. Queue mech-
anisms [33] indicate that jobs within a queue are ordered according to a
scheduling policy, e.g., FCFS (first come first serve), SJF (shortest job first) and
LJF (longest job first); backfilling is the capability to schedule pending jobs upon
the early completion of an executed job. Two backfilling mechanisms are
commonly used, namely, conservative backfilling [62] and easy backfilling [51].
123
International Journal of Parallel Programming
123
International Journal of Parallel Programming
For different memory access models, parallel computers can be categorized as either
shared memory models or nonshared memory models, as shown in Table 10.
Uniform memory access (UMA) is a shared memory architecture used in parallel
computers, where processors share the physical memory uniformly. In the UMA
architecture, the access time to a memory location is independent of which
processor makes the request or which memory chip contains the transferred data.
Nonuniform memory access (NUMA) is a computer memory design used in
multiprocessing, where the memory access time depends on the memory location
relative to the processor. Under NUMA, a processor can access its local memory
faster than nonlocal memory (memory local to another processor or memory shared
between processors). The benefits of NUMA are limited to specific workloads,
notably on servers where data are closely associated with certain tasks or users. The
NUMA architecture can be subdivided into cache-coherent NUMA (CC-NUMA),
123
International Journal of Parallel Programming
After comparing big data computing and HPC, we come to some conclusions. (1)
Iterative computation is supported by batch-oriented big data frameworks (Spark,
Harp, Twister), stream-oriented big data frameworks (Flink, Naiad) and message-
passage frameworks (MPI) in scalable architectures (the shared memory architec-
ture is excluded). However, iterative computation behaves differently in these
frameworks because of their unique designs in the parallel programming model. (2)
How to process heterogeneous tasks in a single system is a fast-growing research
field. Cross-stack functionality applications require parallel computing systems to
process the heterogeneity of tasks with widely different execution times and
resource requirements. The state-of-the-art big data computing systems try to
integrate batch and stream processing in one data processing engine to support both
streaming computation and analytical computation. (3) Big data computing systems
tend to choose general commodity machines and networks to reduce investment and
improve adaptability, while HPC systems tend to choose specialized parallel
computers to achieve high performance. Cross-stack functionality applications
produce workloads that differ from those of typical big data or HPC. These
workload types must have affinity towards the capabilities offered by the aggregate
stack architecture. However, these capabilities that are present in either the HPC or
big data computing architecture cannot handle all these workloads, hence the need
for the converged HPC and big data computing architecture. These conclusions
represent the latest research progress and prospects of big data computing.
123
International Journal of Parallel Programming
Big data computing systems are generally classified into batch-oriented and stream-
oriented systems according to their data processing engine. The unified runtime
engine depends not only on the approaches of bounded and unbounded data
processing but also on the structure of the computation graph and the integration of
123
International Journal of Parallel Programming
different data flow and control flow paradigms. Spark, as a batch processing system,
takes the stream data as a sequence of RDDs. An RDD is created at each time
interval and is processed by the Spark engine as a microbatch. Flink, in contrast, is a
stream processing system. A bounded data set is processed by Flink as a special case
of an unbounded data stream. Spark and Flink integrate the two data processing
procedures into one system by abstracting one procedure to the other, although this
abstraction is not very efficient or cost-effective. Spark and Flink can process
typical big data workloads; however, cross-stack functionality workloads are
beyond their capabilities.
To date, there are no unified parallel programming frameworks that can process
all the workloads of cross-stack functionality applications. For most of the offline
and interactive analytics scenarios, you can use Spark, Flink or Naiad; for scenarios
involving automatic speech recognition and other pattern recognition, your best
choice is a DL framework, such as TensorFlow [56], MXNet [83], etc.; and for
scenarios involving the RL-based IoT, you can try Ray [70], which is a dynamic
task graph computation model designed for RL. Therefore, designing a unified
parallel programming framework is the most active and promising research area in
big data computing.
Here, we present two research directions as the end of this chapter: (1) the design
of general-purpose underlying primitives and the unified programming model
should consider supporting both batch and stream processing, supporting hetero-
geneous computation, and coworking with other DL libraries and numeric libraries;
and (2) the design of a unified programming model should consider leveraging the
cluster architecture for better performance and avoid conflicts in cluster resource
scheduling.
In [86], Wenguang C. pointed out that although big data computing and HPC are
significantly different in many aspects, they still have similarities, and a trend of
mutual learning and convergence is emerging. In the convergence of HPC and big
data computing, on the one hand, big data computing should maintain the benefits of
processing extreme-scale data in parallel with scalability, reliability, transparent
fault tolerance, and ease of use; on the other hand, big data computing should learn
from HPC to process computation-sensitive and interconnect-sensitive workloads as
new requirements from these cross-stack functionality applications.
Harp [91] proposes a collective communication abstraction layer on several
common data abstractions, which provides flexible combinations of computational
models that adapt to a variety of applications. It provides six collective
communication operations (i.e., “broadcast”, “allgather”, “allreduce”, “regroup”,
“send messages to vertices”, and “send edges to vertices”) and hierarchical data
abstractions to support various collective communication patterns. Data abstractions
can be classified horizontally as arrays, key values, or vertices, edges and messages
in graphs or can be constructed vertically by basic types, partitions and tables. In
addition to the collective communication and hierarchical data abstraction layer,
Harp builds a MapCollective model that follows the BSP style and provides a
123
International Journal of Parallel Programming
123
International Journal of Parallel Programming
benchmark is designed to capture the vertical data movement performance; (2) the
GroupBy benchmark is designed to capture the shuffle performance, which stresses
both horizontal and vertical data movement; and (3) the PageRank benchmark is
designed to capture the iterative computation performance. In the performance test,
the author ported Spark to run on the Cray XC family by calibrating the single-node
performance when using the Lustre global file system against that of a workstation
with local SSDs. The result shows that file system metadata access latency
dominates in an HPC installation using Lustre, which results in single-node
performance up to 4X slower than that of a typical workstation. In view of the above
test, the authors proposed some software-based and hardware-based approaches to
alleviate the single-node performance gap when porting Spark to HPC and to
improve the scalability of Spark on HPC. For the software-based approach, to solve
the problem of metadata access latency and adapt to large-scale workloads, the
authors added a layer for pooling and caching open file descriptors and evaluated the
statically sized file pool using two eviction policies to resolve capacity conflicts, i.e.,
LIFO and FIFO, in which LIFO provides the best performance for the shuffle stage.
For the hardware-based approach, the authors evaluated a system with a layer of
nonvolatile storage (BurstBuffer) that sits between the processor memory and back-
end Lustre file system, improving scalability at the expense of per-node
performance. The experience of porting Spark to HPC proves that increasing the
local storage of nodes with large NVRAM by better caching near processors reduces
the intermediate data access overhead, namely, the vertical data movement
overhead. This experience can also be extended to other big data computing
systems such as Hadoop.
Here, we present two research directions as the end of this chapter: (1) the design
of the hierarchical memory of clusters, which is optimized for vertical data
movement; and (2) the design of the topology of high-speed interconnects, which is
optimized for horizontal data movement, involving point-to-point communication
and collective communication.
6 Conclusions
123
International Journal of Parallel Programming
References
1. Akidau, T., Balikov, A., Bekiroğlu, Kaya, et al.: MillWheel: fault-tolerant stream processing at
internet scale [J]. Proc. VLDB Endow. 6(11), 1033–1044 (2013)
2. Almasi, G.: PGAS (Partitioned Global Address Space) Languages [J]. Encycl. Parallel Comput. 1,
1539–1545 (2011)
3. Apache Giraph. https://giraph.apache.org/
4. Asaadi, H. R., Khaldi, D., Chapman, B. A. (2016) Comparative Survey of the HPC and Big Data
Paradigms: Analysis and Experiments [C]// IEEE International Conference on Cluster Computing
(CLUSTER). IEEE,:423–432.
5. Bröse E. ZeroCopy: Techniques, Benefits and Pitfalls [EB/OL]. https://static.aminer.org/pdf/PDF/
000/253/158/design_and_implementation_of_zero_copy_data_path_for_efficient.pdf
6. Browning, S. A. The Tree Machine: A Highly Concurrent Computing Environment [EB/OL]. 1980.
http://resolver.caltech.edu/CaltechCSTR:3760-tr-80.
7. Carbone, P., Fóra, G., Ewen, S et al. (2015) Lightweight Asynchronous Snapshots for Distributed
Dataflows [J]. Computer Science.
8. Carbone, P., Katsifodimos, A., Ewen, S., et al.: Apache flinktm: stream and batch processing in a
single engine [J]. IEEE Data Eng. Bulletin 38(4), 28–38 (2015)
9. Chaimov, N., Malony, A., Canon, S et al. Scaling Spark on HPC Systems [C]// Proceedings of the
25th ACM International Symposium on High-Performance Parallel and Distributed Computing –
HPDC. ACM, 2016:97–110.
10. Chambers, C., Raniwala, A., Perry, F et al. FlumeJava: Easy, Efficient Data-parallel Pipelines [C]//
ACM Sigplan Conference on Programming Language Design & Implementation. ACM, 2010.
11. Chan, E., Heimlich, M., Purkayastha, A., et al.: Collective communication: theory, practice, and
experience [J]. Concurrency Computat. Pract. Exper. 19(13), 1749–1783 (2007)
12. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems
[J]. ACM Transact. Comput. Syst. (TOCS) 3(1), 63–75 (1985)
13. Chapman, B., Curtis, T., Pophale, S et al. Introducing OpenSHMEM: SHMEM for the PGAS
community [C]// Conference on Partitioned Global Address Space Programming Model. 2010.
14. Clos, C.: A study of non-blocking switching networks [J]. Bell Syst. Tech. J. 32(2), 406–424 (1953)
15. Crankshaw, D., Bailis, P., Gonzalez, J.E., et al.: The missing piece in complex analytics: low latency,
scalable model management and serving with velox [J]. European J. Obstet. Gynecol. Reprod. Biol
185, 181–182 (2014)
16. Cristina P. The technology Stacks of High Performance Computing & Big Data Computing: What
They Can Learn from Each Other. https://www.etp4hpc.eu/pujades/files/bigdata_and_hpc_FINAL_
20Nov18.pdf
17. Dagum, L., Menon, R.: OpenMP: an industry standard api for shared-memory programming [J].
IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
18. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters [J]. Commun. ACM
51(1), 107–113 (2008)
19. Doulkeridis, C., Nørvåg, Kjetil: A survey of large-scale analytical query processing in mapreduce [J].
VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)
20. Duell, J., Hargrove, P., Roman, E. The Design and Implementation of Berkeley Lab’s Linux
Checkpoint/Restart [R]. Berkeley Lab Technical Report LBNL-54941, 2002.
21. Egwutuoha, I P., Chen, S., Levy, D, et al. A Fault Tolerance Framework for High Performance
Computing in Cloud [C]//12th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing. IEEE, 2012: 709–710.
22. Egwutuoha, I.P., Levy, D., Selic, B., et al.: A survey of fault tolerance mechanisms and checkpoint/
restart implementations for high performance computing systems [J]. J. Supercomput. 65(3), 1302–
1326 (2013)
23. Ekanayake, J., Hui, L., Zhang, B et al. Twister: A Runtime for Iterative MapReduce [C]// Pro-
ceedings of the 19th ACM International Symposium on High Performance Distributed Computing.
DBLP, 2010.
24. El-Ghazawi, T., Smith, L UPC: Unified parallel [C]//ACM/IEEE Conference on High Performance
Networking & Computing. DBLP, 2006.
25. Fagg G E, Dongarra J. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic
world [C]// Proceedings of EuroPVM-MPI 2000. Springer, 2000.
123
International Journal of Parallel Programming
123
International Journal of Parallel Programming
55. Maria C C, Giuseppe S. Workload Characterization: A Survey [C]// Proceedings of the IEEE. IEEE,
1993, 81(8):1136–1150. https://doi.org/ 10.1109/5.236191
56. Martı́n Abadi, Paul Barham, Jianmin Chen, et al. TensorFlow: A System for Large-scale Machine
Learning [C]//12th USENIX Symposium on Operating Systems Design and Implementation. USE-
NIX, 2016, 265–283.
57. Marz N. Trident. https://github.com/ nathanmarz/storm/wiki/Trident-tutorial. 2012.
58. McSherry F, Isaacs R, Isard M, et al. Composable Incremental and Iterative Data-Parallel Compu-
tation with Naiad [R]. Microsoft Research, 2012. https://www.microsoft.com/en-us/research/wp-
content/uploads/2012/10/naiad.pdf
59. McSherry F, Murray D G, Isaacs R, and Isard M. Differential Dataflow [C]// Proceedings of 6th
Biennial Conference on Innovative Data Systems Research. 2013. http://cidrdb.org/cidr2013/Papers/
CIDR13_Paper111.pdf
60. Mehdi, M., Ala, A.F., Sameh, S., et al.: Deep learning for iot big data and streaming analytics: a
survey [J]. IEEE Commun. Surv. Tutorials 1(1), 99 (2017)
61. Mina J, Verde C. Fault Detection Using Dynamic Principal Component Analysis by Average Esti-
mation [C]// IEEE International Conference on Electrical & Electronics Engineering. IEEE, 2005.
62. Mu’Alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in
scheduling the ibm sp2 with backfilling [J]. IEEE Transact. Parallel Distributed Syst. 6(12), 529–543
(2001)
63. Murray D G, McSherry F, Isaacs R, et al. Naiad: A Timely Dataflow System [C]// ACM Symposium
on Operating Systems Principles (SOSP). ACM, 2013: 439–455.
64. Neumaier, A.: Molecular modeling of proteins and mathematical prediction of protein structure [J].
SIAM Rev. 39(3), 407–460 (1997)
65. Nishihara R, Moritz P, Wang S, et al. Real-Time Machine Learning: The Missing Pieces. 2017.
https://arxiv.org/abs/1703.03924
66. OpenMP Application Program Interface. OpenMP Architecture Review Board. 2008. http://www.
openmp.org/mp-docu- ments/spec30.pdf
67. Ordónez, F.J., Roggen, D.: Deep convolutional and lstm recurrent neural networks for multimodal
wearable activity recognition [J]. Sensors 16(1), 115 (2016)
68. Pan R, Dolog P, Xu G. KNN-Based Clustering for Improving Social Recommender Systems [C]//
Agents and Data Mining Interaction: 8th International Workshop, ADMI 2012. Springer, 2013.
https://doi: 10.1007/978-3-642-36288-0_11
69. Philipp M, Nishihara R, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI
Applications [C]// Proceedings of the 13th USENIX Symposium on Operating Systems Design and
Implementation. 2018. https://arxiv.org/abs/1712.05889
70. Philipp M, Robert N, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI Appli-
cations [C]// USENIX Symposium on Operating Systems Design and Implementation. USENIX,
2018.
71. Quoc-Cuong, T., Juan, S., Volker, M.: A survey of state management in big data processing systems
[J]. Int. J. Very Large Data Bases 27(6), 847–872 (2018)
72. Ramalingam, G.: Bounded Incremental Computation [M]. Springer, Berlin (1996)
73. Reuther, A., Byun, C., Arcand, W., et al.: Scalable System Scheduling for HPC and Big Data [J].
J. Parallel Distrib. Comput. 111(1), 76–92 (2018)
74. Richer S. A Deep Dive into Rescalable State in Apache Flink. 2017. https://flink.apache.org/features/
2017/07/04/flink-rescalable-state.html
75. Sakr, S., Liu, A., Fayoumi, A.: The family of mapreduce and large scale data processing systems [J].
ACM Comput. Surv. 46(1), 1–44 (2013)
76. Sankaran, S., Squyres, J.M., Barrett, B., et al.: The lam/mpi checkpoint/restart framework: system-
initiated checkpointing[J]. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)
77. Saraswat V, Almasi G, Bikshandi G, et al. The Asynchronous Partitioned Global Address Space
Model. http://www.cs.rochester.edu/u/cding/amp/papers/full/The%20Asynchronous%20Partitioned
%20Global%20Address%20Space%20Model.pdf
78. Schulz M, Bronevetsky G, Fernandes R, et al. Implementation and Evaluation of a Scalable
Application-level Checkpoint-recovery Scheme for MPI Programs [C]// Proceedings of the 2004
ACM/IEEE Conference on Supercomputing. IEEE, 2004.
79. Severson, K., Chaiwatanodom, P., Braatz, R.D.: Perspectives on process monitoring of industrial
systems [J]. Annu. Rev. Control. 42, 190–200 (2016)
80. Stephen P B. Multidimentional Scaling. 1997. http://www.analytictech.com/borgatti/mds.htm
123
International Journal of Parallel Programming
81. Supun, K., Pulasthi, W., Saliya, E., et al.: Anatomy of machine learning algorithm implementations
in mpi, spark, and flink [J]. Int. J. High Perform. Comput. 32(1), 61–73 (2018)
82. The Beowulf Cluster site. http://www.beowulf.org
83. Tianqi Chen, Mu Li, Yutian Li, et al. MXNet: A Flexible and Efficient Machine Learning Library for
Heterogeneous Distributed Systems. Neural Information Processing Systems, Workshop on Machine
Learning Systems. 2016.
84. Tony H, Stewart T, Kristin T. The Fourth Paradigm: Data-Intensive Scientific Discovery [M].
Microsoft Research. 2009. https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/
Fourth_Paradigm.pdf
85. Tzoumas K. High-throughput, Low-latency, and Exactly-once Stream Processing with Apache
Flink™. 2015. https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-
stream-processing-with-apache-flink
86. Wenguang, C.: Big data and high performance computing [J]. Big Data Res. 1(001), 20–27 (2015)
87. 陈文光. 大数据与高性能计算[J]. 大数据, 2015, 1(001):20–27. http://www.infocomm-journal.com/
bdr/article/2015/2096-0271/2096-0271-1-1-00020.shtml
88. Wickramasinghe U , Lumsdaine A . A Survey of Methods for Collective Communication Opti-
mization and Tuning. 2016. ArXiv, abs/1611.06334.
89. Woodall T S, Shipman G M, Bosilca G, et al. High Performance RDMA Protocols in HPC [C]//
European Pvm/mpi Users Group Conference on Recent Advances in Parallel Virtual Machine &
Message Passing Interface. Springer, 2006.
90. Yanpei C, Francois R, Randy K. From TPC-C to Big Data Benchmarks: A Functional Workload
Model [R]. 1st Workshop on Specifying Big Data Benchmarks, 2012, 8163: 28–43. https://doi.org/
10.1007/978-3-642-53974-9_4
91. Zaharia M, Chowdhury M, Das T, et al. Resilient Distributed Datasets: A fault-tolerant Abstraction
for In-memory Cluster Computing [C]// Proceedings of the 9th USENIX conference on Networked
Systems Design and Implementation. USENIX Association, 2012.
92. Zhang B, Ruan Y, Qiu J. Harp: Collective Communication on Hadoop [C]// 2015 IEEE International
Conference on Cloud Engineering. IEEE, 2015.
93. Zhang, H., Chen, G., Ooi, B.C., et al.: In-memory big data management and processing: a survey [J].
IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)
94. Zhen J, Jianfeng Z, Lei W et al. Characterizing and Subsetting Big Data Workloads [C]// 2014 IEEE
International Symposium on Workload Characterization. IEEE, 2014.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123