You are on page 1of 4

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE

Concurrency Computat.: Pract. Exper. 2016; 28:2412–2415


Published online 5 March 2016 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3813

EDITORIAL

Parallel and distributed computing for Big Data applications

1. INTRODUCTION

Nowadays, large volumes of data are being collected and analyzed at unprecedented scales. The
use of data-driven mathematical models to analyze large amounts of available data is changing the
way decisions are taken in almost every activity of the society, from business to customers, from
government to science. The abundance of data and the need for collecting, storing, and processing
them have raised many technical challenges for the Big Data applications. There are technical chal-
lenges related to data heterogeneity, inconsistency and incompleteness, privacy and data ownership,
scale, and timeliness [1]. Parallel and distributed computing is a matter of paramount importance
especially for mitigating scale and timeliness challenges.
This special issue contains eight papers presenting recent advances on parallel and distributed
computing for Big Data applications, focusing on their scalability and performance. Four papers
were carefully selected from the 2014 Workshop on Parallel and Distributed Computing for Big Data
Applications (2014), held in conjunction with the International Symposium on Computer Archi-
tecture and High Performance Computing (SBAC-PAD 2014) at the University Pierre and Marie
Curie, Paris, France, on October 24, 2014. Authors from selected papers from the workshop were
invited to submit extended versions based on the recommendations of the workshop’s technical
program committee.
In addition, and in order to provide a broader contribution on topics related to parallel and dis-
tributed computing, an open Call for Papers was publicly announced and distributed. In response,
11 papers have been submitted. All the submissions (including the extended versions from the work-
shop) were object of a rigorous three-cycle review process. As result, eight papers were accepted
for publication in this special issue. Submissions co-authored by the guest editors were handled by
independent editors in order to guarantee the blind peer review process.

2. PAPERS IN THIS ISSUE

In [2], a simulator for MapReduce applications executing on hybrid distributed infrastructures is


proposed. The simulator has been validated with the Grid5000 experimental platform. The hybrid
infrastructure has been created from the combination of Cloud and Desktop Grid infrastructures
to provide a low-cost and scalable solution for Big Data analysis. The simulator is based on two
existing classes of MapReduce runtime environments: BitDew-MapReduce designed for Desktop
Grids and BlobSeer-Hadoop designed for Cloud computing. The main goal is to carry out accurate
simulations of MapReduce executions in a hybrid distributed infrastructure. BIGhybrid reproduces
the behavior of both infrastructures, including the volatile behavior and fault tolerance mecha-
nisms from Desktop Grids, and disk contention from Cloud platforms. Both system types execute
as independent parallel simulations with several configurations. Aiming to evaluate the accuracy of
BIGhybrid, a statistical analysis with real-world MapReduce applications executing in the Grid5000
platform is presented.
In [3], an API for the implementation of load balancing strategies is proposed. Applications that
ingest data from a variable number of sources such as mobile devices or social networks during large
events (e.g., the Olympic Games) may be required to accommodate subtle and large variation in the
rate at which data arrive. In order to provide fast scalability in such scenarios, the design of some
NoSQL database management systems has employed peer-to-peer techniques to efficiently allocate

Copyright © 2016 John Wiley & Sons, Ltd.


EDITORIAL 2413

additional resources (e.g., on the cloud) to distribute the storage and manipulation of new incom-
ing data. However, it is not trivial to anticipate which kind of strategy will be efficient to maintain
adequate performance regarding response time, scalability, and reliability at any time. As the con-
sequence, bad decisions on how to distribute incoming data over the resources may lead to load
imbalance and poor performance. The API design aims to separate load balancing strategies from
the rest of the code. Such separation of concerns provides an easy and flexible way for replacing the
load balancing strategy according to the application requirements, without changing the rest of the
application code. An evaluation of the API is presented with experiments and discussion on how it
can be used in a number of existing systems.
In [4], Application-Guided I/O Scheduling for parallel file systems (AGIOS), a tool to improve
the parallel input-output (I/O) performance, is proposed. AGIOS is sensitive to storage device
type. The proposed solution is evaluated over five different scheduling algorithms. The analy-
ses follow a synchronous approach because its overhead has a higher impact and also because
little delays on request processing can lead to longer execution times. AGIOS focuses on par-
allel file systems and offers an user-level library. An automatic I/O scheduling selection is
achieved through machine learning algorithms and an interface to analyze data. The results
demonstrate an attractive solution with good results for parallel I/O scheduling, if compared with
related work.
A study on memory management for Spark is presented in [5]. Spark is a relevant parallel and
distributed computing framework designed to support the execution of scalable and resilient appli-
cations. A major abstraction implemented in Spark is the Resilient Distribution Dataset (RDD). A
RDD is a collection of objects, which can be partitioned and stored (in memory) across the nodes.
Because RDDs are transient, they need to be recomputed each time their values are used by an action
computation. The selection of RDD partitions and their distribution across the memory of multi-
ple computing nodes has large influence in the whole application performance. The paper addresses
this challenge twofold. First, a new algorithm to select which RDDs should be maintained in mem-
ory is proposed. The algorithm aims to reduce the overhead and to speed up the whole computation
and takes its decisions based on dependencies between RDDs, which are expressed by a directed
acyclic graph. A second algorithm is proposed to replace partitions when the memory is full. The
new algorithm chooses the partition to be replaced based on the cost to recompute it and its size.
Experiments show that Spark with the new algorithm achieves better performance.
A high-performance data processing system named Watershed-ng is proposed in [6]. The proposal
is a re-engineering of Watershed, a previous framework that focus on the processing of continu-
ous data streams by supporting object-oriented abstractions for the implementation of applications
based on the filter-stream paradigm. Like other Big Data frameworks (e.g., Mapreduce), the original
Watershed supports abstractions for the implementation of the computation needed by the applica-
tions, while the communication is deeply embedded into the run-time system. In the new proposal,
Watershed-ng separates the implementation of the computation and communication in different
classes, in a combination, which allows users to choose among MPI, TCP sockets, or shared memory
implementations of the underlying communication channels, according to the applications require-
ments. Also, the new architecture uses Apache YARN (Yet Another Resource Negotiator) as process
scheduler and resource manager and Zookeeper for the coordination of the application processes.
Watershed-ng is evaluated in terms of performance and the amount of code necessary to write appli-
cations. The new system presented performance comparable with systems like Hadoop and Spark,
with noticeable advantage over previous systems like the original Watershed and Anthill.
In [7], the scalability of MapReduce computations is studied. Proposed in 2004, MapReduce
became very popular and widespread, being adopted for the execution of various Big Data appli-
cations such as web indexing, social network applications, processing logs and raw data, queries
processing on large-volume datasets, and many others. The practice shows that MapReduce imple-
mentations can efficiently process dozens of petabytes per day on thousands machines. In the paper,
a Bulk Synchronous Parallelism-based model is proposed, which estimates the computation, com-
munication, and synchronization costs for the execution of MapReduce applications. With this
model, some asymptotic bounds for the scalability of MapReduce applications for several commu-
nication scenarios are demonstrated. These theoretical bounds show that MapReduce applications

Copyright © 2016 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2016; 28:2412–2415
DOI: 10.1002/cpe
2414 EDITORIAL

can achieve sub-linear speedups under specific circumstances. Theoretical results are corroborated
by empirical and simulation experiments.
One of the challenges in Big Data is having to relate data from files with heterogeneous data
formats. This dataflow challenge is addressed in [8]. These heterogeneous raw data files are often
generated by a sequence of computer simulations, each of which has a set of parameters and input
data. These simulations can be modeled as scientific workflows and managed by parallel Scien-
tific Workflow Management System. Although several Scientific Workflow Management Systems
provide sophisticated mechanisms for executing large-scale scientific workflows in distributed high-
performance computing environments, most of them execute the workflow in an “offline” way, as a
black box. Online monitoring and debugging may save significant amounts of workflow execution
time, when unexpected data behavior can be detected way before the end of the workflow execution.
In most of the executions, the debugging process has to explore the content of data files, iteratively
searching for the identification of what caused the anomalous execution. Because each workflow
execution often consists of thousands of tasks that are executed in parallel, it is unviable to per-
form a manual monitoring and debugging on dataflow generation. The paper [8] presents dataflow
analytical techniques and monitoring analysis experiences using a real use case from the astron-
omy domain. The approach is based on provenance data related to domain data extracted at runtime
from raw data files. Users query a database to relate domain, provenance, and execution data having
access to a detailed dataflow generation.
In [9], an automated and scalable tracking and caching mechanism is proposed to evaluate con-
tinuous queries over voluminous data streams. Over the recent years, the number and diversity of
applications that perform online monitoring and control on data collected from sensor networks,
scientific observational equipments, and social networks has increased. In this challenging scenario,
data streams arrive at high transmission rates, coming from a large number of sources, resulting in
voluminous datasets that should be dispersed and managed over multiple resources. In addition, the
execution of continuous queries on such voluminous and distributed datasets raises scalability chal-
lenges when the number of clients and data sources increases. The paper proposes an automated and
scalable tracking and caching mechanism to evaluate continuous queries over data stored in a dis-
tributed system. A dormant cache scheme is proposed in order to alleviate strains on cache because
of intensive memory requirements. Furthermore, a scheduling algorithm is proposed to decide which
data should be maintained in the dormant cache. Empirical evaluation on a private cluster and on
Amazon AWS (Amazon Web Services) is presented.

ACKNOWLEDGEMENTS

Guest editors would like to thank all the authors that submitted to this special issue for their effort to
produce high quality papers. We also would like to express our special gratitude for Editor-in-Chief
Prof. Geoffrey Fox for his guidance and effort during this process. Finally, we would like to thank
all the reviewers invited to collaborate in this special issue: Antônio Tadeu Azevedo Gomes, Andrea
Charao, Daniel de Oliveira, Edson Norberto Cáceres, Erika Rosas, Estevam Hruschka Jr, Fabricio
A.B. da Silva, Gianpaolo Cugola, Lucia Drumond, Luciana Arantes, Marco Netto, Marta Mattoso,
Mauricio Marin, Navendu Jain, Nelson Ebecken, Pedro Velho, Pierre Sens, Rafael Ferreira da Silva,
Renato Ishii, Rodolfo Jardim de Azevedo, Veronica Gil-Costa, and Xinyu Que.

REFERENCES
1. Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C. Big data and its
technical challenges. Communications of the ACM 2014; 57(7):86–94.
2. Anjos JCS, Fedak G, Geyer CFR. BIGhybrid: a simulator for MapReduce applications in hybrid distributed infrastruc-
tures validated with the Grid5000 experimental platform. Concurrency and Computation: Practice and Experience
2015; 28(8):2416–2439.
3. Antoine M, Pellegrino L, Huet F, Baude F. A generic API for load balancing in distributed systems for big data
management. Concurrency and Computation: Practice and Experience 2015; 28(8):2440–2456.
4. Boito FZ, Kassick RV, Navaux POA, Denneulin Y. Automatic I/O scheduling algorithm selection for parallel file
systems. Concurrency and Computation: Practice and Experience 2015; 28(8):2457–2472.

Copyright © 2016 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2016; 28:2412–2415
DOI: 10.1002/cpe
EDITORIAL 2415

5. Duan M, Li K, Tang Z, Xiao G, Li K. Selection and replacement algorithms for memory performance improvement in
spark. Concurrency and Computation: Practice and Experience 2015; 28(8):2473–2486.
6. Rocha R, Hott B, Dias V, Ferreira R, Meira W, Guedes D. Watershed-ng: an extensible distributed stream processing
framework. Concurrency and Computation: Practice and Experience 2016; 28(8):2487–2502.
7. Senger H, Gil-Costa V, Arantes L, Marcondes CAC, Marin M, Sato LM, Silva FAB. BSP cost and scalability analysis
for MapReduce operations. Concurrency and Computation: Practice and Experience 2015; 28(8):2503–2527.
8. Silva V, Oliveira D, Valduriez P, Mattoso M. Analyzing related raw data files through dataflows. Concurrency and
Computation: Practice and Experience 2015; 28(8):2528–2545.
9. Tolooee C, Malensek M, Pallickara SL. A scalable framework for continuous query evaluations over multidimensional,
scientific datasets. Concurrency and Computation: Practice and Experience 2016; 28(8):2546–2563.

H ERMES S ENGER
Universidade Federal
de São Carlos (UFSCar), São Carlos, SP, Brazil
E-mail: senger.hermes@gmail.com

C LAUDIO G EYER
Universidade Federal
University do Rio Grande do Sul (UFRGS)
Porto Alegre, RS, Brazil

Copyright © 2016 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2016; 28:2412–2415
DOI: 10.1002/cpe